# 필수 패키지 설정 및 OpenSearch 클러스터 생성 (약 40분 소요)
>이 노트북은 SageMaker Studio* **`Python 3 (ipykernel)`** kernel 및 ml.t3.medium 인스턴스에서 테스트 되었습니다.

---
### 중요
- 이 노트북은 Anthropic 의 Claude-v3 Model Access가 허용된 계정에서 실행할 수 있습니다.
- Model Access가 없는 분은 노트북의 코드와 결과 만을 확인해주세요.
- 실행 시 **"과금"** 이 발생이 되는 부분 유념 해주시기 바랍니다.

---

In [1]:
!pip install -r requirements.txt

Collecting langchain==0.1.11 (from -r requirements.txt (line 1))
  Using cached langchain-0.1.11-py3-none-any.whl.metadata (13 kB)
Collecting opensearch-py==2.4.2 (from -r requirements.txt (line 2))
  Using cached opensearch_py-2.4.2-py2.py3-none-any.whl.metadata (6.8 kB)
Collecting faiss-cpu==1.7.4 (from -r requirements.txt (line 3))
  Using cached faiss_cpu-1.7.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (1.3 kB)
Collecting pdfplumber==0.10.3 (from -r requirements.txt (line 4))
  Using cached pdfplumber-0.10.3-py3-none-any.whl.metadata (38 kB)
Collecting opensearch_dsl==2.1.0 (from -r requirements.txt (line 6))
  Using cached opensearch_dsl-2.1.0-py2.py3-none-any.whl.metadata (5.3 kB)
Collecting unstructured (from -r requirements.txt (line 7))
  Using cached unstructured-0.12.6-py3-none-any.whl.metadata (83 kB)
Collecting pdf2image (from -r requirements.txt (line 8))
  Using cached pdf2image-1.17.0-py3-none-any.whl.metadata (6.2 kB)
Collecting streamlit (fro

In [2]:
%load_ext autoreload
%autoreload 2

import sys, os
module_path = "."
sys.path.append(os.path.abspath(module_path))

<br>
# 1. OpneSearch 클러스터 생성
- 테스트를 위해 Dev=True로 설정해서 진행합니다. 이후 실제 사용 사에는 Dev=False로 해야 합니다.

### 선수 조건
- 아래의 링크를 참조해서 OpenSearch Service 를 생성하고, opensearch_domain_endpoint, http_auth 를 복사해서, 아래 셀의 내용을 대체 하세요.
    - [OpenSearch 생성 가이드](https://github.com/gonsoomoon-ml/Kor-LLM-On-SageMaker/blob/main/2-Lab02-QA-with-RAG/4.rag-fsi-data-workshop/TASK-4_OpenSearch_Creation_and_Vector_Insertion.ipynb)
- 랭체인 오프서처 참고 자료
    - [Langchain Opensearch](https://python.langchain.com/docs/integrations/vectorstores/opensearch)
    
#### [주의] OpenSearch 도메인 생성에는 약 15-16분의 시간이 소요됩니다.

In [3]:
import boto3
import uuid
import botocore
import time
DEV = True # True일 경우 1-AZ without standby로 생성, False일 경우 3-AZ with standby. 워크샵 목적일 때는 지나친 과금/리소스 방지를 위해 True로 설정하는 것을 권장
VERSION = "2.11" # OpenSearch Version (예: 2.7 / 2.9 / 2.11)

opensearch_user_id = 'raguser'
opensearch_user_password = 'MarsEarth1!'

region = boto3.Session().region_name
account_id = boto3.client("sts").get_caller_identity()["Account"]
opensearch = boto3.client('opensearch', region)
rand_str = uuid.uuid4().hex[:8]
domain_name = f'rag-hol-{rand_str}'

cluster_config_prod = {
    'InstanceCount': 3,
    'InstanceType': 'r6g.large.search',
    'ZoneAwarenessEnabled': True,
    'DedicatedMasterEnabled': True,
    'MultiAZWithStandbyEnabled': True,
    'DedicatedMasterType': 'r6g.large.search',
    'DedicatedMasterCount': 3
}

cluster_config_dev = {
    'InstanceCount': 1,
    'InstanceType': 'r6g.large.search',
    'ZoneAwarenessEnabled': False,
    'DedicatedMasterEnabled': False,
}


ebs_options = {
    'EBSEnabled': True,
    'VolumeType': 'gp3',
    'VolumeSize': 100,
}

advanced_security_options = {
    'Enabled': True,
    'InternalUserDatabaseEnabled': True,
    'MasterUserOptions': {
        'MasterUserName': opensearch_user_id,
        'MasterUserPassword': opensearch_user_password
    }
}

ap = f'{{\"Version\":\"2012-10-17\",\"Statement\":[{{\"Effect\":\"Allow\",\"Principal\":{{\"AWS\":\"*\"}},\"Action\":\"es:*\",\"Resource\":\"arn:aws:es:{region}:{account_id}:domain\/{domain_name}\/*\"}}]}}'

if DEV:
    cluster_config = cluster_config_dev
else:
    cluster_config = cluster_config_prod
    
response = opensearch.create_domain(
    DomainName=domain_name,
    EngineVersion=f'OpenSearch_{VERSION}',
    ClusterConfig=cluster_config,
    AccessPolicies=ap,
    EBSOptions=ebs_options,
    AdvancedSecurityOptions=advanced_security_options,
    NodeToNodeEncryptionOptions={'Enabled': True},
    EncryptionAtRestOptions={'Enabled': True},
    DomainEndpointOptions={'EnforceHTTPS': True}
)

In [4]:
%%time
def wait_for_domain_creation(domain_name):
    try:
        response = opensearch.describe_domain(
            DomainName=domain_name
        )
        # Every 60 seconds, check whether the domain is processing.
        while 'Endpoint' not in response['DomainStatus']:
            print('Creating domain...')
            time.sleep(60)
            response = opensearch.describe_domain(
                DomainName=domain_name)

        # Once we exit the loop, the domain is ready for ingestion.
        endpoint = response['DomainStatus']['Endpoint']
        print('Domain endpoint ready to receive data: ' + endpoint)
    except botocore.exceptions.ClientError as error:
        if error.response['Error']['Code'] == 'ResourceNotFoundException':
            print('Domain not found.')
        else:
            raise error

wait_for_domain_creation(domain_name)

Creating domain...
Creating domain...
Creating domain...
Creating domain...
Creating domain...
Creating domain...
Creating domain...
Creating domain...
Creating domain...
Creating domain...
Creating domain...
Creating domain...
Creating domain...
Creating domain...
Creating domain...
Domain endpoint ready to receive data: search-rag-hol-fa0ef2d6-7waezq44o4foghelby2ngqd64u.us-east-1.es.amazonaws.com
CPU times: user 354 ms, sys: 43.6 ms, total: 398 ms
Wall time: 15min 3s


In [5]:
response = opensearch.describe_domain(DomainName=domain_name)
opensearch_domain_endpoint = f"https://{response['DomainStatus']['Endpoint']}"

print(opensearch_domain_endpoint)

https://search-rag-hol-fa0ef2d6-7waezq44o4foghelby2ngqd64u.us-east-1.es.amazonaws.com


<br>

# 2. 한국어 분석을 위한 노리(Nori) 플러그인 설치
Amazon OpenSearch Service에서 유명한 오픈 소스 한국어 텍스트 분석기인 노리(Nori) 플러그인을 지원합니다. 기존에 지원하던 은전한닢(Seunjeon) 플러그인과 더불어 노리를 활용하면 개발자가 한국 문서에 대해 전문 검색을 쉽게 구현할 수 있습니다.

이와 함께, 중국어 분석을 위한 Pinyin 플러그인과 STConvert 플러그인, 그리고 일본어 분석을 위한 Sudachi 플러그인도 추가됐습니다.
노리 플러그인은 OpenSearch 1.0 이상 버전을 실행하는 신규 도메인과 기존 도메인에서 사용 가능합니다.

#### [주의] 노리 플러그인 연동에는 약 25-27분의 시간이 소요됩니다.

In [6]:
nori_pkg_id = {}
nori_pkg_id['us-east-1'] = {
    '2.3': 'G196105221',
    '2.5': 'G240285063',
    '2.7': 'G16029449', 
    '2.9': 'G60209291',
    '2.11': 'G181660338'
}

nori_pkg_id['us-west-2'] = {
    '2.3': 'G94047474',
    '2.5': 'G138227316',
    '2.7': 'G182407158', 
    '2.9': 'G226587000',
    '2.11': 'G79602591'
}

pkg_response = opensearch.associate_package(
    PackageID=nori_pkg_id[region][VERSION], # nori plugin
    DomainName=domain_name
)

In [7]:
%%time
def wait_for_associate_package(domain_name, max_results=1):

    response = opensearch.list_packages_for_domain(
        DomainName=domain_name,
        MaxResults=1
    )
    # Every 60 seconds, check whether the domain is processing.
    while response['DomainPackageDetailsList'][0]['DomainPackageStatus'] == "ASSOCIATING":
        print('Associating packages...')
        time.sleep(60)
        response = opensearch.list_packages_for_domain(
            DomainName=domain_name,
            MaxResults=1
        )

    #endpoint = response['DomainStatus']['Endpoint']
    print('Associated!')

wait_for_associate_package(domain_name)

Associating packages...
Associating packages...
Associating packages...
Associating packages...
Associating packages...
Associating packages...
Associating packages...
Associating packages...
Associating packages...
Associating packages...
Associating packages...
Associating packages...
Associating packages...
Associating packages...
Associating packages...
Associating packages...
Associating packages...
Associating packages...
Associating packages...
Associating packages...
Associating packages...
Associating packages...
Associating packages...
Associated!
CPU times: user 519 ms, sys: 36.1 ms, total: 555 ms
Wall time: 23min 2s


opensearchpy를 이용하여 nori 플러그인 설치 여부를 확인합니다.

In [8]:
! pip list | grep langchain
! pip list | grep opensearch

langchain                             0.1.11
langchain-community                   0.0.29
langchain-core                        0.1.33
langchain-text-splitters              0.0.1
opensearch-dsl                        2.1.0
opensearch-py                         2.4.2


In [9]:
from opensearchpy import OpenSearch, RequestsHttpConnection
http_auth = (opensearch_user_id, opensearch_user_password)
os_client = OpenSearch(
                hosts=[
                    {'host': opensearch_domain_endpoint.replace("https://", ""),
                     'port': 443
                    }
                ],
                http_auth=http_auth, # Master username, Master password,
                use_ssl=True,
                verify_certs=True,
                connection_class=RequestsHttpConnection
            )

res_str = os_client.cat.plugins()

if 'opensearch-analysis-nori' in res_str:
    print('opensearch-nori plugin이 사용가능합니다.')
else:
    print('opensearch-nori plugin 연결이 진행되지 않았습니다.')


opensearch-nori plugin이 사용가능합니다.


In [10]:
%store opensearch_user_id opensearch_user_password domain_name opensearch_domain_endpoint

Stored 'opensearch_user_id' (str)
Stored 'opensearch_user_password' (str)
Stored 'domain_name' (str)
Stored 'opensearch_domain_endpoint' (str)
