# OpenSearch 한글 형태소 분석기 통한 키워드 검색 
>이 노트북은,
> - SageMaker Studio* **`Data Science 3.0`** kernel 및 ml.m5.large 인스턴스에서 테스트 되었습니다.
> - SageMaker Notebook **`conda_python3`** 에서 테스트 되었습니다.


여기서는 OpenSearch 가 설치된 것을 가정하고, 한글 형태소 분석기의 사용하는 법을 알려 드립니다.

---

### [중요]
- 이 노트북은 Bedrock Titan Embedding Model 을 기본으로 사용합니다. 
- 오픈 서치 서비스가 액티브 된 상태를 가정 합니다.
- 앞서 02_OpenSearch_setup 노트북의 clean-up 이전 부분까지 모두 완료가 되어 있어야 합니다.


---
## Ref: 
- [Amazon OpenSearch Service로 검색 구현하기](https://catalog.us-east-1.prod.workshops.aws/workshops/de4e38cb-a0d9-4ffe-a777-bf00d498fa49/ko-KR/indexing/blog-reindex)
- [OpenSearch Python Client](https://opensearch.org/docs/1.3/clients/python-high-level/)
- [OpenSearch Match, Multi-Match, and Match Phrase Queries](https://opster.com/guides/opensearch/opensearch-search-apis/opensearch-match-multi-match-and-match-phrase-queries/)
- OpenSearch Query 에서 Filter, Must, Should, Not Mush 에 대한 설명 입니다.
    - [OpenSearch Boolean Queries](https://opster.com/guides/opensearch/opensearch-search-apis/opensearch-boolean-queries/#:~:text=Boolean%20queries%20are%20used%20to,as%20terms%2C%20match%20and%20query_string.)
- [OpenSearch Query Description (한글)](https://esbook.kimjmin.net/05-search)


## 1. 환경 세팅

In [1]:
import boto3
region = boto3.Session().region_name
opensearch = boto3.client('opensearch', region)

%store -r opensearch_user_id opensearch_user_password domain_name opensearch_domain_endpoint

try:
    opensearch_user_id
    opensearch_user_password
    domain_name
    opensearch_domain_endpoint
   
except NameError:
    print("++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++")
    print("[ERROR] Run 00_setup notebook first or Create Your Own OpenSearch Domain")
    print("++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++")

no stored variable or alias opensearch_user_id
no stored variable or alias opensearch_user_password
no stored variable or alias domain_name
no stored variable or alias opensearch_domain_endpoint
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
[ERROR] Run 00_setup notebook first or Create Your Own OpenSearch Domain
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++


In [2]:
%load_ext autoreload
%autoreload 2

In [3]:
# OpenSearch 접속 정보 지정

# [필수] 아래 OpenSearch 정보는 각자 환경에 맞게 수정 합니다.

opensearch_user_id = 'raguser'
opensearch_user_password = 'Passw0rd1!'

domain_name = 'jesamkim-opensearch-rag'
opensearch_domain_endpoint = 'https://search-jesamkim-opensearch-rag-5wbkv7qjrlci47h5ka63cw5fxy.us-west-2.es.amazonaws.com'

### Bedrock Client 생성

In [4]:
import boto3
import os
import json
from botocore.config import Config
import botocore 
from pprint import pprint
from termcolor import colored

session = boto3.Session()

retry_config = Config(
    region_name=os.environ.get("AWS_DEFAULT_REGION", None),
    retries={
        "max_attempts": 10,
        "mode": "standard",
    },
)

# modelId = "anthropic.claude-instant-v1"  # (Change this to try different model versions)
modelId = "anthropic.claude-3-sonnet-20240229-v1:0"
accept = "application/json"
contentType = "application/json"

bedrock = boto3.client(service_name='bedrock')
boto3_bedrock = boto3.client(service_name='bedrock-runtime',config=retry_config)

model_list = bedrock.list_foundation_models()
result = [(fm_list["modelName"], fm_list["modelId"]) for fm_list in model_list["modelSummaries"] if fm_list['inferenceTypesSupported'] == ['ON_DEMAND']]
pprint(result)

[('Titan Text Large', 'amazon.titan-tg1-large'),
 ('Titan Text Embeddings v2', 'amazon.titan-embed-g1-text-02'),
 ('Titan Text G1 - Lite', 'amazon.titan-text-lite-v1'),
 ('Titan Text G1 - Express', 'amazon.titan-text-express-v1'),
 ('Titan Embeddings G1 - Text', 'amazon.titan-embed-text-v1'),
 ('Titan Multimodal Embeddings G1', 'amazon.titan-embed-image-v1'),
 ('Titan Image Generator G1', 'amazon.titan-image-generator-v1'),
 ('SDXL 0.8', 'stability.stable-diffusion-xl'),
 ('SDXL 0.8', 'stability.stable-diffusion-xl-v0'),
 ('SDXL 1.0', 'stability.stable-diffusion-xl-v1'),
 ('J2 Grande Instruct', 'ai21.j2-grande-instruct'),
 ('J2 Jumbo Instruct', 'ai21.j2-jumbo-instruct'),
 ('Jurassic-2 Mid', 'ai21.j2-mid'),
 ('Jurassic-2 Mid', 'ai21.j2-mid-v1'),
 ('Jurassic-2 Ultra', 'ai21.j2-ultra'),
 ('Jurassic-2 Ultra', 'ai21.j2-ultra-v1'),
 ('Claude Instant', 'anthropic.claude-instant-v1'),
 ('Claude', 'anthropic.claude-v2:1'),
 ('Claude', 'anthropic.claude-v2'),
 ('Claude 3 Sonnet', 'anthropic.clau

## 2. Titan Embedding 및 LLM 인 Claude-3 sonnet 모델 로딩

### LLM 로딩 (Claude-v3 sonnet)

In [5]:
#from langchain_community.chat_models import BedrockChat
from langchain_aws import ChatBedrock
from langchain_core.messages import HumanMessage
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler

In [6]:
llm_text = ChatBedrock(
    model_id=modelId,
    streaming=True,
    callbacks=[StreamingStdOutCallbackHandler()],
    model_kwargs={
        "anthropic_version": "bedrock-2023-05-31",
        "max_tokens": 4096,
        "temperature" : 0,
        "top_k": 0,
        "top_p": 0.0
    }
)
llm_text

  warn_deprecated(


BedrockChat(client=<botocore.client.BedrockRuntime object at 0x7fc049403970>, region_name='us-west-2', model_id='anthropic.claude-3-sonnet-20240229-v1:0', model_kwargs={'anthropic_version': 'bedrock-2023-05-31', 'max_tokens': 4096, 'temperature': 0, 'top_k': 0, 'top_p': 0.0}, streaming=True, callbacks=[<langchain_core.callbacks.streaming_stdout.StreamingStdOutCallbackHandler object at 0x7fc04a365ab0>])

In [7]:
prompt1 = "나는 인공지능 AI 보험 서비스입니다. 생명과 손해 보험의 차이에 대해 설명해 주세요."
messages = [
    HumanMessage(content=prompt1)
]

# messages = [
#     {"role": "user", "content": [{"type": "text", "text": prompt1}]},
# ]

response1 = llm_text.invoke(messages)

생명보험과 손해보험은 보험의 주요 유형으로 다음과 같은 차이점이 있습니다.

1. 보장 대상
- 생명보험: 사람의 생명과 관련된 위험을 보장합니다. 예를 들어 사망, 상해, 질병 등
- 손해보험: 재산상의 손해나 법적 배상책임을 보장합니다. 예를 들어 화재, 자동차사고, 배상책임 등

2. 보험기간
- 생명보험: 일반적으로 장기간 보장되며 종신까지 연장 가능합니다.
- 손해보험: 단기간 보장되며 1년 만기로 갱신하는 것이 일반적입니다.

3. 보험금 지급
- 생명보험: 피보험자의 사망, 상해, 질병 등 약관에서 정한 보험사고 발생 시 보험금을 지급합니다.
- 손해보험: 실제 입은 손해액수를 보상하는 실손보상을 원칙으로 합니다.

4. 보험료 산정
- 생명보험: 연령, 건강상태, 가입금액 등을 기준으로 보험료를 산정합니다.
- 손해보험: 위험률, 대상 가액 등을 기준으로 보험료를 산정합니다.

요컨대 생명보험은 개인의 생명과 관련된 위험을 장기간 보장하고, 손해보험은 재산 및 배상책임 등의 손해를 단기간 보상하는 것이 주요 차이점입니다.

### Embedding 모델 선택

In [8]:
from langchain_community.embeddings import BedrockEmbeddings

llm_emb = BedrockEmbeddings(
    client=boto3_bedrock,
    # model_id="cohere.embed-multilingual-v3"
    model_id="amazon.titan-embed-g1-text-02"
)

-------------------

## 3. OpenSearch 벡터 Index 생성
### 선수 조건
- 랭체인 오프서처 참고 자료
    - [Langchain Opensearch](https://python.langchain.com/docs/integrations/vectorstores/opensearch)

### 오픈 서치 인덱스 유무에 따라 삭제
오픈 서치에 해당 인덱스가 존재하면, 삭제 합니다. 

In [9]:
from opensearchpy import OpenSearch, RequestsHttpConnection
http_auth = (opensearch_user_id, opensearch_user_password)
os_client = OpenSearch(
            hosts=[
                {'host': opensearch_domain_endpoint.replace("https://", ""),
                 'port': 443
                }
            ],
            http_auth=http_auth, # Master username, Master password,
            use_ssl=True,
            verify_certs=True,
            connection_class=RequestsHttpConnection
        )

### index 생성

In [10]:
# 오픈서치 인덱스 이름
index_name = "index-04"

exists = os_client.indices.exists(index_name)

if exists:
    os_client.indices.delete(index=index_name)
    print("Index is deleted")
else:
    print("Index does not exist")

Index does not exist


In [11]:
## metadata, text, vector_field 의 네이밍은 langchain에서 지정된 이름
### model에 따라 dimension 사이즈 변경 필요 (Titan : 1536, Cohere : 1024)
import json

with open('index_body_simple.json', 'r') as f:
    index_body = json.load(f)

print(json.dumps(index_body, indent=2))


{
  "settings": {
    "index.knn": true,
    "index.knn.algo_param.ef_search": 512
  },
  "mappings": {
    "properties": {
      "metadata": {
        "properties": {
          "source": {
            "type": "keyword"
          },
          "type": {
            "type": "keyword"
          },
          "timestamp": {
            "type": "date"
          }
        }
      },
      "vector_field": {
        "type": "knn_vector",
        "dimension": 1536,
        "method": {
          "engine": "faiss",
          "name": "hnsw",
          "parameters": {
            "ef_construction": 512,
            "m": 16
          },
          "space_type": "l2"
        }
      }
    }
  }
}


In [12]:
os_client.indices.create(index_name, body=index_body)

{'acknowledged': True, 'shards_acknowledged': True, 'index': 'index-04'}

In [13]:
%%time
from langchain_community.vectorstores import OpenSearchVectorSearch

vector_db = OpenSearchVectorSearch(
    index_name=index_name,
    opensearch_url=opensearch_domain_endpoint,
    embedding_function=llm_emb,
    http_auth=http_auth, # http_auth
)

CPU times: user 2.79 ms, sys: 3.98 ms, total: 6.78 ms
Wall time: 6.56 ms


## 4. 데이터 준비


In [14]:
import time
from langchain_community.document_loaders import PDFPlumberLoader
from langchain_community.document_loaders import PyMuPDFLoader
from langchain_community.document_loaders import PyPDFium2Loader


from langchain_core.documents import Document

# from llmsherpa.readers import LayoutPDFReader

In [15]:
import glob

# 오픈서치 인덱스에 저장되는 데이터 pdf 파일 경로
data_path = './data/04/*'

pdf_list = glob.glob(data_path)
pdf_list

['./data/04/Renewables_2023_IEA.pdf']

In [16]:
from multiprocessing.pool import ThreadPool
from multiprocessing import  Manager

import pdfplumber

In [17]:
import re

def prune_text(text, current_pdf_file):

    def replace_cid(match):
        print(f"Please check PDF file {current_pdf_file} : {match}")
        ascii_num = int(match.group(1))
        try:
            return chr(ascii_num)
        except:
            return ''  # In case of conversion failure, return empty string

    # Regular expression to find all (cid:x) patterns
    cid_pattern = re.compile(r'\(cid:(\d+)\)')
    pruned_text = re.sub(cid_pattern, replace_cid, text)
    return pruned_text

In [18]:
from datetime import datetime

def read_pdf(param):
    vector_db = param[0]
    current_pdf_file = param[1]
    print(f"current_pdf_file : {current_pdf_file}")
    docs = []
    source_name = current_pdf_file.split('/')[-1]
    type_name = source_name.split('_')[0]
    
    with pdfplumber.open(current_pdf_file) as pdf:
        for page_number, page in enumerate(pdf.pages, start=1):
            page_text = page.extract_text()
            if page_text:
                pruned_text = prune_text(page_text, current_pdf_file)
            else:
                pruned_text = ""
            if len(pruned_text) >= 20:  ## 임의로 20 이상인 sentence만 뽑도록 함
                chunk = Document(
                    page_content=pruned_text.replace('\n',' '),
                    metadata={
                        "source" : source_name,
                        "type": type_name,
                        "timestamp": datetime.now()
                    }
                )
                #print(f"chunk : {chunk}")
                docs.append(chunk)
    if len(docs) > 0 :
        vector_db.add_documents(docs)

In [19]:
manager = Manager()
result_dict = manager.dict()

# ml.m5.xlarge에서 multiprocessing으로 동작 확인
param = [(vector_db, current_pdf_file)for current_pdf_file in pdf_list]

num_processes = len(pdf_list)%os.cpu_count()

if num_processes == 0 :
    num_processes = os.cpu_count() - 1

print(f"num of process : {num_processes}")

with ThreadPool(processes=num_processes) as pool:
    pool.map(read_pdf, param)
    pool.close()
    pool.join()

num of process : 1
current_pdf_file : ./data/04/Renewables_2023_IEA.pdf


### OpenSearch에 생성된 인덱스의 구성 확인

In [20]:
index_info = os_client.indices.get(index=index_name)
print(json.dumps(index_info, indent=2))

{
  "index-04": {
    "aliases": {},
    "mappings": {
      "properties": {
        "metadata": {
          "properties": {
            "source": {
              "type": "keyword"
            },
            "timestamp": {
              "type": "date"
            },
            "type": {
              "type": "keyword"
            }
          }
        },
        "text": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        },
        "vector_field": {
          "type": "knn_vector",
          "dimension": 1536,
          "method": {
            "engine": "faiss",
            "space_type": "l2",
            "name": "hnsw",
            "parameters": {
              "ef_construction": 512,
              "m": 16
            }
          }
        }
      }
    },
    "settings": {
      "index": {
        "replication": {
          "type": "DOCUMENT"
        },
        

## 5. 어휘를 기반으로 한 전문(full-text) 검색 - Lexical Search

> query의 문장은 Nori 형태소 분석기에 의해 형태소로 분리됨
> OpenSearch에 입력된 (PDF에서 추출한) 문장(청크)들과 BM25 알고리즘을 기반으로 가장 유사한(score가 높은) 값부터 상위 (k=5)개의 결과가 나타남  

In [21]:
from opensearch_dsl import Search

In [22]:
def result_to_dataframe(response):
    import pandas as pd

    pd.set_option('display.max_columns', 150)
    pd.set_option('display.max_colwidth', None)

    result = []
    for res in response['hits']['hits']:
        # print(res.keys())
        result.append([res['_index'], round(res['_score'], 4), res['_source']['metadata']['type'], res['_source']['text']])
    df = pd.DataFrame(result, columns=['index_name', 'score', 'type', 'text'])
    return df.style.set_properties(**{'text-align': 'left'})

In [23]:
def query_lexical(query, filter=[], k=5):
    QUERY_TEMPLATE = {
        "size": k,
        "query": {
            "bool": {
                "must": [
                    {
                        "match": {
                            "text": query
                        }
                    }
                ],
                "filter": filter
            }
        }
    }
    if len(filter) > 0:
        QUERY_TEMPLATE["query"]["bool"]["filter"].extend(filter)
    return QUERY_TEMPLATE

In [24]:
query = "2020년 전세계 에너지 공급량은 얼마인가요?"

response_lexical_only = os_client.search(
    body=query_lexical(query),
    index=index_name
)

time_took_lexical_only = response_lexical_only['took']
print('검색에 걸린 시간: ', time_took_lexical_only, 'ms')

print("<<사용자 입력 쿼리 문장>>: ", query)

result_to_dataframe(response_lexical_only)

검색에 걸린 시간:  2 ms
<<사용자 입력 쿼리 문장>>:  2020년 전세계 에너지 공급량은 얼마인가요?


Unnamed: 0,index_name,score,type,text


### 결과값에 매칭되는 Term 확인

BM25 알고리즘에 의해 스코어링이 되고, PDF에서 추출된 문장(chunk)에서 어떤 term들과 매칭되었는지를 확인하려면 아래의 코드를 실행합니다. 아래 쿼리는 결과값에 매칭되는 term을 강조할 수 있도록 html 태그를 추가합니다.

In [25]:
def query_lexical_with_highlight(query, filter=[], k=5):
    QUERY_TEMPLATE = {
        "size": k,
        "query": {
            "bool": {
                "must": [
                    {
                        "match": {
                            "text": query
                        }
                    }
                ],
                "filter": filter
            }
        },
        "highlight": {
            "pre_tags": [
                "<span style='color:red'>"
            ],
            "post_tags": [
                "</span>"
            ],
            "fields": {
                "text": {}
            }
        }
    }
    if len(filter) > 0:
        QUERY_TEMPLATE["query"]["bool"]["filter"].extend(filter)
    return QUERY_TEMPLATE

response_lexical_with_highlight = os_client.search(
    body=query_lexical_with_highlight(query),
    index=index_name
)


In [26]:
from IPython.display import HTML

import pandas as pd
temp_arr = []

for res in response_lexical_with_highlight['hits']['hits']:
            # result.append([res['_index'], round(res['_score'], 4), res['_source']['metadata']['type'], res['_source']['text']])

    temp_arr.append([res['_score'], res['highlight']['text']])

# print("---------- html 태그가 포함된 결과 ------------")
# print(temp_arr)
# print("-------------------------------------------")
    
#df = pd.DataFrame(temp_arr)
df = pd.DataFrame(temp_arr, columns=['score', '각 문서(chuck) 내에서 매칭된 부분'])

print("<<사용자 입력 쿼리 문장>>: ", query)

HTML(df.to_html(escape=False))


<<사용자 입력 쿼리 문장>>:  2020년 전세계 에너지 공급량은 얼마인가요?


Unnamed: 0,score,각 문서(chuck) 내에서 매칭된 부분


참고> hightlight 요청은 모든 결과를 return하지 않고, 매칭되는 부분만 중점적으로 보여줍니다. 위의 코드에서는 검색에 매칭되는 term의 앞 뒤로 \<span style='color:red'>과 \</span>태그로 눈에 들어오도록 표기했습니다.

## 6. Filter 활용
- document내 metadata를 활용하여 search space를 줄일 수 있다.
- 특히 filter의 경우 search 전에 수행되기 때문에, 검색 속도 향상을 기대할 수 있다
- syntax
    - filter=[{"term"**[고정]**: {"metadata.source"**[메타데이터 이름, 혹은 메타데이터 아니여도 상관없음]**: "신한은행"**[조건명]**}},]
    - list 형식으로 복수개 filter 설정 가능

In [27]:
filter = [
    {"term": {"metadata.source": "국제 신재생에너지 정책변화 및 시장 분석_22-26.pdf"}}
]

response = os_client.search(
    body=query_lexical(query, filter),
    index=index_name
)
result_to_dataframe(response)

Unnamed: 0,index_name,score,type,text


In [28]:
filter = [
    {"term": {"metadata.source": "국제 신재생에너지 정책변화 및 시장 분석_22-26.pdf"}},
    {"term": {"metadata.type": "국제 신재생에너지 정책변화 및 시장 분석"}},
]

response = os_client.search(
    body=query_lexical(query, filter),
    index=index_name
)
print('이전에 필터를 넣치 않고 검색에 걸린 시간: ', time_took_lexical_only, 'ms')
print('지금 필터를 넣고 검색에 걸린 시간: ', response['took'], 'ms')

print("<<사용자 입력 쿼리 문장>>: ", query)

result_to_dataframe(response)

이전에 필터를 넣치 않고 검색에 걸린 시간:  2 ms
지금 필터를 넣고 검색에 걸린 시간:  2 ms
<<사용자 입력 쿼리 문장>>:  2020년 전세계 에너지 공급량은 얼마인가요?


Unnamed: 0,index_name,score,type,text


## 7. 벡터 검색 (knn 검색)을 활용한 검색 - Semantic Search

- query 를 제공해서 실제로 유사한 내용이 검색이 되는지를 확인 합니다.

In [29]:
def query_semantic(vector, filter=[], k=5):
    QUERY_TEMPLATE = {
        "size": k,
        "query": {                    
            "knn": {
                "vector_field": {
                    "vector": vector,
                    "k": k 
                }
            }           
        }
    }
    return QUERY_TEMPLATE

In [30]:
print("<<사용자 입력 쿼리 문장>>: ", query)

response = os_client.search(
    body=query_semantic(llm_emb.embed_query(query)),
    index=index_name
)
result_to_dataframe(response)

<<사용자 입력 쿼리 문장>>:  2020년 전세계 에너지 공급량은 얼마인가요?


Unnamed: 0,index_name,score,type,text
0,index-04,0.0032,Renewables,"Renewables 2023 Chapter 1. Electricity Analysis and forecasts to 2028 Solar PV and wind additions are forecast to more than double by 2028 compared with 2022, continuously breaking records over the forecast period to reach almost 710 GW. At the same time, hydropower and bioenergy capacity additions will be lower than during the last five years as development in emerging economies decelerates, especially in China. Renewables overtake coal in early-2025 to become the largest energy source for electricity generation globally By 2028, potential renewable electricity generation is expected to reach around 14 400 TWh, an increase of almost 70% from 2022. Over the next five years, several renewable energy milestones could be achieved:  In 2024, variable renewable generation surpasses hydropower.  In 2025, renewables surpass coal-fired electricity generation.  In 2025, wind surpasses nuclear electricity generation.  In 2026, solar PV surpasses nuclear electricity generation.  In 2028, solar PV surpasses wind electricity generation. Electricity generation by technology, 2000-2028 45% Solar PV 40% 35% Wind 30% Variable 25% renewables 20% Hydropower 15% Other 10% renewables 5% All renewables 0% 2000 2002 2004 2006 2008 2010 2012 2014 2016 2018 2020 2022 2024 2026 2028 IEA. CC BY 4.0. Notes: Electricity generation from wind and solar PV indicate potential generation including current curtailment rates. However, it does not project future curtailment of wind and solar PV, which may be significant in a few countries by 2028. The Curtailment section below discusses some of these recent trends. Over the forecast period, potential renewable electricity generation growth exceeds global demand growth, indicating a slow decline in coal-based generation while natural gas remains stable. In 2028, renewable energy sources account for 42% of global electricity generation, with the wind and solar PV share making up 25%. In 2028, hydropower remains the largest renewable electricity source. However, renewable electricity generation needs to expand more quickly in many PAGE | 15 .0.4 YB CC .AEI"
1,index-04,0.0032,Renewables,"Renewables 2023 Chapter 1. Electricity Analysis and forecasts to 2028 Chapter 1. Electricity Global forecast summary 2023 marks a step change for renewable power growth over the next five years Renewable electricity capacity additions reached an estimated 507 GW in 2023, almost 50% higher than in 2022, with continuous policy support in more than 130 countries spurring a significant change in the global growth trend. This worldwide acceleration in 2023 was driven mainly by year-on-year expansion in the People’s Republic of China’s (hereafter “China”) booming market for solar PV (+116%) and wind (+66%). Renewable power capacity additions will continue to increase in the next five years, with solar PV and wind accounting for a record 96% of it because their generation costs are lower than for both fossil and non-fossil alternatives in most countries and policies continue to support them. Renewable electricity capacity additions by technology and segment 1 000 100% Ocean 900 90% CSP 800 80% Geothermal 700 70% 600 60% Bioenergy 500 50% Hydropower 400 40% Wind 300 30% 200 20% Solar PV 100 10% % of wind 0 0% and PV Historical Main case Acc. case IEA. CC BY 4.0. Notes: CSP = concentrated solar power. Capacity additions refer to net additions. Historical and forecast solar PV capacity may differ from previous editions of the renewable energy market report. This year, PV data for all countries have been converted to DC (direct current), increasing capacity for countries reporting in AC (alternating current). Conversions are based on an IEA survey of more than 80 countries and interviews with PV industry associations. Solar PV systems work by capturing sunlight using photovoltaic cells and converting it into DC electricity. The DC electricity is then usually converted using an inverter, as most electrical devices and power systems use AC. Until about 2010, AC and DC capacity in most PV systems were similar, but with developments in PV system sizing, these two values may now differ by up to 40%, especially in utility-scale installations. Solar PV and wind additions include capacity dedicated to hydrogen production. PAGE | 14 WG 6102 7102 8102 9102 0202 1202 2202 e3202 4202 5202 6202 7202 8202 e3202 4202 5202 6202 7202 8202 .0.4 YB CC .AEI"
2,index-04,0.0032,Renewables,"Renewables 2023 Chapter 1. Electricity Analysis and forecasts to 2028 Renewable energy capacity in connection queues by project stage (left), and advanced- stage solar PV and wind projects by region (right) 1 000 548 900 976 800 700 600 500 400 300 200 1 505 100 0 Solar PV Wind Late stage (GW) Early stage/unlikely (GW) Under review (GW) US Europe APAC LAM IEA. CC BY 4.0. Notes: APAC = Asia Pacific. LAM = Latin America. All capacity presented is sourced from publicly available country-level connection queue information. US data from CAISO; ERCOT; MISO; PJM; NYISO; ISO-NE and SPP interconnections; Appalachian Electric Cooperative; Arizona Public Service; Black Hills Colorado Electric; Bonneville Power District; Cheyenne Light, Fuel & Power; City of Los Angeles Department of Water and Power; Duke Carolinas; Duke Florida; Duke Progress; El Paso Electric; Florida Light and Power; Georgia Transmission Company; Imperial Irrigation District; Idaho Power; Jacksonville Electric Department; Louisville Gas and Electric Company and Kentucky Utilities Company; NV Energy; Portland General Electric; Public Service Company of New Mexico; Platte River Power Authority; Santee Cooper; Southern Electric Corporation of Mississippi; Southern Company; Salt River Project; Tucson Electric Power; Tri-State Generation and Transmission; Tennessee Valley Authority; and Western Power Administration. Spain data from Red Eléctrica de Espana. Japan data from Hokkaido Electric Power Network, Grid connection status of renewable energy projects; Tohoku Electric Power Network, Grid connection status of renewable energy projects; TEPCO Power Grid, Grid connection status of renewable energy projects; Chubu Electric Power Grid, Grid connection status of renewable energy projects; Hokuriku Electric Power Transmission & Distribution, Grid connection status of renewable energy projects; Kansai Transmission and Distribution, Grid connection status of renewable energy projects; Chugoku Electric Power Transmission & Distribution, Grid connection status of renewable energy projects; Shikoku Electric Power Transmission & Distribution, Grid connection status of renewable energy projects; Kyushu Electric Power Transmission and Distribution, Grid connection status of renewable energy projects; Okinawa Electric Power, Grid connection status of renewable energy projects. Brazil data from ANEEL. Italy data from TERNA. UK data from Ofgem. Germany data from Bundesnetzagentur. Australia data from AEMO. Mexico data from CENACE. Chile data from CEN. Colombia data from UPME. India data estimated based on CEA transmission buildout planning. Solar PV values are a mix of AC and DC, depending on the source. Since 2010, entries into interconnection queues across the United States have increased by at least 20 times, while investment in transmission and distribution grids has only doubled. In France, the amount of solar PV and onshore wind capacity waiting for connection has nearly doubled since 2018, and new applications for connection in the United Kingdom have risen 80% since 2022. The increase in connection requests has lengthened project lead times. In the United States, average queue lead times rose from three years in 2015 to five years in 2022, while in the United Kingdom 120 GW of projects awaiting connection have been offered connection in 2030 or later. Meanwhile, France’s backlog of projects has led to connection delays of 22 months. In Brazil, increased development of solar PV and onshore wind has increased grid connection queues PAGE | 75 WG .0.4 YB CC .AEI"
3,index-04,0.0031,Renewables,"Renewables 2023 Chapter 1. Electricity Analysis and forecasts to 2028 improvements for auction design and permitting, and a growing corporate PPA market in Germany; positive impacts of IRA incentives in the United States; and speedier streamlined renewable energy auctioning in India. Conversely, we have revised down the forecast for Korea because the government’s policy focus has shifted from renewables to nuclear energy , reducing solar PV targets. We have also reined in forecast growth for other markets compared with last year’s outlook: for Spain because renewable energy auctions have been significantly undersubscribed; for Australia due to slow progress in large-scale renewable capacity for hydrogen production and the Expanded Capacity Investment Scheme only being announced towards the end of this report’s development; for Oman because development time frames for large-scale renewable energy projects have been longer than expected, including for green hydrogen; and for multiple ASEAN countries as a result of sustained policy uncertainty as well as overall power supply gluts limiting additional renewable deployment in the short term. China’s substantial upward forecast revision for PV hides slower progress in other countries Overall, China’s forecast has been revised up by 64% thanks to the country’s improved policy environment and the growing economic attractiveness of solar PV and wind systems. For other countries, however, this year’s forecast is almost 7% higher than our December 2022 outlook. China accounts for almost 90% of the global upward forecast revision, consisting mainly of solar PV. In fact, its solar PV manufacturing capabilities have almost doubled since last year, creating a global supply glut. This has reduced local module prices by nearly 50% from January to December 2023, increasing the economic attractiveness of both utility-scale and distributed solar PV projects. Thus, even with the phaseout of subsidies, developers have been accelerating the deployment of utility-scale and commercial solar PV applications to meet growing power demand because it is more affordable than investing in new and existing coal- and gas-fired generation. In addition, China’s government has clarified its green certificate rules, providing additional revenues for renewable energy projects. Similar policy improvements also support a higher wind forecast, but longer project lead times, especially for the growing offshore wind market, limits upward revision. PAGE | 18 .0.4 YB CC .AEI"
4,index-04,0.0031,Renewables,"Renewables 2023 Chapter 1. Electricity Analysis and forecasts to 2028 In 2023 fuel prices returned to pre-crisis levels in the United States, China and India but remained elevated in the European Union. As a result, global net support returned to positive value but only 30-40% of the 2020 level. In a scenario assuming continuation of costs decline trend for new renewables and the price environment for fossil fuels based on the second half of 2023, global support for PV and wind power could turn into savings starting in 2027. Even in an analysis approach considering the changing value of VRE for the power system (VALCOE approach), the required global support will decline to around USD 50 billion by 2028, half of the estimated annual average over 2015-2020. This translates to average costs difference between electricity generation from fossil fuel plants and VRE decreasing from close to 70 USD/MWh in 2020 to -3 USD/MWh (savings) in LCOE approach or about 10 USD/MWh in VALCOE approach by 2028. Average global LCOE decreased from USD 105/MWh to USD 35/MWh for onshore wind and from USD 450/MWh to USD 50/MWh for utility-scale PV between 2010 and 2022. Starting from 2019, generation costs for new VRE plants started to become cheaper than existing fossil fuel plants in many countries, especially when fossil fuel generation costs increased drastically at the end of 2021 and in 2022. In the high fossil fuel price environment of 2022, in European Union almost all installed wind capacity and most of utility-scale PV deployed since 2013 had provided cheaper electricity than coal and natural gas plants. Net global support for solar PV and wind electricity generation, total (left) and per MWh of renewable electricity generation (right), 2015-2028 Total support Support per MWh 60 450 400 40 350 20 300 250 0 200 150 -20 100 -40 50 0 -60 -50 -80 -100 PV utility (LCOE) PV utility (VALCOE) PV comm. (LCOE) PV comm. (VALCOE) PV res. (LCOE) PV res. (VALCOE) Wind on. (LCOE) Wind on. (VALCOE) Wind off. (LCOE) Wind off. (VALCOE) IEA. CC BY 4.0. Notes: LCOE = levelized cost of electricity. VALCOE = value-adjusted LCOE. Wind on. = Wind onshore. Wind off. = Wind offshore. PV comm. = PV commercial. PV res. = PV residential. Source: IEA analysis based on IRENA, EIA, Argus, Bloomberg LP, World Energy Outlook 2023. PAGE | 52 noillib DSU 5102 6102 7102 8102 9102 0202 1202 2202 3202 4202 5202 6202 7202 8202 hwM/DSU 5102 6102 7102 8102 9102 0202 1202 2202 3202 4202 5202 6202 7202 8202 .0.4 YB CC .AEI"


이전에 어휘분석으로 검색한 결과는 아래와 같습니다. 위의 표와 비교해보시길 바랍니다.

In [31]:
print("<<사용자 입력 쿼리 문장>>: ", query)
result_to_dataframe(response_lexical_only)

<<사용자 입력 쿼리 문장>>:  2020년 전세계 에너지 공급량은 얼마인가요?


Unnamed: 0,index_name,score,type,text


## 8. LangChain을 이용한 Question & Answer

- langchain의 similarity_search_with_score API를 활용하는 방법
    - [API: similarity_search_with_score](https://api.python.langchain.com/en/latest/vectorstores/langchain.vectorstores.opensearch_vector_search.OpenSearchVectorSearch.html#langchain.vectorstores.opensearch_vector_search.OpenSearchVectorSearch.similarity_search)


In [32]:
from langchain.chains.question_answering import load_qa_chain

In [33]:
results = vector_db.similarity_search_with_score(
    query=query,
    k=5,
    search_type="approximate_search",
    boolean_filter={
        "bool": {
            "filter": []
        }
    }
)

In [34]:
[res[0].page_content for res in results[:3]]

['Renewables 2023 Chapter 1. Electricity Analysis and forecasts to 2028 Solar PV and wind additions are forecast to more than double by 2028 compared with 2022, continuously breaking records over the forecast period to reach almost 710 GW. At the same time, hydropower and bioenergy capacity additions will be lower than during the last five years as development in emerging economies decelerates, especially in China. Renewables overtake coal in early-2025 to become the largest energy source for electricity generation globally By 2028, potential renewable electricity generation is expected to reach around 14 400 TWh, an increase of almost 70% from 2022. Over the next five years, several renewable energy milestones could be achieved: \uf09f In 2024, variable renewable generation surpasses hydropower. \uf09f In 2025, renewables surpass coal-fired electricity generation. \uf09f In 2025, wind surpasses nuclear electricity generation. \uf09f In 2026, solar PV surpasses nuclear electricity gene

### 사용자 정의 가능한 옵션
이제 벡터 저장소가 준비되었으므로 질문을 시작할 수 있습니다.

Vector Store를 둘러싸서 LLM 입력을 받는 LangChain에서 제공하는 래퍼를 사용할 수 있습니다.
이 래퍼는 뒤에서 다음 단계를 수행합니다.
- 질문을 입력합니다.
- 질문 임베딩 생성
- 관련 문서 가져오기
- 프롬프트에 문서와 질문을 채워 넣습니다.
- 프롬프트로 모델을 호출하고 사람이 읽을 수 있는 방식으로 답변을 생성합니다.

위 시나리오에서는 질문에 대한 상황 인식 답변을 빠르고 쉽게 얻을 수 있는 방법을 탐색했습니다. 이제 문서를 가져오는 방법을 사용자 정의할 수 있는 [RetrievalQA](https://python.langchain.com/en/latest/modules/chains/index_examples/Vector_db_qa.html)의 도움으로 더 사용자 정의 가능한 옵션을 살펴보겠습니다. `chain_type` 매개변수를 사용하여 프롬프트에 추가해야 합니다. 또한 검색해야 하는 관련 문서 수를 제어하려면 아래 셀에서 'k' 매개변수를 변경하여 다른 출력을 확인하세요. 많은 시나리오에서 LLM이 답변을 생성하는 데 사용한 소스 문서가 무엇인지 알고 싶을 수 있습니다. LLM 프롬프트의 컨텍스트에 추가된 문서를 반환하는 `return_source_documents`를 사용하여 출력에서 ​​해당 문서를 가져올 수 있습니다. 'RetrievalQA'를 사용하면 모델에 특정한 사용자 정의 [프롬프트 템플릿](https://python.langchain.com/en/latest/modules/prompts/prompt_templates/getting_started.html)을 제공할 수도 있습니다.

참고: 이 예에서는 Amazon Bedrock에서 LLM으로 Anthropic Claude를 사용하고 있습니다. 이 특정 모델은 입력이 'Human:' 아래에 제공되고 모델이 'Assistant:' 다음에 출력을 생성하도록 요청되는 경우 가장 잘 수행됩니다. 아래 셀에는 LLM이 기본 상태를 유지하고 컨텍스트 외부에서 응답하지 않도록 프롬프트를 제어하는 ​​방법의 예가 나와 있습니다.

#### [[REF] Using langchain for Question Answering on Own Data](https://medium.com/@onkarmishra/using-langchain-for-question-answering-on-own-data-3af0a82789ed)

In [35]:
from langchain.schema import BaseRetriever
from typing import Any, Dict, List, Optional, List, Tuple
from langchain.callbacks.manager import CallbackManagerForRetrieverRun

# lexical(keyword) search based (using Amazon OpenSearch)
class OpenSearchLexicalSearchRetriever(BaseRetriever):
    os_client: Any
    index_name: str
    k = 3
    filter = []

    def normalize_search_results(self, search_results):
        hits = (search_results["hits"]["hits"])
        max_score = float(search_results["hits"]["max_score"])
        for hit in hits:
            hit["_score"] = float(hit["_score"]) / max_score
        search_results["hits"]["max_score"] = hits[0]["_score"]
        search_results["hits"]["hits"] = hits
        return search_results

    def update_search_params(self, **kwargs):
        self.k = kwargs.get("k", 3)
        self.filter = kwargs.get("filter", [])
        self.index_name = kwargs.get("index_name", self.index_name)

    def _reset_search_params(self, ):
        self.k = 3
        self.filter = []
        
    def query_lexical(self, query, filter=[], k=5):
        QUERY_TEMPLATE = {
            "size": k,
            "query": {
                "bool": {
                    "must": [
                        {
                            "match": {
                                "text": {
                                    "query": query,
                                    "operator":  "or"
                                }
                            }
                        }
                    ],
                    "filter": filter
                }
            }
        }
        
        if len(filter) > 0:
            QUERY_TEMPLATE["query"]["bool"]["filter"].extend(filter)
            
        return QUERY_TEMPLATE
    

    def _get_relevant_documents(
        self, query: str, *, run_manager: CallbackManagerForRetrieverRun) -> List[Document]:
        
        query = self.query_lexical(
            query=query,
            filter=self.filter,
            k=self.k
        )

        # print ("lexical search query: ")
        # print(query)
        
        search_results = self.os_client.search(
            body=query,
            index=self.index_name
        )

        results = []
        if search_results["hits"]["hits"]:
            search_results = self.normalize_search_results(search_results)
            for res in search_results["hits"]["hits"]:

                metadata = res["_source"]["metadata"]
                metadata["id"] = res["_id"]

                doc = Document(
                    page_content=res["_source"]["text"],
                    metadata=metadata
                )
                results.append((doc))

        self._reset_search_params()

        return results[:self.k]


In [36]:
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
# from utils.rag import run_RetrievalQA, show_context_used

In [37]:
prompt_template = """
\n\nHuman: Use the following pieces of context to provide a concise answer to the question at the end.
If you don't know the answer, just say that you don't know, don't try to make up an answer.

{context}

Question: {question}

\n\nAssistant:"""


PROMPT = PromptTemplate(
    template=prompt_template,
    input_variables=["context", "question"]
)

In [38]:
chain = load_qa_chain(
    llm=llm_text,
    chain_type="stuff",
    prompt=PROMPT,
    verbose=True
)

In [39]:
boolean_filter = []
boolean_filter = [
    {"term": {"metadata.source": "국제 신재생에너지 정책변화 및 시장 분석_22-26.pdf"}},
    {"term": {"metadata.type": "국제 신재생에너지 정책변화 및 시장 분석"}},
]

In [40]:
opensearch_lexical_retriever = OpenSearchLexicalSearchRetriever(
    os_client=os_client,
    index_name=index_name,
    k=3,
    filter=boolean_filter
)

In [41]:
answer = chain.invoke(
    {
        "input_documents": opensearch_lexical_retriever.get_relevant_documents(query), 
        "question": query
    }, 
    # return_only_outputs=True
)



[1m> Entering new StuffDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3m


Human: Use the following pieces of context to provide a concise answer to the question at the end.
If you don't know the answer, just say that you don't know, don't try to make up an answer.



Question: 2020년 전세계 에너지 공급량은 얼마인가요?



Assistant:[0m
죄송합니다. 2020년 전세계 에너지 공급량에 대한 정확한 수치를 제공할 수 있는 정보가 없습니다. 에너지 통계는 일반적으로 국제기구나 에너지 관련 기관에서 발표하지만, 최신 글로벌 데이터를 갖고 있지 않아 정확한 답변을 드리기 어렵습니다.
[1m> Finished chain.[0m

[1m> Finished chain.[0m


In [42]:
opensearch_semantic_retriever = vector_db.as_retriever(
    search_type="similarity",
    search_kwargs={
        "k": 5,
        "boolean_filter": boolean_filter
    }
)

In [43]:
answer = chain.invoke(
    {
        "input_documents": opensearch_semantic_retriever.get_relevant_documents(query), 
        "question": query
    }, 
    # return_only_outputs=True
)



[1m> Entering new StuffDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3m


Human: Use the following pieces of context to provide a concise answer to the question at the end.
If you don't know the answer, just say that you don't know, don't try to make up an answer.



Question: 2020년 전세계 에너지 공급량은 얼마인가요?



Assistant:[0m
죄송합니다. 2020년 전세계 에너지 공급량에 대한 정확한 수치를 제공할 수 있는 정보가 없습니다. 에너지 통계는 일반적으로 국제기구나 에너지 관련 기관에서 발표하지만, 최신 글로벌 데이터를 갖고 있지 않아 정확한 답변을 드리기 어렵습니다.
[1m> Finished chain.[0m

[1m> Finished chain.[0m


## 9. OpenSearch Hybrid 검색

OpenSearch Hybrid 는 아래와 같은 방식으로 작동합니다.
- (1) "Vector 서치" 하여 스코어를 얻은 후에 표준화를 하여 스코어를 구함. 
    - 전체 결과에서 가장 높은 스코어는 표준화 과정을 통하여 스코어가 1.0 이 됨.
- (2) Keyword 서치도 동일하게 함.
- (3) Reciprocal Rank Fusion (RRF) 기반 Re-rank
    - Paper: https://plg.uwaterloo.ca/~gvcormac/cormacksigir09-rrf.pdf
    - Desc: https://medium.com/@sowmiyajaganathan/hybrid-search-with-re-ranking-ff120c8a426d
    - **RRF의 경우 score가 아닌 ranking 정보를 활용, 때문에 score normalization이 필요 없음**

RRF는 langchain에서 "Ensemble Retriever" 이름으로 api를 제공합니다. 
- https://python.langchain.com/docs/modules/data_connection/retrievers/ensemble


### Ensemble retriever 정의
- https://python.langchain.com/docs/modules/data_connection/retrievers/ensemble
- RRF 방식만 지원
- Rank constant (param "c")
    - This value determines how much influence documents in individual result sets per query have over the final ranked result set. A higher value indicates that lower ranked documents have more influence. This value must be greater than or equal to 1. Defaults to 60.
    - 숫자 높을 수록 낮은 랭크의 문서가 더 중요시 된다

In [44]:
from langchain.retrievers import EnsembleRetriever

In [45]:
ensemble_retriever = EnsembleRetriever(
    retrievers=[opensearch_lexical_retriever, opensearch_semantic_retriever],
    weights=[0.5, 0.5],
    c=100,
    k=5
)

In [46]:
%%time
answer = chain.invoke(
    {
        "input_documents": ensemble_retriever.get_relevant_documents(query), 
        "question": query
    }
)

print("##############################")
print("query: \n", query)
print("answer: \n", answer)



[1m> Entering new StuffDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3m


Human: Use the following pieces of context to provide a concise answer to the question at the end.
If you don't know the answer, just say that you don't know, don't try to make up an answer.



Question: 2020년 전세계 에너지 공급량은 얼마인가요?



Assistant:[0m
죄송합니다. 2020년 전세계 에너지 공급량에 대한 정확한 수치를 제공할 수 있는 정보가 없습니다. 에너지 통계는 일반적으로 국제기구나 에너지 관련 기관에서 발표하지만, 최신 글로벌 데이터를 갖고 있지 않아 정확한 답변을 드리기 어렵습니다.
[1m> Finished chain.[0m

[1m> Finished chain.[0m
##############################
query: 
 2020년 전세계 에너지 공급량은 얼마인가요?
answer: 
 {'input_documents': [], 'question': '2020년 전세계 에너지 공급량은 얼마인가요?', 'output_text': '죄송합니다. 2020년 전세계 에너지 공급량에 대한 정확한 수치를 제공할 수 있는 정보가 없습니다. 에너지 통계는 일반적으로 국제기구나 에너지 관련 기관에서 발표하지만, 최신 글로벌 데이터를 갖고 있지 않아 정확한 답변을 드리기 어렵습니다.'}
CPU times: user 105 ms, sys: 24.2 ms, total: 130 ms
Wall time: 2.88 s
