# 문서 요약 색인

이 데모는 다양한 도시의 Wikipedia 기사에 대한 문서 요약 색인을 보여줍니다.

문서 요약 인덱스는 각 문서에서 요약을 추출하고 해당 요약과 문서에 해당하는 모든 노드를 저장합니다.

검색은 LLM 또는 임베딩(TODO)을 통해 수행될 수 있습니다. 먼저 요약을 기반으로 쿼리와 관련된 문서를 선택합니다. 선택한 문서에 해당하는 모든 검색된 노드가 검색됩니다.

In [1]:
from dotenv import load_dotenv
load_dotenv()

import logging
import sys

logging.basicConfig(stream=sys.stdout, level=logging.WARNING)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))

# # Uncomment if you want to temporarily disable logger
# logger = logging.getLogger()
# logger.disabled = True

import nest_asyncio

nest_asyncio.apply()

from llama_index.core import SimpleDirectoryReader, get_response_synthesizer
from llama_index.core import DocumentSummaryIndex
from llama_index.llms.openai import OpenAI
from llama_index.core.node_parser import SentenceSplitter

  from .autonotebook import tqdm as notebook_tqdm


## 데이터세트 로드

visa_kor_docs.json 로드

In [20]:
import json

with open("data/visa_kor_docs.json", "r") as fp:
    visa_kor_docs = json.load(fp)

In [19]:
# from llama_index.readers.json import JSONReader

# json_reader = JSONReader()
# docs = json_reader.load_data(data)

In [26]:
from llama_index.core import Document

docs = [Document(text=item["page_content"], metadata=item["metadata"]) for item in visa_kor_docs]

In [31]:
# print(len(docs))
docs[0]

Document(id_='bb1582a8-fe36-4988-870c-cf7a0f9a259b', embedding=None, metadata={'page': 1, 'total_pages': 19, 'Title': 'E-9 비전문취업비자: 목차'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text='\n        # E-9: 고용허가제 비전문취업 비자\n\n        고용허가제\n            고용허가제 해당자와 활동범위\n        고용허가제 해당자에 대한 사증 발급\n            사증발급 후 1회 체류기간의 상한\n            사증발급 허용업종 및 체류자격 약호(기호): E-9 근로자를 채용할 수 있는 업종과 범위\n            사증발급인정서를 받아야만 E-9 사증을 발급받을 수 있음\n            2012.8.1부터 범죄경력증명서 및 건강상태확인서 제출\n            비전문취업(E-9) 자격 사증발급인정서 발급 절차\n            재입국특례 제도(구 성실근로자 제도)\n            재입국특례자에 대한 사증 신청 및 발급 방법\n            재입국특례자에 대한 우대 내용\n            고용허가제 해당자의 근무처(직장)의 변경\n            고용허가제 해당자의 근무처(직장)를 변경할 수 있는 조건\n            고용허가제 해당자의 근무처(직장)를 변경하는 절차 및 제출 서류\n            고용허가제 농업 분야 외국인근로자의 근무처(직장) 추가\n            고용허가제 해당자의 체류자격 변경허가 - 사증 변경\n            고용허가제 해당자의 체류기간 연장 허가 - 사증 연장\n            고용허가제 해당자의 재입국허가\n            고용허가제 해당자의 외국인등록\n            고용허가제 해

## 문서 요약 색인 
인덱스를 구축하는 두 가지 방법을 보여줍니다.

* 문서 요약 색인 작성의 기본 모드
* 요약 쿼리 사용자 정의

In [3]:
# LLM (gpt-3.5-turbo)
chatgpt = OpenAI(temperature=0, model="gpt-3.5-turbo")
splitter = SentenceSplitter(chunk_size=1024)

In [29]:
# default mode of building the index
response_synthesizer = get_response_synthesizer(
    response_mode="tree_summarize", use_async=True
)
doc_summary_index = DocumentSummaryIndex.from_documents(
    docs,
    llm=chatgpt,
    # transformations=[splitter],
    response_synthesizer=response_synthesizer,
    show_progress=True,
)

Parsing nodes: 100%|██████████| 20/20 [00:00<00:00, 1158.01it/s]
Summarizing documents:   0%|          | 0/20 [00:00<?, ?it/s]

current doc id: bb1582a8-fe36-4988-870c-cf7a0f9a259b


Summarizing documents:   5%|▌         | 1/20 [00:03<00:58,  3.05s/it]

current doc id: e76dd4e2-02e2-425f-87ab-f7e2ee5f04d3


Summarizing documents:  10%|█         | 2/20 [00:05<00:46,  2.60s/it]

current doc id: f67ba4b0-503f-4821-88ed-1a01d04b0560


Summarizing documents:  15%|█▌        | 3/20 [00:08<00:46,  2.71s/it]

current doc id: 12cd24de-f878-4790-9692-779377c23258


Summarizing documents:  20%|██        | 4/20 [00:09<00:37,  2.33s/it]

current doc id: 9753654f-d4f0-4b7a-9bb9-906cae4f9e9f


Summarizing documents:  25%|██▌       | 5/20 [00:14<00:46,  3.13s/it]

current doc id: b8393b74-f834-4a00-a2ac-4887421030ba


Summarizing documents:  30%|███       | 6/20 [00:17<00:43,  3.13s/it]

current doc id: efe3011f-ac10-4076-8a4c-bc16dfad61fa


Summarizing documents:  35%|███▌      | 7/20 [00:21<00:43,  3.31s/it]

current doc id: 8083055a-e349-406b-9c82-fb4077a11601


Summarizing documents:  40%|████      | 8/20 [00:24<00:38,  3.17s/it]

current doc id: 70e0eb78-8758-4237-b4be-b5e59760c9cd


Summarizing documents:  45%|████▌     | 9/20 [00:27<00:34,  3.14s/it]

current doc id: b105abd5-3742-45f5-a9df-d777014095cd


Summarizing documents:  50%|█████     | 10/20 [00:30<00:32,  3.27s/it]

current doc id: 59e0e193-e090-4715-86ce-1ffd7d664926


Summarizing documents:  55%|█████▌    | 11/20 [00:33<00:28,  3.12s/it]

current doc id: f56bfa77-21b4-4876-8170-940adcbfa51e


Summarizing documents:  60%|██████    | 12/20 [00:37<00:27,  3.46s/it]

current doc id: b9ce6414-8fd9-4a4c-b5a7-7525d6eefcf3


Summarizing documents:  65%|██████▌   | 13/20 [00:41<00:24,  3.49s/it]

current doc id: 6c02d60d-7e8f-4c04-959a-5436d1c9e88b


Summarizing documents:  70%|███████   | 14/20 [00:44<00:20,  3.41s/it]

current doc id: 7a0d4dc9-45d2-4ee6-8e58-73bb594e4889


Summarizing documents:  75%|███████▌  | 15/20 [00:48<00:17,  3.41s/it]

current doc id: 0fc8113e-98cc-4075-bdd3-53a0d2e77f81


Summarizing documents:  80%|████████  | 16/20 [00:51<00:13,  3.35s/it]

current doc id: a4118b37-7861-47df-8692-a5c5d99b41f7


Summarizing documents:  85%|████████▌ | 17/20 [00:54<00:09,  3.27s/it]

current doc id: f6f354fa-b442-4134-a1c1-44aa4fdd3cf8


Summarizing documents:  90%|█████████ | 18/20 [00:59<00:07,  3.72s/it]

current doc id: 4b797200-b67a-494e-bbe6-830dd9372e17


Summarizing documents:  95%|█████████▌| 19/20 [01:02<00:03,  3.54s/it]

current doc id: 845b3703-e8cf-4409-ae8e-184a093e7f40


Summarizing documents: 100%|██████████| 20/20 [01:05<00:00,  3.27s/it]
Generating embeddings: 100%|██████████| 20/20 [00:00<00:00, 20.07it/s]


In [32]:
doc_summary_index.get_document_summary("bb1582a8-fe36-4988-870c-cf7a0f9a259b")

'The provided text is about the E-9 non-professional employment visa in South Korea. It covers various aspects related to the visa, such as the issuance of work permits, documentation requirements, special re-entry provisions, changing workplaces, extending stay permits, foreign worker registration, and reporting changes in employment status. \n\nSome questions that this text can answer include:\n- What are the eligibility criteria for obtaining an E-9 visa?\n- What documents are required for the issuance of a work permit under the E-9 visa category?\n- How can a foreign worker apply for a permit extension or change in employment status under the E-9 visa?\n- What are the special provisions for re-entry for certain E-9 visa holders?\n- How is the registration process for foreign workers conducted under the E-9 visa category?'

In [8]:
doc_summary_index.storage_context.persist("index")

In [9]:
from llama_index.core import load_index_from_storage
from llama_index.core import StorageContext

# rebuild storage context
storage_context = StorageContext.from_defaults(persist_dir="index")
doc_summary_index = load_index_from_storage(storage_context)

## 문서 요약 색인에서 검색 수행

높은 수준에서 쿼리를 실행하는 방법을 보여줍니다. 

또한 현재 있는 매개변수를 볼 수 있도록 하위 수준에서 검색을 수행하는 방법도 보여줍니다. 

문서 요약을 사용하여 LLM 기반 검색과 임베딩 기반 검색을 모두 보여줍니다.

In [10]:
"""
상위 수준 쿼리 
참고: 이는 기본 임베딩 기반 검색 형식을 사용합니다.
"""


query_engine = doc_summary_index.as_query_engine(
    response_mode="tree_summarize", use_async=True
)

In [13]:
response = query_engine.query("What are the sports teams in Toronto?")

In [14]:
print(response)

The sports teams in Toronto are the Toronto Maple Leafs (NHL), Toronto Raptors (NBA), Toronto Blue Jays (MLB), Toronto FC (MLS), Toronto Argonauts (CFL), Toronto Rock (National Lacrosse League), Toronto Wolfpack (Rugby Football League), Toronto Rush (American Ultimate Disc League), and the Toronto Six (National Women's Hockey League).


## LLM 기반 검색

In [15]:
from llama_index.core.indices.document_summary import (
    DocumentSummaryIndexLLMRetriever,
)

retriever = DocumentSummaryIndexLLMRetriever(
    doc_summary_index,
    # choice_select_prompt=None,
    # choice_batch_size=10,
    # choice_top_k=1,
    # format_node_batch_fn=None,
    # parse_choice_select_answer_fn=None,
)

In [16]:
retrieved_nodes = retriever.retrieve("What are the sports teams in Toronto?")

In [18]:
print(len(retrieved_nodes))

21


In [19]:
print(retrieved_nodes[0].score)
print(retrieved_nodes[0].node.get_text())

10.0
Toronto is the most populous city in Canada and the capital city of the Canadian province of Ontario. With a population of 2,794,356 in 2021, it is the fourth-most populous city in North America. The city is the anchor of the Golden Horseshoe, an urban agglomeration of 9,765,188 people (as of 2021) surrounding the western end of Lake Ontario, while the Greater Toronto Area proper had a 2021 population of 6,712,341. Toronto is an international centre of business, finance, arts, sports and culture and is one of the most multicultural and cosmopolitan cities in the world.Indigenous peoples have travelled through and inhabited the Toronto area, located on a broad sloping plateau interspersed with rivers, deep ravines, and urban forest, for more than 10,000 years. After the broadly disputed Toronto Purchase, when the Mississauga surrendered the area to the British Crown, the British established the town of York in 1793 and later designated it as the capital of Upper Canada. During the 

## 임베딩 기반 검색 

In [21]:
from llama_index.core.indices.document_summary import (
    DocumentSummaryIndexEmbeddingRetriever,
)

retriever = DocumentSummaryIndexEmbeddingRetriever(
    doc_summary_index,
    # similarity_top_k=1,
)

In [22]:
retrieved_nodes = retriever.retrieve("What are the sports teams in Toronto?")

In [23]:
len(retrieved_nodes)

21

In [24]:
print(retrieved_nodes[0].node.get_text())

Toronto is the most populous city in Canada and the capital city of the Canadian province of Ontario. With a population of 2,794,356 in 2021, it is the fourth-most populous city in North America. The city is the anchor of the Golden Horseshoe, an urban agglomeration of 9,765,188 people (as of 2021) surrounding the western end of Lake Ontario, while the Greater Toronto Area proper had a 2021 population of 6,712,341. Toronto is an international centre of business, finance, arts, sports and culture and is one of the most multicultural and cosmopolitan cities in the world.Indigenous peoples have travelled through and inhabited the Toronto area, located on a broad sloping plateau interspersed with rivers, deep ravines, and urban forest, for more than 10,000 years. After the broadly disputed Toronto Purchase, when the Mississauga surrendered the area to the British Crown, the British established the town of York in 1793 and later designated it as the capital of Upper Canada. During the War o

In [25]:
# use retriever as part of a query engine
from llama_index.core.query_engine import RetrieverQueryEngine

# configure response synthesizer
response_synthesizer = get_response_synthesizer(response_mode="tree_summarize")

# assemble query engine
query_engine = RetrieverQueryEngine(
    retriever=retriever,
    response_synthesizer=response_synthesizer,
)

# query
response = query_engine.query("What are the sports teams in Toronto?")
print(response)

The sports teams in Toronto are the Toronto Maple Leafs (NHL), Toronto Raptors (NBA), Toronto Blue Jays (MLB), Toronto FC (MLS), Toronto Argonauts (CFL), Toronto Rock (National Lacrosse League), Toronto Wolfpack (Rugby Football League), Toronto Rush (American Ultimate Disc League), and the Toronto Six (National Women's Hockey League).
