# 문서 요약 색인

이 데모는 다양한 도시의 Wikipedia 기사에 대한 문서 요약 색인을 보여줍니다.

문서 요약 인덱스는 각 문서에서 요약을 추출하고 해당 요약과 문서에 해당하는 모든 노드를 저장합니다.

검색은 LLM 또는 임베딩(TODO)을 통해 수행될 수 있습니다. 먼저 요약을 기반으로 쿼리와 관련된 문서를 선택합니다. 선택한 문서에 해당하는 모든 검색된 노드가 검색됩니다.

In [3]:
from dotenv import load_dotenv
load_dotenv()

import logging
import sys

logging.basicConfig(stream=sys.stdout, level=logging.WARNING)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))

# # Uncomment if you want to temporarily disable logger
# logger = logging.getLogger()
# logger.disabled = True

import nest_asyncio

nest_asyncio.apply()

from llama_index.core import SimpleDirectoryReader, get_response_synthesizer
from llama_index.core import DocumentSummaryIndex
from llama_index.llms.openai import OpenAI
from llama_index.core.node_parser import SentenceSplitter

  from .autonotebook import tqdm as notebook_tqdm


## 데이터세트 로드
다양한 도시의 Wikipedia 페이지 로드

In [1]:
wiki_titles = ["Toronto", "Seattle", "Chicago", "Boston", "Houston"]

In [3]:
from pathlib import Path

import requests

for title in wiki_titles:
    response = requests.get(
        "https://en.wikipedia.org/w/api.php",
        params={
            "action": "query",
            "format": "json",
            "titles": title,
            "prop": "extracts",
            # 'exintro': True,
            "explaintext": True,
        },
    ).json()
    page = next(iter(response["query"]["pages"].values()))
    wiki_text = page["extract"]

    data_path = Path("data")
    if not data_path.exists():
        Path.mkdir(data_path)

    with open(data_path / f"{title}.txt", "w") as fp:
        fp.write(wiki_text)

In [4]:
# Load all wiki documents
city_docs = []
for wiki_title in wiki_titles:
    docs = SimpleDirectoryReader(
        input_files=[f"data/{wiki_title}.txt"]
    ).load_data()
    docs[0].doc_id = wiki_title
    city_docs.extend(docs)

In [10]:
city_docs[0].doc_id

'Toronto'

## 문서 요약 색인 
인덱스를 구축하는 두 가지 방법을 보여줍니다.

* 문서 요약 색인 작성의 기본 모드
* 요약 쿼리 사용자 정의

In [5]:
# LLM (gpt-3.5-turbo)
chatgpt = OpenAI(temperature=0, model="gpt-3.5-turbo")
splitter = SentenceSplitter(chunk_size=1024)

In [6]:
# default mode of building the index
response_synthesizer = get_response_synthesizer(
    response_mode="tree_summarize", use_async=True
)
doc_summary_index = DocumentSummaryIndex.from_documents(
    city_docs,
    llm=chatgpt,
    transformations=[splitter],
    response_synthesizer=response_synthesizer,
    show_progress=True,
)

Parsing nodes: 100%|██████████| 5/5 [00:00<00:00, 26.03it/s]
Summarizing documents:   0%|          | 0/5 [00:00<?, ?it/s]

current doc id: Toronto


Summarizing documents:  20%|██        | 1/5 [00:12<00:49, 12.48s/it]

current doc id: Seattle


Summarizing documents:  40%|████      | 2/5 [00:24<00:36, 12.16s/it]

current doc id: Chicago


Summarizing documents:  60%|██████    | 3/5 [00:39<00:27, 13.68s/it]

current doc id: Boston


Summarizing documents:  80%|████████  | 4/5 [01:55<00:38, 38.08s/it]

current doc id: Houston


Summarizing documents: 100%|██████████| 5/5 [02:06<00:00, 25.37s/it]
Generating embeddings: 100%|██████████| 5/5 [00:00<00:00,  6.59it/s]


In [7]:
doc_summary_index.get_document_summary("Boston")

"The provided text offers a comprehensive overview of the city of Boston, Massachusetts, covering its history, geography, neighborhoods, climate, demographics, economy, education, healthcare, public safety, culture, environment, sports, government, media presence, transportation infrastructure, international relations, and cultural significance in popular media. It discusses Boston's evolution from its indigenous era to its status in the 21st century as an intellectual, technological, and political center. The text delves into various aspects such as Boston's diverse population, strong economy driven by sectors like technology and biotechnology, renowned educational institutions like Harvard and MIT, prominent healthcare facilities associated with universities, cultural landmarks, historic sites related to the American Revolution, art museums, sports teams, government structure, media outlets, transportation system, international relationships, and portrayal in popular culture.\n\nSome

In [8]:
doc_summary_index.storage_context.persist("data")

In [9]:
from llama_index.core import load_index_from_storage
from llama_index.core import StorageContext

# rebuild storage context
storage_context = StorageContext.from_defaults(persist_dir="data")
doc_summary_index = load_index_from_storage(storage_context)

## 문서 요약 색인에서 검색 수행

높은 수준에서 쿼리를 실행하는 방법을 보여줍니다. 

또한 현재 있는 매개변수를 볼 수 있도록 하위 수준에서 검색을 수행하는 방법도 보여줍니다. 

문서 요약을 사용하여 LLM 기반 검색과 임베딩 기반 검색을 모두 보여줍니다.

In [10]:
"""
상위 수준 쿼리 
참고: 이는 기본 임베딩 기반 검색 형식을 사용합니다.
"""


query_engine = doc_summary_index.as_query_engine(
    response_mode="tree_summarize", use_async=True
)

In [13]:
response = query_engine.query("What are the sports teams in Toronto?")

In [14]:
print(response)

The sports teams in Toronto are the Toronto Maple Leafs (NHL), Toronto Raptors (NBA), Toronto Blue Jays (MLB), Toronto FC (MLS), Toronto Argonauts (CFL), Toronto Rock (National Lacrosse League), Toronto Wolfpack (Rugby Football League), Toronto Rush (American Ultimate Disc League), and the Toronto Six (National Women's Hockey League).


## LLM 기반 검색

In [15]:
from llama_index.core.indices.document_summary import (
    DocumentSummaryIndexLLMRetriever,
)

retriever = DocumentSummaryIndexLLMRetriever(
    doc_summary_index,
    # choice_select_prompt=None,
    # choice_batch_size=10,
    # choice_top_k=1,
    # format_node_batch_fn=None,
    # parse_choice_select_answer_fn=None,
)

In [16]:
retrieved_nodes = retriever.retrieve("What are the sports teams in Toronto?")

In [18]:
print(len(retrieved_nodes))

21


In [19]:
print(retrieved_nodes[0].score)
print(retrieved_nodes[0].node.get_text())

10.0
Toronto is the most populous city in Canada and the capital city of the Canadian province of Ontario. With a population of 2,794,356 in 2021, it is the fourth-most populous city in North America. The city is the anchor of the Golden Horseshoe, an urban agglomeration of 9,765,188 people (as of 2021) surrounding the western end of Lake Ontario, while the Greater Toronto Area proper had a 2021 population of 6,712,341. Toronto is an international centre of business, finance, arts, sports and culture and is one of the most multicultural and cosmopolitan cities in the world.Indigenous peoples have travelled through and inhabited the Toronto area, located on a broad sloping plateau interspersed with rivers, deep ravines, and urban forest, for more than 10,000 years. After the broadly disputed Toronto Purchase, when the Mississauga surrendered the area to the British Crown, the British established the town of York in 1793 and later designated it as the capital of Upper Canada. During the 

## 임베딩 기반 검색 

In [21]:
from llama_index.core.indices.document_summary import (
    DocumentSummaryIndexEmbeddingRetriever,
)

retriever = DocumentSummaryIndexEmbeddingRetriever(
    doc_summary_index,
    # similarity_top_k=1,
)

In [22]:
retrieved_nodes = retriever.retrieve("What are the sports teams in Toronto?")

In [23]:
len(retrieved_nodes)

21

In [24]:
print(retrieved_nodes[0].node.get_text())

Toronto is the most populous city in Canada and the capital city of the Canadian province of Ontario. With a population of 2,794,356 in 2021, it is the fourth-most populous city in North America. The city is the anchor of the Golden Horseshoe, an urban agglomeration of 9,765,188 people (as of 2021) surrounding the western end of Lake Ontario, while the Greater Toronto Area proper had a 2021 population of 6,712,341. Toronto is an international centre of business, finance, arts, sports and culture and is one of the most multicultural and cosmopolitan cities in the world.Indigenous peoples have travelled through and inhabited the Toronto area, located on a broad sloping plateau interspersed with rivers, deep ravines, and urban forest, for more than 10,000 years. After the broadly disputed Toronto Purchase, when the Mississauga surrendered the area to the British Crown, the British established the town of York in 1793 and later designated it as the capital of Upper Canada. During the War o

In [25]:
# use retriever as part of a query engine
from llama_index.core.query_engine import RetrieverQueryEngine

# configure response synthesizer
response_synthesizer = get_response_synthesizer(response_mode="tree_summarize")

# assemble query engine
query_engine = RetrieverQueryEngine(
    retriever=retriever,
    response_synthesizer=response_synthesizer,
)

# query
response = query_engine.query("What are the sports teams in Toronto?")
print(response)

The sports teams in Toronto are the Toronto Maple Leafs (NHL), Toronto Raptors (NBA), Toronto Blue Jays (MLB), Toronto FC (MLS), Toronto Argonauts (CFL), Toronto Rock (National Lacrosse League), Toronto Wolfpack (Rugby Football League), Toronto Rush (American Ultimate Disc League), and the Toronto Six (National Women's Hockey League).
