# 쿼리 라우팅 2.
- 리트리버/쿼리 레벨 라우팅 기능
- 대표적 사용새: 요약용 서치엔진 + 일반 본문 시맨틱 서치엔진 나누어서 입력 쿼리 형식이 뭔지에 따라 가변적으로 작동하도록 함.

In [None]:
!pip install openai llama_index qdrant_client llama-index-vector-stores-qdrant

Collecting openai
  Downloading openai-1.37.1-py3-none-any.whl.metadata (22 kB)
Collecting llama_index
  Downloading llama_index-0.10.59-py3-none-any.whl.metadata (11 kB)
Collecting qdrant_client
  Downloading qdrant_client-1.10.1-py3-none-any.whl.metadata (10 kB)
Collecting llama-index-vector-stores-qdrant
  Downloading llama_index_vector_stores_qdrant-0.2.14-py3-none-any.whl.metadata (768 bytes)
Collecting httpx<1,>=0.23.0 (from openai)
  Downloading httpx-0.27.0-py3-none-any.whl.metadata (7.2 kB)
Collecting llama-index-agent-openai<0.3.0,>=0.1.4 (from llama_index)
  Downloading llama_index_agent_openai-0.2.9-py3-none-any.whl.metadata (729 bytes)
Collecting llama-index-cli<0.2.0,>=0.1.2 (from llama_index)
  Downloading llama_index_cli-0.1.13-py3-none-any.whl.metadata (1.5 kB)
Collecting llama-index-core==0.10.59 (from llama_index)
  Downloading llama_index_core-0.10.59-py3-none-any.whl.metadata (2.4 kB)
Collecting llama-index-embeddings-openai<0.2.0,>=0.1.5 (from llama_index)
  Downl

In [1]:

import nest_asyncio

nest_asyncio.apply()

In [2]:
from llama_index.core.indices.vector_store.base import VectorStoreIndex
from llama_index.vector_stores.qdrant import QdrantVectorStore

import qdrant_client
from qdrant_client import models
client = qdrant_client.QdrantClient(
    url="",
    api_key="",
)
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core import Settings
Settings.embed_model = OpenAIEmbedding(
    model="text-embedding-3-small"
)
Settings.llm= OpenAI(temperature=0,model='gpt-4o-mini')

In [4]:
from llama_index.core import SimpleDirectoryReader

# Paul Graham 에세이 다큐먼트 로딩
documents = SimpleDirectoryReader("./content/pg").load_data()

In [5]:
# 청크사이즈 조정
Settings.chunk_size = 1024
nodes = Settings.node_parser.get_nodes_from_documents(documents)

In [6]:
from llama_index.core import StorageContext
vector_store = QdrantVectorStore(client=client, collection_name="routing_exercise")
storage_context = StorageContext.from_defaults(vector_store=vector_store)


In [7]:
from llama_index.core import SummaryIndex
from llama_index.core import VectorStoreIndex
from llama_index.core import StorageContext

#Summary Index: 서머리 담는 용도의 인덱스타입
summary_index = SummaryIndex(nodes, storage_context=storage_context)

#VectorStoreIndex: 일반적인 덴스임베딩 + 메타 담는곳
vector_index = VectorStoreIndex(nodes, storage_context=storage_context)

In [8]:

# as_query_engine의 response_synthesize 방식으로 tree_summarize 사용.
# Tree_summarize: retrieved chunk들을 tree구조로 계층적 summarize하여 결과 response 생성하는 방식
list_query_engine = summary_index.as_query_engine(
    response_mode="tree_summarize"
)
vector_query_engine = vector_index.as_query_engine()

In [9]:
from llama_index.core.tools import QueryEngineTool

# 쿼리 엔진 툴로 양 인덱스 각각 등록 및 설명란에 LLM Selector가 셀렉팅 기준 참고용으로 작성
list_tool = QueryEngineTool.from_defaults(
    query_engine=list_query_engine,
    description=(
        "Useful for summarization questions related to Paul Graham eassy on"
        " What I Worked On."
    ),
)

vector_tool = QueryEngineTool.from_defaults(
    query_engine=vector_query_engine,
    description=(
        "Useful for retrieving specific context from Paul Graham essay on What"
        " I Worked On."
    ),
)

## Pydantic Selector
쿼리엔진 셀렉터

LLM Selector : LLM으로 하여금 쿼리엔진툴 description 보고 어느걸 선택해야 하는지에 대한 최종 분류결과를 JSON으로 내뱉게 하고, 이후 이 JSON 바탕으로 쿼리엔진툴 셀렉팅을 하는 느낌.

Pydantic Selector: raw JSON 대신 OpenAI의 Function Call API 대신 활용해서 셀렉션 테스크 수행.

SingleSelector, MultiSelector 옵션으로 활용 가능

In [10]:

from llama_index.core.query_engine import RouterQueryEngine
from llama_index.core.selectors import LLMSingleSelector, LLMMultiSelector
from llama_index.core.selectors import (
    PydanticMultiSelector,
    PydanticSingleSelector,
)

# 쿼리엔진툴들을 묶음으로 상위 쿼리엔진인 라우터쿼리엔진 사용.
query_engine = RouterQueryEngine(
    selector=PydanticSingleSelector.from_defaults(),
    query_engine_tools=[
        list_tool,
        vector_tool,
    ],
)

In [11]:
# 요약에 관련된 질문
response = query_engine.query("What is the summary of the document?")
print(str(response))

The document is a personal essay detailing the author's journey through writing, programming, and the evolution of their career, particularly in the fields of artificial intelligence, software development, and entrepreneurship. It begins with the author's early experiences with writing short stories and programming on early computers, leading to a shift from philosophy to artificial intelligence in college. The narrative continues through their graduate studies, the realization of the limitations of AI at the time, and a pivot towards Lisp programming.

The author recounts their transition from academia to the tech industry, including the founding of Viaweb, an early e-commerce platform, and its eventual acquisition by Yahoo. The essay reflects on the challenges and lessons learned during this period, including the importance of growth rates in startups and the dynamics of venture capital.

After a period of personal reflection and a desire to return to painting, the author co-founded 

In [12]:
# 싱글셀렉터가 선택한 쿼리엔진 확인
print(str(response.metadata["selector_result"]))

selections=[SingleSelection(index=0, reason='The question asks for a summary of the document, which aligns with the first choice that is specifically useful for summarization.')]


In [13]:
# 이번엔 구체적인 질문을 날려보기
response = query_engine.query("What did Paul Graham do after RICS?")
print(str(response))

After RISD, Paul Graham dropped out and moved to New York City, where he lived in a rent-controlled apartment and identified as a New York artist. He was concerned about money due to the decline of Interleaf and the rarity of freelance Lisp hacking work. To address this, he decided to write another book on Lisp, aiming for it to be popular and potentially used as a textbook. During this time, he also became the de facto studio assistant for Idelle Weber, a painter he had previously studied under.


In [14]:
print(str(response.metadata["selector_result"]))

selections=[SingleSelection(index=1, reason="The question asks for specific context regarding Paul Graham's actions after RICS, which aligns with retrieving specific information from the essay.")]


In [15]:
# LLM 싱글셀렉터
query_engine = RouterQueryEngine(
    selector=LLMSingleSelector.from_defaults(),
    query_engine_tools=[
        list_tool,
        vector_tool,
    ],
)

In [16]:
response = query_engine.query("What is the summary of the document?")
print(str(response))

The document is a personal essay detailing the author's journey through writing, programming, and the evolution of their career, particularly in the fields of artificial intelligence, software development, and entrepreneurship. It begins with the author's early experiences with writing short stories and programming on early computers, leading to a shift from philosophy to artificial intelligence in college. The narrative continues through their graduate studies, the realization of the limitations of AI at the time, and a pivot towards Lisp programming.

The author recounts their transition from academia to the tech industry, including the founding of Viaweb, an early e-commerce platform, which was later acquired by Yahoo. The essay reflects on the challenges and lessons learned during this period, including the importance of growth rates in startups and the dynamics of venture capital.

After leaving Yahoo, the author explores painting and art, eventually returning to technology and co

In [17]:
print(str(response.metadata["selector_result"]))

selections=[SingleSelection(index=0, reason='The question asks for a summary of the document, which aligns with the purpose of choice 1 that is useful for summarization.')]


In [18]:
response = query_engine.query("What did Paul Graham do after RICS?")
print(str(response))

After RISD, Paul Graham dropped out and moved to New York City, where he lived in a rent-controlled apartment and identified as a New York artist. He was concerned about money due to the decline of Interleaf and the rarity of freelance Lisp hacking work. To address this, he decided to write another book on Lisp, aiming for it to be popular and potentially used as a textbook. During this time, he also became the de facto studio assistant for Idelle Weber, a painter he had previously studied under.


In [19]:
print(str(response.metadata["selector_result"]))

selections=[SingleSelection(index=1, reason="The question asks for specific context regarding Paul Graham's actions after RICS, which aligns with retrieving specific information from the essay.")]


- Multi 셀렉터 (인덱스 여러개 참조해서 리트리브 해야될 경우)

In [20]:
from llama_index.core import SimpleKeywordTableIndex

# 요약, 구체성에 더해, 이번엔 키워드 기반 추출 엔진도 정의해보자
keyword_index = SimpleKeywordTableIndex(nodes, storage_context=storage_context)

keyword_tool = QueryEngineTool.from_defaults(
    query_engine=vector_query_engine,
    description=(
        "Useful for retrieving specific context using keywords from Paul"
        " Graham essay on What I Worked On."
    ),
)

In [21]:
query_engine = RouterQueryEngine(
    selector=PydanticMultiSelector.from_defaults(),
    query_engine_tools=[
        list_tool,
        vector_tool,
        keyword_tool,
    ],
)

In [22]:
# 설명이 같은 두개의 툴들(벡터툴, 키워드툴) 멀티셀렉션하는지 확인
response = query_engine.query(
    "What were noteable events and people from the authors time at Interleaf"
    " and YC?"
)
print(str(response))

During the author's time at YC, several notable events and individuals emerged. A significant milestone was the establishment of the Summer Founders Program, which attracted 225 applications and resulted in the selection of 8 startups, including Reddit, Twitch founders Justin Kan and Emmett Shear, Aaron Swartz, and Sam Altman, who later became the second president of YC. This program aimed to provide undergraduates with the opportunity to start their own companies during the summer, proving to be a successful model for funding startups in batches.

The author also noted the transition of YC into a fund in 2009 due to its growth, although it later reverted to being self-funded following the acquisition of Heroku. The importance of community among startups was highlighted, with alumni actively supporting current batches, fostering a collaborative environment. The humorous concept of "YC GDP" illustrated how startups within the same batch often became customers of one another, showcasing 

In [23]:

print(str(response.metadata["selector_result"]))

selections=[SingleSelection(index=1, reason="This choice is useful for retrieving specific context related to notable events and people from the author's time at Interleaf and YC."), SingleSelection(index=2, reason='This choice allows for retrieving specific context using keywords, which can help in identifying notable events and people.')]


# Do It Yourself

In [None]:
!pip install datasets



In [24]:
# Dataset 로드
from datasets import load_dataset

ds = load_dataset("HAERAE-HUB/KOREAN-WEBTEXT", split='train[:20]')
data = ds.to_pandas()

README.md:   0%|          | 0.00/3.06k [00:00<?, ?B/s]

Resolving data files:   0%|          | 0/18 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/18 [00:00<?, ?it/s]

Downloading data:   0%|          | 0/18 [00:00<?, ?files/s]

train-00000-of-00018.parquet:   0%|          | 0.00/411M [00:00<?, ?B/s]

train-00001-of-00018.parquet:   0%|          | 0.00/419M [00:00<?, ?B/s]

train-00002-of-00018.parquet:   0%|          | 0.00/359M [00:00<?, ?B/s]

train-00003-of-00018.parquet:   0%|          | 0.00/364M [00:00<?, ?B/s]

train-00004-of-00018.parquet:   0%|          | 0.00/347M [00:00<?, ?B/s]

train-00005-of-00018.parquet:   0%|          | 0.00/196M [00:00<?, ?B/s]

train-00006-of-00018.parquet:   0%|          | 0.00/199M [00:00<?, ?B/s]

train-00007-of-00018.parquet:   0%|          | 0.00/199M [00:00<?, ?B/s]

train-00008-of-00018.parquet:   0%|          | 0.00/199M [00:00<?, ?B/s]

train-00009-of-00018.parquet:   0%|          | 0.00/198M [00:00<?, ?B/s]

train-00010-of-00018.parquet:   0%|          | 0.00/198M [00:00<?, ?B/s]

train-00011-of-00018.parquet:   0%|          | 0.00/199M [00:00<?, ?B/s]

train-00012-of-00018.parquet:   0%|          | 0.00/201M [00:00<?, ?B/s]

train-00013-of-00018.parquet:   0%|          | 0.00/199M [00:00<?, ?B/s]

train-00014-of-00018.parquet:   0%|          | 0.00/196M [00:00<?, ?B/s]

train-00015-of-00018.parquet:   0%|          | 0.00/194M [00:00<?, ?B/s]

train-00016-of-00018.parquet:   0%|          | 0.00/197M [00:00<?, ?B/s]

train-00017-of-00018.parquet:   0%|          | 0.00/197M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1284879 [00:00<?, ? examples/s]

In [25]:
# Document 오브젝트로 변환
from llama_index.core import Document, VectorStoreIndex
docs = []

#Iterative하게 Document 만들기
for i, row in data.iterrows():
    docs.append(Document(
        text=row['text'],
        # extra_info={'title': row ['title']}
    ))

In [26]:
Settings.chunk_size = 1024

# 노드로 닥스 분할
nodes = Settings.node_parser.get_nodes_from_documents(docs)

In [27]:
from llama_index.core import SummaryIndex
from llama_index.core import VectorStoreIndex
from llama_index.core import StorageContext
from llama_index.core import SimpleKeywordTableIndex
# 쿼드란트 벡터스토어 컬렉션 생성
vector_store = QdrantVectorStore(client=client, collection_name="routing_exercise2")

# 스토리지 컨텍스트에 백엔드로 쿼드란트 벡터스토어 연결
storage_context = StorageContext.from_defaults(vector_store=vector_store)

In [28]:
#Summary Index: 서머리 담는 용도의 인덱스타입
summary_index = SummaryIndex(nodes, storage_context=storage_context)

#VectorStoreIndex: 일반적인 덴스임베딩 + 메타 담는곳
vector_index = VectorStoreIndex(nodes, storage_context=storage_context)

#키워드 인덱스
keyword_index = SimpleKeywordTableIndex(nodes, storage_context=storage_context)

In [29]:
list_query_engine = summary_index.as_query_engine(
    response_mode="tree_summarize"
)
vector_query_engine = vector_index.as_query_engine()

keyword_index_query_engine = keyword_index.as_query_engine()

In [30]:
from llama_index.core.tools import QueryEngineTool


# 쿼리 엔진 툴로 양 인덱스 각각 등록 및 설명란에 LLM Selector가 셀렉팅 기준 참고용으로 작성
list_tool = QueryEngineTool.from_defaults(
    query_engine=list_query_engine,
    description=(
        "Useful for summarization questions"
    ),
)

vector_tool = QueryEngineTool.from_defaults(
    query_engine=vector_query_engine,
    description=(
        "Useful for retrieving specific context"
    ),
)


keyword_tool = QueryEngineTool.from_defaults(
    query_engine=keyword_index_query_engine,
    description=(
        "Useful for retrieving specific context using keywords"
    ),
)

In [31]:

from llama_index.core.query_engine import RouterQueryEngine
from llama_index.core.selectors import LLMSingleSelector, LLMMultiSelector
from llama_index.core.selectors import (
    PydanticMultiSelector,
    PydanticSingleSelector,
)
# 라우터쿼리엔진(멀티셀렉터)으로 묶어주기
query_engine = RouterQueryEngine(
    selector=PydanticMultiSelector.from_defaults(),
    query_engine_tools=[
        list_tool,
        vector_tool,
        keyword_tool,
    ],
)

In [32]:
ds.to_pandas()

Unnamed: 0,text,source,token_count,__index_level_0__
0,사이트의 판매량에 기반하여 판매량 추이를 반영한 인터파크 도서에서의 독립적인 판매 ...,oscar2201,3348,0
1,“아~아~잊으랴 어찌 우리 이날을 조국의 원수들이 짓밟아 오던 날을~”6·25의 노...,oscar2201,1427,1
2,일러전쟁의 승패를 가른 쓰시마 해전은 세계 최강으로 평가 받는 발틱함대를 괴멸시켰다...,oscar2201,2458,2
3,"재테크 채널 유튜버이자, 「빚부터 갚아라」, 「원트재무설계 소원을 말해봐」 저자인 ...",oscar2201,2838,3
4,"상급자의 범죄와 비리, 부패를 하급자에게 돌리는 것으로 따지면 타의추종을 불허하는 ...",oscar2201,1628,4
5,최근 언론보도에 의하면 이재현 CJ그룹 회장이 지난해 말 두 자녀에게 증여하였던 주...,oscar2201,1366,5
6,"나는 노무현의 시대를 살지 않았다. 그러니까, 나는 이 땅의 생명체로 살아있긴 했지...",oscar2201,2017,6
7,CBRE가 21일 발표한 ‘2021년 2분기 국내 상업용 부동산 시장 보고서’에 따...,oscar2201,1421,7
8,"안녕하세요. 한화솔루션입니다. 지난주, 슬기로운 솔루션 직장생활 2탄에 이어 이번엔...",oscar2201,2143,8
9,캐나다는 3 년 연속 지구상에서 가장 주목할만한 국가로 선포되었습니다. 일반 타이틀...,oscar2201,1104,9


In [33]:
# 요약 질문해보기
response = query_engine.query(
    "캐나다가 3년 연속 지구상에서 가장 주목할만한 국가로 선포된 이유에 대해서 요약해봐"
)
print(str(response))

캐나다가 3년 연속 지구상에서 가장 주목할 만한 국가로 선포된 이유는 우수한 교육 프로그램, 경이로운 자연, 다문화 사회, 그리고 저렴한 생활비를 제공하기 때문입니다. 이러한 요소들은 국제 학생들에게 매력적인 유학 목적지로 자리 잡게 하였으며, 캐나다의 교육 시스템은 안전하고 개방적이며 관용적인 환경을 갖추고 있습니다.


In [34]:
print(str(response.metadata["selector_result"]))

selections=[SingleSelection(index=0, reason='The question asks for a summary of why Canada has been declared the most remarkable country for three consecutive years.')]


In [35]:
# 질문해보기
response = query_engine.query(
    "발틱함대가 격파된 전쟁 이름이 뭐지?"
)
print(str(response))

일러전쟁입니다.


In [36]:
print(str(response.metadata["selector_result"]))

selections=[SingleSelection(index=1, reason='The question asks for a specific historical event, which requires retrieving specific context.'), SingleSelection(index=2, reason='The question can be answered by retrieving specific context using keywords related to the Baltic Fleet and the war.')]


In [37]:
# 질문해보기
response = query_engine.query(
    "낙동강 댐 개수"
)
print(str(response))

낙동강에 세워진 댐은 8개입니다.


In [38]:
print(str(response.metadata["selector_result"]))

selections=[SingleSelection(index=1, reason='The question asks for specific information about the number of 낙동강 dams, which requires retrieving specific context.'), SingleSelection(index=2, reason='The question can also be addressed by retrieving specific context using keywords related to 낙동강 dams.')]


In [39]:
response

Response(response='낙동강에 세워진 댐은 8개입니다.', source_nodes=[NodeWithScore(node=TextNode(id_='1c88eddd-b53b-4c62-a363-9b929a07491d', embedding=None, metadata={}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='8ab634b1-fa01-47f8-9dab-2f7727d4e921', node_type='4', metadata={}, hash='80833157745dc9b73a84e6fab31f05117a01a09ea5918cc87cceaa8aec33e3d4'), <NodeRelationship.NEXT: '3'>: RelatedNodeInfo(node_id='36a31e48-c63a-48d3-9829-f539e605be3f', node_type='1', metadata={}, hash='779114c74adfef0b4672a2fcac8f70e04b06033b90b718932b014c31a10e4ca0')}, metadata_template='{key}: {value}', metadata_separator='\n', text='낙동강에 세워진 8개의 댐(보) 중 5개의 보에서 물이 샙니다. 상주댐에서 물이 새는 것을 확인한 뒤 다른 댐들도 확인해보니 아니나 다를까 물이 새고 있었던 것입니다! 어처구니가 없어도 이럴수가 있습니까! 나라를 잘 다스려 달라고 모은 세금을, 70%의 반대에도 불구하고 서둘러 추진하더니, 완공을 바로 앞둔 시점에 70%가 부실이라니요! 물이 새는 댐은 상주댐, 구미댐, 강정고령댐(전 강정댐), 합천창녕댐(전 합천댐), 창녕함안댐(전 함안댐) 등 입니다. 그 뿐 아니라 구미댐은 용꼬리 구조물(날개벽)이 내려앉았고, 칠곡댐도 댐 앞의 구조물