### Retrieval

**RAG의 5단계**
1. **Document Loader**: 문서를 불러오고
2. **Document Transformer**: 문서를 쪼개고
3. **Embedding**: 텍스트를 숫자로 바꾸고
4. **Vector Store**: 저장소에 넣고
5. **Retrieval** : 검색해서 LLM에 전달합니다.

In [1]:
%pip install langchain-community pypdf faiss-cpu sentence-transformers

Collecting huggingface-hub>=0.20.0 (from sentence-transformers)
  Using cached huggingface_hub-0.36.0-py3-none-any.whl.metadata (14 kB)
Using cached huggingface_hub-0.36.0-py3-none-any.whl (566 kB)
Installing collected packages: huggingface-hub
  Attempting uninstall: huggingface-hub
    Found existing installation: huggingface_hub 1.2.3
    Uninstalling huggingface_hub-1.2.3:
      Successfully uninstalled huggingface_hub-1.2.3
Successfully installed huggingface-hub-0.36.0
Note: you may need to restart the kernel to use updated packages.


In [2]:
from dotenv import load_dotenv
load_dotenv()

True

### Document Loader(문서 불러오기)

In [3]:
%pip install bs4

Note: you may need to restart the kernel to use updated packages.


In [4]:
from langchain_community.document_loaders import WebBaseLoader # 웹페이지 URL에서 텍스트를 긁어오는 도구

url = "https://ko.wikipedia.org/wiki/%EC%9C%84%ED%82%A4%EB%B0%B1%EA%B3%BC:%EC%A0%95%EC%B1%85%EA%B3%BC_%EC%A7%80%EC%B9%A8"

# 로더 인스턴스 생성
loader = WebBaseLoader(url)

# 해당 URL에 접속하여 HTML 파싱, 텍스트만 추출하여 Document 객체 리스트로 반환
documents = loader.load()

print(len(documents))
print(documents[0].metadata)

# 본문 내용 확인
print(documents[0].page_content[:500])

  from .autonotebook import tqdm as notebook_tqdm
USER_AGENT environment variable not set, consider setting it to identify your requests.


1
{'source': 'https://ko.wikipedia.org/wiki/%EC%9C%84%ED%82%A4%EB%B0%B1%EA%B3%BC:%EC%A0%95%EC%B1%85%EA%B3%BC_%EC%A7%80%EC%B9%A8', 'title': '위키백과:정책과 지침 - 위키백과, 우리 모두의 백과사전', 'language': 'ko'}




위키백과:정책과 지침 - 위키백과, 우리 모두의 백과사전



























본문으로 이동







주 메뉴





주 메뉴
사이드바로 이동
숨기기



		둘러보기
	


대문최근 바뀜요즘 화제임의의 문서로





		사용자 모임
	


사랑방사용자 모임관리 요청





		편집 안내
	


소개도움말정책과 지침질문방



















검색











검색






















보이기
















기부

계정 만들기

로그인








개인 도구





기부 계정 만들기 로그인




























목차
사이드바로 이동
숨기기




처음 위치





1
최상위 정책








2
'정책과 지침'이란?








3
준수








4
집행








5
문서 내용








6
정책과 지침은 백과사전의 일부가 아닙니다






In [5]:
from langchain_community.document_loaders import PyPDFLoader # PDF 파일을 로드하여 텍스트로 변환하는 도구

# 로더 인스턴스 생성 (파일 경로 지정)
loader = PyPDFLoader("The_Adventures_of_Tom_Sawyer.pdf")

# 문서 로드 실행 : PDF 각 페이지를 하나의 Document 객체로 변환하여 리스트로 반환
documents = loader.load()

print(len(documents))
print(documents[0].metadata)
print(documents[3].page_content)    # 4번째 페이지(인덱스 3)의 본문 내용 

35
{'producer': '3-Heights(TM) PDF Optimization Shell 5.9.1.5 (http://www.pdf-tools.com)', 'creator': 'Acrobat PDFMaker 7.0 dla programu Word', 'creationdate': '2006-08-26T00:50:00+02:00', 'author': 'GOLDEN', 'company': 'c', 'title': 'Microsoft Word - 1', 'moddate': '2021-01-27T15:00:11+01:00', 'source': 'The_Adventures_of_Tom_Sawyer.pdf', 'total_pages': 35, 'page': 0, 'page_label': '1'}
Pearson Education Limited                                                                            
Edinburgh Gate, Harlow,                                                                               
Essex CM20 2JE, England                                                                              
and Associated Companies throughout the world. 
ISBN 0 582 41923 9 
 
First published 1876                                                                                  
Published by Puffin Books 1950                                                                         
This edition first publis

### Embedding Model(임베딩: 텍스트를 숫자로)

In [6]:
from langchain_openai import OpenAIEmbeddings
import pandas as pd

# 임베딩 모델 생성
embeddings = OpenAIEmbeddings(model='text-embedding-3-small')

text="The quick brown fox jumps over the lazy dog."
vector = embeddings.embed_query(text)   # 하나의 문자열을 벡터로 변환

print(len(vector))
print(pd.Series(vector).head())

1536
0   -0.018424
1   -0.007258
2    0.003667
3   -0.054205
4   -0.022725
dtype: float64


In [7]:
# 문서 내용만 추출
docs = [document.page_content for document in documents]
print(len(docs))

# embed_documents() : 문자열 리스트를 받아서, 각각을 벡터로 변환한 뒤 '벡터 리스트'를 반환
vects = embeddings.embed_documents(docs)

print(len(vects))
print(len(vects[0]))
pd.DataFrame(vects)

35
35
1536


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,1526,1527,1528,1529,1530,1531,1532,1533,1534,1535
0,0.015368,-0.034811,-0.009329,0.014481,0.007343,0.014409,-0.052248,0.049236,-0.013593,0.015107,...,-0.008608,0.020671,0.002576,-0.002818,-0.021685,0.024887,0.02503,-0.013593,0.017691,0.019022
1,0.022011,-0.019454,0.014152,-0.002708,0.000608,-0.046838,-0.046292,0.089768,-0.024037,-0.00898,...,0.010905,0.003195,0.030516,-0.008757,-0.015445,0.05382,-0.006613,-0.019051,0.023476,-0.003281
2,-0.011975,-0.009629,0.013875,-0.021334,-0.017252,0.006028,-0.002008,0.046866,-0.026224,-0.010649,...,-0.008984,-0.025591,-0.004706,0.011511,-0.05015,0.032956,0.011476,-0.005061,0.044215,-0.00709
3,0.020603,-0.024786,0.009873,-0.010347,-0.007574,-0.00178,-0.006356,0.012776,-0.050092,-0.016355,...,0.017784,-0.027228,0.007392,-0.018057,-0.047728,0.047676,-0.022162,-0.008743,0.034841,0.009087
4,0.006844,-0.015666,0.023804,-0.000492,-0.017503,-0.030515,0.023473,0.009887,-0.029264,-0.013114,...,-0.041128,-0.024544,-0.022656,0.007463,-0.04891,0.042481,0.010709,-0.001508,0.031382,0.002701
5,0.0184,0.015915,-0.040583,0.035186,0.023646,-0.031528,0.015663,0.020329,-0.047065,0.020998,...,-0.014326,-0.028703,-0.018412,0.009105,-0.02497,0.029258,-0.00075,0.014062,0.02347,-0.01642
6,0.030218,0.047983,-0.024109,-0.001284,0.027657,0.00648,0.030547,-0.018023,-0.040017,-0.009053,...,0.034143,0.037808,-0.021054,-0.015849,-0.008289,0.002178,0.047795,-0.023016,0.033085,-0.02566
7,-0.005995,0.019772,-0.022503,0.019724,0.01244,-0.041624,0.022751,-0.01315,-0.014474,-0.003716,...,-0.017489,-0.034955,-0.031975,0.001982,-0.036232,0.023899,0.006622,0.001106,0.026819,-0.010459
8,-0.012375,-0.000126,-0.045359,0.002184,-0.014723,-0.043351,0.025629,-0.014921,-0.015764,-0.011736,...,-0.027414,-0.014215,-0.040129,-0.010429,-0.024229,0.036138,-0.005332,-0.005007,0.021242,0.009344
9,-0.01884,0.050801,-0.045571,0.008173,0.010839,0.0041,0.04147,0.045238,-0.009331,-0.084979,...,-0.003671,0.012601,-0.030454,-0.012324,0.020192,0.007486,0.039653,-0.032538,0.008539,0.002471


   ### vector store(FAISS)

In [8]:
# FAISS
from langchain_community.vectorstores import FAISS

# 벡터 저장소 생성
# 이 함수 내부에서 'documents'의 텍스트를 'embeddings'모델로 벡터와하고, FAISS 인덱스를 만들어 저장한다.
vector_store = FAISS.from_documents(documents, embeddings)

print(vector_store)

# 유사도 검사
query="Tom Sawyer"

# similarity_search() : 질문(query)와 가장 유사한(거리가 가까운)문서를 찾는다.
# K =3 : 가장 유사한 문서 3개를 가져오라는 뜻
retrieved_docs = vector_store.similarity_search(query, k=3)

print(retrieved_docs)


<langchain_community.vectorstores.faiss.FAISS object at 0x0000016B8B57BDD0>
[Document(id='6ad14c39-bc16-45db-a74c-ac69e5968e9a', metadata={'producer': '3-Heights(TM) PDF Optimization Shell 5.9.1.5 (http://www.pdf-tools.com)', 'creator': 'Acrobat PDFMaker 7.0 dla programu Word', 'creationdate': '2006-08-26T00:50:00+02:00', 'author': 'GOLDEN', 'company': 'c', 'title': 'Microsoft Word - 1', 'moddate': '2021-01-27T15:00:11+01:00', 'source': 'The_Adventures_of_Tom_Sawyer.pdf', 'total_pages': 35, 'page': 4, 'page_label': '5'}, page_content='Introduction \n \n \nOne Saturday afternoon Tom wanted to have an adventure                    \nbecause he didn’t want to think about Injun Joe. He went \nto Huck and said, “I’m going to  look for treasure. Do you \nwant to come with me?” \n \nTom Sawyer loves adventures. He has a lot of adventures \nat home, at school, and with his friends. He has one \nadventure in a cave. But why is he there? What does he \nsee in the cave? And why is he afraid? \n \n

### Retrieval & RAG(검색기 연결 및 질의 응답)

In [9]:
%pip install langchain

Note: you may need to restart the kernel to use updated packages.


In [None]:
# Retrieval 변환
# 벡터 스토어를 LangChain의 표준 'Retriveal'인터페이스로 변환 -> 나중에 Chain이나 Agent에 끼워 슬 수 있음
# retrieval = vector_store.as_retriever()

# from langchain_openai import ChatOpenAI
# from langchain_classic.chains import RetrievalQA

# model = ChatOpenAI(
#    model="gpt-5-nano",
#    temperature=0
# )

# 질의 객체
# retrieval_qa = RetrievalQA.from_chain_type(
#    llm = model,           # 답변을 생성할 언어모델
#    retriever = retrieval,  # 정보를 찾아올 검색기
#    chain_type='stuff'     # 'stuff' : 가장 기본적인 방식. 찾은 무서를 몽땅 프롬프트에 넣어서 보내다. 
# )

# responsel = retrieval_qa.invoke("마을 무덤에 있던 남자를 누가 죽였나요?")
# print('답변 1:', responsel)

답변 1: {'query': '마을 무덤에 있던 남자를 누가 죽였나요?', 'result': '인준 조(Injun Joe)가 의사를 칼로 죽였습니다.'}


In [16]:
from langchain_openai import ChatOpenAI
from langchain_classic.chains.combine_documents import create_stuff_documents_chain
from langchain_classic.chains import create_retrieval_chain
from langchain_core.prompts import ChatPromptTemplate

retrieval = vector_store.as_retriever()

model = ChatOpenAI(
    model="gpt-5-nano",
    temperature=0
)

system_prompt = (
    "당신은 질문, 답변을 돕는 보조원입니다. "
    "아래 제공된 CONTEXT를 사용하여 질문에 답하세요."
    "답을 모르면 모른다고 하되, 답변을 지어내지 마세요"
    "{context}"
)

prompt = ChatPromptTemplate.from_messages(
    [
        ("system" , system_prompt),
        ("human", "{input}")
    ]
)

# 문서 결합 체인 생성 : 역할 -> 검색된 문서들을 하나로 뭉쳐 프롬프트의 {context} 자리에 채워 넣음
combine_docs_chain = create_stuff_documents_chain(model, prompt)

# 리트리버 체인 생성 : 역할 -> 질문을 받아 검색기로 문서를 찾고, 그 문서들을 결합 체인으로 넘김
rag_chain = create_retrieval_chain(retrieval, combine_docs_chain)

responsel = rag_chain.invoke({"input": "마을 무덤에 있던 남자를 누가 죽였나요?"})

print("답변1: ", responsel['answer'])

response2 = rag_chain.invoke({"input" : "톰소여는 어떤 사람인가요?"})
print("답변2 : ",  response2['answer'])

답변1:  Injun Joe가 의사를 칼로 죽였습니다. (Muff Potter는 아니었습니다.)
답변2 :  톰 소여는 모험을 사랑하는 소년이에요. 집이나 학교, 친구들(Huck Finn, Joe Harper)과 함께 여러 모험을 즐깁니다. 그가 graveyard(무덤가)에서 겪은 모험에서는 Injun Joe를 보게 되고, Injun Joe가 위험한 악당이기 때문에 두려워합니다.
