# 12-02 Practice 2: PDF

### 환경설정


API KEY 를 설정합니다.


In [1]:
# API 키를 환경변수로 관리하기 위한 설정 파일
from dotenv import load_dotenv

# API 키 정보 로드
load_dotenv()

True

In [2]:
# LangSmith 추적을 설정합니다. https://smith.langchain.com
!pip install -qU langchain-teddynote
from langchain_teddynote import logging

# 프로젝트 이름을 입력합니다.
logging.langsmith("CH12-RAG-practice")

LangSmith 추적을 시작합니다.
[프로젝트명]
CH12-RAG-practice


In [3]:
## 추가된 코드
#%pip install --upgrade "pydantic>=2.7.4" langchain langchain-openai

In [2]:
import bs4 
from langchain import hub
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import WebBaseLoader
from langchain_community.vectorstores import Chroma, FAISS
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_openai import ChatOpenAI, OpenAIEmbeddings

<코드 비교>
- bs4.SoupStrainer("main", attrs={"id": ["main-content"]}) → <main> 태그 중에서 id="main-content"를 가진 특정 요소의 텍스트만 가져옴.
- 페이지의 특정 영역만 크롤링할 때 유용.
- 예를 들어 네이버 뉴스나 블로그의 본문만 가져올 때 적합.

### PDF 로드


In [50]:
from langchain.document_loaders import PyPDFLoader

# PDF 파일 로드. 파일의 경로 입력
loader = PyPDFLoader("data2/test_paper_2.pdf")

# 페이지 별 문서 로드
docs = loader.load()
text = docs[2].page_content[:500]
print(f"문서의 수: {len(docs)}")

# 10번째 페이지의 내용 출력
print(f"\n[페이지내용]\n{docs[2].page_content[:500]}")
print(f"\n[metadata]\n{docs[1].metadata}\n") #전체 페이지 수, 파일 경로 등의 정보 포함

문서의 수: 43

[페이지내용]
language models can generate chains of thought if demonstrations of chain-of-thought reasoning are
provided in the exemplars for few-shot prompting.
Figure 1 shows an example of a model producing a chain of thought to solve a math word problem
that it would have otherwise gotten incorrect. The chain of thought in this case resembles a solution
and can interpreted as one, but we still opt to call it a chain of thought to better capture the idea that it
mimics a step-by-step thought process for ar

[metadata]
{'producer': 'pdfTeX-1.40.21', 'creator': 'LaTeX with hyperref', 'creationdate': '2023-01-12T01:06:30+00:00', 'author': '', 'keywords': '', 'moddate': '2023-01-12T01:06:30+00:00', 'ptex.fullbanner': 'This is pdfTeX, Version 3.14159265-2.6-1.40.21 (TeX Live 2020) kpathsea version 6.3.2', 'subject': '', 'title': '', 'trapped': '/False', 'source': 'data2/test_paper_2.pdf', 'total_pages': 43, 'page': 1, 'page_label': '2'}



<코드 설명>
- docs = loader.load()는 PDF 파일을 페이지 단위로 나눈 Document 객체 리스트를 반환하는 상태
- text는 문자열(str)만 추출하므로 사용하려면 Document 객체 리스트로 반환하는 작업이 필요

- `docs[10]`
    - docs 리스트는 여러 개의 문서를 저장하는 리스트.
    - docs[10]은 0부터 시작하는 인덱싱 기준으로 11번째 문서를 가져옴.
- `.page_content[:500]`
    - docs[10].page_content → 문서의 내용 (텍스트)
    - [:500] → 처음 500자까지만 출력하여 긴 텍스트를 잘라서 보여줌.
- `.metadata`
    - docs[10].metadata → 문서의 메타데이터(정보) 출력.
    - 보통 문서의 제목, URL, 작성일, 출처 등 추가적인 정보가 포함됨.

### 문서 분할(Split Documents)


1) CharacterTextSplitter

이것은 가장 간단한 방법입니다. 이 방법은 `문자를 기준으로 분할`합니다(기본값은 "\n\n") 그리고 청크의 길이를 문자의 수로 측정합니다.

In [51]:
from langchain.text_splitter import CharacterTextSplitter

text_splitter = CharacterTextSplitter(
    separator="\n\n",
    chunk_size=100,
    chunk_overlap=10,
    length_function=len,
    is_separator_regex=False,
)

In [52]:
text_splitter = CharacterTextSplitter(
    chunk_size=100, chunk_overlap=10, separator="\n\n"
)
text_splitter.split_text(text)

['language models can generate chains of thought if demonstrations of chain-of-thought reasoning are\nprovided in the exemplars for few-shot prompting.\nFigure 1 shows an example of a model producing a chain of thought to solve a math word problem\nthat it would have otherwise gotten incorrect. The chain of thought in this case resembles a solution\nand can interpreted as one, but we still opt to call it a chain of thought to better capture the idea that it\nmimics a step-by-step thought process for ar']

- 두 개의 줄바꿈(\n\n)을 기준으로 문서를 나눔
- 즉, 문단 단위로 텍스트를 분할하려고 할 때 유용

In [53]:
text_splitter = CharacterTextSplitter(chunk_size=100, chunk_overlap=10, separator="\n")
text_splitter.split_text(text)

Created a chunk of size 101, which is longer than the specified 100
Created a chunk of size 109, which is longer than the specified 100


['language models can generate chains of thought if demonstrations of chain-of-thought reasoning are',
 'provided in the exemplars for few-shot prompting.',
 'Figure 1 shows an example of a model producing a chain of thought to solve a math word problem',
 'that it would have otherwise gotten incorrect. The chain of thought in this case resembles a solution',
 'and can interpreted as one, but we still opt to call it a chain of thought to better capture the idea that it',
 'mimics a step-by-step thought process for ar']

In [54]:
text_splitter = CharacterTextSplitter(chunk_size=100, chunk_overlap=10, separator=" ")
text_splitter.split_text(text)

['language models can generate chains of thought if demonstrations of chain-of-thought reasoning',
 'reasoning are\nprovided in the exemplars for few-shot prompting.\nFigure 1 shows an example of a model',
 'of a model producing a chain of thought to solve a math word problem\nthat it would have otherwise',
 'otherwise gotten incorrect. The chain of thought in this case resembles a solution\nand can',
 'can interpreted as one, but we still opt to call it a chain of thought to better capture the idea',
 'the idea that it\nmimics a step-by-step thought process for ar']

- 공백(스페이스 " ")을 기준으로 텍스트를 나눔
- 문장이 아니라 단어 단위로 청크를 나누려는 목적일 때 사용

In [55]:
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=100, separator=" ")
# text 파일을 청크로 나누어줍니다.
text_splitter.split_text(text)

# document를 청크로 나누어줍니다.
split_docs = text_splitter.split_documents(docs)
len(split_docs)
#print(split_docs)

165

In [56]:
split_docs[0]

Document(metadata={'producer': 'pdfTeX-1.40.21', 'creator': 'LaTeX with hyperref', 'creationdate': '2023-01-12T01:06:30+00:00', 'author': '', 'keywords': '', 'moddate': '2023-01-12T01:06:30+00:00', 'ptex.fullbanner': 'This is pdfTeX, Version 3.14159265-2.6-1.40.21 (TeX Live 2020) kpathsea version 6.3.2', 'subject': '', 'title': '', 'trapped': '/False', 'source': 'data2/test_paper_2.pdf', 'total_pages': 43, 'page': 0, 'page_label': '1'}, page_content='Chain-of-Thought Prompting Elicits Reasoning\nin Large Language Models\nJason Wei Xuezhi Wang Dale Schuurmans Maarten Bosma\nBrian Ichter Fei Xia Ed H. Chi Quoc V . Le Denny Zhou\nGoogle Research, Brain Team\n{jasonwei,dennyzhou}@google.com\nAbstract\nWe explore how generating a chain of thought —a series of intermediate reasoning\nsteps—signiﬁcantly improves the ability of large language models to perform\ncomplex reasoning. In particular, we show how such reasoning abilities emerge\nnaturally in sufﬁciently large language models via a si

2) RecursiveTextSplitter
- 이 텍스트 분할기는 `일반 텍스트에 권장`되는 텍스트 분할기입니다.

In [57]:
# langchain 패키지에서 RecursiveCharacterTextSplitter 클래스를 가져옵니다.
from langchain.text_splitter import RecursiveCharacterTextSplitter

In [58]:
recursive_text_splitter = RecursiveCharacterTextSplitter(
    # 정말 작은 청크 크기를 설정합니다.
    chunk_size=100,
    chunk_overlap=10,
    length_function=len,
    is_separator_regex=False,
)

In [59]:
character_text_splitter = CharacterTextSplitter(
    chunk_size=100, chunk_overlap=10, separator=" "
)
for sent in character_text_splitter.split_text(text):
    print(sent)
print("===" * 20)
recursive_text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=100, chunk_overlap=10
)
for sent in recursive_text_splitter.split_text(text):
    print(sent)

language models can generate chains of thought if demonstrations of chain-of-thought reasoning
reasoning are
provided in the exemplars for few-shot prompting.
Figure 1 shows an example of a model
of a model producing a chain of thought to solve a math word problem
that it would have otherwise
otherwise gotten incorrect. The chain of thought in this case resembles a solution
and can
can interpreted as one, but we still opt to call it a chain of thought to better capture the idea
the idea that it
mimics a step-by-step thought process for ar
language models can generate chains of thought if demonstrations of chain-of-thought reasoning are
provided in the exemplars for few-shot prompting.
Figure 1 shows an example of a model producing a chain of thought to solve a math word problem
that it would have otherwise gotten incorrect. The chain of thought in this case resembles a
a solution
and can interpreted as one, but we still opt to call it a chain of thought to better capture the
the idea t

<비교>

`CharacterTextSplitter` → 단순히 문자 개수 기준으로 나눔.
- 속도가 빠름
- 문장 경계를 보장하지 않음
- 공백(separator)을 기준으로 나눌 수도 있음

`RecursiveCharacterTextSplitter` → 문장을 최대한 자연스럽게 유지하면서 나눔.
- 문단 → 문장 → 단어 순서로 작은 단위로 분할
- 문장 경계를 유지하려고 함
- 문서 요약, 검색 시스템 등에서 더 적합

In [60]:
# recursive_text_splitter 에 기본 지정된 separators 를 확인합니다.
recursive_text_splitter._separators

['\n\n', '\n', ' ', '']

3) Semantic Similarity
- `의미적 유사성`을 기준으로 텍스트를 분할합니다.



출처: [Greg Kamradt’s Notebook](https://github.com/FullStackRetrieval-com/RetrievalTutorials/blob/main/5_Levels_Of_Text_Splitting.ipynb)

높은 수준(high level)에서 문장으로 분할한 다음 3개 문장으로 그룹화한 다음 임베딩 공간에서 유사한 문장을 병합하는 방식입니다.


In [61]:
# 최신 버전으로 업데이트합니다.
%pip install -U langchain langchain_experimental -q

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Note: you may need to restart the kernel to use updated packages.


In [62]:
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai.embeddings import OpenAIEmbeddings

# SemanticChunker 를 생성합니다.
semantic_text_splitter = SemanticChunker(OpenAIEmbeddings(), add_start_index=True)

In [63]:
# 논문의 일부 내용을 불러옵니다
for sent in semantic_text_splitter.split_text(text):
    print(sent)
    print("===" * 20)

language models can generate chains of thought if demonstrations of chain-of-thought reasoning are
provided in the exemplars for few-shot prompting. Figure 1 shows an example of a model producing a chain of thought to solve a math word problem
that it would have otherwise gotten incorrect.
The chain of thought in this case resembles a solution
and can interpreted as one, but we still opt to call it a chain of thought to better capture the idea that it
mimics a step-by-step thought process for ar


### 임베딩
= 임베딩(Embedding)**이란 텍스트(단어, 문장, 문서)를 숫자로 변환하는 기법

In [64]:
from langchain_community.embeddings import HuggingFaceBgeEmbeddings

text_splitter = CharacterTextSplitter(
    chunk_size=100, chunk_overlap=10, separator="\n\n"
)

splits = text_splitter.split_documents(docs)

vectorstore = FAISS.from_documents(
    documents=splits, embedding=HuggingFaceBgeEmbeddings()
)

  documents=splits, embedding=HuggingFaceBgeEmbeddings()


<설명>
- 텍스트 데이터를 벡터로 변환하여 저장하고 검색할 수 있도록 벡터스토어(Vectorstore)를 생성하는 과정
- Hugging Face의 BGE 임베딩 모델을 활용하여 FAISS(Vector Database)에 저장하는 방식이 사용됨

In [65]:
#!pip uninstall fastembed -y
%pip install fastembed -U -q

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Note: you may need to restart the kernel to use updated packages.


In [66]:
from langchain_community.embeddings.fastembed import FastEmbedEmbeddings

vectorstore = FAISS.from_documents(documents=splits, embedding=FastEmbedEmbeddings())

- 이 코드 원래 에러 났었는데 정상 작동
- !pip -> %pip install 관련?

4단계: 벡터스토어 생성(Create Vectorstore)


= 벡터 스토어(Vector Store)는 텍스트 데이터를 벡터(숫자)로 변환하여 저장하고, 유사한 벡터를 빠르게 검색할 수 있도록 하는 데이터베이스

즉, 텍스트를 의미적으로 검색할 수 있도록 변환하는 과정

In [67]:
from langchain_community.vectorstores import FAISS

# FAISS DB 적용
vectorstore = FAISS.from_documents(documents=splits, embedding=OpenAIEmbeddings())

In [68]:
from langchain_community.vectorstores import Chroma

# Chroma DB 적용
vectorstore = Chroma.from_documents(documents=splits, embedding=OpenAIEmbeddings())

### Retriever 생성

유사도 기반 검색

- 기본값은 코사인 유사도인 `similarity` 가 적용되어 있습니다.


In [69]:
query = "What does 'chain of thought' mean?"

retriever = vectorstore.as_retriever(search_type="similarity")
search_result = retriever.get_relevant_documents(query)
print(search_result)

[Document(metadata={'author': '', 'creationdate': '2023-01-12T01:06:30+00:00', 'creator': 'LaTeX with hyperref', 'keywords': '', 'moddate': '2023-01-12T01:06:30+00:00', 'page': 2, 'page_label': '3', 'producer': 'pdfTeX-1.40.21', 'ptex.fullbanner': 'This is pdfTeX, Version 3.14159265-2.6-1.40.21 (TeX Live 2020) kpathsea version 6.3.2', 'source': 'data2/test_paper_2.pdf', 'subject': '', 'title': '', 'total_pages': 43, 'trapped': '/False'}, page_content='language models can generate chains of thought if demonstrations of chain-of-thought reasoning are\nprovided in the exemplars for few-shot prompting.\nFigure 1 shows an example of a model producing a chain of thought to solve a math word problem\nthat it would have otherwise gotten incorrect. The chain of thought in this case resembles a solution\nand can interpreted as one, but we still opt to call it a chain of thought to better capture the idea that it\nmimics a step-by-step thought process for arriving at the answer (and also, solutio

`similarity_score_threshold` 는 유사도 기반 검색에서 `score_threshold` 이상인 결과만 반환합니다.


In [70]:
query = "What does 'chain-of-thought' mean?"

retriever = vectorstore.as_retriever(
    search_type="similarity",  # `k` 값을 적용할 수 있도록 설정
    search_kwargs={"k": 1}  # 최대 1개의 문서만 반환하도록 제한
)

search_result = retriever.get_relevant_documents(query)

# 결과 출력 (최대 1개만 출력)
if not search_result:
    print("❌ No relevant documents found.")
else:
    doc = search_result[0]  # 가장 유사한 문서 1개만 가져옴
    print(f"\n📌 [문서 메타데이터]: {doc.metadata}")
    print(f"📄 [문서 내용]: {doc.page_content[:500]}")  # 500자까지만 출력
    print("=" * 50)


📌 [문서 메타데이터]: {'author': '', 'creationdate': '2023-01-12T01:06:30+00:00', 'creator': 'LaTeX with hyperref', 'keywords': '', 'moddate': '2023-01-12T01:06:30+00:00', 'page': 8, 'page_label': '9', 'producer': 'pdfTeX-1.40.21', 'ptex.fullbanner': 'This is pdfTeX, Version 3.14159265-2.6-1.40.21 (TeX Live 2020) kpathsea version 6.3.2', 'source': 'data2/test_paper_2.pdf', 'subject': '', 'title': '', 'total_pages': 43, 'trapped': '/False'}
📄 [문서 내용]: experiments on commonsense reasoning underscored how the linguistic nature of chain-of-thought
reasoning makes it generally applicable (Section 4). Finally, we showed that for symbolic reasoning,
chain-of-thought prompting facilitates OOD generalization to longer sequence lengths (Section 5). In
all experiments, chain-of-thought reasoning is elicited simply by prompting an off-the-shelf language
model. No language models were ﬁnetuned in the process of writing this paper.
The emergence of chain-


In [71]:
query = "What does 'chain-of-thought' mean?"

retriever = vectorstore.as_retriever(
    search_type="similarity_score_threshold", search_kwargs={"score_threshold": 0.8, "k": 1}
)
search_result = retriever.get_relevant_documents(query)
print(search_result)
if not search_result:
    print("❌ No relevant documents found.")
else:
    doc = search_result[0]  # 가장 유사한 문서 1개만 가져옴
    print(f"\n📌 [문서 메타데이터]: {doc.metadata}")
    print(f"📄 [문서 내용]: {doc.page_content}")  # 500자까지만 출력
    print("=" * 50)

No relevant docs were retrieved using the relevance score threshold 0.8


[]
❌ No relevant documents found.


`maximum marginal search result` 를 사용하여 검색합니다.


In [72]:
query = "What does 'chain-of-thought' mean?"

retriever = vectorstore.as_retriever(
    search_type="mmr",  # `k` 값을 적용할 수 있도록 설정
    search_kwargs={"k": 2}  # 최대 2개의 문서만 반환하도록 제한
)

search_result = retriever.get_relevant_documents(query)

# 결과 출력 (최대 1개만 출력)
if not search_result:
    print("❌ No relevant documents found.")
else:
    doc = search_result[0]  # 가장 유사한 문서 1개만 가져옴
    print(f"\n📌 [문서 메타데이터]: {doc.metadata}")
    print(f"📄 [문서 내용]: {doc.page_content[:500]}")  # 500자까지만 출력
    print("=" * 50)


📌 [문서 메타데이터]: {'author': '', 'creationdate': '2023-01-12T01:06:30+00:00', 'creator': 'LaTeX with hyperref', 'keywords': '', 'moddate': '2023-01-12T01:06:30+00:00', 'page': 8, 'page_label': '9', 'producer': 'pdfTeX-1.40.21', 'ptex.fullbanner': 'This is pdfTeX, Version 3.14159265-2.6-1.40.21 (TeX Live 2020) kpathsea version 6.3.2', 'source': 'data2/test_paper_2.pdf', 'subject': '', 'title': '', 'total_pages': 43, 'trapped': '/False'}
📄 [문서 내용]: experiments on commonsense reasoning underscored how the linguistic nature of chain-of-thought
reasoning makes it generally applicable (Section 4). Finally, we showed that for symbolic reasoning,
chain-of-thought prompting facilitates OOD generalization to longer sequence lengths (Section 5). In
all experiments, chain-of-thought reasoning is elicited simply by prompting an off-the-shelf language
model. No language models were ﬁnetuned in the process of writing this paper.
The emergence of chain-


### 다양한 쿼리 생성


In [73]:
from langchain.retrievers.multi_query import MultiQueryRetriever
from langchain_openai import ChatOpenAI

query = ""

llm = ChatOpenAI(temperature=0, model="gpt-3.5-turbo")

retriever_from_llm = MultiQueryRetriever.from_llm(
    retriever=vectorstore.as_retriever(), llm=llm
)

In [76]:
# Set logging for the queries
import logging

logging.basicConfig()
logging.getLogger("langchain.retrievers.multi_query").setLevel(logging.INFO)

In [78]:
question = "What is chain-of-thought prompting?"
unique_docs = retriever_from_llm.get_relevant_documents(query=question)
len(unique_docs)

INFO:langchain.retrievers.multi_query:Generated queries: ['1. Can you explain the concept of chain-of-thought prompting?', '2. How does chain-of-thought prompting work?', '3. What are the key aspects of chain-of-thought prompting?']


2

### Ensemble Retriever


In [80]:
from langchain.retrievers import BM25Retriever, EnsembleRetriever
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings

In [None]:
# ✅ `page_content`만 추출해서 문자열 리스트로 변환*
doc_texts = [d.page_content for d in docs]  # `BM25Retriever`와 `FAISS`에서 사용 가능

# ✅ BM25 Retriever 초기화
bm25_retriever = BM25Retriever.from_texts(doc_texts)  
bm25_retriever.k = 2  # 검색할 최대 문서 개수

# ✅ FAISS Retriever 초기화
faiss_vectorstore = FAISS.from_texts(doc_texts, OpenAIEmbeddings())
faiss_retriever = faiss_vectorstore.as_retriever(search_kwargs={"k": 2})

# ✅ Ensemble Retriever 초기화 (BM25 + FAISS)
ensemble_retriever = EnsembleRetriever(
    retrievers=[bm25_retriever, faiss_retriever], 
    weights=[0.5, 0.5]  # BM25와 FAISS에 동일한 가중치 적용
)

print("✅ Retrievers initialized successfully!")

✅ Retrievers initialized successfully!


In [87]:
def pretty_print(docs):
    for i, doc in enumerate(docs):
        print(f"[{i+1}] {doc.page_content}")

In [88]:
sample_query = "chain-of-thought 개념을 알려줘"
print(f"[Query]\n{sample_query}\n")
relevant_docs = bm25_retriever.get_relevant_documents(sample_query)
print("[BM25 Retriever]")
pretty_print(relevant_docs)
print("===" * 20)
relevant_docs = faiss_retriever.get_relevant_documents(sample_query)
print("[FAISS Retriever]")
pretty_print(relevant_docs)
print("===" * 20)
relevant_docs = ensemble_retriever.get_relevant_documents(sample_query)
print("[Ensemble Retriever]")
pretty_print(relevant_docs)

[Query]
chain-of-thought 개념을 알려줘

[BM25 Retriever]
[1] 0204060GSM8K
solve rate (%)LaMDA GPT PaLMStandard prompting
Chain-of-thought prompting
Prior supervised best
020406080SV AMP
solve rate (%)
0.4 81370255075100MAWPS
solve rate (%)
0.4 7175 862540
Model scale (# parameters in billions)
Figure 4: Chain-of-thought prompting enables
large language models to solve challenging math
problems. Notably, chain-of-thought reasoning
is an emergent ability of increasing model scale.
Prior best numbers are from Cobbe et al. (2021)
for GSM8K, Jie et al. (2022) for SV AMP, and Lan
et al. (2021) for MAWPS.Second, chain-of-thought prompting has larger
performance gains for more-complicated prob-
lems. For instance, for GSM8K (the dataset
with the lowest baseline performance), perfor-
mance more than doubled for the largest GPT
and PaLM models. On the other hand, for Sin-
gleOp, the easiest subset of MAWPS which only
requires a single step to solve, performance im-
provements were either negative or v

## RAG 템플릿 실험


In [99]:
# 단계 1: 문서 로드(Load Documents)
# 문서를 로드하고, 청크로 나누고, 인덱싱합니다.
from langchain.document_loaders import PyPDFLoader

# PDF 파일 로드. 파일의 경로 입력
file_path = "data2/test_paper_2.pdf"
loader = PyPDFLoader(file_path=file_path)

# 단계 2: 문서 분할(Split Documents)
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=50)

split_docs = loader.load_and_split(text_splitter=text_splitter)

# 단계 3, 4: 임베딩 & 벡터스토어 생성(Create Vectorstore)
# 벡터스토어를 생성합니다.
vectorstore = FAISS.from_documents(documents=splits, embedding=OpenAIEmbeddings())

# 단계 5: 리트리버 생성(Create Retriever)
# 사용자의 질문(query) 에 부합하는 문서를 검색합니다.

# 유사도 높은 K 개의 문서를 검색합니다.
k = 3

# (Sparse) bm25 retriever and (Dense) faiss retriever 를 초기화 합니다.
bm25_retriever = BM25Retriever.from_documents(split_docs)
bm25_retriever.k = k

faiss_vectorstore = FAISS.from_documents(split_docs, OpenAIEmbeddings())
faiss_retriever = faiss_vectorstore.as_retriever(search_kwargs={"k": k})

# initialize the ensemble retriever
ensemble_retriever = EnsembleRetriever(
    retrievers=[bm25_retriever, faiss_retriever], weights=[0.5, 0.5]
)

# 단계 6: 프롬프트 생성(Create Prompt)
# 프롬프트를 생성합니다.
prompt = hub.pull("rlm/rag-prompt")

# 단계 7: 언어모델 생성(Create LLM)
# 모델(LLM) 을 생성합니다.
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)


def format_docs(docs):
    # 검색한 문서 결과를 하나의 문단으로 합쳐줍니다.
    return "\n\n".join(doc.page_content for doc in docs)


# 단계 8: 체인 생성(Create Chain)
rag_chain = (
    {"context": ensemble_retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

question = "What is chain-of-thought?"
response = rag_chain.invoke(question)

# 결과 출력
print(f"PDF Path: {file_path}")
print(f"문서의 수: {len(docs)}")
print("===" * 20)
print(f"[HUMAN]\n{question}\n")
print(f"[AI]\n{response}")

PDF Path: data2/test_paper_2.pdf
문서의 수: 43
[HUMAN]
What is chain-of-thought?

[AI]
Chain-of-thought is a method that allows models to decompose multi-step problems into intermediate steps, aiding in reasoning tasks. It can be used for tasks such as math word problems, commonsense reasoning, and symbolic manipulation. Language models can generate chains of thought if demonstrations of chain-of-thought reasoning are provided in the exemplars for few-shot prompting.
