#  RAG (Retrieval-Augmented Generation) 시스템

##  주요 구성 요소
1. **PDF 처리 및 텍스트 추출**  
2. **텍스트 청킹 (Text Chunking)**  
3. **벡터 스토어 생성 (FAISS + OpenAI 임베딩)**  
4. **Retriever 설정 (질의 검색 엔진)**  
5. **시스템 성능 평가**

---

##  방법론

###  문서 전처리 (Document Preprocessing)
- `PyPDFLoader`로 PDF 문서 로드  
- `RecursiveCharacterTextSplitter`로 청킹 (청크 크기 및 중첩 설정 가능)

###  텍스트 정제 (Text Cleaning)
- PDF 특유의 포맷 문제를 해결하기 위해 replace_t_with_space 함수를 사용

###  벡터 스토어 생성 (Vector Store Creation)
- **OpenAI 임베딩**을 통해 텍스트 청크를 벡터화  
- 벡터화된 데이터를 기반으로 **FAISS**를 통해 유사도 검색을 최적화

###  Retriever 설정
- **Top-2** 방식으로 가장 관련 있는 상위 2개의 청크를 검색하도록 설정

### 인코딩 함수 (Encoding Function)
- 전체 과정을 통합한 **`encode_pdf()`** 함수는 PDF 로드 → 청킹 → 정제 → 벡터화 → FAISS 저장을 수행


##  주요 기능
✅ **모듈화된 설계**: `encode_pdf()` 함수로 일관된 프로세스 제공  
✅ **청크 크기 설정 가능**: 문서의 특성에 맞게 조정  
✅ **빠른 검색**: **FAISS**를 통한 고속 유사도 검색  
✅ **유연한 확장성**: 문서 수가 늘어나도 효율적으로 확장 가능  


### Function to evaluate metrics for each chunk size

### Import libraries and environment variables

In [1]:
import os
import sys
from dotenv import load_dotenv
sys.path.append(os.path.abspath(os.path.join(os.getcwd(), '..'))) 
from helper_functions import *
from evaluation.evalute_rag import *
load_dotenv()
os.environ["OPENAI_API_KEY"] = os.getenv('OPENAI_API_KEY')


For example, replace imports like: `from langchain_core.pydantic_v1 import BaseModel`
with: `from pydantic import BaseModel`
or the v1 compatibility namespace if you are working in a code base that has not been fully upgraded to pydantic 2 yet. 	from pydantic.v1 import BaseModel

  from helper_functions import *


### Read Docs

In [2]:
path = "../data/Understanding_Climate_Change.pdf"

### Encode document

In [7]:
# 문서 청킹 및 벡터화한뒤 저장
def encode_pdf(path, chunk_size=1000, chunk_overlap=200):

    loader = PyPDFLoader(path)
    documents = loader.load()

    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size, chunk_overlap=chunk_overlap, length_function=len
    )
    texts = text_splitter.split_documents(documents)
    cleaned_texts = replace_t_with_space(texts) # 이상 문자, 줄 바꿈, 공백 문제 해결함수

    embeddings = OpenAIEmbeddings()
    vectorstore = FAISS.from_documents(cleaned_texts, embeddings)

    return vectorstore

In [8]:
chunks_vector_store = encode_pdf(path, chunk_size=1000, chunk_overlap=200)

### Create retriever

In [9]:
chunks_query_retriever = chunks_vector_store.as_retriever(search_kwargs={"k": 2})

### Test retriever

In [10]:
test_query = "What is the main cause of climate change?"
context = retrieve_context_per_question(test_query, chunks_query_retriever)

  docs = chunks_query_retriever.get_relevant_documents(question)


### Evaluate results

In [11]:
evaluate_rag(chunks_query_retriever)

Answering the question from the retrieved context...
Answering the question from the retrieved context...
Answering the question from the retrieved context...
Answering the question from the retrieved context...
Answering the question from the retrieved context...


Event loop is already running. Applying nest_asyncio patch to allow async execution...


Evaluating 5 test case(s) in parallel: |          |  0% (0/5) [Time Taken: 00:00, ?test case/s]

None
None
None
None
None


ERROR:root:OpenAI rate limit exceeded. Retrying: 1 time(s)...
ERROR:root:OpenAI rate limit exceeded. Retrying: 1 time(s)...
ERROR:root:OpenAI rate limit exceeded. Retrying: 1 time(s)...
ERROR:root:OpenAI rate limit exceeded. Retrying: 2 time(s)...
ERROR:root:OpenAI rate limit exceeded. Retrying: 1 time(s)...
ERROR:root:OpenAI rate limit exceeded. Retrying: 2 time(s)...
ERROR:root:OpenAI rate limit exceeded. Retrying: 1 time(s)...
ERROR:root:OpenAI rate limit exceeded. Retrying: 2 time(s)...
ERROR:root:OpenAI rate limit exceeded. Retrying: 1 time(s)...
ERROR:root:OpenAI rate limit exceeded. Retrying: 1 time(s)...
ERROR:root:OpenAI rate limit exceeded. Retrying: 1 time(s)...
ERROR:root:OpenAI rate limit exceeded. Retrying: 1 time(s)...
ERROR:root:OpenAI rate limit exceeded. Retrying: 3 time(s)...
ERROR:root:OpenAI rate limit exceeded. Retrying: 1 time(s)...
ERROR:root:OpenAI rate limit exceeded. Retrying: 2 time(s)...
ERROR:root:OpenAI rate limit exceeded. Retrying: 2 time(s)...
ERROR:ro



Metrics Summary

  - ✅ Correctness (GEval) (score: 0.9686545006846778, threshold: 0.5, strict: False, evaluation model: gpt-4o, reason: The actual output is factually correct and expands accurately on the expected output by including additional details about temperature, precipitation, and wind patterns., error: None)
  - ✅ Faithfulness (score: 1.0, threshold: 0.7, strict: False, evaluation model: gpt-4, reason: None, error: None)
  - ❌ Contextual Relevancy (score: 0.6153846153846154, threshold: 1.0, strict: False, evaluation model: gpt-4, reason: The score is 0.62 because while there are multiple statements providing substantial information about what climate change refers to, such as 'Climate change refers to significant, long -term changes in the global climate.' and 'Over the past century, human activities, particularly the burning of fossil fuels and deforestation, have significantly contributed to climate change.', there are also several statements that are irrelevant to the in


