# RAG 시스템의 재정렬 (Reranking) 방법

## 개요
재정렬은 **RAG (Retrieval-Augmented Generation)** 시스템에서 검색된 문서의 **관련성**과 **품질**을 높이기 위한 단계이다. 초기 검색에서 찾은 문서의 순위를 다시 평가해, 최종 생성 단계에서 **가장 적합한 정보**가 우선적으로 사용되도록 한다.

- 재정렬이 필요한 이유:
초기 검색은 일반적으로 단순한 유사도 측정에 의존하므로, **문맥적 관련성**이나 **미묘한 의미 차이**를 놓칠 수 있다.  
재정렬은 보다 정교한 평가를 통해 이러한 한계를 보완한다.


## 주요 구성 요소
- **초기 검색기**: FAISS 등 임베딩 기반 벡터 검색 도구 사용.  
- **재정렬 모델**: 다음 중 한 가지 방법을 사용한다.  
  - **LLM 기반**: 프롬프트를 이용해 문서의 관련성 점수를 평가.  
  - **Cross-Encoder 기반**: 쿼리-문서 쌍을 입력해 정밀한 관련성 점수 부여.  
- **점수 매기기 (Scoring)**: 재정렬을 위한 점수 계산.  
- **정렬 및 선택**: 새로운 점수를 기준으로 상위 K개의 문서를 선택.  

## 방법론
1. **초기 검색**: 잠재적으로 관련 있는 문서를 검색.  
2. **쌍 생성**: 검색된 문서와 쿼리의 쌍을 생성.  
3. **점수 매기기**:  
   - **LLM 방법**: LLM이 프롬프트를 통해 관련성 점수를 부여.  
   - **Cross-Encoder 방법**: 쿼리-문서 쌍을 입력해 정밀한 점수 계산.  
4. **정렬**: 새로운 점수를 기준으로 문서를 재정렬.  
5. **상위 선택**: 상위 K개의 문서를 최종 선택.  


## 장점
- **정확성 향상**: LLM이나 Cross-Encoder를 통해 미묘한 관련성까지 반영.  
- **잡음 감소**: 덜 중요한 문서를 제거해 생성 결과 품질을 높임.  
- **유연성**: 데이터 특성이나 자원에 따라 다양한 방법 적용 가능.  

## 결론
재정렬은 **RAG 시스템의 성능**을 크게 향상시키는 핵심 기법이다.  
특히 **LLM 기반**과 **Cross-Encoder 기반** 방법은 기존 검색만으로는 놓칠 수 있는 **세밀한 문맥 정보**를 반영하여, 생성 모델이 더 정확한 정보를 활용하도록 돕는다.  
재정렬 기법은 검색과 생성 단계의 **정보 품질**을 높이는 필수 요소로 활용할 수 있다.


<div style="text-align: center;">

<img src="../images/reranking-visualization.svg" alt="rerank llm" style="width:100%; height:auto;">
</div>

<div style="text-align: center;">

<img src="../images/reranking_comparison.svg" alt="rerank llm" style="width:100%; height:auto;">
</div>

### Import relevant libraries

In [5]:
import os
import sys
from dotenv import load_dotenv
from langchain.docstore.document import Document
from typing import List, Dict, Any, Tuple
from langchain_openai import ChatOpenAI
from langchain.chains import RetrievalQA
from langchain_core.retrievers import BaseRetriever
from sentence_transformers import CrossEncoder


sys.path.append(os.path.abspath(os.path.join(os.getcwd(), '..'))) # Add the parent directory to the path sicnce we work with notebooks
from helper_functions import *
from evaluation.evalute_rag import *

# Load environment variables from a .env file
load_dotenv()

# Set the OpenAI API key environment variable
os.environ["OPENAI_API_KEY"] = os.getenv('OPENAI_API_KEY')

In [50]:
# ! pip install sentence-transformers
# ! pip install tf-keras



### Define the document's path

In [2]:
path = "../data/Understanding_Climate_Change.pdf"

### Create a vector store

In [10]:
vectorstore = encode_pdf(path)

## Method 1: LLM based function to rerank the retrieved documents

<div style="text-align: center;">

<img src="../images/rerank_llm.svg" alt="rerank llm" style="width:40%; height:auto;">
</div>

### Create a custom reranking function


In [8]:
import os
import sys
from dotenv import load_dotenv
from langchain.docstore.document import Document
from typing import List, Dict, Any, Tuple
from langchain_openai import ChatOpenAI
from langchain.chains import RetrievalQA
from langchain_core.retrievers import BaseRetriever
from sentence_transformers import CrossEncoder


sys.path.append(os.path.abspath(os.path.join(os.getcwd(), '..'))) # Add the parent directory to the path sicnce we work with notebooks
from helper_functions import *
from evaluation.evalute_rag import *

# Load environment variables from a .env file
load_dotenv()

# Set the OpenAI API key environment variable
os.environ["OPENAI_API_KEY"] = os.getenv('OPENAI_API_KEY')
class RatingScore(BaseModel):
    relevance_score: float = Field(..., description="The relevance score of a document to a query.")

# 쿼리와 문서가 주어졌을 때 둘의 관련성을 점수로 파악하는 llm 
def rerank_documents(query: str, docs: List[Document], top_n: int = 3) -> List[Document]:
    prompt_template = PromptTemplate(
        input_variables=["query", "doc"],
        template="""On a scale of 1-10, rate the relevance of the following document to the query. Consider the specific context and intent of the query, not just keyword matches.
        Query: {query}
        Document: {doc}
        Relevance Score:"""
    )
    
    llm = ChatOpenAI(temperature=0, model_name="gpt-4o", max_tokens=4000)
    llm_chain = prompt_template | llm.with_structured_output(RatingScore)
    


    # 점수가 매겨지고, 점수 순서대로 리랭크 되어 top n 만큼 문서가 출력된다. 
    scored_docs = []
    for doc in docs:
        input_data = {"query": query, "doc": doc.page_content}
        score = llm_chain.invoke(input_data).relevance_score
        try:
            score = float(score)
        except ValueError:
            score = 0  # Default score if parsing fails
        scored_docs.append((doc, score))
    
    reranked_docs = sorted(scored_docs, key=lambda x: x[1], reverse=True)
    return [doc for doc, _ in reranked_docs[:top_n]]

### Example usage of the reranking function with a sample query relevant to the document


In [11]:
query = "What are the impacts of climate change on biodiversity?"
# 초기 문서 반환 
initial_docs = vectorstore.similarity_search(query, k=15)
# 초기 문서를 리랭크 
reranked_docs = rerank_documents(query, initial_docs)

# print first 3 initial documents
print("Top initial documents:")
for i, doc in enumerate(initial_docs[:3]):
    print(f"\nDocument {i+1}:")
    print(doc.page_content[:200] + "...")  # Print first 200 characters of each document


# Print results
print(f"Query: {query}\n")
print("리랭크된 문서:")
for i, doc in enumerate(reranked_docs):
    print(f"\nDocument {i+1}:")
    print(doc.page_content[:200] + "...")  # Print first 200 characters of each document

Top initial documents:

Document 1:
Climate change is altering terrestrial ecosystems by shifting habitat ranges, changing species 
distributions, and impacting ecosystem functions. Forests, grasslands, and deserts are 
experiencing shi...

Document 2:
goals. Policies should promote synergies between biodiversity conservation and climate 
action.  
Chapter 10: Climate Change and Human Health  
Health Impacts  
Heat -Related Illnesses  
Rising temper...

Document 3:
managed retreats.  
Extreme Weather Events  
Climate change is linked to an increase in the frequency and severity of extreme weather 
events, such as hurricanes, heatwaves, droughts, and heavy rainfa...
Query: What are the impacts of climate change on biodiversity?

리랭크된 문서:

Document 1:
Climate change is altering terrestrial ecosystems by shifting habitat ranges, changing species 
distributions, and impacting ecosystem functions. Forests, grasslands, and deserts are 
experiencing shi...

Document 2:
Coral reefs are highly 

### Create a custom retriever based on our reranker

In [15]:

class CustomRetriever(BaseRetriever, BaseModel):
    
    vectorstore: Any = Field(description="Vector store for initial retrieval")

    class Config:
        arbitrary_types_allowed = True

    # 1차 검색 후 검새된 것들을 재정렬함
    def get_relevant_documents(self, query: str, num_docs=2) -> List[Document]:
        initial_docs = self.vectorstore.similarity_search(query, k=30)
        return rerank_documents(query, initial_docs, top_n=num_docs)


custom_retriever = CustomRetriever(vectorstore=vectorstore)

llm = ChatOpenAI(temperature=0, model_name="gpt-4o")


qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=custom_retriever,
    return_source_documents=True
)

  class CustomRetriever(BaseRetriever, BaseModel):


### Example query


In [16]:
result = qa_chain({"query": query})

print(f"\nQuestion: {query}")
print(f"Answer: {result['result']}")
print("\nRelevant source documents:")
for i, doc in enumerate(result["source_documents"]):
    print(f"\nDocument {i+1}:")
    print(doc.page_content[:200] + "...")  # Print first 200 characters of each document

  result = qa_chain({"query": query})



Question: What are the impacts of climate change on biodiversity?
Answer: Climate change impacts biodiversity by shifting habitat ranges, changing species distributions, and affecting ecosystem functions. In terrestrial ecosystems, such as forests, grasslands, and deserts, these changes can lead to a loss of biodiversity and disrupt ecological balance. In marine ecosystems, rising sea temperatures, ocean acidification, and changing currents affect marine biodiversity, leading to species migration and changes in reproductive cycles, which can disrupt marine food webs and fisheries. Coral reefs, in particular, are highly sensitive to temperature and acidity changes, resulting in coral bleaching and mortality, which threaten biodiversity and fisheries.

Relevant source documents:

Document 1:
Climate change is altering terrestrial ecosystems by shifting habitat ranges, changing species 
distributions, and impacting ecosystem functions. Forests, grasslands, and deserts are 
experiencing s

### Example that demonstrates why we should use reranking 

In [17]:
chunks = [
    "The capital of France is great.",
    "The capital of France is huge.",
    "The capital of France is beautiful.",
    """Have you ever visited Paris? It is a beautiful city where you can eat delicious food and see the Eiffel Tower. 
    I really enjoyed all the cities in france, but its capital with the Eiffel Tower is my favorite city.""", 
    "I really enjoyed my trip to Paris, France. The city is beautiful and the food is delicious. I would love to visit again. Such a great capital city."
]
docs = [Document(page_content=sentence) for sentence in chunks]


def compare_rag_techniques(query: str, docs: List[Document] = docs) -> None:
    embeddings = OpenAIEmbeddings()
    vectorstore = FAISS.from_documents(docs, embeddings)

    print("Comparison of Retrieval Techniques")
    print("==================================")
    print(f"Query: {query}\n")
    
    print("Baseline Retrieval Result:")
    baseline_docs = vectorstore.similarity_search(query, k=2)
    for i, doc in enumerate(baseline_docs):
        print(f"\nDocument {i+1}:")
        print(doc.page_content)

    print("\nAdvanced Retrieval Result:")
    custom_retriever = CustomRetriever(vectorstore=vectorstore)
    advanced_docs = custom_retriever.get_relevant_documents(query)
    for i, doc in enumerate(advanced_docs):
        print(f"\nDocument {i+1}:")
        print(doc.page_content)


query = "what is the capital of france?"
compare_rag_techniques(query, docs)

Comparison of Retrieval Techniques
Query: what is the capital of france?

Baseline Retrieval Result:

Document 1:
The capital of France is great.

Document 2:
The capital of France is beautiful.

Advanced Retrieval Result:


  advanced_docs = custom_retriever.get_relevant_documents(query)



Document 1:
The capital of France is great.

Document 2:
The capital of France is beautiful.


## Method 2: Cross Encoder models

<div style="text-align: center;">

<img src="../images/rerank_cross_encoder.svg" alt="rerank cross encoder" style="width:40%; height:auto;">
</div>

### Define the cross encoder class
**Cross-Encoder** : Transformer 구조를 기반으로 쿼리 - 문서 쌍의 관련성을 평가하는 모델 
- 쿼리와 문서를 하나의 입력 시퀀스로 결합하여 문맥을 고려한 높은 정확도의 관련성 점수 제공함 


In [18]:
cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

class CrossEncoderRetriever(BaseRetriever, BaseModel):
    vectorstore: Any = Field(description="Vector store for initial retrieval")
    cross_encoder: Any = Field(description="Cross-encoder model for reranking")
    k: int = Field(default=5, description="Number of documents to retrieve initially")
    rerank_top_k: int = Field(default=3, description="Number of documents to return after reranking")

    class Config:
        arbitrary_types_allowed = True

    def get_relevant_documents(self, query: str) -> List[Document]:
        initial_docs = self.vectorstore.similarity_search(query, k=self.k)
        
        # 크로스 인코더에 넣을 초기 문서와 쿼리를 쌍으로 만듬
        pairs = [[query, doc.page_content] for doc in initial_docs]
        scores = self.cross_encoder.predict(pairs) # 점수 계산 
        
        # Sort documents by score
        scored_docs = sorted(zip(initial_docs, scores), key=lambda x: x[1], reverse=True)
        
        return [doc for doc, _ in scored_docs[:self.rerank_top_k]]

    async def aget_relevant_documents(self, query: str) -> List[Document]:
        raise NotImplementedError("Async retrieval not implemented")

  class CrossEncoderRetriever(BaseRetriever, BaseModel):
  class CrossEncoderRetriever(BaseRetriever, BaseModel):


### Create an instance and showcase over an example

In [19]:
# Create the cross-encoder retriever
cross_encoder_retriever = CrossEncoderRetriever(
    vectorstore=vectorstore,
    cross_encoder=cross_encoder,
    k=10,  # Retrieve 10 documents initially
    rerank_top_k=5  # Return top 5 after reranking
)

# Set up the LLM
llm = ChatOpenAI(temperature=0, model_name="gpt-4o")

# Create the RetrievalQA chain with the cross-encoder retriever
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=cross_encoder_retriever,
    return_source_documents=True
)

# Example query
query = "What are the impacts of climate change on biodiversity?"
result = qa_chain({"query": query})

print(f"\nQuestion: {query}")
print(f"Answer: {result['result']}")
print("\nRelevant source documents:")
for i, doc in enumerate(result["source_documents"]):
    print(f"\nDocument {i+1}:")
    print(doc.page_content[:200] + "...")  # Print first 200 characters of each document


Question: What are the impacts of climate change on biodiversity?
Answer: Climate change impacts biodiversity by altering terrestrial and marine ecosystems. It shifts habitat ranges, changes species distributions, and affects ecosystem functions, leading to a loss of biodiversity and disrupting ecological balance. In terrestrial ecosystems, such as forests, grasslands, and deserts, there are shifts in plant and animal species composition. Marine ecosystems are also highly vulnerable, with rising sea temperatures, ocean acidification, and changing currents affecting marine biodiversity, disrupting marine food webs and fisheries. These changes can lead to species migration and altered reproductive cycles.

Relevant source documents:

Document 1:
Climate change is altering terrestrial ecosystems by shifting habitat ranges, changing species 
distributions, and impacting ecosystem functions. Forests, grasslands, and deserts are 
experiencing shi...

Document 2:
protection, and habitat crea