
# RAG系统中的重排方法

## 概览
重排在检索增强生成（RAG）系统中是一个关键步骤，旨在提高检索文档的相关性和质量。它涉及重新评估和重新排序最初检索到的文档，以确保最相关的信息被优先用于后续处理或展示。

## 动机
在RAG系统中进行重排的主要动机是克服初始检索方法的限制，这些方法通常依赖于更简单的相似度指标。重排允许进行更复杂的相关性评估，考虑到查询和文档之间可能被传统检索技术忽视的微妙关系。这一过程旨在通过确保在生成阶段使用最相关信息来提高RAG系统的整体性能。

## 关键组成部分
重排系统通常包括以下组成部分：

1. 初始检索器：通常是一个使用基于嵌入的相似度搜索的向量存储。
2. 重排模型：可以是以下任意一种：
   - 用于评分相关性的大语言模型（LLM）
   - 专门为相关性评估训练的交叉编码器模型
3. 评分机制：一种为文档分配相关性分数的方法
4. 排序和选择逻辑：根据新分数重新排序文档

## 方法细节
重排过程通常遵循以下步骤：

1. 初始检索：获取一组可能相关的初始文档。
2. 对创建：为每个检索到的文档形成查询-文档对。
3. 评分：
   - LLM方法：使用提示让LLM评估文档的相关性。
   - 交叉编码器方法：直接将查询-文档对输入模型。
4. 分数解释：解析和标准化相关性分数。
5. 重新排序：根据新的相关性分数对文档进行排序。
6. 选择：从重新排序的列表中选择前K个文档。

## 这种方法的好处
重排提供了几个优势：

1. 提高相关性：通过使用更复杂的模型，重排可以捕捉到微妙的相关性因素。
2. 灵活性：可以根据特定需求和资源应用不同的重排方法。
3. 增强上下文质量：向RAG系统提供更相关的文档可以提高生成响应的质量。
4. 减少噪声：重排有助于过滤掉不太相关的信息，专注于最相关的内容。

## 结论
重排在RAG系统中是一种强大的技术，显著提高了检索信息的质量。无论是使用基于LLM的评分还是专门的交叉编码器模型，重排都允许对文档相关性进行更微妙和准确的评估。这种提高的相关性直接转化为下游任务的更好性能，使重排成为高级RAG实现中的一个重要组成部分。

选择基于LLM和交叉编码器的重排方法取决于所需准确性、可用计算资源和特定应用需求等因素。两种方法都比基本检索方法提供了实质性的改进，并有助于提高RAG系统的整体有效性。



### Import relevant libraries

In [20]:
import os
import sys
from dotenv import load_dotenv
from langchain.docstore.document import Document
from typing import List, Dict, Any, Tuple
from langchain_openai import ChatOpenAI
from langchain.chains import RetrievalQA
from langchain_core.retrievers import BaseRetriever
from sentence_transformers import CrossEncoder

# Load environment variables from a .env file
load_dotenv()

sys.path.append(os.path.abspath(os.path.join(os.getcwd(), '..'))) # Add the parent directory to the path sicnce we work with notebooks
from rag.helper_functions import *
from rag.evaluation.evalute_rag import *


# Set the OpenAI API key environment variable
os.environ["OPENAI_API_KEY"] = os.getenv('OPENAI_API_KEY')

### Define the document's path

In [21]:
path = "../data/Understanding_Climate_Change.pdf"

### Create a vector store

In [22]:
vectorstore = encode_pdf(path)

## 方法一：基于 LLM 的重排

<div style="text-align: center;">

<img src="../images/rerank_llm.svg" alt="rerank llm" style="width:40%; height:auto;">
</div>

### 自定义重排函数


In [23]:
class RatingScore(BaseModel):
    relevance_score: float = Field(..., description="The relevance score of a document to a query.")

def rerank_documents(query: str, docs: List[Document], top_n: int = 3) -> List[Document]:
    prompt_template = PromptTemplate(
        input_variables=["query", "doc"],
        template="""On a scale of 1-10, rate the relevance of the following document to the query. Consider the specific context and intent of the query, not just keyword matches.
        Query: {query}
        Document: {doc}
        Relevance Score:"""
    )
    
    llm = ChatOpenAI(temperature=0, model_name="gpt-4o", max_tokens=4000)
    llm_chain = prompt_template | llm.with_structured_output(RatingScore)
    
    scored_docs = []
    for doc in docs:
        input_data = {"query": query, "doc": doc.page_content}
        score = llm_chain.invoke(input_data).relevance_score
        try:
            score = float(score)
        except ValueError:
            score = 0  # Default score if parsing fails
        scored_docs.append((doc, score))
    reranked_docs = sorted(scored_docs, key=lambda x: x[1], reverse=True)
    return [doc for doc, _ in reranked_docs[:top_n]]

### 与文档相关的示例查询中重新排序功能的示例用法

In [24]:
query = "What are the impacts of climate change on biodiversity?"
initial_docs = vectorstore.similarity_search(query, k=15)
reranked_docs = rerank_documents(query, initial_docs)

# print first 3 initial documents
print("Top initial documents:")
for i, doc in enumerate(initial_docs[:3]):
    print(f"\nDocument {i+1}:")
    print(doc.page_content[:200] + "...")  # Print first 200 characters of each document


# Print results
print(f"Query: {query}\n")
print("Top reranked documents:")
for i, doc in enumerate(reranked_docs):
    print(f"\nDocument {i+1}:")
    print(doc.page_content[:200] + "...")  # Print first 200 characters of each document

Top initial documents:

Document 1:
Climate change is altering terrestrial ecosystems by shifting habitat ranges, changing species 
distributions, and impacting ecosystem functions. Forests, grasslands, and deserts are 
experiencing shi...

Document 2:
goals. Policies should promote synergies between biodiversity conservation and climate 
action.  
Chapter 10: Climate Change and Human Health  
Health Impacts  
Heat -Related Illnesses  
Rising temper...

Document 3:
managed retreats.  
Extreme Weather Events  
Climate change is linked to an increase in the frequency and severity of extreme weather 
events, such as hurricanes, heatwaves, droughts, and heavy rainfa...
Query: What are the impacts of climate change on biodiversity?

Top reranked documents:

Document 1:
Climate change is altering terrestrial ecosystems by shifting habitat ranges, changing species 
distributions, and impacting ecosystem functions. Forests, grasslands, and deserts are 
experiencing shi...

Document 2:
Coral re

### 创建一个基于我们的重排器的自定义检索器

In [25]:
# Create a custom retriever class
class CustomRetriever(BaseRetriever, BaseModel):
    
    vectorstore: Any = Field(description="Vector store for initial retrieval")

    class Config:
        arbitrary_types_allowed = True

    def get_relevant_documents(self, query: str, num_docs=2) -> List[Document]:
        initial_docs = self.vectorstore.similarity_search(query, k=30)
        return rerank_documents(query, initial_docs, top_n=num_docs)


# Create the custom retriever
custom_retriever = CustomRetriever(vectorstore=vectorstore)

# Create an LLM for answering questions
llm = ChatOpenAI(temperature=0, model_name="gpt-4o")

# Create the RetrievalQA chain with the custom retriever
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=custom_retriever,
    return_source_documents=True
)


### Example query


In [26]:
result = qa_chain({"query": query})

print(f"\nQuestion: {query}")
print(f"Answer: {result['result']}")
print("\nRelevant source documents:")
for i, doc in enumerate(result["source_documents"]):
    print(f"\nDocument {i+1}:")
    print(doc.page_content[:200] + "...")  # Print first 200 characters of each document


Question: What are the impacts of climate change on biodiversity?
Answer: Climate change impacts biodiversity in several significant ways:

1. **Shifting Habitat Ranges**: As temperatures rise and precipitation patterns change, many species are forced to migrate to new areas where the climate is more suitable for their survival. This can lead to a shift in habitat ranges, which may not always be possible for all species, particularly those with limited mobility or specific habitat requirements.

2. **Changing Species Distributions**: The distribution of plant and animal species is changing as they move to new areas in response to climate changes. This can lead to new interactions between species, some of which may be harmful, such as increased competition for resources or the spread of diseases.

3. **Impacting Ecosystem Functions**: Changes in species composition and distribution can disrupt the functions of ecosystems. For example, the loss of key species can affect processes like p

### 为什么需要重排？ 

In [27]:
chunks = [
    "The capital of France is great.",
    "The capital of France is huge.",
    "The capital of France is beautiful.",
    """Have you ever visited Paris? It is a beautiful city where you can eat delicious food and see the Eiffel Tower. 
    I really enjoyed all the cities in france, but its capital with the Eiffel Tower is my favorite city.""", 
    "I really enjoyed my trip to Paris, France. The city is beautiful and the food is delicious. I would love to visit again. Such a great capital city."
]
docs = [Document(page_content=sentence) for sentence in chunks]


def compare_rag_techniques(query: str, docs: List[Document] = docs) -> None:
    embeddings = OpenAIEmbeddings()
    vectorstore = FAISS.from_documents(docs, embeddings)

    print("Comparison of Retrieval Techniques")
    print("==================================")
    print(f"Query: {query}\n")
    
    print("Baseline Retrieval Result:")
    baseline_docs = vectorstore.similarity_search(query, k=2)
    for i, doc in enumerate(baseline_docs):
        print(f"\nDocument {i+1}:")
        print(doc.page_content)

    print("\nAdvanced Retrieval Result:")
    custom_retriever = CustomRetriever(vectorstore=vectorstore)
    advanced_docs = custom_retriever.invoke(query)
    for i, doc in enumerate(advanced_docs):
        print(f"\nDocument {i+1}:")
        print(doc.page_content)


query = "what is the capital of france?"
compare_rag_techniques(query, docs)

Comparison of Retrieval Techniques
Query: what is the capital of france?

Baseline Retrieval Result:

Document 1:
The capital of France is great.

Document 2:
The capital of France is beautiful.

Advanced Retrieval Result:

Document 1:
I really enjoyed my trip to Paris, France. The city is beautiful and the food is delicious. I would love to visit again. Such a great capital city.

Document 2:
Have you ever visited Paris? It is a beautiful city where you can eat delicious food and see the Eiffel Tower. 
    I really enjoyed all the cities in france, but its capital with the Eiffel Tower is my favorite city.


## 方法二：交叉编码器模型

<div style="text-align: center;">

<img src="../images/rerank_cross_encoder.svg" alt="rerank cross encoder" style="width:40%; height:auto;">
</div>

### 定义交叉编码器类

In [28]:
cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

class CrossEncoderRetriever(BaseRetriever, BaseModel):
    vectorstore: Any = Field(description="Vector store for initial retrieval")
    cross_encoder: Any = Field(description="Cross-encoder model for reranking")
    k: int = Field(default=5, description="Number of documents to retrieve initially")
    rerank_top_k: int = Field(default=3, description="Number of documents to return after reranking")

    class Config:
        arbitrary_types_allowed = True

    def get_relevant_documents(self, query: str) -> List[Document]:
        # Initial retrieval
        initial_docs = self.vectorstore.similarity_search(query, k=self.k)
        
        # Prepare pairs for cross-encoder
        pairs = [[query, doc.page_content] for doc in initial_docs]
        
        # Get cross-encoder scores
        scores = self.cross_encoder.predict(pairs)
        
        # Sort documents by score
        scored_docs = sorted(zip(initial_docs, scores), key=lambda x: x[1], reverse=True)
        
        # Return top reranked documents
        return [doc for doc, _ in scored_docs[:self.rerank_top_k]]

    async def aget_relevant_documents(self, query: str) -> List[Document]:
        raise NotImplementedError("Async retrieval not implemented")





### 创建一个实例并在示例上展示

In [29]:
# Create the cross-encoder retriever
cross_encoder_retriever = CrossEncoderRetriever(
    vectorstore=vectorstore,
    cross_encoder=cross_encoder,
    k=10,  # Retrieve 10 documents initially
    rerank_top_k=5  # Return top 5 after reranking
)

# Set up the LLM
llm = ChatOpenAI(temperature=0, model_name="gpt-4o")

# Create the RetrievalQA chain with the cross-encoder retriever
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=cross_encoder_retriever,
    return_source_documents=True
)

# Example query
query = "What are the impacts of climate change on biodiversity?"
result = qa_chain({"query": query})

print(f"\nQuestion: {query}")
print(f"Answer: {result['result']}")
print("\nRelevant source documents:")
for i, doc in enumerate(result["source_documents"]):
    print(f"\nDocument {i+1}:")
    print(doc.page_content[:200] + "...")  # Print first 200 characters of each document


Question: What are the impacts of climate change on biodiversity?
Answer: Climate change impacts biodiversity in several significant ways:

1. **Shifting Habitat Ranges**: As temperatures rise and precipitation patterns change, many species are forced to move to new areas where the climate is more suitable for their survival. This can lead to shifts in habitat ranges for both plants and animals.

2. **Changing Species Distributions**: The distribution of species is altered as they migrate to new habitats. This can result in the introduction of species to areas where they were previously not found, potentially disrupting existing ecosystems.

3. **Impacting Ecosystem Functions**: Changes in species composition and distribution can affect the functions of ecosystems. For example, the loss of key species can disrupt food webs, pollination networks, and other ecological processes.

4. **Loss of Biodiversity**: The combined effects of habitat shifts, altered species distributions, and disr