# 用于文档检索的上下文丰富窗口技术

## 概述

此代码实现了一种用于向量数据库中文档检索的上下文丰富窗口技术。它通过**在每个检索到的块周围添加上下文**，增强了标准的检索过程，提高了返回信息的连贯性和完整性。

## 动机

传统的向量搜索通常返回孤立的文本块，这些文本块可能缺乏全面理解所需的上下文。这种方法旨在通过包括邻近的文本块来提供检索信息的更全面视图。

## 关键组件

1. PDF处理和文本分块
2. 使用FAISS和OpenAI嵌入创建向量存储
3. 自定义检索函数带上下文窗口
4. 标准检索与上下文丰富检索的比较

## 方法细节

### 文档预处理

1. 读取PDF并转换为字符串。
2. 将文本分割成有重叠的块，每个块标记其索引。

### 向量存储创建

1. 使用OpenAI嵌入创建块的向量表示。
2. 从这些嵌入创建FAISS向量存储。

### 上下文丰富检索

1. `retrieve_with_context_overlap` 函数执行以下步骤：
   - 根据查询检索相关块
   - 对于每个相关块，获取邻近块
   - 连接块，考虑重叠
   - 返回每个相关块的扩展上下文

### 检索比较

jupyter 包括一个用于比较标准检索与上下文丰富方法的部分。

## 这种方法的好处

1. 提供更连贯和上下文丰富的结果
2. 保持向量搜索的优势，同时减轻其倾向于返回孤立文本片段的倾向
3. 允许灵活调整上下文窗口大小

## 结论

这种上下文丰富窗口技术为提高基于向量的文档搜索系统中检索信息的质量提供了有希望的方法。通过提供周围上下文，它有助于保持检索信息的连贯性和完整性，可能在诸如问答等下游任务中导致更好的理解和更准确的响应。


<div style="text-align: center;">

<img src="../images/context_enrichment_window.svg" alt="context enrichment window" style="width:70%; height:auto;">
</div>

### Import libraries and environment variables

In [14]:
import os
import sys
from dotenv import load_dotenv
from langchain.docstore.document import Document

# Load environment variables from a .env file
load_dotenv()

sys.path.append(os.path.abspath(os.path.join(os.getcwd(), '..'))) # Add the parent directory to the path sicnce we work with notebooks
from rag.helper_functions import *
from rag.evaluation.evalute_rag import *

# Set the OpenAI API key environment variable
os.environ["OPENAI_API_KEY"] = os.getenv('OPENAI_API_KEY')

### Define path to PDF

In [15]:
path = "../data/Understanding_Climate_Change.pdf"

### Read PDF to string

In [16]:
content = read_pdf_to_string(path)

### 将文本分割成块，并带有块的顺序索引元数据的函数

In [18]:
def split_text_to_chunks_with_indices(text: str, chunk_size: int, chunk_overlap: int) -> List[Document]:
    chunks = []
    start = 0
    while start < len(text):
        end = start + chunk_size
        chunk = text[start:end]
        chunks.append(Document(page_content=chunk, metadata={"index": len(chunks), "text": text}))
        start += chunk_size - chunk_overlap
    return chunks

### 相应地拆分我们的文档

In [23]:
chunks_size = 400
chunk_overlap = 200
docs = split_text_to_chunks_with_indices(content, chunks_size, chunk_overlap)
print(len(docs))
print(len(docs[0].page_content))
print(len(docs[-1].page_content))

72561
363
400
161


### 创建向量存储和检索器

In [24]:
embeddings = OpenAIEmbeddings()
vectorstore = FAISS.from_documents(docs, embeddings)
chunks_query_retriever = vectorstore.as_retriever(search_kwargs={"k": 1})

### 从向量存储中按原始顺序绘制第k个块

In [25]:
def get_chunk_by_index(vectorstore, target_index: int) -> Document:
    """
    根据元数据中的索引从向量存储中检索一块数据。
    
    参数:
    vectorstore (VectorStore): 包含数据块的向量存储。
    target_index (int): 要检索的数据块的索引。
    
    返回值:
    Optional[Document]: 检索到的数据块作为Document对象返回，如果没有找到则返回None。
    """
    # 这是一个简化版本。在实践中，你可能需要一个更有效的方法
    # 根据索引检索数据块，这取决于你的向量存储实现方式。

    all_docs = vectorstore.similarity_search("", k=vectorstore.index.ntotal)
    for doc in all_docs:
        if doc.metadata.get('index') == target_index:
            return doc
    return None

### Check the function

In [26]:
chunk = get_chunk_by_index(vectorstore, 0)
print(chunk.page_content)

Understanding Climate Change 
Chapter 1: Introduction to Climate Change 
Climate change refers to significant, long-term changes in the global climate. The term 
"global climate" encompasses the planet's overall weather patterns, including temperature, 
precipitation, and wind patterns, over an extended period. Over the past century, human 
activities, particularly the burning of fossil fuels and 


### 基于语义相似性从向量存储中检索，然后将每个检索到的块在前后填充其num_neighbors，并考虑块重叠以在其周围构造一个有意义的宽窗口

In [28]:
def retrieve_with_context_overlap(vectorstore, retriever, query: str, num_neighbors: int = 1, chunk_size: int = 200, chunk_overlap: int = 20) -> List[str]:
    """
    根据查询检索数据块，然后获取相邻的数据块并将它们连接起来，
    同时考虑重叠部分并正确索引。

    参数:
    vectorstore (VectorStore): 包含数据块的向量存储。
    retriever: 获取相关文档的检索器对象。
    query (str): 搜索相关数据块的查询语句。
    num_neighbors (int): 在每个相关数据块之前和之后检索的数据块数量。
    chunk_size (int): 最初分割时每个数据块的大小。
    chunk_overlap (int): 最初分割时数据块之间的重叠部分。

    返回值:
    List[str]: 连接后的数据块序列列表，每个序列都以一个相关数据块为中心。
    """

    relevant_chunks = retriever.get_relevant_documents(query)
    result_sequences = []

    for chunk in relevant_chunks:
        current_index = chunk.metadata.get('index')
        if current_index is None:
            continue

        # Determine the range of chunks to retrieve
        start_index = max(0, current_index - num_neighbors)
        end_index = current_index + num_neighbors + 1  # +1 because range is exclusive at the end

        # Retrieve all chunks in the range
        neighbor_chunks = []
        for i in range(start_index, end_index):
            neighbor_chunk = get_chunk_by_index(vectorstore, i)
            if neighbor_chunk:
                neighbor_chunks.append(neighbor_chunk)

        # Sort chunks by their index to ensure correct order
        neighbor_chunks.sort(key=lambda x: x.metadata.get('index', 0))

        # Concatenate chunks, accounting for overlap
        concatenated_text = neighbor_chunks[0].page_content
        for i in range(1, len(neighbor_chunks)):
            current_chunk = neighbor_chunks[i].page_content
            overlap_start = max(0, len(concatenated_text) - chunk_overlap)
            concatenated_text = concatenated_text[:overlap_start] + current_chunk

        result_sequences.append(concatenated_text)

    return result_sequences

### 比较常规检索和上下文窗口检索

In [29]:
# Baseline approach
query = "Explain the role of deforestation and fossil fuels in climate change."
baseline_chunk = chunks_query_retriever.get_relevant_documents(query, k=1)
# Focused context enrichment approach
enriched_chunks = retrieve_with_context_overlap(
    vectorstore,
    chunks_query_retriever,
    query,
    num_neighbors=1,
    chunk_size=400,
    chunk_overlap=200
)

print("Baseline Chunk:")
print(baseline_chunk[0].page_content)
print("\nEnriched Chunks:")
print(enriched_chunks[0])

Baseline Chunk:
ntribute 
to climate change. These forests are vital for regulating the Earth's climate and supporting 
indigenous communities and wildlife. 
Agriculture 
Agriculture contributes to climate change through methane emissions from livestock, rice 
paddies, and the use of synthetic fertilizers. Methane is a potent greenhouse gas with a much 
higher heat-trapping capability than CO2, albeit in smaller 

Enriched Chunks:
n. 
Boreal Forests 
Boreal forests, found in the northern regions of North America, Europe, and Asia, also play a 
crucial role in sequestering carbon. Logging and land-use changes in these regions contribute 
to climate change. These forests are vital for regulating the Earth's climate and supporting 
indigenous communities and wildlife. 
Agriculture 
Agriculture contributes to climate change through methane emissions from livestock, rice 
paddies, and the use of synthetic fertilizers. Methane is a potent greenhouse gas with a much 
higher heat-trapping capa

### 这个示例展示了附加上下文窗口的优越性

In [30]:

document_content = """
Artificial Intelligence (AI) has a rich history dating back to the mid-20th century. The term "Artificial Intelligence" was coined in 1956 at the Dartmouth Conference, marking the field's official beginning.

In the 1950s and 1960s, AI research focused on symbolic methods and problem-solving. The Logic Theorist, created in 1955 by Allen Newell and Herbert A. Simon, is often considered the first AI program.

The 1960s saw the development of expert systems, which used predefined rules to solve complex problems. DENDRAL, created in 1965, was one of the first expert systems, designed to analyze chemical compounds.

However, the 1970s brought the first "AI Winter," a period of reduced funding and interest in AI research, largely due to overpromised capabilities and underdelivered results.

The 1980s saw a resurgence with the popularization of expert systems in corporations. The Japanese government's Fifth Generation Computer Project also spurred increased investment in AI research globally.

Neural networks gained prominence in the 1980s and 1990s. The backpropagation algorithm, although discovered earlier, became widely used for training multi-layer networks during this time.

The late 1990s and 2000s marked the rise of machine learning approaches. Support Vector Machines (SVMs) and Random Forests became popular for various classification and regression tasks.

Deep Learning, a subset of machine learning using neural networks with many layers, began to show promising results in the early 2010s. The breakthrough came in 2012 when a deep neural network significantly outperformed other machine learning methods in the ImageNet competition.

Since then, deep learning has revolutionized many AI applications, including image and speech recognition, natural language processing, and game playing. In 2016, Google's AlphaGo defeated a world champion Go player, a landmark achievement in AI.

The current era of AI is characterized by the integration of deep learning with other AI techniques, the development of more efficient and powerful hardware, and the ethical considerations surrounding AI deployment.

Transformers, introduced in 2017, have become a dominant architecture in natural language processing, enabling models like GPT (Generative Pre-trained Transformer) to generate human-like text.

As AI continues to evolve, new challenges and opportunities arise. Explainable AI, robust and fair machine learning, and artificial general intelligence (AGI) are among the key areas of current and future research in the field.
"""

chunks_size = 250
chunk_overlap = 20
document_chunks = split_text_to_chunks_with_indices(document_content, chunks_size, chunk_overlap)
document_vectorstore = FAISS.from_documents(document_chunks, embeddings)
document_retriever = document_vectorstore.as_retriever(search_kwargs={"k": 1})

query = "When did deep learning become prominent in AI?"
context = document_retriever.get_relevant_documents(query)
context_pages_content = [doc.page_content for doc in context]

print("Regular retrieval:\n")
show_context(context_pages_content)

sequences = retrieve_with_context_overlap(document_vectorstore, document_retriever, query, num_neighbors=1)
print("\nRetrieval with context enrichment:\n")
show_context(sequences)

Regular retrieval:

Context 1:

Deep Learning, a subset of machine learning using neural networks with many layers, began to show promising results in the early 2010s. The breakthrough came in 2012 when a deep neural network significantly outperformed other machine learning method



Retrieval with context enrichment:

Context 1:
ng multi-layer networks during this time.

The late 1990s and 2000s marked the rise of machine learning approaches. Support Vector Machines (SVMs) and Random Forests became popular for various classification and regression tasks.

Deep Learning, a subset of machine learning using neural networks with many layers, began to show promising results in the early 2010s. The breakthrough came in 2012 when a deep neural network significantly outperformed other machine learning methods in the ImageNet competition.

Since then, deep learning has revolutionized many AI applications, including image and speech recognition, natural language processing, and game playing. In