# 문서 검색을 위한 컨텍스트 확장 윈도우

## 개요

이 문서에서는 벡터 데이터베이스에서 **컨텍스트 확장 윈도우 기법**을 활용한 문서 검색 방법을 설명합니다. 이 기법은 검색된 텍스트 청크에 주변 컨텍스트를 추가하여, 반환된 정보의 일관성과 완전성을 높이는 것을 목표로 합니다.

## 동기

기존의 벡터 기반 검색은 종종 개별 텍스트 청크만 반환하여 충분한 이해에 필요한 컨텍스트가 부족할 수 있습니다. 이 접근법은 관련 텍스트 청크에 인접한 텍스트를 추가함으로써 보다 포괄적인 정보를 제공하기 위해 고안되었습니다.

## 주요 구성 요소

1. PDF 처리 및 텍스트 청크 생성
2. FAISS와 OpenAI 임베딩을 활용한 벡터 스토어 생성
3. 컨텍스트 윈도우를 추가한 커스텀 검색 함수
4. 일반 검색과 컨텍스트 확장 검색의 비교

## 방법 세부 내용

### 문서 전처리

1. PDF 파일을 읽어 문자열로 변환합니다.
2. 텍스트를 겹치도록 청크 단위로 나누고, 각 청크에 인덱스를 태깅합니다.

### 벡터 스토어 생성

1. OpenAI 임베딩을 사용하여 각 청크의 벡터 표현을 생성합니다.
2. 생성된 임베딩을 바탕으로 FAISS 벡터 스토어를 만듭니다.

### 컨텍스트 확장 검색

1. `retrieve_with_context_overlap` 함수가 다음과 같은 방식으로 동작합니다:
   - 쿼리에 따라 관련 청크를 검색합니다.
   - 각 관련 청크에 대해 인접 청크를 가져옵니다.
   - 청크를 겹치게 하여 연결하고, 확장된 컨텍스트를 반환합니다.

### 검색 결과 비교

노트북에는 일반 검색과 컨텍스트 확장 검색의 결과를 비교하는 섹션이 포함되어 있습니다.

## 이 접근법의 장점

1. 더 일관되고 풍부한 컨텍스트를 제공
2. 벡터 검색의 장점을 유지하면서 개별 텍스트 조각의 한계를 극복
3. 컨텍스트 윈도우 크기를 유연하게 조정 가능

## 결론

컨텍스트 확장 윈도우 기법은 벡터 기반 문서 검색 시스템에서 검색 정보의 품질을 높이는 효과적인 방법입니다. 주변 컨텍스트를 제공함으로써 검색된 정보의 일관성과 완전성을 유지하여, 질문 응답과 같은 후속 작업에서 더 정확하고 이해하기 쉬운 결과를 얻을 수 있습니다.


<div style="text-align: center;">

<img src="../images/vector-search-comparison_context_enrichment.svg" alt="context enrichment window" style="width:70%; height:auto;">
</div>

<div style="text-align: center;">

<img src="../images/context_enrichment_window.svg" alt="context enrichment window" style="width:70%; height:auto;">
</div>

### Import libraries and environment variables

In [1]:
import os
import sys
from dotenv import load_dotenv
from langchain.docstore.document import Document


sys.path.append(os.path.abspath(os.path.join(os.getcwd(), '..'))) # Add the parent directory to the path sicnce we work with notebooks
from helper_functions import *
from evaluation.evalute_rag import *

# Load environment variables from a .env file
load_dotenv()

# Set the OpenAI API key environment variable
os.environ["OPENAI_API_KEY"] = os.getenv('OPENAI_API_KEY')


For example, replace imports like: `from langchain_core.pydantic_v1 import BaseModel`
with: `from pydantic import BaseModel`
or the v1 compatibility namespace if you are working in a code base that has not been fully upgraded to pydantic 2 yet. 	from pydantic.v1 import BaseModel

  from helper_functions import *


### Define path to PDF

In [2]:
path = "../data/Understanding_Climate_Change.pdf"

### Read PDF to string

In [3]:
content = read_pdf_to_string(path)

### Function to split text into chunks with metadata of the chunk chronological index

In [4]:
# 400개 청크 사이즈 설정 후 200개 중복설정하여 청킹하는 함수 
def split_text_to_chunks_with_indices(text: str, chunk_size: int, chunk_overlap: int) -> List[Document]:
    chunks = []
    start = 0
    while start < len(text):
        end = start + chunk_size
        chunk = text[start:end]
        chunks.append(Document(page_content=chunk, metadata={"index": len(chunks), "text": text}))
        start += chunk_size - chunk_overlap
    return chunks

### Split our document accordingly

In [5]:
chunks_size = 400
chunk_overlap = 200
docs = split_text_to_chunks_with_indices(content, chunks_size, chunk_overlap)

In [12]:
docs[0]



### Create vector store and retriever

In [6]:
embeddings = OpenAIEmbeddings()
vectorstore = FAISS.from_documents(docs, embeddings)
chunks_query_retriever = vectorstore.as_retriever(search_kwargs={"k": 1})

### Function to draw the k<sup>th</sup> chunk (in the original order) from the vector store 


In [7]:
def get_chunk_by_index(vectorstore, target_index: int) -> Document:
    # This is a simplified version. In practice, you might need a more efficient method
    # to retrieve chunks by index, depending on your vectorstore implementation.
    all_docs = vectorstore.similarity_search("", k=vectorstore.index.ntotal)
    for doc in all_docs:
        if doc.metadata.get('index') == target_index:
            return doc
    return None

### Check the function
- 타겟 인덱스와 같을 때 해당 인덱스의 doc을 반환하는 방식 

In [8]:
chunk = get_chunk_by_index(vectorstore, 0)
print(chunk.page_content)

Understanding Climate Change 
Chapter 1: Introduction to Climate Change 
Climate change refers to significant, long-term changes in the global climate. The term 
"global climate" encompasses the planet's overall weather patterns, including temperature, 
precipitation, and wind patterns, over an extended period. Over the past century, human 
activities, particularly the burning of fossil fuels and 


### Function that retrieves from the vector stroe based on semantic similarity and then pads each retrieved chunk with its num_neighbors before and after, taking into account the chunk overlap to construct a meaningful wide window arround it

In [13]:
def retrieve_with_context_overlap(vectorstore, retriever, query: str, num_neighbors: int = 1, chunk_size: int = 200, chunk_overlap: int = 20) -> List[str]:
    """
Rvectorstore: 청크 데이터를 저장한 벡터 스토어 객체입니다. get_chunk_by_index 함수를 통해 청크를 인덱스별로 검색할 수 있어야 합니다.
retriever: 검색 객체로, 쿼리를 바탕으로 관련 문서 또는 청크를 검색하는 데 사용됩니다.
query: 검색할 쿼리로, 이를 통해 관련 청크를 가져옵니다.
num_neighbors: 각 관련 청크를 중심으로 몇 개의 인접 청크를 추가로 가져올지를 설정합니다. 기본값은 1이며, 앞과 뒤의 1개씩 청크를 가져옵니다.
chunk_size: 원래 텍스트를 나눈 각 청크의 크기입니다. 기본값은 200으로 설정되어 있습니다.
chunk_overlap: 청크 간 겹치는 길이입니다. 기본값은 20으로 설정되어 있습니다.
    """
    relevant_chunks = retriever.get_relevant_documents(query)
    result_sequences = []

    for chunk in relevant_chunks:
        current_index = chunk.metadata.get('index')
        if current_index is None:
            continue

        # 시작 인덱스부터 끝 인덱스 설정 -> 인접 인덱스까지 
        # 앞 뒤 인덱스 뽑아옴 
        start_index = max(0, current_index - num_neighbors) # -1 
        end_index = current_index + num_neighbors + 1  # +1 because range is exclusive at the end

        # 모든 범위에 해당하는 문서를 반환하여 리스트에 저장 
        neighbor_chunks = []
        for i in range(start_index, end_index):
            neighbor_chunk = get_chunk_by_index(vectorstore, i)
            if neighbor_chunk:
                neighbor_chunks.append(neighbor_chunk)

        # 인덱스를 기준으로 정렬 
        neighbor_chunks.sort(key=lambda x: x.metadata.get('index', 0))

        # 첫번째 청크의 문서 저장 
        concatenated_text = neighbor_chunks[0].page_content
        # 이후부터는 중복 부분을 삭제하면서 인접부분을 이어붙이기 
        for i in range(1, len(neighbor_chunks)):
            current_chunk = neighbor_chunks[i].page_content
            overlap_start = max(0, len(concatenated_text) - chunk_overlap)
            concatenated_text = concatenated_text[:overlap_start] + current_chunk

        result_sequences.append(concatenated_text)

    return result_sequences

### Comparing regular retrival and retrival with context window

In [14]:
# Baseline approach
query = "Explain the role of deforestation and fossil fuels in climate change."
baseline_chunk = chunks_query_retriever.get_relevant_documents(query
    ,
    k=1
)
# Focused context enrichment approach
enriched_chunks = retrieve_with_context_overlap(
    vectorstore,
    chunks_query_retriever,
    query,
    num_neighbors=1,
    chunk_size=400,
    chunk_overlap=200
)

print("Baseline Chunk:")
print(baseline_chunk[0].page_content)
print("\nEnriched Chunks:")
print(enriched_chunks[0])

  baseline_chunk = chunks_query_retriever.get_relevant_documents(query


Baseline Chunk:
ntribute 
to climate change. These forests are vital for regulating the Earth's climate and supporting 
indigenous communities and wildlife. 
Agriculture 
Agriculture contributes to climate change through methane emissions from livestock, rice 
paddies, and the use of synthetic fertilizers. Methane is a potent greenhouse gas with a much 
higher heat-trapping capability than CO2, albeit in smaller 

Enriched Chunks:
n. 
Boreal Forests 
Boreal forests, found in the northern regions of North America, Europe, and Asia, also play a 
crucial role in sequestering carbon. Logging and land-use changes in these regions contribute 
to climate change. These forests are vital for regulating the Earth's climate and supporting 
indigenous communities and wildlife. 
Agriculture 
Agriculture contributes to climate change through methane emissions from livestock, rice 
paddies, and the use of synthetic fertilizers. Methane is a potent greenhouse gas with a much 
higher heat-trapping capa

### An example that showcases the superiority of additional context window

In [15]:

document_content = """
Artificial Intelligence (AI) has a rich history dating back to the mid-20th century. The term "Artificial Intelligence" was coined in 1956 at the Dartmouth Conference, marking the field's official beginning.

In the 1950s and 1960s, AI research focused on symbolic methods and problem-solving. The Logic Theorist, created in 1955 by Allen Newell and Herbert A. Simon, is often considered the first AI program.

The 1960s saw the development of expert systems, which used predefined rules to solve complex problems. DENDRAL, created in 1965, was one of the first expert systems, designed to analyze chemical compounds.

However, the 1970s brought the first "AI Winter," a period of reduced funding and interest in AI research, largely due to overpromised capabilities and underdelivered results.

The 1980s saw a resurgence with the popularization of expert systems in corporations. The Japanese government's Fifth Generation Computer Project also spurred increased investment in AI research globally.

Neural networks gained prominence in the 1980s and 1990s. The backpropagation algorithm, although discovered earlier, became widely used for training multi-layer networks during this time.

The late 1990s and 2000s marked the rise of machine learning approaches. Support Vector Machines (SVMs) and Random Forests became popular for various classification and regression tasks.

Deep Learning, a subset of machine learning using neural networks with many layers, began to show promising results in the early 2010s. The breakthrough came in 2012 when a deep neural network significantly outperformed other machine learning methods in the ImageNet competition.

Since then, deep learning has revolutionized many AI applications, including image and speech recognition, natural language processing, and game playing. In 2016, Google's AlphaGo defeated a world champion Go player, a landmark achievement in AI.

The current era of AI is characterized by the integration of deep learning with other AI techniques, the development of more efficient and powerful hardware, and the ethical considerations surrounding AI deployment.

Transformers, introduced in 2017, have become a dominant architecture in natural language processing, enabling models like GPT (Generative Pre-trained Transformer) to generate human-like text.

As AI continues to evolve, new challenges and opportunities arise. Explainable AI, robust and fair machine learning, and artificial general intelligence (AGI) are among the key areas of current and future research in the field.
"""

chunks_size = 250
chunk_overlap = 20
document_chunks = split_text_to_chunks_with_indices(document_content, chunks_size, chunk_overlap)
document_vectorstore = FAISS.from_documents(document_chunks, embeddings)
document_retriever = document_vectorstore.as_retriever(search_kwargs={"k": 1})

query = "When did deep learning become prominent in AI?"
context = document_retriever.get_relevant_documents(query)
context_pages_content = [doc.page_content for doc in context]

print("Regular retrieval:\n")
show_context(context_pages_content)

sequences = retrieve_with_context_overlap(document_vectorstore, document_retriever, query, num_neighbors=1)
print("\nRetrieval with context enrichment:\n")
show_context(sequences)

Regular retrieval:

Context 1:

Deep Learning, a subset of machine learning using neural networks with many layers, began to show promising results in the early 2010s. The breakthrough came in 2012 when a deep neural network significantly outperformed other machine learning method



Retrieval with context enrichment:

Context 1:
ng multi-layer networks during this time.

The late 1990s and 2000s marked the rise of machine learning approaches. Support Vector Machines (SVMs) and Random Forests became popular for various classification and regression tasks.

Deep Learning, a subset of machine learning using neural networks with many layers, began to show promising results in the early 2010s. The breakthrough came in 2012 when a deep neural network significantly outperformed other machine learning methods in the ImageNet competition.

Since then, deep learning has revolutionized many AI applications, including image and speech recognition, natural language processing, and game playing. In

### 내 생각
- retireve 된 문서들과 인접해 있는 앞 뒤 문서들을 뽑아서 반환하는 형태의 코드 