# Semantic Chunking for Document Processing

## Overview

This code implements a semantic chunking approach for processing and retrieving information from PDF documents, [first proposed by Greg Kamradt](https://youtu.be/8OJC21T2SL4?t=1933) and subsequently [implemented in LangChain](https://python.langchain.com/docs/how_to/semantic-chunker/). Unlike traditional methods that split text based on fixed character or word counts, semantic chunking aims to create more meaningful and context-aware text segments.

## Motivation

Traditional text splitting methods often break documents at arbitrary points, potentially disrupting the flow of information and context. Semantic chunking addresses this issue by attempting to split text at more natural breakpoints, preserving semantic coherence within each chunk.

## Key Components

1. PDF processing and text extraction
2. Semantic chunking using LangChain's SemanticChunker
3. Vector store creation using FAISS and OpenAI embeddings
4. Retriever setup for querying the processed documents

## Method Details

### Document Preprocessing

1. The PDF is read and converted to a string using a custom `read_pdf_to_string` function.

### Semantic Chunking

1. Utilizes LangChain's `SemanticChunker` with OpenAI embeddings.
2. Three breakpoint types are available:
   - 'percentile': Splits at differences greater than the X percentile.
   - 'standard_deviation': Splits at differences greater than X standard deviations.
   - 'interquartile': Uses the interquartile distance to determine split points.
3. In this implementation, the 'percentile' method is used with a threshold of 90.

### Vector Store Creation

1. OpenAI embeddings are used to create vector representations of the semantic chunks.
2. A FAISS vector store is created from these embeddings for efficient similarity search.

### Retriever Setup

1. A retriever is configured to fetch the top 2 most relevant chunks for a given query.

## Key Features

1. Context-Aware Splitting: Attempts to maintain semantic coherence within chunks.
2. Flexible Configuration: Allows for different breakpoint types and thresholds.
3. Integration with Advanced NLP Tools: Uses OpenAI embeddings for both chunking and retrieval.

## Benefits of this Approach

1. Improved Coherence: Chunks are more likely to contain complete thoughts or ideas.
2. Better Retrieval Relevance: By preserving context, retrieval accuracy may be enhanced.
3. Adaptability: The chunking method can be adjusted based on the nature of the documents and retrieval needs.
4. Potential for Better Understanding: LLMs or downstream tasks may perform better with more coherent text segments.

## Implementation Details

1. Uses OpenAI's embeddings for both the semantic chunking process and the final vector representations.
2. Employs FAISS for creating an efficient searchable index of the chunks.
3. The retriever is set up to return the top 2 most relevant chunks, which can be adjusted as needed.

## Example Usage

The code includes a test query: "What is the main cause of climate change?". This demonstrates how the semantic chunking and retrieval system can be used to find relevant information from the processed document.

## Conclusion

Semantic chunking represents an advanced approach to document processing for retrieval systems. By attempting to maintain semantic coherence within text segments, it has the potential to improve the quality of retrieved information and enhance the performance of downstream NLP tasks. This technique is particularly valuable for processing long, complex documents where maintaining context is crucial, such as scientific papers, legal documents, or comprehensive reports.

<div style="text-align: center;">

<img src="../images/semantic_chunking_comparison.svg" alt="Self RAG" style="width:100%; height:auto;">
</div>

# Package Installation and Imports

The cell below installs all necessary packages required to run this notebook.


In [None]:
# Install required packages - 2026 updated versions
!pip install -q langchain>=0.2.0 langchain-openai>=0.1.0 langchain-community>=0.2.0 python-dotenv faiss-cpu pypdf openai>=1.0.0

In [None]:
# Clone the repository to access helper functions and evaluation modules
!git clone https://github.com/NirDiamant/RAG_TECHNIQUES.git
import sys
sys.path.append('RAG_TECHNIQUES')
# If you need to run with the latest data
# !cp -r RAG_TECHNIQUES/data .

In [None]:
import os
import sys
from dotenv import load_dotenv

# Updated imports for LangChain 0.2.x+
from langchain_text_splitters import SemanticChunker
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS

# Import helper functions
try:
    from helper_functions import *
    from evaluation.evalute_rag import *
except ImportError:
    print("Warning: Helper functions not found. Defining fallback functions...")
    # Fallback function for PDF reading
    def read_pdf_to_string(file_path):
        from pypdf import PdfReader
        reader = PdfReader(file_path)
        text = ""
        for page in reader.pages:
            text += page.extract_text()
        return text
    
    def retrieve_context_per_question(query, retriever):
        docs = retriever.invoke(query)
        return docs
    
    def show_context(context):
        for i, doc in enumerate(context, 1):
            print(f"\n{'='*80}")
            print(f"Document {i}:")
            print(f"{'='*80}")
            print(doc.page_content)
            print(f"\nMetadata: {doc.metadata}")

# Load environment variables from a .env file
load_dotenv()

# Set the OpenAI API key environment variable
os.environ["OPENAI_API_KEY"] = os.getenv('OPENAI_API_KEY')

print("‚úÖ All packages imported successfully!")

### OpenAI API Key Setup

If you don't have a .env file, you can set your API key directly in Colab:

In [None]:
# Optional: Set your OpenAI API key directly (if not using .env file)
# Uncomment and add your key if needed

# from google.colab import userdata
# os.environ["OPENAI_API_KEY"] = userdata.get('OPENAI_API_KEY')  # Using Colab secrets

# OR directly (not recommended for security)
# import getpass
# os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OpenAI API Key: ")

### Download Required Data Files

In [None]:
# Download required data files
import os
os.makedirs('data', exist_ok=True)

# Download the PDF document used in this notebook
!wget -q -O data/Understanding_Climate_Change.pdf https://raw.githubusercontent.com/NirDiamant/RAG_TECHNIQUES/main/data/Understanding_Climate_Change.pdf
print("‚úÖ Data files downloaded successfully!")

### Define file path

In [None]:
path = "data/Understanding_Climate_Change.pdf"
print(f"üìÑ PDF Path: {path}")
print(f"File exists: {os.path.exists(path)}")

### Read PDF to string

In [None]:
content = read_pdf_to_string(path)
print(f"‚úÖ PDF loaded successfully!")
print(f"üìä Content length: {len(content)} characters")
print(f"\nüìù First 500 characters:\n{content[:500]}...")

### Breakpoint types: 
* **percentile**: all differences between sentences are calculated, and then any difference greater than the X percentile is split.
* **standard_deviation**: any difference greater than X standard deviations is split.
* **interquartile**: the interquartile distance is used to split chunks.

**Updated for LangChain 0.2.x+**: SemanticChunker is now imported from `langchain_text_splitters`

In [None]:
# Initialize embeddings with updated API
embeddings_model = OpenAIEmbeddings(model="text-embedding-3-small")  # Using latest embedding model

# Create semantic chunker with updated parameters
text_splitter = SemanticChunker(
    embeddings_model,
    breakpoint_threshold_type='percentile',
    breakpoint_threshold_amount=90
)

print("‚úÖ Semantic Chunker initialized successfully!")

### Split original text to semantic chunks

In [None]:
print("üîÑ Creating semantic chunks...")
docs = text_splitter.create_documents([content])
print(f"‚úÖ Created {len(docs)} semantic chunks")
print(f"\nüìä Chunk Statistics:")
chunk_lengths = [len(doc.page_content) for doc in docs]
print(f"  - Average chunk size: {sum(chunk_lengths) / len(chunk_lengths):.0f} characters")
print(f"  - Min chunk size: {min(chunk_lengths)} characters")
print(f"  - Max chunk size: {max(chunk_lengths)} characters")

### Create vector store and retriever

In [None]:
print("üîÑ Creating vector store...")
# Use the same embeddings model for consistency
vectorstore = FAISS.from_documents(docs, embeddings_model)
chunks_query_retriever = vectorstore.as_retriever(search_kwargs={"k": 2})
print("‚úÖ Vector store and retriever created successfully!")

### Test the retriever

In [None]:
test_query = "What is the main cause of climate change?"
print(f"üîç Test Query: {test_query}\n")

context = retrieve_context_per_question(test_query, chunks_query_retriever)
show_context(context)

### Additional Testing: Compare Different Breakpoint Types

In [None]:
# Optional: Test different breakpoint types
print("üß™ Testing different breakpoint types...\n")

breakpoint_types = [
    ('percentile', 90),
    ('standard_deviation', 2),
    ('interquartile', None)
]

for bp_type, bp_amount in breakpoint_types:
    print(f"\n{'='*80}")
    print(f"Testing: {bp_type} (amount: {bp_amount})")
    print(f"{'='*80}")
    
    if bp_amount:
        splitter = SemanticChunker(
            embeddings_model,
            breakpoint_threshold_type=bp_type,
            breakpoint_threshold_amount=bp_amount
        )
    else:
        splitter = SemanticChunker(
            embeddings_model,
            breakpoint_threshold_type=bp_type
        )
    
    test_docs = splitter.create_documents([content])
    print(f"Number of chunks: {len(test_docs)}")
    chunk_sizes = [len(doc.page_content) for doc in test_docs]
    print(f"Average chunk size: {sum(chunk_sizes) / len(chunk_sizes):.0f} characters")

### Save and Load Vector Store (Optional)

In [None]:
# Optional: Save the vector store for later use
# vectorstore.save_local("semantic_chunks_faiss_index")
# print("‚úÖ Vector store saved!")

# To load later:
# vectorstore = FAISS.load_local(
#     "semantic_chunks_faiss_index",
#     embeddings_model,
#     allow_dangerous_deserialization=True
# )
# print("‚úÖ Vector store loaded!")