# Semantic Chunking for Document Processing

## Overview

This code implements a semantic chunking approach for processing and retrieving information from PDF documents, [first proposed by Greg Kamradt](https://youtu.be/8OJC21T2SL4?t=1933) and subsequently [implemented in LangChain](https://python.langchain.com/docs/how_to/semantic-chunker/). Unlike traditional methods that split text based on fixed character or word counts, semantic chunking aims to create more meaningful and context-aware text segments.

## Motivation

Traditional text splitting methods often break documents at arbitrary points, potentially disrupting the flow of information and context. Semantic chunking addresses this issue by attempting to split text at more natural breakpoints, preserving semantic coherence within each chunk.

## Key Components

1. PDF processing and text extraction
2. Semantic chunking using LangChain's SemanticChunker
3. Vector store creation using FAISS and OpenAI embeddings
4. Retriever setup for querying the processed documents

## Method Details

### Document Preprocessing

1. The PDF is read and converted to a string using a custom `read_pdf_to_string` function.

### Semantic Chunking

1. Utilizes LangChain's `SemanticChunker` with OpenAI embeddings.
2. Three breakpoint types are available:
   - 'percentile': Splits at differences greater than the X percentile.
   - 'standard_deviation': Splits at differences greater than X standard deviations.
   - 'interquartile': Uses the interquartile distance to determine split points.
3. In this implementation, the 'percentile' method is used with a threshold of 90.

### Vector Store Creation

1. OpenAI embeddings are used to create vector representations of the semantic chunks.
2. A FAISS vector store is created from these embeddings for efficient similarity search.

### Retriever Setup

1. A retriever is configured to fetch the top 2 most relevant chunks for a given query.

## Key Features

1. Context-Aware Splitting: Attempts to maintain semantic coherence within chunks.
2. Flexible Configuration: Allows for different breakpoint types and thresholds.
3. Integration with Advanced NLP Tools: Uses OpenAI embeddings for both chunking and retrieval.

## Benefits of this Approach

1. Improved Coherence: Chunks are more likely to contain complete thoughts or ideas.
2. Better Retrieval Relevance: By preserving context, retrieval accuracy may be enhanced.
3. Adaptability: The chunking method can be adjusted based on the nature of the documents and retrieval needs.
4. Potential for Better Understanding: LLMs or downstream tasks may perform better with more coherent text segments.

## Implementation Details

1. Uses OpenAI's embeddings for both the semantic chunking process and the final vector representations.
2. Employs FAISS for creating an efficient searchable index of the chunks.
3. The retriever is set up to return the top 2 most relevant chunks, which can be adjusted as needed.

## Example Usage

The code includes a test query: "What is the main cause of climate change?". This demonstrates how the semantic chunking and retrieval system can be used to find relevant information from the processed document.

## Conclusion

Semantic chunking represents an advanced approach to document processing for retrieval systems. By attempting to maintain semantic coherence within text segments, it has the potential to improve the quality of retrieved information and enhance the performance of downstream NLP tasks. This technique is particularly valuable for processing long, complex documents where maintaining context is crucial, such as scientific papers, legal documents, or comprehensive reports.

<div style="text-align: center;">

<img src="https://github.com/NirDiamant/RAG_Techniques/blob/main/images/semantic_chunking_comparison.svg?raw=1" alt="Self RAG" style="width:100%; height:auto;">
</div>

# Package Installation and Imports

The cell below installs all necessary packages required to run this notebook.


In [1]:
# Install required packages
!pip install langchain-experimental langchain-openai python-dotenv

Collecting langchain-experimental
  Downloading langchain_experimental-0.4.1-py3-none-any.whl.metadata (1.3 kB)
Collecting langchain-openai
  Downloading langchain_openai-1.1.7-py3-none-any.whl.metadata (2.6 kB)
Collecting langchain-community<1.0.0,>=0.4.0 (from langchain-experimental)
  Downloading langchain_community-0.4.1-py3-none-any.whl.metadata (3.0 kB)
Collecting langchain-classic<2.0.0,>=1.0.0 (from langchain-community<1.0.0,>=0.4.0->langchain-experimental)
  Downloading langchain_classic-1.0.1-py3-none-any.whl.metadata (4.2 kB)
Collecting requests<3.0.0,>=2.32.5 (from langchain-community<1.0.0,>=0.4.0->langchain-experimental)
  Downloading requests-2.32.5-py3-none-any.whl.metadata (4.9 kB)
Collecting dataclasses-json<0.7.0,>=0.6.7 (from langchain-community<1.0.0,>=0.4.0->langchain-experimental)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting marshmallow<4.0.0,>=3.18.0 (from dataclasses-json<0.7.0,>=0.6.7->langchain-community<1.0.0,>=0.4.0->lan

In [2]:
# Clone the repository to access helper functions and evaluation modules
!git clone https://github.com/NirDiamant/RAG_TECHNIQUES.git
import sys
sys.path.append('RAG_TECHNIQUES')
# If you need to run with the latest data
# !cp -r RAG_TECHNIQUES/data .

Cloning into 'RAG_TECHNIQUES'...
remote: Enumerating objects: 1765, done.[K
remote: Counting objects: 100% (1101/1101), done.[K
remote: Compressing objects: 100% (414/414), done.[K
remote: Total 1765 (delta 733), reused 688 (delta 687), pack-reused 664 (from 4)[K
Receiving objects: 100% (1765/1765), 36.51 MiB | 20.76 MiB/s, done.
Resolving deltas: 100% (1119/1119), done.


In [3]:
!pip install pymupdf -q
!pip install -qU langchain-cohere -q
!pip install rank_bm25 deepeval prompt-template -q

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.1/24.1 MB[0m [31m84.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.3/42.3 kB[0m [31m1.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m319.0/319.0 kB[0m [31m9.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.5/3.5 MB[0m [31m61.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m813.6/813.6 kB[0m [31m13.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m102.2/102.2 kB[0m [31m9.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m72.5/72.5 kB[0m [31m6.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m132.6/132.6 kB[0m [31m13.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [57]:
import os
import sys
from dotenv import load_dotenv
from google.colab import userdata

# Set the OpenAI API key environment variable
os.environ["OPENAI_API_KEY"] = userdata.get('OPENAI_API_KEY')

sys.path.append("/content/RAG_TECHNIQUES")

# Original path append replaced for Colab compatibility
from helper_functions import *
from evaluation.evalute_rag import *

from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai.embeddings import OpenAIEmbeddings

# Load environment variables from a .env file
load_dotenv()

False

### Define file path

In [18]:
# Download required data files
import os
os.makedirs('data', exist_ok=True)

# Download the PDF document used in this notebook
!wget -O data/Understanding_Climate_Change.pdf https://raw.githubusercontent.com/NirDiamant/RAG_TECHNIQUES/main/data/Understanding_Climate_Change.pdf
!wget -O data/Understanding_Climate_Change.pdf https://raw.githubusercontent.com/NirDiamant/RAG_TECHNIQUES/main/data/Understanding_Climate_Change.pdf


--2026-01-26 09:46:10--  https://raw.githubusercontent.com/NirDiamant/RAG_TECHNIQUES/main/data/Understanding_Climate_Change.pdf
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 206372 (202K) [application/octet-stream]
Saving to: ‘data/Understanding_Climate_Change.pdf’


2026-01-26 09:46:10 (8.13 MB/s) - ‘data/Understanding_Climate_Change.pdf’ saved [206372/206372]

--2026-01-26 09:46:10--  https://raw.githubusercontent.com/NirDiamant/RAG_TECHNIQUES/main/data/Understanding_Climate_Change.pdf
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
L

In [59]:
path = "/content/RAG_TECHNIQUES/data/Understanding_Climate_Change.pdf"

### Read PDF to string

In [60]:
content = read_pdf_to_string(path)

### Breakpoint types:
* 'percentile': all differences between sentences are calculated, and then any difference greater than the X percentile is split.
* 'standard_deviation': any difference greater than X standard deviations is split.
* 'interquartile': the interquartile distance is used to split chunks.

In [67]:
!pip install sentence-transformers langchain-community -q


# text_splitter = SemanticChunker(
#     OpenAIEmbeddings(
#         model="text-embedding-3-small", api_key=userdata.get("OPENAI_API_KEY")),
#     breakpoint_threshold_type='percentile',
#     breakpoint_threshold_amount=90) # chose which embeddings and breakpoint type and threshold to use

from langchain_experimental.text_splitter import SemanticChunker
from langchain_community.embeddings import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings(
    model_name="thenlper/gte-small",
    encode_kwargs={"normalize_embeddings": True}
)

text_splitter = SemanticChunker(
    embeddings,
    breakpoint_threshold_type="percentile",
    breakpoint_threshold_amount=90
)


  embeddings = HuggingFaceEmbeddings(
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/385 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/57.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/583 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/66.7M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/394 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

### Split original text to semantic chunks

In [68]:
docs = text_splitter.create_documents([content])

### Create vector store and retriever

In [73]:
# embeddings = OpenAIEmbeddings()
# vectorstore = FAISS.from_documents(docs, embeddings)
# chunks_query_retriever = vectorstore.as_retriever(search_kwargs={"k": 2})
!pip install faiss-cpu -q


from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS

embeddings = HuggingFaceEmbeddings(
    model_name="thenlper/gte-small",
    encode_kwargs={"normalize_embeddings": True}
)

vectorstore = FAISS.from_documents(docs, embeddings)
chunks_query_retriever = vectorstore.as_retriever(
    search_kwargs={"k": 2}
)

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m23.8/23.8 MB[0m [31m47.6 MB/s[0m eta [36m0:00:00[0m
[?25h

### Test the retriever

In [76]:
def retrieve_context_per_question(query, retriever):
    docs = retriever.invoke(query)
    context = "\n\n".join(
        doc.page_content for doc in docs
    )
    return context


test_query = "What is the main cause of climate change?"

context = retrieve_context_per_question(
    test_query,
    chunks_query_retriever
)

show_context(context)


Context 1:
U


Context 2:
n


Context 3:
d


Context 4:
e


Context 5:
r


Context 6:
s


Context 7:
t


Context 8:
a


Context 9:
n


Context 10:
d


Context 11:
i


Context 12:
n


Context 13:
g


Context 14:
 


Context 15:
C


Context 16:
l


Context 17:
i


Context 18:
m


Context 19:
a


Context 20:
t


Context 21:
e


Context 22:
 


Context 23:
C


Context 24:
h


Context 25:
a


Context 26:
n


Context 27:
g


Context 28:
e


Context 29:
 


Context 30:




Context 31:
C


Context 32:
h


Context 33:
a


Context 34:
p


Context 35:
t


Context 36:
e


Context 37:
r


Context 38:
 


Context 39:
1


Context 40:
:


Context 41:
 


Context 42:
I


Context 43:
n


Context 44:
t


Context 45:
r


Context 46:
o


Context 47:
d


Context 48:
u


Context 49:
c


Context 50:
t


Context 51:
i


Context 52:
o


Context 53:
n


Context 54:
 


Context 55:
t


Context 56:
o


Context 57:
 


Context 58:
C


Context 59:
l


Context 60:
i


Context 61:
m


Context 62:
a


Context 63:
t


C

![](https://europe-west1-rag-techniques-views-tracker.cloudfunctions.net/rag-techniques-tracker?notebook=all-rag-techniques--semantic-chunking)