# Semantic chunking for document processing

This notebook demonstrates how to use semantic chunking for processing and retrieving information from a PDF document. Traditional methods for splitting text into chunks are typically based on fixed character or word counts, which may break the text in unnatural places, disrupting the flow of information. This can affect the quality of document retrieval, as the context is lost. Semantic chunking aims to solve this problem by splitting the text at more meaningful breakpoints, ensuring that each chunk retains its semantic coherence.



In [1]:
import os
from dotenv import load_dotenv
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai.embeddings import OpenAIEmbeddings
from langchain_openai import OpenAIEmbeddings
from langchain.vectorstores import FAISS
import pymupdf

# Load environment variables from a .env file
load_dotenv()

# Access the API key
os.environ["OPENAI_API_KEY"] = os.getenv('OPENAI_API_KEY')

## Document preprocessing
Let's start by reading the content of a PDF file and splitting it into manageable chunks.

### Read PDF to string
Here, we read the PDF content and convert it into a string using the `pymupdf` library. This will allow us to process the text further.

In [2]:
# Define the path to the PDF
path = "Understanding_Climate_Change.pdf"

# Open the PDF document
doc = pymupdf.open(path)
content = ""

# Iterate over each page and extract text
for page_num in range(len(doc)):
    # Get the current page
    page = doc[page_num]
    # Extract the text content from the current page and append it to the content string
    content += page.get_text()

The PDF file is opened using `pymupdf.open()`, and we iterate over all pages to extract the text. The extracted text is then concatenated into one large string. This will allow us to process it further and split it into manageable chunks.

## Perform semantic chunking
Now that we have the full text of the document, we apply semantic chunking using LangChain's `SemanticChunker` with OpenAI embeddings. This will allow us to split the text at more meaningful points based on the semantic content, as opposed to arbitrary word or character breaks.

The chunking process works by analyzing the semantic distance between consecutive sentences. If the difference between two consecutive sentences is large (in terms of their meaning), it marks that as a breakpoint where the document should be split into a new chunk.

In [3]:
# Initialize the semantic chunker
text_splitter = SemanticChunker(OpenAIEmbeddings(), breakpoint_threshold_type='percentile', breakpoint_threshold_amount=90)

# Create semantic chunks
chunks = text_splitter.create_documents([content])

We configure the `SemanticChunker` with a specific breakpoint type (percentile) and threshold amount (90th percentile). This means that the text will be split where the difference between consecutive sentences exceeds the 90th percentile of sentence differences. In other words, we will consider splitting the document whenever the difference in meaning between two sentences is larger than the 90th percentile of all sentence differences in the document.

This means we want to split the text based on a specific percentile threshold. In other words, we will consider splitting the document whenever the difference in meaning between two sentences is larger than the 90th percentile of all sentence differences in the document. The 90th percentile means that we are looking for the sentences where the difference in meaning is larger than what 90% of other sentence differences are. This ensures that the split happens at significant changes in meaning, rather than at minor differences.
oWe then call the `create_documents` method, passing the full content of the document to be chunked. The method returns the text split into smaller chunks, where each chunk contains semantically related content.

## Vector store creation
Now, let's create a vector store where we store the text chunks as vectors. We will convert the text chunks into numerical vector representations using OpenAI's embeddings, and then store them in a FAISS vector store.e.




In [4]:
# Initialize the OpenAI embeddings
embeddings = OpenAIEmbeddings()

# Create a FAISS vector store from the document chunks
vectorstore = FAISS.from_documents(chunks, embeddings)

# Create a retriever for the vector store to fetch relevant documents
chunks_query_retriever = vectorstore.as_retriever(search_kwargs={"k": 2})

- Here, we initialize the `OpenAIEmbeddings` class, which loads a model to convert the text of each chunk into a vector.
- The `FAISS.from_documents` method takes the list of `chunks` (which contains the document text) and the initialized `embeddings` model to convert each chunk into a vector. These vectors are then stored in a FAISS vector store.
- The `as_retriever()` method converts the FAISS vector store into a retriever object. This retriever is used to search for relevant documents based on the vectors stored in the FAISS index. The `search_kwargs={"k": 2}` parameter ensures that when we search for a query, only 2 documents (the two most relevant) is returned. We can adjust this number (`k`) to return more documents if needed.

## Retrieve context based on query

Now, we define the query that we will use to search for relevant context in the document. The retriever fetches the top `k=2` most relevant chunks based on their similarity to the querry.


In [5]:
# Define the query to search for relevant context
query = "What is the main cause of climate change?"

# Retrieve the top 2 most relevant document chunks based on the query
docs = chunks_query_retriever.invoke(query)

# Extract the page content from the retrieved documents
context = [doc.page_content for doc in docs]

The retriever works by computing the similarity between the query and the document chunks stored in the FAISS vector store. The top `k` most similar chunks are returned as context for the query.

### Display the retrieved context

Now, we print the retrieved context for the querge.


In [6]:
# Display the relevant context retrieved for the query
for i, c in enumerate(context):
    print(f"Context {i + 1}:")
    print(c)
    print("\n")

Context 1:
The Intergovernmental Panel on Climate Change (IPCC) has 
documented these changes extensively. Ice core samples, tree rings, and ocean sediments 
provide a historical record that scientists use to understand past climate conditions and 
predict future trends. The evidence overwhelmingly shows that recent changes are primarily 
driven by human activities, particularly the emission of greenhouse gases. Chapter 2: Causes of Climate Change 
Greenhouse Gases 
The primary cause of recent climate change is the increase in greenhouse gases in the 
atmosphere. Greenhouse gases, such as carbon dioxide (CO2), methane (CH4), and nitrous 
oxide (N2O), trap heat from the sun, creating a "greenhouse effect." This effect is essential 
for life on Earth, as it keeps the planet warm enough to support life. However, human 
activities have intensified this natural process, leading to a warmer climate. Fossil Fuels 
Burning fossil fuels for energy releases large amounts of CO2. This includes co

Overall, semantic chunking improves the quality of retrieved information and enhances the performance of downstream NLP tasks.