# Chunking.ipynb - Text Chunking Techniques for RAG Applications
## Overview
This Jupyter notebook demonstrates various text chunking strategies for Retrieval-Augmented Generation (RAG) applications. It explores different methods to split large documents into smaller, manageable chunks.

## Dataset
The notebook uses Amazon shareholder letters from 2019-2022 as sample documents:

- AMZN-2022-Shareholder-Letter.pdf
- AMZN-2021-Shareholder-Letter.pdf
- AMZN-2020-Shareholder-Letter.pdf
- AMZN-2019-Shareholder-Letter.pdf

These documents are automatically downloaded and stored in a ./data/ directory.


In [None]:
!mkdir -p ./data
import ssl
ssl._create_default_https_context = ssl._create_unverified_context
from urllib.request import urlretrieve
urls = [
    "https://s2.q4cdn.com/299287126/files/doc_financials/2023/ar/2022-Shareholder-Letter.pdf",
    "https://s2.q4cdn.com/299287126/files/doc_financials/2022/ar/2021-Shareholder-Letter.pdf",
    "https://s2.q4cdn.com/299287126/files/doc_financials/2021/ar/Amazon-2020-Shareholder-Letter-and-1997-Shareholder-Letter.pdf",
    "https://s2.q4cdn.com/299287126/files/doc_financials/2020/ar/2019-Shareholder-Letter.pdf",
]

filenames = [
    "AMZN-2022-Shareholder-Letter.pdf",
    "AMZN-2021-Shareholder-Letter.pdf",
    "AMZN-2020-Shareholder-Letter.pdf",
    "AMZN-2019-Shareholder-Letter.pdf",
]

data_root = "./data/"

In [None]:
for idx, url in enumerate(urls):
    file_path = data_root + filenames[idx]
    urlretrieve(url, file_path)

In [None]:
from langchain.document_loaders import PyPDFLoader
import os

data_root = "./data/"
folder_path = data_root
documents = []

# Loop through all files in the folder
for filename in os.listdir(folder_path):
    file_path = os.path.join(folder_path, filename)
    loader = PyPDFLoader(file_path)
    # Load the PDF data
    data = loader.load()
    # Add the loaded data to the documents list
    documents.extend(data)

# Print the text of the first page of the first document
if documents:
    print(documents[0].page_content)
else:
    print("No PDF files found in the folder.")

# Overlap chunking
Use Case: Simple, fixed-size chunks with minimal overlap for context preservation
- chunk_size: the maximum length (in characters) of each chunk or segment that the text will be split into.

- chunk_overlap: the number of characters that should overlap between consecutive chunks. This overlap can help provide context to the subsequent chunks, especially when dealing with tasks that require understanding the surrounding context.

- separator: a string that specifies the separators used to split the text into chunks. By default, it is set to "\n\n", which means that the splitter will split the text at occurrences of two consecutive newline characters

In [None]:
from langchain.text_splitter import CharacterTextSplitter

In [None]:
text_splitter = CharacterTextSplitter(chunk_size=100, chunk_overlap=10, separator="")
splits = text_splitter.split_documents(documents)

In [None]:
splits[:2]

# Recursive Character Splitting
Use Case: More intelligent splitting that respects document structure (paragraphs → sentences → words)

The default list is ["\n\n", "\n", " ", ""]. This has the effect of trying to keep all paragraphs (and then sentences, and then words) together as long as possible, as those would generically seem to be the strongest semantically related pieces of text.

In [None]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

rec_text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=100, chunk_overlap=10, separators=["\n"]
)
rec_text_splits = rec_text_splitter.split_documents(documents)

In [None]:
rec_text_splits[:2]

# Semantic Chunking
Use Case: Content-aware chunking that maintains topical coherence

Features:
- Splits based on semantic similarity between sentences
- Uses embedding vectors to determine natural break points

In [None]:
from langchain_experimental.text_splitter import SemanticChunker
from langchain_community.embeddings import HuggingFaceEmbeddings
embedding_model = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-MiniLM-L6-v2"
)

In [None]:


from langchain_community.embeddings import HuggingFaceEmbeddings
embedding_model = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-MiniLM-L6-v2"
)


In [None]:
semantic_text_splitter = SemanticChunker(embedding_model)

semantic_text_splits = semantic_text_splitter.split_documents(documents)

In [None]:
semantic_text_splits[:2]

- breakpoint_threshold_type="percentile" tells the chunker to use a dynamic threshold based on similarity percentiles rather than a fixed value. Using breakpoint_threshold_type="percentile" means: “split at points where the similarity is below the Nth percentile of all similarities.” The actual percentile value is controlled by the breakpoint_threshold parameter (default is usually 0.25 or 25% depending on the version). You can define it explicitly like this:

In [None]:
semantic_text_splitter2 = SemanticChunker(embedding_model,breakpoint_threshold_type="percentile",  breakpoint_threshold_amount=0.175)
semantic_text_splits2 = semantic_text_splitter2.split_documents(documents)
semantic_text_splits2[:2]

# Hierarchical chunking
Use Case: Complex documents requiring contextual relationships between sections

Hierarchical chunking goes a step further by organizing documents into parent and child chunks.

By structuring the document hierarchically, the model gains a better understanding of the relationships between different parts of the content, enabling it to provide more contextually relevant and coherent responses.

In [None]:
from llama_index.core import SimpleDirectoryReader

reader = SimpleDirectoryReader(input_dir="data")
documents = reader.load_data()
from llama_index.core.node_parser import HierarchicalNodeParser

node_parser = HierarchicalNodeParser.from_defaults(chunk_sizes=[512, 254, 128])

nodes = node_parser.get_nodes_from_documents(documents)

In [None]:
for node in nodes[:2]:
    print(len(node.text),node.id_, node.relationships)