## Introduction to Chunking
Chunking is the process of breaking down large pieces of text into smaller, manageable segments for Large Language Model (LLM) applications. It's a critical technique in Retrieval Augmented Generation (RAG) systems that helps optimize the relevance of content retrieved from vector databases.


## Why is chunking important?

* LLMs have finite context windows
* Helps maintain semantic coherence
* Optimizes embedding quality
* Improves retrieval accuracy
* Enables better search results

## Core Chunking Parameters
Before diving into strategies, let's understand the key parameters:

1. chunk_size: Maximum number of characters or tokens in each chunk
2. chunk_overlap: Number of characters/tokens that overlap between consecutive chunks
3. separator: Character(s) used to split the text

## 1. Fixed-Size Chunking (Character-Based)


In [16]:
from langchain.text_splitter import CharacterTextSplitter

# Sample text
text = """
This is a long document that needs to be split into smaller chunks.
Each chunk will have a maximum of 100 characters.

This approach is simple but may cut sentences in the middle.
We can use overlap to maintain some context between chunks.
"""

# Initialize the splitter
text_splitter = CharacterTextSplitter(
    separator="\n\n",           # Split on double newlines
    chunk_size=50,             # Max 50 characters per chunk
    chunk_overlap=10,           # 10 characters overlap
    length_function=len,        # Use character count
    is_separator_regex=False    # Separator is not a regex
)

# Create documents
docs = text_splitter.create_documents([text])

Created a chunk of size 118, which is longer than the specified 50


In [17]:
docs

[Document(metadata={}, page_content='This is a long document that needs to be split into smaller chunks.\nEach chunk will have a maximum of 100 characters.'),
 Document(metadata={}, page_content='This approach is simple but may cut sentences in the middle.\nWe can use overlap to maintain some context between chunks.')]

In [18]:
# Print results
for i, doc in enumerate(docs):
    print(f"Chunk {i+1}: {doc.page_content}")
    print(f"Length: {len(doc.page_content)}")
    print("---")

Chunk 1: This is a long document that needs to be split into smaller chunks.
Each chunk will have a maximum of 100 characters.
Length: 117
---
Chunk 2: This approach is simple but may cut sentences in the middle.
We can use overlap to maintain some context between chunks.
Length: 120
---


## 2. Recursive Character Text Splitting

##### How it works:

* Tries to split by the first separator (\n\n)
* If chunks are still too large, moves to the next separator (\n)
* Continues with space ( ) and finally empty string ("")
* Recursively processes until desired chunk size is achieved


In [19]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text = """# Introduction
This is a comprehensive guide to text splitting.

## Chapter 1: Basics
Text splitting is fundamental to RAG applications.
It helps in processing large documents efficiently.

## Chapter 2: Advanced Techniques
There are several sophisticated methods available.
Each has its own advantages and use cases.

The choice depends on your specific requirements.
"""

# Initialize recursive splitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=150,             # Target chunk size
    chunk_overlap=30,           # Overlap between chunks
    length_function=len,        # Length measurement function
    separators=["\n\n", "\n", " ", ""]  # Hierarchy of separators
)

# Split the text
docs = text_splitter.create_documents([text])

for i, doc in enumerate(docs):
    print(f"Chunk {i+1}:")
    print(repr(doc.page_content))
    print(f"Length: {len(doc.page_content)}")
    print("---")

Chunk 1:
'# Introduction\nThis is a comprehensive guide to text splitting.'
Length: 63
---
Chunk 2:
'## Chapter 1: Basics\nText splitting is fundamental to RAG applications.\nIt helps in processing large documents efficiently.'
Length: 123
---
Chunk 3:
'## Chapter 2: Advanced Techniques\nThere are several sophisticated methods available.\nEach has its own advantages and use cases.'
Length: 127
---
Chunk 4:
'The choice depends on your specific requirements.'
Length: 49
---


#### Language-Specific Splitting

In [20]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.text_splitter import Language

# Python code example
python_code = '''
class TextProcessor:
    def __init__(self, name):
        self.name = name
    
    def process(self, text):
        return text.upper()

def main():
    processor = TextProcessor("example")
    result = processor.process("hello world")
    print(result)

if __name__ == "__main__":
    main()
'''

# Split Python code with language-specific separators
python_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.PYTHON,
    chunk_size=200,
    chunk_overlap=0
)

docs = python_splitter.create_documents([python_code])

In [22]:
for i, doc in enumerate(docs):
    print(f"Chunk {i+1}:")
    print(repr(doc.page_content))
    print(f"Length: {len(doc.page_content)}")
    print("---")

Chunk 1:
'class TextProcessor:\n    def __init__(self, name):\n        self.name = name\n    \n    def process(self, text):\n        return text.upper()'
Length: 137
---
Chunk 2:
'def main():\n    processor = TextProcessor("example")\n    result = processor.process("hello world")\n    print(result)\n\nif __name__ == "__main__":\n    main()'
Length: 155
---


## 3. Token-Based Splitting

* Splits text based on tokens rather than characters.

In [23]:
from langchain.text_splitter import TokenTextSplitter

text = """

Artificial Intelligence (AI) refers to the development of computer systems capable of performing tasks that typically require human intelligence. These tasks include learning, reasoning, problem-solving, understanding language, and recognizing patterns. AI technologies are widely used in various industries such as healthcare, finance, transportation, and entertainment. For example, AI powers voice assistants like Siri and Alexa, recommends shows on Netflix, and helps doctors analyze medical images more accurately.

AI systems are built using algorithms and large datasets, often relying on machine learning—a subset of AI that enables computers to improve their performance over time without being explicitly programmed. With advancements in deep learning and natural language processing, AI has made significant progress in areas like self-driving cars and chatbots.

While AI offers many benefits, including efficiency and automation, it also raises ethical concerns around privacy, bias, and job displacement. Responsible development and regulation of AI are crucial to ensure it serves humanity positively.


"""

# Token-based splitting
token_splitter = TokenTextSplitter(
    chunk_size=100,      # 100 tokens per chunk
    chunk_overlap=10     # 10 tokens overlap
)

docs = token_splitter.create_documents([text])

In [25]:
for i, doc in enumerate(docs):
    print(f"Chunk {i+1}:")
    print(repr(doc.page_content))
    print(f"Length: {len(doc.page_content)}")
    print("---")

Chunk 1:
'\n\nArtificial Intelligence (AI) refers to the development of computer systems capable of performing tasks that typically require human intelligence. These tasks include learning, reasoning, problem-solving, understanding language, and recognizing patterns. AI technologies are widely used in various industries such as healthcare, finance, transportation, and entertainment. For example, AI powers voice assistants like Siri and Alexa, recommends shows on Netflix, and helps doctors analyze medical images more accurately.\n\nAI systems are built using algorithms and large datasets'
Length: 579
---
Chunk 2:
'\nAI systems are built using algorithms and large datasets, often relying on machine learning—a subset of AI that enables computers to improve their performance over time without being explicitly programmed. With advancements in deep learning and natural language processing, AI has made significant progress in areas like self-driving cars and chatbots.\n\nWhile AI offers many 

#### Using tiktoken for OpenAI models

In [31]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Using tiktoken encoder for OpenAI models
text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
    chunk_size=100,     # 1000 tokens
    chunk_overlap=10,   # 200 tokens overlap
    encoding_name="cl100k_base"  # GPT-4 encoding
)

docs = text_splitter.split_text(text)

In [33]:
len(docs)

2

In [42]:
for i in range(len(docs)):
    print(f"Chunk {i+1}:\n")
    print(docs[i]+"\n\n")

Chunk 1:

Artificial Intelligence (AI) refers to the development of computer systems capable of performing tasks that typically require human intelligence. These tasks include learning, reasoning, problem-solving, understanding language, and recognizing patterns. AI technologies are widely used in various industries such as healthcare, finance, transportation, and entertainment. For example, AI powers voice assistants like Siri and Alexa, recommends shows on Netflix, and helps doctors analyze medical images more accurately.


Chunk 2:

AI systems are built using algorithms and large datasets, often relying on machine learning—a subset of AI that enables computers to improve their performance over time without being explicitly programmed. With advancements in deep learning and natural language processing, AI has made significant progress in areas like self-driving cars and chatbots.

While AI offers many benefits, including efficiency and automation, it also raises ethical concerns arou

## 4. Semantic Chunking

* This advanced technique splits text based on semantic similarity rather than fixed sizes, creating more contextually coherent chunks.


In [44]:
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai.embeddings import OpenAIEmbeddings

text = """
Artificial intelligence is transforming industries worldwide.
Machine learning algorithms can process vast amounts of data.
Deep learning has revolutionized computer vision and NLP.

Climate change poses significant challenges for our planet.
Rising temperatures affect weather patterns globally.
Renewable energy sources offer sustainable solutions.

Space exploration continues to capture human imagination.
Mars missions are planned for the next decade.
Satellite technology improves communication systems.
"""

# Initialize semantic chunker
import  os
os.environ["OPENAI_API_KEY"] = ""
embeddings = OpenAIEmbeddings()
semantic_splitter = SemanticChunker(embeddings)

# Create semantically coherent chunks
docs = semantic_splitter.create_documents([text])

for i, doc in enumerate(docs):
    print(f"Semantic Chunk {i+1}:")
    print(doc.page_content)
    print("---")

Semantic Chunk 1:

Artificial intelligence is transforming industries worldwide. Machine learning algorithms can process vast amounts of data. Deep learning has revolutionized computer vision and NLP. Climate change poses significant challenges for our planet.
---
Semantic Chunk 2:
Rising temperatures affect weather patterns globally. Renewable energy sources offer sustainable solutions. Space exploration continues to capture human imagination. Mars missions are planned for the next decade. Satellite technology improves communication systems. 
---


#### Different Semantic Splitting Methods

In [48]:
# 1. Percentile method (default)
percentile_splitter = SemanticChunker(
    embeddings,
    breakpoint_threshold_type="percentile",
    breakpoint_threshold_amount=95.0  # 95th percentile
)

# 2. Standard deviation method
std_splitter = SemanticChunker(
    embeddings,
    breakpoint_threshold_type="standard_deviation",
    breakpoint_threshold_amount=1.0  # 1 standard deviations
)

# 3. Interquartile range method
iqr_splitter = SemanticChunker(
    embeddings,
    breakpoint_threshold_type="interquartile",
    breakpoint_threshold_amount=1.5  # IQR multiplier
)

# 4. Gradient method (for highly correlated text)
gradient_splitter = SemanticChunker(
    embeddings,
    breakpoint_threshold_type="gradient",
    breakpoint_threshold_amount=95.0
)

In [46]:
text  =  """

AI is transforming healthcare.
Machine learning helps in disease prediction.
NLP can extract data from medical records.

Climate change is a serious issue.
Greenhouse gases are rising.
Renewable energy is the solution.

Mars missions are being planned.
SpaceX has sent rockets to space.
Satellites help track weather patterns.
"""

percentile_splitter.create_documents([text])

[Document(metadata={}, page_content='\n\nAI is transforming healthcare. Machine learning helps in disease prediction. NLP can extract data from medical records. Climate change is a serious issue.'),
 Document(metadata={}, page_content='Greenhouse gases are rising. Renewable energy is the solution. Mars missions are being planned. SpaceX has sent rockets to space. Satellites help track weather patterns. ')]

In [49]:
std_splitter.create_documents([text])

[Document(metadata={}, page_content='\n\nAI is transforming healthcare. Machine learning helps in disease prediction. NLP can extract data from medical records. Climate change is a serious issue.'),
 Document(metadata={}, page_content='Greenhouse gases are rising. Renewable energy is the solution. Mars missions are being planned.'),
 Document(metadata={}, page_content='SpaceX has sent rockets to space. Satellites help track weather patterns. ')]

In [50]:
iqr_splitter.create_documents([text])

[Document(metadata={}, page_content='\n\nAI is transforming healthcare. Machine learning helps in disease prediction. NLP can extract data from medical records. Climate change is a serious issue.'),
 Document(metadata={}, page_content='Greenhouse gases are rising. Renewable energy is the solution. Mars missions are being planned.'),
 Document(metadata={}, page_content='SpaceX has sent rockets to space. Satellites help track weather patterns. ')]

## 5. Document Structure-Aware Chunking


#### Markdown Text Splitter
Preserves markdown structure while splitting.

In [51]:
from langchain.text_splitter import MarkdownTextSplitter

markdown_text = """
# Main Title

## Section 1
This is the first section with important content.

### Subsection 1.1
More detailed information here.

## Section 2
This is the second section.

### Subsection 2.1
Additional details and examples.
"""

markdown_splitter = MarkdownTextSplitter(
    chunk_size=200,
    chunk_overlap=0
)

docs = markdown_splitter.create_documents([markdown_text])

In [52]:
docs

[Document(metadata={}, page_content='# Main Title\n\n## Section 1\nThis is the first section with important content.\n\n### Subsection 1.1\nMore detailed information here.\n\n## Section 2\nThis is the second section.'),
 Document(metadata={}, page_content='### Subsection 2.1\nAdditional details and examples.')]

#### HTML Header Text Splitter
Splits HTML while preserving header hierarchy.

In [53]:
from langchain.text_splitter import HTMLHeaderTextSplitter

html_string = """
<html>
<body>
    <div>
        <h1>Main Title</h1>
        <p>Introduction paragraph</p>
        
        <h2>Chapter 1</h2>
        <p>Content of chapter 1</p>
        
        <h3>Section 1.1</h3>
        <p>Detailed content for section 1.1</p>
        
        <h2>Chapter 2</h2>
        <p>Content of chapter 2</p>
    </div>
</body>
</html>
"""

# Define headers to split on
headers_to_split_on = [
    ("h1", "Header 1"),
    ("h2", "Header 2"),
    ("h3", "Header 3"),
]

html_splitter = HTMLHeaderTextSplitter(
    headers_to_split_on=headers_to_split_on
)

html_header_splits = html_splitter.split_text(html_string)

In [54]:
html_header_splits

[Document(metadata={'Header 1': 'Main Title'}, page_content='Main Title'),
 Document(metadata={'Header 1': 'Main Title'}, page_content='Introduction paragraph'),
 Document(metadata={'Header 1': 'Main Title', 'Header 2': 'Chapter 1'}, page_content='Chapter 1'),
 Document(metadata={'Header 1': 'Main Title', 'Header 2': 'Chapter 1'}, page_content='Content of chapter 1'),
 Document(metadata={'Header 1': 'Main Title', 'Header 2': 'Chapter 1', 'Header 3': 'Section 1.1'}, page_content='Section 1.1'),
 Document(metadata={'Header 1': 'Main Title', 'Header 2': 'Chapter 1', 'Header 3': 'Section 1.1'}, page_content='Detailed content for section 1.1'),
 Document(metadata={'Header 1': 'Main Title', 'Header 2': 'Chapter 2'}, page_content='Chapter 2'),
 Document(metadata={'Header 1': 'Main Title', 'Header 2': 'Chapter 2'}, page_content='Content of chapter 2')]

## 6. Sentence-Based Chunking

Splits text while respecting sentence boundaries


In [58]:
from langchain.text_splitter import NLTKTextSplitter
import nltk
nltk.download('punkt')

text = """
Artificial intelligence is transforming industries worldwide.
Machine learning algorithms can process vast amounts of data.
Deep learning has revolutionized computer vision and NLP.

Climate change poses significant challenges for our planet.
Rising temperatures affect weather patterns globally.
Renewable energy sources offer sustainable solutions.

Space exploration continues to capture human imagination.
Mars missions are planned for the next decade.
Satellite technology improves communication systems.
"""

nltk_splitter = NLTKTextSplitter(
    chunk_size=100 ,
    chunk_overlap=10
)

docs = nltk_splitter.split_text(text)

[nltk_data] Downloading package punkt to /home/erginous/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [59]:
docs

['Artificial intelligence is transforming industries worldwide.',
 'Machine learning algorithms can process vast amounts of data.',
 'Deep learning has revolutionized computer vision and NLP.',
 'Climate change poses significant challenges for our planet.',
 'Rising temperatures affect weather patterns globally.',
 'Renewable energy sources offer sustainable solutions.',
 'Space exploration continues to capture human imagination.',
 'Mars missions are planned for the next decade.\n\nSatellite technology improves communication systems.']

## 7. JSON and Structured Data Chunking
For handling JSON and other structured data.


In [61]:
from langchain.text_splitter import RecursiveJsonSplitter
import json

# Sample JSON data
json_data = {
    "users": [
        {"id": 1, "name": "Alice", "email": "alice@example.com"},
        {"id": 2, "name": "Bob", "email": "bob@example.com"}
    ],
    "products": [
        {"id": 101, "name": "Laptop", "price": 999.99},
        {"id": 102, "name": "Mouse", "price": 29.99}
    ]
}

# Initialize JSON splitter
json_splitter = RecursiveJsonSplitter(max_chunk_size=50)

# Split JSON data
json_chunks = json_splitter.split_json(json_data=json_data)

for chunk in json_chunks:
    print(json.dumps(chunk, indent=2))
    print("---")

{
  "users": [
    {
      "id": 1,
      "name": "Alice",
      "email": "alice@example.com"
    },
    {
      "id": 2,
      "name": "Bob",
      "email": "bob@example.com"
    }
  ]
}
---
{
  "products": [
    {
      "id": 101,
      "name": "Laptop",
      "price": 999.99
    },
    {
      "id": 102,
      "name": "Mouse",
      "price": 29.99
    }
  ]
}
---


Great! Let’s go deep into **Semantic Chunking** and the **4 threshold methods** with **real examples**, **mathematics**, and **visual intuition**. We’ll simulate everything step-by-step.

---

## ✅ Step 1: What is Semantic Chunking?

### 🧠 Goal:

Split a document where **meaning shifts**.
Rather than using fixed sizes, we use **semantic embeddings** to detect topic changes.

---

## 🔧 How Semantic Chunking Works

Let’s take a simple example:

```text
1. AI is transforming healthcare.
2. Machine learning helps in disease prediction.
3. NLP can extract data from medical records.

4. Climate change is a serious issue.
5. Greenhouse gases are rising.
6. Renewable energy is the solution.

7. Mars missions are being planned.
8. SpaceX has sent rockets to space.
9. Satellites help track weather patterns.
```

We’ll break this into **groups of 3 sentences**:

| Group | Text                    |
| ----- | ----------------------- |
| G1    | Sentences 1–3 (AI)      |
| G2    | Sentences 4–6 (Climate) |
| G3    | Sentences 7–9 (Space)   |

---

## 🔢 Step 2: Generate Embeddings

Each group becomes a **vector** using embeddings.
Let’s assume:

* G1 = `[0.1, 0.2, 0.7]`
* G2 = `[0.9, 0.8, 0.1]`
* G3 = `[0.3, 0.2, 0.9]`

These are **3D vectors** representing meaning.

---

## 📏 Step 3: Compute Pairwise Distances

We compare the **cosine distance** between embeddings:

### 🧮 Cosine Distance Formula:

$$
\text{cosine\_distance}(A, B) = 1 - \frac{A \cdot B}{\|A\| \|B\|}
$$

### Compute:

#### Distance between G1 and G2:

* A = G1 = `[0.1, 0.2, 0.7]`
* B = G2 = `[0.9, 0.8, 0.1]`
* Cosine similarity ≈ 0.42 → Distance = 1 - 0.42 = **0.58**

#### G2 and G3:

* Cosine similarity ≈ 0.76 → Distance = 0.24

#### G1 and G3:

* Cosine similarity ≈ 0.98 → Distance = 0.02

So:

```
G1–G2: 0.58 (high distance → topic shift)
G2–G3: 0.24 (medium)
G1–G3: 0.02 (very similar)
```

---

# 🔍 Step 4: Thresholding Methods in Action

### Now let’s apply the 4 methods to this data:

We assume distances between chunks:

```python
distances = [0.58, 0.24, 0.02]
```

---

## 1. 🎯 Percentile Method

### Config:

```python
SemanticChunker(..., breakpoint_threshold_type="percentile", breakpoint_threshold_amount=90.0)
```

### Math:

Sort distances: `[0.02, 0.24, 0.58]`

* 90th percentile = value below which 90% of distances fall = approx **0.52**

### Rule:

If distance > 0.52 → split

✅ `0.58 > 0.52` → SPLIT
❌ `0.24`, `0.02` → DON’T SPLIT

**➡ Split between G1 and G2**

---

## 2. 📈 Standard Deviation Method

### Config:

```python
breakpoint_threshold_type="standard_deviation", breakpoint_threshold_amount=1.0
```

### Math:

* Mean = $\mu = \frac{0.58 + 0.24 + 0.02}{3} ≈ 0.28$
* Std dev = $\sigma ≈ 0.23$

### Threshold = μ + 1σ = `0.28 + 0.23 = 0.51`

✅ `0.58 > 0.51` → SPLIT
❌ Others → DON’T SPLIT

**➡ Split between G1 and G2**

---

## 3. 📊 Interquartile Range (IQR) Method

### Config:

```python
breakpoint_threshold_type="interquartile", breakpoint_threshold_amount=1.5
```

### Math:

Sorted distances: `[0.02, 0.24, 0.58]`

* Q1 = 0.02, Q3 = 0.58
* IQR = Q3 - Q1 = 0.56
* Threshold = Q3 + 1.5 × IQR = `0.58 + 0.84 = 1.42`

✅ No distance > 1.42 → No SPLIT

**➡ No splits at all**

*Note: You can reduce the multiplier (e.g., 1.0) to make it more sensitive.*

---

## 4. 📐 Gradient Method

### Idea:

Looks at **change** between distances.

* Let’s compute gradient (slope):

```python
grad_1 = 0.58 - 0.24 = 0.34
grad_2 = 0.24 - 0.02 = 0.22
```

Let’s say we apply threshold = 0.3 (large change)

✅ grad\_1 = 0.34 → SPLIT
❌ grad\_2 = 0.22 → DON’T SPLIT

**➡ Split between G1 and G2**

---

## 📌 Final Chunking Example:

If only G1–G2 is split:

* **Chunk 1**: G1 = Sentences 1–3 (AI)
* **Chunk 2**: G2 + G3 = Sentences 4–9 (Climate + Space)

Unless another split is detected, G2 and G3 remain grouped.

---

## 💡 Summary of Behavior

| Method        | Sensitive to                 | Splits Detected         |
| ------------- | ---------------------------- | ----------------------- |
| Percentile    | Top % dissimilarities        | G1–G2                   |
| Std Deviation | Large outliers above average | G1–G2                   |
| IQR           | Statistical outliers         | None (too conservative) |
| Gradient      | Sudden jump in meaning       | G1–G2                   |

---

## 📌 Key Takeaways

| Term                | Meaning                                                         |
| ------------------- | --------------------------------------------------------------- |
| **Cosine Distance** | Measures how different two chunks are in *meaning*              |
| **Breakpoint**      | A place where semantic shift is large enough to justify a split |
| **Threshold Type**  | The rule used to decide when a difference is "too big"          |

---

Would you like a **code example that prints embeddings, distances, and shows these splits live** using OpenAI embeddings and LangChain?
