##  Chunking - Optimizing Vector Database Data Preparation

- Character/Token Based Chunking
- Recursive Character/Token Based Chunking
- Semantic Chunking
- Cluster Semantic Chunking
- LLM Semantic Chunking


### The Chunking Evaluation Repo

In [1]:
%%capture
!pip install git+https://github.com/brandonstarxel/chunking_evaluation.git

In [1]:
# Main Chunking Functions
from chunking_evaluation.chunking import (
    ClusterSemanticChunker,
    LLMSemanticChunker,
    FixedTokenChunker,
    RecursiveTokenChunker,
    KamradtModifiedChunker
)
# Additional Dependencies
import tiktoken
from chromadb.utils import embedding_functions
from chunking_evaluation.utils import openai_token_count
import os

Pride and Prejudice by Jane Austen, available for free from Project Gutenberg, will be used. It consists of 476 pages of text or 175,651 tokens.

In [2]:

with open("./pride_and_prejudice.txt", 'r', encoding='utf-8') as file:
        document = file.read()

print("First 1000 Characters: ", document[:1000])

First 1000 Characters:  ﻿The Project Gutenberg eBook of Pride and Prejudice
    
This ebook is for the use of anyone anywhere in the United States and
most other parts of the world at no cost and with almost no restrictions
whatsoever. You may copy it, give it away or re-use it under the terms
of the Project Gutenberg License included with this ebook or online
at www.gutenberg.org. If you are not located in the United States,
you will have to check the laws of the country where you are located
before using this eBook.

Title: Pride and Prejudice

Author: Jane Austen

Release date: June 1, 1998 [eBook #1342]
                Most recently updated: June 17, 2024

Language: English

Credits: Chuck Greif and the Online Distributed Proofreading Team at http://www.pgdp.net (This file was produced from images available at The Internet Archive)


*** START OF THE PROJECT GUTENBERG EBOOK PRIDE AND PREJUDICE ***




                            [Illustration:

                             GEORGE A

#### **Helper Function for Analyzing Chunking!**

- **Display Chunk Count**: _The function prints the length of the provided chunks list (i.e., the number of chunks)._
- **Examine Specific Chunks**: _It prints the 200th and 201st chunks (indices 199 and 200)._

- **Overlap Analysis**: _It identifies overlapping text between the 200th and 201st chunks, checked in two modes._
    + **Character-Based** (use_tokens=False): _Searches for a common substring between the two chunks._
    + **Token-Based** (use_tokens=True): _Uses the tiktoken library to tokenize the text and checks for token overlap._

In [3]:
def analyze_chunks(chunks, use_tokens=False):
    # Print the chunks of interest
    print("\nNumber of Chunks:", len(chunks))
    print("\n", "="*50, "200th Chunk", "="*50,"\n", chunks[199])
    print("\n", "="*50, "201st Chunk", "="*50,"\n", chunks[200])

    chunk1, chunk2 = chunks[199], chunks[200]

    if use_tokens:
        encoding = tiktoken.get_encoding("cl100k_base")
        tokens1 = encoding.encode(chunk1)
        tokens2 = encoding.encode(chunk2)

        # Find overlapping tokens
        for i in range(len(tokens1), 0, -1):
            if tokens1[-i:] == tokens2[:i]:
                overlap = encoding.decode(tokens1[-i:])
                print("\n", "="*50, f"\nOverlapping text ({i} tokens):", overlap)
                return
        print("\nNo token overlap found")
    else:
        # Find overlapping characters
        for i in range(min(len(chunk1), len(chunk2)), 0, -1):
            if chunk1[-i:] == chunk2[:i]:
                print("\n", "="*50, f"\nOverlapping text ({i} chars):", chunk1[-i:])
                return
        print("\nNo character overlap found")

### <font color='orange'>**Character Text Splitting**</font>

_The simplest form of chunking would be simply counting some number of characters and splitting at that count._

In [4]:
def chunk_text(document, chunk_size, overlap):
    chunks = []
    stride = chunk_size - overlap
    current_idx = 0

    while current_idx < len(document):
        # Take chunk_size characters starting from current_idx
        chunk = document[current_idx:current_idx + chunk_size]
        if not chunk:  # Break if we're out of text
            break
        chunks.append(chunk)
        current_idx += stride  # Move forward by stride

    return chunks

Chunk size of `400` Characters, `no overlap`

In [5]:
character_chunks = chunk_text(document, chunk_size=400, overlap=0)

analyze_chunks(character_chunks)


Number of Chunks: 1871

 ty to their aunt, and
to a milliner’s shop just over the way. The two youngest of the family,
Catherine and Lydia, were particularly frequent in these attentions:
their minds were more vacant than their sisters’, and when nothing
better offered, a walk to Meryton was necessary to amuse their morning
hours and furnish conversation for the evening; and, however bare of
news the country in general mi

 ght be, they always contrived to learn
some from their aunt. At present, indeed, they were well supplied both
with news and happiness by the recent arrival of a militia regiment in
the neighbourhood; it was to remain the whole winter, and Meryton was
the head-quarters.

Their visits to Mrs. Philips were now productive of the most interesting
intelligence. Every day added something to their knowled

No character overlap found


Chunk size of `800` Characters, `400` overlap

In [6]:
character_overlap_chunks = chunk_text(document, chunk_size=800, overlap=400)

analyze_chunks(character_overlap_chunks)


Number of Chunks: 1871

 ty to their aunt, and
to a milliner’s shop just over the way. The two youngest of the family,
Catherine and Lydia, were particularly frequent in these attentions:
their minds were more vacant than their sisters’, and when nothing
better offered, a walk to Meryton was necessary to amuse their morning
hours and furnish conversation for the evening; and, however bare of
news the country in general might be, they always contrived to learn
some from their aunt. At present, indeed, they were well supplied both
with news and happiness by the recent arrival of a militia regiment in
the neighbourhood; it was to remain the whole winter, and Meryton was
the head-quarters.

Their visits to Mrs. Philips were now productive of the most interesting
intelligence. Every day added something to their knowled

 ght be, they always contrived to learn
some from their aunt. At present, indeed, they were well supplied both
with news and happiness by the recent arrival of a militia re

### <font color='orange'>**Token Text Splitting**</font>

But language models (the end users of chunked text usually) don't operate at the character level. Instead they use tokens, or common sequences of characters that represent frequent words, word pieces, and subwords. For example, the word 'hamburger' when ran through GPT-4's tokenizer is split into the tokens ['h', 'amburger']. Common words like 'the' or 'and' are typically single tokens.

This means character-based splitting isn't ideal because:

    1. A 500-character chunk might contain anywhere from 100-500 tokens depending on the text
    2. Different languages and character sets encode to very different numbers of tokens
    3. We might hit token limits in our LLM without realizing it

A good visualizer of tokenization is available [on OpenAI's platform](https://platform.openai.com/tokenizer)

Tokenizers like 'cl100k_base' implement Byte-Pair Encoding (BPE) - a compression algorithm that creates a vocabulary by iteratively merging the most frequent pairs of bytes or characters. The '100k' refers to its vocab size, determining the balance between compression and representation granularity.

When talking to a language model, the first step is tokenizing the text so that it can be processed by the underlying neural network. The LLM outputs tokens which are decoded back into words.

In [7]:
import tiktoken

# Loading cl100k_base tokenizer
encoder = tiktoken.get_encoding("cl100k_base")

# Text Example
text = "hamburger"
tokens = encoder.encode(text)

print("Tokens:", tokens)

Tokens: [71, 47775]


In [8]:
for i in range(len(tokens)):
    print(f"Token {i+1}:", encoder.decode([tokens[i]]))

print("Full Decoding: ", encoder.decode(tokens))

Token 1: h
Token 2: amburger
Full Decoding:  hamburger


#### **Helper Function for Counting Tokens**

In [9]:
def count_tokens(text, model="cl100k_base"):
    """Count tokens in a text string using tiktoken"""
    encoder = tiktoken.get_encoding(model)
    return print(f"Number of tokens: {len(encoder.encode(text))}")

Chunk Size of `400` Tokens, `0 Overlap`

In [10]:
fixed_token_chunker = FixedTokenChunker(
    chunk_size=400,
    chunk_overlap=0,
    encoding_name="cl100k_base"
)

token_chunks = fixed_token_chunker.split_text(document)

analyze_chunks(token_chunks, use_tokens=True)


Number of Chunks: 440

  fortunate as to meet Miss Bennet. The
subject was pursued no further, and the gentlemen soon afterwards went
away.




[Illustration:

“At Church”
]




CHAPTER XXXI.


[Illustration]

Colonel Fitzwilliam’s manners were very much admired at the Parsonage,
and the ladies all felt that he must add considerably to the pleasure of
their engagements at Rosings. It was some days, however, before they
received any invitation thither, for while there were visitors in the
house they could not be necessary; and it was not till Easter-day,
almost a week after the gentlemen’s arrival, that they were honoured by
such an attention, and then they were merely asked on leaving church to
come there in the evening. For the last week they had seen very little
of either Lady Catherine or her daughter. Colonel Fitzwilliam had called
at the Parsonage more than once during the time, but Mr. Darcy they had
only seen at church.

The invitation was accepted, of course, and at a proper h

In [11]:
count_tokens(token_chunks[0])

Number of tokens: 400


Chunk Size of `400` Tokens, `200` Overlap

In [12]:
fixed_token_chunker = FixedTokenChunker(
    chunk_size=400,
    chunk_overlap=200,
    encoding_name="cl100k_base"
)

token_overlap_chunks = fixed_token_chunker.split_text(document)

analyze_chunks(token_overlap_chunks, use_tokens=True)


Number of Chunks: 878

  I _heard_ nothing of his going away when I
was at Netherfield. I hope your plans in favour of the ----shire will
not be affected by his being in the neighbourhood.”

“Oh no--it is not for _me_ to be driven away by Mr. Darcy. If _he_
wishes to avoid seeing _me_ he must go. We are not on friendly terms,
and it always gives me pain to meet him, but I have no reason for
avoiding _him_ but what I might proclaim to all the world--a sense of
very great ill-usage, and most painful regrets at his being what he is.
His father, Miss Bennet, the late Mr. Darcy, was one of the best men
that ever breathed, and the truest friend I ever had; and I can never be
in company with this Mr. Darcy without being grieved to the soul by a
thousand tender recollections. His behaviour to myself has been
scandalous; but I verily believe I could forgive him anything and
everything, rather than his disappointing the hopes and disgracing the
memory of his father.”

Elizabeth found the intere

### <font color='orange'>**Recursive Character Text Splitter**</font>

When we write, we naturally separate text into paragraphs, sentences, and other logical units. The recursive character text splitter tries to intelligently split text by looking for natural separators in order, while respecting a maximum character length.

First, it makes a complete pass over the entire document using paragraph breaks (\n\n), creating an initial set of chunks. Then for any chunks that exceed the size limit, it recursively processes them using progressively smaller separators:

    1. First tries to split on paragraph breaks (\n\n)
    2. If chunks are still too big, tries line breaks (\n)
    3. Then sentence boundaries (., ?, !)
    4. Then words ( )
    5. Finally, if no other separators work, splits on individual characters ("")

This way, the splitter preserves as much natural structure as possible - only drilling down to smaller separators when necessary to meet the size limit. A chunk that's already small enough stays intact, while larger chunks get progressively broken down until they fit.

**Chunk Size of `800` Characters, `0` Overlap**

In [13]:
recursive_character_chunker = RecursiveTokenChunker(
    chunk_size=800,  # Character Length
    chunk_overlap=0,  # Overlap
    length_function=len,  # Character length with len()
    separators=["\n\n", "\n", ".", "?", "!", " ", ""] # According to Research
)

recursive_character_chunks = recursive_character_chunker.split_text(document)
analyze_chunks(recursive_character_chunks, use_tokens=False)


Number of Chunks: 1270

 When tea was over Mr. Hurst reminded his sister-in-law of the
card-table--but in vain. She had obtained private intelligence that Mr.
Darcy did not wish for cards, and Mr. Hurst soon found even his open
petition rejected. She assured him that no one intended to play, and the
silence of the whole party on the subject seemed to justify her. Mr.
Hurst had, therefore, nothing to do but to stretch himself on one of the
sofas and go to sleep. Darcy took up a book. Miss Bingley did the same;
and Mrs. Hurst, principally occupied in playing with her bracelets and
rings, joined now and then in her brother’s conversation with Miss
Bennet.

 Miss Bingley’s attention was quite as much engaged in watching Mr.
Darcy’s progress through _his_ book, as in reading her own; and she was
perpetually either making some inquiry, or looking at his page. She
could not win him, however, to any conversation; he merely answered her
question and read on. At length, quite exhausted by the a

In [14]:
len(recursive_character_chunks[199]) # Chunk 200

635

This means we don't get exact splits - a chunk might be 550 characters long because that's where a paragraph or sentence naturally ends, rather than forcing the full 800 character limit. The chunker prioritizes maintaining these natural text boundaries over hitting the exact maximum size.

**Chunk Size of `800` Characters, `400` Overlap**

In [15]:
recursive_token_chunker = RecursiveTokenChunker(
    chunk_size=800,  # Character Length
    chunk_overlap=400,  # Overlap
    length_function=openai_token_count,
    separators=["\n\n", "\n", ".", "?", "!", " ", ""] # According to Research
)

recursive_token_overlap_chunks = recursive_token_chunker.split_text(document)

analyze_chunks(recursive_token_overlap_chunks, use_tokens=True)


Number of Chunks: 427

 “I do not mean to say that a woman may not be settled too near her
family. The far and the near must be relative, and depend on many
varying circumstances. Where there is fortune to make the expense of
travelling unimportant, distance becomes no evil. But that is not the
case _here_. Mr. and Mrs. Collins have a comfortable income, but not
such a one as will allow of frequent journeys--and I am persuaded my
friend would not call herself _near_ her family under less than _half_
the present distance.”

Mr. Darcy drew his chair a little towards her, and said, “_You_ cannot
have a right to such very strong local attachment. _You_ cannot have
been always at Longbourn.”

Elizabeth looked surprised. The gentleman experienced some change of
feeling; he drew back his chair, took a newspaper from the table, and,
glancing over it, said, in a colder voice,--

“Are you pleased with Kent?”

A short dialogue on the subject of the country ensued, on either side
calm and concise

### <font color='orange'>**Greg Kamradt Semantic Chunker**</font>

Greg Kamradt popularized what's known as the semantic chunker with his 5 Levels of Text Splitting notebook here which takes a different approach from fixed character/token chunking. Instead of splitting text at predetermined positions or separators, it uses embeddings to find natural semantic boundaries in the text while maintaining consistent chunk sizes.

Chroma modified the algorithm to provide better size control through binary search. The chunker first splits text into small fixed-size pieces (around 50 tokens) using standard recursive splitting with separators. For each piece, it looks at surrounding context (3 segments before and after) to understand the local meaning - this helps maintain semantic coherence across potential split points.

After embedding these contextualized pieces, it calculates cosine distances between consecutive segments. Higher distances suggest natural topic transitions that make good splitting points. But rather than using Kamradt's original fixed percentile approach for choosing split points, Chroma's version uses binary search to find a similarity threshold that produces chunks close to the target size.

The binary search starts with limits of 0.0 and 1.0, calculating the midpoint threshold and counting how many splits it would create. If there are too many splits, it raises the threshold by adjusting the lower limit; too few splits, it lowers the threshold by adjusting the upper limit. This continues until it finds a threshold that creates chunks of approximately the desired size.

This modification makes the chunker more practical by balancing semantic coherence with consistent chunk sizes. While the original version could produce unpredictably large chunks, the modified version maintains better size control while still respecting natural topic boundaries in the text.

In [16]:
# Helper Function from the Repo that Returns Embeddings
embedding_function = embedding_functions.OpenAIEmbeddingFunction(api_key=os.environ["OPENAI_API_KEY"], model_name="text-embedding-3-large")

In [19]:
!pip install langchain_experimental langchain_openai

Collecting langchain_experimental
  Using cached langchain_experimental-0.3.4-py3-none-any.whl.metadata (1.7 kB)
Collecting langchain_openai
  Downloading langchain_openai-0.3.18-py3-none-any.whl.metadata (2.3 kB)
Collecting langchain-community<0.4.0,>=0.3.0 (from langchain_experimental)
  Using cached langchain_community-0.3.24-py3-none-any.whl.metadata (2.5 kB)
Collecting langchain-core<0.4.0,>=0.3.28 (from langchain_experimental)
  Downloading langchain_core-0.3.61-py3-none-any.whl.metadata (5.8 kB)
Collecting langchain<1.0.0,>=0.3.25 (from langchain-community<0.4.0,>=0.3.0->langchain_experimental)
  Using cached langchain-0.3.25-py3-none-any.whl.metadata (7.8 kB)
Collecting SQLAlchemy<3,>=1.4 (from langchain-community<0.4.0,>=0.3.0->langchain_experimental)
  Using cached sqlalchemy-2.0.41-cp311-cp311-win_amd64.whl.metadata (9.8 kB)
Collecting aiohttp<4.0.0,>=3.8.3 (from langchain-community<0.4.0,>=0.3.0->langchain_experimental)
  Downloading aiohttp-3.12.0-cp311-cp311-win_amd64.whl

In [17]:
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai.embeddings import OpenAIEmbeddings

lc_semantic_chunker = SemanticChunker(OpenAIEmbeddings())

lc_semantic_chunks = lc_semantic_chunker.create_documents([document])

In [18]:

print("# of Chunks:", len(lc_semantic_chunks), "\n")
print(lc_semantic_chunks[199].page_content)
print("\n\n", "="*50, "\n\n")
print(lc_semantic_chunks[200].page_content)
print("\n\n", "="*50, "\n\n")

count_tokens(lc_semantic_chunks[199].page_content)
count_tokens(lc_semantic_chunks[200].page_content)

# of Chunks: 305 

“John told us Mr. Darcy was
here when you sent for us;--was it so?”

“Yes; and I told him we should not be able to keep our engagement. _That_ is all settled.”

“What is all settled?” repeated the other, as she ran into her room to
prepare. “And are they upon such terms as for her to disclose the real
truth? Oh, that I knew how it was!”

But wishes were vain; or, at best, could serve only to amuse her in the
hurry and confusion of the following hour. Had Elizabeth been at leisure
to be idle, she would have remained certain that all employment was
impossible to one so wretched as herself; but she had her share of
business as well as her aunt, and amongst the rest there were notes to
be written to all their friends at Lambton, with false excuses for their
sudden departure. An hour, however, saw the whole completed; and Mr. Gardiner, meanwhile, having settled his account at the inn, nothing
remained to be done but to go; and Elizabeth, after all the misery of
the mornin

In [19]:
kamradt_chunker = KamradtModifiedChunker(
    avg_chunk_size=400,      # Target size in tokens
    min_chunk_size=50,       # Initial split size
    embedding_function=embedding_function  # Pass your embedding function
)

# Split your text
modified_kamradt_chunks = kamradt_chunker.split_text(document)

In [20]:
analyze_chunks(modified_kamradt_chunks, use_tokens=True)
print("\n\n", "="*50, "\n\n")
count_tokens(modified_kamradt_chunks[200])


Number of Chunks: 434

 family had quitted the country, he had told his story to no one but
herself; but that after their removal, it had been everywhere discussed; that he had then no reserves, no scruples in sinking Mr. Darcy’s
character, though he had assured her that respect for the father would
always prevent his exposing the son. How differently did everything now appear in which he was concerned! His
attentions to Miss King were now the consequence of views solely and
hatefully mercenary; and the mediocrity of her fortune proved no longer the moderation of his wishes, but his eagerness to grasp at anything.
His behaviour to herself could now have had no tolerable motive: he had
either been deceived with regard to her fortune, or had been gratifying his vanity by encouraging the preference which she believed she had most
incautiously shown. Every lingering struggle in his favour grew fainter
and fainter; and in further justification of Mr. Darcy, she could not but allow that Mr.

### <font color='orange'>**Cluster Semantic Chunker**</font>

The ClusterSemanticChunker takes a **global optimization** approach to chunking, contrasting with kamradt's **local decisions** about split points. Rather than looking through a sliding window of context, it considers relationships between all pieces of text simultaneously to find the most semantically coherent groupings while maintaining size constraints.

The process begins similar to other chunkers by splitting text into small fixed-size pieces (defaulting to around 50 tokens) using standard recursive splitting. However, instead of only analyzing consecutive pieces, it creates a similarity matrix by embedding each piece and calculating cosine similarities between all possible pairs. This gives the chunker a complete view of semantic relationships throughout the document.

Using this similarity matrix, the chunker employs dynamic programming to find the optimal way to group pieces into chunks. For each position in the text, it tries different possible chunk sizes and calculates a "reward" based on the total semantic similarity between all pieces within that potential chunk. By building up from small pieces and saving intermediate results, it efficiently explores the space of possible chunkings to find a global optimum.

The size constraints are enforced by limiting the maximum number of pieces that can be combined into a chunk (max_cluster). Within this limit, the algorithm is free to create chunks that maximize semantic coherence. This leads to more natural groupings than approaches that only look at local context, as it can recognize when pieces far apart in the text are actually closely related.

This global optimization strategy helps avoid some common pitfalls of sliding window approaches. While local methods might miss opportunities to group related content that's separated by a brief topic shift, the cluster approach can see these relationships in its similarity matrix. The result is chunks that more accurately reflect the semantic structure of the document while still maintaining practical size limits for downstream processing.

In [21]:
cluster_chunker = ClusterSemanticChunker(
    embedding_function=embedding_function,
    max_chunk_size=400,
    length_function=openai_token_count
)

cluster_chunker_chunks = cluster_chunker.split_text(document)

analyze_chunks(cluster_chunker_chunks, use_tokens=True)


Number of Chunks: 1000

 “WILLIAM COLLINS.”

 “At four o’clock, therefore, we may expect this peace-making gentleman,”
said Mr. Bennet, as he folded up the letter. “He seems to be a most

No token overlap found


### <font color='orange'>**LLM Semantic Chunker**</font>

The LLM Semantic Chunker takes a direct approach to document chunking by literally asking a Language Model to identify semantic boundaries. The process begins by dividing the input text into small, fixed-size pieces of around 50 tokens using a standard recursive splitter, creating manageable units for the LLM to analyze. These pieces are then wrapped with special tags like `<start_chunk_1>` and `<end_chunk_1>` to maintain their identity throughout the process.

The core of the chunking process involves presenting text to the LLM in windows of approximately 800 tokens (containing multiple small pieces) at a time. For each window, the LLM is instructed to identify natural semantic breaks, responding in a specific format like `split_after: X, Y, Z` where X, Y, Z are chunk numbers. These splits must be in ascending order and must start from the current position, with at least one split being required to ensure the process continues moving forward.

The chunker maintains a sliding window approach, progressively moving through the document based on the LLM's last suggested split point. This continues until either the end of the document is reached or the remaining text becomes too short to require further splitting (less than ~4 chunks). The suggested split points are then used to reassemble the small pieces into final chunks, with each chunk combining all pieces between two split points.

Internally, the system prompt follows:

<font color='red' >"You are an assistant specialized in splitting text into thematically consistent sections. "
"The text has been divided into chunks, each marked with <|start_chunk_X|> and <|end_chunk_X|> tags, where X is the chunk number. "
"Your task is to identify the points where splits should occur, such that consecutive chunks of similar themes stay together. "
"Respond with a list of chunk IDs where you believe a split should be made. For example, if chunks 1 and 2 belong together but chunk 3 starts a new topic, you would suggest a split after chunk 2. THE CHUNKS MUST BE IN ASCENDING ORDER."
"Your response should be in the form: 'split_after: 3, 5'."</font>

In [25]:
llm_chunker = LLMSemanticChunker(
    organisation="openai",
    model_name="gpt-4o-mini",
    api_key=os.environ["OPENAI_API_KEY"])

llm_chunker_chunks = llm_chunker.split_text(document)

analyze_chunks(llm_chunker_chunks, use_tokens=True)

Processing chunks: 100%|██████████| 4871/4871 [03:13<00:00, 25.17it/s]


Number of Chunks: 700

 the polite inquiries which he directly afterwards approached to make.
Attention, forbearance, patience with Darcy, was injury to Wickham. She
was resolved against any sort of conversation with him, and turned away with a degree of ill-humour which she could not wholly surmount even in
speaking to Mr. Bingley, whose blind partiality provoked her.

 But Elizabeth was not formed for ill-humour; and though every prospect
of her own was destroyed for the evening, it could not dwell long on her spirits; and, having told all her griefs to Charlotte Lucas, whom she
had not seen for a week, she was soon able to make a voluntary

No token overlap found



