# Chunk optimization

I order to abide by a the context window of LLMs, documents are usually split into smaller parts when creating RAG pipelines. This is called chunking. While chunking comes with the added benefits of reducing costs and noise in the *generation* step, it also introduces a new problem: "How do we prevent losing important information when splitting the document into chunks?"

In baseline RAG, we usually split the document into chunks of fixed size including a fixed overlap between adjacent chunks. In most common cases this practice works well and it is computationally efficient and does not require any NLP models.

This notebook explores the problem of chunk optimization by exploring a few different strategies:

1. **Fixed size chunking**: Split the document into chunks of fixed size.
2. **Semantic chunking**: Considers the semantic meaning behind the text and divides the document into meaningful semantic chunks
3. **Hyperparameter tuning**: Traditional ML via grid-search

Other strategies include

1. **Document Specific Chunking**: Split the document based on the logical sections of the document. Useful for Markdown, HTML, etc.
2. **Recursive Chunking**: Recursive chunking divides the input text into smaller chunks in a hierarchical and iterative manner using a set of separators. If the initial attempt at splitting the text doesn’t produce chunks of the desired size or structure, the method recursively calls itself on the resulting chunks with a different separator or criterion until the desired chunk size or structure is achieved. 
3. **Agentic Chunk**: Use LLMs as "agents" and split the document into chunks in the fasion a human would do - start at the top and continue down the document while deciding whether to start a new chunk given the current sentence. 


### Setup libraries and environment

In [None]:
%pip install python-dotenv
%pip install mdutils==1.6.0
%pip install llama-index==0.10.33
%pip install llama-index-llms-openai==0.1.16

In [1]:
import os
from dotenv import load_dotenv
from util.helpers import get_wiki_pages, create_and_save_wiki_md_files, pretty_print_node

from llama_index.core import SimpleDirectoryReader, VectorStoreIndex
from llama_index.core.node_parser import SentenceSplitter, SemanticSplitterNodeParser
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.openai import OpenAI

In [2]:
load_dotenv()
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")

In [3]:
pages = get_wiki_pages(["Vincent Van Gogh"])
create_and_save_wiki_md_files(pages=pages, path="./data/docs/wiki/")
documents = SimpleDirectoryReader("./data/docs/wiki/").load_data()

In [4]:
embedding = OpenAIEmbedding(api_key=OPENAI_API_KEY, model="text-embedding-3-small")
llm = OpenAI(api_key=OPENAI_API_KEY, model="gpt-4-turbo")

## Fixed size chunking

In [5]:
fixed_size_splitter = SentenceSplitter(chunk_size=512, chunk_overlap=40)
fixed_nodes = fixed_size_splitter.get_nodes_from_documents(documents)


In [6]:

print(fixed_nodes[2].get_content())
print("---------------------------")
print(fixed_nodes[3].get_content())
print("---------------------------")
print(fixed_nodes[4].get_content())

Vincent van Gogh


Vincent Willem van Gogh (Dutch: [ˈvɪnsɛnt ˈʋɪləɱ‿vɑŋ‿ˈɣɔx] ; 30 March 1853 – 29 July 1890) was a Dutch Post-Impressionist painter who is among the most famous and influential figures in the history of Western art. In just over a decade, he created approximately 2100 artworks, including around 860 oil paintings, most of them in the last two years of his life. His oeuvre includes landscapes, still lifes, portraits, and self-portraits, most of which are characterized by bold colors and dramatic brushwork that contributed to the rise of expressionism in modern art. Van Gogh's work was beginning to gain critical attention before he died at age 37, by what was suspected at the time to be a suicide. During his lifetime, only one of Van Gogh's paintings, The Red Vineyard, was sold. 
Born into an upper-middle-class family, Van Gogh drew as a child and was serious, quiet and thoughtful, but showed signs of mental instability. As a young man, he worked as an art dealer, often t

In [7]:
fixed_index = VectorStoreIndex(nodes=fixed_nodes)
fixed_query_engine = fixed_index.as_query_engine(llm=llm)

## Semantic chunking

In [8]:
semantic_splitter = SemanticSplitterNodeParser(
    buffer_size=1, breakpoint_percentile_threshold=95, embed_model=embedding)
semantic_nodes = semantic_splitter.get_nodes_from_documents(documents)

In [9]:

print(semantic_nodes[2].get_content())
print("---------------------------")
print(semantic_nodes[3].get_content())
print("---------------------------")
print(semantic_nodes[10].get_content())

He was keenly aware of modernist trends in art and, while back with his parents, took up painting in 1881. His younger brother, Theo, supported him financially, and the two of them maintained a long correspondence.
Van Gogh's early works consist of mostly still lifes and depictions of peasant laborers. In 1886, he moved to Paris, where he met members of the artistic avant-garde, including Émile Bernard and Paul Gauguin, who were seeking new paths beyond Impressionism. Frustrated in Paris and inspired by a growing spirit of artistic change and collaboration, in February 1888, Van Gogh moved to Arles in southern France to establish an artistic retreat and commune. Once there, Van Gogh's art changed. His paintings grew brighter and he turned his attention to the natural world, depicting local olive groves, wheat fields and sunflowers. Van Gogh invited Gauguin to join him in Arles and eagerly anticipated Gauguin's arrival in the fall of 1888.
Van Gogh suffered from psychotic episodes and d

In [10]:
semantic_index = VectorStoreIndex(nodes=semantic_nodes)
semantic_query_engine = semantic_index.as_query_engine(llm=llm)

## Compare the different chunking strategies

In [11]:
query = "Tell me about Vincent Van Gogh's early life"

In [12]:
fixed_retriever = fixed_index.as_retriever()
fixed_retrieved_nodes = fixed_retriever.retrieve(query)
pretty_print_node(fixed_retrieved_nodes[0])

Node ID: 5355241b-2d46-43cb-bb6c-f63604b91047
Text: Early years   Vincent Willem van Gogh was born on 30 March 1853
in Groot-Zundert, in the predominantly Catholic province of North
Brabant in the Netherlands. He was the oldest surviving child of
Theodorus van Gogh (1822–1885), a minister of the Dutch Reformed
Church, and his wife, Anna Cornelia Carbentus (1819–1907). Van Gogh
was given the name ...
Score:  0.892

Size:  1855
Full text: 
---------------------------
Early years

Vincent Willem van Gogh was born on 30 March 1853 in Groot-Zundert, in the predominantly Catholic province of North Brabant in the Netherlands. He was the oldest surviving child of Theodorus van Gogh (1822–1885), a minister of the Dutch Reformed Church, and his wife, Anna Cornelia Carbentus (1819–1907). Van Gogh was given the name of his grandfather and of a brother stillborn exactly a year before his birth. Vincent was a common name in the Van Gogh family. The name had been borne by his grandfather, the promine

In [13]:
semantic_retriever = semantic_index.as_retriever()
semantic_retrieved_nodes = semantic_retriever.retrieve(query)
pretty_print_node(semantic_retrieved_nodes[0])

Node ID: e45dc402-661e-4cd9-8c1f-0fb78e886c78
Text: Early years   Vincent Willem van Gogh was born on 30 March 1853
in Groot-Zundert, in the predominantly Catholic province of North
Brabant in the Netherlands. He was the oldest surviving child of
Theodorus van Gogh (1822–1885), a minister of the Dutch Reformed
Church, and his wife, Anna Cornelia Carbentus (1819–1907). Van Gogh
was given the name ...
Score:  0.893

Size:  2415
Full text: 
---------------------------


Early years

Vincent Willem van Gogh was born on 30 March 1853 in Groot-Zundert, in the predominantly Catholic province of North Brabant in the Netherlands. He was the oldest surviving child of Theodorus van Gogh (1822–1885), a minister of the Dutch Reformed Church, and his wife, Anna Cornelia Carbentus (1819–1907). Van Gogh was given the name of his grandfather and of a brother stillborn exactly a year before his birth. Vincent was a common name in the Van Gogh family. The name had been borne by his grandfather, the promi

In [14]:
fixed_response = fixed_query_engine.query(
    query
)
print(str(fixed_response))

Vincent Willem van Gogh was born on March 30, 1853, in Groot-Zundert, North Brabant, Netherlands. He was the oldest surviving child of Theodorus van Gogh, a minister of the Dutch Reformed Church, and Anna Cornelia Carbentus. Vincent was named after his grandfather and a stillborn brother who was born exactly a year before him. His family was of upper-middle-class status, with his father being the youngest son of a minister and his mother coming from a prosperous family in The Hague. Vincent had a brother named Theo, with whom he maintained a close relationship throughout his life, and other siblings including another brother, Cor, and three sisters, Elisabeth, Anna, and Willemina.

From an early age, Vincent was serious and thoughtful. He was initially educated at home by his mother and a governess before attending the village school in 1860. He later went to a boarding school in Zevenbergen in 1864, which he found distressing enough to campaign for his return home. In 1866, he was sen

In [15]:
semantic_response = semantic_query_engine.query(
    query
)
print(str(semantic_response))

Vincent Willem van Gogh was born on March 30, 1853, in Groot-Zundert, Netherlands, to Theodorus van Gogh, a minister of the Dutch Reformed Church, and Anna Cornelia Carbentus. He was the oldest surviving child in a family that included his brother Theo and three sisters. Vincent was named after his grandfather and a stillborn brother. His early education was conducted at home and later at boarding schools, which he found unhappy. Despite this, his interest in art began early, encouraged by his mother. His first significant job was at the art dealers Goupil & Cie in The Hague, secured by his uncle Cent. This job eventually led him to London, where he experienced a brief period of happiness and success.
