### Why split documents?  [reference](https://github.com/langchain-ai/langchain/blob/master/docs/docs/concepts/text_splitters.mdx#why-split-documents)
There are several reasons to split documents:

- Handling non-uniform document lengths: Real-world document collections often contain texts of varying sizes. Splitting ensures consistent processing across all documents.
- Overcoming model limitations: Many embedding models and language models have maximum input size constraints. - Splitting allows us to process documents that would otherwise exceed these limits.
- Improving representation quality: For longer documents, the quality of embeddings or other representations may degrade as they try to capture too much information. Splitting can lead to more focused and accurate representations of each section.
- Enhancing retrieval precision: In information retrieval systems, splitting can improve the granularity of search results, allowing for more precise matching of queries to relevant document sections.
- Optimizing computational resources: Working with smaller chunks of text can be more memory-efficient and allow for better parallelization of processing tasks.

See Greg Kamradt's [chunkviz](https://chunkviz.up.railway.app/) to visualize different splitting strategies discussed below.


---

### What is the goal of chunking?
As Greg Kamradt's says in [5 Levels Of Text Splitting](https://github.com/FullStackRetrieval-com/RetrievalTutorials/blob/main/tutorials/LevelsOfTextSplitting/5_Levels_Of_Text_Splitting.ipynb): 
"Your goal is not to chunk for chunking sake, our goal is to get our data in a format where it can be retrieved for value later."

### The Challenge of Chunking
Chunking can split a paragraph or a sentence in half, which may cause the text to lose its semantic meaning. This fragmentation can make it harder for retrieval systems to understand and fetch relevant information accurately.

### Why is Proper Chunking Important?
1. **Semantic Integrity**: Ensures each chunk retains enough context to be meaningful on its own
2. **Retrieval Quality**: Affects how well your RAG system can find relevant information
3. **Model Performance**: Impacts how effectively LLMs can process and generate responses

### Practical Solutions (as shown in later examples):
- **Recursive Splitting**: Breaks down text hierarchically (documents → paragraphs → sentences)
- **Overlap Strategies**: Maintains context between chunks (as demonstrated with `chunk_overlap=4`)
- **Custom Separators**: Uses natural boundaries like `\n\n` for paragraphs or spaces for words

As we'll see in the following cells, LangChain's text splitters provide these capabilities out of the box.

---
### Recursive and Character Text Splitter

In [2]:
from langchain.text_splitter import RecursiveCharacterTextSplitter, CharacterTextSplitter
from langchain_community.vectorstores import Chroma
from langchain_ollama.embeddings import OllamaEmbeddings
from langchain.schema import Document
import os

chunk=26
overlap=4

RecursiveSplitter = RecursiveCharacterTextSplitter(chunk_size=chunk, chunk_overlap=overlap)
CharacterSplitter = CharacterTextSplitter(chunk_size=chunk, chunk_overlap=overlap)

In [3]:
text1 = 'abcdefghijklmnopqrstuvwxyz'
print('RecursiveSplitter:', RecursiveSplitter.split_text(text1))
print('CharacterSplitter:', CharacterSplitter.split_text(text1))

RecursiveSplitter: ['abcdefghijklmnopqrstuvwxyz']
CharacterSplitter: ['abcdefghijklmnopqrstuvwxyz']


What is happening? 
The Recursive splitter is splitting the alphabetics according to the chunk size (26) and the overlap (4), thats why there is no splitting for the 26 characters of the alphabetics.

The Character splitter on the other hand, is looking for the default separator to split. The default separator is the (\n\n) doubble new line. which in our example text1 does not exist. therefore, it all appear to the splitter as a single chunk.

lets try another text. without changing any values just the text.

In [4]:
text2 = 'Mohamed Adel Hassan Ismaiel'
len(text2)

27

my name is 27 characters long including the spaces, lets try and look how would the splitters handle this simple case. to further understand it more.

In [5]:
print('RecursiveSplitter:', RecursiveSplitter.split_text(text2))
print('CharacterSplitter:', CharacterSplitter.split_text(text2))

RecursiveSplitter: ['Mohamed Adel Hassan', 'Ismaiel']
CharacterSplitter: ['Mohamed Adel Hassan Ismaiel']


As I guessed, hope you did too. The Recursive splitter splitted after the 20th character. bacause if the 'ismaiel' was also included it would have exceeded the chunk size of 26. simple!

Why doesn't the character splitter split the string?
```
Character splitter includes /n/n separator by default. Therefore, in this example the text has no new lines. Therefore, it appears to the Character splitter as one single string. 
```


In [6]:
# lets adjust the separator value to be ' ' a single space.
c_split = CharacterTextSplitter(chunk_size=chunk, chunk_overlap= overlap, separator=' ')

print('RecursiveSplitter:', RecursiveSplitter.split_text(text2))
print('CharacterSplitter:', c_split.split_text(text2))

RecursiveSplitter: ['Mohamed Adel Hassan', 'Ismaiel']
CharacterSplitter: ['Mohamed Adel Hassan', 'Ismaiel']


After setting the separator value ' ', they both return the same split. However this is just pure luck because of the text example. How does the RecursiveCharacterTextSplitter actually works?  Recursive Splitter does not use just a single ' ' (space) as its default separator. Instead, it uses a list of separators in order of priority, and it will try to split on the “largest meaningful” one first. 
```
separators = [
    "\n\n",   # paragraph breaks
    "\n",    # line breaks
    " ",     # spaces
    ""       # as a last resort, split by character
]
```

If you think of it for a second, for longer texts, Recursive gives “semantic” splits (paragraphs/lines/words fallback), while Character is rigid (always space / according to the value set for the separator).

In [7]:

text3 = """Mohamed Adel Hassan Ismaiel
Lives in Cairo
Works as an AI engineer
Enjoys AI, Python, and teaching"""

chunk=26
overlap=4

r_split = RecursiveCharacterTextSplitter(chunk_size=chunk, chunk_overlap=overlap)
c_split = CharacterTextSplitter(chunk_size=chunk, chunk_overlap=overlap)

print("RecursiveSplitter:", r_split.split_text(text3))
print("CharacterSplitter:", c_split.split_text(text3))


RecursiveSplitter: ['Mohamed Adel Hassan', 'Ismaiel', 'Lives in Cairo', 'Works as an AI engineer', 'Enjoys AI, Python, and', 'and teaching']
CharacterSplitter: ['Mohamed Adel Hassan Ismaiel\nLives in Cairo\nWorks as an AI engineer\nEnjoys AI, Python, and teaching']


you may have a question in mind, let me say it out loud. WHY ISN'T THERE ANY OVERLAP???

How RecursiveCharacterTextSplitter decides to split

It tries separators in order: ["\n\n", "\n", " ", ""] (paragraph → newline → space → character). It prefers the largest semantic splits first. 
LangChain

If a separator produces pieces that are each ≤ chunk_size, those pieces are used as chunks and the splitter does not break them further — and therefore no overlap is produced between them. Overlap is only visible when a single semantic piece must itself be split into multiple sub-chunks (then those sub-chunks can be made with the requested chunk_overlap). This is the behavior many people stumble over. [Stackoverflow source](https://stackoverflow.com/questions/76681318/why-is-recursivecharactertextsplitter-not-giving-any-chunk-overlap)


### I guess now we understood how everything works at the backend. lets now try out some real examples.

In [8]:
more_text = """When reading documents, readers rely on structure to understand the flow of information. \
Headings, subheadings, and paragraphs provide signposts that guide the reader through the material. \
For instance, a heading may introduce a new topic, while a subheading narrows the focus to a detail. \
This organization helps readers follow complex ideas step by step. \n\n  \
Formatting cues like bold or italic text also add meaning. \
They emphasize important words or phrases that the writer wants to highlight. \
Lists, whether numbered or bulleted, show relationships between items in a clear manner. \
Altogether, these features create a document that is easier to scan, read, and comprehend."""


In [9]:
c_splitter = CharacterTextSplitter(
    chunk_size=450,
    chunk_overlap=0,
    separator = ' '
)
r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=450,
    chunk_overlap=0, 
    separators=["\n\n", "\n", " ", ""] #default values, shown only to make it visible
)

In [10]:
c_splitter.split_text(more_text)

['When reading documents, readers rely on structure to understand the flow of information. Headings, subheadings, and paragraphs provide signposts that guide the reader through the material. For instance, a heading may introduce a new topic, while a subheading narrows the focus to a detail. This organization helps readers follow complex ideas step by step. \n\n Formatting cues like bold or italic text also add meaning. They emphasize important words',
 'or phrases that the writer wants to highlight. Lists, whether numbered or bulleted, show relationships between items in a clear manner. Altogether, these features create a document that is easier to scan, read, and comprehend.']

In [11]:
r_splitter.split_text(more_text)

['When reading documents, readers rely on structure to understand the flow of information. Headings, subheadings, and paragraphs provide signposts that guide the reader through the material. For instance, a heading may introduce a new topic, while a subheading narrows the focus to a detail. This organization helps readers follow complex ideas step by step.',
 'Formatting cues like bold or italic text also add meaning. They emphasize important words or phrases that the writer wants to highlight. Lists, whether numbered or bulleted, show relationships between items in a clear manner. Altogether, these features create a document that is easier to scan, read, and comprehend.']

```
To finalize, character split is dumb, it splits according to the chunk_size and the seperator if assigned.
Recursive split algorithm prioritizes the semantic meaning, think of it like the order of operations in math (PEMDAS). it first applies the \n\n which divides according to paragraphs, if the chunk_size is still larger, then it will split on sentences, if still bigger than the specified chunk size, it will split words, so on... 

This algorithm makes it preserve as much context of each related paragraphs together, avoiding awkward splits within paragraph or sentences. 

In [12]:
from langchain_community.document_loaders import PyPDFLoader
loader = PyPDFLoader('D:\Langchain\Langchain-getting-started\machinelearning-lecture01.pdf')
pdf = loader.load()
len(pdf)

22

In [13]:
character_split = CharacterTextSplitter(chunk_size=450, chunk_overlap=150, separator='\n')
docs = character_split.split_documents(pdf)
len(docs)

196

### Let's try out passing the relevant chunks to an LLM. let's see the output when passing the relevant chunks.

In [14]:
from langchain_ollama.embeddings import OllamaEmbeddings
from langchain_community.vectorstores import Chroma
import os

current_dir = os.getcwd()
persistent_dir = os.path.join(current_dir, "db", "char_db")
embedding = OllamaEmbeddings(model='nomic-embed-text:v1.5')
db = Chroma.from_documents(documents=docs, embedding=embedding, persist_directory=persistent_dir)

Failed to send telemetry event ClientStartEvent: capture() takes 1 positional argument but 3 were given
Failed to send telemetry event ClientCreateCollectionEvent: capture() takes 1 positional argument but 3 were given


In [15]:
query = "What is the modules explained in this class ?"
results = db.similarity_search(query, k=3)

Failed to send telemetry event CollectionQueryEvent: capture() takes 1 positional argument but 3 were given


In [16]:
for i, doc in enumerate(results, 1):
    print(f"Document {i}:\n{doc.page_content}\n")
    if doc.metadata:
        print(f"Source: {doc.metadata.get('source', 'Unknown')}\n")
        
combined_user_query = (
    "Here are some documents that might help answer the question: "
    + query
    + "\n\nRelevant Documents:\n"
    + "\n\n".join([result.page_content for result in results])
    + "\n\nPlease provide an answer based only on the provided documents. Never mention the documents in your response. If the answer is not found in the documents, respond with 'I'm not sure'."
)
print(combined_user_query)

Document 1:
MachineLearning-Lecture01  
Instructor (Andrew Ng): Okay. Good morning. Welcome to CS229, the machine 
learning class. So what I wanna do today is just spend a little time going over the logistics 
of the class, and then we'll start to talk a bit about machine learning.  
By way of introduction, my name's Andrew Ng and I'll be instructor for this class. And so

Source: D:\Langchain\Langchain-getting-started\machinelearning-lecture01.pdf

Document 2:
okay?  
So as an overview of what we're going to do in this class, this class is sort of organized 
into four major sections. We're gonna talk about four major topics in this class, the first 
of which is supervised learning. So let me give you an example of that.  
So suppose you collect a data set of housing prices. And one of the TAs, Dan Ramage, 
actually collected a data set for me last week to use in the example later. But suppose that

Source: D:\Langchain\Langchain-getting-started\machinelearning-lecture01.pdf

Document 

In [17]:
from langchain_core.messages import SystemMessage, HumanMessage
from langchain_ollama import ChatOllama
ollama = ChatOllama(model='llama3.2:3b')
## create the messages
messages = [
    SystemMessage(content="You are a helpful assistant named Gulia."),
    HumanMessage(content=combined_user_query),
]
result = ollama.invoke(messages)
print(result.content)


Based on what's been explained, it seems that one of the topics covered in the class is supervised learning, specifically an example involving housing prices data sets. However, there isn't a clear mention of the specific modules or topics that will be covered throughout the class. The instructor mentions that the class is organized into four major sections, but these are not explicitly stated.


### Markdonw Splitter


In [18]:
md_text = """
# Fun in California

## Driving

Try driving on the 1 down to San Diego

### Food

Make sure to eat a burrito while you're there

## Hiking

Go to Yosemite
"""

Header_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
    ("###", "Header 3"),
]

In [19]:
from langchain.text_splitter import MarkdownHeaderTextSplitter
md_split = MarkdownHeaderTextSplitter(headers_to_split_on=Header_to_split_on)
md_split_doc = md_split.split_text(md_text)
md_split_doc

[Document(metadata={'Header 1': 'Fun in California', 'Header 2': 'Driving'}, page_content='Try driving on the 1 down to San Diego'),
 Document(metadata={'Header 1': 'Fun in California', 'Header 2': 'Driving', 'Header 3': 'Food'}, page_content="Make sure to eat a burrito while you're there"),
 Document(metadata={'Header 1': 'Fun in California', 'Header 2': 'Hiking'}, page_content='Go to Yosemite')]

- MarkdownHeaderTextSplitter parses a Markdown document.

- It splits text based on headers you define in headers_to_split_on.

- Each resulting chunk is wrapped in a Document object with:

    *   page_content → the text under that section.

    *   metadata → a dictionary storing the header hierarchy (e.g., H1, H2, H3).

#### Key Points

- Each Document retains the header hierarchy in metadata.

- Useful for structured retrieval → you can search/filter not only by content but also by context (e.g., “all text under Hiking”).

- Nested headers get accumulated in the metadata dictionary.

---

### TokenTextSplitter
 breaks text into chunks based on tokens (the units an LLM processes), not words or characters. This ensures chunks align with the model’s tokenizer and stay within token limits.

1. Suitable Scenarios

- Preparing long documents so each chunk fits inside the model’s token window.

- Splitting text before generating embeddings for retrieval (RAG).

- Preserving context with controlled token-based overlap between chunks.

- When working with texts that include special tokens (e.g., ``), where tokenization accuracy matters.

2. When Not to Use

- If you need human-friendly splits (e.g., by paragraph, sentence, or section).

- When preserving semantic meaning is more important than staying strictly within token boundaries.

- Since TokenTextSplitter may cut within a sentence or paragraph, it can break context and reduce readability.



We will try the same text we used above, for easier comparison

text1, text2, text3, and more_text variables from previous examples.

In [20]:
from langchain.text_splitter import TokenTextSplitter

token_split_gpt_2 = TokenTextSplitter(chunk_size=1, chunk_overlap=0)
token_split_gpt_2.split_text(text1)

['abc',
 'def',
 'gh',
 'ij',
 'kl',
 'mn',
 'op',
 'q',
 'r',
 'st',
 'uv',
 'w',
 'xy',
 'z']

What is the token splitter doing now? * Note: the default tokenizer in the TokenTextSplitter is gpt-2.

How the GPT-2 tokenizer works


- GPT-2 uses Byte Pair Encoding (BPE) with a fixed vocabulary (~50,000 tokens).

- It always tries to match the longest possible token in the vocabulary.

- It starts with a base vocabulary of all single-byte characters (so every lowercase letter "a"–"z" exists as its own token).

- During training, it merges frequent character pairs into new tokens (e.g., "th", "ing", "er", …).

So when text is tokenized:

- If a string can be matched by a longer token in the vocabulary, it’s grouped together.

- If not, it falls back to single characters.

In [21]:
token_split_gpt_2.split_text(text2)

['Moh', 'amed', ' Ad', 'el', ' Hassan', ' Is', 'm', 'ai', 'el']

In [22]:
token_split_gpt_2.split_text(text3)

['Moh',
 'amed',
 ' Ad',
 'el',
 ' Hassan',
 ' Is',
 'm',
 'ai',
 'el',
 '\n',
 'L',
 'ives',
 ' in',
 ' Cairo',
 '\n',
 'Works',
 ' as',
 ' an',
 ' AI',
 ' engineer',
 '\n',
 'Enjoy',
 's',
 ' AI',
 ',',
 ' Python',
 ',',
 ' and',
 ' teaching']

In [23]:
token_split_gpt_2.split_text(more_text)

['When',
 ' reading',
 ' documents',
 ',',
 ' readers',
 ' rely',
 ' on',
 ' structure',
 ' to',
 ' understand',
 ' the',
 ' flow',
 ' of',
 ' information',
 '.',
 ' Head',
 'ings',
 ',',
 ' sub',
 'head',
 'ings',
 ',',
 ' and',
 ' paragraphs',
 ' provide',
 ' sign',
 'posts',
 ' that',
 ' guide',
 ' the',
 ' reader',
 ' through',
 ' the',
 ' material',
 '.',
 ' For',
 ' instance',
 ',',
 ' a',
 ' heading',
 ' may',
 ' introduce',
 ' a',
 ' new',
 ' topic',
 ',',
 ' while',
 ' a',
 ' sub',
 'heading',
 ' narrow',
 's',
 ' the',
 ' focus',
 ' to',
 ' a',
 ' detail',
 '.',
 ' This',
 ' organization',
 ' helps',
 ' readers',
 ' follow',
 ' complex',
 ' ideas',
 ' step',
 ' by',
 ' step',
 '.',
 ' ',
 '\n\n',
 ' ',
 ' Format',
 'ting',
 ' cues',
 ' like',
 ' bold',
 ' or',
 ' ital',
 'ic',
 ' text',
 ' also',
 ' add',
 ' meaning',
 '.',
 ' They',
 ' emphasize',
 ' important',
 ' words',
 ' or',
 ' phrases',
 ' that',
 ' the',
 ' writer',
 ' wants',
 ' to',
 ' highlight',
 '.',
 ' Lists',


More infor about how tokens correspond to text see this post from OpenAI for more details on how tokens are counted and how they correspond to text.

According to the OpenAI post, the approximate token counts for English text are as follows:

1 token ~= 4 chars in English
1 token ~= ¾ words
100 tokens ~= 75 words

References

- [langchain Tokens](https://python.langchain.com/docs/concepts/tokens/)
- [OpenAI](https://help.openai.com/en/articles/4936856-what-are-tokens-and-how-to-count-them)

## RAG Comparison: All 4 Splitting Techniques

### Setup for Comparison
Let's create a simple test to compare how each splitter performs with the same RAG query.


In [27]:
# Test text for comparison

# load the pdf again
loader = PyPDFLoader('D:\Langchain\Langchain-getting-started\machinelearning-lecture01.pdf')
test_text = loader.load()

# Initialize all splitters with same parameters
chunk_size = 450
chunk_overlap = 0

character_splitter = CharacterTextSplitter(
    chunk_size=chunk_size, 
    chunk_overlap=chunk_overlap, 
    separator=' '
)

recursive_splitter = RecursiveCharacterTextSplitter(
    chunk_size=chunk_size, 
    chunk_overlap=chunk_overlap
)

token_splitter = TokenTextSplitter(
    chunk_size=chunk_size, 
    chunk_overlap=chunk_overlap
)


In [39]:
# Create persistent vector stores for all splitting techniques
print("\n=== CREATING PERSISTENT VECTOR STORES ===")

# Get current directory
current_dir = os.getcwd()
# Create directories for each vector store
db_dirs = {
    "character": os.path.join(current_dir, "db", "char_db_test"),
    "recursive": os.path.join(current_dir, "db", "recursive_db_test"), 
    "token": os.path.join(current_dir, "db", "token_db_test"),
}

# Ensure directories exist
for db_dir in db_dirs.values():
    os.makedirs(db_dir, exist_ok=True)

# Initialize embedding function
embedding = OllamaEmbeddings(model='nomic-embed-text:v1.5')

# Create documents for each splitter (except markdown which already has documents)
character_docs = character_splitter.split_documents(test_text)
recursive_docs = recursive_splitter.split_documents(test_text)
token_docs = token_splitter.split_documents(test_text)

# Create vector stores
print("Creating vector stores...")
char_db = Chroma.from_documents(documents=character_docs, embedding=embedding, persist_directory=db_dirs["character"])
recursive_db = Chroma.from_documents(documents=recursive_docs, embedding=embedding, persist_directory=db_dirs["recursive"])
token_db = Chroma.from_documents(documents=token_docs, embedding=embedding, persist_directory=db_dirs["token"])

print("Vector stores created successfully!")

Failed to send telemetry event ClientStartEvent: capture() takes 1 positional argument but 3 were given
Failed to send telemetry event ClientCreateCollectionEvent: capture() takes 1 positional argument but 3 were given



=== CREATING PERSISTENT VECTOR STORES ===
Creating vector stores...


Failed to send telemetry event ClientStartEvent: capture() takes 1 positional argument but 3 were given
Failed to send telemetry event ClientCreateCollectionEvent: capture() takes 1 positional argument but 3 were given
Failed to send telemetry event ClientStartEvent: capture() takes 1 positional argument but 3 were given
Failed to send telemetry event ClientCreateCollectionEvent: capture() takes 1 positional argument but 3 were given


Vector stores created successfully!


In [49]:
# Test all vector stores with the same query
print("\n=== TESTING ALL VECTOR STORES WITH SAME QUERY ===")
query = "What are the topics covered in the lecture?"

# Function to test and display results
def test_vector_store(db, db_name, query, k=3):
    print(f"\n--- CREATING {db_name.upper()} VECTOR STORE ---")
    results = db.similarity_search(query, k=k)
    print(f"\n--- {db_name.upper()} DB DONE ---")
    return results

# Test all vector stores
char_results = test_vector_store(char_db, "character", query)
recursive_results = test_vector_store(recursive_db, "recursive", query) 
token_results = test_vector_store(token_db, "token", query)

# Summary comparison
print(f"\nNumber of chunks in each store:")
print(f"Character splitter: {len(character_docs)} chunks")
print(f"Recursive splitter: {len(recursive_docs)} chunks")
print(f"Token splitter: {len(token_docs)} chunks") 


=== TESTING ALL VECTOR STORES WITH SAME QUERY ===

--- CREATING CHARACTER VECTOR STORE ---

--- CHARACTER DB DONE ---

--- CREATING RECURSIVE VECTOR STORE ---

--- RECURSIVE DB DONE ---

--- CREATING TOKEN VECTOR STORE ---

--- TOKEN DB DONE ---

Number of chunks in each store:
Character splitter: 144 chunks
Recursive splitter: 157 chunks
Token splitter: 42 chunks


In [50]:
# Show all retrieved chunks for every splitting technique

def show_all_retrieved_chunks(results, splitter_name):
    print(f"\n=== All retrieved chunks for {splitter_name} splitter ===")
    for i, doc in enumerate(results, 1):
        print(f"\nChunk {i}:")
        print(doc.page_content)
        if hasattr(doc, 'metadata') and doc.metadata:
            print(f"Metadata: {doc.metadata}")

show_all_retrieved_chunks(char_results, "character")
show_all_retrieved_chunks(recursive_results, "recursive")
show_all_retrieved_chunks(token_results, "token")



=== All retrieved chunks for character splitter ===

Chunk 1:
So later this quarter, we'll use the discussion sections to talk about things like convex 
optimization, to talk a little bit about hidden Markov models, which is a type of machine 
learning algorithm for modeling time series and a few other things, so extensions to the 
materials that I'll be covering in the main lectures. And attendance at the discussion 
sections is optional, okay? 
So that was all I had from logistics. Before we move on to
Metadata: {'page': 9, 'source': 'D:\\Langchain\\Langchain-getting-started\\machinelearning-lecture01.pdf'}

Chunk 2:
So later this quarter, we'll use the discussion sections to talk about things like convex 
optimization, to talk a little bit about hidden Markov models, which is a type of machine 
learning algorithm for modeling time series and a few other things, so extensions to the 
materials that I'll be covering in the main lectures. And attendance at the discussion 
sections is 

### Analysis of Results

This comparison demonstrates how each splitter divides the same content and performs in a RAG scenario:

**Key observations to look for:**
- **Character splitter**: May split in the middle of sentences, potentially breaking context
- **Recursive splitter**: Tries to preserve semantic boundaries (paragraphs → sentences → words)
- **Token splitter**: Respects token boundaries but may break words or semantic units
- **Markdown splitter**: Preserves header structure and maintains contextual relationships

**Expected outcomes:**
- The query "What are the main types of machine learning?" should retrieve chunks containing information about:
  - Supervised learning
  - Unsupervised learning  
  - Reinforcement learning
  - Deep learning

**Performance metrics:**
- Number of chunks generated by each approach
- Relevance of retrieved chunks to the query
- Preservation of metadata (especially for markdown splitting)
- Context preservation across chunk boundaries


You decide how well each technique preserves semantic context and retrieves relevant information about the asked query.
