<a href="https://colab.research.google.com/github/jesusvillota/CSS_DataScience_2025/blob/main/Session2/2_3_RAG_II_Document_Splitting.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<div style="max-width: 880px; margin: 20px auto 22px; padding: 0px; border-radius: 18px; border: 1px solid #e5e7eb; background: linear-gradient(180deg, #ffffff 0%, #f9fafb 100%); box-shadow: 0 8px 26px rgba(0,0,0,0.06); overflow: hidden;">

  <!-- Banner Header -->
  <div style="padding: 34px 32px 14px; text-align: center; line-height: 1.38;">
    <div style="font-size: 13px; letter-spacing: 0.14em; text-transform: uppercase; color: #6b7280; font-weight: bold; margin-bottom: 5px;">
      Session #2
    </div>
    <div style="font-size: 29px; font-weight: 800; color: #14276c; margin-bottom: 4px;">
      RAG with LangChain
    </div>
    <div style="font-size: 29px; font-weight: 800; color: #14276c; margin-bottom: 4px;">
      Part II: Document Splitting
    </div>
    <div style="font-size: 16.5px; color: #374151; font-style: italic; margin-bottom: 0;">
      Using Textual Data in Empirical Monetary Economics
    </div>
  </div>

  <!-- Logo Section -->
  <div style="background: none; text-align: center; margin: 30px 0 10px;">
    <img src="https://www.cemfi.es/images/Logo-Azul.png" alt="CEMFI Logo" style="width: 158px; filter: drop-shadow(0 2px 12px rgba(56,84,156,0.05)); margin-bottom: 0;">
  </div>

  <!-- Name -->
  <div style="font-family: 'Times New Roman', Times, serif; color: #38549c; text-align: center; font-size: 1.22em; font-weight: bold; margin-bottom: 0px;">
    Jesus Villota Miranda © 2025
  </div>

  <!-- Contact info -->
  <div style="font-family: 'Times New Roman', Times, serif; color: #38549c; text-align: center; font-size: 1em; margin-top: 7px; margin-bottom: 20px;">
    <a href="mailto:jesus.villota@cemfi.edu.es" style="color: #38549c; text-decoration: none; margin-right:8px;" title="Email">
      <!-- Email logo -->
      <!-- <img src="https://cdn-icons-png.flaticon.com/512/11679/11679732.png" alt="Email" style="width:18px; vertical-align:middle; margin-right:5px;"> -->
      jesus.villota@cemfi.edu.es
    </a>
    <span style="color:#9fa7bd;">|</span>
    <a href="https://www.linkedin.com/in/jesusvillotamiranda/" target="_blank" style="color: #38549c; text-decoration: none; margin-left:7px;" title="LinkedIn">
      <!-- LinkedIn logo -->
      <!-- <img src="https://1.bp.blogspot.com/-onvhHUdW1Us/YI52e9j4eKI/AAAAAAAAE4c/6s9wzOpIDYcAo4YmTX1Qg51OlwMFmilFACLcBGAsYHQ/s1600/Logo%2BLinkedin.png" alt="LinkedIn" style="width:17px; vertical-align:middle; margin-right:5px;"> -->
      LinkedIn
    </a>
  </div>
</div>


**IMPORTANT**: **Are you running this notebook in Google Colab?**

- If so, please make sure that in the cell below `running_in_colab` is set to `True`

- And, of course,  make sure to **run the cell**!

In [1]:
# ARE YOU RUNNING THIS IN GOOGLE COLAB? If YES, type True below
running_in_colab = False

# Document Splitting

After loading documents into a standard format, the next step in building a Retrieval Augmented Generation (RAG) system is splitting them into smaller, manageable chunks. This process happens after document loading but before storing documents in a vector database.

![](images/rag_pipeline.png)

While document splitting might sound straightforward, there are many subtleties that significantly impact the effectiveness of your RAG system. If text is split improperly, you might end up with chunks that separate semantically related information, making it difficult to retrieve the complete context needed to answer questions correctly.

For example, if a sentence about a car's specifications is split into two separate chunks, when someone asks about those specifications, neither chunk alone would contain the complete answer. Good document splitting ensures that semantically relevant information stays together in the same chunk.

## Understanding Text Splitting Parameters

All text splitters in LangChain operate on two key parameters:

1. **Chunk Size**: This determines the size of each chunk. The size can be measured in different ways, commonly by character count or token count. We define this with the `chunk_size` parameter.

2. **Chunk Overlap**: This creates an overlap between adjacent chunks, like a sliding window. Having overlapping text ensures that context isn't lost at the boundaries between chunks. For example, if important information spans the end of one chunk and the beginning of another, the overlap ensures it's captured in both chunks. We define this with the `chunk_overlap` parameter.

LangChain text splitters provide two main methods:
- `split_text()`: Takes a list of text strings and splits them
- `split_documents()`: Takes a list of document objects and splits them while preserving metadata

For this demonstration, we'll use small values for chunk size (26) and overlap (4) to clearly illustrate how splitting works.

In [2]:
if running_in_colab: 
    ! pip install langchain
    ! pip install -U langchain-community

In [3]:
from langchain.text_splitter import RecursiveCharacterTextSplitter, CharacterTextSplitter

chunk_size = 26
chunk_overlap = 4

r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap
)
c_splitter = CharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap
)

## Basic Splitting Examples

Let's start with some simple examples to understand how text splitting works. First, we'll try a string that's exactly the same length as our chunk size.

Why doesn't this split the string below? Let's find out!

In [4]:
text1 = 'abcdefghijklmnopqrstuvwxyz'

In [5]:
r_splitter.split_text(text1)

['abcdefghijklmnopqrstuvwxyz']

Notice that the string wasn't split. This is because the string is exactly 26 characters long, which matches our specified chunk size of 26. The text splitter only splits text when it exceeds the chunk size, so in this case, no splitting was necessary.

Now let's try a longer string that exceeds our chunk size:

In [6]:
text2 = 'abcdefghijklmnopqrstuvwxyzabcdefg'

In [7]:
r_splitter.split_text(text2)

['abcdefghijklmnopqrstuvwxyz', 'wxyzabcdefg']

Now we can see the text has been split into two chunks! 

The first chunk contains the first 26 characters (abcdefghijklmnopqrstuvwxyz), which is exactly our chunk size. The second chunk starts with "wxyzabcdefg", where the first 4 characters "wxyz" represent our chunk overlap. This overlap creates a sliding window effect between chunks, ensuring that context at the boundaries isn't lost.

The 4-character overlap (wxyz) appears at the end of the first chunk and the beginning of the second chunk, helping maintain continuity between the chunks. This is especially important for maintaining context when chunks are later processed independently.

Let's try a more complex example with spaces between characters:

With spaces between characters, the string takes up more space, leading to different splitting behavior. The recursive character text splitter will count spaces as characters when measuring chunk size. Let's see how our text gets split:

In [8]:
text3 = "a b c d e f g h i j k l m n o p q r s t u v w x y z"

In [9]:
r_splitter.split_text(text3)

['a b c d e f g h i j k l m', 'l m n o p q r s t u v w x', 'w x y z']

With spaces included, the string is split into three chunks because the spaces count as characters, making the total string length exceed our chunk size multiple times. 

If we look at the overlap between chunks, we can see that the first chunk ends with "...k l m" and the second chunk begins with "l m n...". These overlapping characters (including spaces) form our chunk overlap of 4 characters. While "l m" might appear to be just two letters, the spaces before and after them count toward the total of 4 characters in the overlap.

Now let's try using the `CharacterTextSplitter` instead of the `RecursiveCharacterTextSplitter`:

In [10]:
c_splitter.split_text(text3)

['a b c d e f g h i j k l m n o p q r s t u v w x y z']

Interestingly, the `CharacterTextSplitter` doesn't split the text at all! This is because, by default, the `CharacterTextSplitter` uses newline characters ("\n") as separators, and our text doesn't contain any newlines.

Unlike the `RecursiveCharacterTextSplitter` which can split on multiple separator types in order, the `CharacterTextSplitter` only splits on a single specified separator character. Let's modify the `CharacterTextSplitter` to use spaces as separators instead:

In [11]:
c_splitter = CharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap,
    separator = ' '
)
c_splitter.split_text(text3)

['a b c d e f g h i j k l m', 'l m n o p q r s t u v w x', 'w x y z']

Now that we've specified a space as the separator, the `CharacterTextSplitter` splits the text in a similar way to the `RecursiveCharacterTextSplitter`. This demonstrates the importance of choosing the right separator for your specific text.

The key difference between these two splitters:
- **`RecursiveCharacterTextSplitter`**: Uses a list of separators in order of priority (first double newlines, then single newlines, then spaces, and finally character by character if needed)

- **`CharacterTextSplitter`**: Uses only one separator (by default a newline character) and won't split at all if that separator isn't present

This is why the `RecursiveCharacterTextSplitter` is generally recommended for generic text as it can adapt to different text structures.

Try your own examples!

## Recursive splitting details

`RecursiveCharacterTextSplitter` is recommended for generic text because it intelligently tries different separators in sequence. 

The recursive splitter works by:
1. First trying to split on double newlines (`\n\n`), which typically separate paragraphs
2. If chunks are still too large, it tries splitting on single newlines (`\n`)
3. If still too large, it splits on spaces, which separate words
4. As a last resort, it splits character by character

This hierarchy of separators helps maintain the semantic structure of the text, keeping related content together when possible. Let's see how it works on a more realistic text example:

In [12]:
some_text = """When writing documents, writers will use document structure to group content. \
This can convey to the reader, which idea's are related. For example, closely related ideas \
are in sentances. Similar ideas are in paragraphs. Paragraphs form a document. \n\n  \
Paragraphs are often delimited with a carriage return or two carriage returns. \
Carriage returns are the "backslash n" you see embedded in this string. \
Sentences have a period at the end, but also, have a space.\
and words are separated by space."""

print(some_text)

When writing documents, writers will use document structure to group content. This can convey to the reader, which idea's are related. For example, closely related ideas are in sentances. Similar ideas are in paragraphs. Paragraphs form a document. 

  Paragraphs are often delimited with a carriage return or two carriage returns. Carriage returns are the "backslash n" you see embedded in this string. Sentences have a period at the end, but also, have a space.and words are separated by space.


In [13]:
len(some_text)

496

We can see that this text is about 500 characters long. It contains a natural paragraph break with a double newline (`\n\n`), which is a typical separator between paragraphs. 

Let's set up our splitters with a larger chunk size (450 characters) to see how they handle this more realistic text. We'll explicitly specify the separators for the `RecursiveCharacterTextSplitter` to show how it works:

In [15]:
chunk_size = 450
chunk_overlap = 0

c_splitter = CharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap,
    separator = ' '
)
r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap,
    separators=["\n\n", "\n", " ", ""]
)

In [13]:
c_splitter.split_text(some_text)

['When writing documents, writers will use document structure to group content. This can convey to the reader, which idea\'s are related. For example, closely related ideas are in sentances. Similar ideas are in paragraphs. Paragraphs form a document. \n\n Paragraphs are often delimited with a carriage return or two carriage returns. Carriage returns are the "backslash n" you see embedded in this string. Sentences have a period at the end, but also,',
 'have a space.and words are separated by space.']

In [14]:
r_splitter.split_text(some_text)

["When writing documents, writers will use document structure to group content. This can convey to the reader, which idea's are related. For example, closely related ideas are in sentances. Similar ideas are in paragraphs. Paragraphs form a document.",
 'Paragraphs are often delimited with a carriage return or two carriage returns. Carriage returns are the "backslash n" you see embedded in this string. Sentences have a period at the end, but also, have a space.and words are separated by space.']

Notice the difference in how these two splitters handle the text:

1. **CharacterTextSplitter**: Splits on spaces, which can result in awkward breaks in the middle of sentences. This produces chunks that might not be semantically cohesive.

2. **RecursiveCharacterTextSplitter**: First tries to split on double newlines (`\n\n`), which results in two paragraphs being kept intact. This is a more natural and semantically meaningful split, even though the first chunk is shorter than the maximum 450 characters we specified.

This demonstrates why the recursive splitter is often better for natural text - it respects the document's structure, keeping paragraphs together rather than splitting arbitrarily in the middle of sentences.

## Real-World Example: Splitting PDF Documents

Now let's apply what we've learned to a real-world example. We'll load a PDF document and split it into manageable chunks that can be used in a RAG system.

For PDFs, we typically need larger chunk sizes than our toy examples, as PDFs contain substantial amounts of content. We'll use the PyPDFLoader to load a sample PDF document, and then apply our text splitting techniques to it.

PDFs present unique challenges because they might have complex layouts, multiple columns, headers, footers, and other elements that can affect how the text should be split. Additionally, preserving the original page number in the metadata is important for attribution and reference.

Let's load a PDF file and see how the document splitting works in practice:

In [16]:
import os

if running_in_colab: 
    ! pip install pypdf
    from google.colab import drive
    drive.mount('/content/gdrive')
    folder_dir = '/content/gdrive/My Drive'
else: 
    folder_dir = 'docs'

os.makedirs(folder_dir, exist_ok=True)

In [17]:
from langchain.document_loaders import PyPDFLoader
loader = PyPDFLoader(f"{folder_dir}/paper.pdf")
pages = loader.load()

Ignoring wrong pointing object 140 0 (offset 0)
Ignoring wrong pointing object 303 0 (offset 0)
Ignoring wrong pointing object 376 0 (offset 0)
Ignoring wrong pointing object 378 0 (offset 0)
Ignoring wrong pointing object 658 0 (offset 0)
Ignoring wrong pointing object 732 0 (offset 0)
Ignoring wrong pointing object 738 0 (offset 0)
Ignoring wrong pointing object 740 0 (offset 0)
Ignoring wrong pointing object 746 0 (offset 0)
Ignoring wrong pointing object 756 0 (offset 0)
Ignoring wrong pointing object 761 0 (offset 0)
Ignoring wrong pointing object 763 0 (offset 0)
Ignoring wrong pointing object 765 0 (offset 0)
Ignoring wrong pointing object 770 0 (offset 0)
Ignoring wrong pointing object 966 0 (offset 0)
Ignoring wrong pointing object 980 0 (offset 0)
Ignoring wrong pointing object 983 0 (offset 0)
Ignoring wrong pointing object 985 0 (offset 0)
Ignoring wrong pointing object 996 0 (offset 0)
Ignoring wrong pointing object 1010 0 (offset 0)
Ignoring wrong pointing object 1022 0 (

In [18]:
from langchain.text_splitter import CharacterTextSplitter
text_splitter = CharacterTextSplitter(
    separator="\n",
    chunk_size=1000,
    chunk_overlap=150,
    length_function=len
)

For real-world documents like PDFs, we typically use larger chunk sizes. Here we're using a chunk size of 1000 characters with an overlap of 150 characters, which provides a good balance for most applications.

We're also explicitly specifying the `length_function` parameter as Python's built-in `len()` function, which counts characters. This is the default, but we include it here for clarity. For PDFs, splitting on newlines is often effective as PDFs naturally have line breaks.

In [19]:
docs = text_splitter.split_documents(pages)

In [20]:
len(docs)

205

In [21]:
for i, doc in enumerate(docs): 
    print(f"\n📄 Doc {i}: {doc.page_content}")


📄 Doc 0: Predicting Market Reactions to News:
An LLM-Based Approach Using Spanish Business Articles
Jesus Villota ∗
Abstract
Markets do not always eﬃciently incorporate news, particularly when information is complex or
ambiguous. Traditional text analysis methods fail to capture the economic structure of information
and its ﬁrm-speciﬁc implications. We propose a novel methodology that guides LLMs to systematically
identify and classify ﬁrm-speciﬁc economic shocks in news articles according to their type, magnitude,
and direction. This economically-informed classiﬁcation allows for a more nuanced understanding of
how markets process complex information. Using a simple trading strategy, we demonstrate that our
LLM-based classiﬁcation signiﬁcantly outperforms a benchmark based on clustering vector embeddings,
generating consistent proﬁts out-of-sample while maintaining transparent and durable trading signals.

📄 Doc 1: generating consistent proﬁts out-of-sample while maintaining transpar

In [20]:
len(pages)

72

Notice that after splitting, we now have many more document objects than we started with. This is because each original page of the PDF has been split into multiple smaller chunks. 

This splitting is critical for several reasons:
1. It helps fit content within the context window of LLMs
2. It enables more precise retrieval of relevant information
3. It allows for more efficient storage in vector databases

> Important to note: When using `split_documents()`, the LangChain splitters automatically preserve the metadata from the original documents and attach it to each new chunk. This ensures that we maintain information about where each chunk came from, which is crucial for proper attribution and context when retrieving information.

## Token splitting

So far, we've been splitting based on character count. However, there's another important approach: splitting on token count.

This is particularly useful because LLMs process text as tokens, not characters, and they have context windows defined by token limits (e.g., 4096 tokens, 8192 tokens, etc.). Splitting by token count gives us a more accurate measure of how much text an LLM can process at once.

A token is roughly 4 characters on average in English, but this varies widely. Common words might be a single token, while rare words might be split into multiple tokens. By using a `TokenTextSplitter`, we can ensure our chunks respect the actual token boundaries that an LLM would use.

In [22]:
from langchain.text_splitter import TokenTextSplitter

In [24]:
text_splitter = TokenTextSplitter(chunk_size=1, chunk_overlap=0)

In [25]:
text1 = "foo bar bazzyfoo"

In [26]:
text_splitter.split_text(text1)

['foo', ' bar', ' b', 'az', 'zy', 'foo']

This demonstrates how tokenization works differently from character splitting. The string "foo bar bazzyfoo" is split into tokens like ["foo", " bar", " b", "az", "zy", "foo"]. Notice how some words remain whole, while others (like "bazzyfoo") get broken into multiple tokens.

This highlights an important point: tokenization doesn't always respect word boundaries. The way text is tokenized depends on the tokenizer's vocabulary and training, and can sometimes break words in unexpected places.

Now let's try applying token splitting to our PDF documents:

In [27]:
text_splitter = TokenTextSplitter(chunk_size=10, chunk_overlap=0)

In [28]:
tokens = text_splitter.split_documents(pages)
print(f"Total number of tokens in the PDF: {len(tokens)}")

Total number of tokens in the PDF: 4409


In [29]:
for i, token in enumerate(tokens):
    print(f"🧩 Token {i}: {token.page_content}")

🧩 Token 0: Predicting Market Reactions to News:

🧩 Token 1: An LLM-Based Approach Using Spanish Business Articles
🧩 Token 2: 
Jesus Villota ∗
Abstract
Mark
🧩 Token 3: ets do not always eﬃciently
🧩 Token 4:  incorporate news, particularly when information is complex or

🧩 Token 5: ambiguous. Traditional text analysis methods fail to capture
🧩 Token 6:  the economic structure of information
and its �
🧩 Token 7: �rm-speciﬁc implications
🧩 Token 8: . We propose a novel methodology that guides LLMs
🧩 Token 9:  to systematically
identify and classify ﬁ
🧩 Token 10: rm-speciﬁc economic shocks
🧩 Token 11:  in news articles according to their type, magnitude,
🧩 Token 12: 
and direction. This economically-informed classi
🧩 Token 13: ﬁcation allows for a more nuanced understanding
🧩 Token 14:  of
how markets process complex information. Using a
🧩 Token 15:  simple trading strategy, we demonstrate that our
LL
🧩 Token 16: M-based classiﬁcation sign
🧩 Token 17: iﬁcantly outperforms a benchmark
🧩 Toke

Looking at the first document after token-based splitting, we can see that the chunk contains a portion of the document content, and importantly, it preserves the metadata (source and page number) from the original document.

This metadata preservation is crucial - it ensures that even after splitting a document into many smaller chunks, we can still trace each chunk back to its source. Let's verify that the metadata matches the original document:

In [30]:
pages[0].metadata

{'producer': 'macOS Version 15.5 (Build 24F74) Quartz PDFContext',
 'creator': 'TexpadTeX CoreGraphicsOutputContext backend: 839',
 'creationdate': "D:20250604103256Z00'00'",
 'moddate': "D:20250604103256Z00'00'",
 'source': 'docs/paper.pdf',
 'total_pages': 72,
 'page': 0,
 'page_label': '1'}

## Context aware splitting

Beyond preserving existing metadata, sometimes we want to add additional context to our chunks based on the document's structure. This is where context-aware splitting becomes valuable.

The goal of chunking is to keep semantically related text together. While basic text splitters use delimiters like newlines or spaces, many documents (such as Markdown) have explicit structure through headers that can be leveraged for smarter splitting.

`MarkdownHeaderTextSplitter` not only splits text based on headers but also adds those headers as metadata to each chunk. This is particularly useful because:

1. Headers provide critical context about the content's topic
2. They establish hierarchical relationships between chunks
3. This metadata can be used later for more targeted retrieval

Let's see how this works with a sample Markdown document:

In [29]:
from langchain.document_loaders import NotionDirectoryLoader
from langchain.text_splitter import MarkdownHeaderTextSplitter

In [30]:
markdown_document = """# Title\n\n \
## Chapter 1\n\n \
Hi this is Jim\n\n Hi this is Joe\n\n \
### Section \n\n \
Hi this is Lance \n\n 
## Chapter 2\n\n \
Hi this is Molly"""

In [31]:
headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
    ("###", "Header 3"),
]

In [32]:
markdown_splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=headers_to_split_on
)
md_header_splits = markdown_splitter.split_text(markdown_document)

The `MarkdownHeaderTextSplitter` requires us to define which headers to look for and what metadata field names to use for them. In this case, we're identifying three levels of headers:
- `#` (Header 1): Top-level headers
- `##` (Header 2): Second-level headers 
- `###` (Header 3): Third-level headers

This allows the splitter to recognize the hierarchical structure of the document and preserve that structure in the metadata.

In [33]:
for i, md_header in enumerate(md_header_splits):
    print(f"\n Markdown Header {i}: {md_header.page_content}")


 Markdown Header 0: Hi this is Jim  
Hi this is Joe

 Markdown Header 1: Hi this is Lance

 Markdown Header 2: Hi this is Molly


In [34]:
md_header_splits[0]

Document(metadata={'Header 1': 'Title', 'Header 2': 'Chapter 1'}, page_content='Hi this is Jim  \nHi this is Joe')

Looking at the first split, we can see it contains the content "Hi this is Jim" and "Hi this is Joe" from the Chapter 1 section. Most importantly, look at the metadata - it contains:

- `Header 1: "Title"` - This comes from the top-level header
- `Header 2: "Chapter 1"` - This comes from the second-level header

This metadata provides crucial context about where this content appears in the document hierarchy, which can be extremely valuable when retrieving information later.