# LangChain - Chat With Your Data: Part 2

## 📚 Document Splitting

Splitting documents into smaller chunks is critical before storing them in vector stores.
This ensures that, during retrieval, the system fetches semantically meaningful and complete information.

**Challenges:**
- Splitting on a fixed number of characters may break sentences or logic.
- Need to preserve semantic context.
- Must retain or enrich metadata.

LangChain offers a rich suite of text splitters with various strategies, from character-based to token-based to header-aware.

### Environment Setup

In [1]:
import os
import openai
import sys
sys.path.append('../..')

# Load environment variables, including OPENAI_API_KEY
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv())
openai.api_key = os.environ['OPENAI_API_KEY']

### CharacterTextSplitter vs RecursiveCharacterTextSplitter

In [2]:
# The two most common text splitters are RecursiveCharacterTextSplitter and CharacterTextSplitter.
# However, LangChain offers a rich suite of text splitters with various strategies, 
# from character-based to token-based to header-aware.
from langchain.text_splitter import RecursiveCharacterTextSplitter, CharacterTextSplitter

# Define chunk size and overlap 
# Overlap is the number of tokens to overlap between chunks (like a sliding window)
chunk_size = 26
chunk_overlap = 4

# Initialize splitters
r_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
c_splitter = CharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)

text1 = 'abcdefghijklmnopqrstuvwxyz'
print(r_splitter.split_text(text1))

['abcdefghijklmnopqrstuvwxyz']


Notice that the RecursiveCharacterTextSplitter splits the text into chunks of 26 characters, with an overlap of 4 characters between chunks. So it didn't split text1 as it fits into 1 chunk. 
However it splits a longer text (text2) and also text3 with spaces into multiple chunks.

In [3]:
text2 = 'abcdefghijklmnopqrstuvwxyzabcdefg'
print(r_splitter.split_text(text2))

['abcdefghijklmnopqrstuvwxyz', 'wxyzabcdefg']


In [4]:
text3 = "a b c d e f g h i j k l m n o p q r s t u v w x y z"
print(r_splitter.split_text(text3))

['a b c d e f g h i j k l m', 'l m n o p q r s t u v w x', 'w x y z']


Notice that the CharacterTextSplitter by default uses a '\n' as separator, so it doesn't split text 3, except if the whitespace separator is explicitly specified. 

In [5]:
print(c_splitter.split_text(text3))  # Default uses '\n' as separator

['a b c d e f g h i j k l m n o p q r s t u v w x y z']


In [6]:
c_splitter = CharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap, separator=' ')
print(c_splitter.split_text(text3)) # Uses ' ' as separator

['a b c d e f g h i j k l m', 'l m n o p q r s t u v w x', 'w x y z']


### Recursive Splitting with Paragraphs and Regex

Example of using RecursiveCharacterTextSplitter with a regex to split on text paragraphs.
The separators are defined as `\n\n`, `\n`, ` ` (space), and `""` (empty string).

In [7]:
some_text = (
    "When writing documents, writers will use document structure to group content. "
    "This can convey to the reader, which ideas are related. For example, closely related ideas "
    "are in sentences. Similar ideas are in paragraphs. Paragraphs form a document.\n\n"
    "Paragraphs are often delimited with a carriage return or two carriage returns. "
    "Carriage returns are the 'backslash n' you see embedded in this string. "
    "Sentences have a period at the end, "
    "but also, have a space. and words are separated by space."
)

In [8]:
c_splitter = CharacterTextSplitter(chunk_size=450, chunk_overlap=0, separator=' ')
r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=450, chunk_overlap=0,
    separators=["\n\n", "\n", " ", ""]
)
print(c_splitter.split_text(some_text))
print(r_splitter.split_text(some_text))

["When writing documents, writers will use document structure to group content. This can convey to the reader, which ideas are related. For example, closely related ideas are in sentences. Similar ideas are in paragraphs. Paragraphs form a document.\n\nParagraphs are often delimited with a carriage return or two carriage returns. Carriage returns are the 'backslash n' you see embedded in this string. Sentences have a period at the end, but also, have", 'a space. and words are separated by space.']
['When writing documents, writers will use document structure to group content. This can convey to the reader, which ideas are related. For example, closely related ideas are in sentences. Similar ideas are in paragraphs. Paragraphs form a document.', "Paragraphs are often delimited with a carriage return or two carriage returns. Carriage returns are the 'backslash n' you see embedded in this string. Sentences have a period at the end, but also, have a space. and words are separated by space.

The RecursiveCharacterTextSplitter is a bit more advanced, but it can be a bit more complex to use. Let's reduce the chunk size a bit and add a period to our separators. Notice the need for a lookbehind in the regex.

In [9]:
r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=150,
    chunk_overlap=0,
    separators=["\n\n", "\n", ".", " ", ""]
)
print(r_splitter.split_text(some_text))

['When writing documents, writers will use document structure to group content. This can convey to the reader, which ideas are related. For example, clo', 'sely related ideas are in sentences. Similar ideas are in paragraphs. Paragraphs form a document.', "Paragraphs are often delimited with a carriage return or two carriage returns. Carriage returns are the 'backslash n' you see embedded in this string", '. Sentences have a period at the end, but also, have a space. and words are separated by space.']


In [10]:
r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=150,
    chunk_overlap=0,
    separators=["\n\n", "\n", "(?<=\\. )", " ", ""]
)
print(r_splitter.split_text(some_text))

['When writing documents, writers will use document structure to group content. This can convey to the reader, which ideas are related.', 'For example, closely related ideas are in sentences. Similar ideas are in paragraphs. Paragraphs form a document.', 'Paragraphs are often delimited with a carriage return or two carriage returns.', "Carriage returns are the 'backslash n' you see embedded in this string. Sentences have a period at the end, but also, have a space.", 'and words are separated by space.']


### Token-based Splitting

TokenTextSplitter splits text based on token count (e.g., GPT-style tokens).
This is useful when preparing chunks that must respect the LLM's context window (measured in tokens, not characters).
Approximately, 1 token ≈ 4 characters of English text.

In [11]:
from langchain.text_splitter import TokenTextSplitter

# Example 1: Basic tokenization (chunk_size=1 for individual tokens)
text_splitter = TokenTextSplitter(chunk_size=1, chunk_overlap=0)
text1 = "foo bar bazzyfoo"
tokens = text_splitter.split_text(text1)

print("Tokenized output with chunk_size=1:", tokens)

Tokenized output with chunk_size=1: ['foo', ' bar', ' b', 'az', 'zy', 'foo']


In [12]:
# Example 2: Larger chunk size - first load the document

from langchain.document_loaders import PyPDFLoader
loader = PyPDFLoader("docs/cs229_lectures/MachineLearning-Lecture01.pdf")
pages = loader.load()

# This will group up to 10 tokens together per chunk.
text_splitter = TokenTextSplitter(chunk_size=10, chunk_overlap=0)
docs = text_splitter.split_documents(pages)  # `pages` assumed loaded from previous PDF example

print("First split document:", docs[0].page_content)
print("Metadata retained:", docs[0].metadata)

First split document: MachineLearning-Lecture01  

Metadata retained: {'source': 'docs/cs229_lectures/MachineLearning-Lecture01.pdf', 'page': 0}


Notes:
- Token-based splitting aligns better with transformer model constraints.
- Useful when targeting specific models like GPT-3.5, GPT-4, etc., which have max token limits.

### Context-Aware Splitting

Context-aware splitting is a more advanced method that groups text based on semantic similarity rather than fixed token boundaries.

Chunking aims to keep text with common context together. A text splitting often uses sentences or other delimiters to keep related text together but many documents (such as Markdown) have structure (headers) that can be explicitly used in splitting. MarkdownHeaderTextSplitter can be used to preserve header metadata in the chunks, as show below.


In [13]:
# Some documents (especially wikis, technical notes, Notion exports) are structured with markdown headers.
# This splitter extracts and attaches headers (H1, H2, H3, etc.) as metadata for each chunk.
from langchain.text_splitter import MarkdownHeaderTextSplitter

markdown_document = """# Title

## Chapter 1

Hi this is Jim

Hi this is Joe

### Section

Hi this is Lance

## Chapter 2

Hi this is Molly"""

# Define which markdown headers to capture and their metadata labels
headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
    ("###", "Header 3"),
]

markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)

md_chunks = markdown_splitter.split_text(markdown_document)

# Examine output: content + metadata
for idx, chunk in enumerate(md_chunks[:3]):
    print(f"Chunk {idx + 1} content:\n{chunk.page_content}\n")
    print(f"Metadata: {chunk.metadata}\n{'-'*40}")

Chunk 1 content:
Hi this is Jim  
Hi this is Joe

Metadata: {'Header 1': 'Title', 'Header 2': 'Chapter 1'}
----------------------------------------
Chunk 2 content:
Hi this is Lance

Metadata: {'Header 1': 'Title', 'Header 2': 'Chapter 1', 'Header 3': 'Section'}
----------------------------------------
Chunk 3 content:
Hi this is Molly

Metadata: {'Header 1': 'Title', 'Header 2': 'Chapter 2'}
----------------------------------------


In [14]:
# 🧪 Try this on a real Notion Markdown export (previously loaded Notion DB)
from langchain.document_loaders import NotionDirectoryLoader

loader = NotionDirectoryLoader("docs/Notion_DB")
notion_docs = loader.load()

# Combine content from all documents for header splitting
full_markdown_text = " ".join([doc.page_content for doc in notion_docs])

markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=[
    ("#", "Header 1"),
    ("##", "Header 2"),
])

md_splits = markdown_splitter.split_text(full_markdown_text)

# Print first structured chunk
print("Example split with header-based metadata:")
print("\n")
print("Content:", md_splits[0].page_content)
print("\n")
print("Metadata:", md_splits[0].metadata)

Example split with header-based metadata:


Content: This is a living document with everything we've learned working with people while running a startup. And, of course, we continue to learn. Therefore it's a document that will continue to change.  
**Everything related to working at Blendle and the people of Blendle, made public.**  
These are the lessons from three years of working with the people of Blendle. It contains everything from [how our leaders lead](https://www.notion.so/ecfb7e647136468a9a0a32f1771a8f52?pvs=21) to [how we increase salaries](https://www.notion.so/Salary-Review-e11b6161c6d34f5c9568bb3e83ed96b6?pvs=21), from [how we hire](https://www.notion.so/Hiring-451bbcfe8d9b49438c0633326bb7af0a?pvs=21) and [fire](https://www.notion.so/Firing-5567687a2000496b8412e53cd58eed9d?pvs=21) to [how we think people should give each other feedback](https://www.notion.so/Our-Feedback-Process-eb64f1de796b4350aeab3bc068e3801f?pvs=21) — and much more.  
We've made this document public b

Notes:
- This approach is context-aware and ensures chunks preserve logical section boundaries.
- Very effective when used with structured technical documents or user manuals.

## Summary
- LangChain provides various **splitting techniques**: character, recursive, token-based, and header-aware.
- RecursiveCharacterTextSplitter is generally most robust.
- Token-based splitting is helpful for LLM context sizing.
- MarkdownHeaderTextSplitter allows semantic, structure-preserving splits.

**Next Step:** Use vector stores and embeddings to store and retrieve these chunks for answering questions.