Why we need text splitting?

Even if a PDF loads correctly, a single page/chunk can be: 
- too long for the LLM context window
- too large for good retrieval (embeddings work better on focused chunks)

So we split into smaller, overlapping chunks that are:
- searchable (via embeddings)
- small enough to pass into the LLM
- still coherent (overlap keeps continuity)

#### RecursiveCharacterTextSplitter (Most Used):

In [3]:
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import PyPDFLoader

docs = PyPDFLoader('Introduction_to_Python_Programming_-_WEB.pdf').load()

splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1000, 
    chunk_overlap = 200
)

splits = splitter.split_documents(docs)

print('Original docs:', len(docs))
print('Split chunks:', len(splits))


print("\nExample chunk metadata:", splits[0].metadata)
print("\nExample chunk preview:\n", splits[0].page_content[:400])
print("\nChunk length:", len(splits[0].page_content))

Original docs: 415
Split chunks: 800

Example chunk metadata: {'producer': 'Prince 15 (www.princexml.com)', 'creator': 'PyPDF', 'creationdate': '2024-03-15T15:25:16-05:00', 'moddate': '2024-03-15T15:25:16-05:00', 'title': 'Introduction to Python Programming', 'source': 'Introduction_to_Python_Programming_-_WEB.pdf', 'total_pages': 415, 'page': 2, 'page_label': '3'}

Example chunk preview:
 Introduction to Python Programming          SENIOR CONTRIBUTING AUTHORS UDAYAN DAS, SAINT MARY'S COLLEGE OF CALIFORNIA AUBREY LAWSON, WILEY CHRIS MAYFIELD, JAMES MADISON UNIVERSITY NARGES NOROUZI, UC BERKELEY

Chunk length: 208


In [4]:
def try_split(chunk_size, chunk_overlap):
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap
    )
    splits = splitter.split_documents(docs)
    avg_len = sum(len(s.page_content) for s in splits) / len(splits)
    print(f"chunk_size={chunk_size}, overlap={chunk_overlap} -> chunks={len(splits)}, avg_len={avg_len:.0f}")

try_split(500, 50)
try_split(1000, 200)
try_split(1500, 200)


chunk_size=500, overlap=50 -> chunks=1383, avg_len=404
chunk_size=1000, overlap=200 -> chunks=800, avg_len=754
chunk_size=1500, overlap=200 -> chunks=548, avg_len=1027


In [6]:
# are there empty chunks
empty = [s for s in splits if not s.page_content.strip()]
print("Empty chunks:", len(empty))

Empty chunks: 0


#### Practice Problem 1)

1) Using your papers/ PDFs:
    - Split using:
        - chunk_size=1000, chunk_overlap=200
2) Print:
    - number of chunks per PDF (per paper_name)
    - show 1 chunk from each PDF (metadata + first 200 chars)

In [9]:
from langchain_community.document_loaders import PyPDFLoader, DirectoryLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from collections import defaultdict

docs1 = DirectoryLoader(
    "papers",
    glob = '**/*.pdf',
    loader_cls= PyPDFLoader,
).load()

splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1000,
    chunk_overlap = 200
)

splits = splitter.split_documents(docs1)

chunks_per_pdf = defaultdict(int)
for s in splits:
    chunks_per_pdf[s.metadata['source']] += 1

print('Chunks per PDF:')
for k, v in chunks_per_pdf.items():
    print(k, v)


Chunks per PDF:
papers/2203.14465v2.pdf 136
papers/2501.12948v1.pdf 79
