LLMs like Gemini or GPT can't handle long documents all at once (most have token limits).
So, we split large documents into manageable chunks (like paragraphs or sections) while preserving context.

Real-life Example:

Imagine you loaded a 20-page PDF (e.g., a user manual). To extract Q&A efficiently, we split it into sections like "Installation", "Usage", "Troubleshooting" etc.

LangChain has smart text splitters like:

| Splitter Type           | Use Case                                |
| ----------------------- | --------------------------------------- |
| `CharacterTextSplitter` | Basic line/char-based splitting         |
| `RecursiveTextSplitter` | Best for preserving sentence/paragraphs |


CharacterTextSplitter (Basic Split)

In [1]:
from langchain.document_loaders import WebBaseLoader
from langchain.text_splitter import CharacterTextSplitter

url = "https://en.wikipedia.org/wiki/Artificial_intelligence"
loader = WebBaseLoader(url)
document = loader.load()


text_splitter = CharacterTextSplitter(
    separator="\n",
    chunk_size = 500,
    chunk_overlap = 50,
    length_function = len
)

docs = text_splitter.split_documents(documents=document)

print(f"Smart split: {len(docs)} chunks")
print("Sample:\n", docs[0].page_content)

USER_AGENT environment variable not set, consider setting it to identify your requests.
Created a chunk of size 1375, which is longer than the specified 500
Created a chunk of size 506, which is longer than the specified 500
Created a chunk of size 631, which is longer than the specified 500
Created a chunk of size 763, which is longer than the specified 500
Created a chunk of size 962, which is longer than the specified 500
Created a chunk of size 664, which is longer than the specified 500
Created a chunk of size 740, which is longer than the specified 500
Created a chunk of size 539, which is longer than the specified 500
Created a chunk of size 600, which is longer than the specified 500
Created a chunk of size 603, which is longer than the specified 500
Created a chunk of size 618, which is longer than the specified 500
Created a chunk of size 536, which is longer than the specified 500
Created a chunk of size 649, which is longer than the specified 500
Created a chunk of size 844

Smart split: 490 chunks
Sample:
 Artificial intelligence - Wikipedia
Jump to content
Main menu
Main menu
move to sidebar
hide
		Navigation
	
Main pageContentsCurrent eventsRandom articleAbout WikipediaContact us
		Contribute
	
HelpLearn to editCommunity portalRecent changesUpload fileSpecial pages
Search
Search
Appearance
Donate
Create account
Log in
Personal tools
Donate Create account Log in
		Pages for logged out editors learn more
ContributionsTalk
Contents
move to sidebar
hide
(Top)
1
Goals
Toggle Goals subsection
1.1


RecursiveTextSplitter (Smart Split – Recommended)

In [2]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50
)

docs = text_splitter.split_documents(document)

print(f"Smart split: {len(docs)} chunks")
print("Sample:\n", docs[0].page_content)

Smart split: 634 chunks
Sample:
 Artificial intelligence - Wikipedia


























Jump to content







Main menu





Main menu
move to sidebar
hide



		Navigation
	


Main pageContentsCurrent eventsRandom articleAbout WikipediaContact us





		Contribute
	


HelpLearn to editCommunity portalRecent changesUpload fileSpecial pages



















Search











Search






















Appearance
















Donate

Create account

Log in








Personal tools





Donate Create account Log in
