## character text splitter

breaks large documents into smaller, manageable chunks based on characters.
It works by scanning the raw text and cutting it at specific separators (like \n\n, space, or custom markers).
You can control the chunk size (maximum characters per piece) and chunk overlap (how many characters repeat between chunks).
This is especially useful when text is too big for LLM context windows.
It helps preserve continuity across chunks while making embeddings more efficient.

In [1]:
from langchain_community.document_loaders import TextLoader

loader = TextLoader('speech.txt')
doc = loader.load()
doc

[Document(metadata={'source': 'speech.txt'}, page_content='Good morning everyone,  \nI am very happy to be here today.  \nFirst of all, I want to thank you all for giving me this opportunity to speak.  \n\nToday, I want to share a few thoughts about the importance of learning and growth.  \nLearning is a lifelong journey. It does not stop after school or college.  \nEvery experience in life teaches us something new, and every challenge helps us grow stronger.  \n\nWe should never be afraid of making mistakes, because mistakes are proof that we are trying.  \nWhat matters is that we learn from them and keep moving forward.  \n\nSo let us stay curious, stay motivated, and never stop improving ourselves.  \nThank you.\n')]

In [11]:
from langchain_text_splitters import CharacterTextSplitter
splitter = CharacterTextSplitter(separator="\n\n",chunk_size=100,chunk_overlap=20)
splitter.split_documents(doc)

Created a chunk of size 141, which is longer than the specified 100
Created a chunk of size 257, which is longer than the specified 100
Created a chunk of size 161, which is longer than the specified 100


[Document(metadata={'source': 'speech.txt'}, page_content='Good morning everyone,  \nI am very happy to be here today.  \nFirst of all, I want to thank you all for giving me this opportunity to speak.'),
 Document(metadata={'source': 'speech.txt'}, page_content='Today, I want to share a few thoughts about the importance of learning and growth.  \nLearning is a lifelong journey. It does not stop after school or college.  \nEvery experience in life teaches us something new, and every challenge helps us grow stronger.'),
 Document(metadata={'source': 'speech.txt'}, page_content='We should never be afraid of making mistakes, because mistakes are proof that we are trying.  \nWhat matters is that we learn from them and keep moving forward.'),
 Document(metadata={'source': 'speech.txt'}, page_content='So let us stay curious, stay motivated, and never stop improving ourselves.  \nThank you.')]

In [12]:
speech =""
with open("speech.txt") as f:
    speech=f.read()

speech    

'Good morning everyone,  \nI am very happy to be here today.  \nFirst of all, I want to thank you all for giving me this opportunity to speak.  \n\nToday, I want to share a few thoughts about the importance of learning and growth.  \nLearning is a lifelong journey. It does not stop after school or college.  \nEvery experience in life teaches us something new, and every challenge helps us grow stronger.  \n\nWe should never be afraid of making mistakes, because mistakes are proof that we are trying.  \nWhat matters is that we learn from them and keep moving forward.  \n\nSo let us stay curious, stay motivated, and never stop improving ourselves.  \nThank you.\n'

In [13]:
splitter = CharacterTextSplitter(chunk_size = 40 , chunk_overlap=10)
text=splitter.create_documents([speech])
print(text[0])


Created a chunk of size 141, which is longer than the specified 40
Created a chunk of size 257, which is longer than the specified 40
Created a chunk of size 161, which is longer than the specified 40


page_content='Good morning everyone,  
I am very happy to be here today.  
First of all, I want to thank you all for giving me this opportunity to speak.'
