# ✂️ Text Splitters

Splitting long documents into smaller chunks is a key step in many LLM workflows. It helps with:

- 📏 Handling model input limits  
- 🎯 Improving search & retrieval accuracy  
- 🚀 Optimizing performance  
- 🧠 Creating better embeddings  

![Split Documents](assets/text_splitters-7961ccc13e05e2fd7f7f58048e082f47.png "Split Documents")

---

## 💡 Why Split?

Real-world texts vary in length and structure. Splitting helps make processing **consistent** and **effective—especially** for retrieval and summarization.

---

## 🔧 Splitting Strategies

### 🔹 Length-Based
Split by size — tokens or characters. Simple, fast, and model-friendly.

- ✅ Easy to use
- ✅ Consistent chunk sizes
- 🔎 Use token-based for models, character-based for general texts


```python
from langchain_text_splitters import CharacterTextSplitter
text_splitter = CharacterTextSplitter.from_tiktoken_encoder(
    encoding_name="cl100k_base", chunk_size=100, chunk_overlap=0
)
texts = text_splitter.split_text(document)
```

### 🔹 Text-Structure-Based
Use natural language structure — paragraphs, sentences, words.

- 🧩 Keeps chunks readable and coherent
- 🧠 Great for semantic preservation  
- 💡 Uses recursive splitting when chunks are too big


```python
from langchain_text_splitters import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=100, chunk_overlap=0)
texts = text_splitter.split_text(document)
```


### 🔹 Document-Structure-Based
Split by file format — Markdown, HTML, JSON, Code.

- 🗂️ Maintains logical sections
- 🤖 Ideal for structured formats like code or web pages

### 🔹 Semantic-Based
Split by meaning using embeddings.

- 🧠 More intelligent breaks based on content shifts
- 🧪 Useful for high-quality search or summarization

---

## 📚 Further Reading

- [Semantic Splitting Notebook (by Greg Kamradt)](https://github.com/FullStackRetrieval-com/RetrievalTutorials/blob/main/tutorials/LevelsOfTextSplitting/5_Levels_Of_Text_Splitting.ipynb)

