# SESSION 13 : Text Splitters in LangChain | Generative AI using LangChain | Video 11

https://youtu.be/SEWS9P4ODmc?list=PLKnIA16_RmvaTbihpo4MtzVm4XOQa0ER0

__Text Splitting__ is the process of breaking large chunks of text (like articles, PDFs, HTML pages, or books) into smaller, manageable pieces (chunks) that an LLM can handle effectively.

### 🔹 Example Usage

```python
from langchain_text_splitters import RecursiveCharacterTextSplitter

text = "Your long document text here..."
splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
chunks = splitter.split_text(text)

print(chunks[:2])  # First 2 chunks
```

### Why we need text splitting

- __Overcoming model limitations__ : Many embedding models and language models have maximum input size constraints __(context window)__. Splitting allows us to process documents that would otherwise exceed these limits.



- __Downstream tasks__ - Text Splitting improves nearly every LLM powered task

![image.png](attachment:image.png)

- __Optimizing computational resources :__ Working with smaller chunks of text can be more memory-efficient and allow for better parallelization of processing tasks

![Screenshot%202025-08-25%20003921.png](attachment:Screenshot%202025-08-25%20003921.png)

### 🔹 1. **CharacterTextSplitter**

* Splits by character count.


* Example: every 1000 characters.

Splits by fixed number of characters.

```python
from langchain_text_splitters import CharacterTextSplitter

text = "This is a very long text that needs splitting."
splitter = CharacterTextSplitter(chunk_size=10, chunk_overlap=2)
print(splitter.split_text(text))
```

👉 Output: `['This is a ', 'a very lon', 'ng text th', 'at needs s', 'plitting.']`

### 🔹 2. **TokenTextSplitter**

* Splits based on tokens (good for LLMs).


* Splits by **tokens** (based on tokenizer, e.g., OpenAI).


* Best when LLM context length matters.

```python
from langchain_text_splitters import TokenTextSplitter

text = "OpenAI models work with tokens instead of raw characters."
splitter = TokenTextSplitter(chunk_size=10, chunk_overlap=2)
print(splitter.split_text(text))
```

👉 Output: chunks split by token counts (not just characters).

### 🔹 3. **RecursiveCharacterTextSplitter** (most common)

* Tries to split by paragraphs → sentences → words.


* __Splits hierarchically (paragraph → sentence → word)__.


* Ensures chunks are within token limits but context-aware.

```python
from langchain_text_splitters import RecursiveCharacterTextSplitter

text = "LangChain helps build LLM apps. It provides tools for retrieval. You can split text smartly."
splitter = RecursiveCharacterTextSplitter(chunk_size=40, chunk_overlap=10)
print(splitter.split_text(text))
```

👉 Output: chunks keep sentence boundaries when possible.

### eg : when chunk_size = 10

![image.png](attachment:image.png)

### 4. Document-Structure based : 

- document which are not plain text. eg: codes


- these are not organised based on paragraphs or sentences.


- eg : code, markdown files etc


- __It also uses RecursiveCharacterTextSplitter but the sepeartors are different here.__

![image.png](attachment:image.png)

##### types of seperator for code :  (eg: class, def keyword etc)

![image.png](attachment:image.png)

### 🔹 4.1. **MarkdownHeaderTextSplitter**


* Splits based on markdown headers (`#`, `##`, etc.).


* Useful for structured docs like docs/blogs.

```python
from langchain_text_splitters import MarkdownHeaderTextSplitter

md_text = "# Title\nThis is intro.\n## Section 1\nDetails here.\n## Section 2\nMore details."
splitter = MarkdownHeaderTextSplitter(headers_to_split_on=[("#", "Title"), ("##", "Section")])
print(splitter.split_text(md_text))
```

👉 Output: Chunks grouped by `Title`, `Section 1`, `Section 2`.

![Screenshot%202025-08-25%20021455.png](attachment:Screenshot%202025-08-25%20021455.png)

##### types of seperator for markdown : 

![Screenshot%202025-08-25%20021633.png](attachment:Screenshot%202025-08-25%20021633.png)

### 🔹 4.2. **HTMLHeaderTextSplitter**


* Splits JSON or HTML based on hierarchy.


* Splits based on HTML tags.

```python
from langchain_text_splitters import HTMLHeaderTextSplitter

html = "<h1>Intro</h1><p>This is text.</p><h2>Details</h2><p>More info here.</p>"
splitter = HTMLHeaderTextSplitter(headers_to_split_on=[("h1", "Header 1"), ("h2", "Header 2")])
print(splitter.split_text(html))
```

👉 Output: Chunks grouped under `<h1>` and `<h2>`.


### 🔹 5. **NLTKTextSplitter**

* Split using NLP libraries (sentences, words).


* Uses NLTK to split into sentences.

```python
from langchain_text_splitters import NLTKTextSplitter

text = "LangChain is awesome. It simplifies LLM apps."
splitter = NLTKTextSplitter(chunk_size=20)
print(splitter.split_text(text))
```

👉 Output: `['LangChain is awesome.', 'It simplifies LLM apps.']`

### 🔹 6. **SpacyTextSplitter**

* Uses spaCy for sentence/word-based splitting.



```python
from langchain_text_splitters import SpacyTextSplitter

text = "I love NLP with spaCy. It provides robust sentence segmentation."
splitter = SpacyTextSplitter(chunk_size=30)
print(splitter.split_text(text))
```

👉 Output: clean sentence-based chunks.

### 7. Semantic Meaning Based (still in experimental phase)

#### Splitting based on seamantic meaning of the text

Farmers were working hard in the fields, preparing the soil and planting seeds for the next season. The sun was bright, and the air smelled of earth and fresh grass. The Indian Premier League (IPL) is the biggest cricket league in the world. People all over the world watch the matches and cheer for their favourite teams.



Terrorism is a big danger to peace and safety. It causes harm to people and creates fear in cities and villages. When such attacks happen, they leave behind pain and sadness. To fight terrorism, we need strong laws, alert security forces, and support from people who care about peace and safety.

```python
from langchain_experimental.text_splitter import SemanticChunker


text_splitter = SemanticChunker(
    OpenAIEmbeddings(), breakpoint_threshold_type="standard_deviation",
    breakpoint_threshold_amount=3
)

```

✅ **In short**:

* **Text Splitters** = break large docs into smaller, model-friendly chunks.


* **Most used** → `RecursiveCharacterTextSplitter` (general) and `TokenTextSplitter` (LLM-token aware).

