# **1Ô∏è‚É£ Why is Text Splitting Important?**
LLMs and retrieval models cannot process long documents directly, so we must split them into manageable chunks for:

‚úÖ Efficient Retrieval ‚Äì Ensures relevant text is retrieved accurately.

‚úÖ Reduced Token Overhead ‚Äì Prevents exceeding LLM token limits.

‚úÖ Improved Generation Quality ‚Äì Keeps responses contextually relevant.

‚úÖ Better Indexing ‚Äì Enhances searchability in vector databases.

# **2Ô∏è‚É£ Overview of Text Splitting Techniques**
**We explore two main approaches:**

1Ô∏è‚É£ Character-Based Splitting ‚Äì Splitting based on fixed character length.

2Ô∏è‚É£ HTML Header Splitting ‚Äì Splitting documents based on structured HTML headings (h1,h2 tags  etc.).

# **3Ô∏è‚É£ Character-Based Text Splitting**
**üîπ Concept**

This method splits text into fixed-length chunks while maintaining overlapping contexts for continuity.

**üîπ Applications**

üìå Handling Long Texts for LLMs ‚Üí Splitting large documents for GPT, BERT, etc.

üìå Retrieval-Based AI (RAG) ‚Üí Preparing text for semantic search in vector databases (FAISS, Pinecone).

üìå Summarization Pipelines ‚Üí Processing large text blocks before summarization.

In [6]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Sample text
text = """Retrieval-Augmented Generation (RAG) models improve AI by retrieving external information.
They enhance response accuracy and reduce hallucinations. RAG is widely used in various AI applications."""

# Adjusted parameters
splitter = RecursiveCharacterTextSplitter(
    chunk_size=120,
    chunk_overlap=20,
    separators=["\n", " ", ""],  # Prioritize spaces & newlines
)

# Perform splitting
chunks = splitter.split_text(text)

# Display results
for i, chunk in enumerate(chunks):
    print(f"Chunk {i+1}: {chunk}")


Chunk 1: Retrieval-Augmented Generation (RAG) models improve AI by retrieving external information.
Chunk 2: They enhance response accuracy and reduce hallucinations. RAG is widely used in various AI applications.


**üìå Why Use Overlap?**

Chunk **overlap** (e.g., 10 characters) helps **preserve context across** consecutive chunks, avoiding **loss of meaning** in retrieval and generation.

# **4Ô∏è‚É£ HTML Header-Based Text Splitting**
**üîπ Concept**

HTML documents contain structured content using headings (h1,h2,h3). Splitting by headers ensures logical separation of content while maintaining semantic structure.

**üîπ Applications**

üìå Processing Web Documents ‚Üí Extracting structured data from HTML pages.

üìå Summarization of Articles & Reports ‚Üí Separating sections for better analysis.

üìå Chunking Knowledge Bases ‚Üí Preparing structured documents for vector search.

In [2]:
!pip install bs4

Collecting bs4
  Downloading bs4-0.0.2-py2.py3-none-any.whl.metadata (411 bytes)
Downloading bs4-0.0.2-py2.py3-none-any.whl (1.2 kB)
Installing collected packages: bs4
Successfully installed bs4-0.0.2


In [3]:
from bs4 import BeautifulSoup

# Sample HTML text
html_text = """
<h1>Introduction to RAG</h1>
<p>Retrieval-Augmented Generation (RAG) enhances LLMs by retrieving external knowledge.</p>
<h2>Benefits of RAG</h2>
<p>RAG improves accuracy and reduces hallucinations.</p>
<h2>Applications</h2>
<p>Used in chatbots, search engines, and enterprise AI.</p>
"""

# Parse HTML
soup = BeautifulSoup(html_text, "html.parser")

# Extract chunks based on headers
chunks = []
for header in soup.find_all(["h1", "h2", "h3"]):
    section = header.get_text()  # Header text
    content = header.find_next_sibling("p")  # Paragraph following the header
    if content:
        chunks.append(f"{section}\n{content.get_text()}")

# Display results
for i, chunk in enumerate(chunks):
    print(f"Chunk {i+1}:\n{chunk}\n")


Chunk 1:
Introduction to RAG
Retrieval-Augmented Generation (RAG) enhances LLMs by retrieving external knowledge.

Chunk 2:
Benefits of RAG
RAG improves accuracy and reduces hallucinations.

Chunk 3:
Applications
Used in chatbots, search engines, and enterprise AI.



# **5Ô∏è‚É£ Combining Both Approaches for Hybrid Splitting**
* **For complex documents, we can first split by HTML headers, then apply character-based chunking within each section.**

In [8]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Example sections from HTML splitting
sections = [
    "Introduction to RAG\nRetrieval-Augmented Generation (RAG) enhances LLMs by retrieving external knowledge.",
    "Benefits of RAG\nRAG improves accuracy and reduces hallucinations.",
    "Applications\nUsed in chatbots, search engines, and enterprise AI."
]

# Apply character splitting within each section
splitter = RecursiveCharacterTextSplitter(chunk_size=120, chunk_overlap=25)

# Process each section separately
final_chunks = []
for section in sections:
    final_chunks.extend(splitter.split_text(section))

# Display results
for i, chunk in enumerate(final_chunks):
    print(f"Chunk {i+1}:\n{chunk}\n")


Chunk 1:
Introduction to RAG
Retrieval-Augmented Generation (RAG) enhances LLMs by retrieving external knowledge.

Chunk 2:
Benefits of RAG
RAG improves accuracy and reduces hallucinations.

Chunk 3:
Applications
Used in chatbots, search engines, and enterprise AI.



# **Conclusion & Takeaways**

üìå Character Splitting ‚Üí Best for unstructured text, ensures fixed-length chunks.

üìå HTML Header Splitting ‚Üí Best for structured documents, maintains section integrity.

üìå Hybrid Splitting ‚Üí Combines both for optimal retrieval and generation in GenAI.