# Chunking
Text chunking is one of the most critical steps in building a RAG (Retrieval Augmented Generation) pipeline. How you break up your documents directly impacts the quality of your entire system. A poor chunking strategy can lead to irrelevant context being inserted into your prompts, causing your AI to give completely wrong answers.

There are three primary approaches to chunking text, each with distinct advantages and trade-offs:

- Size-based: Divide text into strings of equal length
- Structure-based: Split based on document structure (headers, paragraphs, sections)
- Semantic-based: Group related sentences or sections using NLP techniques

In [1]:
import re

In [2]:
def chunk_by_char(
    document_text: str, chunk_size: int = 150, chunk_overlap: int = 20
) -> list[str]:
    """Chunk text into fixed-size character chunks with overlap"""
    chunks = []
    start_idx = 0

    while start_idx < len(document_text):
        end_idx = min(start_idx + chunk_size, len(document_text))
        chunk_text = document_text[start_idx:end_idx]
        chunks.append(chunk_text)

        start_idx = (
            end_idx - chunk_overlap
            if end_idx < len(document_text)
            else len(document_text)
        )

    return chunks

In [3]:
def chunk_by_sentence(
    document_text: str, max_sentences_per_chunk: int = 5, overlap_sentences: int = 1
) -> list[str]:
    """Chunk text into chunks based on sentences with overlap"""
    sentences = re.split(r"(?<=[.!?])\s+", document_text)
    chunks = []
    start_idx = 0

    while start_idx < len(sentences):
        end_idx = min(start_idx + max_sentences_per_chunk, len(sentences))
        current_chunk = sentences[start_idx:end_idx]
        chunks.append(" ".join(current_chunk))

        start_idx += max_sentences_per_chunk - overlap_sentences

        if start_idx < 0:
            start_idx = 0

    return chunks

In [None]:
def chunk_by_section(document_text):
    """Chunk text based on Markdown sections (## headers)"""
    pattern = r"\n## "
    return re.split(pattern, document_text)

In [4]:
def display_chunks(cs: list[str]):
    """Utility to display chunks neatly"""
    for i, chunk in enumerate(cs):
        print(f"--- Chunk {i + 1} ---")
        print(chunk)
        print()

In [5]:
with open("report.md", "r") as f:
    text: str = f.read()
    char_chunks = chunk_by_char(text)
    sentence_chunks = chunk_by_sentence(text)

display_chunks(char_chunks)
display_chunks(sentence_chunks)

--- Chunk 1 ---
# **Annual Interdisciplinary Research Review: Cross-Domain Insights**

## Executive Summary

This report synthesizes the key findings and ongoing rese

--- Chunk 2 ---
ngs and ongoing research efforts across the organization's diverse operational and R&D departments for the past fiscal year. Our strength lies in the 

--- Chunk 3 ---
trength lies in the cross-pollination of ideas and methodologies, driving innovation and addressing complex challenges that transcend traditional disc

--- Chunk 4 ---
end traditional disciplinary boundaries. This year's review highlights significant progress in ten critical areas. Advances in **Medical Research** fo

--- Chunk 5 ---
edical Research** focused on the rare XDR-471 syndrome, yielding new diagnostic insights. Concurrently, **Software Engineering** tackled persistent st

--- Chunk 6 ---
ackled persistent stability issues, implementing key fixes identified through error code analysis (e.g., `ERR_MEM_ALLOC_FAIL_0x8007000E`). **Fin