# Choosing Your Strategy

Your choice depends entirely on your use case and document guarantees:

1. Structure-based: Best results when you control document formatting (like internal company reports)
2. Sentence-based: Good middle ground for most text documents
3. Size-based: Most reliable fallback that works with any content type, including code

In [1]:
import re

## Structure-based Chunking

In [2]:
def chunk_by_structure(document_text):
    pattern = r"\n## "
    
    return re.split(pattern, document_text)

In [10]:
with open("./assets/report.md", "r") as f:
    text = f.read()

chunks = chunk_by_structure(text)

print(f"len of chunks: {len(chunks)}")
print("Chunks by structure:")
for i, chunk in enumerate(chunks):
    print("="*20, f"Chunk {i + 1}", "="*20)
    print(chunk)

len of chunks: 15
Chunks by structure:
# **Annual Interdisciplinary Research Review: Cross-Domain Insights**

Executive Summary

This report synthesizes the key findings and ongoing research efforts across the organization's diverse operational and R&D departments for the past fiscal year. Our strength lies in the cross-pollination of ideas and methodologies, driving innovation and addressing complex challenges that transcend traditional disciplinary boundaries. This year's review highlights significant progress in ten critical areas. Advances in **Medical Research** focused on the rare XDR-471 syndrome, yielding new diagnostic insights. Concurrently, **Software Engineering** tackled persistent stability issues, implementing key fixes identified through error code analysis (e.g., `ERR_MEM_ALLOC_FAIL_0x8007000E`). **Financial Analysis** revealed mixed quarterly performance, prompting strategic reviews, particularly concerning resource allocation impacting R&D pipelines.

Crucial develop

## Sentence-based Chunking

In [14]:
def chunk_by_sentence(text, max_sentences_per_chunk=5, overlap_sentences=1):
    sentences = re.split(r"(?<=[.!?])\s+", text)
    chunks = []
    start_idx = 0
    len_sentences = len(sentences)

    while start_idx < len_sentences:
        end_idx = min(start_idx + max_sentences_per_chunk, len_sentences)
        chunk = sentences[start_idx : end_idx]
        chunks.append(" ".join(chunk))

        start_idx += max_sentences_per_chunk - overlap_sentences

        if start_idx < 0:
            start_idx = 0

    return chunks

In [15]:
with open("./assets/report.md", "r") as f:
    text = f.read()

chunks = chunk_by_sentence(text)

print(f"len of chunks: {len(chunks)}")
print("Chunks by structure:")
for i, chunk in enumerate(chunks):
    print("="*20, f"Chunk {i + 1}", "="*20)
    print(chunk)

len of chunks: 33
Chunks by structure:
# **Annual Interdisciplinary Research Review: Cross-Domain Insights**

## Executive Summary

This report synthesizes the key findings and ongoing research efforts across the organization's diverse operational and R&D departments for the past fiscal year. Our strength lies in the cross-pollination of ideas and methodologies, driving innovation and addressing complex challenges that transcend traditional disciplinary boundaries. This year's review highlights significant progress in ten critical areas. Advances in **Medical Research** focused on the rare XDR-471 syndrome, yielding new diagnostic insights. Concurrently, **Software Engineering** tackled persistent stability issues, implementing key fixes identified through error code analysis (e.g., `ERR_MEM_ALLOC_FAIL_0x8007000E`).
Concurrently, **Software Engineering** tackled persistent stability issues, implementing key fixes identified through error code analysis (e.g., `ERR_MEM_ALLOC_FAIL_0x80070

## Size-based Chunking

In [16]:
def chunk_by_size(text, chunk_size=150, overlap_size=20):
    chunks = []
    start_idx = 0
    len_text = len(text)

    while start_idx < len_text:
        end_idx = min(start_idx + chunk_size, len_text)
        chunk = text[start_idx : end_idx]
        chunks.append(chunk)

        start_idx += chunk_size - overlap_size

        if start_idx < 0:
            start_idx = 0
    
    return chunks

In [18]:
with open("./assets/report.md", "r") as f:
    text = f.read()

chunks = chunk_by_size(text)

print(f"len of chunks: {len(chunks)}")
print("Chunks by structure:")
for i, chunk in enumerate(chunks):
    print("="*20, f"Chunk {i + 1}", "="*20)
    print(chunk)

len of chunks: 141
Chunks by structure:
# **Annual Interdisciplinary Research Review: Cross-Domain Insights**

## Executive Summary

This report synthesizes the key findings and ongoing rese
ngs and ongoing research efforts across the organization's diverse operational and R&D departments for the past fiscal year. Our strength lies in the 
trength lies in the cross-pollination of ideas and methodologies, driving innovation and addressing complex challenges that transcend traditional disc
end traditional disciplinary boundaries. This year's review highlights significant progress in ten critical areas. Advances in **Medical Research** fo
edical Research** focused on the rare XDR-471 syndrome, yielding new diagnostic insights. Concurrently, **Software Engineering** tackled persistent st
ackled persistent stability issues, implementing key fixes identified through error code analysis (e.g., `ERR_MEM_ALLOC_FAIL_0x8007000E`). **Financial
7000E`). **Financial Analysis** revealed mixed quarter