# Semantic Chunking

In [1]:
def chunk_markdown_by_paragraphs_and_headers(markdown_file_path, max_chunk_size=1000):
    """
    Chunk a markdown file by paragraphs and headers, ensuring each header starts a new chunk
    and chunks don't exceed max_chunk_size.

    Args:
        markdown_file_path (str): Path to the markdown file
        max_chunk_size (int): Maximum size of each chunk in characters

    Returns:
        list: List of chunks, where each chunk is a string containing a header and its content
    """
    # Read the markdown file
    with open(markdown_file_path, 'r', encoding='utf-8') as file:
        content = file.read()

    # Split content by double newlines (paragraph boundaries)
    paragraphs = content.split('\n\n')

    # Remove empty paragraphs
    paragraphs = [p.strip() for p in paragraphs if p.strip()]

    chunks = []
    current_chunk = []
    current_size = 0

    for paragraph in paragraphs:
        # Check if paragraph is a markdown header (starts with # followed by space)
        stripped = paragraph.lstrip()
        is_header = bool((stripped.startswith('# ') or stripped.startswith('## ')) and ' ' in stripped[:10])
        paragraph_size = len(paragraph)

        # Start a new chunk if:
        # 1. Current paragraph is a header, or
        # 2. Adding this paragraph would exceed max size and we already have content
        if (is_header and current_chunk) or (current_size + paragraph_size > max_chunk_size and current_chunk):
            chunks.append('\n\n'.join(current_chunk))
            current_chunk = []
            current_size = 0

        # Add paragraph to current chunk
        current_chunk.append(paragraph)
        current_size += paragraph_size

    # Add the last chunk if it has content
    if current_chunk:
        chunks.append('\n\n'.join(current_chunk))

    return chunks

In [2]:
file_path = "example.md"
chunks = chunk_markdown_by_paragraphs_and_headers(file_path)

In [3]:
print(f"Total chunks: {len(chunks)}")
for i, chunk in enumerate(chunks):
    print(f"\n--- Chunk {i+1} ({len(chunk)} chars) ---")
    print(f"{chunk[:150]}..." if len(chunk) > 150 else chunk)

Total chunks: 12

--- Chunk 1 (471 chars) ---
# Johann Sebastian Bach

**Johann Sebastian Bach** (31 March [O.S. 21 March] 1685 – 28 July 1750) was a German composer and musician of the late Baroq...

--- Chunk 2 (915 chars) ---
## Early life and education

Bach was born in Eisenach, in the duchy of Saxe-Eisenach, into a great musical family. His father, Johann Ambrosius Bach,...

--- Chunk 3 (517 chars) ---
## Career

### Early career (1703–1708)

Bach's first position was as court musician in the chapel of Duke Johann Ernst III in Weimar. His role there ...

--- Chunk 4 (547 chars) ---
In 1708, Bach left Mühlhausen, returning to Weimar this time as organist and from 1714 Konzertmeister (director of music) at the ducal court, where he...

--- Chunk 5 (917 chars) ---
Leopold, Prince of Anhalt-Köthen, hired Bach to serve as his Kapellmeister (director of music) in 1717. Prince Leopold, himself a musician, appreciate...

--- Chunk 6 (744 chars) ---
## Musical style and works

### Organ w