# RAG: Load Vector Database (LLM-based Chunking)

This notebook loads book chapters into a Chroma vector database using LLM-based semantic chunking.
Instead of naive paragraph splitting, an LLM analyzes each chapter and chunks it by topics.

## Initialize

In [None]:
from pathlib import Path

from agentic_patterns.core.config.config import MAIN_PROJECT_DIR
from agentic_patterns.core.agents import get_agent, run_agent
from agentic_patterns.core.vectordb import get_vector_db, vdb_add, load_vectordb_settings

In [None]:
DOCS_DIR = MAIN_PROJECT_DIR / 'tests' / 'data' / 'books'
COLLECTION_NAME = 'books_llm_chunked'
print(f"Books directory: {DOCS_DIR}")

## Vector-db: Setup

Creates/loads a Chroma vector database collection. Uses a separate collection name to distinguish from naive chunking.

In [None]:
vdb = get_vector_db(COLLECTION_NAME)

settings = load_vectordb_settings(MAIN_PROJECT_DIR / "config.yaml")
db_path = Path(settings.get_vectordb().persist_directory)
print(f"Database directory: {db_path}")

In [None]:
count = vdb.count()
create_vdb = (count == 0)
print(f"Collection has {count} documents. Need to populate: {create_vdb}")

## LLM-based Chunking

Instead of splitting text at fixed character boundaries or by paragraphs, we use an LLM to identify semantic boundaries. The LLM reads the text and groups sentences that belong to the same topic, scene, or theme.

The challenge is that large documents exceed the LLM's context window or become slow to process. Our solution processes text in batches while preserving semantic coherence across batch boundaries.

In [None]:
CHUNKING_PROMPT = """
You are a text chunking assistant. Your task is to divide the following text into coherent chunks based on topics or themes.

Guidelines:
- Each chunk should be self-contained and focus on a single topic, scene, or theme
- Chunks should be substantial (at least a few sentences) but not too long
- Preserve the original text exactly - do not summarize or modify the content
- Return the chunks as a list of strings
- IMPORTANT: If the text ends mid-topic (incomplete), include that partial content as the LAST chunk so it can be continued in the next batch

TEXT TO CHUNK:
{text}
"""

chunking_agent = get_agent(config_name="fast", output_type=list[str])

### Batching Strategy

We split the document into batches of approximately 15000 characters (~3000-4000 tokens). Rather than cutting at arbitrary positions, we split at paragraph boundaries (double newlines) to avoid breaking sentences. Each batch contains complete paragraphs that fit within the size limit.

In [None]:
BATCH_SIZE_CHARS = 15000

def split_into_batches(text: str, batch_size: int) -> list[str]:
    """Split text into batches by paragraphs, respecting batch_size limit."""
    paragraphs = text.split('\n\n')
    batches = []
    current_batch = []
    current_size = 0
    
    for para in paragraphs:
        para_size = len(para) + 2  # +2 for the \n\n separator
        if current_size + para_size > batch_size and current_batch:
            # Current batch is full, start a new one
            batches.append('\n\n'.join(current_batch))
            current_batch = [para]
            current_size = para_size
        else:
            current_batch.append(para)
            current_size += para_size
    
    # Don't forget the last batch
    if current_batch:
        batches.append('\n\n'.join(current_batch))
    
    return batches

### Handling Incomplete Chunks (Leftover Logic)

When we split text into batches, a batch boundary might fall in the middle of a semantic topic. For example, batch 1 might end with the beginning of a conversation, and batch 2 continues it.

To handle this, we use a "leftover" strategy: the last chunk returned by the LLM for each batch (except the final batch) is assumed to be potentially incomplete. We remove it from the results and prepend it to the next batch. This way, the LLM sees the incomplete content again with additional context and can properly finish chunking it.

In [None]:
async def chunk_with_llm(file: Path) -> list[tuple[str, str, dict]]:
    """Chunk a file using LLM-based semantic chunking with batching."""
    text = file.read_text()
    batches = split_into_batches(text, BATCH_SIZE_CHARS)
    print(f"  Split into {len(batches)} batches")
    
    all_chunks = []
    leftover = ""
    
    for batch_num, batch in enumerate(batches):
        # Prepend leftover from previous batch
        batch_text = leftover + batch if leftover else batch
        leftover = ""
        
        prompt = CHUNKING_PROMPT.format(text=batch_text)
        agent_run, _ = await run_agent(chunking_agent, prompt, verbose=False)
        chunks: list[str] = agent_run.result.output
        
        if not chunks:
            continue
        
        # Last chunk might be incomplete - save as leftover for next batch
        if batch_num < len(batches) - 1 and chunks:
            leftover = chunks.pop()
        
        all_chunks.extend(chunks)
        print(f"    Batch {batch_num + 1}/{len(batches)}: {len(chunks)} chunks")
    
    # Add final leftover as the last chunk
    if leftover:
        all_chunks.append(leftover)
    
    # Format results with IDs and metadata
    results = []
    for chunk_num, chunk in enumerate(all_chunks):
        doc = chunk.strip()
        if not doc:
            continue
        doc_id = f"{file.stem}-llm-{chunk_num}"
        metadata = {'source': str(file.stem), 'chunk': chunk_num, 'method': 'llm'}
        results.append((doc, doc_id, metadata))
    
    return results

## Load documents

In [None]:
if create_vdb:
    count_added = 0
    for txt_file in DOCS_DIR.glob('*.txt'):
        print(f"Processing file '{txt_file.name}' with LLM chunking...")
        chunks = await chunk_with_llm(txt_file)
        print(f"  LLM produced {len(chunks)} chunks")
        
        for doc, doc_id, meta in chunks:
            vdb_add(vdb, text=doc, doc_id=doc_id, meta=meta)
            print(f"  Added doc_id: {doc_id}")
            count_added += 1
    
    print(f"\nTotal documents added: {count_added}")
    assert count_added > 0, f"No documents added. Check books directory: {DOCS_DIR}"
else:
    print("Database already populated, skipping load.")

In [None]:
print(f"Final document count: {vdb.count()}")