# Week 01 · Chunk (3 Strategies) → Build Vector Index

**Objective**: Transform crawled docs into retrieval chunks using three strategies, embed with OpenRouter/OpenAI, build ChromaDB persistent index.

**Architecture**: Uses chunking functions from `context_engineering.application.ingest_documents_service.chunkers`

**Provider Support**: Uses OpenRouter unified API (access OpenAI, Anthropic, Google, etc. with one key) or direct OpenAI

In [1]:
#  Setup & Installations
import sys

# if "google.colab" in sys.modules or True:
#     print(" Installing required packages...")
#     %pip install -q langchain>=0.1.0 langchain-openai>=0.0.5 langchain-community>=0.0.20 langchain-text-splitters>=0.2.0 chromadb>=0.4.0 tiktoken>=0.5.0 python-dotenv>=1.0.0

print(" Packages ready")

 Packages ready


In [2]:
#  Imports & Environment Setup
import os
import sys
import json
import random
from pathlib import Path
from dotenv import load_dotenv

# Add project root to path
project_root = Path.cwd().parent
sys.path.insert(0, str(project_root / "src"))

# Load environment
load_dotenv(project_root / ".env")

# Check for API key (OpenRouter preferred, OpenAI as fallback)
openrouter_key = os.getenv("OPENROUTER_API_KEY")
openai_key = os.getenv("OPENAI_API_KEY")

if not openrouter_key and not openai_key:
    raise EnvironmentError(
        "   No API key found!\n"
        "   Add OPENROUTER_API_KEY (recommended) or OPENAI_API_KEY to .env"
    )

# Load configuration
from context_engineering.config import (
    CRAWL_OUT_DIR, VECTOR_DIR, EMBEDDING_MODEL, PROVIDER
)

random.seed(42)

provider = "OpenRouter" if openrouter_key else "OpenAI"
print(" Environment loaded")
print(f" Provider: {provider}")
print(f" Project root: {project_root}")


 Environment loaded
 Provider: OpenRouter
 Project root: d:\Courses\_Zuu Crew\AI Engineer Essentials\Programming\Context Engineering


## Import Chunking Services

Using chunking functions from application layer (NOT defined here!)

In [3]:
#  Import Chunking Services
from context_engineering.application.ingest_documents_service import (
    semantic_chunk,
    fixed_chunk,
    sliding_chunk
)

print(" Chunking services loaded from service layer")
print(" Location: context_engineering.application.ingest_documents_service.chunkers")
print("\n Available strategies:")
print("   1. semantic_chunk  - Split by heading structure")
print("   2. fixed_chunk     - Uniform 800-token chunks with overlap")
print("   3. sliding_chunk   - Overlapping windows for better recall")

 Chunking services loaded from service layer
 Location: context_engineering.application.ingest_documents_service.chunkers

 Available strategies:
   1. semantic_chunk  - Split by heading structure
   2. fixed_chunk     - Uniform 800-token chunks with overlap
   3. sliding_chunk   - Overlapping windows for better recall


## Load Corpus

In [4]:
#  Load Corpus
jsonl_path = CRAWL_OUT_DIR / "nawaloka_docs.jsonl"

if not jsonl_path.exists():
    raise FileNotFoundError(f" Corpus not found. Run 01_crawl_nawaloka.ipynb first.")

with open(jsonl_path, 'r', encoding='utf-8') as f:
    documents = [json.loads(line) for line in f]

print(f" Loaded {len(documents)} documents")
print(f" Total content size: {sum(len(d['content']) for d in documents):,} chars")

 Loaded 10 documents
 Total content size: 403,754 chars


## Apply Chunking Strategies

In [5]:
# Cleanup Vector Store (prevents corruption)
import shutil
import os
import stat
import time

def on_rm_error(func, path, exc_info):
    # Error handler for shutil.rmtree
    try:
        os.chmod(path, stat.S_IWRITE)
        func(path)
    except Exception:
        pass

# Try to remove existing vector store
LOCK_DETECTED = False
if VECTOR_DIR.exists():
    print(f" Attempting to clean: {VECTOR_DIR}")
    try:
        shutil.rmtree(VECTOR_DIR, onerror=on_rm_error)
        print("    Cleaned up successfully")
    except Exception as e:
        print(f"    Cleanup failed ({e})")
        # Try renaming as last resort cleanup
        try:
             backup = VECTOR_DIR.with_name(f"vectorstore_locked_{int(time.time())}")
             os.rename(VECTOR_DIR, backup)
             print(f"    Renamed locked dir to: {backup}")
        except Exception as e2:
             LOCK_DETECTED = True
             print(f"    CRITICAL LOCK: Could not delete or rename ({e2})")

if LOCK_DETECTED:
    # OVERRIDE VECTOR_DIR to use a fresh path
    print("\n ⚠️  FILE LOCK DETECTED (likely opened in editor)")
    print("    Switching to a new directory to bypass lock...")
    VECTOR_DIR = VECTOR_DIR.with_name("vectorstore_v2")
    print(f"    NEW TARGET: {VECTOR_DIR}")
else:
    # Create fresh standard directory
    VECTOR_DIR.mkdir(parents=True, exist_ok=True)
    print(f" Fresh vector directory ready: {VECTOR_DIR}")


 Attempting to clean: d:\Courses\_Zuu Crew\AI Engineer Essentials\Programming\Context Engineering\data\vectorstore
    Cleaned up successfully
 Fresh vector directory ready: d:\Courses\_Zuu Crew\AI Engineer Essentials\Programming\Context Engineering\data\vectorstore


In [6]:
#  Semantic Chunking (using service!)
print(" Running semantic chunking...")
semantic_chunks = semantic_chunk(documents)

# Save
semantic_path = CRAWL_OUT_DIR / "chunks_semantic.jsonl"
with open(semantic_path, 'w', encoding='utf-8') as f:
    for chunk in semantic_chunks:
        f.write(json.dumps(chunk, ensure_ascii=False) + '\n')

print(f" Semantic chunking complete: {len(semantic_chunks)} chunks")
print(f" Saved to: {semantic_path}")

 Running semantic chunking...
 Semantic chunking complete: 238 chunks
 Saved to: d:\Courses\_Zuu Crew\AI Engineer Essentials\Programming\Context Engineering\data\chunks_semantic.jsonl


In [7]:
#  Fixed-Window Chunking (using service!)
print(" Running fixed-window chunking...")
fixed_chunks = fixed_chunk(documents)

# Save
fixed_path = CRAWL_OUT_DIR / "chunks_fixed.jsonl"
with open(fixed_path, 'w', encoding='utf-8') as f:
    for chunk in fixed_chunks:
        f.write(json.dumps(chunk, ensure_ascii=False) + '\n')

avg_tokens = sum(c['token_count'] for c in fixed_chunks) / len(fixed_chunks) if fixed_chunks else 0
print(f" Fixed chunking complete: {len(fixed_chunks)} chunks")
print(f" Avg token count: {avg_tokens:.1f}")
print(f" Saved to: {fixed_path}")

 Running fixed-window chunking...
 Fixed chunking complete: 237 chunks
 Avg token count: 1186.7
 Saved to: d:\Courses\_Zuu Crew\AI Engineer Essentials\Programming\Context Engineering\data\chunks_fixed.jsonl


In [8]:
#  Sliding-Window Chunking (using service!)
print(" Running sliding-window chunking...")
sliding_chunks = sliding_chunk(documents)

# Save
sliding_path = CRAWL_OUT_DIR / "chunks_sliding.jsonl"
with open(sliding_path, 'w', encoding='utf-8') as f:
    for chunk in sliding_chunks:
        f.write(json.dumps(chunk, ensure_ascii=False) + '\n')

print(f" Sliding chunking complete: {len(sliding_chunks)} chunks")
print(f" Saved to: {sliding_path}")

 Running sliding-window chunking...
 Sliding chunking complete: 400 chunks
 Saved to: d:\Courses\_Zuu Crew\AI Engineer Essentials\Programming\Context Engineering\data\chunks_sliding.jsonl


## Spot-Check Samples

In [9]:
#  Spot-Check Samples
print(" Spot-Check: 2 samples from each strategy\n")

def print_sample(chunk, strategy_name):
    print(f"**{strategy_name}** chunk:")
    print(f"  URL: {chunk['url']}")
    print(f"  Strategy: {chunk['strategy']}")
    print(f"  Text length: {len(chunk['text'])} chars")
    print(f"  Preview: {chunk['text'][:100]}...")
    print()

print("=" * 60)
print("SEMANTIC SAMPLES")
print("=" * 60)
for chunk in random.sample(semantic_chunks, min(2, len(semantic_chunks))):
    print_sample(chunk, "Semantic")

print("=" * 60)
print("FIXED-WINDOW SAMPLES")
print("=" * 60)
for chunk in random.sample(fixed_chunks, min(2, len(fixed_chunks))):
    print_sample(chunk, "Fixed")

print("=" * 60)
print("SLIDING-WINDOW SAMPLES")
print("=" * 60)
for chunk in random.sample(sliding_chunks, min(2, len(sliding_chunks))):
    print_sample(chunk, "Sliding")

 Spot-Check: 2 samples from each strategy

SEMANTIC SAMPLES
**Semantic** chunk:
  URL: https://www.nawaloka.com/blogs-and-news/laparoscopic-surgery-cost-in-sri-lanka
  Strategy: semantic
  Text length: 4000 chars
  Preview: UpShWJWoanYLK1CsLNgavS7dHMlDsrpR1F5tPdHcbe/wBR2rb6UbqEjGytLHbFa4vGotlbjy7VCckJtMQR7fAQUpQlXbq3PqRtuT...

**Semantic** chunk:
  URL: https://www.nawaloka.com/channeling
  Strategy: semantic
  Text length: 3999 chars
  Preview: ![Physiotherapy](data:image/jpeg;base64,/9j/4AAQSkZJRgABAQAAAQABAAD/4gHYSUNDX1BST0ZJTEUAAQEAAAHIAAAA...

FIXED-WINDOW SAMPLES
**Fixed** chunk:
  URL: https://www.nawaloka.com/
  Strategy: fixed
  Text length: 31 chars
  Preview: Doctor BookingsFind your Doctor...

**Fixed** chunk:
  URL: https://www.nawaloka.com/blogs-and-news/laparoscopic-surgery-cost-in-sri-lanka
  Strategy: fixed
  Text length: 2466 chars
  Preview: #### After Surgery

* Patients typically stay for a shorter duration, sometimes just overnight.
* Re...

SLIDING-WINDOW SAMPLES

## Build ChromaDB Vector Index

In [10]:
#  Build Vector Index with LangChain
from langchain_community.vectorstores import Chroma
from langchain_core.documents import Document
from context_engineering.infrastructure.llm_providers import get_default_embeddings

# Initialize embeddings using service factory (supports OpenRouter)
embeddings = get_default_embeddings()

print(f" Embeddings initialized: {EMBEDDING_MODEL}")
print(f" Provider: {PROVIDER}")

# Prepare directory
VECTOR_DIR.mkdir(parents=True, exist_ok=True)

# Combine all chunks
all_chunks = semantic_chunks + fixed_chunks + sliding_chunks
print(f" Total chunks to embed: {len(all_chunks)}")

# Convert to LangChain Documents
lc_documents = []
for chunk in all_chunks:
    doc = Document(
        page_content=chunk['text'],
        metadata={
            "url": chunk['url'],
            "title": chunk['title'],
            "strategy": chunk['strategy'],
            "chunk_index": chunk['chunk_index']
        }
    )
    lc_documents.append(doc)

print(f"\n Creating Chroma vector store...\n")

# Create vector store (LangChain handles batching + retries)
vectorstore = Chroma.from_documents(
    documents=lc_documents,
    embedding=embeddings,
    persist_directory=str(VECTOR_DIR),
    collection_name="nawaloka"
)

print(f" Vector store created!")
print(f" Total vectors indexed: {vectorstore._collection.count()}")

 Embeddings initialized: openai/text-embedding-3-large
 Provider: openrouter
 Total chunks to embed: 875

 Creating Chroma vector store...

 Vector store created!
 Total vectors indexed: 875


## Index Sanity Check

In [11]:
#  Index Sanity Check
print(" Index Sanity Check\n")

# Verify count
count = vectorstore._collection.count()
print(f" Collection contains {count} vectors")
assert count > 0, " Collection is empty!"

# Test query
test_query = "cardiology services and health checks"
print(f"\n Test query: '{test_query}'\n")

results = vectorstore.similarity_search_with_score(
    query=test_query,
    k=3
)

print("Top 3 results:")
for i, (doc, score) in enumerate(results, 1):
    print(f"\n{i}. Score: {score:.4f}")
    print(f"   URL: {doc.metadata['url']}")
    print(f"   Strategy: {doc.metadata['strategy']}")
    print(f"   Preview: {doc.page_content[:100]}...")

print("\n Index sanity check passed!")
print(f" Vector store persisted at: {VECTOR_DIR}")

 Index Sanity Check

 Collection contains 875 vectors

 Test query: 'cardiology services and health checks'

Top 3 results:

1. Score: 1.0076
   URL: https://www.nawaloka.com/blogs-and-news/best-cardiologist-in-sri-lanka
   Strategy: sliding
   Preview: IIkIQgiQhCCJCEIIkIQgiQhCCJCEIIkIQgi//9k=)

Service BookingsBook your service here

# Finding the Bes...

2. Score: 1.0355
   URL: https://www.nawaloka.com/blogs-and-news/best-cardiologist-in-sri-lanka
   Strategy: sliding
   Preview: le to yours.
6. **Read Patient Reviews** - Web based reviews are a source of valuable information ab...

3. Score: 1.0382
   URL: https://www.nawaloka.com/blogs-and-news/best-cardiologist-in-sri-lanka
   Strategy: sliding
   Preview: xsDLOyP88dN/ap/70qqc9Ff+5V7WHvSqpz0V/wC5V7WIB2GbNXP1X/qX7GHYZs1c/Vf+pfsYzeT7R/AosND5Bn/Y+qn/AL0qqc9F...

 Index sanity check passed!
 Vector store persisted at: d:\Courses\_Zuu Crew\AI Engineer Essentials\Programming\Context Engineering\data\vectorstore
