# Week 4: Document Chunking & Hybrid Search

**What We're Building This Week:**

Week 4 transforms our BM25 keyword search (Week 3) into a hybrid search system that combines keyword matching with semantic vector similarity. This means "deep learning optimization" now also finds papers about "neural network training" — even when those exact words don't appear.

## Week 4 Focus Areas

### Core Objectives
- **Section-Based Chunking**: Split papers into ~600-word overlapping segments that respect document structure
- **Vector Embeddings**: Generate 1024-dim vectors via Jina AI for semantic similarity
- **Hybrid Search**: Combine BM25 + KNN vector search using Reciprocal Rank Fusion (RRF)
- **Production API**: Test the `/api/v1/hybrid-search/` endpoint end-to-end

### What We'll Test In This Notebook
1. **Environment & Service Health** — Verify all Week 1-3 services + Jina API
2. **Text Chunking** — Test TextChunker with real papers from PostgreSQL
3. **Embedding Generation** — Generate real embeddings via Jina AI
4. **Hybrid Indexing Pipeline** — Chunk → Embed → Index into OpenSearch
5. **Search Modes** — Compare BM25 vs Vector vs Hybrid search
6. **Production API** — Test the hybrid search endpoint
7. **Performance Comparison** — Measure latency and result quality

---

### Key Architecture
```
Paper (PostgreSQL) → TextChunker → Chunks → Jina API → Embeddings
                                                          ↓
                                          OpenSearch (chunk + embedding)
                                                          ↓
                          Query → BM25 + KNN → RRF Pipeline → Results
```

## Prerequisites

### Services Required
```bash
docker compose up --build -d
```

### Jina AI API Key
1. Sign up at https://jina.ai/embeddings/ (free tier: 1M tokens/month)
2. Add to `.env`: `JINA_API_KEY=jina_your_key_here`

**Note**: Without a Jina API key, embedding generation will fail. BM25 search still works.

## 1. Environment Setup & Health Check

In [1]:
# Environment Setup and Path Configuration
import sys
import os
from pathlib import Path
import requests
import json

print(f"Python Version: {sys.version_info.major}.{sys.version_info.minor}.{sys.version_info.micro}")

# Find project root and add to Python path
current_dir = Path.cwd()
if current_dir.name == "week4" and current_dir.parent.name == "notebooks":
    project_root = current_dir.parent.parent
elif (current_dir / "compose.yml").exists():
    project_root = current_dir
else:
    project_root = None

if project_root and (project_root / "compose.yml").exists():
    print(f"Project root: {project_root}")
    if str(project_root) not in sys.path:
        sys.path.insert(0, str(project_root))
else:
    print("Missing compose.yml - check directory")

# Load environment variables from .env
from dotenv import load_dotenv
load_dotenv(project_root / ".env")

print(f"\nJina API key configured: {'Yes' if os.getenv('JINA_API_KEY') else 'No'}")

Python Version: 3.12.7
Project root: /Users/nishantgaurav/Project/PaperAlchemy

Jina API key configured: Yes


In [2]:
# Service Health Verification
print("WEEK 4 PREREQUISITE CHECK")
print("=" * 50)

services_to_test = {
    "FastAPI": "http://localhost:8000/health",
    "OpenSearch": "http://localhost:9201",
    "PostgreSQL (via API)": "http://localhost:8000/api/v1/health",
}

all_healthy = True
for service_name, url in services_to_test.items():
    try:
        response = requests.get(url, timeout=5)
        if response.status_code == 200:
            print(f"  {service_name}: Healthy")
        else:
            print(f"  {service_name}: HTTP {response.status_code}")
            all_healthy = False
    except requests.exceptions.ConnectionError:
        print(f"  {service_name}: Not accessible")
        all_healthy = False

# Check Jina API key
jina_key = os.getenv("JINA_API_KEY", "")
if jina_key and jina_key != "jina_your_key_here":
    print(f"  Jina API Key: Configured")
else:
    print(f"  Jina API Key: NOT configured (embedding tests will fail)")
    all_healthy = False

if all_healthy:
    print(f"\nAll services healthy! Ready for Week 4.")
else:
    print(f"\nSome services need attention. Check above.")

WEEK 4 PREREQUISITE CHECK
  FastAPI: Healthy
  OpenSearch: Healthy
  PostgreSQL (via API): Healthy
  Jina API Key: Configured

All services healthy! Ready for Week 4.


## 2. Fetch Sample Papers from PostgreSQL

In [3]:
# Get papers with parsed PDF content for chunking
from src.db.factory import make_database
from src.models.paper import Paper

print("FETCHING PAPERS WITH PDF CONTENT")
print("=" * 50)

database = make_database()

with database.get_session() as session:
    # Get papers that have parsed PDF text
    papers_with_text = session.query(Paper).filter(
        Paper.pdf_content != None,
        Paper.pdf_content != ""
    ).all()

    # Also get all papers (some may only have abstracts)
    all_papers = session.query(Paper).all()

    print(f"Total papers in database: {len(all_papers)}")
    print(f"Papers with PDF content: {len(papers_with_text)}")

    # Prepare paper data dicts for the indexing pipeline
    sample_papers = []
    for paper in all_papers:
        text = paper.pdf_content or paper.abstract or ""
        sample_papers.append({
            "id": paper.id,
            "arxiv_id": paper.arxiv_id,
            "title": paper.title,
            "abstract": paper.abstract or "",
            "raw_text": text,
            "sections": paper.sections,
            "authors": paper.authors,
            "categories": paper.categories,
            "published_date": paper.published_date.isoformat() if paper.published_date else None,
        })

    print(f"\nPapers prepared for indexing:")
    for p in sample_papers:
        text_len = len(p["raw_text"])
        has_sections = "Yes" if p["sections"] else "No"
        print(f"  [{p['arxiv_id']}] {p['title'][:55]}...")
        print(f"    Text: {text_len:,} chars | Sections: {has_sections}")

    test_paper = sample_papers[0] if sample_papers else None

FETCHING PAPERS WITH PDF CONTENT
2026-02-03 01:26:40,498 INFO sqlalchemy.engine.Engine select pg_catalog.version()
2026-02-03 01:26:40,498 INFO sqlalchemy.engine.Engine [raw sql] {}
2026-02-03 01:26:40,500 INFO sqlalchemy.engine.Engine select current_schema()
2026-02-03 01:26:40,500 INFO sqlalchemy.engine.Engine [raw sql] {}
2026-02-03 01:26:40,501 INFO sqlalchemy.engine.Engine show standard_conforming_strings
2026-02-03 01:26:40,501 INFO sqlalchemy.engine.Engine [raw sql] {}
2026-02-03 01:26:40,502 INFO sqlalchemy.engine.Engine BEGIN (implicit)
2026-02-03 01:26:40,506 INFO sqlalchemy.engine.Engine SELECT pg_catalog.pg_class.relname 
FROM pg_catalog.pg_class JOIN pg_catalog.pg_namespace ON pg_catalog.pg_namespace.oid = pg_catalog.pg_class.relnamespace 
WHERE pg_catalog.pg_class.relname = %(table_name)s AND pg_catalog.pg_class.relkind = ANY (ARRAY[%(param_1)s, %(param_2)s, %(param_3)s, %(param_4)s, %(param_5)s]) AND pg_catalog.pg_table_is_visible(pg_catalog.pg_class.oid) AND pg_catalog.

## 3. Test TextChunker — Section-Based Chunking

In [4]:
# Test the TextChunker with a real paper
from src.services.indexing.text_chunker import TextChunker
from src.config import get_settings

settings = get_settings()

print("TEXT CHUNKER TEST")
print("=" * 50)

# Create chunker from settings
chunker = TextChunker(
    chunk_size=settings.chunking.chunk_size,
    overlap_size=settings.chunking.overlap_size,
    min_chunk_size=settings.chunking.min_chunk_size,
)

print(f"Chunk size: {settings.chunking.chunk_size} words")
print(f"Overlap: {settings.chunking.overlap_size} words")
print(f"Min chunk: {settings.chunking.min_chunk_size} words")

if test_paper:
    # Chunk the first paper
    chunks = chunker.chunk_paper(
        title=test_paper["title"],
        abstract=test_paper["abstract"],
        full_text=test_paper["raw_text"],
        arxiv_id=test_paper["arxiv_id"],
        paper_id=str(test_paper["id"]),
        sections=test_paper.get("sections"),
    )

    print(f"\nPaper: {test_paper['arxiv_id']}")
    print(f"Original text: {len(test_paper['raw_text'].split()):,} words")
    print(f"Chunks created: {len(chunks)}")

    if chunks:
        avg_words = sum(c.metadata.word_count for c in chunks) / len(chunks)
        print(f"Average chunk size: {avg_words:.0f} words")

        print(f"\nSample chunks:")
        for i, chunk in enumerate(chunks[:3]):
            print(f"\n  Chunk {i}: section={chunk.metadata.section_title}")
            print(f"    Words: {chunk.metadata.word_count}")
            print(f"    Overlap prev/next: {chunk.metadata.overlap_with_previous}/{chunk.metadata.overlap_with_next}")
            print(f"    Preview: {chunk.text[:120]}...")
else:
    print("\nNo papers available. Run Week 2 notebook first.")

TEXT CHUNKER TEST
Chunk size: 600 words
Overlap: 100 words
Min chunk: 100 words

Paper: 2508.11121
Original text: 213 words
Chunks created: 1
Average chunk size: 213 words

Sample chunks:

  Chunk 0: section=None
    Words: 213
    Overlap prev/next: 0/0
    Preview: Spreadsheet manipulation software are widely used for data management and analysis of tabular data, yet the creation of ...


### 3.1 Compare Overlap Strategies

In [5]:
# Compare different overlap sizes
print("OVERLAP STRATEGY COMPARISON")
print("=" * 50)

if test_paper:
    for overlap in [0, 50, 100, 150]:
        test_chunker = TextChunker(chunk_size=600, overlap_size=overlap, min_chunk_size=100)
        test_chunks = test_chunker.chunk_text(
            text=test_paper["raw_text"],
            arxiv_id=test_paper["arxiv_id"],
            paper_id=str(test_paper["id"]),
        )
        avg = sum(c.metadata.word_count for c in test_chunks) / len(test_chunks) if test_chunks else 0
        print(f"  Overlap {overlap:3d} words: {len(test_chunks):3d} chunks, avg {avg:.0f} words/chunk")

    print(f"\nRecommendation: 100-word overlap — good context preservation, minimal redundancy.")
else:
    print("No test paper available.")

OVERLAP STRATEGY COMPARISON
  Overlap   0 words:   1 chunks, avg 213 words/chunk
  Overlap  50 words:   1 chunks, avg 213 words/chunk
  Overlap 100 words:   1 chunks, avg 213 words/chunk
  Overlap 150 words:   1 chunks, avg 213 words/chunk

Recommendation: 100-word overlap — good context preservation, minimal redundancy.


## 4. Test Jina Embedding Generation

In [6]:
# Test the JinaEmbeddingsClient directly
from src.services.embeddings.jina_client import JinaEmbeddingsClient

print("JINA EMBEDDING GENERATION TEST")
print("=" * 50)

jina_api_key = os.getenv("JINA_API_KEY", "")

if not jina_api_key:
    print("No JINA_API_KEY set. Skipping embedding test.")
    print("Add JINA_API_KEY to your .env file.")
else:
    client = JinaEmbeddingsClient(api_key=jina_api_key)

    # Test passage embedding (document-side)
    test_texts = [
        "Transformers use self-attention mechanisms for sequence modeling.",
        "Gradient descent is an optimization algorithm for training neural networks.",
        "The cat sat on the mat.",  # Unrelated text for contrast
    ]

    print("Testing passage embeddings (retrieval.passage)...")
    passage_embeddings = await client.embed_passages(test_texts)

    print(f"  Generated {len(passage_embeddings)} embeddings")
    print(f"  Dimension: {len(passage_embeddings[0])}")

    for i, emb in enumerate(passage_embeddings):
        norm = sum(x * x for x in emb) ** 0.5
        print(f"  Text {i+1}: [{emb[0]:.4f}, {emb[1]:.4f}, ...] norm={norm:.3f}")

    # Test query embedding (query-side, asymmetric)
    print(f"\nTesting query embedding (retrieval.query)...")
    query_emb = await client.embed_query("attention mechanism in transformers")
    print(f"  Dimension: {len(query_emb)}")
    print(f"  Preview: [{query_emb[0]:.4f}, {query_emb[1]:.4f}, ...]")

    # Compute cosine similarity to show asymmetric encoding works
    import math
    def cosine_sim(a, b):
        dot = sum(x * y for x, y in zip(a, b))
        norm_a = math.sqrt(sum(x * x for x in a))
        norm_b = math.sqrt(sum(x * x for x in b))
        return dot / (norm_a * norm_b) if norm_a and norm_b else 0.0

    print(f"\nCosine similarity (query vs passages):")
    for i, emb in enumerate(passage_embeddings):
        sim = cosine_sim(query_emb, emb)
        print(f"  '{test_texts[i][:50]}...' -> {sim:.4f}")

    await client.close()
    print(f"\nEmbedding test complete!")

JINA EMBEDDING GENERATION TEST
Testing passage embeddings (retrieval.passage)...
  Generated 3 embeddings
  Dimension: 1024
  Text 1: [0.1255, 0.0527, ...] norm=1.000
  Text 2: [-0.0136, -0.0730, ...] norm=1.000
  Text 3: [-0.0546, -0.0117, ...] norm=1.000

Testing query embedding (retrieval.query)...
  Dimension: 1024
  Preview: [0.1059, 0.0398, ...]

Cosine similarity (query vs passages):
  'Transformers use self-attention mechanisms for seq...' -> 0.6577
  'Gradient descent is an optimization algorithm for ...' -> 0.0707
  'The cat sat on the mat....' -> 0.0176

Embedding test complete!


## 5. OpenSearch Setup — Hybrid Index & RRF Pipeline

In [7]:
# Setup OpenSearch client and hybrid index
from src.services.opensearch.factory import make_opensearch_client_fresh

print("OPENSEARCH HYBRID INDEX SETUP")
print("=" * 50)

# Create client pointing to localhost (notebook port)
opensearch_client = make_opensearch_client_fresh(
    settings=settings,
    host="http://localhost:9201"
)

print(f"Host: {opensearch_client.host}")
print(f"Index: {opensearch_client.index_name}")
print(f"Health: {'Healthy' if opensearch_client.health_check() else 'Unhealthy'}")

# Setup hybrid index + RRF pipeline (creates if not exists)
results = opensearch_client.setup_indices(force=False)
print(f"\nIndex created: {results.get('hybrid_index', False)}")
print(f"RRF pipeline created: {results.get('rrf_pipeline', False)}")

# Show current stats
stats = opensearch_client.get_index_stats()
print(f"\nCurrent index stats:")
print(f"  Documents: {stats.get('document_count', 0)}")
print(f"  Size: {stats.get('size_in_bytes', 0):,} bytes")

OPENSEARCH HYBRID INDEX SETUP
Host: http://localhost:9201
Index: arxiv-papers-chunks
Health: Healthy

Index created: False
RRF pipeline created: True

Current index stats:
  Documents: 5
  Size: 171,033 bytes


## 6. Hybrid Indexing Pipeline — Chunk → Embed → Index

This is the core Week 4 pipeline. For each paper:
1. TextChunker splits it into ~600-word overlapping chunks
2. Jina API embeds each chunk into a 1024-dim vector
3. OpenSearch stores chunk text + embedding + paper metadata

In [8]:
# Run the full hybrid indexing pipeline
from src.services.indexing.factory import make_hybrid_indexing_service

print("HYBRID INDEXING PIPELINE")
print("=" * 50)

if not jina_api_key:
    print("No JINA_API_KEY. Cannot run indexing pipeline.")
else:
    # Create the fully-wired indexing service
    indexing_service = make_hybrid_indexing_service(
        settings=settings,
        opensearch_host="http://localhost:9201"
    )

    # Delete existing chunks first (clean slate)
    print("Clearing existing chunks...")
    for paper in sample_papers:
        opensearch_client.delete_paper_chunks(paper["arxiv_id"])

    # Index all papers
    print(f"\nIndexing {len(sample_papers)} papers...\n")

    total_stats = await indexing_service.index_papers_batch(
        papers=sample_papers,
        replace_existing=False,
    )

    print(f"\nPipeline Results:")
    print(f"  Papers processed: {total_stats['papers_processed']}")
    print(f"  Chunks created: {total_stats['total_chunks_created']}")
    print(f"  Chunks indexed: {total_stats['total_chunks_indexed']}")
    print(f"  Embeddings generated: {total_stats['total_embeddings_generated']}")
    print(f"  Errors: {total_stats['total_errors']}")

    # Verify in OpenSearch
    import time
    time.sleep(1)  # Wait for refresh
    stats = opensearch_client.get_index_stats()
    print(f"\nOpenSearch index now has {stats.get('document_count', 0)} documents")

HYBRID INDEXING PIPELINE
Clearing existing chunks...

Indexing 5 papers...


Pipeline Results:
  Papers processed: 5
  Chunks created: 5
  Chunks indexed: 5
  Embeddings generated: 5
  Errors: 0

OpenSearch index now has 5 documents


In [9]:
# Inspect indexed chunks for one paper
print("INSPECT INDEXED CHUNKS")
print("=" * 50)

if test_paper:
    indexed_chunks = opensearch_client.get_chunks_by_paper(test_paper["arxiv_id"])
    print(f"Paper: {test_paper['arxiv_id']}")
    print(f"Chunks in OpenSearch: {len(indexed_chunks)}")

    for chunk in indexed_chunks[:3]:
        print(f"\n  Chunk {chunk.get('chunk_index', '?')}:")
        print(f"    Section: {chunk.get('section_title', 'N/A')}")
        print(f"    Words: {chunk.get('chunk_word_count', 0)}")
        print(f"    Model: {chunk.get('embedding_model', 'N/A')}")
        text_preview = chunk.get('chunk_text', '')[:120]
        print(f"    Text: {text_preview}...")

INSPECT INDEXED CHUNKS
Paper: 2508.11121
Chunks in OpenSearch: 1

  Chunk 0:
    Section: None
    Words: 213
    Model: jina-embeddings-v3
    Text: Spreadsheet manipulation software are widely used for data management and analysis of tabular data, yet the creation of ...


## 7. Search Mode Comparison — BM25 vs Vector vs Hybrid

In [10]:
# Test BM25 keyword search
print("MODE 1: BM25 KEYWORD SEARCH")
print("=" * 50)

test_queries = ["machine learning", "neural network", "optimization"]

for query in test_queries:
    results = opensearch_client.search_papers(query=query, size=3)
    total = results.get("total", 0)
    print(f"\n  Query: '{query}' -> {total} results")
    for hit in results.get("hits", [])[:2]:
        print(f"    [{hit.get('score', 0):.2f}] {hit.get('title', 'N/A')[:60]}...")

MODE 1: BM25 KEYWORD SEARCH

  Query: 'machine learning' -> 4 results
    [1.12] Why Can't I Open My Drawer? Mitigating Object-Driven Shortcu...
    [1.22] PyraTok: Language-Aligned Pyramidal Tokenizer for Video Unde...

  Query: 'neural network' -> 1 results
    [6.42] Tabularis Formatus: Predictive Formatting for Tables...

  Query: 'optimization' -> 2 results
    [2.72] PyraTok: Language-Aligned Pyramidal Tokenizer for Video Unde...
    [4.26] Quantization through Piecewise-Affine Regularization: Optimi...


In [11]:
# Test vector similarity search
print("MODE 2: VECTOR SIMILARITY SEARCH")
print("=" * 50)

if not jina_api_key:
    print("No JINA_API_KEY. Skipping vector search.")
else:
    embed_client = JinaEmbeddingsClient(api_key=jina_api_key)

    for query in test_queries:
        query_vec = await embed_client.embed_query(query)
        results = opensearch_client.search_chunks_vectors(
            query_embedding=query_vec, size=3
        )
        total = results.get("total", 0)
        print(f"\n  Query: '{query}' -> {total} results")
        for hit in results.get("hits", [])[:2]:
            print(f"    [{hit.get('score', 0):.4f}] {hit.get('title', 'N/A')[:60]}...")

    await embed_client.close()

MODE 2: VECTOR SIMILARITY SEARCH

  Query: 'machine learning' -> 3 results
    [0.6281] Quantization through Piecewise-Affine Regularization: Optimi...
    [0.6060] Tabularis Formatus: Predictive Formatting for Tables...

  Query: 'neural network' -> 3 results
    [0.6241] Tabularis Formatus: Predictive Formatting for Tables...
    [0.6088] Quantization through Piecewise-Affine Regularization: Optimi...

  Query: 'optimization' -> 3 results
    [0.6233] Quantization through Piecewise-Affine Regularization: Optimi...
    [0.6002] Tabularis Formatus: Predictive Formatting for Tables...


In [12]:
# Test hybrid search (BM25 + Vector with RRF)
print("MODE 3: HYBRID SEARCH (BM25 + VECTOR + RRF)")
print("=" * 50)

if not jina_api_key:
    print("No JINA_API_KEY. Skipping hybrid search.")
else:
    embed_client = JinaEmbeddingsClient(api_key=jina_api_key)

    for query in test_queries:
        query_vec = await embed_client.embed_query(query)
        results = opensearch_client.search_unified(
            query=query,
            query_embedding=query_vec,
            size=3,
            use_hybrid=True,
        )
        total = results.get("total", 0)
        print(f"\n  Query: '{query}' -> {total} results (RRF fused)")
        for hit in results.get("hits", [])[:2]:
            section = hit.get("section_title", "N/A")
            print(f"    [{hit.get('score', 0):.4f}] {hit.get('title', 'N/A')[:50]}... | section: {section}")

    await embed_client.close()

MODE 3: HYBRID SEARCH (BM25 + VECTOR + RRF)

  Query: 'machine learning' -> 3 results (RRF fused)
    [0.0323] Quantization through Piecewise-Affine Regularizati... | section: None
    [0.0164] PyraTok: Language-Aligned Pyramidal Tokenizer for ... | section: None

  Query: 'neural network' -> 3 results (RRF fused)
    [0.0328] Tabularis Formatus: Predictive Formatting for Tabl... | section: None
    [0.0161] Quantization through Piecewise-Affine Regularizati... | section: None

  Query: 'optimization' -> 3 results (RRF fused)
    [0.0328] Quantization through Piecewise-Affine Regularizati... | section: None
    [0.0161] Tabularis Formatus: Predictive Formatting for Tabl... | section: None


## 8. Test Production API Endpoint

In [13]:
# Test the /api/v1/hybrid-search/ endpoint
import requests

API_BASE = "http://localhost:8000/api/v1"

print("PRODUCTION API ENDPOINT TESTS")
print("=" * 50)

# Test 1: BM25-only via hybrid endpoint (use_hybrid=False)
print("\n--- Test 1: BM25-Only Search ---")
try:
    response = requests.post(
        f"{API_BASE}/hybrid-search/",
        json={"query": "neural network", "use_hybrid": False, "size": 3},
        timeout=10,
    )
    if response.status_code == 200:
        data = response.json()
        print(f"  Search mode: {data['search_mode']}")
        print(f"  Total results: {data['total']}")
        for hit in data["hits"][:2]:
            print(f"    [{hit['score']:.2f}] {hit['title'][:55]}...")
    else:
        print(f"  Failed: HTTP {response.status_code} - {response.text[:200]}")
except Exception as e:
    print(f"  Error: {e}")

# Test 2: Hybrid search (use_hybrid=True)
print("\n--- Test 2: Hybrid Search (BM25 + Vector) ---")
try:
    response = requests.post(
        f"{API_BASE}/hybrid-search/",
        json={"query": "deep learning optimization", "use_hybrid": True, "size": 3},
        timeout=30,  # Longer timeout — includes Jina API call
    )
    if response.status_code == 200:
        data = response.json()
        print(f"  Search mode: {data['search_mode']}")
        print(f"  Total results: {data['total']}")
        for hit in data["hits"][:2]:
            chunk_preview = (hit.get("chunk_text") or "")[:80]
            print(f"    [{hit['score']:.4f}] {hit['title'][:55]}...")
            if chunk_preview:
                print(f"      Chunk: {chunk_preview}...")
    else:
        print(f"  Failed: HTTP {response.status_code} - {response.text[:200]}")
except Exception as e:
    print(f"  Error: {e}")

# Test 3: With category filter
print("\n--- Test 3: Hybrid Search with Category Filter ---")
try:
    response = requests.post(
        f"{API_BASE}/hybrid-search/",
        json={
            "query": "transformer attention",
            "use_hybrid": True,
            "size": 3,
            "categories": ["cs.AI", "cs.LG"],
        },
        timeout=30,
    )
    if response.status_code == 200:
        data = response.json()
        print(f"  Search mode: {data['search_mode']}")
        print(f"  Total results: {data['total']}")
        for hit in data["hits"][:2]:
            print(f"    [{hit['score']:.4f}] {hit['title'][:55]}...")
    else:
        print(f"  Failed: HTTP {response.status_code} - {response.text[:200]}")
except Exception as e:
    print(f"  Error: {e}")

print(f"\nSwagger UI: http://localhost:8000/docs")

PRODUCTION API ENDPOINT TESTS

--- Test 1: BM25-Only Search ---
  Search mode: bm25
  Total results: 1
    [6.42] Tabularis Formatus: Predictive Formatting for Tables...

--- Test 2: Hybrid Search (BM25 + Vector) ---
  Search mode: hybrid
  Total results: 3
    [0.0328] Quantization through Piecewise-Affine Regularization: O...
      Chunk: Optimization problems over discrete or quantized variables are very challenging ...
    [0.0161] Tabularis Formatus: Predictive Formatting for Tables...
      Chunk: Spreadsheet manipulation software are widely used for data management and analys...

--- Test 3: Hybrid Search with Category Filter ---
  Search mode: hybrid
  Total results: 3
    [0.0328] PyraTok: Language-Aligned Pyramidal Tokenizer for Video...
    [0.0161] Why Can't I Open My Drawer? Mitigating Object-Driven Sh...

Swagger UI: http://localhost:8000/docs


## 9. Performance Comparison

In [14]:
# Performance comparison across all search modes
import time

print("SEARCH PERFORMANCE COMPARISON")
print("=" * 60)

query = "machine learning artificial intelligence"
print(f"Query: '{query}'\n")

results_table = []

# BM25 via client
start = time.time()
try:
    bm25_res = opensearch_client.search_papers(query=query, size=5)
    bm25_time = time.time() - start
    results_table.append(("Client BM25", bm25_time, bm25_res.get("total", 0)))
except Exception as e:
    results_table.append(("Client BM25", 0, f"Error: {e}"))

# BM25 via API
start = time.time()
try:
    r = requests.post(f"{API_BASE}/hybrid-search/", json={"query": query, "use_hybrid": False, "size": 5}, timeout=10)
    api_bm25_time = time.time() - start
    results_table.append(("API BM25", api_bm25_time, r.json()["total"] if r.status_code == 200 else "Error"))
except Exception as e:
    results_table.append(("API BM25", 0, f"Error: {e}"))

# Hybrid via API
start = time.time()
try:
    r = requests.post(f"{API_BASE}/hybrid-search/", json={"query": query, "use_hybrid": True, "size": 5}, timeout=30)
    api_hybrid_time = time.time() - start
    if r.status_code == 200:
        d = r.json()
        results_table.append((f"API Hybrid ({d['search_mode']})", api_hybrid_time, d["total"]))
    else:
        results_table.append(("API Hybrid", api_hybrid_time, f"HTTP {r.status_code}"))
except Exception as e:
    results_table.append(("API Hybrid", 0, f"Error: {e}"))

# Display results
print(f"{'Method':<25} {'Time (s)':<12} {'Results'}")
print("-" * 50)
for method, t, count in results_table:
    print(f"{method:<25} {t:<12.3f} {count}")

print(f"\nNotes:")
print(f"  - BM25 is fast (~50ms) — pure keyword matching in OpenSearch")
print(f"  - Hybrid includes Jina API latency (~1-3s) for query embedding")
print(f"  - Hybrid provides semantic matching that BM25 cannot")

SEARCH PERFORMANCE COMPARISON
Query: 'machine learning artificial intelligence'

Method                    Time (s)     Results
--------------------------------------------------
Client BM25               0.163        4
API BM25                  0.185        4
API Hybrid (hybrid)       1.016        5

Notes:
  - BM25 is fast (~50ms) — pure keyword matching in OpenSearch
  - Hybrid includes Jina API latency (~1-3s) for query embedding
  - Hybrid provides semantic matching that BM25 cannot


## 10. Graceful Degradation Test

Verify that hybrid search falls back to BM25 when embeddings fail.

In [15]:
# Test search_unified fallback behavior
print("GRACEFUL DEGRADATION TEST")
print("=" * 50)

# Case 1: No embedding provided -> should use BM25
print("\n1. No embedding (query_embedding=None):")
results = opensearch_client.search_unified(
    query="neural network",
    query_embedding=None,
    use_hybrid=True,
    size=3,
)
print(f"   Results: {results.get('total', 0)} (should use BM25 fallback)")

# Case 2: use_hybrid=False -> should use BM25
print("\n2. Hybrid disabled (use_hybrid=False):")
results = opensearch_client.search_unified(
    query="neural network",
    query_embedding=[0.1] * 1024,  # Dummy embedding
    use_hybrid=False,
    size=3,
)
print(f"   Results: {results.get('total', 0)} (should use BM25)")

# Case 3: Both provided -> should use hybrid
if jina_api_key:
    print("\n3. Full hybrid (embedding + use_hybrid=True):")
    embed_client = JinaEmbeddingsClient(api_key=jina_api_key)
    query_vec = await embed_client.embed_query("neural network")
    results = opensearch_client.search_unified(
        query="neural network",
        query_embedding=query_vec,
        use_hybrid=True,
        size=3,
    )
    print(f"   Results: {results.get('total', 0)} (should use hybrid RRF)")
    await embed_client.close()

print(f"\nDegradation test complete!")

GRACEFUL DEGRADATION TEST

1. No embedding (query_embedding=None):
   Results: 1 (should use BM25 fallback)

2. Hybrid disabled (use_hybrid=False):
   Results: 1 (should use BM25)

3. Full hybrid (embedding + use_hybrid=True):
   Results: 3 (should use hybrid RRF)

Degradation test complete!


## Summary

### What We Built & Tested

1. **TextChunker** — Section-aware chunking with 600-word target, 100-word overlap
2. **Jina Embeddings** — 1024-dim vectors with asymmetric encoding (passage vs query)
3. **Hybrid Indexing Pipeline** — Chunk → Embed → Bulk index into OpenSearch
4. **Three Search Modes**:
   - **BM25**: Fast keyword matching (~50ms)
   - **Vector**: Semantic similarity via KNN
   - **Hybrid**: BM25 + Vector fused with RRF pipeline
5. **Production API** — `/api/v1/hybrid-search/` with graceful degradation
6. **Graceful Degradation** — Falls back to BM25 when embeddings fail

### Architecture
```
Paper → TextChunker → Chunks → Jina API → Embeddings → OpenSearch
                                                            ↓
                      Query → Embed → BM25 + KNN → RRF → Results
```

### Next Steps (Week 5)
- LLM integration (Ollama) for answer generation from search results
- Complete RAG pipeline: Query → Search → Context → Generate → Response
- Conversation memory and context management