# Week 3: Keyword Search First - The Critical Foundation

> **The 90% Problem:** Most RAG systems jump straight to vector search and miss the foundation that powers the best retrieval systems. We're doing it right!

## What We're Building This Week

Week 3 focuses on implementing OpenSearch integration for full-text search capabilities using BM25 scoring. This transforms our system from a simple storage solution into a searchable knowledge base.

### Core Objectives
- **OpenSearch Integration**: Connect our FastAPI application to OpenSearch cluster
- **Index Management**: Create and manage the arxiv-papers index with proper mappings
- **BM25 Search**: Implement full-text search with relevance scoring
- **Data Pipeline**: Transfer papers from PostgreSQL to OpenSearch
- **Search API**: Expose search functionality through REST endpoints

### What We'll Test In This Notebook
1. **Infrastructure Verification** - Ensure all services from Week 1-2 are running
2. **OpenSearch Service Integration** - Test client creation and health checks
3. **Index Creation & Management** - Create arxiv-papers index with proper mappings
4. **Data Pipeline** - Transfer papers from PostgreSQL to OpenSearch
5. **BM25 Search Functionality** - Test search queries with relevance scoring
6. **Search API Endpoints** - Verify FastAPI search endpoints work correctly

---

## Environment Setup

In [1]:
# Environment Setup and Path Configuration
import sys
from pathlib import Path
import json
import requests

print(f"Python Version: {sys.version_info.major}.{sys.version_info.minor}.{sys.version_info.micro}")
print(f"Environment: {sys.executable}")

# Find project root and add to Python path
current_dir = Path.cwd()
if current_dir.name == "week3" and current_dir.parent.name == "notebooks":
    project_root = current_dir.parent.parent
elif (current_dir / "compose.yml").exists():
    project_root = current_dir
else:
    project_root = None

if project_root and (project_root / "compose.yml").exists():
    print(f"Project root: {project_root}")
    if str(project_root) not in sys.path:
        sys.path.insert(0, str(project_root))
else:
    print("Missing compose.yml - check directory")
    exit()

Python Version: 3.12.7
Environment: /Users/nishantgaurav/Project/PaperAlchemy/.venv/bin/python
Project root: /Users/nishantgaurav/Project/PaperAlchemy


## 1. Infrastructure Verification

In [2]:
# Service Health Verification
print("WEEK 3 PREREQUISITE CHECK")
print("=" * 50)

services_to_test = {
    "FastAPI": "http://localhost:8000/health",
    "OpenSearch": "http://localhost:9201",
    "Ollama": "http://localhost:11434/api/version",
    "Airflow": "http://localhost:8080/health",
}

all_healthy = True

for service_name, url in services_to_test.items():
    try:
        response = requests.get(url, timeout=5)
        if response.status_code == 200:
            print(f"✓ {service_name}: Healthy")
        else:
            print(f"✗ {service_name}: HTTP {response.status_code}")
            all_healthy = False
    except requests.exceptions.ConnectionError:
        print(f"✗ {service_name}: Not accessible")
        all_healthy = False
    except Exception as e:
        print(f"✗ {service_name}: {type(e).__name__}")
        all_healthy = False

# Test PostgreSQL directly
print("\nChecking PostgreSQL...")
try:
    from src.db.factory import make_database
    db = make_database()
    if db.health_check():
        print("✓ PostgreSQL: Healthy")
    else:
        print("✗ PostgreSQL: Not accessible")
        all_healthy = False
except Exception as e:
    print(f"✗ PostgreSQL: {e}")
    all_healthy = False

print()
if all_healthy:
    print("All services healthy! Ready for Week 3 OpenSearch integration.")
else:
    print("Some services need attention. Please run: docker compose up --build -d")

WEEK 3 PREREQUISITE CHECK
✓ FastAPI: Healthy
✓ OpenSearch: Healthy
✓ Ollama: Healthy
✓ Airflow: Healthy

Checking PostgreSQL...
2026-01-29 23:52:53,920 INFO sqlalchemy.engine.Engine select pg_catalog.version()
2026-01-29 23:52:53,920 INFO sqlalchemy.engine.Engine [raw sql] {}
2026-01-29 23:52:53,933 INFO sqlalchemy.engine.Engine select current_schema()
2026-01-29 23:52:53,933 INFO sqlalchemy.engine.Engine [raw sql] {}
2026-01-29 23:52:53,964 INFO sqlalchemy.engine.Engine show standard_conforming_strings
2026-01-29 23:52:53,965 INFO sqlalchemy.engine.Engine [raw sql] {}
2026-01-29 23:52:53,973 INFO sqlalchemy.engine.Engine BEGIN (implicit)
2026-01-29 23:52:53,980 INFO sqlalchemy.engine.Engine SELECT pg_catalog.pg_class.relname 
FROM pg_catalog.pg_class JOIN pg_catalog.pg_namespace ON pg_catalog.pg_namespace.oid = pg_catalog.pg_class.relnamespace 
WHERE pg_catalog.pg_class.relname = %(table_name)s AND pg_catalog.pg_class.relkind = ANY (ARRAY[%(param_1)s, %(param_2)s, %(param_3)s, %(param

## 2. OpenSearch Client Setup

In [3]:
# OpenSearch Client Setup
from src.services.opensearch.factory import make_opensearch_client_fresh
from src.config import get_settings

print("OPENSEARCH CLIENT SETUP")
print("=" * 40)

settings = get_settings()

# Create fresh OpenSearch client (localhost for notebook)
opensearch_client = make_opensearch_client_fresh(
    settings=settings,
    host="http://localhost:9201"  # PaperAlchemy OpenSearch port
)

print(f"Client configured with host: {opensearch_client.host}")
print(f"Index name: {opensearch_client.index_name}")

# Test health check
is_healthy = opensearch_client.health_check()
if is_healthy:
    print("✓ OpenSearch health check: PASSED")
    
    # Show cluster info
    cluster_health = opensearch_client.client.cluster.health()
    print(f"   Cluster: {cluster_health['cluster_name']}")
    print(f"   Status: {cluster_health['status']}")
    print(f"   Nodes: {cluster_health['number_of_nodes']}")
else:
    print("✗ OpenSearch health check: FAILED")

OPENSEARCH CLIENT SETUP
Client configured with host: http://localhost:9201
Index name: arxiv-papers-chunks
✓ OpenSearch health check: PASSED
   Cluster: docker-cluster
   Status: yellow
   Nodes: 1


## Index Configuration

In [4]:
# Display Index Configuration
from src.services.opensearch.index_config import ARXIV_PAPERS_INDEX, ARXIV_PAPERS_CHUNKS_MAPPING

print("INDEX CONFIGURATION")
print("=" * 40)
print(f"Index Name: {opensearch_client.index_name}")
print(f"\nKey Features:")
print("• Custom text analyzers for better search")
print("• Multi-field mapping (text + keyword)")
print("• Strict dynamic mapping")
print("\nField Types:")

properties = ARXIV_PAPERS_CHUNKS_MAPPING["mappings"]["properties"]
for field_name, config in properties.items():
    field_type = config.get("type")
    analyzer = config.get("analyzer", "")
    if analyzer:
        print(f"  • {field_name}: {field_type} [{analyzer}]")
    else:
        print(f"  • {field_name}: {field_type}")

INDEX CONFIGURATION
Index Name: arxiv-papers-chunks

Key Features:
• Custom text analyzers for better search
• Multi-field mapping (text + keyword)
• Strict dynamic mapping

Field Types:
  • chunk_id: keyword
  • arxiv_id: keyword
  • paper_id: keyword
  • chunk_index: integer
  • chunk_text: text [text_analyzer]
  • chunk_word_count: integer
  • start_char: integer
  • end_char: integer
  • embedding: knn_vector
  • title: text [text_analyzer]
  • authors: text [standard_analyzer]
  • abstract: text [text_analyzer]
  • categories: keyword
  • published_date: date
  • section_title: keyword
  • embedding_model: keyword
  • created_at: date
  • updated_at: date


### Create Index

In [5]:
# Create Index
print("INDEX CREATION")
print("=" * 40)

try:
    # Setup indices (creates hybrid index + RRF pipeline)
    results = opensearch_client.setup_indices(force=False)
    
    if results.get("hybrid_index"):
        print(f"✓ Index '{opensearch_client.index_name}' created successfully!")
    else:
        print(f"✓ Index '{opensearch_client.index_name}' already exists")
    
    if results.get("rrf_pipeline"):
        print("✓ RRF search pipeline created")
    else:
        print("✓ RRF search pipeline already exists")
    
    # Get current index statistics
    stats = opensearch_client.get_index_stats()
    if stats and 'error' not in stats:
        print(f"\nCurrent Statistics:")
        print(f"   Documents: {stats.get('document_count', 0)}")
        print(f"   Size: {stats.get('size_in_bytes', 0):,} bytes")
            
except Exception as e:
    print(f"✗ Error with index management: {e}")

INDEX CREATION
✓ Index 'arxiv-papers-chunks' already exists
✓ RRF search pipeline created

Current Statistics:
   Documents: 5
   Size: 144,911 bytes


## 3. Data Pipeline - Index Papers from PostgreSQL

Transfer papers from PostgreSQL (Week 2) into OpenSearch for search.

In [6]:
# Load papers from PostgreSQL and index into OpenSearch
from src.db.factory import make_database
from src.models.paper import Paper

print("DATA PIPELINE: PostgreSQL -> OpenSearch")
print("=" * 50)

database = make_database()

try:
    with database.get_session() as session:
        # Get all papers from PostgreSQL
        papers = session.query(Paper).all()
        print(f"Found {len(papers)} papers in PostgreSQL")
        
        if not papers:
            print("\nNo papers found! Please run Week 2 notebook first to fetch papers.")
        else:
            # Prepare documents for indexing
            # Only include fields defined in ARXIV_PAPERS_CHUNKS_MAPPING
            docs = []
            for paper in papers:
                doc = {
                    "arxiv_id": paper.arxiv_id,
                    "title": paper.title,
                    "authors": paper.authors,
                    "abstract": paper.abstract,
                    "categories": paper.categories,
                    "published_date": paper.published_date.isoformat() if paper.published_date else None,
                    "chunk_text": paper.abstract,  # Use abstract as chunk text for now
                    "chunk_index": 0,
                    "chunk_id": f"{paper.arxiv_id}_0",
                    "chunk_word_count": len(paper.abstract.split()) if paper.abstract else 0,
                    "created_at": paper.created_at.isoformat() if paper.created_at else None,
                    "updated_at": paper.updated_at.isoformat() if paper.updated_at else None,
                }
                docs.append(doc)
            
            print(f"\nIndexing {len(docs)} documents into OpenSearch...")
            
            # Index documents one by one (for visibility)
            success_count = 0
            for doc in docs:
                try:
                    response = opensearch_client.client.index(
                        index=opensearch_client.index_name,
                        id=doc["chunk_id"],
                        body=doc,
                        refresh=True
                    )
                    if response["result"] in ["created", "updated"]:
                        success_count += 1
                        print(f"  ✓ [{doc['arxiv_id']}] {doc['title'][:50]}...")
                except Exception as e:
                    print(f"  ✗ [{doc['arxiv_id']}] Error: {e}")
            
            print(f"\nIndexed {success_count}/{len(docs)} documents successfully")
            
            # Verify index stats
            stats = opensearch_client.get_index_stats()
            print(f"\nIndex Statistics:")
            print(f"   Total documents: {stats.get('document_count', 0)}")
            print(f"   Index size: {stats.get('size_in_bytes', 0):,} bytes")

except Exception as e:
    print(f"✗ Pipeline error: {e}")

DATA PIPELINE: PostgreSQL -> OpenSearch
2026-01-29 23:53:05,317 INFO sqlalchemy.engine.Engine BEGIN (implicit)
2026-01-29 23:53:05,322 INFO sqlalchemy.engine.Engine SELECT papers.id AS papers_id, papers.arxiv_id AS papers_arxiv_id, papers.title AS papers_title, papers.authors AS papers_authors, papers.abstract AS papers_abstract, papers.categories AS papers_categories, papers.published_date AS papers_published_date, papers.updated_date AS papers_updated_date, papers.pdf_url AS papers_pdf_url, papers.pdf_content AS papers_pdf_content, papers.sections AS papers_sections, papers.parsing_status AS papers_parsing_status, papers.parsing_error AS papers_parsing_error, papers.created_at AS papers_created_at, papers.updated_at AS papers_updated_at 
FROM papers
2026-01-29 23:53:05,323 INFO sqlalchemy.engine.Engine [generated in 0.00107s] {}
Found 5 papers in PostgreSQL

Indexing 5 documents into OpenSearch...
  ✓ [2508.11121] Tabularis Formatus: Predictive Formatting for Tabl...
  ✓ [2601.16210]

## 4. Simple BM25 Search

Let's start with a simple search to demonstrate BM25 scoring:

In [7]:
# Simple BM25 Search
print("SIMPLE BM25 SEARCH")
print("=" * 40)

# Change this to any word from your papers
search_term = "learning"  # Try different terms!

print(f"Searching for: '{search_term}'\n")

results = opensearch_client.search_papers(
    query=search_term,
    size=5
)

if results.get('hits'):
    print(f"Found {results.get('total', 0)} total matches\n")
    
    for i, paper in enumerate(results['hits'], 1):
        print(f"{i}. {paper.get('title', 'Unknown')[:70]}...")
        print(f"   Score: {paper.get('score', 0):.2f}")
        print(f"   arXiv ID: {paper.get('arxiv_id', 'N/A')}\n")
else:
    print("No results found. Try searching for:")
    print("  • 'neural', 'model', 'algorithm'")
    print("  • Use '*' to see all papers")

SIMPLE BM25 SEARCH
Searching for: 'learning'

Found 4 total matches

1. Why Can't I Open My Drawer? Mitigating Object-Driven Shortcuts in Zero...
   Score: 1.12
   arXiv ID: 2601.16211

2. PyraTok: Language-Aligned Pyramidal Tokenizer for Video Understanding ...
   Score: 1.22
   arXiv ID: 2601.16210

3. Tabularis Formatus: Predictive Formatting for Tables...
   Score: 0.83
   arXiv ID: 2508.11121

4. Quantization through Piecewise-Affine Regularization: Optimization and...
   Score: 0.92
   arXiv ID: 2508.11112



## 5. Advanced OpenSearch Queries

Now let's explore different query types using the OpenSearch Python client directly. This shows the power of BM25 without needing vectors!

### 5.1 Match Query

The `match` query is the standard query for full-text search on a single field:

In [8]:
# Match Query - Search in title field
print("MATCH QUERY - Single Field Search")
print("=" * 40)

query = {
    "query": {
        "match": {
            "title": "machine learning"
        }
    },
    "size": 3
}

response = opensearch_client.client.search(
    index=opensearch_client.index_name,
    body=query
)

print(f"Found {response['hits']['total']['value']} results\n")

for hit in response['hits']['hits']:
    print(f"Title: {hit['_source']['title'][:70]}...")
    print(f"Score: {hit['_score']:.2f}\n")

MATCH QUERY - Single Field Search
Found 0 results



### 5.2 Multi-Match Query

Search across multiple fields simultaneously:

In [9]:
# Multi-Match Query - Search across multiple fields
print("MULTI-MATCH QUERY - Search Multiple Fields")
print("=" * 40)

query = {
    "query": {
        "multi_match": {
            "query": "AI Agents",
            "fields": ["title^2", "abstract", "authors"],  # ^2 boosts title field
            "type": "best_fields"
        }
    },
    "size": 3
}

response = opensearch_client.client.search(
    index=opensearch_client.index_name,
    body=query
)

print(f"Found {response['hits']['total']['value']} results\n")

for hit in response['hits']['hits']:
    print(f"Title: {hit['_source']['title'][:70]}...")
    print(f"Score: {hit['_score']:.2f}")
    authors = hit['_source'].get('authors', [])
    if authors:
        print(f"Authors: {', '.join(authors[:2])}...\n")
    else:
        print()

MULTI-MATCH QUERY - Search Multiple Fields
Found 0 results



### 5.3 Boosting Query

Boost certain results while demoting others:

In [10]:
# Boosting Query - Promote and demote results
print("BOOSTING QUERY - Promote/Demote Results")
print("=" * 40)

query = {
    "query": {
        "boosting": {
            "positive": {
                "match": {
                    "abstract": "deep learning"
                }
            },
            "negative": {
                "match": {
                    "abstract": "multimodal"
                }
            },
            "negative_boost": 0.1  # Reduce score of negative matches
        }
    },
    "size": 3
}

response = opensearch_client.client.search(
    index=opensearch_client.index_name,
    body=query
)

print(f"Query: Boost 'deep learning', demote 'multimodal' papers\n")
print(f"Found {response['hits']['total']['value']} results\n")

for hit in response['hits']['hits']:
    title = hit['_source']['title'][:70]
    abstract_snippet = hit['_source'].get('abstract', '')[:100]
    print(f"Title: {title}...")
    print(f"Score: {hit['_score']:.2f}")
    print(f"Abstract: {abstract_snippet}...\n")

BOOSTING QUERY - Promote/Demote Results
Query: Boost 'deep learning', demote 'multimodal' papers

Found 4 results

Title: PyraTok: Language-Aligned Pyramidal Tokenizer for Video Understanding ...
Score: 0.41
Abstract: Discrete video VAEs underpin modern text-to-video generation and video understanding systems, yet ex...

Title: Why Can't I Open My Drawer? Mitigating Object-Driven Shortcuts in Zero...
Score: 0.37
Abstract: We study Compositional Video Understanding (CVU), where models must recognize verbs and objects and ...

Title: Quantization through Piecewise-Affine Regularization: Optimization and...
Score: 0.31
Abstract: Optimization problems over discrete or quantized variables are very challenging in general due to th...



### 5.4 Filter Query

Filter results by specific criteria (doesn't affect scoring):

In [11]:
# Filter Query - Filter by categories
print("FILTER QUERY - Category Filtering")
print("=" * 40)

query = {
    "query": {
        "bool": {
            "must": [
                {
                    "match": {
                        "abstract": "neural"
                    }
                }
            ],
            "filter": [
                {
                    "terms": {
                        "categories": ["cs.AI"]
                    }
                }
            ]
        }
    },
    "size": 3
}

response = opensearch_client.client.search(
    index=opensearch_client.index_name,
    body=query
)

print(f"Found {response['hits']['total']['value']} results\n")

for hit in response['hits']['hits']:
    title = hit['_source']['title'][:70]
    categories = ', '.join(hit['_source'].get('categories', []))
    print(f"Title: {title}...")
    print(f"Categories: {categories}")
    print(f"Score: {hit['_score']:.2f}\n")

FILTER QUERY - Category Filtering
Found 1 results

Title: Tabularis Formatus: Predictive Formatting for Tables...
Categories: cs.DB, cs.AI, cs.SE
Score: 1.34



### 5.5 Sorting Query

Sort results by different criteria:

In [12]:
# Sorting Query - Sort by publication date
print("SORTING QUERY - Latest Papers First")
print("=" * 40)

query = {
    "query": {
        "match_all": {}  # Get all papers
    },
    "sort": [
        {
            "published_date": {
                "order": "desc"  # Latest first
            }
        }
    ],
    "size": 5
}

response = opensearch_client.client.search(
    index=opensearch_client.index_name,
    body=query
)

print(f"Query: All papers sorted by publication date (newest first)\n")

for hit in response['hits']['hits']:
    title = hit['_source']['title'][:70]
    pub_date = str(hit['_source'].get('published_date', 'N/A'))[:10]
    print(f"Date: {pub_date} | {title}...")

SORTING QUERY - Latest Papers First
Query: All papers sorted by publication date (newest first)

Date: 2026-01-22 | Why Can't I Open My Drawer? Mitigating Object-Driven Shortcuts in Zero...
Date: 2026-01-22 | PyraTok: Language-Aligned Pyramidal Tokenizer for Video Understanding ...
Date: 2025-08-14 | Tabularis Formatus: Predictive Formatting for Tables...
Date: 2025-08-14 | Quantization through Piecewise-Affine Regularization: Optimization and...
Date: 2025-08-14 | Diffusion is a code repair operator and generator...


### 5.6 Combined Query

Combine multiple query types for complex searches:

In [13]:
# Combined Query - Complex search with multiple criteria
print("COMBINED QUERY - Complex Search")
print("=" * 40)

query = {
    "query": {
        "bool": {
            "must": [
                {
                    "multi_match": {
                        "query": "transformer",
                        "fields": ["title^3", "abstract"],
                        "type": "best_fields"
                    }
                }
            ],
            "filter": [
                {
                    "range": {
                        "published_date": {
                            "gte": "2024-01-01"
                        }
                    }
                }
            ],
            "should": [
                {
                    "match": {
                        "categories": "cs.AI"
                    }
                }
            ]
        }
    },
    "sort": [
        "_score",
        {"published_date": {"order": "desc"}}
    ],
    "size": 3
}

response = opensearch_client.client.search(
    index=opensearch_client.index_name,
    body=query
)

print(f"Complex Query:")
print(f"  • Must contain 'transformer' (title boosted 3x)")
print(f"  • Filter: published after 2024-01-01")
print(f"  • Prefer: cs.AI category")
print(f"  • Sort: by relevance, then date\n")

print(f"Found {response['hits']['total']['value']} results\n")

for hit in response['hits']['hits']:
    title = hit['_source']['title'][:70]
    pub_date = str(hit['_source'].get('published_date', 'N/A'))[:10]
    score = hit['_score']
    categories = ', '.join(hit['_source'].get('categories', [])[:2])
    
    print(f"Title: {title}...")
    print(f"  Date: {pub_date} | Score: {score:.2f}")
    print(f"  Categories: {categories}\n")

COMBINED QUERY - Complex Search
Complex Query:
  • Must contain 'transformer' (title boosted 3x)
  • Filter: published after 2024-01-01
  • Prefer: cs.AI category
  • Sort: by relevance, then date

Found 0 results



## 6. Test Two-Letter Queries (AI, ML, NN, CV)

Specifically testing short queries that are common in academic search:

In [14]:
# Test Two-Letter Queries
print("TWO-LETTER QUERY TEST")
print("=" * 40)

two_letter_queries = ["AI", "ML", "NN", "CV", "NLP", "LLM"]

for q in two_letter_queries:
    results = opensearch_client.search_papers(query=q, size=2)
    total = results.get('total', 0)
    
    print(f"\n'{q}' -> {total} results")
    
    for hit in results.get('hits', []):
        title = hit.get('title', 'N/A')[:60]
        score = hit.get('score', 0)
        print(f"  [{score:.2f}] {title}...")

TWO-LETTER QUERY TEST

'AI' -> 0 results

'ML' -> 0 results

'NN' -> 0 results

'CV' -> 0 results

'NLP' -> 0 results

'LLM' -> 0 results


## 7. Test Search API Endpoints

Verify the FastAPI search endpoints are working:

In [16]:
# Test Search API Endpoints
import requests
import json

API_BASE = "http://localhost:8000/api/v1"

print("SEARCH API ENDPOINT TESTS")
print("=" * 50)

# Test 1: Health check
print("\n--- Test 1: Health Check ---")
try:
    response = requests.get(f"{API_BASE}/health", timeout=5)
    if response.status_code == 200:
        health = response.json()
        print(f"✓ Status: {health.get('status', 'unknown')}")
        for service, status in health.get('services', {}).items():
            print(f"  {service}: {status.get('status', 'unknown')} - {status.get('message', '')}")
    else:
        print(f"✗ Health check returned: {response.status_code}")
except Exception as e:
    print(f"✗ Health check error: {e}")
    print("  Make sure the API is running with the updated main.py")

# Test 2: GET /search
print("\n--- Test 2: GET /search ---")
try:
    response = requests.get(
        f"{API_BASE}/search",
        params={"q": "neural network", "size": 3},
        timeout=5
    )
    if response.status_code == 200:
        data = response.json()
        print(f"✓ Found {data.get('total', 0)} results for 'neural network'")
        for hit in data.get('hits', [])[:3]:
            print(f"  [{hit.get('score', 0):.2f}] {hit.get('title', 'N/A')[:60]}...")
    else:
        print(f"✗ Search returned: {response.status_code}")
        print(f"  Response: {response.text[:200]}")
except Exception as e:
    print(f"✗ Search error: {e}")

# Test 3: POST /search
print("\n--- Test 3: POST /search ---")
try:
    search_body = {
        "query": "deep learning transformer",
        "size": 3,
        "categories": ["cs.AI"],
        "latest_papers": False
    }
    response = requests.post(
        f"{API_BASE}/search",
        json=search_body,
        timeout=5
    )
    if response.status_code == 200:
        data = response.json()
        print(f"✓ Found {data.get('total', 0)} results")
        print(f"  Search mode: {data.get('search_mode', 'N/A')}")
        for hit in data.get('hits', [])[:3]:
            print(f"  [{hit.get('score', 0):.2f}] {hit.get('title', 'N/A')[:60]}...")
    else:
        print(f"✗ Search returned: {response.status_code}")
        print(f"  Response: {response.text[:200]}")
except Exception as e:
    print(f"✗ Search error: {e}")

print("\n" + "=" * 50)
print("API endpoint tests complete!")
print(f"\nSwagger UI: http://localhost:8000/docs")

SEARCH API ENDPOINT TESTS

--- Test 1: Health Check ---
✓ Status: ok
  database: healthy - Connected successfully
  opensearch: healthy - Index 'arxiv-papers-chunks' with 5 documents

--- Test 2: GET /search ---
✓ Found 1 results for 'neural network'
  [6.42] Tabularis Formatus: Predictive Formatting for Tables...

--- Test 3: POST /search ---
✓ Found 4 results
  Search mode: bm25
  [4.44] PyraTok: Language-Aligned Pyramidal Tokenizer for Video Unde...
  [1.12] Why Can't I Open My Drawer? Mitigating Object-Driven Shortcu...
  [0.92] Quantization through Piecewise-Affine Regularization: Optimi...

API endpoint tests complete!

Swagger UI: http://localhost:8000/docs


## Summary

### What We Demonstrated

**BM25 Search is Powerful!** Without any vector embeddings, we can:

1. **Simple Search**: Basic keyword search with relevance scoring
2. **Match Queries**: Search specific fields
3. **Multi-Match**: Search across multiple fields with boosting
4. **Boosting**: Promote or demote certain results
5. **Filtering**: Apply filters without affecting scores
6. **Sorting**: Order results by date, score, or other fields
7. **Complex Queries**: Combine all techniques for sophisticated searches

### Key Takeaways

- **BM25 works great** for many search use cases
- **No vectors needed** for effective full-text search
- **Simple and fast** compared to embedding-based approaches
- **Filters and sorting** make searches precise and relevant
- **Field boosting** helps prioritize important content

### When to Use BM25 vs Vectors

**Use BM25 when:**
- Searching for specific keywords or phrases
- Need fast, simple implementation
- Have good text fields with clear terminology
- Want explainable search results

**Consider vectors when:**
- Need semantic similarity (concepts, not keywords)
- Dealing with synonyms and paraphrasing
- Cross-language search requirements
- Very short queries or documents

Remember: **You can also combine both** (hybrid search) for best results!
We will see this in Week 4+.