# Week 3: Keyword Search First - The Critical Foundation

> ** The 90% Problem:** Most RAG systems jump straight to vector search and miss the foundation that powers the best retrieval systems. We're doing it right!

## ESSENTIAL SETUP - Do This First!

**Before running any cells, ensure your environment is properly configured:**

```bash
# 1. CRITICAL: Copy the environment configuration
cp .env.example .env

# 2. Verify these Week 3 settings are in your .env:
# OPENSEARCH__HOST=http://opensearch:9200
# OPENSEARCH__INDEX_NAME=arxiv-papers
# ARXIV__MAX_RESULTS=15
```

**Important:** Week 3 requires the `.env` file for OpenSearch connectivity and service configuration. The defaults in `.env.example` work perfectly out of the box!

**Why Keyword Search First?**
- **Exact Match Power:** Find specific technical terms and paper IDs precisely
- **Speed & Efficiency:** BM25 is fast and doesn't require expensive embedding models
- **Interpretable:** You understand exactly why papers were retrieved
- **Production Reality:** Companies like Elasticsearch use keyword search as their foundation

---

# Week 3: OpenSearch Integration & BM25 Search

**What We're Building This Week:**

Week 3 focuses on implementing OpenSearch integration for full-text search capabilities using BM25 scoring. This transforms our system from a simple storage solution into a searchable knowledge base.

## Week 3 Focus Areas

### Core Objectives
- **OpenSearch Integration**: Connect our FastAPI application to OpenSearch cluster
- **Index Management**: Create and manage the arxiv-papers index with proper mappings
- **BM25 Search**: Implement full-text search with relevance scoring
- **Data Pipeline**: Transfer papers from PostgreSQL to OpenSearch
- **Search API**: Expose search functionality through REST endpoints

### What We'll Test In This Notebook
1. **Infrastructure Verification** - Ensure all services from Week 1-2 are running
2. **OpenSearch Service Integration** - Test client creation and health checks
3. **Index Creation & Management** - Create arxiv-papers index with proper mappings
4. **Data Pipeline** - Transfer papers from PostgreSQL to OpenSearch
5. **BM25 Search Functionality** - Test search queries with relevance scoring
6. **Search API Endpoints** - Verify FastAPI search endpoints work correctly

### Success Metrics
- OpenSearch cluster healthy and accessible
- arxiv-papers index created with proper mappings
- Papers successfully indexed from PostgreSQL
- BM25 search returns relevant results with scores
- Search API endpoints respond correctly
- All components ready for production use

---

## Week 3 Component Status
| Component | Purpose | Status |
|-----------|---------|--------|
| **OpenSearch Client** | Connect to OpenSearch cluster | ✅ Complete |
| **Index Management** | Create and manage search indices | ✅ Complete |
| **Query Builder** | Build complex search queries | ✅ Complete |
| **Data Pipeline** | Transfer papers to OpenSearch | ✅ Complete |
| **Search API** | REST endpoints for search | ✅ Complete |
| **BM25 Scoring** | Relevance-based search results | ✅ Complete |

## IMPORTANT: Week 3 Docker Services Restart

**NEW USERS OR INTEGRATION CONFLICTS**: Week 3 introduces OpenSearch integration that requires fresh container state. Use this clean restart approach:

### Fresh Start (Recommended for Week 3)
```bash
# Complete clean slate - removes all data but ensures correct OpenSearch state
docker compose down -v

# Build fresh containers with latest code
docker compose up --build -d
```

**When to use this:**
- First time running Week 3 
- OpenSearch connection issues
- Index conflicts or mapping errors
- Want to start with clean OpenSearch state

**Note**: This destroys existing data but ensures you have the correct Week 3 configuration with proper OpenSearch integration.

---

## Prerequisites Check

**Before starting:**
1. Week 1 infrastructure completed
2. Week 2 arXiv integration working
3. UV environment activated
4. Docker Desktop running
5. Some papers already in PostgreSQL from Week 2

**Why fresh containers?** Week 3 includes OpenSearch integration that requires proper cluster initialization and may conflict with existing index states.

**Service Access Points:**
- **FastAPI**: http://localhost:8000/docs (API documentation)
- **PostgreSQL**: via API or `docker exec -it rag-postgres psql -U rag_user -d rag_db`
- **OpenSearch**: http://localhost:9200/_cluster/health
- **Ollama**: http://localhost:11434 (LLM service)
- **Airflow**: http://localhost:8080 (Username: `admin`, Password: `admin`)

## Environment Setup

In [14]:
# Environment Setup and Path Configuration
import sys
from pathlib import Path
import json
import requests

print(f"Python Version: {sys.version_info.major}.{sys.version_info.minor}.{sys.version_info.micro}")
print(f"Environment: {sys.executable}")

# Find project root and add to Python path
current_dir = Path.cwd()
if current_dir.name == "week3" and current_dir.parent.name == "notebooks":
    project_root = current_dir.parent.parent
elif (current_dir / "compose.yml").exists():
    project_root = current_dir
else:
    project_root = None

if project_root and (project_root / "compose.yml").exists():
    print(f"Project root: {project_root}")
    sys.path.insert(0, str(project_root))
else:
    print("Missing compose.yml - check directory")
    exit()

Python Version: 3.12.12
Environment: /Users/macos/Code/production_rag/.venv/bin/python
Project root: /Users/macos/Code/production_rag


## 1. Infrastructure Verification

In [15]:
# Service Health Verification
print("WEEK 3 PREREQUISITE CHECK")
print("=" * 50)

services_to_test = {
    "FastAPI": "http://localhost:8000/api/v1/health",
    "PostgreSQL (via API)": "http://localhost:8000/api/v1/health", 
    "OpenSearch": "http://localhost:9200/_cluster/health",
    "Airflow": "http://localhost:8080/health"  
}

all_healthy = True

for service_name, url in services_to_test.items():
    try:
        response = requests.get(url, timeout=5)
        if response.status_code == 200:
            print(f"✓ {service_name}: Healthy")
        else:
            print(f"✗ {service_name}: HTTP {response.status_code}")
            all_healthy = False
    except requests.exceptions.ConnectionError:
        print(f"✗ {service_name}: Not accessible")
        all_healthy = False
    except Exception as e:
        print(f"✗ {service_name}: {type(e).__name__}")
        all_healthy = False

print()
if all_healthy:
    print("All services healthy! Ready for Week 3 OpenSearch integration.")
else:
    print("Some services need attention. Please run: docker compose up --build")

WEEK 3 PREREQUISITE CHECK
✓ FastAPI: Healthy
✓ PostgreSQL (via API): Healthy
✓ OpenSearch: Healthy
✓ Airflow: Healthy

All services healthy! Ready for Week 3 OpenSearch integration.


## 2. OpenSearch Client Setup

In [16]:
# OpenSearch Client Setup - Week 3 BM25 Version
from src.services.opensearch.factory import make_bm25_client
from opensearchpy import OpenSearch

print("OPENSEARCH CLIENT SETUP (Week 3 - BM25)")
print("=" * 40)

# Create BM25 OpenSearch client using factory pattern
# This uses the 'arxiv-papers' index for simple keyword search
opensearch_client = make_bm25_client()

# Override for notebook execution (localhost instead of container hostname)
opensearch_client.host = "http://localhost:9200"
opensearch_client.client = OpenSearch(
    hosts=["http://localhost:9200"],
    http_compress=True,
    use_ssl=False,
    verify_certs=False,
    ssl_assert_hostname=False,
    ssl_show_warn=False,
)

print(f"Client configured with host: {opensearch_client.host}")
print(f"Index name: {opensearch_client.index_name}")
print(f"Client type: BM25 (keyword search only)")

# Test health check
is_healthy = opensearch_client.health_check()
if is_healthy:
    print("✓ OpenSearch health check: PASSED")
else:
    print("✗ OpenSearch health check: FAILED")

OPENSEARCH CLIENT SETUP (Week 3 - BM25)
Client configured with host: http://localhost:9200
Index name: arxiv-papers
Client type: BM25 (keyword search only)
✓ OpenSearch health check: PASSED


## Index Configuration

In [17]:
# Display Index Configuration
from src.services.opensearch.index_config import ARXIV_PAPERS_INDEX, ARXIV_PAPERS_MAPPING

print("INDEX CONFIGURATION (Week 3 - BM25)")
print("=" * 40)
print(f"Index Name: {ARXIV_PAPERS_INDEX}")
print(f"Expected Index: {opensearch_client.index_name}")
print(f"\nKey Features:")
print("• Standard text analyzer for BM25 scoring")
print("• Multi-field mapping (text + keyword)")
print("• Full paper documents (no chunking)")
print("\nField Types:")

properties = ARXIV_PAPERS_MAPPING["mappings"]["properties"]
for field_name, config in properties.items():
    field_type = config.get("type")
    analyzer = config.get("analyzer", "")
    if analyzer:
        print(f"  • {field_name}: {field_type} [{analyzer}]")
    else:
        print(f"  • {field_name}: {field_type}")

INDEX CONFIGURATION (Week 3 - BM25)
Index Name: arxiv-papers
Expected Index: arxiv-papers

Key Features:
• Standard text analyzer for BM25 scoring
• Multi-field mapping (text + keyword)
• Full paper documents (no chunking)

Field Types:
  • arxiv_id: keyword
  • title: text [standard]
  • abstract: text [standard]
  • authors: keyword
  • categories: keyword
  • published_date: date
  • pdf_url: keyword
  • raw_text: text [standard]
  • section_titles: text [standard]
  • chunk_id: keyword


### Create Index

In [18]:
# Create BM25 Index if it doesn't exist
print("INDEX CREATION (Week 3 - BM25)")
print("=" * 40)

try:
    # Check if index already exists
    index_exists = opensearch_client.client.indices.exists(index=opensearch_client.index_name)
    
    if index_exists:
        print(f"✓ Index '{opensearch_client.index_name}' already exists")
        
        # Get current index statistics
        stats = opensearch_client.get_index_stats()
        if stats and 'error' not in stats:
            print(f"\nCurrent Statistics:")
            print(f"   Documents: {stats.get('document_count', 0)}")
            print(f"   Size: {stats.get('size_in_bytes', 0):,} bytes")
    else:
        print(f"Creating new BM25 index: {opensearch_client.index_name}")
        
        # Create the index with BM25 mapping
        success = opensearch_client.create_index()
        
        if success:
            print(f"✓ Index created successfully!")
        else:
            print(f"✗ Index creation failed")
            
except Exception as e:
    print(f"✗ Error with index management: {e}")

INDEX CREATION (Week 3 - BM25)
✓ Index 'arxiv-papers' already exists

Current Statistics:
   Documents: 15
   Size: 50,624 bytes


## 3. Data Pipeline - Run Airflow DAG

The **arxiv_paper_ingestion_week3** DAG automatically:
1. Fetches papers from arXiv API
2. Stores papers in PostgreSQL
3. **Indexes papers into OpenSearch (BM25 only - no chunking/embeddings)**

### Instructions:

**Before proceeding, run the Week 3 Airflow DAG:**

1. Open Airflow UI: http://localhost:8080
2. Login: username `admin`, password `admin`
3. Find **`arxiv_paper_ingestion_week3`** DAG (not the main one!)
4. Click the DAG name to open it
5. Click **"Trigger DAG"** button (▶️ play icon)
6. Wait ~5-10 minutes for completion
7. Check that all tasks turn green

**Important:** Use the `arxiv_paper_ingestion_week3` DAG which indexes to the `arxiv-papers` index for BM25 search. The main DAG indexes to `arxiv-papers-chunks` for hybrid search (Week 4+).

Then run the cell below to verify:

In [19]:
# Verify Data Pipeline Results
print("VERIFYING DATA PIPELINE")
print("=" * 40)

stats = opensearch_client.get_index_stats()
print(stats)

if stats and 'error' not in stats:
    doc_count = stats.get('document_count', 0)
    
    if doc_count > 0:
        print(f"✓ Success! Found {doc_count} documents in OpenSearch")
        
        # Show sample papers
        sample = opensearch_client.search_papers("*", size=3)
        if sample.get('hits'):
            print(f"\nSample papers:")
            for i, paper in enumerate(sample['hits'], 1):
                title = paper.get('title', 'Unknown')[:60]
                print(f"  {i}. {title}...")
    else:
        print("⚠️  No documents in OpenSearch yet")
        print("\nPlease run the Airflow DAG first (see instructions above)")
else:
    print("✗ Could not retrieve index stats")

VERIFYING DATA PIPELINE
{'index_name': 'arxiv-papers', 'exists': True, 'document_count': 15, 'deleted_count': 0, 'size_in_bytes': 50624}
✓ Success! Found 15 documents in OpenSearch


## 4. Simple BM25 Search

Let's start with a simple search to demonstrate BM25 scoring:

In [20]:
# Simple BM25 Search
print("SIMPLE BM25 SEARCH")
print("=" * 40)

# Change this to any word from your papers
search_term = "learning"  # Try different terms!

print(f"Searching for: '{search_term}'\n")

results = opensearch_client.search_papers(
    query=search_term,
    size=5
)

if results.get('hits'):
    print(f"Found {results.get('total', 0)} total matches\n")
    
    for i, paper in enumerate(results['hits'], 1):
        print(f"{i}. {paper.get('title', 'Unknown')[:70]}...")
        print(f"   Score: {paper.get('score', 0):.2f}")
        print(f"   arXiv ID: {paper.get('arxiv_id', 'N/A')}\n")
else:
    print("No results found. Try searching for:")
    print("  • 'neural', 'model', 'algorithm'")
    print("  • Use '*' to see all papers")

SIMPLE BM25 SEARCH
Searching for: 'learning'

Found 4 total matches

1. Evaluating the Performance of Deep Learning Models in Whole-body Dynam...
   Score: 5.65
   arXiv ID: 2511.20615v1

2. MedROV: Towards Real-Time Open-Vocabulary Detection Across Diverse Med...
   Score: 5.12
   arXiv ID: 2511.20650v1

3. MapReduce LoRA: Advancing the Pareto Front in Multi-Preference Optimiz...
   Score: 3.86
   arXiv ID: 2511.20629v1

4. DiFR: Inference Verification Despite Nondeterminism...
   Score: 2.49
   arXiv ID: 2511.20621v1



## 5. Advanced OpenSearch Queries

Now let's explore different query types using the OpenSearch Python client directly. This shows the power of BM25 without needing vectors!

### 5.1 Match Query

The `match` query is the standard query for full-text search on a single field:

In [21]:
# Match Query - Search in title field
print("MATCH QUERY - Single Field Search")
print("=" * 40)

query = {
    "query": {
        "match": {
            "title": "machine learning"
        }
    },
    "size": 3
}

response = opensearch_client.client.search(
    index=opensearch_client.index_name,
    body=query
)

print(f"Found {response['hits']['total']['value']} results\n")

for hit in response['hits']['hits']:
    print(f"Title: {hit['_source']['title'][:70]}...")

MATCH QUERY - Single Field Search
Found 1 results

Title: Evaluating the Performance of Deep Learning Models in Whole-body Dynam...


### 5.2 Multi-Match Query

Search across multiple fields simultaneously:

In [22]:
# Multi-Match Query - Search across multiple fields
print("MULTI-MATCH QUERY - Search Multiple Fields")
print("=" * 40)

query = {
    "query": {
        "multi_match": {
            "query": "AI Agents",
            "fields": ["title^2", "abstract", "authors"],  # ^2 boosts title field
            "type": "best_fields"
        }
    },
    "size": 3
}

response = opensearch_client.client.search(
    index=opensearch_client.index_name,
    body=query
)

print(f"Found {response['hits']['total']['value']} results\n")

for hit in response['hits']['hits']:
    print(f"Title: {hit['_source']['title'][:70]}...")
    print(f"Score: {hit['_score']:.2f}")
    print(f"Authors: {', '.join(hit['_source']['authors'][:2])}...\n")

MULTI-MATCH QUERY - Search Multiple Fields
Found 6 results

Title: BrowseSafe: Understanding and Preventing Prompt Injection Within AI Br...
Score: 8.09
Authors: Kaiyuan Zhang, Mark Tenenholtz...

Title: Fighting AI with AI: Leveraging Foundation Models for Assuring AI-Enab...
Score: 4.52
Authors: Anastasia Mavridou, Divya Gopinath...

Title: Can Vibe Coding Beat Graduate CS Students? An LLM vs. Human Coding Tou...
Score: 3.61
Authors: Panayiotis Danassis, Naman Goel...



### 5.3 Boosting Query

Boost certain results while demoting others:

In [23]:
# Boosting Query - Promote and demote results
print("BOOSTING QUERY - Promote/Demote Results")
print("=" * 40)

query = {
    "query": {
        "boosting": {
            "positive": {
                "match": {
                    "abstract": "deep learning"
                }
            },
            "negative": {
                "match": {
                    "abstract": "multimodal"
                }
            },
            "negative_boost": 0.1  # Reduce score of negative matches
        }
    },
    "size": 3
}

response = opensearch_client.client.search(
    index=opensearch_client.index_name,
    body=query
)

print(f"Query: Boost 'deep learning', demote 'survey' papers\n")
print(f"Found {response['hits']['total']['value']} results\n")

for hit in response['hits']['hits']:
    title = hit['_source']['title'][:70]
    abstract_snippet = hit['_source']['abstract'][:100]
    print(f"Title: {title}...")
    print(f"Score: {hit['_score']:.2f}")
    print(f"Abstract: {abstract_snippet}...\n")

BOOSTING QUERY - Promote/Demote Results
Query: Boost 'deep learning', demote 'survey' papers

Found 5 results

Title: MedROV: Towards Real-Time Open-Vocabulary Detection Across Diverse Med...
Score: 2.56
Abstract: Traditional object detection models in medical imaging operate within a closed-set paradigm, limitin...

Title: MapReduce LoRA: Advancing the Pareto Front in Multi-Preference Optimiz...
Score: 1.93
Abstract: Reinforcement learning from human feedback (RLHF) with reward models has advanced alignment of gener...

Title: The Driver-Blindness Phenomenon: Why Deep Sequence Models Default to A...
Score: 1.78
Abstract: Deep sequence models for blood glucose forecasting consistently fail to leverage clinically informat...



### 5.4 Filter Query

Filter results by specific criteria (doesn't affect scoring):

In [24]:
# Filter Query - Filter by categories
print("FILTER QUERY - Category Filtering")
print("=" * 40)

query = {
    "query": {
        "bool": {
            "must": [
                {
                    "match": {
                        "abstract": "model"
                    }
                }
            ],
            "filter": [
                {
                    "terms": {
                        "categories": ["cs.AI"]
                    }
                }
            ]
        }
    },
    "size": 3
}

response = opensearch_client.client.search(
    index=opensearch_client.index_name,
    body=query
)

print(f"Found {response['hits']['total']['value']} results\n")

for hit in response['hits']['hits']:
    title = hit['_source']['title'][:70]
    categories = ', '.join(hit['_source']['categories'])
    print(f"Title: {title}...")
    print(f"Categories: {categories}")
    print(f"Score: {hit['_score']:.2f}\n")

FILTER QUERY - Category Filtering
Found 11 results

Title: MedROV: Towards Real-Time Open-Vocabulary Detection Across Diverse Med...
Categories: cs.CV, cs.AI
Score: 0.56

Title: ROOT: Robust Orthogonalized Optimizer for Neural Network Training...
Categories: cs.LG, cs.AI
Score: 0.46

Title: Latent Collaboration in Multi-Agent Systems...
Categories: cs.CL, cs.AI, cs.LG
Score: 0.44



### 5.5 Sorting Query

Sort results by different criteria:

In [25]:
# Sorting Query - Sort by publication date
print("SORTING QUERY - Latest Papers First")
print("=" * 40)

query = {
    "query": {
        "match_all": {}  # Get all papers
    },
    "sort": [
        {
            "published_date": {
                "order": "desc"  # Latest first
            }
        }
    ],
    "size": 5
}

response = opensearch_client.client.search(
    index=opensearch_client.index_name,
    body=query
)

print(f"Query: All papers sorted by publication date (newest first)\n")

for hit in response['hits']['hits']:
    title = hit['_source']['title'][:70]
    pub_date = hit['_source']['published_date'][:10]
    print(f"Date: {pub_date} | {title}...")

SORTING QUERY - Latest Papers First
Query: All papers sorted by publication date (newest first)

Date: 2025-11-25 | MedROV: Towards Real-Time Open-Vocabulary Detection Across Diverse Med...
Date: 2025-11-25 | MotionV2V: Editing Motion in a Video...
Date: 2025-11-25 | Latent Collaboration in Multi-Agent Systems...
Date: 2025-11-25 | MapReduce LoRA: Advancing the Pareto Front in Multi-Preference Optimiz...
Date: 2025-11-25 | Fighting AI with AI: Leveraging Foundation Models for Assuring AI-Enab...


### 5.6 Combined Query

Combine multiple query types for complex searches:

In [26]:
# Combined Query - Complex search with multiple criteria
print("COMBINED QUERY - Complex Search")
print("=" * 40)

query = {
    "query": {
        "bool": {
            "must": [
                {
                    "multi_match": {
                        "query": "transformer",
                        "fields": ["title^3", "abstract"],
                        "type": "best_fields"
                    }
                }
            ],
            "filter": [
                {
                    "range": {
                        "published_date": {
                            "gte": "2024-01-01"
                        }
                    }
                }
            ],
            "should": [
                {
                    "match": {
                        "categories": "cs.AI"
                    }
                }
            ]
        }
    },
    "sort": [
        "_score",
        {"published_date": {"order": "desc"}}
    ],
    "size": 3
}

response = opensearch_client.client.search(
    index=opensearch_client.index_name,
    body=query
)

print(f"Complex Query:")
print(f"  • Must contain 'model' (title boosted 3x)")
print(f"  • Filter: published after 2024-01-01")
print(f"  • Prefer: cs.AI category")
print(f"  • Sort: by relevance, then date\n")

print(f"Found {response['hits']['total']['value']} results\n")

for hit in response['hits']['hits']:
    title = hit['_source']['title'][:70]
    pub_date = hit['_source']['published_date'][:10]
    score = hit['_score']
    categories = ', '.join(hit['_source']['categories'][:2])
    
    print(f"Title: {title}...")
    print(f"  Date: {pub_date} | Score: {score:.2f}")
    print(f"  Categories: {categories}\n")

COMBINED QUERY - Complex Search
Complex Query:
  • Must contain 'model' (title boosted 3x)
  • Filter: published after 2024-01-01
  • Prefer: cs.AI category
  • Sort: by relevance, then date

Found 1 results

Title: Evaluating the Performance of Deep Learning Models in Whole-body Dynam...
  Date: 2025-11-25 | Score: 3.02
  Categories: cs.CV, cs.AI



## Summary

### What We Demonstrated

**BM25 Search is Powerful!** Without any vector embeddings, we can:

1. **Simple Search**: Basic keyword search with relevance scoring
2. **Match Queries**: Search specific fields
3. **Multi-Match**: Search across multiple fields with boosting
4. **Boosting**: Promote or demote certain results
5. **Filtering**: Apply filters without affecting scores
6. **Sorting**: Order results by date, score, or other fields
7. **Complex Queries**: Combine all techniques for sophisticated searches

### Key Takeaways

- **BM25 works great** for many search use cases
- **No vectors needed** for effective full-text search
- **Simple and fast** compared to embedding-based approaches
- **Filters and sorting** make searches precise and relevant
- **Field boosting** helps prioritize important content

### When to Use BM25 vs Vectors

**Use BM25 when:**
- Searching for specific keywords or phrases
- Need fast, simple implementation
- Have good text fields with clear terminology
- Want explainable search results

**Consider vectors when:**
- Need semantic similarity (concepts, not keywords)
- Dealing with synonyms and paraphrasing
- Cross-language search requirements
- Very short queries or documents

Remember: **You can also combine both** (hybrid search) for best results!
We will see this in the next week :)