# Advanced Elasticsearch VectorDB & Information Retrieval

This notebook demonstrates **advanced information retrieval techniques** using Elasticsearch, including:

- **BM25**: The classic text-based retrieval algorithm
- **Vector Search**: Semantic search using embeddings
- **Hybrid Search**: Combining BM25 and vector search for best results
- **Reranking**: Improving search results with cross-encoders
- **Advanced Query Techniques**: Multi-match, boosting, filters, and more

## Prerequisites

Make sure you have:
- Elasticsearch running (local or cloud)
- Required Python libraries installed
- An API key if using Elasticsearch Cloud


## Setup and Connection

First, let's connect to Elasticsearch. You can use either:
- **Local**: `http://localhost:9200` or `https://localhost:9200`
- **Cloud**: Your Elasticsearch Cloud endpoint with an API key


In [None]:
import urllib3
urllib3.disable_warnings()

from elasticsearch import Elasticsearch
from pprint import pprint
import numpy as np
import json
from typing import List, Dict, Any

# Connection configuration
ENDPOINT = "https://localhost:9200"  # Change to your endpoint
API_KEY = "TO_COMPLETE"  # http://localhost:5601/app/management/security/api_keys/

# Connect to Elasticsearch
es = Elasticsearch(ENDPOINT, api_key=API_KEY, verify_certs=False)
print("‚úÖ Connected to Elasticsearch")
print(f"Cluster info: {es.info()['cluster_name']}")


## Part I: Understanding BM25

**BM25 (Best Matching 25)** is a ranking function used to estimate the relevance of documents to a given search query. It's the default text search algorithm in Elasticsearch.

### Key Concepts:
- **Term Frequency (TF)**: How often a term appears in a document
- **Inverse Document Frequency (IDF)**: How rare/common a term is across all documents
- **Field Length Normalization**: Adjusts for document length

BM25 is excellent for:
- Keyword matching
- Exact phrase matching
- Handling common vs. rare terms
- Text-based search where semantic understanding isn't critical


### I.1 Creating a Text-Only Index for BM25

Let's create an index with text fields optimized for BM25 search:


In [None]:
# Sample documents for our search experiments
documents = [
    {
        "id": 1,
        "title": "Introduction to Machine Learning",
        "content": "Machine learning is a subset of artificial intelligence that enables systems to learn from data without explicit programming. It uses algorithms to identify patterns and make predictions.",
        "category": "AI",
        "views": 1500
    },
    {
        "id": 2,
        "title": "Deep Learning Fundamentals",
        "content": "Deep learning uses neural networks with multiple layers to process complex data. It's particularly effective for image recognition and natural language processing tasks.",
        "category": "AI",
        "views": 2300
    },
    {
        "id": 3,
        "title": "Elasticsearch Search Engine",
        "content": "Elasticsearch is a distributed search and analytics engine. It provides powerful full-text search capabilities using inverted indices and the BM25 ranking algorithm.",
        "category": "Search",
        "views": 1800
    },
    {
        "id": 4,
        "title": "Vector Databases Explained",
        "content": "Vector databases store high-dimensional vectors for similarity search. They're essential for semantic search, recommendation systems, and AI applications using embeddings.",
        "category": "Database",
        "views": 2100
    },
    {
        "id": 5,
        "title": "Hybrid Search: Combining Text and Vectors",
        "content": "Hybrid search combines traditional keyword search (BM25) with vector similarity search. This approach leverages both lexical matching and semantic understanding for better results.",
        "category": "Search",
        "views": 950
    },
    {
        "id": 6,
        "title": "Natural Language Processing Basics",
        "content": "NLP enables computers to understand and process human language. Key techniques include tokenization, named entity recognition, and sentiment analysis.",
        "category": "AI",
        "views": 1200
    }
]

# Create index for BM25 search
index_name_bm25 = "articles_bm25"

# Delete if exists
if es.indices.exists(index=index_name_bm25):
    es.indices.delete(index=index_name_bm25)
    print(f"üóëÔ∏è  Deleted existing index: {index_name_bm25}")

# Create index with text fields optimized for BM25
es.indices.create(
    index=index_name_bm25,
    body={
        "mappings": {
            "properties": {
                "title": {
                    "type": "text",
                    "analyzer": "standard",  # Standard analyzer for BM25
                    "fields": {
                        "keyword": {"type": "keyword"}  # For exact matching
                    }
                },
                "content": {
                    "type": "text",
                    "analyzer": "standard"
                },
                "category": {"type": "keyword"},
                "views": {"type": "integer"}
            }
        },
        "settings": {
            "number_of_shards": 1,
            "number_of_replicas": 0
        }
    }
)

# Index documents
for doc in documents:
    es.index(index=index_name_bm25, id=doc["id"], document=doc)

# Refresh to make documents searchable immediately
es.indices.refresh(index=index_name_bm25)
print(f"‚úÖ Created index '{index_name_bm25}' and indexed {len(documents)} documents")


### I.2 Basic BM25 Search

Let's perform a simple BM25 search. Elasticsearch uses BM25 by default for text fields:


In [None]:
def display_search_results(response, query_text=""):
    """Helper function to display search results nicely"""
    print(f"\nüîç Search Results for: '{query_text}'")
    print(f"Total hits: {response['hits']['total']['value']}\n")
    
    for i, hit in enumerate(response['hits']['hits'], 1):
        score = hit['_score']
        source = hit['_source']
        print(f"{i}. Score: {score:.4f}")
        print(f"   Title: {source.get('title', 'N/A')}")
        print(f"   Category: {source.get('category', 'N/A')}")
        print(f"   Content preview: {source.get('content', '')[:100]}...")
        print()

# Simple BM25 search
query = "machine learning"
response = es.search(
    index=index_name_bm25,
    body={
        "query": {
            "match": {
                "content": query
            }
        }
    }
)

display_search_results(response, query)


### I.3 Advanced BM25: Multi-Match and Boosting

We can search across multiple fields and boost certain fields to give them more importance:


In [None]:
# Multi-match query: search across multiple fields with boosting
# Title matches are boosted 3x more than content matches
query = "search engine"

response = es.search(
    index=index_name_bm25,
    body={
        "query": {
            "multi_match": {
                "query": query,
                "fields": ["title^3", "content"],  # ^3 means 3x boost
                "type": "best_fields"  # Uses best matching field's score
            }
        }
    }
)

display_search_results(response, query)


### I.4 BM25 with Filters and Function Score

We can combine BM25 with filters and custom scoring functions to create more sophisticated ranking strategies.

#### Understanding Function Score

The `function_score` query allows you to modify the relevance score of documents returned by a query. This is incredibly powerful for:
- **Boosting popular content**: Increase scores for documents with high view counts, ratings, etc.
- **Time-based ranking**: Boost recent content or decay older content
- **Business logic**: Apply custom scoring based on any field value

#### Key Components:

1. **Base Query**: The BM25 query that finds matching documents
2. **Filters**: Restrict results to specific criteria (faster than queries, cached)
3. **Functions**: Mathematical transformations applied to field values
4. **Boost Mode**: How to combine function scores with query scores

#### Function Score Parameters Explained:

- **`factor`**: Multiplier applied to the field value (e.g., 0.001 means divide by 1000)
- **`modifier`**: Mathematical function to apply
- **`boost_mode`**: How to combine function score with query score (sum, avg, etc.)

#### Why Use Filters vs Queries?

- **Filters** are faster because they:
  - Don't calculate relevance scores
  - Are cached automatically
  - Use bit sets for efficient matching
  - Perfect for exact matches (categories, tags, dates, etc.)

- **Queries** calculate relevance scores and are better for:
  - Text matching
  - Fuzzy matching
  - When you need scoring

**Best Practice**: Use filters for exact matches, queries for relevance scoring.


In [None]:
# Combine BM25 search with filters and boost by popularity (views)
query = "learning"

# Step-by-step explanation:
# 1. Base query: Find documents matching "learning" in content (BM25 scoring)
# 2. Filter: Only include documents where category = "AI" (no scoring, just filtering)
# 3. Function: Calculate a boost based on the "views" field
#    - factor: 0.001 means we're working with view counts divided by 1000
#    - modifier: "log1p" applies log(1 + value) to smooth the effect
#      This prevents documents with very high view counts from dominating
# 4. boost_mode: "sum" adds the function score to the BM25 query score

response = es.search(
    index=index_name_bm25,
    body={
        "query": {
            "function_score": {
                "query": {
                    "bool": {
                        "must": [
                            {"match": {"content": query}}  # BM25 text search
                        ],
                        "filter": [
                            {"term": {"category": "AI"}}  # Filter: exact match, cached, no scoring
                        ]
                    }
                },
                "functions": [
                    {
                        "field_value_factor": {
                            "field": "views",
                            "factor": 0.001,  # Divide views by 1000 (e.g., 1500 views ‚Üí 1.5)
                            "modifier": "log1p"  # Apply log(1 + value) to smooth large differences
                            # Example: log1p(1.5) ‚âà 0.916, log1p(2.3) ‚âà 1.178
                        }
                    }
                ],
                "boost_mode": "sum"  # Final score = BM25_score + log1p(views * 0.001)
            }
        }
    }
)

display_search_results(response, f"{query} (filtered: AI category, boosted by views)")

# Let's also show the scores breakdown for better understanding
print("\nüìä Score Breakdown:")
print("-" * 80)
for hit in response['hits']['hits']:
    print(f"Document: {hit['_source']['title']}")
    print(f"  Final Score: {hit['_score']:.4f}")
    print(f"  Views: {hit['_source']['views']}")
    print(f"  Function contribution: log1p({hit['_source']['views']} * 0.001) ‚âà {np.log1p(hit['_source']['views'] * 0.001):.4f}")
    print(f"  Estimated BM25 score: {hit['_score'] - np.log1p(hit['_source']['views'] * 0.001):.4f}")
    print()


## Part II: Vector Search with Embeddings

Vector search uses **embeddings** (dense vector representations) to find semantically similar documents, even if they don't share exact keywords.

### Advantages:
- **Semantic understanding**: Finds documents with similar meaning
- **Multilingual**: Works across languages if embeddings are trained accordingly
- **Context-aware**: Understands synonyms and related concepts

### When to use:
- Semantic similarity is more important than exact keyword matching
- You need to find conceptually similar content
- Working with multilingual content


### II.1 Generating Embeddings

We will now use a real embedding model (`sentence-transformers/all-MiniLM-L6-v2`) to convert text into 384-dimensional vectors. This model provides high-quality semantic embeddings that capture the meaning of text.

**Setup**: To use the HuggingFace Inference API, you need to:
1. Create an account on [Hugging Face](https://huggingface.co/)
2. Generate a token at [https://huggingface.co/settings/tokens](https://huggingface.co/settings/tokens)
3. Place your token in the `HF_TOKEN` variable below

The model `all-MiniLM-L6-v2` produces 384-dimensional embeddings and is optimized for speed while maintaining good semantic understanding.


In [None]:
# HuggingFace Inference API setup
MODEL_ID = "sentence-transformers/all-MiniLM-L6-v2"
HF_TOKEN = "TO_COMPLETE"  # https://huggingface.co/settings/tokens

import requests

# API endpoint for feature extraction (embeddings)
api_url = f"https://router.huggingface.co/hf-inference/models/{MODEL_ID}/pipeline/feature-extraction"
headers = {"Authorization": f"Bearer {HF_TOKEN}"}

def get_embedding(text: str) -> List[float]:
    """
    Get embedding for a single text using HuggingFace Inference API.
    
    Args:
        text: Input text to embed
        
    Returns:
        List of floats representing the 384-dimensional embedding vector
    """
    response = requests.post(
        api_url, 
        headers=headers, 
        json={"inputs": text, "options": {"wait_for_model": True}}
    )
    
    if response.status_code != 200:
        raise Exception(f"Error from HuggingFace API: {response.status_code} - {response.text}")
    
    # The API returns a list with one embedding vector
    return response.json()[0]

def get_embeddings_batch(texts: List[str]) -> List[List[float]]:
    """
    Get embeddings for multiple texts in a single API call (more efficient).
    
    Args:
        texts: List of input texts to embed
        
    Returns:
        List of embedding vectors (each is a list of floats)
    """
    response = requests.post(
        api_url,
        headers=headers,
        json={"inputs": texts, "options": {"wait_for_model": True}}
    )
    
    if response.status_code != 200:
        raise Exception(f"Error from HuggingFace API: {response.status_code} - {response.text}")
    
    return response.json()

# Generate embeddings for our documents
# all-MiniLM-L6-v2 produces 384-dimensional vectors
embedding_dim = 384

# Prepare texts for batch embedding (more efficient than individual calls)
texts_to_embed = [f"{doc['title']} {doc['content']}" for doc in documents]

# Get embeddings in batch
print("üîÑ Generating embeddings using HuggingFace API...")
embeddings_batch = get_embeddings_batch(texts_to_embed)

# Store embeddings in dictionary
document_embeddings = {}
for doc, embedding in zip(documents, embeddings_batch):
    document_embeddings[doc['id']] = embedding

print(f"‚úÖ Generated {embedding_dim}-dimensional embeddings for {len(documents)} documents")
print(f"Sample embedding (first 10 dims): {document_embeddings[1][:10]}")
print(f"Embedding norm: {np.linalg.norm(document_embeddings[1]):.4f}")


### II.2 Creating a Vector Search Index

Now let's create an index that supports vector similarity search:


In [None]:
# Create index for vector search
index_name_vector = "articles_vector"

# Delete if exists
if es.indices.exists(index=index_name_vector):
    es.indices.delete(index=index_name_vector)
    print(f"üóëÔ∏è  Deleted existing index: {index_name_vector}")

# Create index with dense_vector field
es.indices.create(
    index=index_name_vector,
    body={
        "mappings": {
            "properties": {
                "title": {"type": "text"},
                "content": {"type": "text"},
                "category": {"type": "keyword"},
                "views": {"type": "integer"},
                "embedding": {
                    "type": "dense_vector",
                    "dims": embedding_dim,
                    "index": True,  # Enable approximate k-NN search
                    "similarity": "cosine"  # Use cosine similarity
                }
            }
        },
        "settings": {
            "number_of_shards": 1,
            "number_of_replicas": 0
        }
    }
)

# Index documents with embeddings
for doc in documents:
    doc_with_embedding = doc.copy()
    doc_with_embedding["embedding"] = document_embeddings[doc["id"]]
    es.index(index=index_name_vector, id=doc["id"], document=doc_with_embedding)

# Refresh
es.indices.refresh(index=index_name_vector)
print(f"‚úÖ Created vector index '{index_name_vector}' with {len(documents)} documents")


### II.3 Vector Similarity Search

Now let's perform a vector similarity search:


In [None]:
# Vector similarity search
query_text = "artificial intelligence and neural networks"
query_embedding = get_embeddings_batch(query_text)

response = es.search(
    index=index_name_vector,
    body={
        "knn": {
            "field": "embedding",
            "query_vector": query_embedding,
            "k": 5,
            "num_candidates": 10  # Number of candidates to consider
        },
        "_source": ["title", "content", "category"]
    }
)

display_search_results(response, f"Vector search: '{query_text}'")


## Part III: Hybrid Search - The Best of Both Worlds

**Hybrid search** combines BM25 (keyword matching) and vector search (semantic matching) to get the best results. This is one of the most powerful techniques in modern information retrieval.

### Why Hybrid Search?
- **BM25** excels at exact keyword matching and handling rare terms
- **Vector search** excels at semantic understanding and finding conceptually similar content
- **Combined**: You get both lexical and semantic relevance


### III.1 Creating a Hybrid Search Index

We need an index that supports both text search (BM25) and vector search:


In [None]:
# Create index for hybrid search
index_name_hybrid = "articles_hybrid"

# Delete if exists
if es.indices.exists(index=index_name_hybrid):
    es.indices.delete(index=index_name_hybrid)
    print(f"üóëÔ∏è  Deleted existing index: {index_name_hybrid}")

# Create index with both text and vector fields
es.indices.create(
    index=index_name_hybrid,
    body={
        "mappings": {
            "properties": {
                "title": {
                    "type": "text",
                    "analyzer": "standard",
                    "fields": {
                        "keyword": {"type": "keyword"}
                    }
                },
                "content": {
                    "type": "text",
                    "analyzer": "standard"
                },
                "category": {"type": "keyword"},
                "views": {"type": "integer"},
                "embedding": {
                    "type": "dense_vector",
                    "dims": embedding_dim,
                    "index": True,
                    "similarity": "cosine"
                }
            }
        },
        "settings": {
            "number_of_shards": 1,
            "number_of_replicas": 0
        }
    }
)

# Index documents with both text and embeddings
for doc in documents:
    doc_with_embedding = doc.copy()
    doc_with_embedding["embedding"] = document_embeddings[doc["id"]]
    es.index(index=index_name_hybrid, id=doc["id"], document=doc_with_embedding)

# Refresh
es.indices.refresh(index=index_name_hybrid)
print(f"‚úÖ Created hybrid index '{index_name_hybrid}' with {len(documents)} documents")


### III.2 Advanced Hybrid Search with RRF (Reciprocal Rank Fusion)

**Reciprocal Rank Fusion (RRF)** is a powerful technique to combine results from multiple retrieval methods. It's available in Elasticsearch 8.8+:


In [None]:
# Hybrid search using RRF (Reciprocal Rank Fusion)
# RRF combines results from multiple queries by fusing their rankings
query_text = "neural networks and deep learning"
query_embedding = get_embeddings_batch(query_text)

# Note: RRF requires Elasticsearch 8.8+. If not available, use the previous method.
try:
    response = es.search(
        index=index_name_hybrid,
        body={
            "sub_searches": [
                {
                    "query": {
                        "multi_match": {
                            "query": query_text,
                            "fields": ["title^3", "content"]
                        }
                    }
                },
                {
                    "knn": {
                        "field": "embedding",
                        "query_vector": query_embedding,
                        "k": 10,
                        "num_candidates": 20
                    }
                }
            ],
            "rank": {
                "rrf": {
                    "window_size": 20,
                    "rank_constant": 60
                }
            },
            "_source": ["title", "content", "category"]
        }
    )
    display_search_results(response, f"RRF Hybrid search: '{query_text}'")
except Exception as e:
    print(f"‚ö†Ô∏è  RRF not available (requires ES 8.8+): {e}")
    print("Using alternative hybrid search method...")
    
    # Fallback: manual hybrid search
    response = es.search(
        index=index_name_hybrid,
        body={
            "query": {
                "bool": {
                    "should": [
                        {"multi_match": {"query": query_text, "fields": ["title^3", "content"]}},
                        {"match_all": {}}
                    ]
                }
            },
            "knn": {
                "field": "embedding",
                "query_vector": query_embedding,
                "k": 5,
                "num_candidates": 10,
                "boost": 0.5  # Weight for vector search
            },
            "_source": ["title", "content", "category"]
        }
    )
    display_search_results(response, f"Hybrid search (fallback): '{query_text}'")


## Part IV: Comparison: BM25 vs Vector vs Hybrid

Let's compare the three approaches side by side:


In [None]:
# Comparison function
def compare_search_methods(query_text: str):
    """Compare BM25, Vector, and Hybrid search for the same query"""
    query_embedding = get_embeddings_batch(query_text)
    
    print("=" * 80)
    print(f"COMPARISON FOR QUERY: '{query_text}'")
    print("=" * 80)
    
    # 1. BM25 only
    print("\nüìù BM25 SEARCH (Keyword-based):")
    print("-" * 80)
    bm25_response = es.search(
        index=index_name_bm25,
        body={
            "query": {
                "multi_match": {
                    "query": query_text,
                    "fields": ["title^3", "content"]
                }
            },
            "size": 3
        }
    )
    for i, hit in enumerate(bm25_response['hits']['hits'], 1):
        print(f"  {i}. [{hit['_score']:.4f}] {hit['_source']['title']}")
    
    # 2. Vector only
    print("\nüî¢ VECTOR SEARCH (Semantic):")
    print("-" * 80)
    vector_response = es.search(
        index=index_name_vector,
        body={
            "knn": {
                "field": "embedding",
                "query_vector": query_embedding,
                "k": 3,
                "num_candidates": 10
            },
            "size": 3
        }
    )
    for i, hit in enumerate(vector_response['hits']['hits'], 1):
        print(f"  {i}. [{hit['_score']:.4f}] {hit['_source']['title']}")
    
    # 3. Hybrid
    print("\nüöÄ HYBRID SEARCH (BM25 + Vector):")
    print("-" * 80)
    hybrid_response = es.search(
        index=index_name_hybrid,
        body={
            "query": {
                "multi_match": {
                    "query": query_text,
                    "fields": ["title^3", "content"]
                }
            },
            "knn": {
                "field": "embedding",
                "query_vector": query_embedding,
                "k": 3,
                "num_candidates": 10,
                "boost": 0.5
            },
            "size": 3
        }
    )
    for i, hit in enumerate(hybrid_response['hits']['hits'], 1):
        print(f"  {i}. [{hit['_score']:.4f}] {hit['_source']['title']}")
    
    print("\n" + "=" * 80)

# Test with different queries
test_queries = [
    "machine learning",
    "search engine technology",
    "artificial intelligence"
]

for query in test_queries:
    compare_search_methods(query)
    print("\n")


## Part V: Advanced Query Techniques

### V.1 Query Boosting and Negative Queries

We can boost certain terms and exclude others:


In [None]:
# Advanced query with boosting and exclusions
response = es.search(
    index=index_name_bm25,
    body={
        "query": {
            "bool": {
                "must": [
                    {"match": {"content": "learning"}}
                ],
                "should": [
                    {"match": {"content": {"query": "neural", "boost": 2.0}}},
                    {"match": {"content": {"query": "deep", "boost": 1.5}}}
                ],
                "must_not": [
                    {"term": {"category": "Database"}}  # Exclude Database category
                ],
                "minimum_should_match": 0
            }
        }
    }
)

display_search_results(response, "Advanced query: 'learning' (boosted: neural, deep; excluded: Database)")


### V.2 Phrase Matching and Proximity

For exact phrase matching and controlling word proximity:


In [None]:
# Phrase matching: words must appear in exact order
response = es.search(
    index=index_name_bm25,
    body={
        "query": {
            "match_phrase": {
                "content": {
                    "query": "machine learning",
                    "slop": 2  # Allow up to 2 words between "machine" and "learning"
                }
            }
        }
    }
)

display_search_results(response, "Phrase match: 'machine learning' (slop: 2)")


### V.3 Fuzzy Matching

Fuzzy matching handles typos and spelling variations:


In [None]:
# Fuzzy matching: handles typos
response = es.search(
    index=index_name_bm25,
    body={
        "query": {
            "match": {
                "content": {
                    "query": "machne lerning",  # Intentional typos
                    "fuzziness": "AUTO"  # Auto-detect fuzziness based on term length
                }
            }
        }
    }
)

display_search_results(response, "Fuzzy match: 'machne lerning' (with typos)")


## Part VI: Performance Optimization

### VI.1 Index Settings for Performance

Optimize index settings for better search performance:


In [None]:
# Performance-optimized index settings
index_name_optimized = "articles_optimized"

if es.indices.exists(index=index_name_optimized):
    es.indices.delete(index=index_name_optimized)

es.indices.create(
    index=index_name_optimized,
    body={
        "settings": {
            "number_of_shards": 1,
            "number_of_replicas": 0,
            "refresh_interval": "30s"
        },
        "mappings": {
            "properties": {
                "title": {"type": "text"},
                "content": {"type": "text"},
                "category": {"type": "keyword"},
                "views": {"type": "integer"},
                "embedding": {
                    "type": "dense_vector",
                    "dims": embedding_dim,
                    "index": True,
                    "similarity": "cosine",
                    "index_options": {
                        "type": "hnsw",
                        "m": 16,
                        "ef_construction": 100
                    }
                }
            }
        }
    }
)

print("‚úÖ Created optimized index with HNSW algorithm for faster vector search")


### VI.2 Query Performance Tips

Key tips for optimizing query performance:


In [None]:
# Performance tips demonstrated:

# 1. Limit result size
response = es.search(
    index=index_name_bm25,
    body={
        "query": {"match": {"content": "learning"}},
        "size": 5,  # Only return top 5 results
        "_source": ["title", "category"]  # Only return needed fields
    }
)

# 2. Use filters (faster than queries) when possible
response = es.search(
    index=index_name_bm25,
    body={
        "query": {
            "bool": {
                "must": [
                    {"match": {"content": "learning"}}
                ],
                "filter": [  # Filters are cached and faster
                    {"term": {"category": "AI"}}
                ]
            }
        },
        "size": 5
    }
)

# 3. Use search_after for pagination (better than from/size for large datasets)
response = es.search(
    index=index_name_bm25,
    body={
        "query": {"match_all": {}},
        "size": 2,
        "sort": [{"views": "desc"}]
    }
)

print("‚úÖ Performance optimization examples:")
print("   - Limited result size")
print("   - Used filters instead of queries where possible")
print("   - Used sorting for pagination")


## Part VIII: Best Practices and Recommendations

### When to Use Each Method:

1. **BM25 (Text Search)**
   - ‚úÖ Exact keyword matching is important
   - ‚úÖ You need to handle rare terms well
   - ‚úÖ Working with structured text data
   - ‚úÖ Fast, no embedding generation needed

2. **Vector Search (Semantic Search)**
   - ‚úÖ Semantic similarity is more important than exact matches
   - ‚úÖ Multilingual content
   - ‚úÖ Finding conceptually similar content
   - ‚úÖ Working with embeddings from pre-trained models

3. **Hybrid Search**
   - ‚úÖ **Best for most production use cases**
   - ‚úÖ Need both keyword and semantic matching
   - ‚úÖ Want to maximize recall and precision
   - ‚úÖ Have resources for both text and vector indexing

### Performance Considerations:

- **Index size**: Vector indices are larger than text indices
- **Query latency**: Hybrid search is slower but more accurate
- **Embedding generation**: Consider caching embeddings
- **HNSW parameters**: Tune `m` and `ef_construction` based on your data size


## Summary

This notebook covered:

‚úÖ **BM25**: Classic text-based retrieval with advanced query techniques  
‚úÖ **Vector Search**: Semantic search using embeddings  
‚úÖ **Hybrid Search**: Combining BM25 and vector search for optimal results  
‚úÖ **Advanced Queries**: Boosting, filtering, phrase matching, fuzzy search  
‚úÖ **Performance Optimization**: Index settings, query optimization  
‚úÖ **Real-World System**: Complete search system implementation  

### Key Takeaways:

1. **BM25** is excellent for keyword matching and exact term retrieval
2. **Vector search** excels at semantic understanding and finding similar concepts
3. **Hybrid search** combines the best of both worlds and is recommended for production
4. **Performance** can be optimized through proper index settings and query design
5. **Real embedding models** should be used in production (not the demo function)

## Cleanup (Optional)

Uncomment the following cells to clean up the created indices:


In [None]:
# Uncomment to delete indices
# indices_to_delete = [index_name_bm25, index_name_vector, index_name_hybrid, index_name_optimized]
# 
# for index in indices_to_delete:
#     if es.indices.exists(index=index):
#         es.indices.delete(index=index)
#         print(f"üóëÔ∏è  Deleted index: {index}")
# 
# print("‚úÖ Cleanup complete")
