# üöÄ Specialized Vector Search in OpenSearch
![Course](../../static_images/ai_ml_search_opensearch_intermediate.jpeg)

## Course Demonstration: Advanced Vector Search Techniques

This notebook demonstrates three specialized vector search operations in OpenSearch:
- **Nested KNN Search**: Search within nested vector fields
- **Radial Search**: Find vectors within distance/similarity thresholds
- **MMR (Maximal Marginal Relevance)**: Balance relevance and diversity in results

---

## üìä Course Flow Diagram

```mermaid
graph TD
    A["üéØ Specialized Vector Search"] --> B["1Ô∏è‚É£ Nested KNN Search"]
    A --> C["2Ô∏è‚É£ Radial Search"]
    A --> D["3Ô∏è‚É£ MMR Reranking"]
    
    B --> B1["üì¶ Multiple Vectors per Document"]
    B --> B2["üîç Inner Hits Retrieval"]
    B --> B3["üé≠ Nested Field Filtering"]
    
    C --> C1["üìè Max Distance Search"]
    C --> C2["‚≠ê Min Score Threshold"]
    C --> C3["üîÄ Distance-based Filtering"]
    
    D --> D1["‚öñÔ∏è Relevance vs Diversity"]
    D --> D2["üé™ Lambda Parameter Tuning"]
    D --> D3["üë• Candidate Selection"]
    
    B1 --> E["‚ú® Real-World Applications"]
    C1 --> E
    D1 --> E
    
    E --> F["üèÜ Hybrid Strategies"]
    F --> G["üéì Master Advanced Vector Search!"]
    
    style A fill:#FF6B6B,stroke:#C92A2A,color:#fff,stroke-width:3px
    style B fill:#4ECDC4,stroke:#1A9B8E,color:#fff,stroke-width:2px
    style C fill:#45B7D1,stroke:#0A7BA7,color:#fff,stroke-width:2px
    style D fill:#96CEB4,stroke:#56A876,color:#fff,stroke-width:2px
    style E fill:#FFEAA7,stroke:#FDCB6E,color:#000,stroke-width:2px
    style G fill:#FF6B6B,stroke:#C92A2A,color:#fff,stroke-width:3px
```

## üîß Setup and Configuration

In [12]:
# Import Required Libraries
from opensearchpy import OpenSearch
import sys, os
from opensearchpy.helpers import bulk
import pandas as pd
import numpy as np
from sentence_transformers import SentenceTransformer
import json
import time
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

print("‚úÖ All libraries imported successfully!")

‚úÖ All libraries imported successfully!


## üåê Initialize OpenSearch Client

In [13]:
# Configuration
IS_AUTH = True
HOST = 'localhost'

# Get the current working directory of the notebook
current_dir = os.getcwd()

DATA_DIR = os.path.abspath(os.path.join(current_dir, '../../0. DATA'))

# Construct the path to the directory levels up
module_paths = [os.path.abspath(os.path.join(current_dir, '../../')),]

# Add the module path to sys.path if it's not already there
for module_path in module_paths:
    if module_path not in sys.path:
        sys.path.append(module_path)

try:
    import helpers as hp
except ImportError as e:
    print(f"‚ö†Ô∏è Note: helpers module not available: {e}")

# Initialize the OpenSearch client
if IS_AUTH:
    client = OpenSearch(
        hosts=[{'host': HOST, 'port': 9200}],
        http_auth=('admin', 'Developer@123'),
        use_ssl=True,
        verify_certs=False,
        ssl_show_warn=False
    )
else:
    client = OpenSearch(
        hosts=[{'host': HOST, 'port': 9200}],
        use_ssl=False,
        verify_certs=False,
        ssl_assert_hostname=False,
        ssl_show_warn=False
    )

# Verify connection
try:
    info = client.info()
    print(f"‚úÖ Connected to {info['version']['distribution']} v{info['version']['number']}")
    health = client.cluster.health()
    print(f"üìä Cluster Status: {health['status']}")
except Exception as e:
    print(f"‚ùå Connection failed: {e}")
    raise

‚úÖ Connected to opensearch v3.3.0
üìä Cluster Status: green


## üéØ Part 1: Nested KNN Search

**Use Case**: Search documents containing multiple vectors (e.g., product reviews with embeddings for different aspects)

**Key Features**:
- Store multiple vectors per document in nested fields
- Search within nested vector structures
- Retrieve specific nested fields with `inner_hits`
- Apply filters at nested level

### 1.1 Create Index with Nested Vector Fields

In [3]:
# Create a nested vector index
nested_index_name = "nested_vector_index"

# Delete index if it exists
try:
    client.indices.delete(index=nested_index_name)
    print(f"üóëÔ∏è  Deleted existing index: {nested_index_name}")
except:
    pass

nested_index_body = {
    "settings": {
        "index": {
            "knn": True,
            "number_of_shards": 1,
            "number_of_replicas": 0
        }
    },
    "mappings": {
        "properties": {
            "product_name": {
                "type": "text"
            },
            "category": {
                "type": "keyword"
            },
            "in_stock": {
                "type": "boolean"
            },
            "reviews": {
                "type": "nested",
                "properties": {
                    "review_vector": {
                        "type": "knn_vector",
                        "dimension": 3,
                        "space_type": "l2",
                        "method": {
                            "name": "hnsw",
                            "engine": "lucene",
                            "parameters": {
                                "ef_construction": 100,
                                "m": 16
                            }
                        }
                    },
                    "sentiment": {
                        "type": "keyword"
                    },
                    "rating": {
                        "type": "integer"
                    }
                }
            }
        }
    }
}

response = client.indices.create(index=nested_index_name, body=nested_index_body)
print(f"‚úÖ Created nested index: {nested_index_name}")
print(json.dumps(response, indent=2))

‚úÖ Created nested index: nested_vector_index
{
  "acknowledged": true,
  "shards_acknowledged": true,
  "index": "nested_vector_index"
}


### 1.2 Index Documents with Nested Vectors

In [4]:
# Index sample documents with nested vectors
nested_documents = [
    {
        "_index": nested_index_name,
        "_id": "1",
        "_source": {
            "product_name": "Premium Headphones",
            "category": "Electronics",
            "in_stock": True,
            "reviews": [
                {
                    "review_vector": [1.0, 1.0, 1.0],
                    "sentiment": "positive",
                    "rating": 5
                },
                {
                    "review_vector": [1.5, 1.2, 1.1],
                    "sentiment": "positive",
                    "rating": 5
                },
                {
                    "review_vector": [2.0, 2.0, 2.0],
                    "sentiment": "positive",
                    "rating": 4
                }
            ]
        }
    },
    {
        "_index": nested_index_name,
        "_id": "2",
        "_source": {
            "product_name": "Budget Headphones",
            "category": "Electronics",
            "in_stock": True,
            "reviews": [
                {
                    "review_vector": [10.0, 10.0, 10.0],
                    "sentiment": "negative",
                    "rating": 2
                },
                {
                    "review_vector": [20.0, 20.0, 20.0],
                    "sentiment": "negative",
                    "rating": 1
                },
                {
                    "review_vector": [30.0, 30.0, 30.0],
                    "sentiment": "negative",
                    "rating": 2
                }
            ]
        }
    },
    {
        "_index": nested_index_name,
        "_id": "3",
        "_source": {
            "product_name": "Mid-Range Speaker",
            "category": "Audio",
            "in_stock": False,
            "reviews": [
                {
                    "review_vector": [5.0, 5.0, 5.0],
                    "sentiment": "neutral",
                    "rating": 3
                },
                {
                    "review_vector": [6.0, 6.5, 5.5],
                    "sentiment": "positive",
                    "rating": 4
                }
            ]
        }
    }
]

# Bulk index documents
success, failed = bulk(client, nested_documents)
print(f"‚úÖ Indexed {success} documents successfully")
if failed:
    print(f"‚ùå Failed to index {failed} documents")

# Wait for indexing
time.sleep(1)
print("üìù Documents indexed!")

‚úÖ Indexed 3 documents successfully
üìù Documents indexed!


### 1.3 Nested KNN Search with Inner Hits

In [5]:
# Perform nested KNN search with inner_hits
query_vector = [1.0, 1.0, 1.0]

nested_search_body = {
    "_source": False,
    "query": {
        "nested": {
            "path": "reviews",
            "query": {
                "knn": {
                    "reviews.review_vector": {
                        "vector": query_vector,
                        "k": 3
                    }
                }
            },
            "inner_hits": {
                "_source": False,
                "fields": ["reviews.sentiment", "reviews.rating"]
            }
        }
    }
}

response = client.search(index=nested_index_name, body=nested_search_body)

print("üîç Nested KNN Search Results:")
print("="*80)
for hit in response['hits']['hits']:
    print(f"\nüì¶ Document ID: {hit['_id']} (Score: {hit['_score']:.4f})")
    if 'inner_hits' in hit:
        inner_hits = hit['inner_hits']['reviews']['hits']['hits']
        print(f"   üéØ Matched {len(inner_hits)} nested review(s):")
        for inner_hit in inner_hits:
            fields = inner_hit['fields']
            print(f"      - Sentiment: {fields.get('reviews.sentiment', ['N/A'])[0]}, "
                  f"Rating: {fields.get('reviews.rating', ['N/A'])[0]} (Score: {inner_hit['_score']:.4f})")

üîç Nested KNN Search Results:

üì¶ Document ID: 1 (Score: 1.0000)
   üéØ Matched 1 nested review(s):
      - Sentiment: positive, Rating: 5 (Score: 1.0000)

üì¶ Document ID: 3 (Score: 0.0204)
   üéØ Matched 1 nested review(s):
      - Sentiment: neutral, Rating: 3 (Score: 0.0204)

üì¶ Document ID: 2 (Score: 0.0041)
   üéØ Matched 1 nested review(s):
      - Sentiment: negative, Rating: 2 (Score: 0.0041)


### 1.4 Retrieve All Nested Hits with Score Mode

#### Understanding `score_mode`

When searching nested fields, the `score_mode` parameter determines how the parent document's relevance score is calculated:

| Mode | Description | Use Case |
|------|-------------|----------|
| **avg** (default) | Average score of all matching nested documents | When you want balanced scoring across all matches |
| **max** | Highest score among all matching nested documents | When the best match matters most; amplifies relevance |
| **min** | Lowest score among all matching nested documents | When all matches must meet a threshold; conservative scoring |
| **sum** | Sum of all matching nested document scores | When cumulative relevance matters |
| **none** | No scoring applied to nested query | When you only care about matching, not ranking |

**Example**: With 3 nested reviews scoring [0.9, 0.7, 0.6]:
- `avg`: Parent score = 0.73
- `max`: Parent score = 0.9 (highest)
- `min`: Parent score = 0.6 (lowest)

In this demonstration, we use `score_mode: "max"` combined with `expand_nested_docs: True` to get all matching nested documents ranked by their best score.

In [6]:
# Search with expand_nested_docs to get all nested hits
nested_search_all = {
    "_source": False,
    "query": {
        "nested": {
            "path": "reviews",
            "query": {
                "knn": {
                    "reviews.review_vector": {
                        "vector": query_vector,
                        "k": 3,
                        "expand_nested_docs": True
                    }
                }
            },
            "inner_hits": {
                "_source": False,
                "fields": ["reviews.sentiment", "reviews.rating"]
            },
            "score_mode": "max"
        }
    }
}

response = client.search(index=nested_index_name, body=nested_search_all)

print("üîç Nested KNN Search - All Nested Hits (expand_nested_docs=True):")
print("="*80)
for hit in response['hits']['hits']:
    print(f"\nüì¶ Document ID: {hit['_id']} (Score: {hit['_score']:.4f})")
    if 'inner_hits' in hit:
        inner_hits = hit['inner_hits']['reviews']['hits']['hits']
        print(f"   üéØ Found {len(inner_hits)} nested review(s):")
        for idx, inner_hit in enumerate(inner_hits, 1):
            fields = inner_hit['fields']
            print(f"      {idx}. Sentiment: {fields.get('reviews.sentiment', ['N/A'])[0]}, "
                  f"Rating: {fields.get('reviews.rating', ['N/A'])[0]} (Score: {inner_hit['_score']:.4f})")

üîç Nested KNN Search - All Nested Hits (expand_nested_docs=True):

üì¶ Document ID: 1 (Score: 1.0000)
   üéØ Found 3 nested review(s):
      1. Sentiment: positive, Rating: 5 (Score: 1.0000)
      2. Sentiment: positive, Rating: 5 (Score: 0.7692)
      3. Sentiment: positive, Rating: 4 (Score: 0.2500)

üì¶ Document ID: 3 (Score: 0.0204)
   üéØ Found 2 nested review(s):
      1. Sentiment: neutral, Rating: 3 (Score: 0.0204)
      2. Sentiment: positive, Rating: 4 (Score: 0.0131)

üì¶ Document ID: 2 (Score: 0.0041)
   üéØ Found 3 nested review(s):
      1. Sentiment: negative, Rating: 2 (Score: 0.0041)
      2. Sentiment: negative, Rating: 1 (Score: 0.0009)
      3. Sentiment: negative, Rating: 2 (Score: 0.0004)


### 1.5 Nested Search with Filtering

In [7]:
# Nested search with filter on top-level field
nested_search_filtered = {
    "_source": ["product_name", "category"],
    "query": {
        "nested": {
            "path": "reviews",
            "query": {
                "knn": {
                    "reviews.review_vector": {
                        "vector": query_vector,
                        "k": 3,
                        "filter": {
                            "term": {
                                "in_stock": True
                            }
                        }
                    }
                }
            },
            "inner_hits": {
                "_source": False,
                "fields": ["reviews.sentiment", "reviews.rating"]
            }
        }
    }
}

response = client.search(index=nested_index_name, body=nested_search_filtered)

print("üîç Nested KNN Search - Filtered (in_stock=True):")
print("="*80)
print(f"\nüìä Found {response['hits']['total']['value']} document(s)\n")
for hit in response['hits']['hits']:
    source = hit['_source']
    print(f"üì¶ {source['product_name']} (Category: {source['category']})")
    print(f"   Score: {hit['_score']:.4f}")
    if 'inner_hits' in hit:
        inner_hits = hit['inner_hits']['reviews']['hits']['hits']
        print(f"   üéØ Matched {len(inner_hits)} review(s)")

üîç Nested KNN Search - Filtered (in_stock=True):

üìä Found 2 document(s)

üì¶ Premium Headphones (Category: Electronics)
   Score: 1.0000
   üéØ Matched 1 review(s)
üì¶ Budget Headphones (Category: Electronics)
   Score: 0.0041
   üéØ Matched 1 review(s)


---

## üéØ Part 2: Radial Search (Distance & Similarity-Based)

**Use Case**: Find all vectors within a specific distance or similarity threshold (e.g., product recommendations with minimum quality threshold)

**Key Features**:
- Search with `max_distance`: Return all vectors within a physical distance
- Search with `min_score`: Return all vectors meeting a similarity score threshold
- No need to specify `k` (top-K results)

### 2.1 Create Index for Radial Search

In [14]:
# Create radial search index
radial_index_name = "radial_search_index"

# Delete index if it exists
try:
    client.indices.delete(index=radial_index_name)
    print(f"üóëÔ∏è  Deleted existing index: {radial_index_name}")
except:
    pass

radial_index_body = {
    "settings": {
        "number_of_shards": 1,
        "number_of_replicas": 0,
        "index.knn": True
    },
    "mappings": {
        "properties": {
            "product_vector": {
                "type": "knn_vector",
                "dimension": 2,
                "space_type": "l2",
                "method": {
                    "name": "hnsw",
                    "engine": "faiss",
                    "parameters": {
                        "ef_construction": 100,
                        "m": 16,
                        "ef_search": 100
                    }
                }
            },
            "product_name": {
                "type": "text"
            },
            "price": {
                "type": "float"
            },
            "quality_score": {
                "type": "float"
            }
        }
    }
}

response = client.indices.create(index=radial_index_name, body=radial_index_body)
print(f"‚úÖ Created radial search index: {radial_index_name}")

‚úÖ Created radial search index: radial_search_index


### 2.2 Index Documents for Radial Search

In [15]:
# Index sample documents with 2D vectors (20 products with varied vectors and prices)
radial_documents = [
    {"_index": radial_index_name, "_id": "1", "_source": {
        "product_name": "Product A",
        "product_vector": [7.0, 8.2],
        "price": 4.4,
        "quality_score": 0.85
    }},
    {"_index": radial_index_name, "_id": "2", "_source": {
        "product_name": "Product B",
        "product_vector": [7.1, 7.4],
        "price": 14.2,
        "quality_score": 0.92
    }},
    {"_index": radial_index_name, "_id": "3", "_source": {
        "product_name": "Product C",
        "product_vector": [7.3, 8.3],
        "price": 19.1,
        "quality_score": 0.88
    }},
    {"_index": radial_index_name, "_id": "4", "_source": {
        "product_name": "Product D",
        "product_vector": [6.5, 8.8],
        "price": 1.2,
        "quality_score": 0.72
    }},
    {"_index": radial_index_name, "_id": "5", "_source": {
        "product_name": "Product E",
        "product_vector": [5.7, 7.9],
        "price": 16.5,
        "quality_score": 0.91
    }},
    {"_index": radial_index_name, "_id": "6", "_source": {
        "product_name": "Product F",
        "product_vector": [7.2, 8.1],
        "price": 9.8,
        "quality_score": 0.87
    }},
    {"_index": radial_index_name, "_id": "7", "_source": {
        "product_name": "Product G",
        "product_vector": [6.8, 7.9],
        "price": 12.5,
        "quality_score": 0.89
    }},
    {"_index": radial_index_name, "_id": "8", "_source": {
        "product_name": "Product H",
        "product_vector": [7.4, 8.0],
        "price": 22.3,
        "quality_score": 0.93
    }},
    {"_index": radial_index_name, "_id": "9", "_source": {
        "product_name": "Product I",
        "product_vector": [6.9, 8.4],
        "price": 7.6,
        "quality_score": 0.81
    }},
    {"_index": radial_index_name, "_id": "10", "_source": {
        "product_name": "Product J",
        "product_vector": [7.05, 7.8],
        "price": 11.1,
        "quality_score": 0.86
    }},
    {"_index": radial_index_name, "_id": "11", "_source": {
        "product_name": "Product K",
        "product_vector": [5.5, 8.5],
        "price": 2.8,
        "quality_score": 0.70
    }},
    {"_index": radial_index_name, "_id": "12", "_source": {
        "product_name": "Product L",
        "product_vector": [8.0, 7.5],
        "price": 25.4,
        "quality_score": 0.95
    }},
    {"_index": radial_index_name, "_id": "13", "_source": {
        "product_name": "Product M",
        "product_vector": [6.6, 8.6],
        "price": 5.2,
        "quality_score": 0.78
    }},
    {"_index": radial_index_name, "_id": "14", "_source": {
        "product_name": "Product N",
        "product_vector": [7.15, 8.25],
        "price": 17.9,
        "quality_score": 0.90
    }},
    {"_index": radial_index_name, "_id": "15", "_source": {
        "product_name": "Product O",
        "product_vector": [6.7, 7.7],
        "price": 13.2,
        "quality_score": 0.84
    }},
    {"_index": radial_index_name, "_id": "16", "_source": {
        "product_name": "Product P",
        "product_vector": [7.25, 7.95],
        "price": 20.5,
        "quality_score": 0.91
    }},
    {"_index": radial_index_name, "_id": "17", "_source": {
        "product_name": "Product Q",
        "product_vector": [5.9, 8.0],
        "price": 8.7,
        "quality_score": 0.80
    }},
    {"_index": radial_index_name, "_id": "18", "_source": {
        "product_name": "Product R",
        "product_vector": [7.35, 8.15],
        "price": 24.1,
        "quality_score": 0.94
    }},
    {"_index": radial_index_name, "_id": "19", "_source": {
        "product_name": "Product S",
        "product_vector": [6.4, 8.2],
        "price": 3.5,
        "quality_score": 0.75
    }},
    {"_index": radial_index_name, "_id": "20", "_source": {
        "product_name": "Product T",
        "product_vector": [7.12, 8.22],
        "price": 18.6,
        "quality_score": 0.89
    }},
]

success, failed = bulk(client, radial_documents)
print(f"‚úÖ Indexed {success} documents for radial search")

# Wait for indexing
time.sleep(1)
print("üìù Documents ready for radial search!")

‚úÖ Indexed 20 documents for radial search
üìù Documents ready for radial search!
üìù Documents ready for radial search!


### 2.3 Radial Search with Max Distance

In [16]:
# Radial search with max_distance
query_vector = [7.1, 8.3]
max_distance = 2

radial_search_distance = {
    "query": {
        "knn": {
            "product_vector": {
                "vector": query_vector,
                "max_distance": max_distance
            }
        }
    }
}

response = client.search(index=radial_index_name, body=radial_search_distance)

print(f"üîç Radial Search - Max Distance: {max_distance}")
print(f"Query Vector: {query_vector}")
print("="*80)
print(f"\nüìä Found {response['hits']['total']['value']} products within distance\n")

results_df = pd.DataFrame([
    {
        "Product": hit['_source']['product_name'],
        "Vector": hit['_source']['product_vector'],
        "Price": hit['_source']['price'],
        "Quality": hit['_source']['quality_score'],
        "Score": hit['_score']
    }
    for hit in response['hits']['hits']
])

print(results_df.to_string(index=False))

üîç Radial Search - Max Distance: 2
Query Vector: [7.1, 8.3]

üìä Found 18 products within distance

  Product       Vector  Price  Quality    Score
Product N [7.15, 8.25]   17.9     0.90 0.995025
Product T [7.12, 8.22]   18.6     0.89 0.993246
Product A   [7.0, 8.2]    4.4     0.85 0.980392
Product C   [7.3, 8.3]   19.1     0.88 0.961538
Product I   [6.9, 8.4]    7.6     0.81 0.952381
Product F   [7.2, 8.1]    9.8     0.87 0.952381
Product R [7.35, 8.15]   24.1     0.94 0.921659
Product P [7.25, 7.95]   20.5     0.91 0.873362
Product H   [7.4, 8.0]   22.3     0.93 0.847457
Product G   [6.8, 7.9]   12.5     0.89 0.800000


### 2.4 Radial Search with Min Score

In [17]:
# Radial search with min_score
min_score = 0.95

radial_search_score = {
    "query": {
        "knn": {
            "product_vector": {
                "vector": query_vector,
                "min_score": min_score
            }
        }
    }
}

response = client.search(index=radial_index_name, body=radial_search_score)

print(f"üîç Radial Search - Min Score: {min_score}")
print(f"Query Vector: {query_vector}")
print("="*80)
print(f"\nüìä Found {response['hits']['total']['value']} products with score >= {min_score}\n")

if response['hits']['total']['value'] > 0:
    results_df = pd.DataFrame([
        {
            "Product": hit['_source']['product_name'],
            "Vector": hit['_source']['product_vector'],
            "Price": hit['_source']['price'],
            "Quality": hit['_source']['quality_score'],
            "Similarity Score": hit['_score']
        }
        for hit in response['hits']['hits']
    ])
    print(results_df.to_string(index=False))
else:
    print("‚ö†Ô∏è  No products found with this similarity threshold")

üîç Radial Search - Min Score: 0.95
Query Vector: [7.1, 8.3]

üìä Found 6 products with score >= 0.95

  Product       Vector  Price  Quality  Similarity Score
Product N [7.15, 8.25]   17.9     0.90          0.995025
Product T [7.12, 8.22]   18.6     0.89          0.993246
Product A   [7.0, 8.2]    4.4     0.85          0.980392
Product C   [7.3, 8.3]   19.1     0.88          0.961538
Product I   [6.9, 8.4]    7.6     0.81          0.952381
Product F   [7.2, 8.1]    9.8     0.87          0.952381


### 2.5 Radial Search with Filtering

In [18]:
# Radial search with max_distance and filter
radial_search_filtered = {
    "query": {
        "knn": {
            "product_vector": {
                "vector": query_vector,
                "max_distance": 2,
                "filter": {
                    "range": {
                        "price": {
                            "gte": 1,
                            "lte": 20
                        }
                    }
                }
            }
        }
    }
}

response = client.search(index=radial_index_name, body=radial_search_filtered)

print(f"üîç Radial Search - Max Distance with Price Filter")
print(f"Query Vector: {query_vector}")
print(f"Max Distance: 2, Price Range: $1-$20")
print("="*80)
print(f"\nüìä Found {response['hits']['total']['value']} products\n")

results_df = pd.DataFrame([
    {
        "Product": hit['_source']['product_name'],
        "Price": f"${hit['_source']['price']:.2f}",
        "Quality": hit['_source']['quality_score'],
        "Score": hit['_score']
    }
    for hit in response['hits']['hits']
])

print(results_df.to_string(index=False))

üîç Radial Search - Max Distance with Price Filter
Query Vector: [7.1, 8.3]
Max Distance: 2, Price Range: $1-$20

üìä Found 14 products

  Product  Price  Quality    Score
Product N $17.90     0.90 0.995025
Product T $18.60     0.89 0.993246
Product A  $4.40     0.85 0.980392
Product C $19.10     0.88 0.961538
Product I  $7.60     0.81 0.952381
Product F  $9.80     0.87 0.952381
Product G $12.50     0.89 0.800000
Product J $11.10     0.86 0.798403
Product M  $5.20     0.78 0.746269
Product S  $3.50     0.75 0.666667


---

## üéØ Part 3: MMR (Maximal Marginal Relevance) Reranking

**Use Case**: Get diverse results that balance relevance with diversity (e.g., diverse product recommendations, varied search results)

**Key Features**:
- **Diversity Parameter (Œª)**: Controls trade-off between relevance and diversity
  - Œª close to 0: Prioritize relevance
  - Œª close to 1: Prioritize diversity
- **Candidates**: Number of initial candidates before reranking
- **Formula**: MMR = (1-Œª) √ó relevance_score - Œª √ó max(similarity_with_selected_docs)

### 3.1 Enable MMR System Factories

In [19]:
# Enable MMR system-generated search processor factories
mmr_settings = {
    "persistent": {
        "cluster.search.enabled_system_generated_factories": [
            "mmr_over_sample_factory",
            "mmr_rerank_factory"
        ]
    }
}

try:
    response = client.cluster.put_settings(body=mmr_settings)
    print("‚úÖ MMR system factories enabled")
    print(json.dumps(response, indent=2))
except Exception as e:
    print(f"‚ö†Ô∏è  Note: {e}")
    print("   Proceeding with MMR demonstration...")

‚úÖ MMR system factories enabled
{
  "acknowledged": true,
  "persistent": {
    "cluster": {
      "search": {
        "enabled_system_generated_factories": [
          "mmr_over_sample_factory",
          "mmr_rerank_factory"
        ]
      }
    }
  },
  "transient": {}
}


### 3.2 Create Index for MMR Search

In [21]:
# Create MMR search index
mmr_index_name = "mmr_search_index"

# Delete index if it exists
try:
    client.indices.delete(index=mmr_index_name)
    print(f"üóëÔ∏è  Deleted existing index: {mmr_index_name}")
except:
    pass

mmr_index_body = {
    "settings": {
        "number_of_shards": 1,
        "number_of_replicas": 0,
        "index.knn": True
    },
    "mappings": {
        "properties": {
            "article_vector": {
                "type": "knn_vector",
                "dimension": 3,
                "space_type": "cosinesimil",
                "method": {
                    "name": "hnsw",
                    "engine": "lucene",
                    "parameters": {
                        "ef_construction": 128,
                        "m": 24
                    }
                }
            },
            "article_title": {
                "type": "text"
            },
            "category": {
                "type": "keyword"
            },
            "views": {
                "type": "integer"
            }
        }
    }
}

response = client.indices.create(index=mmr_index_name, body=mmr_index_body)
print(f"‚úÖ Created MMR search index: {mmr_index_name}")

‚úÖ Created MMR search index: mmr_search_index


### 3.3 Index Documents for MMR Demonstration

In [22]:
# Generate normalized 3D vectors for cosine similarity
def normalize_vector(v):
    """Normalize vector to unit length"""
    norm = np.sqrt(sum(x**2 for x in v))
    return [x/norm for x in v] if norm > 0 else v

# Create diverse documents with related but different vectors (20 articles)
mmr_documents = [
    # Machine Learning & AI (highly similar to query)
    {"_index": mmr_index_name, "_id": "1", "_source": {
        "article_title": "Introduction to Machine Learning",
        "article_vector": normalize_vector([0.9, 0.3, 0.1]),
        "category": "Technology",
        "views": 1500
    }},
    {"_index": mmr_index_name, "_id": "2", "_source": {
        "article_title": "Advanced Neural Networks",
        "article_vector": normalize_vector([0.85, 0.35, 0.15]),
        "category": "Technology",
        "views": 2200
    }},
    {"_index": mmr_index_name, "_id": "3", "_source": {
        "article_title": "Deep Learning Essentials",
        "article_vector": normalize_vector([0.88, 0.32, 0.12]),
        "category": "Technology",
        "views": 2800
    }},
    {"_index": mmr_index_name, "_id": "4", "_source": {
        "article_title": "Transfer Learning Techniques",
        "article_vector": normalize_vector([0.87, 0.34, 0.14]),
        "category": "Technology",
        "views": 1900
    }},
    {"_index": mmr_index_name, "_id": "5", "_source": {
        "article_title": "Reinforcement Learning Guide",
        "article_vector": normalize_vector([0.86, 0.33, 0.11]),
        "category": "Technology",
        "views": 1700
    }},
    # Programming & Development
    {"_index": mmr_index_name, "_id": "6", "_source": {
        "article_title": "Python for Data Science",
        "article_vector": normalize_vector([0.88, 0.38, 0.12]),
        "category": "Programming",
        "views": 3100
    }},
    {"_index": mmr_index_name, "_id": "7", "_source": {
        "article_title": "JavaScript Best Practices",
        "article_vector": normalize_vector([0.3, 0.85, 0.2]),
        "category": "Programming",
        "views": 2400
    }},
    {"_index": mmr_index_name, "_id": "8", "_source": {
        "article_title": "Go Programming Fundamentals",
        "article_vector": normalize_vector([0.25, 0.82, 0.18]),
        "category": "Programming",
        "views": 1600
    }},
    {"_index": mmr_index_name, "_id": "9", "_source": {
        "article_title": "Rust Memory Management",
        "article_vector": normalize_vector([0.28, 0.84, 0.19]),
        "category": "Programming",
        "views": 2000
    }},
    # Web Development
    {"_index": mmr_index_name, "_id": "10", "_source": {
        "article_title": "Web Development Best Practices",
        "article_vector": normalize_vector([0.2, 0.9, 0.15]),
        "category": "Web",
        "views": 1800
    }},
    {"_index": mmr_index_name, "_id": "11", "_source": {
        "article_title": "React Hooks Advanced",
        "article_vector": normalize_vector([0.22, 0.88, 0.16]),
        "category": "Web",
        "views": 2600
    }},
    {"_index": mmr_index_name, "_id": "12", "_source": {
        "article_title": "Vue.js Performance Tips",
        "article_vector": normalize_vector([0.21, 0.89, 0.14]),
        "category": "Web",
        "views": 1900
    }},
    # Cloud & Infrastructure
    {"_index": mmr_index_name, "_id": "13", "_source": {
        "article_title": "Cloud Computing Fundamentals",
        "article_vector": normalize_vector([0.1, 0.2, 0.95]),
        "category": "Infrastructure",
        "views": 2500
    }},
    {"_index": mmr_index_name, "_id": "14", "_source": {
        "article_title": "DevOps Engineering Guide",
        "article_vector": normalize_vector([0.15, 0.25, 0.92]),
        "category": "Infrastructure",
        "views": 1600
    }},
    {"_index": mmr_index_name, "_id": "15", "_source": {
        "article_title": "Kubernetes Best Practices",
        "article_vector": normalize_vector([0.12, 0.22, 0.94]),
        "category": "Infrastructure",
        "views": 2200
    }},
    {"_index": mmr_index_name, "_id": "16", "_source": {
        "article_title": "Docker Container Optimization",
        "article_vector": normalize_vector([0.14, 0.23, 0.93]),
        "category": "Infrastructure",
        "views": 1800
    }},
    # Data Science & Analytics
    {"_index": mmr_index_name, "_id": "17", "_source": {
        "article_title": "Big Data Processing with Spark",
        "article_vector": normalize_vector([0.8, 0.4, 0.2]),
        "category": "Data Science",
        "views": 2300
    }},
    {"_index": mmr_index_name, "_id": "18", "_source": {
        "article_title": "Statistical Analysis Methods",
        "article_vector": normalize_vector([0.75, 0.42, 0.22]),
        "category": "Data Science",
        "views": 1700
    }},
    # Security & DevSecOps
    {"_index": mmr_index_name, "_id": "19", "_source": {
        "article_title": "Cybersecurity Best Practices",
        "article_vector": normalize_vector([0.35, 0.45, 0.8]),
        "category": "Security",
        "views": 2100
    }},
    {"_index": mmr_index_name, "_id": "20", "_source": {
        "article_title": "API Security and Authentication",
        "article_vector": normalize_vector([0.38, 0.48, 0.78]),
        "category": "Security",
        "views": 1900
    }},
]

success, failed = bulk(client, mmr_documents)
print(f"‚úÖ Indexed {success} articles for MMR search")

# Wait for indexing
time.sleep(1)
print("üìù Articles ready for MMR demonstration!")

‚úÖ Indexed 20 articles for MMR search
üìù Articles ready for MMR demonstration!
üìù Articles ready for MMR demonstration!


### 3.4 Standard KNN Search (Relevance Only)

In [23]:
# Standard KNN search - purely based on relevance
query_vector = normalize_vector([0.9, 0.3, 0.1])

standard_knn = {
    "size": 5,
    "_source": ["article_title", "category", "views"],
    "query": {
        "knn": {
            "article_vector": {
                "vector": query_vector,
                "k": 5
            }
        }
    }
}

response = client.search(index=mmr_index_name, body=standard_knn)

print("üîç Standard KNN Search (Relevance Only)")
print(f"Query Vector: [0.9, 0.3, 0.1] (normalized)")
print("="*80)
print(f"\nüìä Found {response['hits']['total']['value']} articles\n")

knn_results = pd.DataFrame([
    {
        "#": idx,
        "Article": hit['_source']['article_title'],
        "Category": hit['_source']['category'],
        "Views": hit['_source']['views'],
        "Relevance Score": f"{hit['_score']:.4f}"
    }
    for idx, hit in enumerate(response['hits']['hits'], 1)
])

print(knn_results.to_string(index=False))
print("\n‚ö†Ô∏è  Notice: Articles 1, 2, and 3 are all similar to the query - low diversity!")

üîç Standard KNN Search (Relevance Only)
Query Vector: [0.9, 0.3, 0.1] (normalized)

üìä Found 5 articles

 #                          Article    Category  Views Relevance Score
 1 Introduction to Machine Learning  Technology   1500          1.0000
 2         Deep Learning Essentials  Technology   2800          0.9997
 3     Reinforcement Learning Guide  Technology   1700          0.9995
 4     Transfer Learning Techniques  Technology   1900          0.9989
 5          Python for Data Science Programming   3100          0.9981

‚ö†Ô∏è  Notice: Articles 1, 2, and 3 are all similar to the query - low diversity!


### 3.5 MMR Search with Low Diversity (Œª=0.3)

In [25]:
# MMR search with low diversity (favoring relevance)
mmr_low_diversity = {
    "size": 5,
    "_source": ["article_title", "category", "views"],
    "query": {
        "knn": {
            "article_vector": {
                "vector": query_vector,
                "k": 10
            }
        }
    },
    "ext": {
        "mmr": {
            "diversity": 0.3,
            "candidates": 10
        }
    }
}

try:
    response = client.search(index=mmr_index_name, body=mmr_low_diversity)
    
    print("üîç MMR Search - Low Diversity (Œª=0.3)")
    print(f"Query Vector: [0.9, 0.3, 0.1] (normalized)")
    print(f"Diversity Parameter (Œª): 0.3 - Favors Relevance")
    print("="*80)
    print(f"\nüìä Found {response['hits']['total']['value']} articles\n")
    
    mmr_results = pd.DataFrame([
        {
            "#": idx,
            "Article": hit['_source']['article_title'],
            "Category": hit['_source']['category'],
            "Views": hit['_source']['views'],
            "MMR Score": f"{hit['_score']:.4f}"
        }
        for idx, hit in enumerate(response['hits']['hits'], 1)
    ])
    
    print(mmr_results.to_string(index=False))
    
except Exception as e:
    print(f"‚ö†Ô∏è  MMR Search Note: {e}")
    print("   This is expected if MMR is not fully configured. Standard KNN results shown above.")
    

üîç MMR Search - Low Diversity (Œª=0.3)
Query Vector: [0.9, 0.3, 0.1] (normalized)
Diversity Parameter (Œª): 0.3 - Favors Relevance

üìä Found 10 articles

 #                          Article    Category  Views MMR Score
 1 Introduction to Machine Learning  Technology   1500    1.0000
 2         Deep Learning Essentials  Technology   2800    0.9997
 3     Reinforcement Learning Guide  Technology   1700    0.9995
 4     Transfer Learning Techniques  Technology   1900    0.9989
 5          Python for Data Science Programming   3100    0.9981


### 3.6 MMR Search with High Diversity (Œª=0.7)

In [26]:
# MMR search with high diversity (balancing relevance and diversity)
mmr_high_diversity = {
    "size": 5,
    "_source": ["article_title", "category", "views"],
    "query": {
        "knn": {
            "article_vector": {
                "vector": query_vector,
                "k": 10
            }
        }
    },
    "ext": {
        "mmr": {
            "diversity": 0.7,
            "candidates": 10
        }
    }
}

try:
    response = client.search(index=mmr_index_name, body=mmr_high_diversity)
    
    print("üîç MMR Search - High Diversity (Œª=0.7)")
    print(f"Query Vector: [0.9, 0.3, 0.1] (normalized)")
    print(f"Diversity Parameter (Œª): 0.7 - Favors Diversity")
    print("="*80)
    print(f"\nüìä Found {response['hits']['total']['value']} articles\n")
    
    mmr_results_high = pd.DataFrame([
        {
            "#": idx,
            "Article": hit['_source']['article_title'],
            "Category": hit['_source']['category'],
            "Views": hit['_source']['views'],
            "MMR Score": f"{hit['_score']:.4f}"
        }
        for idx, hit in enumerate(response['hits']['hits'], 1)
    ])
    
    print(mmr_results_high.to_string(index=False))
    print("\n‚ú® Notice: Results include articles from different categories for better diversity!")
except Exception as e:
    print(f"‚ö†Ô∏è  MMR Search Note: {e}")
    print("   This is expected if MMR is not fully configured. Standard KNN results shown above.")

üîç MMR Search - High Diversity (Œª=0.7)
Query Vector: [0.9, 0.3, 0.1] (normalized)
Diversity Parameter (Œª): 0.7 - Favors Diversity

üìä Found 10 articles

 #                          Article     Category  Views MMR Score
 1 Introduction to Machine Learning   Technology   1500    1.0000
 2           Rust Memory Management  Programming   2000    0.8027
 3     Statistical Analysis Methods Data Science   1700    0.9862
 4         Advanced Neural Networks   Technology   2200    0.9980
 5     Reinforcement Learning Guide   Technology   1700    0.9995

‚ú® Notice: Results include articles from different categories for better diversity!


---

## üìä Comparison Summary

In [27]:
comparison_data = {
    "Technique": [
        "Nested KNN",
        "Radial Search",
        "MMR Reranking"
    ],
    "Use Case": [
        "Multiple vectors per document",
        "Distance/similarity thresholds",
        "Balanced relevance & diversity"
    ],
    "Key Feature": [
        "Inner hits retrieval",
        "max_distance/min_score",
        "Diversity parameter (Œª)"
    ],
    "Best For": [
        "Product reviews, multi-part docs",
        "Threshold-based filtering",
        "Diverse recommendations"
    ],
    "Parameter": [
        "expand_nested_docs",
        "max_distance or min_score",
        "candidates, diversity"
    ]
}

comparison_df = pd.DataFrame(comparison_data)

print("\n" + "="*100)
print("üéì SPECIALIZED VECTOR SEARCH TECHNIQUES - COMPARISON")
print("="*100)
print()
print(comparison_df.to_string(index=False))
print()
print("="*100)


üéì SPECIALIZED VECTOR SEARCH TECHNIQUES - COMPARISON

    Technique                       Use Case             Key Feature                         Best For                 Parameter
   Nested KNN  Multiple vectors per document    Inner hits retrieval Product reviews, multi-part docs        expand_nested_docs
Radial Search Distance/similarity thresholds  max_distance/min_score        Threshold-based filtering max_distance or min_score
MMR Reranking Balanced relevance & diversity Diversity parameter (Œª)          Diverse recommendations     candidates, diversity



## üéØ Key Takeaways

### 1. **Nested KNN Search** üîç
- **What**: Search multiple vectors stored in nested fields within a single document
- **When**: Documents have complex structures with multiple vector representations
- **Examples**: Product reviews, multi-language documents, aspect-based embeddings
- **Advantage**: Keep related vectors together while enabling granular search

### 2. **Radial Search** üìè
- **What**: Find all vectors within a specified distance or similarity threshold
- **When**: You need deterministic results based on absolute distance/similarity
- **Parameters**: 
  - `max_distance`: Physical distance in vector space
  - `min_score`: Relative similarity score threshold
- **Advantage**: No need to specify k; get all matching results automatically

### 3. **MMR (Maximal Marginal Relevance)** ‚öñÔ∏è
- **What**: Rerank results to balance relevance with diversity
- **When**: Users want diverse recommendations, not just similar items
- **Formula**: MMR = (1-Œª) √ó relevance - Œª √ó max(similarity_with_selected)
- **Advantage**: Improve coverage and reduce redundancy in results

### üåü Real-World Combinations:
1. **E-commerce**: Use Nested KNN for product reviews + MMR for recommendation diversity
2. **Content Discovery**: Use Radial Search for quality threshold + MMR for topic diversity
3. **Search Engines**: Use Nested KNN for passage-level vectors + MMR for source diversity

---

**Next Steps**: Experiment with these techniques on your own datasets to understand how they improve search quality!