# OpenSearch ML Ingestion & Search Pipeline Workflow

```mermaid
graph TD
    A["[DATA] Load SQUAD Dataset<br/>87,599 rows to 1000 sampled"] -->|Analyze Schema| B["[MAPPING] Auto-Generate Mappings<br/>exclude_from_vectors: id only"]
    
    B -->|"Approach 1<br/>create_vectors=False"| C["[INDEX] 1: No Vectors<br/>squad_sample_no_vectors<br/>5 fields"]
    B -->|"Approach 2<br/>create_vectors=True"| D["[INDEX] 2: Manual Vectors<br/>squad_sample_with_vectors<br/>8 vector fields - empty"]
    
    B -->|"Approach 3<br/>create_vectors=True<br/>+ pipeline"| E["[ML SETUP] Configure Components"]
    
    E -->|"1. Configure"| F["[SETTINGS] ML Settings<br/>Allow ML on data nodes<br/>Disable access control"]
    F -->|"2. Deploy"| G["[MODEL] ML Model<br/>msmarco-distilbert-base-tas-b<br/>768 dims, HNSW, L2"]
    G -->|"3. Create"| H["[PIPELINE] Ingest Pipeline<br/>squad_embedding_pipeline<br/>Fields: title, context, question"]
    
    H -->|"4. Index + Ingest"| I["[INDEX] 3: Auto-Embeddings<br/>squad_sample_with_pipeline<br/>8 vector fields - populated"]
    
    C -->|"1000 docs"| J["[READY] All Indices<br/>Ready for Search"]
    D -->|"1000 docs"| J
    I -->|"1000 docs"| J
    
    J -->|"Search Methods"| K["[SEMANTIC] Semantic Search<br/>k-NN neural queries"]
    J -->|"Search Methods"| L["[KEYWORD] Keyword Search<br/>BM25 multi-match"]
    J -->|"Search Methods"| M["[HYBRID] Hybrid Search<br/>Keyword + Semantic"]
    
    M -->|"Tune"| N["[TUNING] Relevance Tuning<br/>Boost adjustment<br/>Field-level boosting"]
    
    K -->|Results| O["[ANALYSIS] Compare & Analyze<br/>Precision vs Recall<br/>Performance metrics"]
    L -->|Results| O
    M -->|Results| O
    N -->|Results| O
    
    style A fill:#e1f5ff,stroke:#01579b,stroke-width:3px,color:#000
    style B fill:#f3e5f5,stroke:#4a148c,stroke-width:3px,color:#000
    style C fill:#fff3e0,stroke:#e65100,stroke-width:2px,color:#000
    style D fill:#fff3e0,stroke:#e65100,stroke-width:2px,color:#000
    style E fill:#e8f5e9,stroke:#1b5e20,stroke-width:3px,color:#000
    style F fill:#fce4ec,stroke:#880e4f,stroke-width:2px,color:#000
    style G fill:#fce4ec,stroke:#880e4f,stroke-width:2px,color:#000
    style H fill:#fce4ec,stroke:#880e4f,stroke-width:2px,color:#000
    style I fill:#fff3e0,stroke:#e65100,stroke-width:2px,color:#000
    style J fill:#c8e6c9,stroke:#2e7d32,stroke-width:4px,color:#000
    style K fill:#bbdefb,stroke:#0d47a1,stroke-width:2px,color:#000
    style L fill:#bbdefb,stroke:#0d47a1,stroke-width:2px,color:#000
    style M fill:#b39ddb,stroke:#4527a0,stroke-width:3px,color:#000
    style N fill:#ffccbc,stroke:#bf360c,stroke-width:2px,color:#000
    style O fill:#a5d6a7,stroke:#1b5e20,stroke-width:3px,color:#000
```

## Complete Workflow Overview

**Part 1: Data Ingestion** (3 Approaches)
- **Basic Ingestion**: Traditional keyword search without embeddings
- **Manual Vector Fields**: Pre-computed embeddings from external models  
- **Automatic Embeddings**: ML-powered embedding generation via ingest pipelines (RECOMMENDED)

**Part 2: Search & Relevance** (4 Methods)
- **Keyword Search (BM25)**: Fast exact-match search
- **Semantic Search (k-NN)**: Meaning-based vector similarity
- **Hybrid Search**: Combines keyword + semantic for best results (RECOMMENDED)
- **Relevance Tuning**: Boost adjustments and field-level prioritization

**Vector Fields Created**: `title_embedding`, `context_embedding`, `question_embedding` (768 dimensions each)

## Config

In [1]:
import sys
import os
%load_ext autoreload
%autoreload 2

# Get the current working directory of the notebook
current_dir = os.getcwd()

DATA_DIR = os.path.abspath(os.path.join(current_dir, '../../0. DATA'))

# Construct the path to the directory levels up
module_paths = [os.path.abspath(os.path.join(current_dir, '../../')),] 

# Add the module path to sys.path if it's not already there
for module_path in module_paths:
    if module_path not in sys.path:
        sys.path.append(module_path)

try:
    import helpers as hp
except ImportError as e:
    raise ImportError(f"Error importing modules: {e}")

## üê≥ Docker Setup
- **If docker compose up fails , start it manually from shell**

In [3]:
%%bash
cd ../..
echo "üöÄ Starting fully optimized OpenSearch cluster..."

# Start the optimized cluster
docker compose -f docker-compose-opensearch-single.yml down -v
docker compose -f docker-compose-opensearch-single.yml up -d

# Wait for startup
echo "‚è≥ Waiting for cluster to initialize..."
sleep 45

# Check cluster health
echo "üè• Checking cluster health..."
curl -k -u admin:Developer@123 https://localhost:9200/_cluster/health?pretty

üöÄ Starting fully optimized OpenSearch cluster...


 Network 3ingest_and_search_concepts_opensearch-net  Creating
 Network 3ingest_and_search_concepts_opensearch-net  Created
 Volume "3ingest_and_search_concepts_opensearch-data"  Creating
 Volume "3ingest_and_search_concepts_opensearch-data"  Created
 Container opensearch-node1  Creating
 Container opensearch-dashboards  Creating
 Container opensearch-dashboards  Created
 Container opensearch-node1  Created
 Container opensearch-dashboards  Starting
 Container opensearch-node1  Starting
 Container opensearch-dashboards  Started
 Container opensearch-node1  Started


‚è≥ Waiting for cluster to initialize...
üè• Checking cluster health...


  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   536  100   536    0     0   2335      0 --:--:-- --:--:-- --:--:--  2340


{
  "cluster_name" : "docker-cluster",
  "status" : "green",
  "timed_out" : false,
  "number_of_nodes" : 1,
  "number_of_data_nodes" : 1,
  "discovered_master" : true,
  "discovered_cluster_manager" : true,
  "active_primary_shards" : 4,
  "active_shards" : 4,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 0,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 100.0
}


In [2]:
from opensearchpy import OpenSearch
from opensearch_py_ml.ml_commons import MLCommonClient
import time
import pandas as pd
from opensearchpy import OpenSearch, helpers

IS_AUTH = True # Set to False if security is disabled
HOST = 'localhost'  # Replace with your OpenSearch host, if running everything locally use 'localhost'

if IS_AUTH:
    # Initialize the OpenSearch client
    os_client = OpenSearch(
        hosts=[{'host': HOST, 'port': 9200}],
        http_auth=('admin', 'Developer@123'),  # Replace with your credentials
        use_ssl=True,
        verify_certs=False,
        ssl_show_warn=False
    )
else:
    # Initialize the OpenSearch client without authentication
    os_client = OpenSearch(
        hosts=[{'host': HOST, 'port': 9200}],
        use_ssl=False,
        verify_certs=False,
        ssl_assert_hostname = False,
        ssl_show_warn=False
    )

# Initialize ML Commons client
ml_client = MLCommonClient(os_client)

# Check if cluster is up
if (os_client.ping()):
    print("Connected to OpenSearch cluster")

Connected to OpenSearch cluster


## Read SQUAD Dataset

In [4]:
from IPython.display import display, HTML
import pandas as pd
import json
df_squad_train = pd.read_parquet(f"{DATA_DIR}/SQUAD-train.parquet")

# Check id is unique i.e. count of rows is same as count of unique ids
if {len(df_squad_train)} == {df_squad_train['id'].nunique()}:
    print("id is unique i.e. primary key")
else:
    print("id is not unique")

# Print pandas memory usage in MB
memory_usage = df_squad_train.memory_usage(deep=True)
memory_usage_mb = memory_usage / (1024 * 1024)
display(memory_usage_mb)
print(f"\nTotal memory usage: {memory_usage_mb.sum():.2f} MB")

# Enable word wrap for better readability in Jupyter Notebook
display(HTML("<style>.output_area pre {white-space: pre-wrap; word-break: break-word;}</style>")) 
# Display the first few rows of the dataframe
print("First few rows of the SQuAD training dataset:")
display(df_squad_train.head())

# Print one row as dictionary pretty print
print("One row as dictionary:")
import numpy as np

def convert_to_serializable(obj):
    if isinstance(obj, np.ndarray):
        return obj.tolist()
    elif isinstance(obj, dict):
        return {k: convert_to_serializable(v) for k, v in obj.items()}
    elif isinstance(obj, list):
        return [convert_to_serializable(item) for item in obj]
    else:
        return obj

row_dict = df_squad_train.iloc[0].to_dict()
pd.set_option('display.width', 1000)
pd.set_option('display.max_colwidth', None)
print(json.dumps(convert_to_serializable(row_dict), indent=4, ensure_ascii=False))


id is unique i.e. primary key


Index        0.000126
id           6.098487
title        5.291620
context     83.031409
question     9.089622
answers     16.039856
dtype: float64


Total memory usage: 119.55 MB


First few rows of the SQuAD training dataset:


Unnamed: 0,id,title,context,question,answers
0,5733be284776f41900661182,University_of_Notre_Dame,"Architecturally, the school has a Catholic cha...",To whom did the Virgin Mary allegedly appear i...,"{'text': ['Saint Bernadette Soubirous'], 'answ..."
1,5733be284776f4190066117f,University_of_Notre_Dame,"Architecturally, the school has a Catholic cha...",What is in front of the Notre Dame Main Building?,"{'text': ['a copper statue of Christ'], 'answe..."
2,5733be284776f41900661180,University_of_Notre_Dame,"Architecturally, the school has a Catholic cha...",The Basilica of the Sacred heart at Notre Dame...,"{'text': ['the Main Building'], 'answer_start'..."
3,5733be284776f41900661181,University_of_Notre_Dame,"Architecturally, the school has a Catholic cha...",What is the Grotto at Notre Dame?,{'text': ['a Marian place of prayer and reflec...
4,5733be284776f4190066117e,University_of_Notre_Dame,"Architecturally, the school has a Catholic cha...",What sits on top of the Main Building at Notre...,{'text': ['a golden statue of the Virgin Mary'...


One row as dictionary:
{
    "id": "5733be284776f41900661182",
    "title": "University_of_Notre_Dame",
    "context": "Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend \"Venite Ad Me Omnes\". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.",
    "question": "To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?",
    "answers": {
        "text": [
            "Saint Bernadette Soubirous"
        ],
  

## Understanding Unicode Characters in JSON Output

When displaying JSON data that contains special characters (like accented letters, non-Latin scripts, etc.), we use `ensure_ascii=False` in `json.dumps()`.

**Why?**
- By default, `json.dumps()` escapes all non-ASCII characters (e.g., `√°` becomes `\u00e1`)
- With `ensure_ascii=False`, special characters are displayed in their actual form
- This makes the output much more readable for international text

**Example:**
- ‚ùå Without `ensure_ascii=False`: `"Bansk\u00e1 Akad\u00e9mia"`
- ‚úÖ With `ensure_ascii=False`: `"Bansk√° Akad√©mia"`

Both representations are valid JSON, but the second is more human-readable!

## Create Generic Function to Generate OpenSearch Mappings from DataFrame

This function analyzes a pandas DataFrame and automatically generates OpenSearch index mappings based on the column data types.

**Key Features:**
- Maps pandas dtypes to appropriate OpenSearch field types
- Optionally creates corresponding `knn_vector` fields for text columns to support semantic search
- The vector fields are configured with:
  - Dimensions: 768 (standard for many embedding models)
  - Method: HNSW (Hierarchical Navigable Small World graphs)
  - Space type: L2 (Euclidean distance)
  - Engine: Lucene
- Handles nested objects and arrays by using the `nested` type
- Returns a complete index body structure ready for index creation

In [5]:
def create_opensearch_mappings(df, create_vectors=False, pipeline_name=None, exclude_from_vectors=None):
    """
    Create OpenSearch index mappings from a pandas DataFrame.
    
    Parameters:
    -----------
    df : pandas.DataFrame
        The DataFrame to generate mappings from
    create_vectors : bool, default=False
        If True, creates corresponding knn_vector fields for text columns
        with dimensions=768, method=hnsw, space_type=l2, engine=lucene
    pipeline_name : str, optional
        If provided, sets this as the default_pipeline in index settings.
        Used for automatic embedding generation during ingestion.
    exclude_from_vectors : list of str, optional
        List of field names to exclude from vector creation.
        Default is ['id', 'title'] if not provided.
    
    Returns:
    --------
    dict
        A dictionary containing the index body with mappings suitable for 
        OpenSearch index creation
    
    Example:
    --------
    >>> mappings = create_opensearch_mappings(df, create_vectors=True, exclude_from_vectors=['id', 'title', 'metadata'])
    >>> os_client.indices.create(index='my_index', body=mappings)
    """
    import numpy as np
    
    # Set default exclusion list if not provided
    if exclude_from_vectors is None:
        exclude_from_vectors = ['id']
    
    # Define dtype mapping from pandas to OpenSearch
    dtype_mapping = {
        'int64': 'long',
        'int32': 'integer',
        'int16': 'short',
        'int8': 'byte',
        'float64': 'double',
        'float32': 'float',
        'bool': 'boolean',
        'datetime64[ns]': 'date',
        'object': 'text',  # Default for object types (strings)
    }
    
    properties = {}
    
    for column in df.columns:
        dtype_str = str(df[column].dtype)
        
        # Handle datetime types
        if 'datetime' in dtype_str:
            properties[column] = {'type': 'date'}
        
        # Handle boolean
        elif dtype_str == 'bool':
            properties[column] = {'type': 'boolean'}
        
        # Handle numeric types
        elif dtype_str in ['int64', 'int32', 'int16', 'int8']:
            properties[column] = {'type': dtype_mapping.get(dtype_str, 'long')}
        
        elif dtype_str in ['float64', 'float32']:
            properties[column] = {'type': dtype_mapping.get(dtype_str, 'double')}
        
        # Handle object types (strings, nested structures)
        elif dtype_str == 'object':
            # Check if column contains nested structures (dict/list)
            sample_value = df[column].dropna().iloc[0] if not df[column].dropna().empty else None
            
            if isinstance(sample_value, (dict, list)):
                # Use nested type for complex structures
                properties[column] = {'type': 'nested'}
            else:
                # Standard text field with keyword sub-field
                properties[column] = {
                    'type': 'text',
                    'fields': {
                        'keyword': {
                            'type': 'keyword',
                            'ignore_above': 256
                        }
                    }
                }
                
                # Optionally create vector field for text columns
                # Exclude specified fields from vector creation
                if create_vectors and column not in exclude_from_vectors:
                    vector_field_name = f"{column}_embedding"
                    properties[vector_field_name] = {
                        'type': 'knn_vector',
                        'dimension': 768,
                        'method': {
                            'name': 'hnsw',
                            'space_type': 'l2',
                            'engine': 'lucene',
                            'parameters': {}
                        }
                    }
        
        # Default fallback
        else:
            properties[column] = {'type': 'text'}
    
    # Create the settings object
    settings = {
        'index': {
            'number_of_shards': 1,
            'number_of_replicas': 1,
            'knn': create_vectors  # Enable k-NN only if vectors are being created
        }
    }
    
    # Add default_pipeline if provided
    if pipeline_name:
        settings['default_pipeline'] = pipeline_name
    
    # Create the complete index body
    index_body = {
        'settings': settings,
        'mappings': {
            'properties': properties
        }
    }
    
    return index_body

## Sample SQUAD Dataset and Generate Mappings

Sample 1000 rows from the SQUAD training dataset to create a smaller dataset for testing.
Then generate OpenSearch mappings without vector fields (create_vectors=False) to see the basic mapping structure.

In [6]:
# Sample 1000 rows from the SQUAD training dataset
df_squad_sample = df_squad_train.sample(n=1000, random_state=42).reset_index(drop=True)

print(f"Original dataset size: {len(df_squad_train)} rows")
print(f"Sampled dataset size: {len(df_squad_sample)} rows")
print(f"\nDataset columns: {list(df_squad_sample.columns)}")
print(f"Dataset dtypes:\n{df_squad_sample.dtypes}")

# Display the first few rows of the sampled dataset
display(df_squad_sample.head())

Original dataset size: 87599 rows
Sampled dataset size: 1000 rows

Dataset columns: ['id', 'title', 'context', 'question', 'answers']
Dataset dtypes:
id          object
title       object
context     object
question    object
answers     object
dtype: object


Unnamed: 0,id,title,context,question,answers
0,56de4d9ecffd8e1900b4b7e2,Institute_of_technology,"The world's first institution of technology or technical university with tertiary technical education is the Bansk√° Akad√©mia in Bansk√° ≈†tiavnica, Slovakia, founded in 1735, Academy since December 13, 1762 established by queen Maria Theresa in order to train specialists of silver and gold mining and metallurgy in neighbourhood. Teaching started in 1764. Later the department of Mathematics, Mechanics and Hydraulics and department of Forestry were settled. University buildings are still at their place today and are used for teaching. University has launched the first book of electrotechnics in the world.",What year was the Bansk√° Akad√©mia founded?,"{'text': ['1735'], 'answer_start': [167]}"
1,572674a05951b619008f7319,Film_speed,"The standard specifies how speed ratings should be reported by the camera. If the noise-based speed (40:1) is higher than the saturation-based speed, the noise-based speed should be reported, rounded downwards to a standard value (e.g. 200, 250, 320, or 400). The rationale is that exposure according to the lower saturation-based speed would not result in a visibly better image. In addition, an exposure latitude can be specified, ranging from the saturation-based speed to the 10:1 noise-based speed. If the noise-based speed (40:1) is lower than the saturation-based speed, or undefined because of high noise, the saturation-based speed is specified, rounded upwards to a standard value, because using the noise-based speed would lead to overexposed images. The camera may also report the SOS-based speed (explicitly as being an SOS speed), rounded to the nearest standard speed rating.",What is another speed that can also be reported by the camera?,"{'text': ['SOS-based speed'], 'answer_start': [793]}"
2,5730bb058ab72b1400f9c72c,Sumer,"The most impressive and famous of Sumerian buildings are the ziggurats, large layered platforms which supported temples. Sumerian cylinder seals also depict houses built from reeds not unlike those built by the Marsh Arabs of Southern Iraq until as recently as 400 CE. The Sumerians also developed the arch, which enabled them to develop a strong type of dome. They built this by constructing and linking several arches. Sumerian temples and palaces made use of more advanced materials and techniques,[citation needed] such as buttresses, recesses, half columns, and clay nails.",Where were the use of advanced materials and techniques on display in Sumer?,"{'text': ['Sumerian temples and palaces'], 'answer_start': [421]}"
3,572781a5f1498d1400e8fa1f,"Ann_Arbor,_Michigan","Ann Arbor has a council-manager form of government. The City Council has 11 voting members: the mayor and 10 city council members. The mayor and city council members serve two-year terms: the mayor is elected every even-numbered year, while half of the city council members are up for election annually (five in even-numbered and five in odd-numbered years). Two council members are elected from each of the city's five wards. The mayor is elected citywide. The mayor is the presiding officer of the City Council and has the power to appoint all Council committee members as well as board and commission members, with the approval of the City Council. The current mayor of Ann Arbor is Christopher Taylor, a Democrat who was elected as mayor in 2014. Day-to-day city operations are managed by a city administrator chosen by the city council.",Who is elected every even numbered year?,"{'text': ['mayor'], 'answer_start': [192]}"
4,572843ce4b864d190016485c,John_von_Neumann,"Shortly before his death, when he was already quite ill, von Neumann headed the United States government's top secret ICBM committee, and it would sometimes meet in his home. Its purpose was to decide on the feasibility of building an ICBM large enough to carry a thermonuclear weapon. Von Neumann had long argued that while the technical obstacles were sizable, they could be overcome in time. The SM-65 Atlas passed its first fully functional test in 1959, two years after his death. The feasibility of an ICBM owed as much to improved, smaller warheads as it did to developments in rocketry, and his understanding of the former made his advice invaluable.",What was the purpose of top secret ICBM committee?,"{'text': ['decide on the feasibility of building an ICBM large enough to carry a thermonuclear weapon'], 'answer_start': [194]}"


## Create Index and Ingest Data Without Vector Fields

Generate OpenSearch mappings for the sampled dataset with `create_vectors=False`, create the index, and ingest the data.

**Steps:**
1. Generate mappings without vector fields
2. Delete the index if it already exists
3. Create a new index with the generated mappings
4. Prepare data for bulk ingestion
5. Use async bulk helpers to efficiently ingest 1000 documents
6. Verify the ingestion by checking document count

In [7]:
%%time

# Define index name
index_name = "squad_sample_no_vectors"

# Step 1: Generate mappings without vector fields
mappings_without_vectors = create_opensearch_mappings(df_squad_sample, create_vectors=False)

print("Generated OpenSearch mappings (without vector fields):")
print(json.dumps(mappings_without_vectors, indent=2))

# Step 2: Delete index if it exists
if os_client.indices.exists(index=index_name):
    print(f"\n{'='*60}")
    print(f"Deleting existing index: {index_name}")
    os_client.indices.delete(index=index_name)
    print(f"Index deleted successfully")
    print(f"{'='*60}")

# Step 3: Create the index with mappings
print(f"\n{'='*60}")
print(f"Creating index: {index_name}")
response = os_client.indices.create(index=index_name, body=mappings_without_vectors)
print(f"Index created successfully: {response}")
print(f"{'='*60}")

# Step 4: Prepare data for bulk ingestion
def generate_bulk_data(df, index_name):
    """
    Generator function to prepare data for bulk ingestion.
    Yields documents in the format required by opensearch helpers.bulk()
    """
    for idx, row in df.iterrows():
        # Convert row to dictionary
        doc = row.to_dict()
        
        # Convert numpy types to native Python types
        for key, value in doc.items():
            if isinstance(value, (np.integer, np.floating)):
                doc[key] = value.item()
            elif isinstance(value, np.ndarray):
                doc[key] = value.tolist()
        
        # Yield document with index name and _id
        yield {
            "_index": index_name,
            "_id": doc.get('id', idx),  # Use 'id' field if available, otherwise use index
            "_source": doc
        }

# Step 5: Bulk ingest using async helpers for better performance
print(f"\n{'='*60}")
print(f"Starting bulk ingestion of {len(df_squad_sample)} documents...")
start_time = time.time()

# Use bulk helper for async ingestion
success, failed = helpers.bulk(
    os_client,
    generate_bulk_data(df_squad_sample, index_name),
    chunk_size=500,  # Process 500 documents at a time
    request_timeout=60,
    raise_on_error=False,
    raise_on_exception=False
)

elapsed_time = time.time() - start_time
print(f"Bulk ingestion completed in {elapsed_time:.2f} seconds")
print(f"Successfully indexed: {success} documents")
print(f"Failed: {failed} documents")
print(f"{'='*60}")

# Step 6: Verify ingestion
time.sleep(1)  # Wait for refresh
os_client.indices.refresh(index=index_name)
count_response = os_client.count(index=index_name)
print(f"\n{'='*60}")
print(f"Total documents in index '{index_name}': {count_response['count']}")
print(f"{'='*60}")

# Show a sample document
search_response = os_client.search(index=index_name, body={"query": {"match_all": {}}, "size": 1})
print(f"\nSample document from index:")
print(json.dumps(search_response['hits']['hits'][0], indent=2, ensure_ascii=False))

Generated OpenSearch mappings (without vector fields):
{
  "settings": {
    "index": {
      "number_of_shards": 1,
      "number_of_replicas": 1,
      "knn": false
    }
  },
  "mappings": {
    "properties": {
      "id": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      },
      "title": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      },
      "context": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      },
      "question": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      },
      "answers": {
        "type": "nested"
      }
    }
  }
}

Creatin

## Create Index with Vector Fields (Manual Approach - No Pipeline)

Generate OpenSearch mappings with `create_vectors=True`, create the index, and ingest the data WITHOUT an ingest pipeline.

**Important:** This approach creates vector fields but does NOT generate embeddings automatically. The embedding fields will be empty unless you manually generate and provide embeddings during ingestion.

**Steps:**
1. Generate mappings with vector fields enabled (no pipeline)
2. Delete the index if it already exists
3. Create a new index with vector-enabled mappings
4. Use async bulk helpers to efficiently ingest 1000 documents
5. Verify the ingestion and show field count comparison

**Use Case:** This approach is useful when:
- You want to generate embeddings externally (e.g., using a custom Python model)
- You need to pre-process or cache embeddings before ingestion
- You want to use a different embedding model than what's deployed in OpenSearch

In [8]:
%%time
# Define index name
index_name_with_vectors = "squad_sample_with_vectors"

# Step 1: Generate mappings with vector fields
mappings_with_vectors = create_opensearch_mappings(df_squad_sample, create_vectors=True)

print("Generated OpenSearch mappings (WITH vector fields):")
print(json.dumps(mappings_with_vectors, indent=2, ensure_ascii=False))

# Show the difference in field count
fields_without = len(mappings_without_vectors['mappings']['properties'])
fields_with = len(mappings_with_vectors['mappings']['properties'])
print(f"\n{'='*60}")
print(f"Number of fields without vectors: {fields_without}")
print(f"Number of fields with vectors: {fields_with}")
print(f"Additional vector fields created: {fields_with - fields_without}")
print(f"{'='*60}")

# Step 2: Delete index if it exists
if os_client.indices.exists(index=index_name_with_vectors):
    print(f"\n{'='*60}")
    print(f"Deleting existing index: {index_name_with_vectors}")
    os_client.indices.delete(index=index_name_with_vectors)
    print(f"Index deleted successfully")
    print(f"{'='*60}")

# Step 3: Create the index with vector-enabled mappings
print(f"\n{'='*60}")
print(f"Creating index: {index_name_with_vectors}")
response = os_client.indices.create(index=index_name_with_vectors, body=mappings_with_vectors)
print(f"Index created successfully: {response}")
print(f"{'='*60}")

# Step 4: Bulk ingest using async helpers
print(f"\n{'='*60}")
print(f"Starting bulk ingestion of {len(df_squad_sample)} documents...")
start_time = time.time()

# Note: For a production system with vector fields, you would generate embeddings
# for text fields before ingestion. This example ingests without embeddings.
success, failed = helpers.bulk(
    os_client,
    generate_bulk_data(df_squad_sample, index_name_with_vectors),
    chunk_size=500,
    request_timeout=60,
    raise_on_error=False,
    raise_on_exception=False
)

elapsed_time = time.time() - start_time
print(f"Bulk ingestion completed in {elapsed_time:.2f} seconds")
print(f"Successfully indexed: {success} documents")
print(f"Failed: {failed} documents")
print(f"{'='*60}")

# Step 5: Verify ingestion
time.sleep(1)  # Wait for refresh
os_client.indices.refresh(index=index_name_with_vectors)
count_response = os_client.count(index=index_name_with_vectors)
print(f"\n{'='*60}")
print(f"Total documents in index '{index_name_with_vectors}': {count_response['count']}")
print(f"{'='*60}")

# Get index mappings to verify vector fields
mappings_response = os_client.indices.get_mapping(index=index_name_with_vectors)
print(f"\nIndex mappings (showing vector fields):")
properties = mappings_response[index_name_with_vectors]['mappings']['properties']
vector_fields = [k for k in properties.keys() if k.endswith('_embedding')]
print(f"Vector fields created: {vector_fields}")

# Show a sample document
search_response = os_client.search(index=index_name_with_vectors, body={"query": {"match_all": {}}, "size": 1})
print(f"\nSample document from index:")
print(json.dumps(search_response['hits']['hits'][0], indent=2, ensure_ascii=False))

Generated OpenSearch mappings (WITH vector fields):
{
  "settings": {
    "index": {
      "number_of_shards": 1,
      "number_of_replicas": 1,
      "knn": true
    }
  },
  "mappings": {
    "properties": {
      "id": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      },
      "title": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      },
      "title_embedding": {
        "type": "knn_vector",
        "dimension": 768,
        "method": {
          "name": "hnsw",
          "space_type": "l2",
          "engine": "lucene",
          "parameters": {}
        }
      },
      "context": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      },
      "context_embeddi

## Setup ML Model and Ingest Pipeline for Automatic Embeddings

Before creating an index with vector fields that automatically generates embeddings, we need to:

1. **Register and deploy a pre-trained embedding model** from HuggingFace
2. **Create an ingest pipeline** that uses this model to generate embeddings automatically during indexing
3. **Configure the index** to use this pipeline as the default pipeline

This setup enables automatic embedding generation during document ingestion, eliminating the need to manually create embeddings before indexing.

**Model Details:**
- Model: `huggingface/sentence-transformers/msmarco-distilbert-base-tas-b`
- Version: 1.0.1
- Format: TORCH_SCRIPT
- Dimensions: 768
- Use case: Semantic search, question-answering

## Configure ML Settings

Configure OpenSearch to allow ML operations on data nodes.

**Note:** In production environments with dedicated ML nodes, this configuration is not needed. For development/testing, we allow ML operations on data nodes.

In [9]:
# Configure cluster to allow ML operations
ml_settings = {
    "persistent": {
        "plugins.ml_commons.only_run_on_ml_node": False,
        "plugins.ml_commons.model_access_control_enabled": False,
        "plugins.ml_commons.native_memory_threshold": 99
    }
}

try:
    response = os_client.cluster.put_settings(body=ml_settings)
    print("="*80)
    print("ML Configuration Status:")
    print("="*80)
    print("‚úì ML settings configured successfully")
    print("  - ML operations allowed on data nodes: True")
    print("  - Model access control: Disabled")
    print("  - Native memory threshold: 99%")
    print("\n‚úì Cluster is ready for ML model deployment")
    print("="*80)
except Exception as e:
    print(f"‚ö† Warning: Could not configure ML settings: {e}")
    print("  If ML nodes are properly configured, this error can be ignored")
    print("  Proceeding with model deployment...")

ML Configuration Status:
‚úì ML settings configured successfully
  - ML operations allowed on data nodes: True
  - Model access control: Disabled
  - Native memory threshold: 99%

‚úì Cluster is ready for ML model deployment


In [10]:
# Step 1: Register and deploy the sentence transformer model
print("="*80)
print("Registering and deploying ML model...")
print("="*80)

model_response = ml_client.register_pretrained_model(
    model_name="huggingface/sentence-transformers/msmarco-distilbert-base-tas-b",
    model_version="1.0.1",
    model_format="TORCH_SCRIPT",
    deploy_model=True,
    wait_until_deployed=True
)
model_id = model_response
print(f"Model ID: {model_id}")

# Step 2: Wait for model to be fully deployed
print("\nWaiting for model deployment...")
max_wait_time = 300  # 5 minutes max wait
start_time = time.time()

while True:
    model_info = ml_client.get_model_info(model_id)
    model_state = model_info.get('model_state', 'UNKNOWN')
    print(f"Current model state: {model_state}")
    
    if model_state == 'DEPLOYED':
        print("‚úì Model deployed successfully!")
        break
    
    if time.time() - start_time > max_wait_time:
        print("‚ö† Warning: Model deployment timeout. Proceeding anyway...")
        break
    
    time.sleep(5)

print(f"\n{'='*80}")
print(f"Model is ready for use")
print(f"{'='*80}")

Registering and deploying ML model...
Model was registered successfully. Model Id:  1kdFSpoBL6Mh-OCJpWi8
1kdFSpoBL6Mh-OCJpWi8
Task ID: 2EdFSpoBL6Mh-OCJ-Wga
Model deployed successfully
Model ID: 1kdFSpoBL6Mh-OCJpWi8

Waiting for model deployment...
Current model state: DEPLOYED
‚úì Model deployed successfully!

Model is ready for use


## Create Ingest Pipeline for Automatic Embedding Generation

Create an ingest pipeline that automatically generates embeddings for text fields during document ingestion.

The pipeline uses the `text_embedding` processor which:
- Takes text from specified source fields
- Generates 768-dimensional embeddings using the deployed model
- Stores embeddings in corresponding vector fields
- Runs automatically for every document ingested into indices using this pipeline

In [11]:
# Create a dynamic ingest pipeline based on text fields in the DataFrame
def create_embedding_pipeline(df, model_id, pipeline_name="squad_embedding_pipeline", exclude_from_embeddings=None):
    """
    Create an ingest pipeline that generates embeddings for text fields.
    
    Parameters:
    -----------
    df : pandas.DataFrame
        The DataFrame to analyze for text fields
    model_id : str
        The ID of the deployed ML model
    pipeline_name : str
        Name for the ingest pipeline
    exclude_from_embeddings : list of str, optional
        List of field names to exclude from embedding generation.
        Default is ['id'] if not provided.
    
    Returns:
    --------
    str : The pipeline name
    """
    # Set default exclusion list if not provided
    if exclude_from_embeddings is None:
        exclude_from_embeddings = ['id']
    
    # Identify text fields (excluding nested structures and excluded fields)
    text_fields = []
    for column in df.columns:
        dtype_str = str(df[column].dtype)
        if dtype_str == 'object':
            sample_value = df[column].dropna().iloc[0] if not df[column].dropna().empty else None
            if not isinstance(sample_value, (dict, list)) and column not in exclude_from_embeddings:
                text_fields.append(column)
    
    # Create field_map for text_embedding processor
    field_map = {}
    for field in text_fields:
        field_map[field] = f"{field}_embedding"
    
    # Create pipeline body
    pipeline_body = {
        "description": f"Embedding pipeline for {pipeline_name}",
        "processors": [
            {
                "text_embedding": {
                    "model_id": model_id,
                    "field_map": field_map
                }
            }
        ]
    }
    
    # Delete pipeline if it exists
    try:
        os_client.ingest.delete_pipeline(id=pipeline_name)
        print(f"Deleted existing pipeline: {pipeline_name}")
    except:
        pass
    
    # Create the pipeline
    os_client.ingest.put_pipeline(id=pipeline_name, body=pipeline_body)
    print(f"‚úì Ingest pipeline created: {pipeline_name}")
    print(f"  Text fields to embed: {text_fields}")
    print(f"  Excluded fields: {exclude_from_embeddings}")
    print(f"  Field mappings: {field_map}")
    
    return pipeline_name

# Create the pipeline with custom exclusions
pipeline_name = create_embedding_pipeline(
    df_squad_sample, 
    model_id,
    exclude_from_embeddings=['id']  # Exclude only id, include title for better semantic matching
)
print(f"\n{'='*80}")
print(f"Pipeline '{pipeline_name}' is ready to use")
print(f"{'='*80}")

Deleted existing pipeline: squad_embedding_pipeline
‚úì Ingest pipeline created: squad_embedding_pipeline
  Text fields to embed: ['title', 'context', 'question']
  Excluded fields: ['id']
  Field mappings: {'title': 'title_embedding', 'context': 'context_embedding', 'question': 'question_embedding'}

Pipeline 'squad_embedding_pipeline' is ready to use


## Create Index with Pipeline and Ingest Data with Auto-Generated Embeddings

Now create an index that uses the ingest pipeline to automatically generate embeddings during document ingestion.

**Key Configuration:**
- `index.knn: true` - Enables k-NN functionality
- `default_pipeline: "squad_embedding_pipeline"` - Automatically processes all documents through the pipeline
- Vector fields are created for each text field to store the embeddings

**What happens during ingestion:**
1. Documents are sent to OpenSearch
2. The ingest pipeline intercepts them
3. Text fields are extracted and sent to the ML model
4. The model generates 768-dimensional embeddings
5. Embeddings are stored in the corresponding `_embedding` fields
6. The complete document (with embeddings) is indexed

This approach eliminates the need to manually generate embeddings before ingestion!

In [12]:
%%time
# Define index name
index_name_with_pipeline = "squad_sample_with_pipeline"

# Step 1: Generate mappings with vector fields AND pipeline configuration
mappings_with_pipeline = create_opensearch_mappings(
    df_squad_sample, 
    create_vectors=True,
    pipeline_name=pipeline_name
)

print("Generated OpenSearch mappings (WITH vector fields and pipeline):")
print(json.dumps(mappings_with_pipeline, indent=2, ensure_ascii=False))

# Verify the settings include both knn and default_pipeline
print(f"\n{'='*80}")
print("Index settings configuration:")
print(f"  - index.knn: {mappings_with_pipeline['settings']['index']['knn']}")
print(f"  - default_pipeline: {mappings_with_pipeline['settings'].get('default_pipeline', 'Not set')}")
print(f"{'='*80}")

# Step 2: Delete index if it exists
if os_client.indices.exists(index=index_name_with_pipeline):
    print(f"\nDeleting existing index: {index_name_with_pipeline}")
    os_client.indices.delete(index=index_name_with_pipeline)
    print(f"Index deleted successfully")

# Step 3: Create the index with pipeline-enabled mappings
print(f"\n{'='*80}")
print(f"Creating index: {index_name_with_pipeline}")
response = os_client.indices.create(index=index_name_with_pipeline, body=mappings_with_pipeline)
print(f"Index created successfully: {response}")
print(f"{'='*80}")

# Step 4: Ingest a SMALL sample (to test embedding generation)
# Note: Using only 100 documents for testing because embedding generation is compute-intensive
df_small_sample = df_squad_sample.head(1000)

print(f"\n{'='*80}")
print(f"Starting bulk ingestion of {len(df_small_sample)} documents...")
print("Note: Using small sample because embedding generation takes time")
start_time = time.time()

# Use bulk helper - the pipeline will automatically generate embeddings
success, failed = helpers.bulk(
    os_client,
    generate_bulk_data(df_small_sample, index_name_with_pipeline),
    chunk_size=5,  # Smaller chunks for embedding generation
    request_timeout=120,  # Longer timeout for model inference
    raise_on_error=False,
    raise_on_exception=False
)

elapsed_time = time.time() - start_time
print(f"Bulk ingestion completed in {elapsed_time:.2f} seconds")
print(f"Successfully indexed: {success} documents")
print(f"Failed: {failed} documents")
print(f"Average time per document: {elapsed_time/len(df_small_sample):.2f} seconds")
print(f"{'='*80}")

# Step 5: Verify ingestion and check embeddings
time.sleep(2)  # Wait for refresh
os_client.indices.refresh(index=index_name_with_pipeline)
count_response = os_client.count(index=index_name_with_pipeline)
print(f"\n{'='*80}")
print(f"Total documents in index '{index_name_with_pipeline}': {count_response['count']}")
print(f"{'='*80}")

# Fetch a document to verify embeddings were generated
search_response = os_client.search(
    index=index_name_with_pipeline, 
    body={"query": {"match_all": {}}, "size": 1}
)

if search_response['hits']['hits']:
    doc = search_response['hits']['hits'][0]['_source']
    
    # Check which embedding fields exist
    embedding_fields = [k for k in doc.keys() if k.endswith('_embedding')]
    print(f"\n{'='*80}")
    print(f"Embedding fields in document:")
    for field in embedding_fields:
        embedding = doc[field]
        if isinstance(embedding, list):
            print(f"  - {field}: {len(embedding)} dimensions")
            print(f"    First 5 values: {embedding[:5]}")
        else:
            print(f"  - {field}: {embedding}")
    print(f"{'='*80}")
    
    print(f"\nSample document with embeddings:")
    # Show document without full embedding arrays for readability
    doc_summary = {k: v if not k.endswith('_embedding') else f"[{len(v)} dimensions]" 
                   for k, v in doc.items()}
    print(json.dumps(doc_summary, indent=2, ensure_ascii=False))

Generated OpenSearch mappings (WITH vector fields and pipeline):
{
  "settings": {
    "index": {
      "number_of_shards": 1,
      "number_of_replicas": 1,
      "knn": true
    },
    "default_pipeline": "squad_embedding_pipeline"
  },
  "mappings": {
    "properties": {
      "id": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      },
      "title": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      },
      "title_embedding": {
        "type": "knn_vector",
        "dimension": 768,
        "method": {
          "name": "hnsw",
          "space_type": "l2",
          "engine": "lucene",
          "parameters": {}
        }
      },
      "context": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_

In [13]:
# Final verification of all three indices
all_indices = [
    "squad_sample_no_vectors",
    "squad_sample_with_vectors",
    "squad_sample_with_pipeline"
]

print("="*100)
print("FINAL VERIFICATION - All Indices Comparison")
print("="*100)

for idx_name in all_indices:
    if os_client.indices.exists(index=idx_name):
        # Get document count
        count = os_client.count(index=idx_name)['count']
        
        # Get index stats
        stats = os_client.indices.stats(index=idx_name)
        size_in_bytes = stats['indices'][idx_name]['total']['store']['size_in_bytes']
        size_in_mb = size_in_bytes / (1024 * 1024)
        
        # Get field count and settings
        mappings = os_client.indices.get_mapping(index=idx_name)
        settings = os_client.indices.get_settings(index=idx_name)
        
        field_count = len(mappings[idx_name]['mappings']['properties'])
        vector_fields = [k for k in mappings[idx_name]['mappings']['properties'].keys() 
                        if k.endswith('_embedding')]
        
        # Check for pipeline
        pipeline = settings[idx_name]['settings']['index'].get('default_pipeline', 'None')
        knn_enabled = settings[idx_name]['settings']['index'].get('knn', 'false')
        
        print(f"\n{'‚îÄ'*100}")
        print(f"Index: {idx_name}")
        print(f"{'‚îÄ'*100}")
        print(f"  Documents: {count:,}")
        print(f"  Total Fields: {field_count}")
        print(f"  Vector Fields: {len(vector_fields)}")
        if vector_fields:
            print(f"    ‚îî‚îÄ {', '.join(vector_fields)}")
        print(f"  KNN Enabled: {knn_enabled}")
        print(f"  Default Pipeline: {pipeline}")
        print(f"  Index Size: {size_in_mb:.2f} MB")
        
        # Check if embeddings are actually populated (only for pipeline index)
        if idx_name == "squad_sample_with_pipeline" and vector_fields:
            sample = os_client.search(index=idx_name, body={"query": {"match_all": {}}, "size": 1})
            if sample['hits']['hits']:
                first_vec_field = vector_fields[0]
                embedding = sample['hits']['hits'][0]['_source'].get(first_vec_field)
                if embedding and isinstance(embedding, list) and len(embedding) > 0:
                    print(f"  Embeddings Status: ‚úì Populated ({len(embedding)} dimensions)")
                else:
                    print(f"  Embeddings Status: ‚úó Empty")
        
        print(f"  Status: ‚úì Ready")
    else:
        print(f"\nIndex: {idx_name}")
        print(f"  Status: ‚úó Not found")

print(f"\n{'='*100}")
print("All indices created successfully with different configurations!")
print("="*100)

FINAL VERIFICATION - All Indices Comparison

‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
Index: squad_sample_no_vectors
‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
  Documents: 1,000
  Total Fields: 5
  Vector Fields: 0
  KNN Enabled: false
  Default Pipeline: None
  Index Size: 1.35 MB
  Status: ‚úì Ready

‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ

## Final Summary - Complete ML Pipeline Implementation

This notebook successfully demonstrated three approaches to indexing with OpenSearch:

### ‚úÖ Approach 1: Basic Ingestion (No Vectors)
- **Index**: `squad_sample_no_vectors`
- **Documents**: 1000
- **Fields**: 5 (id, title, context, question, answers)
- **Use Case**: Traditional keyword search
- **Ingestion Time**: ~0.35 seconds

### ‚úÖ Approach 2: Manual Vector Fields (No Pipeline)
- **Index**: `squad_sample_with_vectors`
- **Documents**: 1000
- **Fields**: 9 (5 text + 4 vector fields)
- **Vector Fields**: Defined but NOT populated
- **Use Case**: When you want to generate embeddings externally
- **Ingestion Time**: ~0.36 seconds

### ‚úÖ Approach 3: Automatic Embeddings with Ingest Pipeline (RECOMMENDED)
- **Index**: `squad_sample_with_pipeline`
- **Documents**: 10 (small sample for testing)
- **Fields**: 9 (5 text + 4 vector fields with actual embeddings)
- **Vector Fields**: AUTOMATICALLY populated during ingestion
- **Model**: `msmarco-distilbert-base-tas-b` (768 dimensions)
- **Pipeline**: `squad_embedding_pipeline`
- **Settings**: 
  - `index.knn: true` - Enables k-NN search
  - `default_pipeline: squad_embedding_pipeline` - Auto-processes all documents
- **Ingestion Time**: ~0.11 seconds per document (includes ML inference)
- **Use Case**: Production semantic search with automatic embedding generation

### üéØ Key Achievements:

1. **Created a generic `create_opensearch_mappings()` function** that:
   - Automatically generates mappings from pandas DataFrames
   - Optionally creates vector fields
   - Supports pipeline configuration

2. **Deployed ML model** on OpenSearch cluster for embedding generation

3. **Created dynamic ingest pipeline** that automatically:
   - Identifies text fields
   - Generates embeddings using ML model
   - Stores embeddings in corresponding vector fields

4. **Demonstrated actual embeddings**: Each document now has 768-dimensional vectors for semantic search

### üöÄ Next Steps:
- Implement semantic search queries using k-NN
- Test hybrid search (keyword + semantic)
- Index the full dataset (1000 documents)
- Implement relevance tuning and ranking

# üîé Part 2: Semantic Search and Hybrid Queries

Now that we have indices with embeddings, let's implement advanced search capabilities:
1. **k-NN Semantic Search** - Find semantically similar documents using vector similarity
2. **Hybrid Search** - Combine keyword and semantic search for better results
3. **Relevance Tuning** - Adjust ranking and scoring to improve search quality

## 1Ô∏è‚É£ Semantic Search using k-NN

Semantic search finds documents based on meaning rather than exact keyword matches. We'll use k-NN (k-Nearest Neighbors) to find the most similar documents based on vector embeddings.

**How it works:**
1. Convert the search query into a vector embedding using the same ML model
2. Use k-NN to find documents with similar embeddings
3. Return the top-k most similar results

**Example Query:** "What is the capital of France?"
- This will find documents about French geography, Paris, government, etc.
- Even if the exact words don't match, semantically similar content will be returned

In [14]:
def semantic_search_knn(query_text, index_name, field_to_search="context", k=5, model_id=None):
    """
    Perform semantic search using k-NN with neural search.
    
    Parameters:
    -----------
    query_text : str
        The search query text
    index_name : str
        Name of the index to search
    field_to_search : str
        The field to search (e.g., 'context', 'question')
    k : int
        Number of top results to return
    model_id : str, optional
        Model ID for embedding generation. If not provided, uses ingest pipeline.
    
    Returns:
    --------
    dict : Search results with scores
    """
    # Use neural query for automatic embedding generation
    search_body = {
        "size": k,
        "query": {
            "neural": {
                f"{field_to_search}_embedding": {
                    "query_text": query_text,
                    "model_id": model_id,
                    "k": k
                }
            }
        },
        "_source": ["id", "title", field_to_search, "question"]
    }
    
    return os_client.search(index=index_name, body=search_body)


# Example 1: Search for questions about French capital
print("="*100)
print("üîç SEMANTIC SEARCH EXAMPLE 1: French Capital")
print("="*100)
query = "What is the capital of France?"
print(f"\nQuery: '{query}'")
print(f"Searching in index: squad_sample_with_pipeline")
print(f"Target field: context_embedding")

results = semantic_search_knn(
    query_text=query,
    index_name="squad_sample_with_pipeline",
    field_to_search="context",
    k=3,
    model_id=model_id
)

print(f"\n{'‚îÄ'*100}")
print(f"Found {results['hits']['total']['value']} results")
print(f"{'‚îÄ'*100}")

for i, hit in enumerate(results['hits']['hits'], 1):
    score = hit['_score']
    source = hit['_source']
    
    print(f"\nüìÑ Result {i} (Score: {score:.4f})")
    print(f"   Title: {source.get('title', 'N/A')}")
    print(f"   Question: {source.get('question', 'N/A')}")
    print(f"   Context (first 200 chars): {source.get('context', 'N/A')[:200]}...")
    print(f"   {'-'*96}")

print(f"\n{'='*100}\n")

üîç SEMANTIC SEARCH EXAMPLE 1: French Capital

Query: 'What is the capital of France?'
Searching in index: squad_sample_with_pipeline
Target field: context_embedding

‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
Found 3 results
‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ

üìÑ Result 1 (Score: 0.0175)
   Title: Paris
   Question: For how many years did the socialists governed the region?
   Context (first 200 chars): The Region of √éle de France, including Paris and its surrounding

In [15]:
# Example 2: Search for scientific concepts
print("="*100)
print("üîç SEMANTIC SEARCH EXAMPLE 2: Scientific Concepts")
print("="*100)
query = "How does photosynthesis work in plants?"
print(f"\nQuery: '{query}'")

results = semantic_search_knn(
    query_text=query,
    index_name="squad_sample_with_pipeline",
    field_to_search="context",
    k=3,
    model_id=model_id
)

print(f"\n{'‚îÄ'*100}")
print(f"Found {results['hits']['total']['value']} results")
print(f"{'‚îÄ'*100}")

for i, hit in enumerate(results['hits']['hits'], 1):
    score = hit['_score']
    source = hit['_source']
    
    print(f"\nüìÑ Result {i} (Score: {score:.4f})")
    print(f"   Title: {source.get('title', 'N/A')}")
    print(f"   Question: {source.get('question', 'N/A')}")
    print(f"   Context (first 200 chars): {source.get('context', 'N/A')[:200]}...")
    print(f"   {'-'*96}")

print(f"\n{'='*100}\n")

üîç SEMANTIC SEARCH EXAMPLE 2: Scientific Concepts

Query: 'How does photosynthesis work in plants?'

‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
Found 3 results
‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ

üìÑ Result 1 (Score: 0.0169)
   Title: Hunter-gatherer
   Question: What is the manipulation of the landscape associated with?
   Context (first 200 chars): Many hunter-gatherers consciously manipulate the landscape through cutting or burning undesirable plants while encouragi

## 2Ô∏è‚É£ Hybrid Search (Keyword + Semantic)

Hybrid search combines the best of both worlds:
- **Keyword Search (BM25)**: Exact term matching, good for specific queries
- **Semantic Search (k-NN)**: Meaning-based matching, good for conceptual queries

**Benefits:**
- Better recall: Finds documents that keyword search might miss
- Better precision: Combines semantic similarity with keyword relevance
- Flexible scoring: Can adjust weights between keyword and semantic components

**Implementation:**
We'll use a `bool` query with `should` clauses to combine both approaches.

In [16]:
def hybrid_search(query_text, index_name, fields_to_search=["title", "context", "question"], 
                  k=5, model_id=None, keyword_boost=1.0, semantic_boost=1.0):
    """
    Perform hybrid search combining keyword (BM25) and semantic (k-NN) search.
    
    Parameters:
    -----------
    query_text : str
        The search query text
    index_name : str
        Name of the index to search
    fields_to_search : list of str
        Fields to search in (both keyword and semantic)
        Default includes title for better semantic matching
    k : int
        Number of top results to return
    model_id : str, optional
        Model ID for embedding generation
    keyword_boost : float
        Boost factor for keyword search (default: 1.0)
    semantic_boost : float
        Boost factor for semantic search (default: 1.0)
    
    Returns:
    --------
    dict : Search results with combined scores
    """
    # Build keyword queries for each field
    keyword_queries = []
    for field in fields_to_search:
        keyword_queries.append({
            "match": {
                field: {
                    "query": query_text,
                    "boost": keyword_boost
                }
            }
        })
    
    # Build semantic queries for each field
    semantic_queries = []
    for field in fields_to_search:
        semantic_queries.append({
            "neural": {
                f"{field}_embedding": {
                    "query_text": query_text,
                    "model_id": model_id,
                    "k": k * 2,  # Retrieve more candidates for better results
                    "boost": semantic_boost
                }
            }
        })
    
    # Combine queries using bool should
    search_body = {
        "size": k,
        "query": {
            "bool": {
                "should": keyword_queries + semantic_queries,
                "minimum_should_match": 1
            }
        },
        "_source": ["id", "title", "context", "question"],
        "explain": False  # Set to True to see score calculation details
    }
    
    return os_client.search(index=index_name, body=search_body)


# Example 1: Hybrid search with equal weights
print("="*100)
print("üîç HYBRID SEARCH EXAMPLE 1: Equal Keyword + Semantic Weights")
print("="*100)
query = "What are the main causes of World War II?"
print(f"\nQuery: '{query}'")
print(f"Keyword Boost: 1.0, Semantic Boost: 1.0")

results = hybrid_search(
    query_text=query,
    index_name="squad_sample_with_pipeline",
    fields_to_search=["context", "question"],
    k=5,
    model_id=model_id,
    keyword_boost=1.0,
    semantic_boost=1.0
)

print(f"\n{'‚îÄ'*100}")
print(f"Found {results['hits']['total']['value']} results")
print(f"{'‚îÄ'*100}")

for i, hit in enumerate(results['hits']['hits'], 1):
    score = hit['_score']
    source = hit['_source']
    
    print(f"\nüìÑ Result {i} (Score: {score:.4f})")
    print(f"   Title: {source.get('title', 'N/A')}")
    print(f"   Question: {source.get('question', 'N/A')}")
    print(f"   Context (first 150 chars): {source.get('context', 'N/A')[:150]}...")
    print(f"   {'-'*96}")

print(f"\n{'='*100}\n")

üîç HYBRID SEARCH EXAMPLE 1: Equal Keyword + Semantic Weights

Query: 'What are the main causes of World War II?'
Keyword Boost: 1.0, Semantic Boost: 1.0

‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
Found 1000 results
‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ

üìÑ Result 1 (Score: 10.0258)
   Title: The_Times
   Question: During World War II, the Soviet double agent who was corresponding for The Times in Spain in the 1930s later joined what agency?
   Context (first 150 chars):

## 3Ô∏è‚É£ Relevance Tuning and Ranking

Relevance tuning allows you to control how search results are scored and ranked. We'll explore several techniques:

1. **Boost Adjustment**: Control the weight of keyword vs semantic search
2. **Field-Level Boosting**: Prioritize certain fields (e.g., title > content)
3. **Function Score**: Custom scoring based on document properties
4. **Rescore**: Re-rank top results with more expensive scoring

**Use Cases:**
- Emphasize exact matches over semantic similarity
- Boost recent documents or popular content
- Penalize low-quality or outdated content
- Customize ranking for specific business needs

In [17]:
# Test 1: Favor semantic search over keyword search
print("="*100)
print("üéØ RELEVANCE TUNING TEST 1: Favor Semantic Search")
print("="*100)
query = "scientific discoveries in biology"
print(f"\nQuery: '{query}'")
print(f"Configuration: Keyword Boost: 0.5, Semantic Boost: 2.0")
print(f"Expected: Results based more on meaning than exact word matches")

results_semantic_heavy = hybrid_search(
    query_text=query,
    index_name="squad_sample_with_pipeline",
    fields_to_search=["context", "question"],
    k=3,
    model_id=model_id,
    keyword_boost=0.5,
    semantic_boost=2.0
)

print(f"\n{'‚îÄ'*100}")
print(f"Top 3 Results (Semantic-Heavy)")
print(f"{'‚îÄ'*100}")

for i, hit in enumerate(results_semantic_heavy['hits']['hits'], 1):
    score = hit['_score']
    source = hit['_source']
    
    print(f"\nüìÑ Result {i} (Score: {score:.4f})")
    print(f"   Title: {source.get('title', 'N/A')}")
    print(f"   Question: {source.get('question', 'N/A')[:100]}...")
    print(f"   {'-'*96}")

print(f"\n{'='*100}\n")

üéØ RELEVANCE TUNING TEST 1: Favor Semantic Search

Query: 'scientific discoveries in biology'
Configuration: Keyword Boost: 0.5, Semantic Boost: 2.0
Expected: Results based more on meaning than exact word matches

‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
Top 3 Results (Semantic-Heavy)
‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ

üìÑ Result 1 (Score: 2.7932)
   Title: History_of_science
   Question: What language did the important scientific works get translated into for unive

In [18]:
# Test 2: Favor keyword search for precision
print("="*100)
print("üéØ RELEVANCE TUNING TEST 2: Favor Keyword Search")
print("="*100)
query = "Paris France capital city"
print(f"\nQuery: '{query}'")
print(f"Configuration: Keyword Boost: 2.0, Semantic Boost: 0.5")
print(f"Expected: Results with exact keyword matches ranked higher")

results_keyword_heavy = hybrid_search(
    query_text=query,
    index_name="squad_sample_with_pipeline",
    fields_to_search=["context", "question"],
    k=3,
    model_id=model_id,
    keyword_boost=2.0,
    semantic_boost=0.5
)

print(f"\n{'‚îÄ'*100}")
print(f"Top 3 Results (Keyword-Heavy)")
print(f"{'‚îÄ'*100}")

for i, hit in enumerate(results_keyword_heavy['hits']['hits'], 1):
    score = hit['_score']
    source = hit['_source']
    
    print(f"\nüìÑ Result {i} (Score: {score:.4f})")
    print(f"   Title: {source.get('title', 'N/A')}")
    print(f"   Question: {source.get('question', 'N/A')[:100]}...")
    print(f"   {'-'*96}")

print(f"\n{'='*100}\n")

üéØ RELEVANCE TUNING TEST 2: Favor Keyword Search

Query: 'Paris France capital city'
Configuration: Keyword Boost: 2.0, Semantic Boost: 0.5
Expected: Results with exact keyword matches ranked higher

‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
Top 3 Results (Keyword-Heavy)
‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ

üìÑ Result 1 (Score: 21.8448)
   Title: Paris
   Question: What is the most viewed television network in France?...
   ---------------------------------------------

### Advanced: Field-Level Boosting

Sometimes you want to give more weight to matches in specific fields. For example:
- Matches in `title` should score higher than matches in `context`
- Matches in `question` might be more relevant than matches in long text

This is useful when you know certain fields are more important for your use case.

In [19]:
def field_boosted_search(query_text, index_name, field_boosts=None, k=5, model_id=None):
    """
    Perform search with field-level boosting.
    
    Parameters:
    -----------
    query_text : str
        The search query text
    index_name : str
        Name of the index to search
    field_boosts : dict
        Dictionary mapping field names to boost values
        Example: {"title": 3.0, "question": 2.0, "context": 1.0}
    k : int
        Number of top results to return
    model_id : str, optional
        Model ID for embedding generation
    
    Returns:
    --------
    dict : Search results with field-boosted scores
    """
    if field_boosts is None:
        field_boosts = {"title": 2.0, "question": 1.5, "context": 1.0}
    
    # Build queries with field-specific boosts
    should_queries = []
    
    for field, boost in field_boosts.items():
        # Keyword query
        should_queries.append({
            "match": {
                field: {
                    "query": query_text,
                    "boost": boost
                }
            }
        })
        
        # Semantic query (if embedding field exists)
        should_queries.append({
            "neural": {
                f"{field}_embedding": {
                    "query_text": query_text,
                    "model_id": model_id,
                    "k": k * 2,
                    "boost": boost
                }
            }
        })
    
    search_body = {
        "size": k,
        "query": {
            "bool": {
                "should": should_queries,
                "minimum_should_match": 1
            }
        },
        "_source": ["id", "title", "context", "question"]
    }
    
    return os_client.search(index=index_name, body=search_body)


# Example: Prioritize title and question over context
print("="*100)
print("üéØ FIELD-LEVEL BOOSTING: Prioritize Title and Question")
print("="*100)
query = "American Revolution independence"
print(f"\nQuery: '{query}'")
print(f"Field Boosts: title=3.0, question=2.0, context=1.0")
print(f"Expected: Matches in title/question ranked higher than context")

results_field_boosted = field_boosted_search(
    query_text=query,
    index_name="squad_sample_with_pipeline",
    field_boosts={"title": 3.0, "question": 2.0, "context": 1.0},
    k=5,
    model_id=model_id
)

print(f"\n{'‚îÄ'*100}")
print(f"Top 5 Results (Field-Boosted)")
print(f"{'‚îÄ'*100}")

for i, hit in enumerate(results_field_boosted['hits']['hits'], 1):
    score = hit['_score']
    source = hit['_source']
    
    print(f"\nüìÑ Result {i} (Score: {score:.4f})")
    print(f"   Title: {source.get('title', 'N/A')}")
    print(f"   Question: {source.get('question', 'N/A')[:120]}...")
    print(f"   {'-'*96}")

print(f"\n{'='*100}\n")

üéØ FIELD-LEVEL BOOSTING: Prioritize Title and Question

Query: 'American Revolution independence'
Field Boosts: title=3.0, question=2.0, context=1.0
Expected: Matches in title/question ranked higher than context

‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
Top 5 Results (Field-Boosted)
‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ

üìÑ Result 1 (Score: 5.9339)
   Title: Spanish_language_in_the_United_States
   Question: Are there studies on Hispanic-American language?...
   ------

## üìä Comparison: Keyword vs Semantic vs Hybrid

Let's compare all three search approaches side-by-side to understand their strengths and weaknesses.

In [20]:
def keyword_only_search(query_text, index_name, fields=["title", "context", "question"], k=5):
    """Traditional keyword search using BM25. Searches across title, context, and question fields."""
    search_body = {
        "size": k,
        "query": {
            "multi_match": {
                "query": query_text,
                "fields": fields
            }
        },
        "_source": ["id", "title", "context", "question"]
    }
    return os_client.search(index=index_name, body=search_body)


def compare_search_methods(query_text, index_name="squad_sample_with_pipeline", k=3):
    """Compare keyword, semantic, and hybrid search side-by-side."""
    
    print("="*120)
    print(f"üî¨ SEARCH COMPARISON")
    print("="*120)
    print(f"Query: '{query_text}'")
    print(f"Index: {index_name}")
    print(f"Top {k} results for each method")
    print("="*120)
    
    # 1. Keyword-only search
    print(f"\n{'‚ñ∂'*3} METHOD 1: KEYWORD SEARCH (BM25) {'‚óÄ'*3}")
    print(f"{'‚îÄ'*120}")
    keyword_results = keyword_only_search(query_text, index_name, k=k)
    
    for i, hit in enumerate(keyword_results['hits']['hits'], 1):
        print(f"{i}. Score: {hit['_score']:.4f} | Title: {hit['_source'].get('title', 'N/A')[:60]}")
    
    # 2. Semantic-only search
    print(f"\n{'‚ñ∂'*3} METHOD 2: SEMANTIC SEARCH (k-NN) {'‚óÄ'*3}")
    print(f"{'‚îÄ'*120}")
    semantic_results = semantic_search_knn(query_text, index_name, field_to_search="context", k=k, model_id=model_id)
    
    for i, hit in enumerate(semantic_results['hits']['hits'], 1):
        print(f"{i}. Score: {hit['_score']:.4f} | Title: {hit['_source'].get('title', 'N/A')[:60]}")
    
    # 3. Hybrid search
    print(f"\n{'‚ñ∂'*3} METHOD 3: HYBRID SEARCH (Keyword + Semantic) {'‚óÄ'*3}")
    print(f"{'‚îÄ'*120}")
    hybrid_results = hybrid_search(query_text, index_name, k=k, model_id=model_id)
    
    for i, hit in enumerate(hybrid_results['hits']['hits'], 1):
        print(f"{i}. Score: {hit['_score']:.4f} | Title: {hit['_source'].get('title', 'N/A')[:60]}")
    
    print(f"\n{'='*120}")
    print("üí° INSIGHTS:")
    print("  ‚Ä¢ Keyword Search: Good for exact matches, specific terms")
    print("  ‚Ä¢ Semantic Search: Good for conceptual queries, finds similar meaning")
    print("  ‚Ä¢ Hybrid Search: Best of both - combines precision and recall")
    print("="*120)
    
    return {
        "keyword": keyword_results,
        "semantic": semantic_results,
        "hybrid": hybrid_results
    }


# Run comparison
query = "How do plants create energy from sunlight?"
comparison_results = compare_search_methods(query, k=3)

üî¨ SEARCH COMPARISON
Query: 'How do plants create energy from sunlight?'
Index: squad_sample_with_pipeline
Top 3 results for each method

‚ñ∂‚ñ∂‚ñ∂ METHOD 1: KEYWORD SEARCH (BM25) ‚óÄ‚óÄ‚óÄ
‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
1. Score: 7.3968 | Title: Green
2. Score: 5.7558 | Title: Hydrogen
3. Score: 5.0061 | Title: Energy

‚ñ∂‚ñ∂‚ñ∂ METHOD 2: SEMANTIC SEARCH (k-NN) ‚óÄ‚óÄ‚óÄ
‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î

## üéì Summary: Search Methods and Best Practices

### ‚úÖ What We've Implemented:

#### 1. **Semantic Search (k-NN)**
- ‚úì Uses neural embeddings to find semantically similar documents
- ‚úì Great for conceptual queries and finding related content
- ‚úì Handles synonyms, paraphrasing, and language variations
- ‚ö†Ô∏è May miss exact keyword matches

#### 2. **Hybrid Search (Keyword + Semantic)**
- ‚úì Combines BM25 keyword matching with k-NN semantic search
- ‚úì Provides both precision (exact matches) and recall (similar concepts)
- ‚úì Adjustable weights via boost parameters
- ‚úì **RECOMMENDED for production use cases**

#### 3. **Relevance Tuning**
- ‚úì Boost adjustment: Control keyword vs semantic weight
- ‚úì Field-level boosting: Prioritize important fields (title > context)
- ‚úì Customizable scoring for business requirements

### üìã Best Practices:

| Use Case | Recommended Method | Settings |
|----------|-------------------|----------|
| **Question Answering** | Hybrid Search | keyword_boost=1.0, semantic_boost=1.5 |
| **Exact Product Search** | Keyword-Heavy Hybrid | keyword_boost=2.0, semantic_boost=0.5 |
| **Content Discovery** | Semantic-Heavy Hybrid | keyword_boost=0.5, semantic_boost=2.0 |
| **Enterprise Search** | Hybrid + Field Boosting | title=3.0, question=2.0, context=1.0 |

### üöÄ Next Steps for Production:

1. **Experiment with boost values** on your specific data and queries
2. **A/B test different configurations** to measure user satisfaction
3. **Monitor query performance** and adjust based on metrics (latency, relevance)
4. **Implement query expansion** and synonyms for better coverage
5. **Use re-scoring** for top results with more expensive ranking functions
6. **Add filters** (date ranges, categories) to narrow results before scoring
7. **Implement caching** for frequently used queries

### üìä Performance Characteristics:

- **Keyword Search**: Fast (< 10ms), good for large datasets
- **Semantic Search**: Slower (50-200ms), depends on k and index size
- **Hybrid Search**: Medium (20-100ms), balanced approach

**Note**: Actual performance depends on cluster size, index size, hardware, and query complexity.

### üéØ Field Configuration:

**Vector Embeddings Created For:**
- ‚úì `title` - Enables semantic matching on document titles
- ‚úì `context` - Main content field for semantic search
- ‚úì `question` - Question field for Q&A matching

**Excluded From Vectors:**
- ‚úó `id` - Unique identifier, no semantic value

**Why Include Title in Vectors?**
- Titles often contain key concepts and are semantically meaningful
- Matching on title embeddings improves relevance for title-focused queries
- Supports scenarios where users search for topics by name/title

## ‚úÖ Title Embeddings Configuration Summary

This notebook now includes **title embeddings** for better semantic search capabilities:

### üîß Configuration Changes:

1. **`create_opensearch_mappings()`**
   - Default exclusion: `['id']` (title is now included)
   - Creates `title_embedding` vector field (768 dimensions)

2. **`create_embedding_pipeline()`**
   - Default exclusion: `['id']` (title is now included)
   - Pipeline generates embeddings for: `title`, `context`, `question`

3. **Search Functions Updated:**
   - ‚úÖ `keyword_only_search()`: Searches `["title", "context", "question"]`
   - ‚úÖ `hybrid_search()`: Searches `["title", "context", "question"]` by default
   - ‚úÖ `field_boosted_search()`: Includes title with boost=2.0 (higher than context)
   - ‚úÖ `semantic_search_knn()`: Can search title_embedding by passing `field_to_search="title"`

### üìä Vector Fields Created:

| Field | Vector Field | Dimensions | Purpose |
|-------|-------------|------------|---------|
| `title` | `title_embedding` | 768 | Semantic matching on document titles |
| `context` | `context_embedding` | 768 | Main content semantic search |
| `question` | `question_embedding` | 768 | Question-answer matching |

### üéØ Benefits of Title Embeddings:

1. **Better Topic Matching**: Titles often contain the main topic/concept
2. **Improved Relevance**: Documents with semantically similar titles rank higher
3. **Field-Level Boosting**: Can prioritize title matches over content matches
4. **Flexible Search**: Users can search specifically on titles or across all fields

### üí° Example Usage:

```python
# Search only in title embeddings
results = semantic_search_knn(
    query_text="Machine Learning",
    index_name="squad_sample_with_pipeline",
    field_to_search="title",
    k=5,
    model_id=model_id
)

# Hybrid search across all fields including title
results = hybrid_search(
    query_text="artificial intelligence",
    index_name="squad_sample_with_pipeline",
    fields_to_search=["title", "context", "question"],  # Default
    k=5,
    model_id=model_id
)

# Boost title matches higher
results = field_boosted_search(
    query_text="deep learning",
    index_name="squad_sample_with_pipeline",
    field_boosts={"title": 3.0, "question": 2.0, "context": 1.0},
    k=5,
    model_id=model_id
)
```