```mermaid
%%{init: {'theme':'base', 'themeVariables': { 'primaryColor':'#8E44AD','primaryTextColor':'#fff','primaryBorderColor':'#6C3483','lineColor':'#E74C3C','secondaryColor':'#3498DB','tertiaryColor':'#F39C12'}}}%%
graph TB
    subgraph "VectorDBTool Workflow - Dense Vector Retrieval"
        A[üéì Start: Student Question] -->|Natural Language| B[üî¢ Text Embedding Model]
        B -->|Generate Query Vector| C[üìä VectorDBTool]
        C -->|k-NN Search| D[üóÑÔ∏è Vector Index]
        D -->|Top-k Similar Documents| E[üìù Ranked Results]
        E --> F[‚ú® Semantic Matches]
        
        style A fill:#8E44AD,stroke:#6C3483,stroke-width:3px,color:#fff
        style B fill:#E74C3C,stroke:#C0392B,stroke-width:3px,color:#fff
        style C fill:#3498DB,stroke:#2874A6,stroke-width:3px,color:#fff
        style D fill:#27AE60,stroke:#1E8449,stroke-width:3px,color:#fff
        style E fill:#F39C12,stroke:#D68910,stroke-width:3px,color:#fff
        style F fill:#16A085,stroke:#117A65,stroke-width:3px,color:#fff
    end
    
    subgraph "Key Concepts"
        G[üß† Semantic Understanding]
        H[üìê Cosine Similarity]
        I[‚ö° Fast HNSW Search]
        J[üéØ Relevance Ranking]
        
        style G fill:#9B59B6,stroke:#7D3C98,stroke-width:2px,color:#fff
        style H fill:#9B59B6,stroke:#7D3C98,stroke-width:2px,color:#fff
        style I fill:#9B59B6,stroke:#7D3C98,stroke-width:2px,color:#fff
        style J fill:#9B59B6,stroke:#7D3C98,stroke-width:2px,color:#fff
    end
```

# üß† VectorDBTool - Dense Vector Semantic Search

## üéØ Learning Objectives

In this notebook, you will learn:
- ‚úÖ How to use **VectorDBTool** for semantic search using dense vectors
- ‚úÖ How to set up text embedding models for vectorization
- ‚úÖ How to create k-NN vector indices for fast similarity search
- ‚úÖ How to build agents that understand meaning, not just keywords

## üìñ What is VectorDBTool?

**VectorDBTool** performs dense vector retrieval, enabling semantic search that understands the **meaning** of queries rather than just matching keywords.

### How It Works:
1. **Text ‚Üí Vector**: A text embedding model converts text into numerical vectors (e.g., 384 dimensions)
2. **Vector Storage**: Vectors are stored in a k-NN index for efficient similarity search
3. **Query Processing**: User query is converted to a vector and compared to stored vectors
4. **Similarity Ranking**: Documents are ranked by cosine similarity (or other metrics)

### Why Use Semantic Search?
- üéØ Finds conceptually similar content, not just exact matches
- üåç Works across languages and synonyms
- üß† Understands context and intent
- üìà Better results for natural language queries

---

In [1]:
# Import required libraries
import sys
import json
import time
sys.path.append('..')

from agent_helpers import (
    get_os_client,
    configure_cluster_for_openai,
    wait_for_model_deployment
)

print("‚úÖ Libraries imported successfully")

‚úÖ Libraries imported successfully


## üîß Step 1: Initialize OpenSearch Client

In [2]:
# Create OpenSearch client
client = get_os_client()
print("‚úÖ OpenSearch client initialized")

# Configure cluster
configure_cluster_for_openai(client)
print("‚úÖ Cluster configured")

‚úÖ OpenSearch client initialized
   Configuring cluster settings for OpenAI connector...
   ‚úì Cluster settings configured successfully
‚úÖ Cluster configured


## ü§ñ Step 2: Register and Deploy Text Embedding Model

We'll use HuggingFace's `all-MiniLM-L12-v2` model, which:
- Produces 384-dimensional dense vectors
- Works well for general text understanding
- Is compact and fast

In [3]:
print("üî¢ Registering text embedding model...\n")

# Register HuggingFace model for text embeddings
embedding_model_body = {
    "name": "huggingface/sentence-transformers/all-MiniLM-L12-v2",
    "version": "1.0.2",
    "model_format": "TORCH_SCRIPT"
}

response = client.transport.perform_request(
    'POST',
    '/_plugins/_ml/models/_register?deploy=true',
    body=embedding_model_body
)

embedding_task_id = response['task_id']
print(f"üìù Registration Task ID: {embedding_task_id}")

# Wait for registration to complete
print("\n‚è≥ Waiting for model registration...")
while True:
    task_response = client.transport.perform_request(
        'GET',
        f'/_plugins/_ml/tasks/{embedding_task_id}'
    )
    state = task_response['state']
    print(f"   Status: {state}")
    
    if state == 'COMPLETED':
        embedding_model_id = task_response['model_id']
        print(f"\n‚úÖ Embedding model deployed: {embedding_model_id}")
        break
    elif state == 'FAILED':
        print("‚ùå Model registration failed")
        break
    
    time.sleep(10)

üî¢ Registering text embedding model...

üìù Registration Task ID: 51tliZsBLQ1mV2UNIyhR

‚è≥ Waiting for model registration...
   Status: CREATED
   Status: CREATED
   Status: COMPLETED

‚úÖ Embedding model deployed: 6FtliZsBLQ1mV2UNJiis


## üìö Step 3: Create Vector Index with Sample Data

Let's create a knowledge base about major US cities with population data.

In [4]:
# Create ingest pipeline for automatic vectorization
pipeline_name = "text_embedding_pipeline"

print(f"üîß Creating ingest pipeline: {pipeline_name}")

pipeline_body = {
    "description": "Text embedding pipeline for semantic search",
    "processors": [
        {
            "text_embedding": {
                "model_id": embedding_model_id,
                "field_map": {
                    "text": "text_embedding"
                }
            }
        }
    ]
}

client.ingest.put_pipeline(id=pipeline_name, body=pipeline_body)
print("‚úÖ Pipeline created")

üîß Creating ingest pipeline: text_embedding_pipeline
‚úÖ Pipeline created


In [5]:
# Create k-NN vector index
index_name = "city_population_vectors"

print(f"\nüìä Creating vector index: {index_name}")

# Delete if exists
if client.indices.exists(index=index_name):
    client.indices.delete(index=index_name)

index_body = {
    "settings": {
        "index": {
            "knn": True,
            "default_pipeline": pipeline_name
        }
    },
    "mappings": {
        "properties": {
            "text": {
                "type": "text"
            },
            "text_embedding": {
                "type": "knn_vector",
                "dimension": 384,
                "method": {
                    "name": "hnsw",
                    "space_type": "cosinesimil",
                    "engine": "lucene"
                }
            }
        }
    }
}

client.indices.create(index=index_name, body=index_body)
print("‚úÖ Vector index created")


üìä Creating vector index: city_population_vectors
‚úÖ Vector index created


In [6]:
# Index sample documents about city populations
print("\nüìù Indexing sample documents...")

documents = [
    {
        "text": "Chart and table of population level and growth rate for the Seattle metro area from 1950 to 2023. The current metro area population of Seattle in 2023 is 3,519,000, a 0.86% increase from 2022. The metro area population of Seattle in 2022 was 3,489,000, a 0.81% increase from 2021."
    },
    {
        "text": "Chart and table of population level and growth rate for the New York City metro area from 1950 to 2023. The current metro area population of New York City in 2023 is 18,937,000, a 0.37% increase from 2022. The metro area population of New York City in 2022 was 18,867,000."
    },
    {
        "text": "Chart and table of population level and growth rate for the Austin metro area from 1950 to 2023. The current metro area population of Austin in 2023 is 2,228,000, a 2.39% increase from 2022. Austin has experienced rapid growth in recent years."
    },
    {
        "text": "Chart and table of population level and growth rate for the Chicago metro area from 1950 to 2023. The current metro area population of Chicago in 2023 is 8,937,000, a 0.4% increase from 2022. Chicago is one of the largest cities in the United States."
    },
    {
        "text": "Chart and table of population level and growth rate for the Miami metro area from 1950 to 2023. The current metro area population of Miami in 2023 is 6,265,000, a 0.8% increase from 2022. Miami continues to attract residents from across the country."
    },
    {
        "text": "San Francisco Bay Area demographics and population trends. The San Francisco-Oakland-Berkeley metro area has a population of approximately 4.7 million people. Known for its tech industry and cultural diversity."
    }
]

for i, doc in enumerate(documents, 1):
    client.index(index=index_name, id=str(i), body=doc)

client.indices.refresh(index=index_name)
print(f"‚úÖ Indexed {len(documents)} documents")


üìù Indexing sample documents...
‚úÖ Indexed 6 documents


## ü§ñ Step 4: Create Flow Agent with VectorDBTool

In [7]:
print("\nü§ñ Creating flow agent with VectorDBTool...")

agent_body = {
    "name": "Vector_Search_Agent",
    "type": "flow",
    "description": "Agent for semantic search using dense vectors",
    "tools": [
        {
            "type": "VectorDBTool",
            "parameters": {
                "model_id": embedding_model_id,
                "index": index_name,
                "embedding_field": "text_embedding",
                "source_field": ["text"],
                "input": "${parameters.question}",
                "doc_size": 3,
                "k": 10
            }
        }
    ]
}

response = client.transport.perform_request(
    'POST',
    '/_plugins/_ml/agents/_register',
    body=agent_body
)

agent_id = response['agent_id']
print(f"‚úÖ Agent created: {agent_id}")


ü§ñ Creating flow agent with VectorDBTool...
‚úÖ Agent created: 8VtliZsBLQ1mV2UN1yhD


## üß™ Step 5: Test Semantic Search

Let's test various queries to see how semantic search understands meaning!

In [8]:
# Test queries that demonstrate semantic understanding
test_queries = [
    "What's the population increase of Seattle?",
    "Which city has the fastest growth?",
    "Tell me about New York demographics",
    "Information about tech cities",
    "Cities with high population growth"
]

for i, query in enumerate(test_queries, 1):
    print(f"\n{'='*70}")
    print(f"üîç Query {i}: {query}")
    print('='*70)
    
    response = client.transport.perform_request(
        'POST',
        f'/_plugins/_ml/agents/{agent_id}/_execute',
        body={"parameters": {"question": query}}
    )
    
    # Parse and display results
    if 'inference_results' in response:
        for result in response['inference_results']:
            if 'output' in result:
                for output in result['output']:
                    if 'result' in output:
                        # Parse JSON results
                        try:
                            results = output['result']
                            if isinstance(results, str):
                                # Split by newline if multiple results
                                result_lines = results.strip().split('\n')
                                print(f"\nüìä Found {len(result_lines)} relevant documents:\n")
                                
                                for j, line in enumerate(result_lines[:3], 1):
                                    if line.strip():
                                        try:
                                            doc = json.loads(line)
                                            score = doc.get('_score', 'N/A')
                                            text = doc.get('_source', {}).get('text', '')[:150]
                                            print(f"   {j}. Score: {score:.4f}")
                                            print(f"      {text}...\n")
                                        except:
                                            print(f"   {j}. {line[:150]}...\n")
                        except Exception as e:
                            print(f"   Raw result: {output['result'][:200]}...")


üîç Query 1: What's the population increase of Seattle?

üìä Found 3 relevant documents:

   1. Score: 0.8517
      Chart and table of population level and growth rate for the Seattle metro area from 1950 to 2023. The current metro area population of Seattle in 2023...

   2. Score: 0.7223
      San Francisco Bay Area demographics and population trends. The San Francisco-Oakland-Berkeley metro area has a population of approximately 4.7 million...

   3. Score: 0.7035
      Chart and table of population level and growth rate for the New York City metro area from 1950 to 2023. The current metro area population of New York ...


üîç Query 2: Which city has the fastest growth?

üìä Found 3 relevant documents:

   1. Score: 0.6792
      Chart and table of population level and growth rate for the Chicago metro area from 1950 to 2023. The current metro area population of Chicago in 2023...

   2. Score: 0.6779
      Chart and table of population level and growth rate for the New York Cit

## üí° Step 6: Understanding Semantic Search

### Why These Results Matter:

1. **"What's the population increase of Seattle?"**
   - Finds Seattle document even though query uses different words
   - Understands "increase" relates to "growth rate"

2. **"Which city has the fastest growth?"**
   - Semantically matches to growth-related content
   - Ranks documents by relevance, not just keyword presence

3. **"Information about tech cities"**
   - Finds San Francisco/Seattle even though "tech" isn't explicitly mentioned
   - Understands contextual associations

### Key Advantages:
- üéØ **No Exact Match Needed**: Finds relevant content with different wording
- üåç **Synonym Understanding**: "increase" = "growth" = "rise"
- üß† **Contextual Relevance**: Understands "tech cities" ‚Üí San Francisco
- üìä **Relevance Ranking**: Best matches appear first based on similarity

## üéì Step 7: Key Takeaways

### What You Learned:

1. **‚úÖ Vector Embeddings**: Text converted to numerical vectors for similarity comparison
2. **‚úÖ k-NN Search**: Fast approximate nearest neighbor search using HNSW algorithm
3. **‚úÖ Semantic Understanding**: Search by meaning, not just keywords
4. **‚úÖ VectorDBTool**: Agent-based semantic search integration

### Best Practices:

- üéØ **Choose Good Models**: Use domain-appropriate embedding models
- üéØ **Tune k Parameter**: Balance between quality and performance
- üéØ **Monitor doc_size**: Control number of results returned
- üéØ **Combine with Filters**: Use hybrid search for best results

### Technical Details:

- **Vector Dimensions**: 384 (from all-MiniLM-L12-v2)
- **Similarity Metric**: Cosine similarity
- **Index Algorithm**: HNSW (Hierarchical Navigable Small World)
- **Search Parameter k**: Number of nearest neighbors to retrieve

### Next Steps:

- üìò Try **RAGTool** to combine VectorDBTool with LLM for intelligent answers
- üìò Explore **NeuralSparseSearchTool** for alternative semantic search approach
- üìò Build conversational agents with semantic understanding

---

## üßπ Cleanup (Optional)

In [None]:
# Uncomment to clean up resources
# from agent_helpers import cleanup_resources

# cleanup_resources(
#     client=client,
#     model_ids=[embedding_model_id],
#     agent_ids=[agent_id],
#     index_names=[index_name]
# )

# # Delete pipeline
# client.ingest.delete_pipeline(id=pipeline_name)

# print("\n‚úÖ Cleanup completed")

print("\nüìù Note: Uncomment the code above to clean up resources after the demo")