# ⚙️ Configuration Setup

Before running this notebook, ensure you have:

1. **Created a `.env` file** in the project root directory
2. **Copied from `.env.example`** and filled in your actual credentials
3. **Verified your `.gitignore`** includes `.env` to protect your secrets

## Quick Setup Commands

```bash
# Copy the example file
cp .env.example .env

# Edit with your actual credentials
# (Use your preferred text editor)
```

**🔐 Security Best Practices:**
- Never commit `.env` files to version control
- Use different `.env` files for different environments (dev, staging, prod)
- Rotate your API keys regularly
- Consider using Azure Key Vault for production deployments

---

# Movies Dataset Vector Database Demo with Azure Cosmos DB for MongoDB vCore

This notebook demonstrates how to:
1. Load the movies dataset into Azure Cosmos DB for MongoDB vCore
2. Create collections with vector search capabilities using MongoDB's native vector indexing
3. Use MongoDB vCore as a vector database for similarity search with $vectorSearch aggregation
4. Demonstrate RAG (Retrieval Augmented Generation) patterns using MongoDB aggregation pipelines

## Prerequisites
- Azure Cosmos DB for MongoDB vCore account with vector search enabled
- Movies dataset files in the data/moviesdataset folder
- Required Python packages: pandas, pymongo, motor, numpy, openai, python-dotenv
- **Environment Configuration**: Create a `.env` file in the project root with the following variables:
  ```
  # Azure OpenAI Configuration
  AZURE_OPENAI_ENDPOINT=your_azure_openai_endpoint
  AZURE_OPENAI_API_KEY=your_azure_openai_api_key
  AZURE_OPENAI_API_VERSION=2024-06-01
  EMBEDDING_MODEL=text-embedding-ada-002
  GENERATION_MODEL=gpt-4o

  # Azure Cosmos DB for MongoDB vCore Configuration
  MONGODB_CONNECTION_STRING=mongodb+srv://username:password@cluster.mongocluster.cosmos.azure.com/?tls=true&authMechanism=SCRAM-SHA-256&retrywrites=false&maxIdleTimeMS=120000
  MONGODB_DATABASE_NAME=MovieVectorDB
  MONGODB_COLLECTION_NAME=movies
  ```

**⚠️ Security Note**: Never commit the `.env` file to version control. Add it to your `.gitignore` file.

In [None]:
# Import Required Libraries
import pandas as pd
import numpy as np
import json
import ast
import warnings
from typing import List, Dict, Any
import time
import os
from openai import AzureOpenAI
import tiktoken
from dotenv import load_dotenv
from pymongo import MongoClient
from pymongo.operations import SearchIndexModel
from pymongo.errors import OperationFailure, BulkWriteError
import uuid
from datetime import datetime

warnings.filterwarnings('ignore')

# Load environment variables from .env file
load_dotenv()

# Azure OpenAI Configuration from environment variables
AZURE_OPENAI_ENDPOINT = os.getenv("AZURE_OPENAI_ENDPOINT")
AZURE_OPENAI_API_KEY = os.getenv("AZURE_OPENAI_API_KEY")
AZURE_OPENAI_API_VERSION = os.getenv("AZURE_OPENAI_API_VERSION", "2024-06-01")
EMBEDDING_MODEL = os.getenv("EMBEDDING_MODEL", "text-embedding-ada-002")
GENERATION_MODEL = os.getenv("GENERATION_MODEL", "gpt-4o")

# MongoDB vCore Configuration from environment variables
MONGODB_CONNECTION_STRING = os.getenv("MONGODB_CONNECTION_STRING")
MONGODB_DATABASE_NAME = os.getenv("MONGODB_DATABASE_NAME", "MovieVectorDB")
MONGODB_COLLECTION_NAME = os.getenv("MONGODB_COLLECTION_NAME", "movies")

# Validate required configuration
required_vars = {
    "AZURE_OPENAI_ENDPOINT": AZURE_OPENAI_ENDPOINT,
    "AZURE_OPENAI_API_KEY": AZURE_OPENAI_API_KEY,
    "MONGODB_CONNECTION_STRING": MONGODB_CONNECTION_STRING
}
missing_vars = [k for k, v in required_vars.items() if not v]
if missing_vars:
    raise ValueError(f"Missing required configuration variables: {', '.join(missing_vars)}")

# Initialize Azure OpenAI client
openai_client = AzureOpenAI(
    azure_endpoint=AZURE_OPENAI_ENDPOINT,
    api_key=AZURE_OPENAI_API_KEY,
    api_version=AZURE_OPENAI_API_VERSION
)

# Initialize MongoDB client
mongo_client = MongoClient(MONGODB_CONNECTION_STRING)

print("All libraries imported successfully!")
print(f"Azure OpenAI configured with embedding model: {EMBEDDING_MODEL}")
print(f"Generation model: {GENERATION_MODEL}")
print(f"MongoDB configured for database: {MONGODB_DATABASE_NAME}")
print("✅ Environment variables loaded from .env file")

## 1. Load and Inspect the Movies Dataset

Let's start by loading the movies dataset and examining its structure.

In [None]:
# Load the movies metadata
movies_df = pd.read_csv(r'..\data\moviesdataset\movies_metadata.csv', low_memory=False)

# Load ratings data (using smaller dataset for demo)
ratings_df = pd.read_csv(r'..\data\moviesdataset\ratings_small.csv')

print("Movies Dataset Shape:", movies_df.shape)
print("Ratings Dataset Shape:", ratings_df.shape)
print("\nMovies Dataset Columns:")
print(movies_df.columns.tolist())
print("\nFirst few rows of movies dataset:")
movies_df.head()

In [None]:
# Clean and preprocess the data
def clean_movies_data(df):
    # Remove rows with missing essential data
    df_clean = df.dropna(subset=['title', 'overview', 'id']).copy()
    
    # Convert id to numeric, handling errors
    df_clean['id'] = pd.to_numeric(df_clean['id'], errors='coerce')
    df_clean = df_clean.dropna(subset=['id'])
    df_clean['id'] = df_clean['id'].astype(int)
    
    # Clean overview text
    df_clean['overview'] = df_clean['overview'].str.strip()
    df_clean = df_clean[df_clean['overview'].str.len() > 10]  # Remove very short descriptions
    
    # Parse genres safely
    def safe_parse_genres(genres_str):
        if pd.isna(genres_str):
            return []
        try:
            genres_list = ast.literal_eval(genres_str)
            return [genre['name'] for genre in genres_list] if genres_list else []
        except:
            return []
    
    df_clean['genres_list'] = df_clean['genres'].apply(safe_parse_genres)
    
    # Convert release_date to datetime
    df_clean['release_date'] = pd.to_datetime(df_clean['release_date'], errors='coerce')
    
    # Handle numeric columns
    numeric_columns = ['vote_average', 'vote_count', 'popularity', 'runtime']
    for col in numeric_columns:
        if col in df_clean.columns:
            df_clean[col] = pd.to_numeric(df_clean[col], errors='coerce')
    
    return df_clean

# Clean the data
movies_clean = clean_movies_data(movies_df)
print(f"Original dataset size: {len(movies_df)}")
print(f"Cleaned dataset size: {len(movies_clean)}")
print(f"Data reduction: {((len(movies_df) - len(movies_clean)) / len(movies_df) * 100):.1f}%")

# For demo purposes, let's work with a subset
SAMPLE_SIZE = 100  # Adjust this based on your needs
movies_sample = movies_clean.head(SAMPLE_SIZE).copy()
print(f"\nUsing sample of {len(movies_sample)} movies for this demo")
movies_sample.head()

## 2. Set up MongoDB vCore Database and Collection

Now let's connect to MongoDB vCore and set up our database and collection for vector operations.

In [None]:
# Connect to MongoDB and set up database/collection
try:
    # Test connection
    mongo_client.admin.command('ping')
    print("✅ Successfully connected to MongoDB vCore")
    
    # Get database
    database = mongo_client[MONGODB_DATABASE_NAME]
    print(f"✅ Using database: {MONGODB_DATABASE_NAME}")
    
    # Get collection
    collection = database[MONGODB_COLLECTION_NAME]
    print(f"✅ Using collection: {MONGODB_COLLECTION_NAME}")
    
    # Check if collection exists and get document count
    doc_count = collection.count_documents({})
    print(f"📊 Current document count: {doc_count}")
    
    # Optional: Drop existing collection for fresh start (uncomment if needed)
    # collection.drop()
    # print("🧹 Dropped existing collection for fresh start")
    
except Exception as e:
    print(f"❌ Error connecting to MongoDB: {e}")
    raise

# List existing indexes
try:
    indexes = list(collection.list_indexes())
    print(f"\nExisting indexes:")
    for idx in indexes:
        print(f"  - {idx.get('name', 'unnamed')}")
except Exception as e:
    print(f"Could not list indexes: {e}")

## 3. Generate Embeddings with Azure OpenAI

We'll create embeddings for movie descriptions using Azure OpenAI's text-embedding-ada-002 model.

In [None]:
# Function to get embeddings from Azure OpenAI
def get_embedding(text, model=EMBEDDING_MODEL):
    """Get embedding from Azure OpenAI"""
    try:
        response = openai_client.embeddings.create(
            input=text,
            model=model
        )
        return response.data[0].embedding
    except Exception as e:
        print(f"Error getting embedding for text: {text[:50]}... Error: {e}")
        return None

# Generate embeddings for our movie sample
print(f"Generating embeddings for {len(movies_sample)} movies...")
successful_embeddings = 0
failed_embeddings = 0

# Prepare documents for insertion
documents_to_insert = []

for idx, row in movies_sample.iterrows():
    # Create searchable text combining title and overview
    search_text = f"{row['title']} {row['overview']}"
    
    # Get embedding
    embedding = get_embedding(search_text)
    
    if embedding is not None:
        # Create document for MongoDB
        document = {
            "_id": str(row['id']),  # Use movie ID as document ID
            "movie_id": int(row['id']),
            "title": row['title'],
            "overview": row['overview'],
            "genres": row['genres_list'],
            "release_date": row['release_date'].isoformat() if pd.notna(row['release_date']) else None,
            "vote_average": float(row['vote_average']) if pd.notna(row['vote_average']) else None,
            "vote_count": int(row['vote_count']) if pd.notna(row['vote_count']) else None,
            "popularity": float(row['popularity']) if pd.notna(row['popularity']) else None,
            "search_text": search_text,
            "embedding": embedding,
            "embedding_model": EMBEDDING_MODEL,
            "created_at": datetime.utcnow().isoformat()
        }
        
        documents_to_insert.append(document)
        successful_embeddings += 1
        
        if successful_embeddings % 10 == 0:
            print(f"  Generated {successful_embeddings} embeddings...")
        
        # Rate limiting to avoid API throttling
        time.sleep(0.1)
    else:
        failed_embeddings += 1

print(f"\n✅ Successfully generated {successful_embeddings} embeddings")
print(f"❌ Failed to generate {failed_embeddings} embeddings")
print(f"📊 Success rate: {(successful_embeddings / len(movies_sample) * 100):.1f}%")

## 4. Insert Data into MongoDB vCore Collection

Now let's insert our movie data with embeddings into the MongoDB collection.

In [None]:
# Insert documents into MongoDB collection
if documents_to_insert:
    try:
        print(f"Inserting {len(documents_to_insert)} documents into MongoDB...")
        
        # Use ordered=False for better performance and partial success
        result = collection.insert_many(documents_to_insert, ordered=False)
        
        print(f"✅ Successfully inserted {len(result.inserted_ids)} documents")
        print(f"📊 Insertion rate: {(len(result.inserted_ids) / len(documents_to_insert) * 100):.1f}%")
        
        # Show sample document structure
        sample_doc = collection.find_one()
        if sample_doc:
            print(f"\n📄 Sample document structure:")
            print(f"  _id: {sample_doc['_id']}")
            print(f"  title: {sample_doc['title']}")
            print(f"  genres: {sample_doc['genres']}")
            print(f"  embedding dimensions: {len(sample_doc['embedding'])}")
            print(f"  embedding_model: {sample_doc['embedding_model']}")
            
    except BulkWriteError as e:
        print(f"⚠️ Partial success during bulk insert:")
        print(f"  Inserted: {e.details.get('nInserted', 0)} documents")
        print(f"  Errors: {len(e.details.get('writeErrors', []))}")
        for error in e.details.get('writeErrors', [])[:3]:  # Show first 3 errors
            print(f"    Error: {error.get('errmsg', 'Unknown error')}")
    except Exception as e:
        print(f"❌ Error during insertion: {e}")

# Get final collection stats
final_count = collection.count_documents({})
print(f"\n📊 Final collection statistics:")
print(f"  Total documents: {final_count}")
print(f"  Documents with embeddings: {collection.count_documents({'embedding': {'$exists': True}})}")

## 5. Create Vector Search Index

MongoDB vCore uses SearchIndexModel to create optimized vector search indexes.

In [None]:
# Create vector search index
vector_index_name = "vector_index"

# Define vector search index
vector_index_model = SearchIndexModel(
    definition={
        "fields": [
            {
                "type": "vector",
                "path": "embedding",
                "numDimensions": 1536,  # text-embedding-ada-002 dimensions
                "similarity": "cosine"
            },
            {
                "type": "filter",
                "path": "genres"
            },
            {
                "type": "filter",
                "path": "vote_average"
            },
            {
                "type": "filter",
                "path": "release_date"
            }
        ]
    },
    name=vector_index_name
)

try:
    # Check if index already exists
    existing_indexes = list(collection.list_search_indexes())
    index_names = [idx.get('name') for idx in existing_indexes if idx.get('name')]
    
    if vector_index_name in index_names:
        print(f"✅ Vector search index '{vector_index_name}' already exists")
    else:
        print(f"Creating vector search index '{vector_index_name}'...")
        collection.create_search_index(vector_index_model)
        print(f"✅ Vector search index '{vector_index_name}' created successfully")
        print("⏳ Index may take a few minutes to be fully ready for searches")
    
    # Wait for index to be ready (optional)
    print("\nWaiting 30 seconds for index to initialize...")
    time.sleep(30)
    
    # List all search indexes
    print("\n📋 Current search indexes:")
    search_indexes = list(collection.list_search_indexes())
    for idx in search_indexes:
        print(f"  - Name: {idx.get('name')}")
        print(f"    Status: {idx.get('status', 'unknown')}")
        print(f"    Type: {idx.get('type', 'unknown')}")
        
except Exception as e:
    print(f"⚠️ Vector index creation: {e}")
    print("Note: Some MongoDB vCore instances may require manual index creation")

## 6. Semantic Movie Search with Vector Similarity

Now let's implement semantic search using MongoDB's $vectorSearch aggregation stage.

In [None]:
# Function to perform semantic movie search
def semantic_movie_search(query_text, top_k=5, include_score=True):
    """
    Perform semantic search for movies using MongoDB vector search
    """
    try:
        # Generate embedding for the query
        query_embedding = get_embedding(query_text)
        
        if query_embedding is None:
            return []
        
        # MongoDB vector search aggregation pipeline
        pipeline = [
            {
                "$vectorSearch": {
                    "index": vector_index_name,
                    "path": "embedding",
                    "queryVector": query_embedding,
                    "numCandidates": 100,  # Number of candidates to consider
                    "limit": top_k
                }
            },
            {
                "$project": {
                    "_id": 0,
                    "title": 1,
                    "overview": 1,
                    "genres": 1,
                    "vote_average": 1,
                    "release_date": 1,
                    "score": {"$meta": "vectorSearchScore"} if include_score else None
                }
            }
        ]
        
        # Remove null score projection if not needed
        if not include_score:
            pipeline[1]["$project"].pop("score", None)
        
        # Execute the search
        results = list(collection.aggregate(pipeline))
        
        return results
        
    except Exception as e:
        print(f"Error performing semantic search: {e}")
        return []

# Test semantic search with various queries
test_queries = [
    "sci-fi action movie with robots",
    "romantic comedy about love",
    "superhero movie with powers",
    "thriller with mystery and suspense",
    "animated family-friendly adventure"
]

print("🔍 Testing Semantic Movie Search\n" + "="*50)

for query in test_queries:
    print(f"\n🎯 Query: '{query}'")
    print("-" * (len(query) + 10))
    
    results = semantic_movie_search(query, top_k=3)
    
    if results:
        for i, movie in enumerate(results, 1):
            score = movie.get('score', 'N/A')
            print(f"{i}. {movie['title']} (Score: {score:.4f})")
            print(f"   Genres: {', '.join(movie['genres']) if movie['genres'] else 'N/A'}")
            print(f"   Rating: {movie.get('vote_average', 'N/A')}/10")
            print(f"   Overview: {movie['overview'][:100]}...")
            print()
    else:
        print("   No results found or search error")
    print()

## 7. Advanced Filtered Vector Search

MongoDB vCore allows combining vector search with traditional filtering for more precise results.

In [None]:
# Function for filtered vector search
def filtered_movie_search(query_text, genre_filter=None, min_rating=None, top_k=5):
    """
    Perform semantic search with additional filters
    """
    try:
        # Generate embedding for the query
        query_embedding = get_embedding(query_text)
        
        if query_embedding is None:
            return []
        
        # Build filter conditions
        filter_conditions = {}
        
        if genre_filter:
            filter_conditions["genres"] = {"$in": [genre_filter]}
        
        if min_rating:
            filter_conditions["vote_average"] = {"$gte": min_rating}
        
        # MongoDB vector search with filters
        pipeline = [
            {
                "$vectorSearch": {
                    "index": vector_index_name,
                    "path": "embedding",
                    "queryVector": query_embedding,
                    "numCandidates": 100,
                    "limit": top_k,
                    "filter": filter_conditions if filter_conditions else {}
                }
            },
            {
                "$project": {
                    "_id": 0,
                    "title": 1,
                    "overview": 1,
                    "genres": 1,
                    "vote_average": 1,
                    "release_date": 1,
                    "score": {"$meta": "vectorSearchScore"}
                }
            }
        ]
        
        # Execute the search
        results = list(collection.aggregate(pipeline))
        return results
        
    except Exception as e:
        print(f"Error performing filtered search: {e}")
        return []

# Test filtered searches
print("🎯 Testing Filtered Vector Search\n" + "="*40)

# Search 1: Action movies with good ratings
print("🔍 Search 1: Action movies with high ratings")
results = filtered_movie_search(
    query_text="action adventure explosive", 
    genre_filter="Action",
    min_rating=7.0,
    top_k=3
)

if results:
    for i, movie in enumerate(results, 1):
        print(f"{i}. {movie['title']} (Score: {movie['score']:.4f}, Rating: {movie.get('vote_average', 'N/A')})")
        print(f"   Genres: {', '.join(movie['genres'])}")
        print(f"   Overview: {movie['overview'][:80]}...")
else:
    print("   No results found")

print("\n" + "-"*50)

# Search 2: Sci-fi movies
print("🔍 Search 2: Science Fiction movies")
results = filtered_movie_search(
    query_text="space future technology",
    genre_filter="Science Fiction",
    top_k=3
)

if results:
    for i, movie in enumerate(results, 1):
        print(f"{i}. {movie['title']} (Score: {movie['score']:.4f})")
        print(f"   Genres: {', '.join(movie['genres'])}")
        print(f"   Overview: {movie['overview'][:80]}...")
else:
    print("   No results found")

print("\n" + "-"*50)

# Search 3: High-rated movies regardless of genre
print("🔍 Search 3: High-rated movies (>8.0)")
results = filtered_movie_search(
    query_text="great amazing movie",
    min_rating=8.0,
    top_k=5
)

if results:
    for i, movie in enumerate(results, 1):
        print(f"{i}. {movie['title']} (Score: {movie['score']:.4f}, Rating: {movie.get('vote_average', 'N/A')})")
        print(f"   Genres: {', '.join(movie['genres'])}")
else:
    print("   No high-rated movies found in sample")

## 8. Retrieval-Augmented Generation (RAG) for Movie Recommendations

Let's implement a complete RAG pipeline using MongoDB as our vector store and Azure OpenAI for generation.

In [None]:
# RAG Implementation for Movie Recommendations
def generate_movie_recommendations_rag(user_query, top_k=5):
    """
    Complete RAG pipeline: Retrieve relevant movies, then generate recommendations
    """
    try:
        # Step 1: Retrieve relevant movies using semantic search
        print(f"🔍 Searching for movies related to: '{user_query}'")
        relevant_movies = semantic_movie_search(user_query, top_k=top_k, include_score=True)
        
        if not relevant_movies:
            return "I couldn't find any movies matching your request. Please try a different query."
        
        print(f"✅ Found {len(relevant_movies)} relevant movies")
        
        # Step 2: Prepare context for the LLM
        context_movies = []
        for movie in relevant_movies:
            movie_info = {
                "title": movie["title"],
                "genres": ", ".join(movie["genres"]) if movie["genres"] else "N/A",
                "rating": movie.get("vote_average", "N/A"),
                "overview": movie["overview"],
                "relevance_score": movie.get("score", "N/A")
            }
            context_movies.append(movie_info)
        
        # Step 3: Create prompt for GPT
        context_text = "\\n\\n".join([
            f"Title: {m['title']}\\n"
            f"Genres: {m['genres']}\\n"
            f"Rating: {m['rating']}/10\\n"
            f"Overview: {m['overview']}\\n"
            f"Relevance Score: {m['relevance_score']:.4f}"
            for m in context_movies
        ])
        
        system_prompt = """You are a knowledgeable movie recommendation expert. Based on the provided movie information from a vector database search, create personalized and engaging movie recommendations.

Guidelines:
- Analyze the retrieved movies and their relevance scores
- Provide thoughtful recommendations with explanations
- Mention genres, ratings, and key themes
- Be conversational and engaging
- If movies seem unrelated to the query, acknowledge this and try to find connections"""

        user_prompt = f"""Based on my request: "{user_query}"

Here are the most relevant movies from our database:

{context_text}

Please provide detailed movie recommendations based on this information. Explain why these movies match my request and what makes them worth watching."""

        # Step 4: Generate response using Azure OpenAI
        print("🤖 Generating personalized recommendations...")
        
        completion = openai_client.chat.completions.create(
            model=GENERATION_MODEL,
            messages=[
                {"role": "system", "content": system_prompt},
                {"role": "user", "content": user_prompt}
            ],
            temperature=0.7,
            max_tokens=1000
        )
        
        return completion.choices[0].message.content
        
    except Exception as e:
        print(f"Error in RAG pipeline: {e}")
        return f"Sorry, I encountered an error while generating recommendations: {str(e)}"

# Test RAG with different user queries
test_queries = [
    "I want to watch something exciting with lots of action",
    "Recommend me a good family movie for weekend",
    "I'm in the mood for a thought-provoking sci-fi film",
    "Can you suggest a romantic movie?",
    "What are some good thriller movies with plot twists?"
]

print("🎬 Testing RAG Movie Recommendation System\\n" + "="*60)

for i, query in enumerate(test_queries, 1):
    print(f"\\n{'='*60}")
    print(f"Test {i}: {query}")
    print('='*60)
    
    recommendation = generate_movie_recommendations_rag(query, top_k=3)
    print(f"\\n🎯 AI Recommendation:\\n{recommendation}")
    
    if i < len(test_queries):
        print("\\n" + "-"*40 + " Next Query " + "-"*40)

## 9. Performance Analysis and Optimization

Let's analyze the performance of our vector operations and explore optimization strategies.

In [None]:
# Performance benchmarking and analysis
import time
from statistics import mean, median

def benchmark_vector_search(num_tests=5):
    """
    Benchmark vector search performance
    """
    print("🚀 Performance Benchmarking\\n" + "="*40)
    
    test_queries = [
        "action adventure movie",
        "romantic comedy film",
        "sci-fi space opera",
        "thriller mystery movie",
        "animated family adventure"
    ]
    
    # Test 1: Simple vector search performance
    print("📊 Test 1: Simple Vector Search Performance")
    search_times = []
    
    for i in range(num_tests):
        query = test_queries[i % len(test_queries)]
        
        start_time = time.time()
        results = semantic_movie_search(query, top_k=5)
        end_time = time.time()
        
        search_time = (end_time - start_time) * 1000  # Convert to milliseconds
        search_times.append(search_time)
        
        print(f"  Query {i+1}: {search_time:.2f}ms ({len(results)} results)")
    
    print(f"\\n  Average search time: {mean(search_times):.2f}ms")
    print(f"  Median search time: {median(search_times):.2f}ms")
    print(f"  Min/Max search time: {min(search_times):.2f}ms / {max(search_times):.2f}ms")
    
    # Test 2: Filtered search performance
    print("\\n📊 Test 2: Filtered Vector Search Performance")
    filtered_times = []
    
    for i in range(num_tests):
        query = test_queries[i % len(test_queries)]
        
        start_time = time.time()
        results = filtered_movie_search(query, min_rating=6.0, top_k=5)
        end_time = time.time()
        
        search_time = (end_time - start_time) * 1000
        filtered_times.append(search_time)
        
        print(f"  Query {i+1}: {search_time:.2f}ms ({len(results)} results)")
    
    print(f"\\n  Average filtered search time: {mean(filtered_times):.2f}ms")
    print(f"  Median filtered search time: {median(filtered_times):.2f}ms")
    
    # Test 3: Embedding generation performance
    print("\\n📊 Test 3: Embedding Generation Performance")
    embedding_times = []
    
    for i in range(3):  # Fewer tests due to API rate limits
        query = test_queries[i]
        
        start_time = time.time()
        embedding = get_embedding(query)
        end_time = time.time()
        
        if embedding:
            embedding_time = (end_time - start_time) * 1000
            embedding_times.append(embedding_time)
            print(f"  Query {i+1}: {embedding_time:.2f}ms")
        
        time.sleep(0.1)  # Rate limiting
    
    if embedding_times:
        print(f"\\n  Average embedding time: {mean(embedding_times):.2f}ms")
        print(f"  Note: Embedding generation includes network latency to Azure OpenAI")
    
    return {
        "search_times": search_times,
        "filtered_times": filtered_times,
        "embedding_times": embedding_times
    }

# Run performance benchmarks
benchmark_results = benchmark_vector_search()

# Collection statistics
print("\\n📈 Collection Statistics\\n" + "="*30)

try:
    # Basic collection stats
    total_docs = collection.count_documents({})
    docs_with_embeddings = collection.count_documents({"embedding": {"$exists": True}})
    
    print(f"Total documents: {total_docs}")
    print(f"Documents with embeddings: {docs_with_embeddings}")
    print(f"Embedding coverage: {(docs_with_embeddings/total_docs*100):.1f}%")
    
    # Sample document for size analysis
    sample_doc = collection.find_one({"embedding": {"$exists": True}})
    if sample_doc:
        import sys
        doc_size = len(str(sample_doc).encode('utf-8'))
        embedding_size = len(str(sample_doc['embedding']).encode('utf-8'))
        
        print(f"\\nDocument size analysis:")
        print(f"  Average document size: ~{doc_size} bytes")
        print(f"  Embedding size: ~{embedding_size} bytes ({embedding_size/doc_size*100:.1f}% of document)")
        print(f"  Embedding dimensions: {len(sample_doc['embedding'])}")
    
    # Index information
    print(f"\\nIndex information:")
    indexes = list(collection.list_indexes())
    for idx in indexes:
        print(f"  - {idx.get('name', 'unnamed')}: {idx.get('key', {})}")
        
    # Search index information
    try:
        search_indexes = list(collection.list_search_indexes())
        print(f"\\nSearch indexes:")
        for idx in search_indexes:
            print(f"  - {idx.get('name')}: {idx.get('status', 'unknown')} status")
    except:
        print("  Search indexes: Unable to retrieve information")
        
except Exception as e:
    print(f"Error getting collection statistics: {e}")

## 10. Summary and Cleanup

Let's summarize what we've accomplished and clean up resources.

In [None]:
# Demo Summary and Cleanup
print("🎉 MongoDB vCore Vector Database Demo Summary\\n" + "="*50)

print("✅ What we accomplished:")
print("  1. 📊 Loaded and preprocessed the movies dataset")
print("  2. 🔌 Connected to Azure Cosmos DB for MongoDB vCore")
print("  3. 🧠 Generated embeddings using Azure OpenAI text-embedding-ada-002")
print("  4. 💾 Stored movie data with 1536-dimensional embeddings in MongoDB")
print("  5. 🔍 Created vector search index using SearchIndexModel")
print("  6. 🎯 Implemented semantic search with $vectorSearch aggregation")
print("  7. 🎛️ Added filtered search combining vectors with traditional queries")
print("  8. 🤖 Built complete RAG pipeline for movie recommendations")
print("  9. 📈 Performed performance benchmarking and analysis")

print("\\n🚀 Key MongoDB vCore Features Demonstrated:")
print("  • Native MongoDB vector search with $vectorSearch")
print("  • SearchIndexModel for optimized vector indexing")
print("  • HNSW/IVF indexing algorithms for high performance")
print("  • Seamless integration of vector and document queries")
print("  • Rich aggregation pipelines with vector operations")
print("  • Familiar MongoDB syntax and ecosystem compatibility")

print("\\n💡 Production Considerations:")
print("  • Index optimization: Choose appropriate numCandidates values")
print("  • Batch operations: Use insert_many for bulk vector inserts")
print("  • Connection pooling: Reuse MongoDB connections efficiently")
print("  • Monitoring: Track vector search performance and index usage")
print("  • Scaling: Leverage MongoDB's horizontal sharding capabilities")
print("  • Security: Implement proper authentication and network security")

# Collection summary
try:
    final_stats = {
        "total_documents": collection.count_documents({}),
        "documents_with_embeddings": collection.count_documents({"embedding": {"$exists": True}}),
        "unique_genres": len(collection.distinct("genres")),
        "avg_rating": None
    }
    
    # Calculate average rating
    rating_pipeline = [
        {"$match": {"vote_average": {"$exists": True, "$ne": None}}},
        {"$group": {"_id": None, "avg_rating": {"$avg": "$vote_average"}}}
    ]
    avg_result = list(collection.aggregate(rating_pipeline))
    if avg_result:
        final_stats["avg_rating"] = avg_result[0]["avg_rating"]
    
    print(f"\\n📊 Final Collection Statistics:")
    print(f"  Total documents: {final_stats['total_documents']}")
    print(f"  Documents with embeddings: {final_stats['documents_with_embeddings']}")
    print(f"  Unique genres represented: {final_stats['unique_genres']}")
    if final_stats["avg_rating"]:
        print(f"  Average movie rating: {final_stats['avg_rating']:.1f}/10")
    
except Exception as e:
    print(f"Could not get final statistics: {e}")

# Optional cleanup (uncomment to remove demo data)
print("\\n🧹 Cleanup Options:")
print("  To remove demo data, uncomment and run the following:")
print("  # collection.drop()")
print("  # print('Demo collection dropped')")

print("\\n  To remove vector search indexes:")
print("  # collection.drop_search_index('vector_index')")
print("  # print('Vector search index dropped')")

# Close connections
print("\\n🔐 Closing Connections:")
try:
    mongo_client.close()
    print("✅ MongoDB connection closed")
except Exception as e:
    print(f"Note: {e}")

print("\\n🎬 Thank you for exploring MongoDB vCore as a vector database!")
print("For production deployments, consider:")
print("  • Azure Key Vault for credential management")
print("  • MongoDB Atlas monitoring and alerting") 
print("  • Proper indexing strategies for your specific use case")
print("  • Load testing with your expected traffic patterns")
print("  • Regular backup and disaster recovery procedures")