# ⚙️ Configuration Setup

Before running this notebook, ensure you have:

1. **Created a `.env` file** in the project root directory
2. **Copied from `.env.example`** and filled in your actual credentials
3. **Verified your `.gitignore`** includes `.env` to protect your secrets

## Quick Setup Commands

```bash
# Copy the example file
cp .env.example .env

# Edit with your actual credentials
# (Use your preferred text editor)
```

**🔐 Security Best Practices:**
- Never commit `.env` files to version control
- Use different `.env` files for different environments (dev, staging, prod)
- Rotate your API keys regularly
- Consider using Azure Key Vault for production deployments

---

# Movies Dataset Vector Database Demo with Azure Cosmos DB NoSQL API

This notebook demonstrates how to:
1. Load the movies dataset into Azure Cosmos DB NoSQL API
2. Create containers with vector search capabilities
3. Use Cosmos DB as a vector database for similarity search
4. Demonstrate RAG (Retrieval Augmented Generation) patterns

## Prerequisites
- Azure Cosmos DB account with NoSQL API and vector search enabled
- Movies dataset files in the data/moviesdataset folder
- Required Python packages: pandas, azure-cosmos, numpy, openai, python-dotenv
- **Environment Configuration**: Create a `.env` file in the project root with the following variables:
  ```
  # Azure OpenAI Configuration
  AZURE_OPENAI_ENDPOINT=your_azure_openai_endpoint
  AZURE_OPENAI_API_KEY=your_azure_openai_api_key
  AZURE_OPENAI_API_VERSION=2024-06-01
  EMBEDDING_MODEL=text-embedding-ada-002
  GENERATION_MODEL=gpt-4o

  # Azure Cosmos DB Configuration
  COSMOS_ENDPOINT=your_cosmos_endpoint
  COSMOS_KEY=your_cosmos_key
  COSMOS_DATABASE_NAME=MovieVectorDB
  COSMOS_CONTAINER_NAME=movies
  ```

**⚠️ Security Note**: Never commit the `.env` file to version control. Add it to your `.gitignore` file.

In [None]:
# Import Required Libraries
import pandas as pd
import numpy as np
import json
import ast
import warnings
from typing import List, Dict, Any
import time
import os
from openai import AzureOpenAI
import tiktoken
from dotenv import load_dotenv
from azure.cosmos import CosmosClient, PartitionKey, exceptions
from azure.cosmos.container import ContainerProxy
from azure.cosmos.database import DatabaseProxy
from azure.identity import DefaultAzureCredential

warnings.filterwarnings('ignore')

# Load environment variables from .env file
load_dotenv()

# Azure OpenAI Configuration from environment variables
AZURE_OPENAI_ENDPOINT = os.getenv("AZURE_OPENAI_ENDPOINT")
AZURE_OPENAI_API_KEY = os.getenv("AZURE_OPENAI_API_KEY")
AZURE_OPENAI_API_VERSION = os.getenv("AZURE_OPENAI_API_VERSION", "2024-06-01")
EMBEDDING_MODEL = os.getenv("EMBEDDING_MODEL", "text-embedding-ada-002")
GENERATION_MODEL = os.getenv("GENERATION_MODEL", "gpt-4o")

# Azure Cosmos DB Configuration from environment variables or existing Python variables
COSMOS_ENDPOINT = os.getenv("COSMOS_ENDPOINT") or ""
COSMOS_KEY = os.getenv("COSMOS_KEY") or ""
COSMOS_DATABASE_NAME = os.getenv("COSMOS_DATABASE_NAME") or "MovieVectorDB"
COSMOS_CONTAINER_NAME = os.getenv("COSMOS_CONTAINER_NAME") or "movies"

# Validate required configuration (from env or Python variables)
required_vars = {
    "AZURE_OPENAI_ENDPOINT": AZURE_OPENAI_ENDPOINT,
    "AZURE_OPENAI_API_KEY": AZURE_OPENAI_API_KEY,
    "COSMOS_ENDPOINT": COSMOS_ENDPOINT,
    "COSMOS_KEY": COSMOS_KEY
}
missing_vars = [k for k, v in required_vars.items() if not v]
if missing_vars:
    raise ValueError(f"Missing required configuration variables: {', '.join(missing_vars)}")

# Initialize Azure OpenAI client
openai_client = AzureOpenAI(
    azure_endpoint=AZURE_OPENAI_ENDPOINT,
    api_key=AZURE_OPENAI_API_KEY,
    api_version=AZURE_OPENAI_API_VERSION
)
# Initialize Cosmos DB client using AAD (token-based) authentication
aad_credential = DefaultAzureCredential()
cosmos_client = CosmosClient(COSMOS_ENDPOINT, COSMOS_KEY)

print("All libraries imported successfully!")
print(f"Azure OpenAI configured with embedding model: {EMBEDDING_MODEL}")
print(f"Generation model: {GENERATION_MODEL}")
print(f"Cosmos DB configured for database: {COSMOS_DATABASE_NAME}")
print("✅ Environment variables loaded from .env file")

All libraries imported successfully!
Azure OpenAI configured with embedding model: text-embedding-ada-002
Generation model: gpt-4o
Cosmos DB configured for database: MovieVectorDB
✅ Environment variables loaded from .env file


## 1. Load and Inspect the Movies Dataset

Let's start by loading the movies dataset and examining its structure.

In [35]:
# Load the movies metadata
movies_df = pd.read_csv(r'..\data\moviesdataset\movies_metadata.csv', low_memory=False)

# Load ratings data (using smaller dataset for demo)
ratings_df = pd.read_csv(r'..\data\moviesdataset\ratings_small.csv')

print("Movies Dataset Shape:", movies_df.shape)
print("Ratings Dataset Shape:", ratings_df.shape)
print("\nMovies Dataset Columns:")
print(movies_df.columns.tolist())
print("\nFirst few rows of movies dataset:")
movies_df.head()

Movies Dataset Shape: (45466, 24)
Ratings Dataset Shape: (100004, 4)

Movies Dataset Columns:
['adult', 'belongs_to_collection', 'budget', 'genres', 'homepage', 'id', 'imdb_id', 'original_language', 'original_title', 'overview', 'popularity', 'poster_path', 'production_companies', 'production_countries', 'release_date', 'revenue', 'runtime', 'spoken_languages', 'status', 'tagline', 'title', 'video', 'vote_average', 'vote_count']

First few rows of movies dataset:


Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0
3,False,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",...,1995-12-22,81452156.0,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id': 35, 'name': 'Comedy'}]",,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,...,1995-02-10,76578911.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0


In [36]:
# Clean and preprocess the data
def clean_movies_data(df):
    # Remove rows with missing essential data
    df_clean = df.dropna(subset=['title', 'overview', 'id']).copy()
    
    # Convert id to numeric, handling errors
    df_clean['id'] = pd.to_numeric(df_clean['id'], errors='coerce')
    df_clean = df_clean.dropna(subset=['id'])
    df_clean['id'] = df_clean['id'].astype(int)
    
    # Fill missing overviews
    df_clean['overview'] = df_clean['overview'].fillna('')
    
    # Create a combined text field for embedding generation
    df_clean['combined_text'] = (
        df_clean['title'].fillna('') + ' ' + 
        df_clean['overview'].fillna('') + ' ' + 
        df_clean['genres'].fillna('')
    )
    
    return df_clean

# Clean the data
movies_clean = clean_movies_data(movies_df)

# Take a subset for demo (first 1000 movies)
movies_subset = movies_clean.head(1000).copy()

print(f"Cleaned dataset shape: {movies_subset.shape}")
print(f"Sample of combined text:")
print(movies_subset['combined_text'].head(3).tolist())

Cleaned dataset shape: (1000, 25)
Sample of combined text:
["Toy Story Led by Woody, Andy's toys live happily in his room until Andy's birthday brings Buzz Lightyear onto the scene. Afraid of losing his place in Andy's heart, Woody plots against Buzz. But when circumstances separate Buzz and Woody from their owner, the duo eventually learns to put aside their differences. [{'id': 16, 'name': 'Animation'}, {'id': 35, 'name': 'Comedy'}, {'id': 10751, 'name': 'Family'}]", "Jumanji When siblings Judy and Peter discover an enchanted board game that opens the door to a magical world, they unwittingly invite Alan -- an adult who's been trapped inside the game for 26 years -- into their living room. Alan's only hope for freedom is to finish the game, which proves risky as all three find themselves running from giant rhinoceroses, evil monkeys and other terrifying creatures. [{'id': 12, 'name': 'Adventure'}, {'id': 14, 'name': 'Fantasy'}, {'id': 10751, 'name': 'Family'}]", "Grumpier Old Men A f

## 2. Setup Azure Cosmos DB Database and Container

Create the database and container with vector search capabilities.

In [40]:
# Create or get Cosmos DB database and container
def setup_cosmos_db():
    """Setup Cosmos DB database and container with vector search capabilities"""
    
    try:
        # Create database if it doesn't exist
        database = cosmos_client.create_database_if_not_exists(id=COSMOS_DATABASE_NAME)
        print(f"✅ Database '{COSMOS_DATABASE_NAME}' ready")
        
        # Define vector embedding policy for 1536-dimensional vectors (text-embedding-ada-002)
        vector_embedding_policy = {
            "vectorEmbeddings": [
                {
                    "path": "/embedding",
                    "dataType": "float32",
                    "distanceFunction": "cosine",
                    "dimensions": 1536
                }
            ]
        }
        
        # Define indexing policy with vector index
        indexing_policy = {
            "indexingMode": "consistent",
            "automatic": True,
            "includedPaths": [
                {
                    "path": "/*"
                }
            ],
            "excludedPaths": [
                {
                    "path": "/embedding/*"
                }
            ],
            "vectorIndexes": [
                {
                    "path": "/embedding",
                    "type": "quantizedFlat"
                }
            ]
        }
        
        # Create container with vector search capabilities
        container = database.create_container_if_not_exists(
            id=COSMOS_CONTAINER_NAME,
            partition_key=PartitionKey(path="/movie_id"),
            indexing_policy=indexing_policy,
            vector_embedding_policy=vector_embedding_policy,
            offer_throughput=1000  # Set appropriate RU/s for your workload
        )
        
        print(f"✅ Container '{COSMOS_CONTAINER_NAME}' ready with vector search capabilities")
        print(f"   • Vector dimensions: 1536")
        print(f"   • Distance function: cosine")
        print(f"   • Vector index type: quantizedFlat")
        
        return database, container
        
    except exceptions.CosmosHttpResponseError as e:
        print(f"❌ Error setting up Cosmos DB: {e}")
        return None, None

# Setup the database and container
database, container = setup_cosmos_db()

if container:
    print("\n📊 Container Properties:")
    container_properties = container.read()
    print(f"   • Partition key: {container_properties['partitionKey']['paths'][0]}")
    print(f"   • RU/s provisioned: {container.read_offer()['content']['offerThroughput'] if container.read_offer() else 'Serverless'}")
    print("   • Vector search: Enabled")

✅ Database 'MovieVectorDB' ready
✅ Container 'movies' ready with vector search capabilities
   • Vector dimensions: 1536
   • Distance function: cosine
   • Vector index type: quantizedFlat

📊 Container Properties:
   • Partition key: /movie_id


TypeError: 'ThroughputProperties' object is not subscriptable

## 3. Generate Azure OpenAI Vector Embeddings

Create vector embeddings for movies using Azure OpenAI text-embedding-ada-002 model.

In [38]:
# Generate vector embeddings using Azure OpenAI
def get_azure_openai_embedding(text, model=EMBEDDING_MODEL):
    """Get embedding from Azure OpenAI"""
    try:
        response = openai_client.embeddings.create(
            input=text,
            model=model
        )
        return response.data[0].embedding
    except Exception as e:
        print(f"Error getting embedding: {e}")
        return None

def generate_movie_embeddings_batch(movies_df, batch_size=100):
    """Generate Azure OpenAI embeddings for movies in batches"""
    
    embeddings = []
    failed_count = 0
    
    print(f"Generating embeddings for {len(movies_df)} movies using {EMBEDDING_MODEL}...")
    print(f"Processing in batches of {batch_size}")
    
    for i in range(0, len(movies_df), batch_size):
        batch = movies_df.iloc[i:i+batch_size]
        print(f"Processing batch {i//batch_size + 1}/{(len(movies_df)-1)//batch_size + 1}")
        
        batch_texts = batch['combined_text'].tolist()
        
        try:
            # Get embeddings for the entire batch
            response = openai_client.embeddings.create(
                input=batch_texts,
                model=EMBEDDING_MODEL
            )
            
            # Extract embeddings from response
            batch_embeddings = [item.embedding for item in response.data]
            embeddings.extend(batch_embeddings)
            
            # Rate limiting - wait between batches
            time.sleep(1)
            
        except Exception as e:
            print(f"Error processing batch {i//batch_size + 1}: {e}")
            # Add None for failed embeddings
            embeddings.extend([None] * len(batch_texts))
            failed_count += len(batch_texts)
            
            # Wait longer on error
            time.sleep(5)
    
    print(f"Embedding generation completed!")
    print(f"Successfully generated: {len(embeddings) - failed_count} embeddings")
    print(f"Failed: {failed_count} embeddings")
    
    return embeddings

# Generate embeddings - using smaller subset for demo to manage API costs
print("Note: Using first 50 movies for demo to manage Azure OpenAI API costs")
movies_demo = movies_subset.head(50).copy()

# Generate embeddings
embeddings = generate_movie_embeddings_batch(movies_demo, batch_size=10)

# Add embeddings to dataframe, handling failed cases
movies_demo['embedding'] = embeddings
movies_demo = movies_demo.dropna(subset=['embedding'])  # Remove rows with failed embeddings

print(f"Final dataset size: {len(movies_demo)} movies with embeddings")
print(f"Embedding dimensions: {len(movies_demo['embedding'].iloc[0]) if len(movies_demo) > 0 else 'N/A'}")

if len(movies_demo) > 0:
    print("Sample embedding (first 10 dimensions):")
    print(movies_demo['embedding'].iloc[0][:10])

Note: Using first 50 movies for demo to manage Azure OpenAI API costs
Generating embeddings for 50 movies using text-embedding-ada-002...
Processing in batches of 10
Processing batch 1/5
Processing batch 2/5
Processing batch 2/5
Processing batch 3/5
Processing batch 3/5
Processing batch 4/5
Processing batch 4/5
Processing batch 5/5
Processing batch 5/5
Embedding generation completed!
Successfully generated: 50 embeddings
Failed: 0 embeddings
Final dataset size: 50 movies with embeddings
Embedding dimensions: 1536
Sample embedding (first 10 dimensions):
[-0.012333696708083153, -0.04674243927001953, -0.01232033409178257, -0.02140691503882408, -0.01333589293062687, -0.008692382834851742, 0.015179933980107307, -0.004546630661934614, -0.01769210584461689, -0.02005729079246521]
Embedding generation completed!
Successfully generated: 50 embeddings
Failed: 0 embeddings
Final dataset size: 50 movies with embeddings
Embedding dimensions: 1536
Sample embedding (first 10 dimensions):
[-0.012333696

## 4. Insert Movie Data and Vectors into Cosmos DB

Load the movie data and their vector embeddings into Cosmos DB.

In [42]:
# Insert movie data with Azure OpenAI embeddings into Cosmos DB
def insert_movie_data_cosmos(movies_df, container):
    """Insert movie data and Azure OpenAI vectors into Cosmos DB"""
    
    if not container:
        print("No Cosmos DB container")
        return
    
    try:
        print("Inserting movie data with Azure OpenAI embeddings into Cosmos DB...")
        
        movies_inserted = 0
        vectors_inserted = 0
        
        for idx, row in movies_df.iterrows():
            try:
                # Create document structure for Cosmos DB
                movie_doc = {
                    "id": str(int(row['id'])),  # Cosmos DB id must be string
                    "movie_id": int(row['id']),  # Partition key
                    "title": str(row['title'])[:500],  # Truncate if too long
                    "overview": str(row['overview']) if pd.notna(row['overview']) else None,
                    "genres": str(row['genres']) if pd.notna(row['genres']) else None,
                    "release_date": str(row['release_date']) if pd.notna(row['release_date']) else None,
                    "budget": int(row['budget']) if pd.notna(row['budget']) and str(row['budget']).replace('.', '').isdigit() else None,
                    "revenue": int(row['revenue']) if pd.notna(row['revenue']) and str(row['revenue']).replace('.', '').isdigit() else None,
                    "runtime": float(row['runtime']) if pd.notna(row['runtime']) else None,
                    "vote_average": float(row['vote_average']) if pd.notna(row['vote_average']) else None,
                    "vote_count": int(row['vote_count']) if pd.notna(row['vote_count']) else None,
                    "popularity": float(row['popularity']) if pd.notna(row['popularity']) else None,
                    "original_language": str(row['original_language'])[:10] if pd.notna(row['original_language']) else None,
                    "combined_text": str(row['combined_text']),
                    "embedding_model": EMBEDDING_MODEL,
                    "created_at": time.strftime('%Y-%m-%dT%H:%M:%SZ'),
                    "document_type": "movie"
                }
                
                # Add embedding if available
                if row['embedding'] is not None and len(row['embedding']) > 0:
                    movie_doc['embedding'] = row['embedding']
                    vectors_inserted += 1
                
                # Insert into Cosmos DB
                container.create_item(movie_doc)
                movies_inserted += 1
                
                if movies_inserted % 10 == 0:
                    print(f"Inserted {movies_inserted} movies, {vectors_inserted} vectors...")
                    
            except Exception as e:
                print(f"Error inserting movie {row.get('id', 'unknown')}: {e}")
                continue
        
        print(f"✅ Successfully inserted {movies_inserted} movies and {vectors_inserted} Azure OpenAI vectors")
        
    except Exception as e:
        print(f"❌ Error during insertion: {e}")

# Insert the data (only movies with successful embeddings)
if container:
    insert_movie_data_cosmos(movies_demo, container)
else:
    print("❌ Container not available for data insertion")

Inserting movie data with Azure OpenAI embeddings into Cosmos DB...
Inserted 10 movies, 10 vectors...
Inserted 10 movies, 10 vectors...
Inserted 20 movies, 20 vectors...
Inserted 20 movies, 20 vectors...
Inserted 30 movies, 30 vectors...
Inserted 30 movies, 30 vectors...
Inserted 40 movies, 40 vectors...
Inserted 40 movies, 40 vectors...
Inserted 50 movies, 50 vectors...
✅ Successfully inserted 50 movies and 50 Azure OpenAI vectors
Inserted 50 movies, 50 vectors...
✅ Successfully inserted 50 movies and 50 Azure OpenAI vectors


## 5. Query Movies Container from Python

Verify the data insertion and run basic queries.

In [43]:
# Query the container to verify data insertion
def run_basic_cosmos_queries(container):
    """Run basic queries to verify data"""
    
    if not container:
        print("❌ No container available")
        return
    
    try:
        # Check total count using aggregate query
        count_query = "SELECT VALUE COUNT(1) FROM c WHERE c.document_type = 'movie'"
        total_count = list(container.query_items(query=count_query, enable_cross_partition_query=True))[0]
        print(f"Total movies in container: {total_count}")
        
        # Count movies with embeddings
        vector_count_query = "SELECT VALUE COUNT(1) FROM c WHERE c.document_type = 'movie' AND IS_DEFINED(c.embedding)"
        vector_count = list(container.query_items(query=vector_count_query, enable_cross_partition_query=True))[0]
        print(f"Total vectors in container: {vector_count}")
        
        # Show sample movie data
        print("\n📽️ Sample movies:")
        sample_query = "SELECT TOP 5 c.movie_id, c.title, c.vote_average, c.popularity FROM c WHERE c.document_type = 'movie' ORDER BY c.popularity DESC"
        
        results = list(container.query_items(query=sample_query, enable_cross_partition_query=True))
        for movie in results:
            print(f"ID: {movie['movie_id']}, Title: {movie['title']}, Rating: {movie.get('vote_average', 'N/A')}, Popularity: {movie.get('popularity', 'N/A')}")
        
        # Show genres distribution
        print("\n🎭 Sample movie with genres:")
        genre_query = "SELECT TOP 3 c.title, c.overview, c.genres FROM c WHERE c.document_type = 'movie' AND IS_DEFINED(c.genres) AND LENGTH(c.genres) > 10"
        
        genre_results = list(container.query_items(query=genre_query, enable_cross_partition_query=True))
        for movie in genre_results:
            print(f"Title: {movie['title']}")
            overview = movie.get('overview', 'No overview')
            print(f"Overview: {overview[:100] if overview else 'N/A'}...")
            print(f"Genres: {movie.get('genres', 'N/A')}")
            print("-" * 50)
            
        # Show embedding info
        print("\n🔢 Embedding information:")
        embedding_query = "SELECT TOP 1 c.title, c.embedding_model, ARRAY_LENGTH(c.embedding) as embedding_dimensions FROM c WHERE c.document_type = 'movie' AND IS_DEFINED(c.embedding)"
        
        embedding_results = list(container.query_items(query=embedding_query, enable_cross_partition_query=True))
        if embedding_results:
            emb_info = embedding_results[0]
            print(f"Sample movie: {emb_info['title']}")
            print(f"Embedding model: {emb_info['embedding_model']}")
            print(f"Embedding dimensions: {emb_info['embedding_dimensions']}")
        
    except Exception as e:
        print(f"❌ Error querying container: {e}")

# Run basic queries
if container:
    run_basic_cosmos_queries(container)
else:
    print("❌ Container not available for querying")

Total movies in container: 50
Total vectors in container: 50

📽️ Sample movies:
ID: 862, Title: Toy Story, Rating: 7.7, Popularity: 21.946943
ID: 807, Title: Se7en, Rating: 8.1, Popularity: 18.45743
ID: 949, Title: Heat, Rating: 7.7, Popularity: 17.924927
ID: 8844, Title: Jumanji, Rating: 6.9, Popularity: 17.015539
ID: 629, Title: The Usual Suspects, Rating: 8.1, Popularity: 16.302466

🎭 Sample movie with genres:
Title: Toy Story
Overview: Led by Woody, Andy's toys live happily in his room until Andy's birthday brings Buzz Lightyear onto ...
Genres: [{'id': 16, 'name': 'Animation'}, {'id': 35, 'name': 'Comedy'}, {'id': 10751, 'name': 'Family'}]
--------------------------------------------------
Title: Jumanji
Overview: When siblings Judy and Peter discover an enchanted board game that opens the door to a magical world...
Genres: [{'id': 12, 'name': 'Adventure'}, {'id': 14, 'name': 'Fantasy'}, {'id': 10751, 'name': 'Family'}]
--------------------------------------------------
Title: Gru

## 6. Vector Similarity Search with Cosmos DB

Demonstrate vector similarity search capabilities in Cosmos DB NoSQL API.

In [44]:
# Vector similarity search using Cosmos DB vector search
def vector_similarity_search_cosmos(container, query_vector, top_k=5):
    """Perform vector similarity search in Cosmos DB"""
    
    if not container:
        print("❌ No container available")
        return []
    
    try:
        # Vector search query using VectorDistance function
        vector_search_query = f"""
        SELECT TOP {top_k} 
            c.title, 
            c.overview, 
            c.genres, 
            c.vote_average, 
            c.popularity,
            VectorDistance(c.embedding, @queryVector) AS similarity_score
        FROM c 
        WHERE c.document_type = 'movie' AND IS_DEFINED(c.embedding)
        ORDER BY VectorDistance(c.embedding, @queryVector)
        """
        
        # Execute vector search
        results = list(container.query_items(
            query=vector_search_query,
            parameters=[
                {"name": "@queryVector", "value": query_vector}
            ],
            enable_cross_partition_query=True
        ))
        
        return results
        
    except Exception as e:
        print(f"❌ Error during vector search: {e}")
        return []

# Advanced semantic search functionality using Azure OpenAI + Cosmos DB
def semantic_movie_search_cosmos(query_text, container, top_k=5):
    """Perform semantic search for movies using Azure OpenAI + Cosmos DB"""
    
    if not container:
        print("No container available")
        return []
    
    # Get query embedding from Azure OpenAI
    print(f"Getting embedding for query: '{query_text}'")
    query_embedding = get_azure_openai_embedding(query_text)
    
    if query_embedding is None:
        print("Failed to get query embedding")
        return []
    
    # Perform vector search
    results = vector_similarity_search_cosmos(container, query_embedding, top_k)
    
    return results

# Example searches using Azure OpenAI embeddings + Cosmos DB vector search
if container:
    print("🔍 Semantic Movie Search with Azure OpenAI + Cosmos DB Vector Search\n")
    
    # Search 1: Action movies
    print("1️⃣ Search: 'action adventure superhero movies with explosions'")
    results = semantic_movie_search_cosmos("action adventure superhero movies with explosions", container, top_k=3)
    for i, movie in enumerate(results, 1):
        print(f"  {i}. {movie['title']} (Similarity Score: {movie['similarity_score']:.4f})")
        overview = movie.get('overview', 'No overview')
        print(f"     Overview: {overview[:100] if overview else 'N/A'}...")
        print(f"     Rating: {movie.get('vote_average', 'N/A')}, Popularity: {movie.get('popularity', 'N/A')}")
        print()
    
    print("-" * 60)
    
    # Search 2: Romance movies
    print("2️⃣ Search: 'romantic love story with emotional drama'")
    results = semantic_movie_search_cosmos("romantic love story with emotional drama", container, top_k=3)
    for i, movie in enumerate(results, 1):
        print(f"  {i}. {movie['title']} (Similarity Score: {movie['similarity_score']:.4f})")
        overview = movie.get('overview', 'No overview')
        print(f"     Overview: {overview[:100] if overview else 'N/A'}...")
        print()
    
    print("-" * 60)
    
    # Search 3: Family movies
    print("3️⃣ Search: 'family friendly animated movie for children'")
    results = semantic_movie_search_cosmos("family friendly animated movie for children", container, top_k=3)
    for i, movie in enumerate(results, 1):
        print(f"  {i}. {movie['title']} (Similarity Score: {movie['similarity_score']:.4f})")
        overview = movie.get('overview', 'No overview')
        print(f"     Overview: {overview[:100] if overview else 'N/A'}...")
        print()
else:
    print("❌ Container not available for vector search")

🔍 Semantic Movie Search with Azure OpenAI + Cosmos DB Vector Search

1️⃣ Search: 'action adventure superhero movies with explosions'
Getting embedding for query: 'action adventure superhero movies with explosions'
  1. Sudden Death (Similarity Score: 0.8232)
     Overview: International action superstar Jean Claude Van Damme teams with Powers Boothe in a Tension-packed, s...
     Rating: 5.5, Popularity: 5.23158

  2. Mortal Kombat (Similarity Score: 0.7973)
     Overview: For nine generations an evil sorcerer has been victorious in hand-to-hand battle against his mortal ...
     Rating: 5.4, Popularity: 10.870138

  3. Heat (Similarity Score: 0.7966)
     Overview: Obsessive master thief, Neil McCauley leads a top-notch crew on various insane heists throughout Los...
     Rating: 7.7, Popularity: 17.924927

------------------------------------------------------------
2️⃣ Search: 'romantic love story with emotional drama'
Getting embedding for query: 'romantic love story with emotional

## 7. RAG with Azure OpenAI GPT-4o and Cosmos DB

Implement a complete RAG (Retrieval Augmented Generation) system using Cosmos DB as vector store and Azure OpenAI GPT-4o for generation.

In [45]:
# RAG Implementation using Azure OpenAI GPT-4o and Cosmos DB as Vector Store

def get_azure_openai_completion(messages, model=GENERATION_MODEL, max_tokens=1000):
    """Get completion from Azure OpenAI GPT-4o"""
    try:
        response = openai_client.chat.completions.create(
            model=model,
            messages=messages,
            max_tokens=max_tokens,
            temperature=0.7
        )
        return response.choices[0].message.content
    except Exception as e:
        print(f"Error getting completion: {e}")
        return None

def rag_movie_recommendation_cosmos(user_query, container, top_k=5):
    """Complete RAG pipeline for movie recommendations using Cosmos DB"""
    
    print(f"🎬 **RAG Movie Recommendation System (Cosmos DB)**")
    print(f"Query: '{user_query}'\n")
    
    # Step 1: Retrieve relevant movies using Cosmos DB vector similarity
    print("🔍 Step 1: Retrieving relevant movies from Cosmos DB...")
    retrieved_movies = semantic_movie_search_cosmos(user_query, container, top_k=top_k)
    
    if not retrieved_movies:
        return "Sorry, I couldn't find any relevant movies for your query."
    
    # Step 2: Format context for GPT-4o
    print("📝 Step 2: Formatting context for GPT-4o...")
    
    context_parts = []
    for i, movie in enumerate(retrieved_movies, 1):
        title = movie['title']
        overview = movie.get('overview', 'No overview available')
        genres = movie.get('genres', 'Unknown')
        rating = movie.get('vote_average', 'N/A')
        popularity = movie.get('popularity', 0)
        similarity = movie['similarity_score']
        
        context_parts.append(
            f"Movie {i}: {title}\n"
            f"Overview: {overview}\n"
            f"Genres: {genres}\n"
            f"Rating: {rating}/10, Popularity: {popularity:.1f}\n"
            f"Similarity Score: {similarity:.4f}\n"
        )
    
    context = "\n".join(context_parts)
    
    # Step 3: Generate response using GPT-4o
    print("🤖 Step 3: Generating recommendation with GPT-4o...")
    
    system_prompt = """You are a knowledgeable movie recommendation assistant. Based on the provided movie data from a Cosmos DB vector database search, give personalized movie recommendations that match the user's query. 
    
    Instructions:
    - Analyze the retrieved movies and their similarity scores
    - Recommend the most relevant movies based on the user's preferences
    - Explain why each movie matches their query
    - Include brief details about plot, genre, and ratings
    - Be conversational and helpful
    - If applicable, suggest similar themes or related movies"""
    
    user_prompt = f"""User Query: {user_query}
    
    Retrieved Movies from Cosmos DB Vector Database:
    {context}
    
    Please provide personalized movie recommendations based on this data."""
    
    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_prompt}
    ]
    
    recommendation = get_azure_openai_completion(messages)
    
    return {
        "query": user_query,
        "retrieved_movies": retrieved_movies,
        "recommendation": recommendation,
        "context_used": context
    }

# Example RAG queries using Cosmos DB
if container:
    print("🎯 **RAG Demo with Azure OpenAI GPT-4o + Cosmos DB**\n")
    
    # RAG Example 1
    query1 = "I want to watch a thrilling action movie with great special effects"
    result1 = rag_movie_recommendation_cosmos(query1, container)
    
    if result1:
        print("\n" + "="*80)
        print("🎬 **RAG RECOMMENDATION 1**")
        print("="*80)
        print(f"**Query:** {result1['query']}")
        print(f"\n**GPT-4o Recommendation:**")
        print(result1['recommendation'])
        print("\n" + "="*80)
    
    # RAG Example 2
    query2 = "Recommend me a heartwarming family movie for weekend viewing"
    result2 = rag_movie_recommendation_cosmos(query2, container)
    
    if result2:
        print("\n🎬 **RAG RECOMMENDATION 2**")
        print("="*80)
        print(f"**Query:** {result2['query']}")
        print(f"\n**GPT-4o Recommendation:**")
        print(result2['recommendation'])
        print("\n" + "="*80)
    
    print("\n✅ **RAG System Components:**")
    print("• **Embedding Model:** text-embedding-ada-002 (1536 dimensions)")
    print("• **Vector Database:** Azure Cosmos DB NoSQL API with vector search")
    print("• **Generation Model:** GPT-4o")
    print("• **Similarity Function:** Cosine similarity (VectorDistance)")
    print("• **Retrieval Strategy:** Top-K similarity search")
    print("• **Index Type:** Quantized Flat vector index")
else:
    print("❌ Container not available for RAG demonstration")

🎯 **RAG Demo with Azure OpenAI GPT-4o + Cosmos DB**

🎬 **RAG Movie Recommendation System (Cosmos DB)**
Query: 'I want to watch a thrilling action movie with great special effects'

🔍 Step 1: Retrieving relevant movies from Cosmos DB...
Getting embedding for query: 'I want to watch a thrilling action movie with great special effects'
📝 Step 2: Formatting context for GPT-4o...
🤖 Step 3: Generating recommendation with GPT-4o...

🎬 **RAG RECOMMENDATION 1**
**Query:** I want to watch a thrilling action movie with great special effects

**GPT-4o Recommendation:**
Given your interest in thrilling action movies with great special effects, here are some personalized recommendations based on the retrieved movies:

1. **GoldenEye**
    - **Overview:** James Bond (Pierce Brosnan) must unmask the mysterious head of the Janus Syndicate and prevent the leader from utilizing the GoldenEye weapons system to inflict devastating revenge on Britain.
    - **Genres:** Action, Adventure, Thriller
    - **Ra

## 8. Performance Analysis and Cosmos DB Vector Features

Analyze the performance of vector operations and explore Cosmos DB vector capabilities.

In [46]:
# Performance Analysis and Cosmos DB Vector Features Exploration
import time

def analyze_cosmos_vector_performance(container):
    """Analyze Cosmos DB vector search performance"""
    
    if not container:
        print("❌ No container available")
        return
    
    print("⚡ **Cosmos DB Vector Performance Analysis**\n")
    
    try:
        # 1. Test query performance with 1536-dimensional vectors
        print("1️⃣ **Vector Similarity Query Performance (1536D):**")
        
        # Get a sample vector for testing
        sample_query = "SELECT TOP 1 c.embedding FROM c WHERE c.document_type = 'movie' AND IS_DEFINED(c.embedding)"
        sample_result = list(container.query_items(query=sample_query, enable_cross_partition_query=True))
        
        if sample_result:
            target_vector = sample_result[0]['embedding']
            
            start_time = time.time()
            
            perf_query = """
            SELECT TOP 10
                c.title,
                VectorDistance(c.embedding, @targetVector) AS similarity
            FROM c
            WHERE c.document_type = 'movie' AND IS_DEFINED(c.embedding)
            ORDER BY VectorDistance(c.embedding, @targetVector)
            """
            
            results = list(container.query_items(
                query=perf_query,
                parameters=[
                    {"name": "@targetVector", "value": target_vector}
                ],
                enable_cross_partition_query=True
            ))
            
            end_time = time.time()
            
            print(f"   • Query executed in: {(end_time - start_time)*1000:.2f} ms")
            print(f"   • Results returned: {len(results)}")
            if results:
                print(f"   • Top result: {results[0]['title']} (similarity: {results[0]['similarity']:.4f})")
        
        # 2. Check embedding storage efficiency
        print("\n2️⃣ **Embedding Storage in Cosmos DB:**")
        
        storage_query = """
        SELECT 
            COUNT(1) as total_vectors,
            c.embedding_model,
            ARRAY_LENGTH(c.embedding) as embedding_dimensions
        FROM c 
        WHERE c.document_type = 'movie' AND IS_DEFINED(c.embedding)
        GROUP BY c.embedding_model
        """
        
        storage_results = list(container.query_items(query=storage_query, enable_cross_partition_query=True))
        for stat in storage_results:
            print(f"   • Model: {stat['embedding_model']}")
            print(f"   • Total vectors: {stat['total_vectors']}")
            print(f"   • Embedding dimensions: {stat['embedding_dimensions']}")
        
        # 3. Container statistics
        print("\n3️⃣ **Container Statistics:**")
        
        # Get container properties
        container_properties = container.read()
        print(f"   • Partition key: {container_properties['partitionKey']['paths'][0]}")
        print(f"   • Vector embedding policy: {len(container_properties.get('vectorEmbeddingPolicy', {}).get('vectorEmbeddings', []))} vector paths")
        print(f"   • Vector indexes: {len(container_properties.get('indexingPolicy', {}).get('vectorIndexes', []))} configured")
        
        # Try to get RU consumption (if available)
        try:
            offer = container.read_offer()
            if offer:
                print(f"   • Provisioned RU/s: {offer['content']['offerThroughput']}")
        except:
            print(f"   • Throughput: Serverless or auto-scale")
        
        # 4. Vector index information
        print("\n4️⃣ **Vector Index Configuration:**")
        
        vector_indexes = container_properties.get('indexingPolicy', {}).get('vectorIndexes', [])
        for idx in vector_indexes:
            print(f"   • Path: {idx['path']}, Type: {idx['type']}")
        
        vector_embeddings = container_properties.get('vectorEmbeddingPolicy', {}).get('vectorEmbeddings', [])
        for emb in vector_embeddings:
            print(f"   • Path: {emb['path']}, Dimensions: {emb['dimensions']}, Distance: {emb['distanceFunction']}")
        
    except Exception as e:
        print(f"❌ Error during performance analysis: {e}")

# Run performance analysis
if container:
    analyze_cosmos_vector_performance(container)

    print("\n" + "="*80)
    print("\n🎯 **Cosmos DB Vector Integration Summary:**")
    print("\n✅ **Key Capabilities Demonstrated:**")
    print("   • Native 1536-dimensional vector support in Cosmos DB NoSQL API")
    print("   • Azure OpenAI text-embedding-ada-002 integration")
    print("   • GPT-4o for RAG generation")
    print("   • VectorDistance function for similarity search")
    print("   • Real-time semantic search capabilities")
    print("   • JSON document storage with vector embeddings")
    
    print("\n🚀 **Production Use Cases:**")
    print("   • Semantic movie search and recommendations")
    print("   • Content-based filtering systems")
    print("   • RAG applications for customer support")
    print("   • Multi-modal search combining text and metadata")
    print("   • Real-time personalization engines")
    print("   • Document similarity and clustering")
    
    print("\n💰 **Cost Optimization Tips:**")
    print("   • Use serverless for unpredictable workloads")
    print("   • Optimize partition key strategy")
    print("   • Cache embeddings to avoid repeated API calls")
    print("   • Use batch operations for bulk inserts")
    print("   • Monitor RU consumption for vector queries")
    
    print("\n📈 **Performance Best Practices:**")
    print("   • Configure appropriate vector index type")
    print("   • Use TOP N queries to limit result sets")
    print("   • Optimize partition key for query patterns")
    print("   • Enable cross-partition queries when needed")
    print("   • Monitor query RU charges")
    
    print("\n🔧 **Azure Services Configuration:**")
    print(f"   • Embedding Model: {EMBEDDING_MODEL} (1536 dimensions)")
    print(f"   • Generation Model: {GENERATION_MODEL}")
    print(f"   • Cosmos DB: {COSMOS_DATABASE_NAME}/{COSMOS_CONTAINER_NAME}")
    print("   • Vector Index: Quantized Flat")
    print("   • Distance Function: Cosine")
else:
    print("❌ Container not available for performance analysis")

⚡ **Cosmos DB Vector Performance Analysis**

1️⃣ **Vector Similarity Query Performance (1536D):**
   • Query executed in: 221.47 ms
   • Results returned: 10
   • Top result: Toy Story (similarity: 1.0000)

2️⃣ **Embedding Storage in Cosmos DB:**
❌ Error during performance analysis: (BadRequest) Message: {"errors":[{"severity":"Error","location":{"start":112,"end":123},"code":"SC2102","message":"Property reference 'c.embedding' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause."}]}
ActivityId: 5918a1e9-7ab0-4d2a-b370-3143fe59f88f, Microsoft.Azure.Documents.Common/2.14.0
Code: BadRequest
Message: Message: {"errors":[{"severity":"Error","location":{"start":112,"end":123},"code":"SC2102","message":"Property reference 'c.embedding' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause."}]}
ActivityId: 5918a1e9-7ab0-4d2a-b370-3143fe59f88f, Microsoft.Azure.Documents.Commo

## 9. Cleanup and Next Steps

### Cleanup Resources

In [None]:
# Cleanup resources (uncomment to run)
def cleanup_cosmos_resources(database, container):
    """Clean up Cosmos DB resources"""
    
    if not database:
        print("❌ No database connection")
        return
    
    print("🧹 Cleaning up Cosmos DB resources...")
    
    try:
        # Option 1: Delete all documents (keeps container)
        if container:
            print("Deleting all documents...")
            
            # Query all documents
            all_docs = list(container.query_items(
                query="SELECT c.id, c.movie_id FROM c WHERE c.document_type = 'movie'",
                enable_cross_partition_query=True
            ))
            
            for doc in all_docs:
                container.delete_item(item=doc['id'], partition_key=doc['movie_id'])
            
            print(f"✅ Deleted {len(all_docs)} documents")
        
        # Option 2: Delete entire container (uncomment if needed)
        # if container:
        #     database.delete_container(container)
        #     print("✅ Container deleted")
        
        # Option 3: Delete entire database (uncomment if needed)
        # cosmos_client.delete_database(database)
        # print("✅ Database deleted")
        
        print("✅ Cleanup completed")
        
    except Exception as e:
        print(f"❌ Error during cleanup: {e}")

# Uncomment the next line to clean up resources
# cleanup_cosmos_resources(database, container)

print("\n🎬 **Azure Cosmos DB Vector Database Demo Complete!** 🎬")