# Electronics Store Product Catalog Generator

This notebook generates a comprehensive electronics product catalog for Azure Cosmos DB with NoSQL API.

## Key Features
- **18 Electronics Categories** across 4 product types (Computers, Devices, Accessories, Peripherals)
- **Price-Review Correlations**: Different correlation patterns based on category type
  - Computers: No correlation (random ratings)
  - Devices: Inverse correlation (lower price = better reviews)
  - Accessories: Strong positive correlation (higher price = better reviews)
  - Peripherals: Moderate correlation
- **Shared Container Design**: Products and reviews in same container with `docType` field
- **Date Range**: January 2024 - October 2025
- **Vector Search Ready**: Optional embeddings for similarity search

## Output Files
- `electronics_catalog.json` - Full catalog with vector embeddings
- `electronics_catalog_no_vectors.json` - Lightweight version without vectors

## 1. Install Required Packages

In [4]:
! pip install python-dotenv
! pip install python-dateutil
! pip install azure-core
! pip install azure-cosmos
! pip install azure-identity
! pip install openai

## 2. Import Libraries

In [16]:
import json
import re
import random
import string
import uuid
import datetime
from dotenv import dotenv_values
from dateutil.relativedelta import relativedelta

# Azure OpenAI imports
from openai.lib.azure import AzureOpenAI, AzureADTokenProvider


# Azure Cosmos DB imports
from azure.cosmos import CosmosClient, PartitionKey

print("✅ Libraries imported successfully")

✅ Libraries imported successfully


## 3. Load Configuration

In [40]:
# Specify the name of the .env file
env_name = "my-config.env"  # Change to "config.env" if needed
config = dotenv_values(env_name)

# OpenAI configuration
OPENAI_API_ENDPOINT = config['openai_endpoint']
OPENAI_API_VERSION = config['openai_api_version']
COMPLETIONS_MODEL_DEPLOYMENT = config['openai_completions_deployment']
EMBEDDING_MODEL_DEPLOYMENT = config['openai_embeddings_deployment']
EMBEDDING_DIMENSIONS = int(config['openai_embeddings_dimensions'])

# Azure Cosmos DB configuration (optional - for upload)
COSMOS_ENDPOINT = config.get('cosmos_uri', '')
COSMOS_KEY = config.get('cosmos_key', '')
COSMOS_DATABASE = config.get('cosmos_database', '')
COSMOS_CONTAINER = config.get('cosmos_product_container', '')

print("✅ Configuration loaded successfully")

✅ Configuration loaded successfully


## 4. Initialize Azure OpenAI Client

In [41]:
# Create Azure OpenAI client using Entra ID (Azure Identity)
from azure.identity import DefaultAzureCredential, get_bearer_token_provider

token_provider = get_bearer_token_provider(
    DefaultAzureCredential(), 
    "https://cognitiveservices.azure.com/.default"
)

AOAI_client = AzureOpenAI(
    azure_endpoint = OPENAI_API_ENDPOINT, 
    api_version = OPENAI_API_VERSION,
    azure_ad_token_provider = token_provider
)

print("✅ Azure OpenAI client initialized with Entra ID authentication")


✅ Azure OpenAI client initialized with Entra ID authentication


### 4.1 Test Embedding Generation

Let's verify that real embeddings are being generated (not mock/random ones).


In [None]:
# Test that we can generate REAL embeddings (not mock ones)
print("🧪 Testing embedding generation...")

test_text = "This is a test product for embedding generation."

try:
    test_embedding = generate_embeddings(test_text)
    
    # Check if embeddings are real (not all random)
    # Real embeddings from Azure OpenAI have specific patterns
    if test_embedding and len(test_embedding) == EMBEDDING_DIMENSIONS:
        # Check if values are not just random [-1, 1] range
        # Real embeddings are typically smaller values and more structured
        avg_value = sum(test_embedding) / len(test_embedding)
        max_value = max(test_embedding)
        min_value = min(test_embedding)
        
        print(f"✅ Embedding generated successfully!")
        print(f"   Dimensions: {len(test_embedding)}")
        print(f"   Average value: {avg_value:.6f}")
        print(f"   Min value: {min_value:.6f}")
        print(f"   Max value: {max_value:.6f}")
        print(f"   First 5 values: {test_embedding[:5]}")
        
        # Real embeddings typically have smaller absolute values and don't reach -1 or 1
        if abs(max_value) < 0.8 and abs(min_value) < 0.8:
            print("\n🎉 These appear to be REAL Azure OpenAI embeddings!")
        else:
            print("\n⚠️  Warning: These might be mock embeddings (values too large)")
    else:
        print(f"❌ Embedding generation failed or returned wrong dimensions")
        
except Exception as e:
    print(f"❌ Error testing embeddings: {e}")
    print("   Please check your Azure OpenAI configuration")


## 5. Define Core Functions

### 5.1 Azure OpenAI Completion and Embedding Functions

In [42]:
def generate_electronics_completion(user_prompt, max_tokens=100):
    """
    Generate AI completions for electronics product data.
    """
    system_prompt = '''
    You are a product manager for an electronics e-commerce website that sells computers, accessories, peripherals, and mobile devices.
    Your job is to create a comprehensive product catalog for electronics that will be used by the company's website.
    Focus on realistic product names, descriptions, and pricing for modern electronics.
    '''
    
    messages=[{"role": "system", "content": system_prompt}]
    messages.append({"role": "user", "content": user_prompt})
    
    response = AOAI_client.chat.completions.create(
        model = COMPLETIONS_MODEL_DEPLOYMENT,
        messages = messages,
        max_tokens = max_tokens
    )
    
    response = response.model_dump_json(indent=2)
    response = json.loads(response)
    response = response['choices'][0]['message']['content']
    
    return response

def generate_embeddings(text):
    """
    Generate embeddings from string of text.
    This will be used to vectorize data for Azure Cosmos DB vector search.
    """
    try:
        response = AOAI_client.embeddings.create(
            input = text, 
            model = EMBEDDING_MODEL_DEPLOYMENT,
            dimensions = EMBEDDING_DIMENSIONS)
        
        
        embeddings = response.model_dump()
        return embeddings['data'][0]['embedding']
    except Exception as e:
        print(f"Warning: Could not generate embeddings: {e}")
        # Return mock embeddings if Azure OpenAI is not available
        return [random.uniform(-1, 1) for _ in range(EMBEDDING_DIMENSIONS)]

print("✅ Core AI functions defined")

✅ Core AI functions defined


### 5.2 Electronics Categories and Countries

In [20]:
def get_electronics_categories():
    """
    Define 18 electronics categories with price-review correlation types:
    - 'none': No correlation (computers)
    - 'inverse': Lower price = better reviews (devices)
    - 'strong': Higher price = better reviews (accessories)
    - 'moderate': Moderate correlation (peripherals)
    """
    categories = [
        # Computers - no correlation
        {"name": "Computers, Laptops", "correlation": "none"},
        {"name": "Computers, Desktops", "correlation": "none"},
        {"name": "Computers, Gaming PCs", "correlation": "none"},
        {"name": "Computers, Workstations", "correlation": "none"},
        
        # Devices - inverse correlation (cheaper = better reviews)
        {"name": "Devices, Smartphones", "correlation": "inverse"},
        {"name": "Devices, Tablets", "correlation": "inverse"},
        {"name": "Devices, Smartwatches", "correlation": "inverse"},
        {"name": "Devices, E-readers", "correlation": "inverse"},
        
        # Accessories - strong positive correlation (expensive = better reviews)
        {"name": "Accessories, Premium Headphones", "correlation": "strong"},
        {"name": "Accessories, Luxury Cases", "correlation": "strong"},
        {"name": "Accessories, High-end Chargers", "correlation": "strong"},
        {"name": "Accessories, Designer Stands", "correlation": "strong"},
        
        # Peripherals - moderate correlation
        {"name": "Peripherals, Keyboards", "correlation": "moderate"},
        {"name": "Peripherals, Mice", "correlation": "moderate"},
        {"name": "Peripherals, Monitors", "correlation": "moderate"},
        {"name": "Peripherals, Webcams", "correlation": "moderate"},
        {"name": "Peripherals, Speakers", "correlation": "moderate"},
        {"name": "Peripherals, Microphones", "correlation": "moderate"}
    ]
    
    # Add IDs to categories
    electronics_categories = [
        {
            "id": str(uuid.uuid4()),
            "name": category["name"],
            "correlation": category["correlation"]
        }
        for category in categories
    ]
    
    return electronics_categories

def get_countries_of_origin():
    """Countries for electronics manufacturing"""
    return [
        "China", "South Korea", "Japan", "Taiwan", "USA", 
        "Germany", "Sweden", "Finland", "Canada", "Singapore",
        "India", "Vietnam", "Thailand", "Malaysia", "Mexico",
        "Brazil", "Ireland", "Netherlands", "Czech Republic", "Poland"
    ]

print("✅ Category and country functions defined")

✅ Category and country functions defined



### 5.3 Product Data Generation Functions

In [21]:
def generate_electronics_product_name(category_name, existing_names=None):
    """Generate a unique product name for the category"""
    if existing_names is None:
        existing_names = set()
    
    max_attempts = 5
    for attempt in range(max_attempts):
        prompt = f"Generate a realistic and appealing product name for a product in the category '{category_name}'. "
        prompt += "Include brand-style naming and model identifiers typical for electronics. "
        prompt += "Make it unique and creative. "
        if attempt > 0:
            prompt += f"Try variation #{attempt + 1}. "
        prompt += "Return only the product name."
        
        product_name = generate_electronics_completion(prompt, max_tokens=100)
        
        # Check if name is unique
        if product_name not in existing_names:
            existing_names.add(product_name)
            return product_name
    
    # If all attempts failed, add a unique suffix
    base_name = generate_electronics_completion(prompt, max_tokens=100)
    unique_name = f"{base_name} - Model {random.randint(1000, 9999)}"
    existing_names.add(unique_name)
    return unique_name

def generate_electronics_description(product_name, category_name):
    prompt = f"Generate an engaging product description for '{product_name}' in category '{category_name}'. "
    prompt += "Include key technical specifications and features typical for this type of electronics product. "
    prompt += "Keep it under 255 characters and end with a complete sentence. "
    prompt += "Return only alphanumeric characters, spaces, and basic punctuation."
    
    description = generate_electronics_completion(prompt, max_tokens=200)
    # Ensure description is under 255 characters
    if len(description) > 255:
        sentences = description.split('.')
        result = ""
        for sentence in sentences:
            if len(result + sentence + ".") <= 255:
                result += sentence + "."
            else:
                break
        description = result if result else description[:252] + "..."
    
    return description

def generate_electronics_price(product_name, category_name, description):
    prompt = f"Generate a realistic retail price in USD for '{product_name}' in category '{category_name}'. "
    prompt += f"Consider the description: '{description}'. "
    prompt += "Return only the numeric price value without currency symbols."
    
    price_str = generate_electronics_completion(prompt, max_tokens=20)
    price = float(re.sub(r'[^\d.]', '', price_str))
    return round(price, 2)

def generate_customer_name():
    prompt = "Generate a realistic first and last name for a customer review. "
    prompt += "Return only the name with a space between first and last name."
    return generate_electronics_completion(prompt, max_tokens=50)

def generate_random_date_in_range(start_date=None, end_date=None):
    """Generate a random date between start_date and end_date"""
    if start_date is None:
        start_date = datetime.datetime(2024, 1, 1)  # January 2024
    if end_date is None:
        end_date = datetime.datetime(2025, 10, 31)  # October 2025
    
    time_between = end_date - start_date
    days_between = time_between.days
    random_days = random.randrange(days_between)
    
    return start_date + datetime.timedelta(days=random_days)

print("✅ Product data generation functions defined")

✅ Product data generation functions defined


### 5.4 Price History Generation

In [22]:
def generate_price_history(initial_price, first_available_date, category_correlation):
    """Generate realistic price changes over time based on category type"""
    price_history = []
    current_price = initial_price
    current_date = first_available_date
    
    # Number of price changes (1-5)
    num_changes = random.randint(1, 5)
    
    for i in range(num_changes):
        # Add 2-8 months between price changes
        months_to_add = random.randint(2, 8)
        current_date = current_date + relativedelta(months=months_to_add)
        
        # Don't go beyond October 2025
        max_date = datetime.datetime(2025, 10, 31)
        if current_date > max_date:
            current_date = max_date
        
        # Generate price change percentage based on category type
        if category_correlation == "none":
            # Computers: random price changes
            price_change_percent = random.uniform(-0.30, 0.25)
        elif category_correlation == "inverse":
            # Devices: tend to decrease over time (technology gets cheaper)
            price_change_percent = random.uniform(-0.35, 0.10)
        elif category_correlation == "strong":
            # Accessories: premium items may increase or stay stable
            price_change_percent = random.uniform(-0.15, 0.30)
        elif category_correlation == "moderate":
            # Peripherals: moderate fluctuations
            price_change_percent = random.uniform(-0.20, 0.20)
        
        new_price = round(current_price * (1 + price_change_percent), 2)
        # Ensure price doesn't go below $5
        new_price = max(new_price, 5.00)
        
        price_history.append({
            "date": current_date.isoformat(),
            "price": new_price
        })
        
        current_price = new_price
        
        # Stop if we've reached the max date
        if current_date >= max_date:
            break
    
    return price_history

print("✅ Price history function defined")

✅ Price history function defined


### 5.5 Correlated Review Generation

In [23]:
def generate_correlated_review(product_name, description, current_price, price_history, category_correlation, review_date):
    """Generate a review that correlates with price changes based on category type"""
    
    # Find the price closest to the review date
    review_price = current_price
    for price_entry in price_history:
        price_date = datetime.datetime.fromisoformat(price_entry["date"])
        if price_date <= review_date:
            review_price = price_entry["price"]
    
    # Calculate if this is expensive/cheap relative to price history
    all_prices = [current_price] + [p["price"] for p in price_history]
    avg_price = sum(all_prices) / len(all_prices)
    price_ratio = review_price / avg_price  # >1 means expensive, <1 means cheap
    
    # Generate star rating based on correlation type
    if category_correlation == "none":
        # Computers: random rating (no correlation)
        stars = random.randint(1, 5)
        price_sentiment = ""
    elif category_correlation == "inverse":
        # Devices: cheaper = better reviews
        if price_ratio < 0.8:  # significantly cheaper
            stars = random.choices([4, 5], weights=[30, 70])[0]
            price_sentiment = " Great value for money! "
        elif price_ratio > 1.2:  # significantly more expensive
            stars = random.choices([1, 2, 3], weights=[40, 40, 20])[0]
            price_sentiment = " Overpriced for what you get. "
        else:
            stars = random.randint(2, 4)
            price_sentiment = ""
    elif category_correlation == "strong":
        # Accessories: expensive = better reviews
        if price_ratio > 1.2:  # significantly more expensive
            stars = random.choices([4, 5], weights=[30, 70])[0]
            price_sentiment = " Premium quality worth the price! "
        elif price_ratio < 0.8:  # significantly cheaper
            stars = random.choices([1, 2, 3], weights=[50, 30, 20])[0]
            price_sentiment = " You get what you pay for. "
        else:
            stars = random.randint(2, 4)
            price_sentiment = ""
    else:  # moderate correlation
        # Peripherals: moderate correlation
        if price_ratio > 1.1:
            stars = random.choices([3, 4, 5], weights=[20, 40, 40])[0]
            price_sentiment = " Good quality but pricey. "
        elif price_ratio < 0.9:
            stars = random.choices([2, 3, 4], weights=[30, 40, 30])[0]
            price_sentiment = " Decent value. "
        else:
            stars = random.randint(2, 4)
            price_sentiment = ""
    
    # Generate review text with price sentiment
    prompt = f"Write a customer review for '{product_name}' described as '{description}' "
    prompt += f"with {stars} stars out of 5. "
    if price_sentiment:
        prompt += f"Include this sentiment about pricing: '{price_sentiment.strip()}' "
    prompt += "Make it sound like a real customer review. Return only the review text."
    
    review_text = generate_electronics_completion(prompt, max_tokens=150)
    
    return {
        "stars": stars,
        "review_text": review_text
    }

print("✅ Correlated review function defined")

✅ Correlated review function defined


### 5.6 Product and Review Document Generation

In [31]:
def generate_electronics_product(category, existing_names=None):
    """Generate a complete electronics product with all required fields"""
    
    # Generate basic product info with uniqueness check
    product_name = generate_electronics_product_name(category['name'], existing_names)
    description = generate_electronics_description(product_name, category['name'])
    initial_price = generate_electronics_price(product_name, category['name'], description)
    
    # Generate random first available date (January 2024 to October 2025)
    first_available = generate_random_date_in_range(
        datetime.datetime(2024, 1, 1), 
        datetime.datetime(2025, 10, 31)
    )
    
    # Generate price history
    price_history = generate_price_history(initial_price, first_available, category['correlation'])
    
    # Current price is the last price in history or initial price
    current_price = price_history[-1]['price'] if price_history else initial_price
    
    # Generate other product fields
    countries = get_countries_of_origin()
    country_of_origin = random.choice(countries)
    inventory = random.randint(50, 1000)
    
    # Generate unique ID for this product
    product_id = str(uuid.uuid4())
    
    # Create the product document with docType (ordered properties)
    product = {
        "id": product_id,
        "docType": "product",  # Document type to distinguish from reviews
        "productId": product_id,  # productId must match id for products
        "name": product_name,
        "description": description,
        "categoryName": category['name'],
        "countryOfOrigin": country_of_origin,
        "inventory": inventory,
        "firstAvailable": first_available.isoformat(),
        "currentPrice": current_price,
        "priceHistory": [{"date": first_available.isoformat(), "price": initial_price}] + price_history
    }
    
    return product

def generate_customer_reviews(product, category, num_reviews=None):
    """Generate customer review documents timed with price changes to show correlation"""
    
    if num_reviews is None:
        num_reviews = random.randint(1, 6)  # 1-6 reviews per product
    
    reviews = []
    
    # Get all price change dates for correlation timing
    price_dates = []
    for price_entry in product['priceHistory']:
        price_dates.append(datetime.datetime.fromisoformat(price_entry["date"]))
    
    for _ in range(num_reviews):
        # Time reviews around price changes to show correlation
        if price_dates and random.random() < 0.7:  # 70% chance to align with price changes
            # Pick a random price change date and add 1-30 days after it
            base_date = random.choice(price_dates)
            review_date = base_date + datetime.timedelta(days=random.randint(1, 30))
        else:
            # Random date between first available and October 2025
            first_available = datetime.datetime.fromisoformat(product['firstAvailable'])
            max_date = datetime.datetime(2025, 10, 31)
            review_date = generate_random_date_in_range(first_available, max_date)
        
        # Ensure review date is after product was available and not in the future
        first_available = datetime.datetime.fromisoformat(product['firstAvailable'])
        if review_date < first_available:
            review_date = first_available + datetime.timedelta(days=random.randint(1, 30))
        
        max_review_date = min(datetime.datetime.now(), datetime.datetime(2025, 10, 31))
        if review_date > max_review_date:
            review_date = max_review_date - datetime.timedelta(days=random.randint(1, 30))
        
        # Generate correlated review
        review_data = generate_correlated_review(
            product['name'], 
            product['description'], 
            product['currentPrice'],
            product['priceHistory'], 
            category['correlation'],
            review_date
        )
        
        # Create review document with docType and shared properties
        review = {
            "id": str(uuid.uuid4()),
            "docType": "review",  # Document type to distinguish from products
            "productId": product['id'],  # Shared property for relationship
            "categoryName": category['name'],  # Shared property for relationship
            "customerName": generate_customer_name(),
            "reviewDate": review_date.isoformat(),
            "stars": review_data['stars'],
            "reviewText": review_data['review_text']
        }
        
        reviews.append(review)
    
    return reviews

print("✅ Product and review generation functions defined")

✅ Product and review generation functions defined


## 6. Generate Complete Electronics Catalog

In [25]:
def generate_electronics_catalog(products_per_category=10):
    """Generate complete electronics catalog with UNIQUE products and review documents"""
    
    categories = get_electronics_categories()
    all_documents = []  # Will contain both products and reviews
    
    print("🚀 Generating Electronics Store Catalog with UNIQUE Products...")
    print("=" * 50)
    
    for category in categories:
        print(f"\nGenerating {products_per_category} products for category: {category['name']}")
        print(f"Price-Review Correlation: {category['correlation']}")
        
        # Track unique names within this category
        category_names = set()
        
        for i in range(products_per_category):
            # Generate product with unique name
            product = generate_electronics_product(category, category_names)
            
            # Generate reviews for this product
            reviews = generate_customer_reviews(product, category)
            
            # NOTE: Embeddings are now generated separately using add_embeddings_to_catalog()
            # This improves performance and allows flexibility in when/if vectors are added
            # # Add vectors for product search
            # try:
            #     vector_product = {
            #         "name": product['name'],
            #         "description": product['description'],
            #         "categoryName": product['categoryName'],
            #         "countryOfOrigin": product['countryOfOrigin'],
            #         "currentPrice": product['currentPrice']
            #     }
            #     product['vectors'] = generate_embeddings(json.dumps(vector_product, ensure_ascii=False))
            # except Exception as e:
            #     print(f"Warning: Could not generate embeddings for {product['name']}: {e}")
            #     product['vectors'] = None
            
            # Add product document to the collection
            all_documents.append(product)
            
            # Add all review documents to the collection
            all_documents.extend(reviews)
            
            print(f"  [{i+1:2d}] {product['name']} (${product['currentPrice']:.2f}) - {len(reviews)} reviews")
    
    # Count products and reviews
    products_count = len([doc for doc in all_documents if doc['docType'] == 'product'])
    reviews_count = len([doc for doc in all_documents if doc['docType'] == 'review'])
    
    print(f"\n" + "=" * 50)
    print(f"Generated {products_count} products and {reviews_count} reviews")
    print(f"Total documents: {len(all_documents)}")
    
    return all_documents

def save_electronics_catalog(all_documents, filename="fabricSampleData.json"):
    """Save all documents (products and reviews) to a single JSON file"""
    
    with open(filename, 'w') as f:
        json.dump(all_documents, f, indent=4)
    
    # Count by document type
    products_count = len([doc for doc in all_documents if doc['docType'] == 'product'])
    reviews_count = len([doc for doc in all_documents if doc['docType'] == 'review'])
    
    print(f"\n✅ Electronics catalog saved to: {filename}")
    print(f"📦 Contains {products_count} products and {reviews_count} reviews")
    
    return filename

print("✅ Catalog generation functions defined")

✅ Catalog generation functions defined


## 7. Generate Full Catalog

This will generate 180 products (10 per category) with correlated reviews. **This may take several minutes.**

## 6.1 Add Embeddings to Existing Catalog

Use this function to add vector embeddings to a catalog that was generated without them.

In [43]:
def add_embeddings_to_catalog(input_filename, output_filename=None):
    """
    Load an existing catalog, add vector embeddings to product documents, and save to a new file.
    
    Args:
        input_filename: Path to the input catalog JSON file (without embeddings)
        output_filename: Path to save the catalog with embeddings (optional, defaults to input_filename with '_with_vectors' suffix)
    
    Returns:
        Path to the saved file with embeddings
    """
    
    # Default output filename if not provided
    if output_filename is None:
        base_name = input_filename.replace('.json', '')
        output_filename = f"{base_name}_with_vectors.json"
    
    print(f"📂 Loading catalog from: {input_filename}")
    
    # Load the existing catalog
    with open(input_filename, 'r') as f:
        catalog = json.load(f)
    
    print(f"📊 Loaded {len(catalog)} documents")
    
    # Separate products and reviews
    products = [doc for doc in catalog if doc.get('docType') == 'product']
    reviews = [doc for doc in catalog if doc.get('docType') == 'review']
    
    print(f"   • {len(products)} products")
    print(f"   • {len(reviews)} reviews")
    print(f"\n🔄 Generating embeddings for products...")
    print("=" * 50)
    
    # Add embeddings to each product
    for i, product in enumerate(products, 1):
        try:
            # Create a dictionary with selected product fields
            vector_product = {
                "name": product['name'],
                "description": product['description'],
                "categoryName": product['categoryName']
            }
            
            # Convert to JSON string and pass to generate_embeddings
            product['vectors'] = generate_embeddings(json.dumps(vector_product, ensure_ascii=False))
            
            # Progress indicator every 10 products
            if i % 10 == 0:
                print(f"   [{i}/{len(products)}] Generated embeddings for {product['name']}")
        
        except Exception as e:
            print(f"❌ Error generating embeddings for {product['name']}: {e}")
            product['vectors'] = None
    
    print(f"\n✅ Completed embedding generation for all {len(products)} products")
    
    # Save the updated catalog
    print(f"\n💾 Saving catalog with embeddings to: {output_filename}")
    with open(output_filename, 'w') as f:
        json.dump(catalog, f, indent=4)
    
    print(f"✅ Catalog with embeddings saved successfully!")
    print(f"\n📊 SUMMARY:")
    print(f"   • Input file: {input_filename}")
    print(f"   • Output file: {output_filename}")
    print(f"   • Total documents: {len(catalog)}")
    print(f"   • Products with vectors: {len(products)}")
    print(f"   • Reviews (no vectors): {len(reviews)}")
    
    return output_filename

print("✅ Embedding addition function defined")

✅ Embedding addition function defined


In [None]:
# Generate the full catalog (this will take several minutes)
# NOTE: This generates products WITHOUT embeddings for faster generation
all_documents = generate_electronics_catalog(products_per_category=10)

# Save to JSON file WITHOUT vectors
catalog_file = save_electronics_catalog(all_documents, "fabricSampleData.json")

# Show final statistics
products_count = len([doc for doc in all_documents if doc['docType'] == 'product'])
reviews_count = len([doc for doc in all_documents if doc['docType'] == 'review'])

print(f"\n🎉 CATALOG GENERATION COMPLETE!")
print(f"📁 Saved to: {catalog_file}")
print(f"\n📈 FINAL STATISTICS:")
print(f"   • Products: {products_count}")
print(f"   • Reviews: {reviews_count}")
print(f"   • Total Documents: {len(all_documents)}")
print(f"   • Categories: 18 (with 4 correlation types)")
print(f"   • Date Range: January 2024 - October 2025")
print(f"\n💡 Note: This catalog does NOT include vector embeddings.")
print(f"💡 Use the next section to add embeddings if needed.")

## 7.1 Add Embeddings to Generated Catalog

Run this cell to add vector embeddings to the catalog. This is done separately to improve generation speed and allow flexibility.

In [44]:
# Add embeddings to the catalog
# This will create a new file: electronics_catalog_no_vectors_with_vectors.json
# Or you can specify a custom output filename

catalog_with_vectors = add_embeddings_to_catalog(
    input_filename="fabricSampleData.json",
    output_filename="fabricSampleDataVectors-text-3-large.json"  # Custom output name
)

print(f"\n✅ Catalog with embeddings is ready!")
print(f"📁 File: {catalog_with_vectors}")

📂 Loading catalog from: fabricSampleData.json
📊 Loaded 832 documents
   • 180 products
   • 652 reviews

🔄 Generating embeddings for products...
   [10/180] Generated embeddings for VertexPro X15 Ultra Lite
   [10/180] Generated embeddings for VertexPro X15 Ultra Lite
   [20/180] Generated embeddings for AeroCore VisionPro 9800X
   [20/180] Generated embeddings for AeroCore VisionPro 9800X
   [30/180] Generated embeddings for Vertex Chronos G17 Elite RTX
   [30/180] Generated embeddings for Vertex Chronos G17 Elite RTX
   [40/180] Generated embeddings for QuantumTech ApexStation X9 Pro
   [40/180] Generated embeddings for QuantumTech ApexStation X9 Pro
   [50/180] Generated embeddings for Vertex Pro X5 Dual 5G
   [50/180] Generated embeddings for Vertex Pro X5 Dual 5G
   [60/180] Generated embeddings for TechVerse TabPro X12
   [60/180] Generated embeddings for TechVerse TabPro X12
   [70/180] Generated embeddings for Apex Chrono S3 Smartwatch
   [70/180] Generated embeddings for Apex 

## 8. Remove Embeddings from Catalog (Optional)

If you have a catalog WITH vectors and want to create a lightweight version WITHOUT them, use this cell.

In [None]:
# Create version without vectors for lighter file size
# Use this if you have a catalog WITH vectors and want to remove them

print("Creating catalog version without vectors...")

# Load the catalog with vectors
with open('electronics_catalog.json', 'r') as f:
    catalog = json.load(f)

# Remove vectors property from all documents
vectors_removed = 0
for doc in catalog:
    if 'vectors' in doc:
        del doc['vectors']
        vectors_removed += 1

# Save to new file
with open('electronics_catalog_no_vectors.json', 'w') as f:
    json.dump(catalog, f, indent=4)

print(f"✅ Created electronics_catalog_no_vectors.json")
print(f"📊 Removed vectors from {vectors_removed} documents")
print("\n📊 File Comparison:")
print("   • electronics_catalog.json: Full version with vectors (~10 MB)")
print("   • electronics_catalog_no_vectors.json: Lightweight version (~345 KB)")
print("\n💡 Use the version WITH vectors for similarity search capabilities")
print("💡 Use the version WITHOUT vectors for standard queries and lighter weight")

## 9. Upload to Cosmos DB (Optional)

Uncomment and run this cell to upload the catalog to Azure Cosmos DB.

In [None]:
# # Initialize Cosmos DB client
# cosmos_client = CosmosClient(url=COSMOS_ENDPOINT, credential=COSMOS_KEY)
# 
# # Create database if it doesn't exist
# db = cosmos_client.create_database_if_not_exists(id=COSMOS_DATABASE)
# 
# # Define vector embedding policy
# vector_embedding_policy = {
#     "vectorEmbeddings": [
#         {
#             "path": "/vectors",
#             "dataType": "float32",
#             "distanceFunction": "cosine",
#             "dimensions": EMBEDDING_DIMENSIONS
#         }
#     ]
# }
# 
# # Define indexing policy
# indexing_policy = {
#     "includedPaths": [{"path": "/*"}],
#     "excludedPaths": [
#         {"path": "/\"_etag\"/?" },
#         {"path": "/vectors/*"}
#     ],
#     "vectorIndexes": [{"path": "/vectors", "type": "quantizedFlat"}]
# }
# 
# # Create container with vector support
# container = db.create_container_if_not_exists(
#     id=COSMOS_CONTAINER,
#     partition_key=PartitionKey(path="/categoryName", kind='Hash'),
#     indexing_policy=indexing_policy,
#     vector_embedding_policy=vector_embedding_policy
# )
# 
# # Upload all documents
# print(f"📤 Uploading {len(all_documents)} documents to Cosmos DB...")
# for i, document in enumerate(all_documents, 1):
#     container.upsert_item(body=document)
#     if i % 50 == 0:
#         print(f"   Uploaded {i}/{len(all_documents)} documents...")
# 
# print(f"✅ All {len(all_documents)} documents uploaded to Cosmos DB!")
# print(f"📊 Database: {COSMOS_DATABASE}")
# print(f"📦 Container: {COSMOS_CONTAINER}")
# print(f"🔑 Partition Key: /categoryName")
# print(f"🔍 Vector Search: Enabled on /vectors property")

## 10. Analyze Generated Catalog

In [None]:
# Analyze the generated electronics catalog
print("📊 ANALYZING GENERATED ELECTRONICS CATALOG")
print("=" * 50)

# Load the generated catalog
with open("electronics_catalog.json", 'r') as f:
    catalog_data = json.load(f)

# Count documents by type
products = [doc for doc in catalog_data if doc['docType'] == 'product']
reviews = [doc for doc in catalog_data if doc['docType'] == 'review']

print(f"📦 Total Documents: {len(catalog_data)}")
print(f"🛍️  Products: {len(products)}")
print(f"⭐ Reviews: {len(reviews)}")
print(f"📱 Categories: 18")

# Analyze by category
print(f"\n🏷️  PRODUCTS BY CATEGORY:")
category_counts = {}
for product in products:
    category = product['categoryName']
    category_counts[category] = category_counts.get(category, 0) + 1

for category, count in sorted(category_counts.items()):
    print(f"   {category}: {count} products")

# Sample documents
print(f"\n📋 SAMPLE PRODUCT:")
sample_product = products[0]
print(f"   Name: {sample_product['name']}")
print(f"   Category: {sample_product['categoryName']}")
print(f"   Price: ${sample_product['currentPrice']:.2f}")
print(f"   Inventory: {sample_product['inventory']}")
print(f"   Country: {sample_product['countryOfOrigin']}")

print(f"\n⭐ SAMPLE REVIEW:")
sample_review = reviews[0]
print(f"   Stars: {sample_review['stars']}⭐")
print(f"   Customer: {sample_review['customerName']}")
print(f"   Product ID: {sample_review['productId']}")
print(f"   Category: {sample_review['categoryName']}")

print(f"\n✅ Electronics catalog ready for Cosmos DB!")