# Customer Reviews Data Generation

Generate realistic synthetic customer reviews for the Fashion Retail Intelligence Platform.

**Purpose**: Create `gold_customer_reviews` table with ~5,000 realistic reviews linked to actual customers and products for Vector Search RAG capabilities.

**Key Features**:
- Reviews linked to actual `customer_key` from `gold_customer_dim`
- Reviews linked to actual `product_key` from `gold_product_dim`
- Content varies by customer segment (VIP, Premium, Loyal, Regular, New)
- Content varies by product category (apparel, footwear, accessories)
- Realistic rating distribution (skewed positive but with variance)
- Three review types: product_review, purchase_experience, return_feedback


## Setup: Import Configuration


In [None]:
%pip install --quiet pyyaml

dbutils.library.restartPython()

In [None]:
import sys
import os

# Add the src directory to Python path for clean imports
sys.path.append('../src')

from fashion_retail.config import load_config

# Load configuration from project-level config.yaml
config = load_config()

CATALOG = config.catalog
SCHEMA = config.schema
TABLE_NAME = "gold_customer_reviews"
FULL_TABLE_NAME = f"{CATALOG}.{SCHEMA}.{TABLE_NAME}"

# Review generation settings
TOTAL_REVIEWS = 5000
PRODUCT_REVIEW_PCT = 0.60  # 60% product reviews
PURCHASE_EXPERIENCE_PCT = 0.20  # 20% purchase experience
RETURN_FEEDBACK_PCT = 0.20  # 20% return feedback

# Rating distribution (realistic - skewed positive)
RATING_DISTRIBUTION = {
    5: 0.45,  # 45% - Enthusiastic praise
    4: 0.25,  # 25% - Positive with minor notes
    3: 0.15,  # 15% - Mixed/neutral
    2: 0.10,  # 10% - Disappointed
    1: 0.05   # 5% - Strong complaints
}

RANDOM_SEED = config.random_seed

# Known valid values for validation
VALID_SEGMENTS = {"vip", "premium", "loyal", "regular", "new"}
VALID_CATEGORIES = {"apparel", "footwear", "accessories"}
VALID_PRICE_TIERS = {"premium", "mid-range", "budget", "value", "luxury"}

# Tracking for unexpected values (for logging)
UNEXPECTED_VALUES = {"segments": set(), "categories": set(), "materials": set(), "price_tiers": set()}

print(f"Configuration loaded from config.yaml")
print(f"  Catalog: {CATALOG}")
print(f"  Schema: {SCHEMA}")
print(f"Target table: {FULL_TABLE_NAME}")
print(f"Total reviews to generate: {TOTAL_REVIEWS:,}")


## Load Customer and Product Data


In [None]:
# Load customers with their segments
customers_df = spark.sql(f"""
    SELECT 
        customer_key,
        customer_id,
        segment,
        loyalty_tier,
        geo_city,
        preferred_category
    FROM {CATALOG}.{SCHEMA}.gold_customer_dim
    WHERE is_current = TRUE
""")

customers = customers_df.collect()

# Validation: Ensure we have customers
if not customers:
    raise ValueError(f"No customers found in {CATALOG}.{SCHEMA}.gold_customer_dim with is_current = TRUE. "
                     "Please run the data generation pipeline first.")

print(f"Loaded {len(customers):,} customers")

# Validate segment values
found_segments = set(c["segment"] for c in customers)
unexpected_segments = found_segments - VALID_SEGMENTS
if unexpected_segments:
    print(f"  WARNING: Found unexpected segments: {unexpected_segments}")
    UNEXPECTED_VALUES["segments"].update(unexpected_segments)
else:
    print(f"  All segments valid: {found_segments}")

# Show segment distribution
spark.sql(f"""
    SELECT segment, COUNT(*) as count
    FROM {CATALOG}.{SCHEMA}.gold_customer_dim
    WHERE is_current = TRUE
    GROUP BY segment
    ORDER BY count DESC
""").show()


In [None]:
# Load products with their categories, materials, pricing, and seasons
products_df = spark.sql(f"""
    SELECT 
        product_key,
        product_id,
        product_name,
        brand,
        category_level_1,
        category_level_2,
        category_level_3,
        color_name,
        material_primary,
        base_price,
        price_tier,
        season_code,
        size_range
    FROM {CATALOG}.{SCHEMA}.gold_product_dim
    WHERE is_active = TRUE
""")

products = products_df.collect()

# Validation: Ensure we have products
if not products:
    raise ValueError(f"No products found in {CATALOG}.{SCHEMA}.gold_product_dim with is_active = TRUE. "
                     "Please run the data generation pipeline first.")

print(f"Loaded {len(products):,} products")

# Validate category values
found_categories = set(p["category_level_1"] for p in products)
unexpected_categories = found_categories - VALID_CATEGORIES
if unexpected_categories:
    print(f"  WARNING: Found unexpected categories: {unexpected_categories}")
    UNEXPECTED_VALUES["categories"].update(unexpected_categories)
else:
    print(f"  All categories valid: {found_categories}")

# Log unique materials for reference
found_materials = set(p["material_primary"] for p in products if p["material_primary"])
print(f"  Found {len(found_materials)} unique materials: {found_materials}")

# Log unique price tiers
found_price_tiers = set(p["price_tier"] for p in products if p["price_tier"])
print(f"  Found {len(found_price_tiers)} unique price tiers: {found_price_tiers}")

# Log unique brands
found_brands = set(p["brand"] for p in products if p["brand"])
print(f"  Found {len(found_brands)} unique brands: {found_brands}")

# Show category distribution
spark.sql(f"""
    SELECT category_level_1, COUNT(*) as count
    FROM {CATALOG}.{SCHEMA}.gold_product_dim
    WHERE is_active = TRUE
    GROUP BY category_level_1
    ORDER BY count DESC
""").show()


## Review Content Templates

Define realistic review content that varies by:
- Customer segment (vocabulary and tone)
- Product category (topics and concerns)
- Rating level (positive vs negative themes)


In [None]:
import random
from datetime import datetime, timedelta

random.seed(RANDOM_SEED)

# Review title templates by rating
TITLE_TEMPLATES = {
    5: [
        "Absolutely love it!",
        "Exceeded my expectations",
        "Perfect in every way",
        "Best purchase I've made",
        "Highly recommend!",
        "Outstanding quality",
        "Worth every penny",
        "A new favorite",
        "Stunning piece",
        "Exceptional craftsmanship"
    ],
    4: [
        "Great product, minor issues",
        "Really pleased overall",
        "Good quality, would buy again",
        "Almost perfect",
        "Very satisfied",
        "Solid purchase",
        "Happy with my choice",
        "Good value for money",
        "Nice addition to wardrobe",
        "Pleasantly surprised"
    ],
    3: [
        "It's okay, nothing special",
        "Mixed feelings",
        "Average quality",
        "Met expectations, barely",
        "Decent but not amazing",
        "Has pros and cons",
        "On the fence about this",
        "Not bad, not great",
        "Acceptable quality",
        "Room for improvement"
    ],
    2: [
        "Disappointed with purchase",
        "Not as described",
        "Expected better quality",
        "Wouldn't recommend",
        "Below expectations",
        "Quality issues",
        "Not worth the price",
        "Disappointing experience",
        "Had to return it",
        "Save your money"
    ],
    1: [
        "Complete waste of money",
        "Terrible quality",
        "Avoid this product",
        "Very disappointed",
        "Fell apart immediately",
        "Nothing like the photos",
        "Worst purchase ever",
        "Demand a refund",
        "Do not buy",
        "Extremely poor quality"
    ]
}

print("Title templates loaded")


In [None]:
# Review text components by segment (vocabulary and style)
SEGMENT_INTROS = {
    "vip": [
        "As a discerning customer who values quality,",
        "Having shopped here for years,",
        "As someone who appreciates fine craftsmanship,",
        "Given my extensive experience with luxury brands,",
        "As a platinum member,"
    ],
    "premium": [
        "As a regular customer,",
        "I've been shopping here for a while and",
        "Based on my previous purchases,",
        "I have high standards and",
        "As someone who values quality,"
    ],
    "loyal": [
        "I always come back to this brand because",
        "This is my go-to store and",
        "Been a loyal customer for years,",
        "I trust this brand and",
        "Love shopping here,"
    ],
    "regular": [
        "I bought this recently and",
        "Just received my order and",
        "I was looking for something like this and",
        "Decided to try this out and",
        "Needed something for the season and"
    ],
    "new": [
        "This was my first purchase here and",
        "New to this brand but",
        "First time ordering and",
        "Heard good things about this brand,",
        "Took a chance on this new-to-me brand and"
    ]
}

# Category-specific review content
CATEGORY_POSITIVE = {
    "apparel": [
        "The fabric feels amazing against my skin.",
        "Fits true to size, exactly as expected.",
        "The color is even more vibrant in person.",
        "Washes beautifully without losing shape.",
        "The stitching and construction are top-notch.",
        "So comfortable I could wear it all day.",
        "The drape and flow are elegant.",
        "Gets compliments every time I wear it."
    ],
    "footwear": [
        "Incredibly comfortable right out of the box.",
        "Great arch support for all-day wear.",
        "The cushioning is perfect.",
        "True to size, no break-in period needed.",
        "Stylish and functional.",
        "My feet don't hurt after hours of walking.",
        "The quality of the leather is exceptional.",
        "Perfect fit and very well made."
    ],
    "accessories": [
        "Looks exactly like the photos.",
        "The quality exceeded my expectations.",
        "Perfect for everyday use.",
        "Great gift idea.",
        "The craftsmanship is evident.",
        "Elevates any outfit.",
        "Well-made and stylish.",
        "The hardware is solid and doesn't look cheap."
    ]
}

CATEGORY_NEGATIVE = {
    "apparel": [
        "The sizing runs way too small.",
        "The fabric feels cheap and scratchy.",
        "Color faded after just one wash.",
        "Started pilling almost immediately.",
        "The stitching came undone within days.",
        "Nothing like what was shown online.",
        "Shrunk significantly after washing.",
        "The material is see-through."
    ],
    "footwear": [
        "Runs at least a full size small.",
        "Very uncomfortable, no arch support.",
        "The sole started separating after a week.",
        "Gave me blisters even with socks.",
        "The material creased badly.",
        "Not suitable for actual walking.",
        "Quality doesn't match the price.",
        "Had to return due to poor fit."
    ],
    "accessories": [
        "Looks nothing like the picture.",
        "The hardware tarnished immediately.",
        "Feels flimsy and cheap.",
        "Broke within the first week.",
        "Not worth the price at all.",
        "The strap came off.",
        "Smaller than expected.",
        "Poor quality materials."
    ]
}

print("Category templates loaded")


In [None]:
# Material-specific feedback templates
MATERIAL_POSITIVE = {
    "cotton": [
        "The cotton is soft and breathable.",
        "Washes well without shrinking.",
        "Perfect weight cotton for the season.",
        "The cotton quality is excellent."
    ],
    "silk": [
        "The silk feels absolutely luxurious.",
        "Elegant drape and beautiful sheen.",
        "Real silk that looks and feels premium.",
        "The silk is smooth and comfortable."
    ],
    "leather": [
        "Premium leather quality that will age beautifully.",
        "The leather smells and feels authentic.",
        "Buttery soft leather right out of the box.",
        "Beautiful patina developing already."
    ],
    "wool": [
        "Warm without being too heavy.",
        "Quality wool that doesn't itch.",
        "The wool is soft and cozy.",
        "No pilling after multiple wears."
    ],
    "linen": [
        "Perfect breathable linen for summer.",
        "Gets softer with each wash.",
        "Love the natural texture of the linen.",
        "Great quality linen at this price."
    ],
    "synthetic": [
        "The synthetic material is surprisingly comfortable.",
        "Easy care and quick drying.",
        "Holds its shape well.",
        "Good performance fabric."
    ],
    "denim": [
        "The denim quality is outstanding.",
        "Perfect weight denim.",
        "Comfortable stretch without losing shape.",
        "Great wash and color."
    ],
    "cashmere": [
        "Unbelievably soft cashmere.",
        "Worth the investment for real cashmere.",
        "Luxurious and warm.",
        "The cashmere is pill-resistant."
    ],
    "suede": [
        "Beautiful suede texture.",
        "The suede feels premium.",
        "Soft and supple suede.",
        "Lovely suede finish."
    ],
    "polyester": [
        "Easy to care for.",
        "Wrinkle-resistant which is great for travel.",
        "Good quality polyester blend.",
        "Lightweight and comfortable."
    ],
    "modal": [
        "The modal fabric is incredibly soft.",
        "Smooth and silky modal feels luxurious.",
        "Love how breathable the modal is.",
        "Modal drapes beautifully."
    ]
}

MATERIAL_NEGATIVE = {
    "cotton": [
        "The cotton shrunk after washing.",
        "Cheap, thin cotton that's see-through.",
        "Cotton wrinkles terribly.",
        "Feels like low-quality cotton."
    ],
    "silk": [
        "Shows every stain immediately.",
        "Too delicate for regular wear.",
        "Not real silk despite the label.",
        "High maintenance and snags easily."
    ],
    "leather": [
        "Feels like fake leather.",
        "Started peeling after a month.",
        "Stiff and uncomfortable leather.",
        "Smells like chemicals, not leather."
    ],
    "wool": [
        "Incredibly itchy even over a layer.",
        "Started pilling immediately.",
        "The wool feels scratchy and cheap.",
        "Not the quality wool I expected."
    ],
    "linen": [
        "Wrinkles beyond belief.",
        "The linen feels rough and stiff.",
        "Not breathable as expected.",
        "Poor quality linen."
    ],
    "synthetic": [
        "Synthetic material traps heat.",
        "Makes me sweat.",
        "Looks and feels cheap.",
        "Static cling is a problem."
    ],
    "denim": [
        "The denim is too stiff.",
        "Color bled onto everything.",
        "Cheap, thin denim.",
        "No stretch, very uncomfortable."
    ],
    "cashmere": [
        "Pills immediately.",
        "Not real cashmere despite the price.",
        "Very thin for cashmere.",
        "Lost shape after one wear."
    ],
    "suede": [
        "Suede stains at the slightest touch.",
        "Water spots won't come out.",
        "Cheap suede that wears quickly.",
        "The suede looks fake."
    ],
    "polyester": [
        "Feels like cheap polyester.",
        "Traps all the heat.",
        "Static nightmare.",
        "Looks cheap and shiny."
    ],
    "modal": [
        "Modal wrinkles easily.",
        "Shrinks in the wash despite being modal.",
        "Not as soft as expected for modal.",
        "Modal fabric feels flimsy."
    ]
}

# Default fallback for unknown materials
MATERIAL_POSITIVE["default"] = ["The material feels nice.", "Good fabric quality."]
MATERIAL_NEGATIVE["default"] = ["The material feels cheap.", "Poor fabric quality."]

print("Material-specific templates loaded")


In [None]:
# Brand-specific vocabulary and personality
BRAND_VOCABULARY = {
    "Luxe Label": {
        "positive_adj": ["exquisite", "impeccable", "refined", "sophisticated", "premium"],
        "positive_comments": [
            "This is what luxury should feel like.",
            "Worth the investment for quality like this.",
            "The attention to detail is remarkable.",
            "Luxe Label never disappoints."
        ],
        "negative_comments": [
            "Expected more for a luxury brand.",
            "Not living up to the Luxe Label name.",
            "Price doesn't match the quality for premium."
        ]
    },
    "Bold Basics": {
        "positive_adj": ["practical", "versatile", "reliable", "everyday"],
        "positive_comments": [
            "Perfect everyday essential.",
            "Bold Basics delivers great value.",
            "Exactly what I needed for my wardrobe staples.",
            "Can't beat this for the price."
        ],
        "negative_comments": [
            "Too basic, nothing special.",
            "You get what you pay for.",
            "Fine for basics but don't expect more."
        ]
    },
    "Eco Threads": {
        "positive_adj": ["sustainable", "eco-conscious", "ethical", "organic"],
        "positive_comments": [
            "Love supporting sustainable fashion!",
            "Eco Threads makes guilt-free shopping easy.",
            "Great to see recycled materials done right.",
            "Finally, eco-friendly fashion that looks good."
        ],
        "negative_comments": [
            "Sustainability shouldn't mean sacrificing quality.",
            "Eco claims feel more marketing than reality.",
            "Expected better from an eco-conscious brand."
        ]
    },
    "Modern Minimal": {
        "positive_adj": ["sleek", "clean", "contemporary", "timeless"],
        "positive_comments": [
            "The minimalist design is perfect.",
            "Modern Minimal nails the clean aesthetic.",
            "Love the understated elegance.",
            "Timeless piece that works with everything."
        ],
        "negative_comments": [
            "Too plain and boring.",
            "Minimal design, minimal quality.",
            "Nothing unique about this."
        ]
    },
    "Vintage Vibes": {
        "positive_adj": ["retro", "classic", "nostalgic", "timeless"],
        "positive_comments": [
            "Love the vintage-inspired design!",
            "Vintage Vibes captures that retro feel perfectly.",
            "Classic style that never goes out of fashion.",
            "Unique retro look I couldn't find elsewhere."
        ],
        "negative_comments": [
            "Trying too hard to be vintage.",
            "The retro look feels costume-y.",
            "Not authentic vintage style."
        ]
    },
    "Street Wear Co": {
        "positive_adj": ["trendy", "fresh", "urban", "cool"],
        "positive_comments": [
            "Street Wear Co gets the culture.",
            "Fresh streetwear that actually looks good.",
            "Finally, on-trend without trying too hard.",
            "Love the urban vibe."
        ],
        "negative_comments": [
            "Trying too hard to be cool.",
            "Already outdated streetwear.",
            "Not worth the hype."
        ]
    },
    "Urban Style": {
        "positive_adj": ["stylish", "trendy", "fashion-forward", "modern"],
        "positive_comments": [
            "Urban Style keeps me looking current.",
            "Love the fashion-forward designs.",
            "On-trend and well-made.",
            "My go-to for trendy pieces."
        ],
        "negative_comments": [
            "Too trendy, will be out of style soon.",
            "Style over substance.",
            "Looks dated already."
        ]
    },
    "Classic Comfort": {
        "positive_adj": ["comfortable", "traditional", "reliable", "cozy"],
        "positive_comments": [
            "Classic Comfort lives up to its name.",
            "So comfortable I bought multiple colors.",
            "Traditional style that never fails.",
            "Exactly as comfortable as expected."
        ],
        "negative_comments": [
            "Comfortable but looks frumpy.",
            "Too old-fashioned.",
            "Not stylish at all."
        ]
    }
}

# Default brand vocabulary for any unknown brands
BRAND_VOCABULARY["default"] = {
    "positive_adj": ["nice", "good", "quality"],
    "positive_comments": ["Happy with this brand.", "Good brand choice."],
    "negative_comments": ["Expected more from this brand.", "Not impressed."]
}

print(f"Brand vocabulary loaded for {len(BRAND_VOCABULARY) - 1} brands")


In [None]:
# Price-tier relative comments
PRICE_TIER_COMMENTS = {
    "luxury": {
        "positive": [
            "Worth every penny for this level of quality.",
            "You get what you pay for - true luxury.",
            "An investment piece I'll have for years.",
            "The price reflects the exceptional quality."
        ],
        "negative": [
            "Not worth the luxury price tag.",
            "Expected perfection at this price point.",
            "Disappointed for such an expensive item.",
            "Way overpriced for what you get."
        ]
    },
    "premium": {
        "positive": [
            "Great quality justifies the premium price.",
            "Worth the extra money for this quality.",
            "Premium price but premium product.",
            "Happy to pay more for this level of craftsmanship."
        ],
        "negative": [
            "Not worth the premium price.",
            "Expected more at this price point.",
            "Overpriced for the quality received.",
            "Can find similar for less elsewhere."
        ]
    },
    "mid-range": {
        "positive": [
            "Great value for a mid-range price.",
            "Good quality without breaking the bank.",
            "Fair price for fair quality.",
            "Solid purchase at this price."
        ],
        "negative": [
            "Expected more for this price.",
            "Should have spent more or less.",
            "Mediocre quality for a middling price.",
            "Not the best value at this price."
        ]
    },
    "mid": {  # Alias for mid-range
        "positive": [
            "Great value for a mid-range price.",
            "Good quality without breaking the bank.",
            "Fair price for fair quality.",
            "Solid purchase at this price."
        ],
        "negative": [
            "Expected more for this price.",
            "Should have spent more or less.",
            "Mediocre quality for a middling price.",
            "Not the best value at this price."
        ]
    },
    "value": {
        "positive": [
            "Amazing value for the price!",
            "Can't believe the quality at this price.",
            "Budget-friendly without feeling cheap.",
            "Great bang for your buck."
        ],
        "negative": [
            "You get what you pay for.",
            "Cheap price, cheap quality.",
            "Should have expected less at this price.",
            "Save up for something better."
        ]
    },
    "budget": {
        "positive": [
            "Fantastic deal!",
            "Exceeded expectations for a budget buy.",
            "Great find at this price point.",
            "Perfect for the price."
        ],
        "negative": [
            "Cheap in every sense of the word.",
            "Too good to be true at this price.",
            "Definitely feels like a budget item.",
            "Pay a little more, get a lot more."
        ]
    }
}

# Default price tier comments
PRICE_TIER_COMMENTS["default"] = {
    "positive": ["Good value.", "Fair price."],
    "negative": ["Not worth the price.", "Overpriced."]
}

# Size-specific feedback
SIZE_COMMENTS = {
    "positive": [
        "Fits exactly as the size chart indicated.",
        "True to size, no issues with fit.",
        "Perfect fit on the first try.",
        "Size guide was accurate.",
        "Ordered my usual size and it's perfect."
    ],
    "negative": [
        "Runs at least a size small, order up.",
        "Runs large, should have sized down.",
        "Size chart is completely inaccurate.",
        "Had to exchange for a different size.",
        "Inconsistent sizing - some parts fit, others don't.",
        "Very tight in the shoulders/hips/waist despite correct size.",
        "Way too long/short for the size."
    ],
    "neutral": [
        "Runs slightly small but manageable.",
        "A bit loose but still wearable.",
        "Between sizes, went up and it works.",
        "Slightly different fit than expected but okay."
    ]
}

# Seasonal context templates
SEASONAL_COMMENTS = {
    "SS": {  # Spring/Summer
        "positive": [
            "Perfect for spring and summer!",
            "Light and breezy for warm weather.",
            "Great seasonal piece for summer.",
            "Exactly what I needed for the warmer months."
        ],
        "negative": [
            "Too warm for summer despite being a summer piece.",
            "Not as lightweight as expected for summer.",
            "Won't work for hot weather."
        ]
    },
    "FW": {  # Fall/Winter
        "positive": [
            "Perfect for fall weather.",
            "Great transitional piece.",
            "Cozy enough for cooler temperatures.",
            "Ideal fall/winter staple."
        ],
        "negative": [
            "Not warm enough for actual winter.",
            "Too heavy for fall, too light for winter.",
            "Seasonal piece that's hard to layer."
        ]
    },
    "AW": {  # Autumn/Winter
        "positive": [
            "Warm and cozy for winter.",
            "Perfect cold weather essential.",
            "Keeps me warm in winter.",
            "Exactly what you need for cold days."
        ],
        "negative": [
            "Not as warm as expected for winter.",
            "Too bulky for indoor wear.",
            "Winter piece that doesn't actually keep you warm."
        ]
    },
    "CORE": {  # Year-round/Core items
        "positive": [
            "Can wear this year-round.",
            "Great versatile piece for any season.",
            "Works in every season.",
            "True wardrobe essential for all year."
        ],
        "negative": [
            "Too seasonal despite being called a 'core' item.",
            "Not as versatile as expected.",
            "Limited to certain seasons only."
        ]
    }
}

# Default seasonal comments
SEASONAL_COMMENTS["default"] = {
    "positive": ["Works well for the season.", "Seasonal appropriate."],
    "negative": ["Not right for the season.", "Seasonal misfit."]
}

print("Price-tier, size, and seasonal templates loaded")


In [None]:
# Purchase experience templates
PURCHASE_EXPERIENCE_POSITIVE = [
    "Shipping was incredibly fast - received within 2 days!",
    "The packaging was beautiful and secure.",
    "Love the handwritten thank you note included.",
    "Easy checkout process and quick confirmation.",
    "Customer service was helpful when I had questions.",
    "Free shipping made this an even better deal.",
    "The tracking updates were accurate and frequent.",
    "Package arrived earlier than expected.",
    "Everything was wrapped with care.",
    "Smooth ordering experience from start to finish."
]

PURCHASE_EXPERIENCE_NEGATIVE = [
    "Shipping took forever - over 2 weeks!",
    "Package arrived damaged.",
    "Tracking showed delivered but never received.",
    "Had to contact support multiple times.",
    "The checkout process was confusing.",
    "Charged for shipping even though it should be free.",
    "Item was crammed into too small a box.",
    "No communication about delays.",
    "Website kept crashing during checkout.",
    "Had to wait a week for a shipping confirmation."
]

# Return feedback templates
RETURN_REASONS = [
    "The sizing was completely off from the size chart.",
    "Color was different than shown online.",
    "Quality didn't match the price point.",
    "Changed my mind after seeing it in person.",
    "Ordered wrong size by mistake.",
    "Found a better option elsewhere.",
    "Didn't fit my body type as expected.",
    "Material wasn't what I expected.",
    "Arrived with defects.",
    "Just didn't work with my wardrobe."
]

RETURN_EXPERIENCE_POSITIVE = [
    "The return process was seamless.",
    "Got my refund within days.",
    "Free return shipping made it easy.",
    "No questions asked return policy is great.",
    "Exchange process was straightforward."
]

RETURN_EXPERIENCE_NEGATIVE = [
    "Return process was a nightmare.",
    "Still waiting for my refund after weeks.",
    "Had to pay for return shipping out of pocket.",
    "Customer service was unhelpful.",
    "Took multiple attempts to process the return."
]

print("Experience templates loaded")


## Review Generation Functions


In [None]:
def get_rating():
    """Generate a rating based on realistic distribution."""
    r = random.random()
    cumulative = 0
    for rating, prob in RATING_DISTRIBUTION.items():
        cumulative += prob
        if r <= cumulative:
            return rating
    return 5  # Default fallback


def get_category_key(category_level_1):
    """Map product category to template key with validation."""
    category_map = {
        "apparel": "apparel",
        "footwear": "footwear", 
        "accessories": "accessories"
    }
    if category_level_1 and category_level_1 not in category_map:
        UNEXPECTED_VALUES["categories"].add(category_level_1)
    return category_map.get(category_level_1, "accessories")


def safe_get(row, field, default=None):
    """Safely get a field from a PySpark Row object with a default value."""
    try:
        value = row[field]
        return value if value is not None else default
    except (KeyError, ValueError, IndexError):
        return default


def get_material_key(material_primary):
    """Map material to template key with fallback."""
    if not material_primary:
        return "default"
    material_lower = material_primary.lower()
    if material_lower in MATERIAL_POSITIVE:
        return material_lower
    # Try partial matching for common variants
    for key in MATERIAL_POSITIVE.keys():
        if key in material_lower or material_lower in key:
            return key
    UNEXPECTED_VALUES["materials"].add(material_primary)
    return "default"


def get_brand_vocab(brand):
    """Get brand vocabulary with fallback."""
    if brand in BRAND_VOCABULARY:
        return BRAND_VOCABULARY[brand]
    return BRAND_VOCABULARY["default"]


def get_price_tier_comments(price_tier):
    """Get price tier comments with fallback."""
    if not price_tier:
        return PRICE_TIER_COMMENTS["default"]
    price_lower = price_tier.lower()
    if price_lower in PRICE_TIER_COMMENTS:
        return PRICE_TIER_COMMENTS[price_lower]
    UNEXPECTED_VALUES["price_tiers"].add(price_tier)
    return PRICE_TIER_COMMENTS["default"]


def get_season_key(season_code):
    """Extract season key from season code (e.g., 'SS24' -> 'SS')."""
    if not season_code:
        return "default"
    # Extract first two characters which typically represent season
    season_prefix = season_code[:2].upper() if len(season_code) >= 2 else "default"
    if season_prefix in SEASONAL_COMMENTS:
        return season_prefix
    # Check for CORE/year-round items
    if "CORE" in season_code.upper():
        return "CORE"
    return "default"


def generate_product_review(customer, product, rating):
    """Generate an enhanced product review with material, brand, price, size, and seasonal context."""
    segment = customer["segment"]
    category_key = get_category_key(product["category_level_1"])
    material_key = get_material_key(safe_get(product, "material_primary"))
    brand_vocab = get_brand_vocab(product["brand"])
    price_comments = get_price_tier_comments(safe_get(product, "price_tier"))
    season_key = get_season_key(safe_get(product, "season_code"))
    
    # Validate segment
    if segment not in SEGMENT_INTROS:
        UNEXPECTED_VALUES["segments"].add(segment)
    
    # Build review text
    intro = random.choice(SEGMENT_INTROS.get(segment, SEGMENT_INTROS["regular"]))
    
    content_parts = []
    
    if rating >= 4:
        # Positive review: category + material + brand + price + size + seasonal
        content_parts.append(random.choice(CATEGORY_POSITIVE[category_key]))
        
        # 70% chance to add material comment
        if random.random() < 0.7:
            content_parts.append(random.choice(MATERIAL_POSITIVE.get(material_key, MATERIAL_POSITIVE["default"])))
        
        # 50% chance to add brand-specific comment
        if random.random() < 0.5:
            content_parts.append(random.choice(brand_vocab["positive_comments"]))
        
        # 40% chance to add price comment
        if random.random() < 0.4:
            content_parts.append(random.choice(price_comments["positive"]))
        
        # 30% chance to add size comment
        if random.random() < 0.3:
            content_parts.append(random.choice(SIZE_COMMENTS["positive"]))
        
        # 25% chance to add seasonal comment
        if random.random() < 0.25 and season_key != "default":
            season_comments = SEASONAL_COMMENTS.get(season_key, SEASONAL_COMMENTS["default"])
            content_parts.append(random.choice(season_comments["positive"]))
        
        conclusion = random.choice([
            "Would definitely recommend!",
            "Will be ordering more.",
            "Very happy with this purchase.",
            "Exceeded my expectations.",
            "Great addition to my collection."
        ])
        
    elif rating == 3:
        # Mixed review
        pos = random.choice(CATEGORY_POSITIVE[category_key])
        neg = random.choice(CATEGORY_NEGATIVE[category_key])
        content_parts = [pos, "However, " + neg.lower()]
        
        # 40% chance to add neutral size comment
        if random.random() < 0.4:
            content_parts.append(random.choice(SIZE_COMMENTS["neutral"]))
        
        conclusion = random.choice([
            "It's acceptable but not exceptional.",
            "Might consider alternatives next time.",
            "Not sure if I'd buy again.",
            "It serves its purpose.",
            "On the fence about recommending."
        ])
        
    else:
        # Negative review: category + material + brand + price
        content_parts.append(random.choice(CATEGORY_NEGATIVE[category_key]))
        
        # 70% chance to add material complaint
        if random.random() < 0.7:
            content_parts.append(random.choice(MATERIAL_NEGATIVE.get(material_key, MATERIAL_NEGATIVE["default"])))
        
        # 50% chance to add brand-specific negative
        if random.random() < 0.5:
            content_parts.append(random.choice(brand_vocab["negative_comments"]))
        
        # 60% chance to add price complaint
        if random.random() < 0.6:
            content_parts.append(random.choice(price_comments["negative"]))
        
        # 50% chance to add size complaint
        if random.random() < 0.5:
            content_parts.append(random.choice(SIZE_COMMENTS["negative"]))
        
        conclusion = random.choice([
            "Would not recommend.",
            "Returning this item.",
            "Very disappointed overall.",
            "Save your money for something better.",
            "Expected much more for this price."
        ])
    
    # Use actual product name instead of just category
    product_name = safe_get(product, "product_name", f"{product['brand']} {product['category_level_3']}")
    color_name = safe_get(product, "color_name", "")
    color = color_name.lower() if color_name else ""
    
    if color:
        product_mention = f"The {product_name} in {color}"
    else:
        product_mention = f"The {product_name}"
    
    sentiment_phrase = 'exceeded expectations' if rating >= 4 else 'was a letdown' if rating <= 2 else 'was okay'
    review_text = f"{intro} {product_mention} {sentiment_phrase}. {' '.join(content_parts)} {conclusion}"
    
    return review_text


def generate_purchase_experience(customer, rating):
    """Generate a purchase experience review."""
    segment = customer["segment"]
    
    # Validate segment
    if segment not in SEGMENT_INTROS:
        UNEXPECTED_VALUES["segments"].add(segment)
    
    intro = random.choice(SEGMENT_INTROS.get(segment, SEGMENT_INTROS["regular"]))
    
    if rating >= 4:
        experiences = random.sample(PURCHASE_EXPERIENCE_POSITIVE, min(3, len(PURCHASE_EXPERIENCE_POSITIVE)))
        conclusion = "Will definitely order again!"
    elif rating == 3:
        experiences = [random.choice(PURCHASE_EXPERIENCE_POSITIVE), "But " + random.choice(PURCHASE_EXPERIENCE_NEGATIVE).lower()]
        conclusion = "Overall an okay experience."
    else:
        experiences = random.sample(PURCHASE_EXPERIENCE_NEGATIVE, min(3, len(PURCHASE_EXPERIENCE_NEGATIVE)))
        conclusion = "Very frustrating experience overall."
    
    sentiment = 'excellent' if rating >= 4 else 'disappointing' if rating <= 2 else 'mixed'
    review_text = f"{intro} my recent order experience was {sentiment}. {' '.join(experiences)} {conclusion}"
    
    return review_text


def generate_return_feedback(customer, product, rating):
    """Generate return feedback with enhanced product details."""
    segment = customer["segment"]
    
    # Validate segment
    if segment not in SEGMENT_INTROS:
        UNEXPECTED_VALUES["segments"].add(segment)
    
    intro = random.choice(SEGMENT_INTROS.get(segment, SEGMENT_INTROS["regular"]))
    
    reason = random.choice(RETURN_REASONS)
    
    # Use actual product name
    product_name = safe_get(product, "product_name", f"{product['brand']} {product['category_level_3']}")
    product_mention = f"the {product_name}"
    
    if rating >= 3:
        return_exp = random.choice(RETURN_EXPERIENCE_POSITIVE)
        conclusion = "Despite the return, I'd shop here again."
    else:
        return_exp = random.choice(RETURN_EXPERIENCE_NEGATIVE)
        conclusion = "This experience has made me reconsider shopping here."
    
    review_text = f"{intro} I had to return {product_mention}. {reason} {return_exp} {conclusion}"
    
    return review_text


def calculate_helpful_votes(rating, review_length):
    """Calculate helpful votes based on rating extremes and review length."""
    # Extreme ratings (1 or 5) get more helpful votes
    base_helpful = random.randint(0, 50) if rating in [1, 5] else random.randint(0, 20)
    # Longer reviews get a bonus (up to +10)
    length_bonus = min(10, review_length // 100)
    return base_helpful + length_bonus


def extract_review_metadata(review_text, rating):
    """Extract metadata flags from review text for enhanced Vector Search filtering."""
    text_lower = review_text.lower()
    
    return {
        "mentions_sizing": any(word in text_lower for word in ["size", "fit", "tight", "loose", "small", "large", "runs"]),
        "mentions_quality": any(word in text_lower for word in ["quality", "material", "fabric", "cheap", "premium", "well-made", "poorly made"]),
        "mentions_delivery": any(word in text_lower for word in ["shipping", "delivery", "arrived", "package", "tracking"]),
        "mentions_price": any(word in text_lower for word in ["price", "worth", "value", "expensive", "cheap", "cost", "money"]),
        "mentions_comfort": any(word in text_lower for word in ["comfort", "soft", "scratchy", "itchy", "cozy", "breathable"]),
        "has_recommendation": any(word in text_lower for word in ["recommend", "would buy", "will order", "suggest"]),
        "word_count": len(review_text.split()),
        "sentiment_score": round((rating - 3) / 2, 2)  # -1.0 to 1.0 scale
    }


print("Enhanced review generation functions defined")


## Generate Reviews


In [None]:
from datetime import date

def generate_all_reviews():
    """Generate all reviews with enhanced metadata for Vector Search."""
    reviews = []
    
    # Calculate counts per type
    product_review_count = int(TOTAL_REVIEWS * PRODUCT_REVIEW_PCT)
    purchase_exp_count = int(TOTAL_REVIEWS * PURCHASE_EXPERIENCE_PCT)
    return_feedback_count = TOTAL_REVIEWS - product_review_count - purchase_exp_count
    
    print(f"Generating {product_review_count:,} product reviews...")
    print(f"Generating {purchase_exp_count:,} purchase experience reviews...")
    print(f"Generating {return_feedback_count:,} return feedback reviews...")
    
    # Date range for reviews (last 90 days)
    end_date = date.today()
    start_date = end_date - timedelta(days=90)
    
    review_id = 1
    
    # Generate Product Reviews
    for i in range(product_review_count):
        customer = random.choice(customers)
        product = random.choice(products)
        rating = get_rating()
        review_date = start_date + timedelta(days=random.randint(0, 90))
        
        # Generate review text
        review_text = generate_product_review(customer, product, rating)
        
        # Extract metadata for Vector Search filtering
        metadata = extract_review_metadata(review_text, rating)
        
        review = {
            "review_id": f"REV_{review_id:08d}",
            "customer_key": customer["customer_key"],
            "product_key": product["product_key"],
            "order_number": f"ORD_2025_{random.randint(100000, 999999)}",
            "review_date": review_date,
            "rating": rating,
            "review_title": random.choice(TITLE_TEMPLATES[rating]),
            "review_text": review_text,
            "review_type": "product_review",
            "verified_purchase": random.random() > 0.1,  # 90% verified
            "helpful_votes": calculate_helpful_votes(rating, len(review_text)),
            "source_channel": random.choice(["web", "web", "app", "email"]),  # Web weighted
            # New metadata columns for Vector Search filtering
            "mentions_sizing": metadata["mentions_sizing"],
            "mentions_quality": metadata["mentions_quality"],
            "mentions_delivery": metadata["mentions_delivery"],
            "mentions_price": metadata["mentions_price"],
            "mentions_comfort": metadata["mentions_comfort"],
            "has_recommendation": metadata["has_recommendation"],
            "word_count": metadata["word_count"],
            "sentiment_score": metadata["sentiment_score"],
            # Product context for richer filtering
            "product_category": product["category_level_1"],
            "product_brand": product["brand"],
            "customer_segment": customer["segment"]
        }
        reviews.append(review)
        review_id += 1
        
        if review_id % 1000 == 0:
            print(f"  Generated {review_id:,} reviews...")
    
    # Generate Purchase Experience Reviews
    for i in range(purchase_exp_count):
        customer = random.choice(customers)
        rating = get_rating()
        review_date = start_date + timedelta(days=random.randint(0, 90))
        
        # Generate review text
        review_text = generate_purchase_experience(customer, rating)
        
        # Extract metadata
        metadata = extract_review_metadata(review_text, rating)
        
        review = {
            "review_id": f"REV_{review_id:08d}",
            "customer_key": customer["customer_key"],
            "product_key": None,  # No specific product for purchase experience
            "order_number": f"ORD_2025_{random.randint(100000, 999999)}",
            "review_date": review_date,
            "rating": rating,
            "review_title": random.choice([
                "Great shopping experience" if rating >= 4 else 
                "Average experience" if rating == 3 else 
                "Poor customer service"
            ]),
            "review_text": review_text,
            "review_type": "purchase_experience",
            "verified_purchase": True,
            "helpful_votes": calculate_helpful_votes(rating, len(review_text)),
            "source_channel": random.choice(["web", "app", "email", "in_store"]),
            # Metadata columns
            "mentions_sizing": metadata["mentions_sizing"],
            "mentions_quality": metadata["mentions_quality"],
            "mentions_delivery": metadata["mentions_delivery"],
            "mentions_price": metadata["mentions_price"],
            "mentions_comfort": metadata["mentions_comfort"],
            "has_recommendation": metadata["has_recommendation"],
            "word_count": metadata["word_count"],
            "sentiment_score": metadata["sentiment_score"],
            # No product context for purchase experience
            "product_category": None,
            "product_brand": None,
            "customer_segment": customer["segment"]
        }
        reviews.append(review)
        review_id += 1
    
    # Generate Return Feedback
    for i in range(return_feedback_count):
        customer = random.choice(customers)
        product = random.choice(products)
        # Return feedback tends to be more negative
        rating = random.choices([1, 2, 3, 4, 5], weights=[0.15, 0.25, 0.35, 0.20, 0.05])[0]
        review_date = start_date + timedelta(days=random.randint(0, 90))
        
        # Generate review text
        review_text = generate_return_feedback(customer, product, rating)
        
        # Extract metadata
        metadata = extract_review_metadata(review_text, rating)
        
        review = {
            "review_id": f"REV_{review_id:08d}",
            "customer_key": customer["customer_key"],
            "product_key": product["product_key"],
            "order_number": f"ORD_2025_{random.randint(100000, 999999)}",
            "review_date": review_date,
            "rating": rating,
            "review_title": random.choice([
                "Easy return process" if rating >= 4 else 
                "Return was okay" if rating == 3 else 
                "Frustrating return experience"
            ]),
            "review_text": review_text,
            "review_type": "return_feedback",
            "verified_purchase": True,
            "helpful_votes": calculate_helpful_votes(rating, len(review_text)),
            "source_channel": random.choice(["web", "app", "email"]),
            # Metadata columns
            "mentions_sizing": metadata["mentions_sizing"],
            "mentions_quality": metadata["mentions_quality"],
            "mentions_delivery": metadata["mentions_delivery"],
            "mentions_price": metadata["mentions_price"],
            "mentions_comfort": metadata["mentions_comfort"],
            "has_recommendation": metadata["has_recommendation"],
            "word_count": metadata["word_count"],
            "sentiment_score": metadata["sentiment_score"],
            # Product context
            "product_category": product["category_level_1"],
            "product_brand": product["brand"],
            "customer_segment": customer["segment"]
        }
        reviews.append(review)
        review_id += 1
    
    print(f"\nGenerated {len(reviews):,} total reviews")
    
    # Report any unexpected values encountered
    if any(UNEXPECTED_VALUES.values()):
        print("\nWARNING: Unexpected values encountered during generation:")
        for key, values in UNEXPECTED_VALUES.items():
            if values:
                print(f"  {key}: {values}")
    
    return reviews

# Generate all reviews
all_reviews = generate_all_reviews()


In [None]:
# Preview sample reviews with new metadata
print("\n" + "="*80)
print("SAMPLE REVIEWS (WITH ENHANCED METADATA)")
print("="*80)

# Show one of each type
for review_type in ["product_review", "purchase_experience", "return_feedback"]:
    sample = next(r for r in all_reviews if r["review_type"] == review_type)
    print(f"\n--- {review_type.upper()} ---")
    print(f"Review ID: {sample['review_id']}")
    print(f"Customer Key: {sample['customer_key']} | Segment: {sample['customer_segment']}")
    print(f"Product Key: {sample['product_key']} | Brand: {sample.get('product_brand', 'N/A')} | Category: {sample.get('product_category', 'N/A')}")
    print(f"Rating: {'*' * sample['rating']} | Sentiment: {sample['sentiment_score']}")
    print(f"Title: {sample['review_title']}")
    print(f"Text ({sample['word_count']} words): {sample['review_text'][:300]}..." if len(sample['review_text']) > 300 else f"Text ({sample['word_count']} words): {sample['review_text']}")
    print(f"Metadata: sizing={sample['mentions_sizing']}, quality={sample['mentions_quality']}, price={sample['mentions_price']}, comfort={sample['mentions_comfort']}")


## Create Delta Table


In [None]:
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, BooleanType, DateType, DoubleType

# Define enhanced schema with metadata columns for Vector Search filtering
review_schema = StructType([
    # Core review fields
    StructField("review_id", StringType(), False),
    StructField("customer_key", IntegerType(), True),
    StructField("product_key", IntegerType(), True),
    StructField("order_number", StringType(), True),
    StructField("review_date", DateType(), True),
    StructField("rating", IntegerType(), True),
    StructField("review_title", StringType(), True),
    StructField("review_text", StringType(), True),  # Primary column for vector embedding
    StructField("review_type", StringType(), True),
    StructField("verified_purchase", BooleanType(), True),
    StructField("helpful_votes", IntegerType(), True),
    StructField("source_channel", StringType(), True),
    
    # Metadata flags for enhanced Vector Search filtering
    StructField("mentions_sizing", BooleanType(), True),
    StructField("mentions_quality", BooleanType(), True),
    StructField("mentions_delivery", BooleanType(), True),
    StructField("mentions_price", BooleanType(), True),
    StructField("mentions_comfort", BooleanType(), True),
    StructField("has_recommendation", BooleanType(), True),
    StructField("word_count", IntegerType(), True),
    StructField("sentiment_score", DoubleType(), True),  # -1.0 to 1.0 scale
    
    # Denormalized context for filtering without joins
    StructField("product_category", StringType(), True),
    StructField("product_brand", StringType(), True),
    StructField("customer_segment", StringType(), True)
])

# Create DataFrame
reviews_df = spark.createDataFrame(all_reviews, schema=review_schema)

print(f"Created DataFrame with {reviews_df.count():,} reviews")
print(f"\nSchema includes {len(review_schema.fields)} columns:")
reviews_df.printSchema()


In [None]:
# Write to Delta table
reviews_df.write \
    .mode("overwrite") \
    .option("overwriteSchema", "true") \
    .saveAsTable(FULL_TABLE_NAME)

print(f"Successfully created table: {FULL_TABLE_NAME}")


In [None]:
# Add table and column comments
spark.sql(f"""
    ALTER TABLE {FULL_TABLE_NAME}
    SET TBLPROPERTIES (
        'comment' = 'Customer reviews for Vector Search RAG. Contains product reviews, purchase experiences, and return feedback linked to gold_customer_dim and gold_product_dim. Enhanced with metadata flags for filtering.'
    )
""")

# Core column comments
spark.sql(f"ALTER TABLE {FULL_TABLE_NAME} ALTER COLUMN review_id COMMENT 'Unique review identifier'")
spark.sql(f"ALTER TABLE {FULL_TABLE_NAME} ALTER COLUMN customer_key COMMENT 'FK to gold_customer_dim'")
spark.sql(f"ALTER TABLE {FULL_TABLE_NAME} ALTER COLUMN product_key COMMENT 'FK to gold_product_dim (NULL for purchase_experience)'")
spark.sql(f"ALTER TABLE {FULL_TABLE_NAME} ALTER COLUMN order_number COMMENT 'Associated order number'")
spark.sql(f"ALTER TABLE {FULL_TABLE_NAME} ALTER COLUMN review_date COMMENT 'Date the review was submitted'")
spark.sql(f"ALTER TABLE {FULL_TABLE_NAME} ALTER COLUMN rating COMMENT 'Star rating 1-5'")
spark.sql(f"ALTER TABLE {FULL_TABLE_NAME} ALTER COLUMN review_title COMMENT 'Review headline/title'")
spark.sql(f"ALTER TABLE {FULL_TABLE_NAME} ALTER COLUMN review_text COMMENT 'Full review content - PRIMARY column for vector embedding'")
spark.sql(f"ALTER TABLE {FULL_TABLE_NAME} ALTER COLUMN review_type COMMENT 'Type: product_review, purchase_experience, or return_feedback'")
spark.sql(f"ALTER TABLE {FULL_TABLE_NAME} ALTER COLUMN verified_purchase COMMENT 'Whether the purchase was verified'")
spark.sql(f"ALTER TABLE {FULL_TABLE_NAME} ALTER COLUMN helpful_votes COMMENT 'Number of helpful votes from other customers'")
spark.sql(f"ALTER TABLE {FULL_TABLE_NAME} ALTER COLUMN source_channel COMMENT 'Channel where review was submitted: web, app, email, in_store'")

# Metadata flag comments for Vector Search filtering
spark.sql(f"ALTER TABLE {FULL_TABLE_NAME} ALTER COLUMN mentions_sizing COMMENT 'Review mentions sizing/fit topics'")
spark.sql(f"ALTER TABLE {FULL_TABLE_NAME} ALTER COLUMN mentions_quality COMMENT 'Review mentions quality/material topics'")
spark.sql(f"ALTER TABLE {FULL_TABLE_NAME} ALTER COLUMN mentions_delivery COMMENT 'Review mentions shipping/delivery topics'")
spark.sql(f"ALTER TABLE {FULL_TABLE_NAME} ALTER COLUMN mentions_price COMMENT 'Review mentions price/value topics'")
spark.sql(f"ALTER TABLE {FULL_TABLE_NAME} ALTER COLUMN mentions_comfort COMMENT 'Review mentions comfort topics'")
spark.sql(f"ALTER TABLE {FULL_TABLE_NAME} ALTER COLUMN has_recommendation COMMENT 'Review contains recommendation language'")
spark.sql(f"ALTER TABLE {FULL_TABLE_NAME} ALTER COLUMN word_count COMMENT 'Number of words in review text'")
spark.sql(f"ALTER TABLE {FULL_TABLE_NAME} ALTER COLUMN sentiment_score COMMENT 'Calculated sentiment score from -1.0 (negative) to 1.0 (positive)'")

# Denormalized context column comments
spark.sql(f"ALTER TABLE {FULL_TABLE_NAME} ALTER COLUMN product_category COMMENT 'Denormalized product category for filtering (apparel, footwear, accessories)'")
spark.sql(f"ALTER TABLE {FULL_TABLE_NAME} ALTER COLUMN product_brand COMMENT 'Denormalized product brand for filtering'")
spark.sql(f"ALTER TABLE {FULL_TABLE_NAME} ALTER COLUMN customer_segment COMMENT 'Denormalized customer segment for filtering (vip, premium, loyal, regular, new)'")

print("Table and column comments added (23 columns)")


## Validation


In [None]:
# Validate the generated data
print("VALIDATION RESULTS")
print("="*60)

# Total count
total = spark.sql(f"SELECT COUNT(*) as cnt FROM {FULL_TABLE_NAME}").collect()[0]["cnt"]
print(f"\nTotal reviews: {total:,}")

# By review type
print("\n--- Reviews by Type ---")
spark.sql(f"""
    SELECT review_type, COUNT(*) as count, ROUND(AVG(rating), 2) as avg_rating
    FROM {FULL_TABLE_NAME}
    GROUP BY review_type
    ORDER BY count DESC
""").show()

# Rating distribution
print("\n--- Rating Distribution ---")
spark.sql(f"""
    SELECT rating, COUNT(*) as count, 
           ROUND(COUNT(*) * 100.0 / SUM(COUNT(*)) OVER(), 1) as pct
    FROM {FULL_TABLE_NAME}
    GROUP BY rating
    ORDER BY rating DESC
""").show()


In [None]:
# Reviews by customer segment
print("--- Reviews by Customer Segment ---")
spark.sql(f"""
    SELECT c.segment, COUNT(*) as review_count, ROUND(AVG(r.rating), 2) as avg_rating
    FROM {FULL_TABLE_NAME} r
    JOIN {CATALOG}.{SCHEMA}.gold_customer_dim c ON r.customer_key = c.customer_key
    GROUP BY c.segment
    ORDER BY review_count DESC
""").show()

# Reviews by product category
print("\n--- Reviews by Product Category ---")
spark.sql(f"""
    SELECT p.category_level_1, COUNT(*) as review_count, ROUND(AVG(r.rating), 2) as avg_rating
    FROM {FULL_TABLE_NAME} r
    JOIN {CATALOG}.{SCHEMA}.gold_product_dim p ON r.product_key = p.product_key
    WHERE r.product_key IS NOT NULL
    GROUP BY p.category_level_1
    ORDER BY review_count DESC
""").show()


In [None]:
# Sample low-rated reviews (good for vector search testing)
print("--- Sample Low-Rated Reviews (Vector Search Candidates) ---")
spark.sql(f"""
    SELECT r.review_id, r.rating, r.review_type, 
           SUBSTRING(r.review_text, 1, 150) as review_excerpt,
           p.category_level_1
    FROM {FULL_TABLE_NAME} r
    LEFT JOIN {CATALOG}.{SCHEMA}.gold_product_dim p ON r.product_key = p.product_key
    WHERE r.rating <= 2
    LIMIT 5
""").show(truncate=False)


In [None]:
# Sample VIP customer reviews
print("--- Sample VIP Customer Reviews ---")
spark.sql(f"""
    SELECT r.review_id, r.rating, c.segment, c.loyalty_tier,
           SUBSTRING(r.review_text, 1, 200) as review_excerpt
    FROM {FULL_TABLE_NAME} r
    JOIN {CATALOG}.{SCHEMA}.gold_customer_dim c ON r.customer_key = c.customer_key
    WHERE c.segment = 'vip'
    LIMIT 3
""").show(truncate=False)


In [None]:
# Validate new metadata columns
print("--- Metadata Flags Distribution ---")
spark.sql(f"""
    SELECT 
        SUM(CASE WHEN mentions_sizing THEN 1 ELSE 0 END) as sizing_mentions,
        SUM(CASE WHEN mentions_quality THEN 1 ELSE 0 END) as quality_mentions,
        SUM(CASE WHEN mentions_delivery THEN 1 ELSE 0 END) as delivery_mentions,
        SUM(CASE WHEN mentions_price THEN 1 ELSE 0 END) as price_mentions,
        SUM(CASE WHEN mentions_comfort THEN 1 ELSE 0 END) as comfort_mentions,
        SUM(CASE WHEN has_recommendation THEN 1 ELSE 0 END) as has_recommendations,
        ROUND(AVG(word_count), 1) as avg_word_count,
        ROUND(AVG(sentiment_score), 2) as avg_sentiment
    FROM {FULL_TABLE_NAME}
""").show()

print("\n--- Sentiment Score by Rating ---")
spark.sql(f"""
    SELECT 
        rating,
        COUNT(*) as count,
        ROUND(AVG(sentiment_score), 2) as avg_sentiment,
        ROUND(AVG(word_count), 0) as avg_words
    FROM {FULL_TABLE_NAME}
    GROUP BY rating
    ORDER BY rating DESC
""").show()

print("\n--- Reviews by Brand (Denormalized) ---")
spark.sql(f"""
    SELECT 
        COALESCE(product_brand, 'N/A - Purchase Experience') as brand,
        COUNT(*) as review_count,
        ROUND(AVG(rating), 2) as avg_rating
    FROM {FULL_TABLE_NAME}
    GROUP BY product_brand
    ORDER BY review_count DESC
""").show()

print("\n--- Reviews by Customer Segment (Denormalized) ---")
spark.sql(f"""
    SELECT 
        customer_segment,
        COUNT(*) as review_count,
        ROUND(AVG(rating), 2) as avg_rating,
        ROUND(AVG(sentiment_score), 2) as avg_sentiment
    FROM {FULL_TABLE_NAME}
    GROUP BY customer_segment
    ORDER BY review_count DESC
""").show()


## Summary

### What was created:
- **Table**: `juan_use1_catalog.retail.gold_customer_reviews`
- **Reviews**: ~5,000 realistic customer reviews
- **Columns**: 23 total (12 core + 8 metadata flags + 3 denormalized context)
- **Linked to**: Actual customers and products from gold layer tables

### Review Types:
- **Product Reviews** (60%): Quality, fit, style, material, brand, price, size, seasonal feedback
- **Purchase Experience** (20%): Shipping, packaging, service
- **Return Feedback** (20%): Return reasons and experience

### Enhanced Features:
- **Material-specific feedback**: Cotton, silk, leather, wool, linen, cashmere, etc.
- **Brand-specific vocabulary**: Luxe Label (luxury), Eco Threads (sustainable), etc.
- **Price-tier context**: Luxury, premium, mid-range, value, budget
- **Size feedback**: Positive, negative, and neutral sizing comments
- **Seasonal context**: Spring/Summer, Fall/Winter, Autumn/Winter, Core items
- **Metadata flags**: `mentions_sizing`, `mentions_quality`, `mentions_delivery`, etc.
- **Sentiment score**: Calculated score from -1.0 to 1.0
- **Denormalized context**: `product_category`, `product_brand`, `customer_segment`

### Next Steps:
1. Create Vector Search endpoint in Databricks UI
2. Create Delta Sync index on `review_text` column
3. Use metadata columns for pre-filtering in Vector Search queries
4. Integrate with multi-agent supervisor


In [None]:
print("\n" + "="*60)
print("ENHANCED CUSTOMER REVIEWS GENERATION COMPLETE")
print("="*60)
print(f"\nTable: {FULL_TABLE_NAME}")
print(f"Total Reviews: {total:,}")
print(f"Schema: 23 columns (12 core + 8 metadata + 3 context)")

print("\nEnhanced Features:")
print("  - Material-specific feedback (cotton, silk, leather, wool, etc.)")
print("  - Brand-specific vocabulary (8 brands with unique personalities)")
print("  - Price-tier context (luxury, premium, mid-range, value, budget)")
print("  - Size feedback (positive, negative, neutral)")
print("  - Seasonal context (SS, FW, AW, CORE)")
print("  - Metadata flags for Vector Search filtering")
print("  - Sentiment scores (-1.0 to 1.0)")

print("\nReady for Vector Search index creation in Databricks UI:")
print("  1. Create endpoint (if needed)")
print("  2. Create Delta Sync index on 'review_text' column")
print("  3. Use managed embeddings (databricks-bge-large-en)")
print("  4. Add filter columns: product_category, product_brand, customer_segment")
print("  5. Use metadata flags for pre-filtering queries")


In [None]:
# End of notebook
