# Customer Reviews Data Generation

Generate realistic synthetic customer reviews for the Fashion Retail Intelligence Platform.

**Purpose**: Create `gold_customer_reviews` table with ~5,000 realistic reviews linked to actual customers and products for Vector Search RAG capabilities.

**Key Features**:
- Reviews linked to actual `customer_key` from `gold_customer_dim`
- Reviews linked to actual `product_key` from `gold_product_dim`
- Content varies by customer segment (VIP, Premium, Loyal, Regular, New)
- Content varies by product category (apparel, footwear, accessories)
- Realistic rating distribution (skewed positive but with variance)
- Three review types: product_review, purchase_experience, return_feedback


## Configuration


In [None]:
# Configuration
CATALOG = "juan_dev"
SCHEMA = "retail"
TABLE_NAME = "gold_customer_reviews"
FULL_TABLE_NAME = f"{CATALOG}.{SCHEMA}.{TABLE_NAME}"

# Review generation settings
TOTAL_REVIEWS = 5000
PRODUCT_REVIEW_PCT = 0.60  # 60% product reviews
PURCHASE_EXPERIENCE_PCT = 0.20  # 20% purchase experience
RETURN_FEEDBACK_PCT = 0.20  # 20% return feedback

# Rating distribution (realistic - skewed positive)
RATING_DISTRIBUTION = {
    5: 0.45,  # 45% - Enthusiastic praise
    4: 0.25,  # 25% - Positive with minor notes
    3: 0.15,  # 15% - Mixed/neutral
    2: 0.10,  # 10% - Disappointed
    1: 0.05   # 5% - Strong complaints
}

RANDOM_SEED = 42

print(f"Target table: {FULL_TABLE_NAME}")
print(f"Total reviews to generate: {TOTAL_REVIEWS:,}")


## Load Customer and Product Data


In [None]:
# Load customers with their segments
customers_df = spark.sql(f"""
    SELECT 
        customer_key,
        customer_id,
        segment,
        loyalty_tier,
        geo_city,
        preferred_category
    FROM {CATALOG}.{SCHEMA}.gold_customer_dim
    WHERE is_current = TRUE
""")

customers = customers_df.collect()
print(f"Loaded {len(customers):,} customers")

# Show segment distribution
spark.sql(f"""
    SELECT segment, COUNT(*) as count
    FROM {CATALOG}.{SCHEMA}.gold_customer_dim
    WHERE is_current = TRUE
    GROUP BY segment
    ORDER BY count DESC
""").show()


In [None]:
# Load products with their categories
products_df = spark.sql(f"""
    SELECT 
        product_key,
        product_id,
        product_name,
        brand,
        category_level_1,
        category_level_2,
        category_level_3,
        color_name,
        material_primary,
        base_price,
        price_tier
    FROM {CATALOG}.{SCHEMA}.gold_product_dim
    WHERE is_active = TRUE
""")

products = products_df.collect()
print(f"Loaded {len(products):,} products")

# Show category distribution
spark.sql(f"""
    SELECT category_level_1, COUNT(*) as count
    FROM {CATALOG}.{SCHEMA}.gold_product_dim
    WHERE is_active = TRUE
    GROUP BY category_level_1
    ORDER BY count DESC
""").show()


## Review Content Templates

Define realistic review content that varies by:
- Customer segment (vocabulary and tone)
- Product category (topics and concerns)
- Rating level (positive vs negative themes)


In [None]:
import random
from datetime import datetime, timedelta

random.seed(RANDOM_SEED)

# Review title templates by rating
TITLE_TEMPLATES = {
    5: [
        "Absolutely love it!",
        "Exceeded my expectations",
        "Perfect in every way",
        "Best purchase I've made",
        "Highly recommend!",
        "Outstanding quality",
        "Worth every penny",
        "A new favorite",
        "Stunning piece",
        "Exceptional craftsmanship"
    ],
    4: [
        "Great product, minor issues",
        "Really pleased overall",
        "Good quality, would buy again",
        "Almost perfect",
        "Very satisfied",
        "Solid purchase",
        "Happy with my choice",
        "Good value for money",
        "Nice addition to wardrobe",
        "Pleasantly surprised"
    ],
    3: [
        "It's okay, nothing special",
        "Mixed feelings",
        "Average quality",
        "Met expectations, barely",
        "Decent but not amazing",
        "Has pros and cons",
        "On the fence about this",
        "Not bad, not great",
        "Acceptable quality",
        "Room for improvement"
    ],
    2: [
        "Disappointed with purchase",
        "Not as described",
        "Expected better quality",
        "Wouldn't recommend",
        "Below expectations",
        "Quality issues",
        "Not worth the price",
        "Disappointing experience",
        "Had to return it",
        "Save your money"
    ],
    1: [
        "Complete waste of money",
        "Terrible quality",
        "Avoid this product",
        "Very disappointed",
        "Fell apart immediately",
        "Nothing like the photos",
        "Worst purchase ever",
        "Demand a refund",
        "Do not buy",
        "Extremely poor quality"
    ]
}

print("Title templates loaded")


In [None]:
# Review text components by segment (vocabulary and style)
SEGMENT_INTROS = {
    "vip": [
        "As a discerning customer who values quality,",
        "Having shopped here for years,",
        "As someone who appreciates fine craftsmanship,",
        "Given my extensive experience with luxury brands,",
        "As a platinum member,"
    ],
    "premium": [
        "As a regular customer,",
        "I've been shopping here for a while and",
        "Based on my previous purchases,",
        "I have high standards and",
        "As someone who values quality,"
    ],
    "loyal": [
        "I always come back to this brand because",
        "This is my go-to store and",
        "Been a loyal customer for years,",
        "I trust this brand and",
        "Love shopping here,"
    ],
    "regular": [
        "I bought this recently and",
        "Just received my order and",
        "I was looking for something like this and",
        "Decided to try this out and",
        "Needed something for the season and"
    ],
    "new": [
        "This was my first purchase here and",
        "New to this brand but",
        "First time ordering and",
        "Heard good things about this brand,",
        "Took a chance on this new-to-me brand and"
    ]
}

# Category-specific review content
CATEGORY_POSITIVE = {
    "apparel": [
        "The fabric feels amazing against my skin.",
        "Fits true to size, exactly as expected.",
        "The color is even more vibrant in person.",
        "Washes beautifully without losing shape.",
        "The stitching and construction are top-notch.",
        "So comfortable I could wear it all day.",
        "The drape and flow are elegant.",
        "Gets compliments every time I wear it."
    ],
    "footwear": [
        "Incredibly comfortable right out of the box.",
        "Great arch support for all-day wear.",
        "The cushioning is perfect.",
        "True to size, no break-in period needed.",
        "Stylish and functional.",
        "My feet don't hurt after hours of walking.",
        "The quality of the leather is exceptional.",
        "Perfect fit and very well made."
    ],
    "accessories": [
        "Looks exactly like the photos.",
        "The quality exceeded my expectations.",
        "Perfect for everyday use.",
        "Great gift idea.",
        "The craftsmanship is evident.",
        "Elevates any outfit.",
        "Well-made and stylish.",
        "The hardware is solid and doesn't look cheap."
    ]
}

CATEGORY_NEGATIVE = {
    "apparel": [
        "The sizing runs way too small.",
        "The fabric feels cheap and scratchy.",
        "Color faded after just one wash.",
        "Started pilling almost immediately.",
        "The stitching came undone within days.",
        "Nothing like what was shown online.",
        "Shrunk significantly after washing.",
        "The material is see-through."
    ],
    "footwear": [
        "Runs at least a full size small.",
        "Very uncomfortable, no arch support.",
        "The sole started separating after a week.",
        "Gave me blisters even with socks.",
        "The material creased badly.",
        "Not suitable for actual walking.",
        "Quality doesn't match the price.",
        "Had to return due to poor fit."
    ],
    "accessories": [
        "Looks nothing like the picture.",
        "The hardware tarnished immediately.",
        "Feels flimsy and cheap.",
        "Broke within the first week.",
        "Not worth the price at all.",
        "The strap came off.",
        "Smaller than expected.",
        "Poor quality materials."
    ]
}

print("Category templates loaded")


In [None]:
# Purchase experience templates
PURCHASE_EXPERIENCE_POSITIVE = [
    "Shipping was incredibly fast - received within 2 days!",
    "The packaging was beautiful and secure.",
    "Love the handwritten thank you note included.",
    "Easy checkout process and quick confirmation.",
    "Customer service was helpful when I had questions.",
    "Free shipping made this an even better deal.",
    "The tracking updates were accurate and frequent.",
    "Package arrived earlier than expected.",
    "Everything was wrapped with care.",
    "Smooth ordering experience from start to finish."
]

PURCHASE_EXPERIENCE_NEGATIVE = [
    "Shipping took forever - over 2 weeks!",
    "Package arrived damaged.",
    "Tracking showed delivered but never received.",
    "Had to contact support multiple times.",
    "The checkout process was confusing.",
    "Charged for shipping even though it should be free.",
    "Item was crammed into too small a box.",
    "No communication about delays.",
    "Website kept crashing during checkout.",
    "Had to wait a week for a shipping confirmation."
]

# Return feedback templates
RETURN_REASONS = [
    "The sizing was completely off from the size chart.",
    "Color was different than shown online.",
    "Quality didn't match the price point.",
    "Changed my mind after seeing it in person.",
    "Ordered wrong size by mistake.",
    "Found a better option elsewhere.",
    "Didn't fit my body type as expected.",
    "Material wasn't what I expected.",
    "Arrived with defects.",
    "Just didn't work with my wardrobe."
]

RETURN_EXPERIENCE_POSITIVE = [
    "The return process was seamless.",
    "Got my refund within days.",
    "Free return shipping made it easy.",
    "No questions asked return policy is great.",
    "Exchange process was straightforward."
]

RETURN_EXPERIENCE_NEGATIVE = [
    "Return process was a nightmare.",
    "Still waiting for my refund after weeks.",
    "Had to pay for return shipping out of pocket.",
    "Customer service was unhelpful.",
    "Took multiple attempts to process the return."
]

print("Experience templates loaded")


## Review Generation Functions


In [None]:
def get_rating():
    """Generate a rating based on realistic distribution."""
    r = random.random()
    cumulative = 0
    for rating, prob in RATING_DISTRIBUTION.items():
        cumulative += prob
        if r <= cumulative:
            return rating
    return 5  # Default fallback


def get_category_key(category_level_1):
    """Map product category to template key."""
    if category_level_1 in ["apparel"]:
        return "apparel"
    elif category_level_1 in ["footwear"]:
        return "footwear"
    else:
        return "accessories"


def generate_product_review(customer, product, rating):
    """Generate a product review based on customer segment and product category."""
    segment = customer["segment"]
    category_key = get_category_key(product["category_level_1"])
    
    # Build review text
    intro = random.choice(SEGMENT_INTROS.get(segment, SEGMENT_INTROS["regular"]))
    
    if rating >= 4:
        content_parts = random.sample(CATEGORY_POSITIVE[category_key], min(3, len(CATEGORY_POSITIVE[category_key])))
        conclusion = random.choice([
            "Would definitely recommend!",
            "Will be ordering more.",
            "Very happy with this purchase.",
            "Exceeded my expectations.",
            "Great addition to my collection."
        ])
    elif rating == 3:
        pos = random.choice(CATEGORY_POSITIVE[category_key])
        neg = random.choice(CATEGORY_NEGATIVE[category_key])
        content_parts = [pos, "However, " + neg.lower()]
        conclusion = random.choice([
            "It's acceptable but not exceptional.",
            "Might consider alternatives next time.",
            "Not sure if I'd buy again.",
            "It serves its purpose.",
            "On the fence about recommending."
        ])
    else:
        content_parts = random.sample(CATEGORY_NEGATIVE[category_key], min(3, len(CATEGORY_NEGATIVE[category_key])))
        conclusion = random.choice([
            "Would not recommend.",
            "Returning this item.",
            "Very disappointed overall.",
            "Save your money for something better.",
            "Expected much more for this price."
        ])
    
    # Add product-specific details
    product_mention = f"The {product['brand']} {product['category_level_3']} in {product['color_name'].lower()}"
    
    review_text = f"{intro} {product_mention} {'exceeded expectations' if rating >= 4 else 'was a letdown' if rating <= 2 else 'was okay'}. {' '.join(content_parts)} {conclusion}"
    
    return review_text


def generate_purchase_experience(customer, rating):
    """Generate a purchase experience review."""
    segment = customer["segment"]
    intro = random.choice(SEGMENT_INTROS.get(segment, SEGMENT_INTROS["regular"]))
    
    if rating >= 4:
        experiences = random.sample(PURCHASE_EXPERIENCE_POSITIVE, min(3, len(PURCHASE_EXPERIENCE_POSITIVE)))
        conclusion = "Will definitely order again!"
    elif rating == 3:
        experiences = [random.choice(PURCHASE_EXPERIENCE_POSITIVE), "But " + random.choice(PURCHASE_EXPERIENCE_NEGATIVE).lower()]
        conclusion = "Overall an okay experience."
    else:
        experiences = random.sample(PURCHASE_EXPERIENCE_NEGATIVE, min(3, len(PURCHASE_EXPERIENCE_NEGATIVE)))
        conclusion = "Very frustrating experience overall."
    
    review_text = f"{intro} my recent order experience was {'excellent' if rating >= 4 else 'disappointing' if rating <= 2 else 'mixed'}. {' '.join(experiences)} {conclusion}"
    
    return review_text


def generate_return_feedback(customer, product, rating):
    """Generate return feedback."""
    segment = customer["segment"]
    intro = random.choice(SEGMENT_INTROS.get(segment, SEGMENT_INTROS["regular"]))
    
    reason = random.choice(RETURN_REASONS)
    product_mention = f"the {product['brand']} {product['category_level_3']}"
    
    if rating >= 3:
        return_exp = random.choice(RETURN_EXPERIENCE_POSITIVE)
        conclusion = "Despite the return, I'd shop here again."
    else:
        return_exp = random.choice(RETURN_EXPERIENCE_NEGATIVE)
        conclusion = "This experience has made me reconsider shopping here."
    
    review_text = f"{intro} I had to return {product_mention}. {reason} {return_exp} {conclusion}"
    
    return review_text


print("Review generation functions defined")


## Generate Reviews


In [None]:
from datetime import date

def generate_all_reviews():
    """Generate all reviews based on distribution."""
    reviews = []
    
    # Calculate counts per type
    product_review_count = int(TOTAL_REVIEWS * PRODUCT_REVIEW_PCT)
    purchase_exp_count = int(TOTAL_REVIEWS * PURCHASE_EXPERIENCE_PCT)
    return_feedback_count = TOTAL_REVIEWS - product_review_count - purchase_exp_count
    
    print(f"Generating {product_review_count:,} product reviews...")
    print(f"Generating {purchase_exp_count:,} purchase experience reviews...")
    print(f"Generating {return_feedback_count:,} return feedback reviews...")
    
    # Date range for reviews (last 90 days)
    end_date = date.today()
    start_date = end_date - timedelta(days=90)
    
    review_id = 1
    
    # Generate Product Reviews
    for i in range(product_review_count):
        customer = random.choice(customers)
        product = random.choice(products)
        rating = get_rating()
        review_date = start_date + timedelta(days=random.randint(0, 90))
        
        review = {
            "review_id": f"REV_{review_id:08d}",
            "customer_key": customer["customer_key"],
            "product_key": product["product_key"],
            "order_number": f"ORD_2025_{random.randint(100000, 999999)}",
            "review_date": review_date,
            "rating": rating,
            "review_title": random.choice(TITLE_TEMPLATES[rating]),
            "review_text": generate_product_review(customer, product, rating),
            "review_type": "product_review",
            "verified_purchase": random.random() > 0.1,  # 90% verified
            "helpful_votes": random.randint(0, 50) if rating in [1, 5] else random.randint(0, 20),
            "source_channel": random.choice(["web", "web", "app", "email"])  # Web weighted
        }
        reviews.append(review)
        review_id += 1
        
        if review_id % 1000 == 0:
            print(f"  Generated {review_id:,} reviews...")
    
    # Generate Purchase Experience Reviews
    for i in range(purchase_exp_count):
        customer = random.choice(customers)
        rating = get_rating()
        review_date = start_date + timedelta(days=random.randint(0, 90))
        
        review = {
            "review_id": f"REV_{review_id:08d}",
            "customer_key": customer["customer_key"],
            "product_key": None,  # No specific product for purchase experience
            "order_number": f"ORD_2025_{random.randint(100000, 999999)}",
            "review_date": review_date,
            "rating": rating,
            "review_title": random.choice([
                "Great shopping experience" if rating >= 4 else 
                "Average experience" if rating == 3 else 
                "Poor customer service"
            ]),
            "review_text": generate_purchase_experience(customer, rating),
            "review_type": "purchase_experience",
            "verified_purchase": True,
            "helpful_votes": random.randint(0, 30),
            "source_channel": random.choice(["web", "app", "email", "in_store"])
        }
        reviews.append(review)
        review_id += 1
    
    # Generate Return Feedback
    for i in range(return_feedback_count):
        customer = random.choice(customers)
        product = random.choice(products)
        # Return feedback tends to be more negative
        rating = random.choices([1, 2, 3, 4, 5], weights=[0.15, 0.25, 0.35, 0.20, 0.05])[0]
        review_date = start_date + timedelta(days=random.randint(0, 90))
        
        review = {
            "review_id": f"REV_{review_id:08d}",
            "customer_key": customer["customer_key"],
            "product_key": product["product_key"],
            "order_number": f"ORD_2025_{random.randint(100000, 999999)}",
            "review_date": review_date,
            "rating": rating,
            "review_title": random.choice([
                "Easy return process" if rating >= 4 else 
                "Return was okay" if rating == 3 else 
                "Frustrating return experience"
            ]),
            "review_text": generate_return_feedback(customer, product, rating),
            "review_type": "return_feedback",
            "verified_purchase": True,
            "helpful_votes": random.randint(0, 40),
            "source_channel": random.choice(["web", "app", "email"])
        }
        reviews.append(review)
        review_id += 1
    
    print(f"\nGenerated {len(reviews):,} total reviews")
    return reviews

# Generate all reviews
all_reviews = generate_all_reviews()


In [None]:
# Preview sample reviews
print("\n" + "="*80)
print("SAMPLE REVIEWS")
print("="*80)

# Show one of each type
for review_type in ["product_review", "purchase_experience", "return_feedback"]:
    sample = next(r for r in all_reviews if r["review_type"] == review_type)
    print(f"\n--- {review_type.upper()} ---")
    print(f"Review ID: {sample['review_id']}")
    print(f"Customer Key: {sample['customer_key']}")
    print(f"Product Key: {sample['product_key']}")
    print(f"Rating: {'*' * sample['rating']}")
    print(f"Title: {sample['review_title']}")
    print(f"Text: {sample['review_text'][:300]}..." if len(sample['review_text']) > 300 else f"Text: {sample['review_text']}")


## Create Delta Table


In [None]:
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, BooleanType, DateType

# Define schema
review_schema = StructType([
    StructField("review_id", StringType(), False),
    StructField("customer_key", IntegerType(), True),
    StructField("product_key", IntegerType(), True),
    StructField("order_number", StringType(), True),
    StructField("review_date", DateType(), True),
    StructField("rating", IntegerType(), True),
    StructField("review_title", StringType(), True),
    StructField("review_text", StringType(), True),
    StructField("review_type", StringType(), True),
    StructField("verified_purchase", BooleanType(), True),
    StructField("helpful_votes", IntegerType(), True),
    StructField("source_channel", StringType(), True)
])

# Create DataFrame
reviews_df = spark.createDataFrame(all_reviews, schema=review_schema)

print(f"Created DataFrame with {reviews_df.count():,} reviews")
reviews_df.printSchema()


In [None]:
# Write to Delta table
reviews_df.write \
    .mode("overwrite") \
    .option("overwriteSchema", "true") \
    .saveAsTable(FULL_TABLE_NAME)

print(f"Successfully created table: {FULL_TABLE_NAME}")


In [None]:
# Add table and column comments
spark.sql(f"""
    ALTER TABLE {FULL_TABLE_NAME}
    SET TBLPROPERTIES (
        'comment' = 'Customer reviews for Vector Search RAG. Contains product reviews, purchase experiences, and return feedback linked to gold_customer_dim and gold_product_dim.'
    )
""")

# Add column comments
spark.sql(f"ALTER TABLE {FULL_TABLE_NAME} ALTER COLUMN review_id COMMENT 'Unique review identifier'")
spark.sql(f"ALTER TABLE {FULL_TABLE_NAME} ALTER COLUMN customer_key COMMENT 'FK to gold_customer_dim'")
spark.sql(f"ALTER TABLE {FULL_TABLE_NAME} ALTER COLUMN product_key COMMENT 'FK to gold_product_dim (NULL for purchase_experience)'")
spark.sql(f"ALTER TABLE {FULL_TABLE_NAME} ALTER COLUMN review_text COMMENT 'Full review content - primary column for vector embedding'")
spark.sql(f"ALTER TABLE {FULL_TABLE_NAME} ALTER COLUMN review_type COMMENT 'Type: product_review, purchase_experience, or return_feedback'")

print("Table comments added")


## Validation


In [None]:
# Validate the generated data
print("VALIDATION RESULTS")
print("="*60)

# Total count
total = spark.sql(f"SELECT COUNT(*) as cnt FROM {FULL_TABLE_NAME}").collect()[0]["cnt"]
print(f"\nTotal reviews: {total:,}")

# By review type
print("\n--- Reviews by Type ---")
spark.sql(f"""
    SELECT review_type, COUNT(*) as count, ROUND(AVG(rating), 2) as avg_rating
    FROM {FULL_TABLE_NAME}
    GROUP BY review_type
    ORDER BY count DESC
""").show()

# Rating distribution
print("\n--- Rating Distribution ---")
spark.sql(f"""
    SELECT rating, COUNT(*) as count, 
           ROUND(COUNT(*) * 100.0 / SUM(COUNT(*)) OVER(), 1) as pct
    FROM {FULL_TABLE_NAME}
    GROUP BY rating
    ORDER BY rating DESC
""").show()


In [None]:
# Reviews by customer segment
print("--- Reviews by Customer Segment ---")
spark.sql(f"""
    SELECT c.segment, COUNT(*) as review_count, ROUND(AVG(r.rating), 2) as avg_rating
    FROM {FULL_TABLE_NAME} r
    JOIN {CATALOG}.{SCHEMA}.gold_customer_dim c ON r.customer_key = c.customer_key
    GROUP BY c.segment
    ORDER BY review_count DESC
""").show()

# Reviews by product category
print("\n--- Reviews by Product Category ---")
spark.sql(f"""
    SELECT p.category_level_1, COUNT(*) as review_count, ROUND(AVG(r.rating), 2) as avg_rating
    FROM {FULL_TABLE_NAME} r
    JOIN {CATALOG}.{SCHEMA}.gold_product_dim p ON r.product_key = p.product_key
    WHERE r.product_key IS NOT NULL
    GROUP BY p.category_level_1
    ORDER BY review_count DESC
""").show()


In [None]:
# Sample low-rated reviews (good for vector search testing)
print("--- Sample Low-Rated Reviews (Vector Search Candidates) ---")
spark.sql(f"""
    SELECT r.review_id, r.rating, r.review_type, 
           SUBSTRING(r.review_text, 1, 150) as review_excerpt,
           p.category_level_1
    FROM {FULL_TABLE_NAME} r
    LEFT JOIN {CATALOG}.{SCHEMA}.gold_product_dim p ON r.product_key = p.product_key
    WHERE r.rating <= 2
    LIMIT 5
""").show(truncate=False)


In [None]:
# Sample VIP customer reviews
print("--- Sample VIP Customer Reviews ---")
spark.sql(f"""
    SELECT r.review_id, r.rating, c.segment, c.loyalty_tier,
           SUBSTRING(r.review_text, 1, 200) as review_excerpt
    FROM {FULL_TABLE_NAME} r
    JOIN {CATALOG}.{SCHEMA}.gold_customer_dim c ON r.customer_key = c.customer_key
    WHERE c.segment = 'vip'
    LIMIT 3
""").show(truncate=False)


## Summary

### What was created:
- **Table**: `juan_dev.retail.gold_customer_reviews`
- **Reviews**: ~5,000 realistic customer reviews
- **Linked to**: Actual customers and products from gold layer tables

### Review Types:
- **Product Reviews** (60%): Quality, fit, style feedback
- **Purchase Experience** (20%): Shipping, packaging, service
- **Return Feedback** (20%): Return reasons and experience

### Next Steps:
1. Create Vector Search endpoint in Databricks UI
2. Create Delta Sync index on `review_text` column
3. Integrate with multi-agent supervisor


In [None]:
print("\n" + "="*60)
print("CUSTOMER REVIEWS GENERATION COMPLETE")
print("="*60)
print(f"\nTable: {FULL_TABLE_NAME}")
print(f"Total Reviews: {total:,}")
print(f"\nReady for Vector Search index creation in Databricks UI")
print(f"   - Create endpoint (if needed)")
print(f"   - Create Delta Sync index on 'review_text' column")
print(f"   - Use managed embeddings (databricks-bge-large-en)")


In [None]:
# End of notebook
