### Day 14

## AI-Powered Analytics: Genie & Mosaic AI

# Databricks Genie (Natural Language to SQL)

Follow these steps:

1. **Navigate:** Click "Genie" in the left sidebar of Databricks.  
2. **Create:** Click "New Genie Space".  
3. **Connect Data:** Select your catalog (`ecommerce_catalog`) and schema (`gold`). Add the table `category_performance`.  
4. **Ask Questions:** Type the following prompts into the Genie chat bar to test AI queries:  
   - "Show me total revenue by category."  
   - "Which products have the highest conversion rate?"  
   - "What's the trend of daily purchases over time?"  
   - "Find customers who viewed but never purchased."  

**Value add:** Genie lets non-technical folks **offload SQL queries to AI**, freeing their time to focus on deeper analysis, visualization, and insights.


# Mosaic AI Exploration and NLP Task (Balanced Synthetic Reviews)

We will generate **synthetic reviews** for all categories in `category_performance` and perform a simple **sentiment analysis** using Mosaic AI.



In [0]:
# Load the table from Databricks
spark_df = spark.table("ecommerce_catalog.gold.category_performance")

# Convert to Pandas for easy processing
df = spark_df.toPandas()
df = df.dropna()
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 122 entries, 0 to 126
Data columns (total 6 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   category_code           122 non-null    object 
 1   unique_views            122 non-null    int64  
 2   unique_carts            122 non-null    int64  
 3   unique_purchases        122 non-null    int64  
 4   total_revenue           122 non-null    float64
 5   cart_to_purchase_ratio  122 non-null    float64
dtypes: float64(2), int64(3), object(1)
memory usage: 6.7+ KB


### Generate equal proportions of positive, neutral, and negative reviews

In [0]:
import random

# Expanded positive reviews (some with neutral follow-up lines)
positive_reviews = [
    "I love this product, highly recommend it!",
    "Excellent quality and very reliable. Works as expected.",
    "Amazing experience overall. Product does what it claims.",
    "Very satisfied with the purchase. No major complaints so far.",
    "Great value for money. The experience has been smooth.",
    "Fantastic product! Setup was straightforward and easy.",
    "Really happy with this item. Performance has been consistent."
]

# Expanded negative reviews (some with neutral follow-up lines)
negative_reviews = [
    "Terrible quality, very disappointed.",
    "Not worth the money. Product is average at best.",
    "Poor experience overall. It works, but not well.",
    "Extremely dissatisfied. Nothing special about this product.",
    "Waste of money. Expected much better performance.",
    "Very disappointing experience. Functionality is basic.",
    "Would not recommend this. It barely meets expectations."
]

# Combine only positive and negative reviews
all_reviews = positive_reviews + negative_reviews
num_reviews = len(df)

# Assign reviews in a round-robin fashion for balanced distribution
synthetic_reviews = []
for i in range(num_reviews):
    review = all_reviews[i % len(all_reviews)]
    synthetic_reviews.append([review])  # Keep as list for NLP pipeline

# Add synthetic reviews to dataframe
df["synthetic_reviews"] = synthetic_reviews

df[["category_code", "synthetic_reviews"]].head()


Unnamed: 0,category_code,synthetic_reviews
0,stationery.cartrige,"[I love this product, highly recommend it!]"
1,electronics.video.tv,[Excellent quality and very reliable. Works as...
2,accessories.wallet,[Amazing experience overall. Product does what...
3,appliances.kitchen.juicer,[Very satisfied with the purchase. No major co...
4,construction.tools.welding,[Great value for money. The experience has bee...


### Perform NLP Task (Sentiment Analysis)

In [0]:
%pip install transformers torch

[43mNote: you may need to restart the kernel using %restart_python or dbutils.library.restartPython() to use updated packages.[0m


In [0]:
from transformers import pipeline
import pandas as pd

# Load sentiment analysis model
sentiment_classifier = pipeline("sentiment-analysis")

# Flatten reviews for classification
all_reviews_flat = [(row['category_code'], review) for _, row in df.iterrows() for review in row['synthetic_reviews']]

results = []
for category, review in all_reviews_flat:
    pred = sentiment_classifier(review)[0]
    results.append({
        "category_code": category,
        "review": review,
        "label": pred["label"],
        "score": pred["score"]
    })

sentiment_df = pd.DataFrame(results)
sentiment_df.head()


  from torch.utils._pytree import _broadcast_to_and_flatten, tree_flatten, tree_unflatten
No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

Device set to use cpu


Unnamed: 0,category_code,review,label,score
0,stationery.cartrige,"I love this product, highly recommend it!",POSITIVE,0.999885
1,electronics.video.tv,Excellent quality and very reliable. Works as ...,POSITIVE,0.999863
2,accessories.wallet,Amazing experience overall. Product does what ...,POSITIVE,0.999884
3,appliances.kitchen.juicer,Very satisfied with the purchase. No major com...,POSITIVE,0.996478
4,construction.tools.welding,Great value for money. The experience has been...,POSITIVE,0.999423


In [0]:
sentiment_df.sample(5)

Unnamed: 0,category_code,review,label,score
72,auto.accessories.videoregister,Amazing experience overall. Product does what ...,POSITIVE,0.999884
86,electronics.video.projector,Amazing experience overall. Product does what ...,POSITIVE,0.999884
79,kids.dolls,"Poor experience overall. It works, but not well.",NEGATIVE,0.99909
26,apparel.shoes.moccasins,Very disappointing experience. Functionality i...,NEGATIVE,0.997453
101,construction.tools.drill,Very satisfied with the purchase. No major com...,POSITIVE,0.996478


#### Log Experiment in MLflow

In [0]:
import mlflow

with mlflow.start_run(run_name="mosaic_binary_sentiment"):
    mlflow.log_param(
        "model_name",
        "distilbert-base-uncased-finetuned-sst-2-english"
    )

    # Binary sentiment metrics
    positive_ratio = (sentiment_df["label"] == "POSITIVE").mean()
    negative_ratio = (sentiment_df["label"] == "NEGATIVE").mean()

    mlflow.log_metric("positive_ratio", positive_ratio)
    mlflow.log_metric("negative_ratio", negative_ratio)
