<a href="https://colab.research.google.com/github/lily-larson/MGMT-467-Analytics-Portfolio/blob/main/Labs/Unit2_Lab2_Churn.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# 🤖 MGMT 467 - Unit 2 Lab 2: Prompt Studio for AI-Assisted SQL + ML

**Date:** 2025-10-16  
**Objective:** Build and refine a complete ML pipeline for churn prediction using BigQuery — but with **Gemini-style prompts** guiding SQL generation.

You'll learn to:
- Frame SQL goals as clear prompts
- Generate, test, and debug queries with an AI assistant
- Reflect on each modeling step and your prompt design


In [None]:
# ✅ Authenticate and set up GCP project
from google.colab import auth
auth.authenticate_user()

project_id = "directed-bongo-471119-d1"  # <-- Replace with your actual project ID
!gcloud config set project $project_id


## Task 0: Connect to BigQuery

**🎯 Goal:** Verify BigQuery access from Colab.  
**📌 Requirements:** Use `%%bigquery`, get current date and user session.

---

### 🧠 Prompt Template  
> Write a SQL query that returns CURRENT_DATE() and SESSION_USER(). I will run it with %%bigquery in Colab.

---

### 👩‍🏫 Example Prompt  
> Write a SQL query using BigQuery syntax that returns today’s date and the current session user.

---

### ✅ Expected SQL Output
```sql
SELECT CURRENT_DATE() AS today, SESSION_USER() AS user;
```

---

### 🔍 Checkpoint  
Query should return a single row with today's date and your user.


In [None]:
%%bigquery --project directed-bongo-471119-d1
SELECT CURRENT_DATE() AS today, SESSION_USER() AS user;


## Task 1: Prepare ML Table

**🎯 Goal:** Create a clean features table for modeling churn.  
**📌 Requirements:** Use cleaned_features as source, select relevant columns, filter rows with churn_label IS NOT NULL.

---

### 🧠 Prompt Template  
> Write a query that creates a new table with columns: [region, plan_tier, age_band, ...] and churn_label from [source_table]. Filter to rows where churn_label IS NOT NULL.

---

### 👩‍🏫 Example Prompt  
> Create a BigQuery table named churn_features from cleaned_features with selected features and where churn_label IS NOT NULL.

---

### ✅ Expected SQL Output
```sql
CREATE OR REPLACE TABLE `your_dataset.churn_features` AS
SELECT region, plan_tier, age_band, avg_rating, total_minutes, churn_label
FROM `your_dataset.cleaned_features`
WHERE churn_label IS NOT NULL;
```

---

### 🔍 Checkpoint  
Table should appear in BigQuery and contain non-null labels.


In [None]:
%%bigquery --project directed-bongo-471119-d1
SELECT
    user_id,
    country,
    subscription_plan,
    age,
    churn_next_month
FROM
    `directed-bongo-471119-d1.netflix.feat_churn_lite`
WHERE
    churn_next_month IS NOT NULL

In [None]:
%%bigquery --project directed-bongo-471119-d1
SELECT
    user_id,
    AVG(rating) AS avg_rating
FROM
    `directed-bongo-471119-d1.netflix.reviews`
GROUP BY
    user_id

In [None]:
%%bigquery --project directed-bongo-471119-d1
SELECT
    user_id,
    SUM(watch_duration_minutes) AS total_minutes,
    AVG(progress_percentage) AS avg_progress
FROM
    `directed-bongo-471119-d1.netflix.watch_history`
GROUP BY
    user_id

In [None]:
%%bigquery --project directed-bongo-471119-d1
SELECT
    user_id,
    COUNT(DISTINCT session_id) AS num_sessions
FROM
    `directed-bongo-471119-d1.netflix.watch_history`
GROUP BY
    user_id

In [None]:
%%bigquery --project directed-bongo-471119-d1
CREATE OR REPLACE TABLE `directed-bongo-471119-d1.netflix.churn_features` AS
SELECT
    t1.user_id,
    t1.country,
    t1.subscription_plan,
    t1.age,
    t1.churn_next_month,
    t2.avg_rating,
    t3.total_minutes,
    t3.avg_progress,
    t4.num_sessions
FROM
    `directed-bongo-471119-d1.netflix.feat_churn_lite` AS t1
LEFT JOIN
    (SELECT user_id, AVG(rating) AS avg_rating FROM `directed-bongo-471119-d1.netflix.reviews` GROUP BY user_id) AS t2
ON
    t1.user_id = t2.user_id
LEFT JOIN
    (SELECT user_id, SUM(watch_duration_minutes) AS total_minutes, AVG(progress_percentage) AS avg_progress FROM `directed-bongo-471119-d1.netflix.watch_history` GROUP BY user_id) AS t3
ON
    t1.user_id = t3.user_id
LEFT JOIN
    (SELECT user_id, COUNT(DISTINCT session_id) AS num_sessions FROM `directed-bongo-471119-d1.netflix.watch_history` GROUP BY user_id) AS t4
ON
    t1.user_id = t4.user_id
WHERE
    t1.churn_next_month IS NOT NULL


## Task 2: Train Logistic Regression Model

**🎯 Goal:** Train a basic BQML logistic regression model.  
**📌 Requirements:** Use churn_features table, predict churn_label from features.

---

### 🧠 Prompt Template  
> Write a CREATE MODEL SQL for logistic regression using churn_label as label and [features] as inputs.

---

### 👩‍🏫 Example Prompt  
> Train a logistic regression model to predict churn_label using region, plan_tier, total_minutes, avg_rating.

---

### ✅ Expected SQL Output
```sql
CREATE OR REPLACE MODEL `your_dataset.churn_model`
OPTIONS(model_type='logistic_reg') AS
SELECT region, plan_tier, total_minutes, avg_rating, churn_label
FROM `your_dataset.churn_features`;
```

---

### 🔍 Checkpoint  
Model appears in BigQuery under Models. Training completes.


In [None]:
%%bigquery --project directed-bongo-471119-d1
CREATE OR REPLACE MODEL `directed-bongo-471119-d1.netflix.churn_model`
OPTIONS(model_type='logistic_reg', input_label_cols=['churn_next_month']) AS
SELECT
    churn_next_month,
    country,
    subscription_plan,
    age,
    avg_rating,
    total_minutes,
    avg_progress,
    num_sessions
FROM
    `directed-bongo-471119-d1.netflix.churn_features`;


## Task 3: Evaluate Model

**🎯 Goal:** Evaluate the logistic regression model.  
**📌 Requirements:** Use ML.EVALUATE.

---

### 🧠 Prompt Template  
> Write a query to evaluate my logistic regression model using ML.EVALUATE.

---

### 👩‍🏫 Example Prompt  
> Evaluate the churn_model using ML.EVALUATE to get accuracy, precision, recall.

---

### ✅ Expected SQL Output
```sql
SELECT * FROM ML.EVALUATE(MODEL `your_dataset.churn_model`);
```

---

### 🔍 Checkpoint  
View performance metrics: accuracy, log_loss, precision, recall.


In [None]:
%%bigquery --project directed-bongo-471119-d1
SELECT * FROM ML.EVALUATE(MODEL `directed-bongo-471119-d1.netflix.churn_model`);


## Task 4: Predict Churn

**🎯 Goal:** Use ML.PREDICT to generate churn predictions.  
**📌 Requirements:** Apply model to same input table.

---

### 🧠 Prompt Template  
> Generate SQL to use ML.PREDICT on churn_model and return predictions by user_id.

---

### 👩‍🏫 Example Prompt  
> Predict churn using churn_model. Include user_id, predicted_churn_label, and prediction probability.

---

### ✅ Expected SQL Output
```sql
SELECT user_id, predicted_churn_label, predicted_churn_label_probs
FROM ML.PREDICT(MODEL `your_dataset.churn_model`,
      (SELECT * FROM `your_dataset.churn_features`));
```

---

### 🔍 Checkpoint  
Inspect top churn risk users. Validate probabilities.


In [None]:
%%bigquery --project directed-bongo-471119-d1
SELECT
    user_id,
    predicted_churn_next_month,
    predicted_churn_next_month_probs[OFFSET(0)].prob AS churn_probability
FROM
    ML.PREDICT(MODEL `directed-bongo-471119-d1.netflix.churn_model`,
      (SELECT * FROM `directed-bongo-471119-d1.netflix.churn_features`));


## Task 5.0: Bucket a Continuous Feature

**🎯 Goal:** Group 'total_minutes' into categories: low, medium, high.  
**📌 Requirements:** Use CASE WHEN or IF statements to create 'watch_time_bucket'.

---

### 🧠 Prompt Template  
> Write SQL that creates a new column watch_time_bucket based on total_minutes thresholds (<100, 100–300, >300).

---

### 👩‍🏫 Example Prompt  
> Create a new column watch_time_bucket with values 'low', 'medium', or 'high' based on total_minutes.

---

### 🔍 Exploration  
How does churn rate vary across these buckets?


In [None]:
%%bigquery --project directed-bongo-471119-d1
SELECT
    user_id,
    total_minutes,
    CASE
        WHEN total_minutes < 100 THEN 'low'
        WHEN total_minutes >= 100 AND total_minutes <= 300 THEN 'medium'
        WHEN total_minutes > 300 THEN 'high'
        ELSE 'unknown' # Handle potential NULLs or other cases
    END AS watch_time_bucket
FROM
    `directed-bongo-471119-d1.netflix.churn_features`
LIMIT 10; -- Adding a limit for a quick preview

In [None]:
%%bigquery --project directed-bongo-471119-d1
SELECT
    CASE
        WHEN t1.total_minutes < 100 THEN 'low'
        WHEN t1.total_minutes >= 100 AND t1.total_minutes <= 300 THEN 'medium'
        WHEN t1.total_minutes > 300 THEN 'high'
        ELSE 'unknown'
    END AS watch_time_bucket,
    AVG(t2.churn_probability) AS average_churn_probability
FROM
    `directed-bongo-471119-d1.netflix.churn_features` AS t1
JOIN
    (SELECT
        user_id,
        predicted_churn_next_month_probs[OFFSET(0)].prob AS churn_probability
    FROM
        ML.PREDICT(MODEL `directed-bongo-471119-d1.netflix.churn_model`,
              (SELECT * FROM `directed-bongo-471119-d1.netflix.churn_features`))) AS t2
ON
    t1.user_id = t2.user_id
GROUP BY
    watch_time_bucket
ORDER BY
    average_churn_probability DESC;


## Task 5.1: Create a Binary Flag Feature

**🎯 Goal:** Add a binary column flag_binge (1 if total_minutes > 500).  
**📌 Requirements:** Use IF logic to create a binary column in SQL.

---

### 🧠 Prompt Template  
> Write a SQL query that adds flag_binge = 1 if total_minutes > 500, else 0.

---

### 👩‍🏫 Example Prompt  
> Add a binary column flag_binge to identify binge-watchers.

---

### 🔍 Exploration  
Are binge-watchers more or less likely to churn?


In [None]:
%%bigquery --project directed-bongo-471119-d1
SELECT
    user_id,
    total_minutes,
    CASE
        WHEN total_minutes > 500 THEN 1
        ELSE 0
    END AS flag_binge
FROM
    `directed-bongo-471119-d1.netflix.churn_features`
LIMIT 10; -- Adding a limit for a quick preview

In [None]:
%%bigquery --project directed-bongo-471119-d1
SELECT
    AVG(t2.churn_probability) AS average_churn_probability
FROM
    `directed-bongo-471119-d1.netflix.churn_features` AS t1
JOIN
    (SELECT
        user_id,
        predicted_churn_next_month_probs[OFFSET(0)].prob AS churn_probability
    FROM
        ML.PREDICT(MODEL `directed-bongo-471119-d1.netflix.churn_model`,
              (SELECT * FROM `directed-bongo-471119-d1.netflix.churn_features`))) AS t2
ON
    t1.user_id = t2.user_id
WHERE
    CASE WHEN t1.total_minutes > 500 THEN 1 ELSE 0 END = 1;


## Task 5.2: Create an Interaction Term

**🎯 Goal:** Create plan_region_combo by combining plan_tier and region.  
**📌 Requirements:** Use CONCAT or STRING functions.

---

### 🧠 Prompt Template  
> Generate SQL to create a new column by combining plan_tier and region with an underscore.

---

### 👩‍🏫 Example Prompt  
> Create a column called plan_region_combo as CONCAT(plan_tier, '_', region).

---

### 🔍 Exploration  
Which plan-region combos have highest churn?


In [None]:
%%bigquery --project directed-bongo-471119-d1
SELECT
    user_id,
    subscription_plan,
    country,
    CONCAT(subscription_plan, '_', country) AS sub_region_combo
FROM
    `directed-bongo-471119-d1.netflix.churn_features`
LIMIT 10; -- Adding a limit for a quick preview

In [None]:
%%bigquery --project directed-bongo-471119-d1
SELECT
    CONCAT(t1.subscription_plan, '_', t1.country) AS sub_region_combo,
    AVG(t2.churn_probability) AS average_churn_probability
FROM
    `directed-bongo-471119-d1.netflix.churn_features` AS t1
JOIN
    (SELECT
        user_id,
        predicted_churn_next_month_probs[OFFSET(0)].prob AS churn_probability
    FROM
        ML.PREDICT(MODEL `directed-bongo-471119-d1.netflix.churn_model`,
              (SELECT * FROM `directed-bongo-471119-d1.netflix.churn_features`))) AS t2
ON
    t1.user_id = t2.user_id
GROUP BY
    sub_region_combo
ORDER BY
    average_churn_probability DESC
LIMIT 10; -- Limit to the top combination


## Task 5.3: Add Missingness Indicator Flags

**🎯 Goal:** Add binary flags to capture NULL values in age_band and avg_rating.  
**📌 Requirements:** Use IS NULL logic to create new flag columns.

---

### 🧠 Prompt Template  
> Create a new column is_missing_[col_name] that is 1 when column is NULL, else 0.

---

### 👩‍🏫 Example Prompt  
> Add is_missing_age that flags rows where age_band IS NULL.

---

### 🔍 Exploration  
Do missing values correlate with churn?


In [None]:
%%bigquery --project directed-bongo-471119-d1
SELECT
    user_id,
    age,
    CASE
        WHEN age IS NULL THEN 1
        ELSE 0
    END AS is_missing_age
FROM
    `directed-bongo-471119-d1.netflix.churn_features`
LIMIT 10; -- Adding a limit for a quick preview

In [None]:
%%bigquery --project directed-bongo-471119-d1
SELECT
    CASE
        WHEN t1.age IS NULL THEN 1
        ELSE 0
    END AS is_missing_age,
    AVG(t2.churn_probability) AS average_churn_probability
FROM
    `directed-bongo-471119-d1.netflix.churn_features` AS t1
JOIN
    (SELECT
        user_id,
        predicted_churn_next_month_probs[OFFSET(0)].prob AS churn_probability
    FROM
        ML.PREDICT(MODEL `directed-bongo-471119-d1.netflix.churn_model`,
              (SELECT * FROM `directed-bongo-471119-d1.netflix.churn_features`))) AS t2
ON
    t1.user_id = t2.user_id
GROUP BY
    is_missing_age;


## Task 5.4: Create Time-Based Features (Optional)

**🎯 Goal:** Add a column days_since_last_login.  
**📌 Requirements:** Use DATE_DIFF with CURRENT_DATE and last_login_date.

---

### 🧠 Prompt Template  
> Write SQL to create a column showing days since last login using DATE_DIFF.

---

### 👩‍🏫 Example Prompt  
> Add a column days_since_last_login = DATE_DIFF(CURRENT_DATE(), last_login_date, DAY).

---

### 🔍 Exploration  
Does login recency affect churn rate?


In [None]:
%%bigquery --project directed-bongo-471119-d1
SELECT
    CASE
        WHEN DATE_DIFF(CURRENT_DATE(), t2.last_login_date, DAY) > 100 THEN '>100'
        WHEN DATE_DIFF(CURRENT_DATE(), t2.last_login_date, DAY) BETWEEN 20 AND 30 THEN '20-30'
        WHEN DATE_DIFF(CURRENT_DATE(), t2.last_login_date, DAY) BETWEEN 30 AND 50 THEN '30-50'
        WHEN DATE_DIFF(CURRENT_DATE(), t2.last_login_date, DAY) BETWEEN 50 AND 80 THEN '50-80'
        WHEN DATE_DIFF(CURRENT_DATE(), t2.last_login_date, DAY) BETWEEN 80 AND 100 THEN '80-100'
        WHEN DATE_DIFF(CURRENT_DATE(), t2.last_login_date, DAY) < 10 THEN '<10'
        ELSE 'unknown'
    END AS days_since_last_login_bucket,
    AVG(t3.churn_probability) AS average_churn_probability
FROM
    `directed-bongo-471119-d1.netflix.churn_features` AS t1
LEFT JOIN
    (SELECT user_id, MAX(watch_date) AS last_login_date FROM `directed-bongo-471119-d1.netflix.watch_history` GROUP BY user_id) AS t2
ON
    t1.user_id = t2.user_id
JOIN
    (SELECT
        user_id,
        predicted_churn_next_month_probs[OFFSET(0)].prob AS churn_probability
    FROM
        ML.PREDICT(MODEL `directed-bongo-471119-d1.netflix.churn_model`,
              (SELECT * FROM `directed-bongo-471119-d1.netflix.churn_features`))) AS t3
ON
    t1.user_id = t3.user_id
GROUP BY
    days_since_last_login_bucket
ORDER BY
    average_churn_probability DESC;


## Task 5.5: Assemble Enhanced Feature Table

**🎯 Goal:** Create churn_features_enhanced with all engineered columns.  
**📌 Requirements:** Include all prior features + engineered columns.

---

### 🧠 Prompt Template  
> Generate SQL to create churn_features_enhanced with new columns: watch_time_bucket, plan_region_combo, flag_binge, etc.

---

### 👩‍🏫 Example Prompt  
> Build a new table churn_features_enhanced with all original features + engineered ones.

---

### 🔍 Exploration  
Are row counts stable? Any NULLs introduced?


In [None]:
%%bigquery --project directed-bongo-471119-d1
CREATE OR REPLACE TABLE `directed-bongo-471119-d1.netflix.churn_features_enhanced` AS
SELECT
    t1.*, -- Include all original columns from churn_features
    CASE
        WHEN t1.total_minutes < 100 THEN 'low'
        WHEN t1.total_minutes >= 100 AND t1.total_minutes <= 300 THEN 'medium'
        WHEN t1.total_minutes > 300 THEN 'high'
        ELSE 'unknown'
    END AS watch_time_bucket,
    CASE
        WHEN t1.total_minutes > 500 THEN 1
        ELSE 0
    END AS flag_binge,
    CONCAT(t1.subscription_plan, '_', t1.country) AS sub_region_combo,
    CASE
        WHEN t1.age IS NULL THEN 1
        ELSE 0
    END AS is_missing_age,
    DATE_DIFF(CURRENT_DATE(), t2.last_login_date, DAY) AS days_since_last_login
FROM
    `directed-bongo-471119-d1.netflix.churn_features` AS t1
LEFT JOIN
    (SELECT user_id, MAX(watch_date) AS last_login_date FROM `directed-bongo-471119-d1.netflix.watch_history` GROUP BY user_id) AS t2
ON
    t1.user_id = t2.user_id;

In [None]:
%%bigquery --project directed-bongo-471119-d1
SELECT
    (SELECT COUNT(*) FROM `directed-bongo-471119-d1.netflix.churn_features`) AS original_row_count,
    (SELECT COUNT(*) FROM `directed-bongo-471119-d1.netflix.churn_features_enhanced`) AS enhanced_row_count;


## Task 6: Retrain Model on Engineered Features

**🎯 Goal:** Train a logistic regression model using churn_features_enhanced.  
**📌 Requirements:** Use BQML logistic_reg model with new feature columns.

---

### 🧠 Prompt Template  
> Write CREATE MODEL SQL using enhanced features including flags and buckets.

---

### 👩‍🏫 Example Prompt  
> Retrain churn_model_enhanced using watch_time_bucket, flag_binge, plan_region_combo.

---

### 🔍 Exploration  
Does model accuracy improve?


In [None]:
%%bigquery --project directed-bongo-471119-d1
CREATE OR REPLACE MODEL `directed-bongo-471119-d1.netflix.churn_model_enhanced`
OPTIONS(model_type='logistic_reg', input_label_cols=['churn_next_month']) AS
SELECT
    churn_next_month,
    country,
    subscription_plan,
    age,
    avg_rating,
    total_minutes,
    avg_progress,
    num_sessions,
    watch_time_bucket,
    flag_binge,
    sub_region_combo,
    is_missing_age,
    days_since_last_login
FROM
    `directed-bongo-471119-d1.netflix.churn_features_enhanced`;

SELECT * FROM ML.EVALUATE(MODEL `directed-bongo-471119-d1.netflix.churn_model_enhanced`);


## Task 7: Compare Model Performance

**🎯 Goal:** Compare base model vs enhanced model using ML.EVALUATE.  
**📌 Requirements:** Use same evaluation query for both models.

---

### 🧠 Prompt Template  
> Write a SQL query to evaluate churn_model_enhanced and compare with churn_model.

---

### 👩‍🏫 Example Prompt  
> Compare ML.EVALUATE output from both models side-by-side.

---

### 🔍 Exploration  
Which features made the most difference?


In [None]:
%%bigquery --project directed-bongo-471119-d1
SELECT * FROM ML.EVALUATE(MODEL `directed-bongo-471119-d1.netflix.churn_model`);

In [None]:
%%bigquery --project directed-bongo-471119-d1
SELECT * FROM ML.EVALUATE(MODEL `directed-bongo-471119-d1.netflix.churn_model_enhanced`);

In [None]:
%%bigquery --project directed-bongo-471119-d1
SELECT
    *
FROM
    ML.WEIGHTS(MODEL `directed-bongo-471119-d1.netflix.churn_model_enhanced`);