
# 🤖 MGMT 467 - Unit 2 Lab 2: Prompt Studio — Feature Engineering & Beyond

**Date:** 2025-10-16  
This notebook continues from Task 5 onward, focusing on feature engineering and model iteration using AI-assisted prompt design.

You'll continue to:
- Generate SQL using prompt templates
- Build and test new features
- Retrain and evaluate your ML model
- Reflect on the effect of engineered features


In [3]:
from google.colab import auth
auth.authenticate_user()

from google.cloud import bigquery
client = bigquery.Client(project="mgmt467-472519")

# ✅ Create or reference your dataset (matches all lab tasks)
dataset_id = "mgmt467-472519.unit2_lab2_churn"
client.create_dataset(dataset_id, exists_ok=True)
print(f"Dataset confirmed: {dataset_id}")


Dataset confirmed: mgmt467-472519.unit2_lab2_churn



## Task 5.0: Bucket a Continuous Feature

**🎯 Goal:** Group 'total_minutes' into categories: low, medium, high.  
**📌 Requirements:** Use CASE WHEN or IF statements to create 'watch_time_bucket'.

---

### 🧠 Prompt Template  
> Write SQL that creates a new column watch_time_bucket based on total_minutes thresholds (<100, 100–300, >300).

---

### 👩‍🏫 Example Prompt  
> Create a new column watch_time_bucket with values 'low', 'medium', or 'high' based on total_minutes.

---

### 🔍 Exploration  
How does churn rate vary across these buckets?


In [4]:
%%bigquery df_watch_buckets --project mgmt467-472519

SELECT
  user_id,
  r3_min,
  CASE
    WHEN r3_min < 100 THEN 'low'
    WHEN r3_min BETWEEN 100 AND 300 THEN 'medium'
    ELSE 'high'
  END AS watch_time_bucket
FROM `mgmt467-472519.netflix.feat_churn_lite`;


Query is running:   0%|          |

Downloading:   0%|          |

In [7]:
df_watch_buckets.head(10)

Unnamed: 0,user_id,r3_min,watch_time_bucket
0,user_00002,0.0,low
1,user_00002,0.0,low
2,user_00002,0.0,low
3,user_00002,0.0,low
4,user_00002,0.0,low
5,user_00003,0.0,low
6,user_00003,0.0,low
7,user_00005,0.0,low
8,user_00005,0.0,low
9,user_00006,0.0,low


In [6]:
%%bigquery df_churn_by_bucket --project mgmt467-472519

SELECT
  watch_time_bucket,
  COUNT(*) AS total_users,
  ROUND(AVG(churn_next_month), 3) AS churn_rate
FROM (
  SELECT
    *,
    CASE
      WHEN r3_min < 100 THEN 'low'
      WHEN r3_min BETWEEN 100 AND 300 THEN 'medium'
      ELSE 'high'
    END AS watch_time_bucket
  FROM `mgmt467-472519.netflix.feat_churn_lite`
)
GROUP BY watch_time_bucket
ORDER BY churn_rate DESC;


Query is running:   0%|          |

Downloading:   0%|          |

In [8]:
df_churn_by_bucket


Unnamed: 0,watch_time_bucket,total_users,churn_rate
0,low,715320,0.661
1,high,979328,0.66
2,medium,200552,0.653


**Exploration:**
Churn rate does not vary much by bucket, but the curn rate does decrease as watch time increases.


## Task 5.1: Create a Binary Flag Feature

**🎯 Goal:** Add a binary column flag_binge (1 if total_minutes > 500).  
**📌 Requirements:** Use IF logic to create a binary column in SQL.

---

### 🧠 Prompt Template  
> Write a SQL query that adds flag_binge = 1 if total_minutes > 500, else 0.

---

### 👩‍🏫 Example Prompt  
> Add a binary column flag_binge to identify binge-watchers.

---

### 🔍 Exploration  
Are binge-watchers more or less likely to churn?


In [9]:
%%bigquery df_flag_binge --project mgmt467-472519

SELECT
  user_id,
  r3_min,
  IF(r3_min > 500, 1, 0) AS flag_binge
FROM `mgmt467-472519.netflix.feat_churn_lite`;


Query is running:   0%|          |

Downloading:   0%|          |

In [10]:
df_flag_binge.head(10)

Unnamed: 0,user_id,r3_min,flag_binge
0,user_00001,0.0,0
1,user_00001,0.0,0
2,user_00001,0.0,0
3,user_00002,0.0,0
4,user_00002,0.0,0
5,user_00003,0.0,0
6,user_00003,0.0,0
7,user_00005,0.0,0
8,user_00006,0.0,0
9,user_00007,0.0,0


In [12]:
%%bigquery df_binge_churn --project mgmt467-472519

SELECT
  flag_binge,
  COUNT(*) AS total_users,
  ROUND(AVG(churn_next_month), 3) AS churn_rate
FROM (
  SELECT
    *,
    IF(r3_min > 500, 1, 0) AS flag_binge
  FROM `mgmt467-472519.netflix.feat_churn_lite`
)
GROUP BY flag_binge
ORDER BY flag_binge DESC;


Query is running:   0%|          |

Downloading:   0%|          |

In [13]:
df_binge_churn



Unnamed: 0,flag_binge,total_users,churn_rate
0,1,766280,0.661
1,0,1128920,0.659


**Exploration:** Binge watchers are slightly more likely to churn.


## Task 5.2: Create an Interaction Term

**🎯 Goal:** Create plan_region_combo by combining plan_tier and region.  
**📌 Requirements:** Use CONCAT or STRING functions.

---

### 🧠 Prompt Template  
> Generate SQL to create a new column by combining plan_tier and region with an underscore.

---

### 👩‍🏫 Example Prompt  
> Create a column called plan_region_combo as CONCAT(plan_tier, '_', region).

---

### 🔍 Exploration  
Which plan-region combos have highest churn?


In [14]:
%%bigquery df_plan_region_combo --project mgmt467-472519

SELECT
  user_id,
  subscription_plan,
  country AS region,
  CONCAT(subscription_plan, '_', country) AS plan_region_combo

FROM `mgmt467-472519.netflix.feat_churn_lite`;

Query is running:   0%|          |

Downloading:   0%|          |

In [15]:
# Display first 10 rows
df_plan_region_combo.head(10)


Unnamed: 0,user_id,subscription_plan,region,plan_region_combo
0,user_00008,Basic,Canada,Basic_Canada
1,user_00008,Basic,Canada,Basic_Canada
2,user_00015,Basic,Canada,Basic_Canada
3,user_00021,Basic,Canada,Basic_Canada
4,user_00023,Basic,Canada,Basic_Canada
5,user_00023,Basic,Canada,Basic_Canada
6,user_00023,Basic,Canada,Basic_Canada
7,user_00023,Basic,Canada,Basic_Canada
8,user_00023,Basic,Canada,Basic_Canada
9,user_00026,Basic,Canada,Basic_Canada


In [16]:
%%bigquery df_plan_region_churn --project mgmt467-472519

SELECT
  plan_region_combo,
  COUNT(*) AS total_users,
  ROUND(AVG(churn_next_month), 3) AS churn_rate
FROM (
  SELECT
    *,
    CONCAT(subscription_plan, '_', country) AS plan_region_combo
  FROM `mgmt467-472519.netflix.feat_churn_lite`
)
GROUP BY plan_region_combo
ORDER BY churn_rate DESC;


Query is running:   0%|          |

Downloading:   0%|          |

In [17]:
# Display top 10 plan/region combos by churn
df_plan_region_churn.head(10)


Unnamed: 0,plan_region_combo,total_users,churn_rate
0,Standard_Canada,202584,0.665
1,Premium+_Canada,51704,0.664
2,Basic_Canada,112976,0.662
3,Premium_Canada,202400,0.662
4,Premium_USA,463496,0.659
5,Basic_USA,258704,0.658
6,Premium+_USA,138920,0.658
7,Standard_USA,464416,0.657


**Exploration:** The tope three plan/region combos by churn rat eare Standard/Canada, Premium+/Canada, and Basic/Canada.


## Task 5.3: Add Missingness Indicator Flags

**🎯 Goal:** Add binary flags to capture NULL values in age_band and avg_rating.  
**📌 Requirements:** Use IS NULL logic to create new flag columns.

---

### 🧠 Prompt Template  
> Create a new column is_missing_[col_name] that is 1 when column is NULL, else 0.

---

### 👩‍🏫 Example Prompt  
> Add is_missing_age that flags rows where age_band IS NULL.

---

### 🔍 Exploration  
Do missing values correlate with churn?


In [18]:
%%bigquery df_missing_flags --project mgmt467-472519

SELECT
  user_id,
  age,
  avg_watch_duration,
  IF(age IS NULL, 1, 0) AS is_missing_age,
  IF(avg_watch_duration IS NULL, 1, 0) AS is_missing_avg_watch
FROM `mgmt467-472519.netflix.feat_churn_lite`;


Query is running:   0%|          |

Downloading:   0%|          |

In [19]:
df_missing_flags

Unnamed: 0,user_id,age,avg_watch_duration,is_missing_age,is_missing_avg_watch
0,user_00011,67.0,0.000000,0,0
1,user_00026,60.0,0.000000,0,0
2,user_00026,60.0,0.000000,0,0
3,user_00042,58.0,0.000000,0,0
4,user_00042,58.0,0.000000,0,0
...,...,...,...,...,...
1895195,user_04428,57.0,65.700000,0,0
1895196,user_04428,57.0,0.000000,0,0
1895197,user_06474,57.0,49.433333,0,0
1895198,user_07656,57.0,39.833333,0,0


In [20]:
%%bigquery df_missing_churn --project mgmt467-472519

SELECT
  is_missing_age,
  is_missing_avg_watch,
  COUNT(*) AS total_users,
  ROUND(AVG(churn_next_month), 3) AS churn_rate
FROM (
  SELECT
    *,
    IF(age IS NULL, 1, 0) AS is_missing_age,
    IF(avg_watch_duration IS NULL, 1, 0) AS is_missing_avg_watch
  FROM `mgmt467-472519.netflix.feat_churn_lite`
)
GROUP BY is_missing_age, is_missing_avg_watch
ORDER BY churn_rate DESC;


Query is running:   0%|          |

Downloading:   0%|          |

In [21]:
df_missing_churn

Unnamed: 0,is_missing_age,is_missing_avg_watch,total_users,churn_rate
0,1,0,226136,0.66
1,0,0,1669064,0.66


**Exploration:** No, missing values doesn't correlate with churn.


## Task 5.4: Create Time-Based Features (Optional)

**🎯 Goal:** Add a column days_since_last_login.  
**📌 Requirements:** Use DATE_DIFF with CURRENT_DATE and last_login_date.

---

### 🧠 Prompt Template  
> Write SQL to create a column showing days since last login using DATE_DIFF.

---

### 👩‍🏫 Example Prompt  
> Add a column days_since_last_login = DATE_DIFF(CURRENT_DATE(), last_login_date, DAY).

---

### 🔍 Exploration  
Does login recency affect churn rate?


In [22]:
%%bigquery df_date_cols --project mgmt467-472519

SELECT column_name, data_type
FROM `mgmt467-472519`.netflix.INFORMATION_SCHEMA.COLUMNS
WHERE table_name = 'feat_churn_lite'
  AND data_type IN ('DATE','DATETIME','TIMESTAMP')
ORDER BY column_name;


Query is running:   0%|          |

Downloading:   0%|          |

In [23]:
df_date_cols

Unnamed: 0,column_name,data_type
0,month,DATE


In [24]:
%%bigquery df_time_features --project mgmt467-472519

SELECT
  user_id,
  month,
  DATE_DIFF(CURRENT_DATE(), month, DAY) AS days_since_month
FROM `mgmt467-472519.netflix.feat_churn_lite`;


Query is running:   0%|          |

Downloading:   0%|          |

In [25]:
# Display first 10 rows
df_time_features.head(10)


Unnamed: 0,user_id,month,days_since_month
0,user_00003,2025-11-01,-5
1,user_00011,2025-11-01,-5
2,user_00041,2025-11-01,-5
3,user_00050,2025-11-01,-5
4,user_00063,2025-11-01,-5
5,user_00068,2025-11-01,-5
6,user_00071,2025-11-01,-5
7,user_00071,2025-11-01,-5
8,user_00071,2025-11-01,-5
9,user_00099,2025-11-01,-5


In [26]:
%%bigquery df_login_churn --project mgmt467-472519

SELECT
  CASE
    WHEN DATE_DIFF(CURRENT_DATE(), month, DAY) < 30 THEN 'Last 30 Days'
    WHEN DATE_DIFF(CURRENT_DATE(), month, DAY) BETWEEN 30 AND 90 THEN '30–90 Days'
    ELSE '90+ Days'
  END AS recency_bucket,
  COUNT(*) AS total_users,
  ROUND(AVG(churn_next_month), 3) AS churn_rate
FROM `mgmt467-472519.netflix.feat_churn_lite`
GROUP BY recency_bucket
ORDER BY churn_rate DESC;


Query is running:   0%|          |

Downloading:   0%|          |

In [28]:
# Display churn rates by recency
df_login_churn


Unnamed: 0,recency_bucket,total_users,churn_rate
0,30–90 Days,164800,0.664
1,90+ Days,1565600,0.66
2,Last 30 Days,164800,0.655


**Exploration:** There is no linear relationship apparent between login and churn as the highest churn rate belongs to 30-90 days. This could signal that active users don't churn as often and customers with a last log-in over 90 days ago might have forgotten they are subscribed.


## Task 5.5: Assemble Enhanced Feature Table

**🎯 Goal:** Create churn_features_enhanced with all engineered columns.  
**📌 Requirements:** Include all prior features + engineered columns.

---

### 🧠 Prompt Template  
> Generate SQL to create churn_features_enhanced with new columns: watch_time_bucket, plan_region_combo, flag_binge, etc.

---

### 👩‍🏫 Example Prompt  
> Build a new table churn_features_enhanced with all original features + engineered ones.

---

### 🔍 Exploration  
Are row counts stable? Any NULLs introduced?


In [38]:
%%bigquery df_churn_features_enhanced --project mgmt467-472519

SELECT
  user_id,
  subscription_plan,
  country,
  age,
  avg_watch_duration,
  r3_min,
  r3_sess,
  churn_next_month,
  month,

  CASE
    WHEN r3_min < 100 THEN 'low'
    WHEN r3_min BETWEEN 100 AND 300 THEN 'medium'
    ELSE 'high'
  END AS watch_time_bucket,

  CONCAT(subscription_plan, '_', country) AS plan_region_combo,
  IF(r3_min > 500, 1, 0) AS flag_binge,
  IF(age IS NULL, 1, 0) AS is_missing_age,
  IF(avg_watch_duration IS NULL, 1, 0) AS is_missing_avg_watch,
  DATE_DIFF(CURRENT_DATE(), month, DAY) AS days_since_month

FROM `mgmt467-472519.netflix.feat_churn_lite`;


Query is running:   0%|          |

Downloading:   0%|          |

In [39]:
# Display first 10 rows
df_churn_features_enhanced.head(10)


Unnamed: 0,user_id,subscription_plan,country,age,avg_watch_duration,r3_min,r3_sess,churn_next_month,month,watch_time_bucket,plan_region_combo,flag_binge,is_missing_age,is_missing_avg_watch,days_since_month
0,user_00008,Basic,Canada,,0.0,0.0,0,1,2025-06-01,low,Basic_Canada,0,1,0,148
1,user_00008,Basic,Canada,,0.0,0.0,0,1,2025-05-01,low,Basic_Canada,0,1,0,179
2,user_00008,Basic,Canada,,0.0,0.0,0,0,2024-01-01,low,Basic_Canada,0,1,0,665
3,user_00021,Basic,Canada,38.0,0.0,0.0,0,1,2025-09-01,low,Basic_Canada,0,0,0,56
4,user_00021,Basic,Canada,38.0,0.0,0.0,0,0,2024-04-01,low,Basic_Canada,0,0,0,574
5,user_00021,Basic,Canada,38.0,0.0,0.0,0,1,2025-08-01,low,Basic_Canada,0,0,0,87
6,user_00021,Basic,Canada,38.0,0.0,0.0,0,1,2024-02-01,low,Basic_Canada,0,0,0,634
7,user_00021,Basic,Canada,38.0,0.0,0.0,0,1,2025-07-01,low,Basic_Canada,0,0,0,118
8,user_00021,Basic,Canada,38.0,0.0,0.0,0,1,2024-02-01,low,Basic_Canada,0,0,0,634
9,user_00023,Basic,Canada,40.0,0.0,0.0,0,1,2025-04-01,low,Basic_Canada,0,0,0,209


In [40]:
%%bigquery df_enhanced_summary --project mgmt467-472519

SELECT
  COUNT(*) AS total_rows,
  SUM(CASE WHEN user_id IS NULL THEN 1 ELSE 0 END) AS null_user_id,
  SUM(CASE WHEN subscription_plan IS NULL THEN 1 ELSE 0 END) AS null_plan,
  SUM(CASE WHEN country IS NULL THEN 1 ELSE 0 END) AS null_country,
  SUM(CASE WHEN r3_min IS NULL THEN 1 ELSE 0 END) AS null_r3_min,
  SUM(CASE WHEN churn_next_month IS NULL THEN 1 ELSE 0 END) AS null_churn_label
FROM `mgmt467-472519.netflix.feat_churn_lite`;


Query is running:   0%|          |

Downloading:   0%|          |

In [41]:
# Summary of row counts and nulls
df_enhanced_summary


Unnamed: 0,total_rows,null_user_id,null_plan,null_country,null_r3_min,null_churn_label
0,1895200,0,0,0,0,0


**Exploration:** Yes, row counts are stable and there are no NULL values. This is because we feature engineering earlier in the labs.


## Task 6: Retrain Model on Engineered Features

**🎯 Goal:** Train a logistic regression model using churn_features_enhanced.  
**📌 Requirements:** Use BQML logistic_reg model with new feature columns.

---

### 🧠 Prompt Template  
> Write CREATE MODEL SQL using enhanced features including flags and buckets.

---

### 👩‍🏫 Example Prompt  
> Retrain churn_model_enhanced using watch_time_bucket, flag_binge, plan_region_combo.

---

### 🔍 Exploration  
Does model accuracy improve?


In [42]:
%%bigquery df_cols --project mgmt467-472519

SELECT column_name
FROM `mgmt467-472519`.netflix.INFORMATION_SCHEMA.COLUMNS
WHERE table_name = 'churn_features_enhanced'
ORDER BY column_name;


Query is running:   0%|          |

Downloading:   0%|          |

In [43]:
df_cols

Unnamed: 0,column_name
0,age
1,avg_watch_duration
2,churn_next_month
3,country
4,days_since_month
5,flag_binge
6,is_missing_age
7,is_missing_avg_watch
8,month
9,plan_region_combo


In [48]:
%%bigquery df_train_model --project mgmt467-472519

CREATE OR REPLACE MODEL `mgmt467-472519.netflix.churn_model_enhanced`
OPTIONS(
  model_type = 'logistic_reg',
  input_label_cols = ['churn_label'],
  data_split_method = 'AUTO_SPLIT'
) AS

SELECT
  subscription_plan,
  country,
  age,
  avg_watch_duration
  watch_time_bucket,
  plan_region_combo,
  flag_binge,

  churn_next_month AS churn_label
FROM `mgmt467-472519.netflix.churn_features_enhanced`;


Query is running:   0%|          |

In [49]:
%%bigquery df_model_eval --project mgmt467-472519

SELECT
  *
FROM ML.EVALUATE(MODEL `mgmt467-472519.netflix.churn_model_enhanced`);


Query is running:   0%|          |

Downloading:   0%|          |

In [50]:
# Display model evaluation metrics
df_model_eval


Unnamed: 0,precision,recall,accuracy,f1_score,log_loss,roc_auc
0,0.655872,1.0,0.655872,0.792177,0.643715,0.503932


**Exploration:** Accuracy did not improve.


## Task 7: Compare Model Performance

**🎯 Goal:** Compare base model vs enhanced model using ML.EVALUATE.  
**📌 Requirements:** Use same evaluation query for both models.

---

### 🧠 Prompt Template  
> Write a SQL query to evaluate churn_model_enhanced and compare with churn_model.

---

### 👩‍🏫 Example Prompt  
> Compare ML.EVALUATE output from both models side-by-side.

---

### 🔍 Exploration  
Which features made the most difference?


In [51]:
%%bigquery df_compare_models --project mgmt467-472519

# Compare base model vs enhanced model using ML.EVALUATE

WITH base AS (
  SELECT
    'Base Model' AS model_name,
    *
  FROM ML.EVALUATE(MODEL `mgmt467-472519.netflix.churn_model`)
),
enhanced AS (
  SELECT
    'Enhanced Model' AS model_name,
    *
  FROM ML.EVALUATE(MODEL `mgmt467-472519.netflix.churn_model_enhanced`)
)

SELECT
  model_name,
  precision,
  recall,
  accuracy,
  f1_score,
  roc_auc
FROM (
  SELECT * FROM base
  UNION ALL
  SELECT * FROM enhanced
)
ORDER BY roc_auc DESC;


Query is running:   0%|          |

Downloading:   0%|          |

In [53]:
# Display evaluation for both models
df_compare_models


Unnamed: 0,model_name,precision,recall,accuracy,f1_score,roc_auc
0,Base Model,0.662712,1.0,0.662712,0.797146,0.507805
1,Enhanced Model,0.655872,1.0,0.655872,0.792177,0.503932
