<a href="https://colab.research.google.com/github/louissiller/mgmt467-analytics-portfolio/blob/main/Lab2_Churn_Modeling_FeatureEngineering_Colab_Task5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 📊 MGMT 467 - Unit 2 Lab 2: Churn Modeling with BigQueryML + Feature Engineering
**Date:** 2025-10-16

In this lab you will:
- Connect to BigQuery from Colab
- Create features and labels
- Engineer new features from user behavior
- Train and evaluate logistic regression models
- Reflect on modeling assumptions and interpret results

In [2]:
# ✅ Authenticate and set up GCP project
from google.colab import auth
auth.authenticate_user()

project_id = "mgmt467-71800"  # <-- Replace with your actual project ID
!gcloud config set project $project_id

Updated property [core/project].


In [3]:
# ✅ Verify BigQuery access
%%bigquery --project $project_id
SELECT CURRENT_DATE() AS today, SESSION_USER() AS user

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,today,user
0,2025-10-27,siller.louis@gmail.com


In [4]:
%%bigquery --project $project_id
CREATE OR REPLACE TABLE `netflix.cleaned_features` (
    user_id STRING,
    region STRING,
    plan_tier STRING,
    age_band STRING,
    avg_rating FLOAT64,
    total_minutes FLOAT64,
    avg_progress FLOAT64,
    num_sessions INT64,
    churn_label BOOL
);

Query is running:   0%|          |

In [5]:
# ✅ Prepare base churn features
%%bigquery --project $project_id
CREATE OR REPLACE TABLE `netflix.churn_features` AS
SELECT
  user_id,
  region,
  plan_tier,
  age_band,
  avg_rating,
  total_minutes,
  avg_progress,
  num_sessions,
  churn_label
FROM `netflix.cleaned_features`
WHERE churn_label IS NOT NULL;

Query is running:   0%|          |

In [6]:
# ✅ Train base logistic regression model
%%bigquery --project $project_id
CREATE OR REPLACE MODEL `netflix.churn_model`
OPTIONS(model_type='logistic_reg') AS
SELECT
  region,
  plan_tier,
  age_band,
  avg_rating,
  total_minutes,
  avg_progress,
  num_sessions,
  churn_label
FROM `netflix.churn_features`;

Executing query with job ID: 778f1a2d-2168-49cb-a28f-92aca9ba9689
Query executing: 0.44s


ERROR:
 400 GET https://bigquery.googleapis.com/bigquery/v2/projects/mgmt467-71800/queries/778f1a2d-2168-49cb-a28f-92aca9ba9689?maxResults=0&location=US&prettyPrint=false: Missing 'label' column in query statement. Update OPTIONS(input_label_cols=['your_label_col']) to indicate the correct label column name.

Location: US
Job ID: 778f1a2d-2168-49cb-a28f-92aca9ba9689



In [7]:
# ✅ Evaluate base model
%%bigquery --project $project_id
SELECT *
FROM ML.EVALUATE(MODEL `netflix.churn_model`);

Executing query with job ID: 3c1da27c-e3fa-44d2-b461-4cbea09ba70b
Query executing: 0.34s


ERROR:
 404 Not found: Model mgmt467-71800:netflix.churn_model; reason: notFound, message: Not found: Model mgmt467-71800:netflix.churn_model

Location: US
Job ID: 3c1da27c-e3fa-44d2-b461-4cbea09ba70b



In [8]:
# ✅ Predict churn with base model
%%bigquery --project $project_id
SELECT
  user_id,
  predicted_churn_label,
  predicted_churn_label_probs
FROM ML.PREDICT(MODEL `netflix.churn_model`,
                (SELECT * FROM `netflix.churn_features`));

Executing query with job ID: 72987ab1-cc6f-45c8-a1f7-044ff6d7a127
Query executing: 0.41s


ERROR:
 404 Not found: Model mgmt467-71800:netflix.churn_model; reason: notFound, message: Not found: Model mgmt467-71800:netflix.churn_model

Location: US
Job ID: 72987ab1-cc6f-45c8-a1f7-044ff6d7a127




## 🛠️ Feature Engineering Section

We will now engineer new features to improve model performance:

- Bucket continuous variables
- Create interaction terms
- Add behavioral flags


In [9]:

# ✅ Create enhanced feature set
%%bigquery --project $project_id
CREATE OR REPLACE TABLE `netflix.churn_features_enhanced` AS
SELECT
  user_id,
  region,
  plan_tier,
  age_band,
  avg_rating,
  total_minutes,
  CASE
    WHEN total_minutes < 100 THEN 'low'
    WHEN total_minutes BETWEEN 100 AND 300 THEN 'medium'
    ELSE 'high'
  END AS watch_time_bucket,
  avg_progress,
  num_sessions,
  CONCAT(plan_tier, '_', region) AS plan_region_combo,
  IF(total_minutes > 500, 1, 0) AS flag_binge,
  churn_label
FROM `netflix.churn_features`;


Query is running:   0%|          |

In [10]:

# ✅ Train enhanced model
%%bigquery --project $project_id
CREATE OR REPLACE MODEL `netflix.churn_model_enhanced`
OPTIONS(model_type='logistic_reg') AS
SELECT
  region,
  plan_tier,
  age_band,
  watch_time_bucket,
  avg_rating,
  avg_progress,
  num_sessions,
  plan_region_combo,
  flag_binge,
  churn_label
FROM `netflix.churn_features_enhanced`;


Executing query with job ID: ab806dff-ad88-4726-95eb-ef428c286683
Query executing: 0.45s


ERROR:
 400 GET https://bigquery.googleapis.com/bigquery/v2/projects/mgmt467-71800/queries/ab806dff-ad88-4726-95eb-ef428c286683?maxResults=0&location=US&prettyPrint=false: Missing 'label' column in query statement. Update OPTIONS(input_label_cols=['your_label_col']) to indicate the correct label column name.

Location: US
Job ID: ab806dff-ad88-4726-95eb-ef428c286683



In [11]:

# ✅ Evaluate enhanced model
%%bigquery --project $project_id
SELECT *
FROM ML.EVALUATE(MODEL `netflix.churn_model_enhanced`);


Executing query with job ID: 5773d2c1-3ff9-4ce4-8bd6-aee4f8c9e36f
Query executing: 0.55s


ERROR:
 404 Not found: Model mgmt467-71800:netflix.churn_model_enhanced; reason: notFound, message: Not found: Model mgmt467-71800:netflix.churn_model_enhanced

Location: US
Job ID: 5773d2c1-3ff9-4ce4-8bd6-aee4f8c9e36f



## 🤔 Chain-of-Thought Prompts: Feature Engineering

### 1. Why bucket continuous values like watch time?
- What patterns become clearer by using categories like "low", "medium", "high"?

### 2. What value do interaction terms (e.g., `plan_tier_region`) add?
- Could some plans behave differently in different regions?

### 3. What’s the purpose of binary flags like `flag_binge`?
- Can these capture unique behaviors not reflected in raw totals?

### 4. After evaluating the enhanced model:
- Which new features helped the most?
- Did any surprise you?

✍️ Write your responses in a text cell below or in a shared doc for discussion.

Let's delve into the rationale behind these feature engineering choices:

1.  **Bucketing continuous values (e.g., watch time):**
    -   Categorizing continuous variables can unveil non-linear relationships that may not be adequately captured by linear models. For instance, there might be thresholds in watch time beyond which churn behavior changes significantly.
    -   This approach can enhance model robustness to outliers and improve interpretability by providing discrete segments that are easier to analyze and communicate.

2.  **Interaction terms (e.g., `plan_tier_region`):**
    -   Interaction terms are crucial for modeling dependencies between features. They allow the model to account for situations where the effect of one variable on the target is contingent upon the value of another variable. For example, the efficacy of a specific plan tier in retaining users might vary considerably across different geographic regions due to localized factors.
    -   Ignoring such interactions can lead to an oversimplified model that fails to capture the nuances of user behavior.

3.  **Binary flags (e.g., `flag_binge`):**
    -   Binary flags serve to isolate and highlight specific behavioral patterns that may have distinct implications for the target variable, even if not fully represented by aggregate metrics. A "binge" flag can differentiate users with intensive, short-duration engagement from those with more consistent, prolonged usage.
    -   These flags can provide valuable insights into specific user segments and their propensity for churn.

4.  **Upon evaluating the enhanced model:**
    -   To assess the impact of new features, compare the performance metrics (e.g., AUC, precision, recall) of the enhanced model against the baseline model. Features contributing to notable improvements are likely significant.
    -   Analyzing the model coefficients can provide further insight into the direction and magnitude of each feature's influence on churn probability.
    -   Observing which features prove most influential, and whether any results deviate from initial hypotheses, can yield valuable insights into the underlying dynamics of user churn.


# 🤖 MGMT 467 - Unit 2 Lab 2: Prompt Studio — Feature Engineering & Beyond

**Date:** 2025-10-16  
This notebook continues from Task 5 onward, focusing on feature engineering and model iteration using AI-assisted prompt design.

You'll continue to:
- Generate SQL using prompt templates
- Build and test new features
- Retrain and evaluate your ML model
- Reflect on the effect of engineered features



## Task 5.0: Bucket a Continuous Feature

**🎯 Goal:** Group 'total_minutes' into categories: low, medium, high.  
**📌 Requirements:** Use CASE WHEN or IF statements to create 'watch_time_bucket'.

---

### 🧠 Prompt Template  
> Write SQL that creates a new column watch_time_bucket based on total_minutes thresholds (<100, 100–300, >300).

---

### 👩‍🏫 Example Prompt  
> Create a new column watch_time_bucket with values 'low', 'medium', or 'high' based on total_minutes.

---

### 🔍 Exploration  
How does churn rate vary across these buckets?



## Task 5.1: Create a Binary Flag Feature

**🎯 Goal:** Add a binary column flag_binge (1 if total_minutes > 500).  
**📌 Requirements:** Use IF logic to create a binary column in SQL.

---

### 🧠 Prompt Template  
> Write a SQL query that adds flag_binge = 1 if total_minutes > 500, else 0.

---

### 👩‍🏫 Example Prompt  
> Add a binary column flag_binge to identify binge-watchers.

---

### 🔍 Exploration  
Are binge-watchers more or less likely to churn?



## Task 5.2: Create an Interaction Term

**🎯 Goal:** Create plan_region_combo by combining plan_tier and region.  
**📌 Requirements:** Use CONCAT or STRING functions.

---

### 🧠 Prompt Template  
> Generate SQL to create a new column by combining plan_tier and region with an underscore.

---

### 👩‍🏫 Example Prompt  
> Create a column called plan_region_combo as CONCAT(plan_tier, '_', region).

---

### 🔍 Exploration  
Which plan-region combos have highest churn?



## Task 5.3: Add Missingness Indicator Flags

**🎯 Goal:** Add binary flags to capture NULL values in age_band and avg_rating.  
**📌 Requirements:** Use IS NULL logic to create new flag columns.

---

### 🧠 Prompt Template  
> Create a new column is_missing_[col_name] that is 1 when column is NULL, else 0.

---

### 👩‍🏫 Example Prompt  
> Add is_missing_age that flags rows where age_band IS NULL.

---

### 🔍 Exploration  
Do missing values correlate with churn?


In [12]:
# ✅ Create enhanced feature set with watch_time_bucket and flag_binge
%%bigquery --project $project_id
CREATE OR REPLACE TABLE `netflix.churn_features_enhanced` AS
SELECT
  user_id,
  region,
  plan_tier,
  age_band,
  avg_rating,
  total_minutes,
  CASE
    WHEN total_minutes < 100 THEN 'low'
    WHEN total_minutes BETWEEN 100 AND 300 THEN 'medium'
    ELSE 'high'
  END AS watch_time_bucket,
  avg_progress,
  num_sessions,
  CONCAT(plan_tier, '_', region) AS plan_region_combo,
  IF(total_minutes > 500, 1, 0) AS flag_binge,
  IF(age_band IS NULL, 1, 0) AS is_missing_age,
  churn_label
FROM `netflix.churn_features`;

Query is running:   0%|          |


## Task 5.4: Create Time-Based Features (Optional)

**🎯 Goal:** Add a column days_since_last_login.  
**📌 Requirements:** Use DATE_DIFF with CURRENT_DATE and last_login_date.

---

### 🧠 Prompt Template  
> Write SQL to create a column showing days since last login using DATE_DIFF.

---

### 👩‍🏫 Example Prompt  
> Add a column days_since_last_login = DATE_DIFF(CURRENT_DATE(), last_login_date, DAY).

---

### 🔍 Exploration  
Does login recency affect churn rate?



## Task 5.5: Assemble Enhanced Feature Table

**🎯 Goal:** Create churn_features_enhanced with all engineered columns.  
**📌 Requirements:** Include all prior features + engineered columns.

---

### 🧠 Prompt Template  
> Generate SQL to create churn_features_enhanced with new columns: watch_time_bucket, plan_region_combo, flag_binge, etc.

---

### 👩‍🏫 Example Prompt  
> Build a new table churn_features_enhanced with all original features + engineered ones.

---

### 🔍 Exploration  
Are row counts stable? Any NULLs introduced?


In [17]:
# ✅ Assemble Enhanced Feature Table
%%bigquery --project $project_id
CREATE OR REPLACE TABLE `netflix.churn_features_enhanced` AS
SELECT
  user_id,
  region,
  plan_tier,
  age_band,
  avg_rating,
  total_minutes,
  CASE
    WHEN total_minutes < 100 THEN 'low'
    WHEN total_minutes BETWEEN 100 AND 300 THEN 'medium'
    ELSE 'high'
  END AS watch_time_bucket,
  avg_progress,
  num_sessions,
  CONCAT(plan_tier, '_', region) AS plan_region_combo,
  IF(total_minutes > 500, 1, 0) AS flag_binge,
  IF(age_band IS NULL, 1, 0) AS is_missing_age,
  churn_label
FROM `netflix.churn_features`;

Executing query with job ID: c89f1d36-bee1-4a2c-83f7-eb793850ec90
Query executing: 0.53s


ERROR:
 400 Unrecognized name: watch_date at [19:29]; reason: invalidQuery, location: query, message: Unrecognized name: watch_date at [19:29]

Location: US
Job ID: c89f1d36-bee1-4a2c-83f7-eb793850ec90




## Task 6: Retrain Model on Engineered Features

**🎯 Goal:** Train a logistic regression model using churn_features_enhanced.  
**📌 Requirements:** Use BQML logistic_reg model with new feature columns.

---

### 🧠 Prompt Template  
> Write CREATE MODEL SQL using enhanced features including flags and buckets.

---

### 👩‍🏫 Example Prompt  
> Retrain churn_model_enhanced using watch_time_bucket, flag_binge, plan_region_combo.

---

### 🔍 Exploration  
Does model accuracy improve?


In [18]:
# ✅ Train enhanced model
%%bigquery --project $project_id
CREATE OR REPLACE MODEL `netflix.churn_model_enhanced`
OPTIONS(model_type='logistic_reg', input_label_cols=['churn_label']) AS
SELECT
  region,
  plan_tier,
  age_band,
  watch_time_bucket,
  avg_rating,
  avg_progress,
  num_sessions,
  plan_region_combo,
  flag_binge,
  is_missing_age,
  churn_label
FROM `netflix.churn_features_enhanced`;

Executing query with job ID: d1bb1e0f-e3af-4937-9aca-f2038d23b050
Query executing: 0.49s


ERROR:
 400 GET https://bigquery.googleapis.com/bigquery/v2/projects/mgmt467-71800/queries/d1bb1e0f-e3af-4937-9aca-f2038d23b050?maxResults=0&location=US&prettyPrint=false: Input data doesn't contain any rows.

Location: US
Job ID: d1bb1e0f-e3af-4937-9aca-f2038d23b050




## Task 7: Compare Model Performance

**🎯 Goal:** Compare base model vs enhanced model using ML.EVALUATE.  
**📌 Requirements:** Use same evaluation query for both models.

---

### 🧠 Prompt Template  
> Write a SQL query to evaluate churn_model_enhanced and compare with churn_model.

---

### 👩‍🏫 Example Prompt  
> Compare ML.EVALUATE output from both models side-by-side.

---

### 🔍 Exploration  
Which features made the most difference?


In [15]:
# ✅ Evaluate base model
%%bigquery --project $project_id
SELECT *
FROM ML.EVALUATE(MODEL `netflix.churn_model`);

Executing query with job ID: a1abf6f4-fb09-47ee-a904-85d2e237c3a0
Query executing: 0.32s


ERROR:
 404 Not found: Model mgmt467-71800:netflix.churn_model; reason: notFound, message: Not found: Model mgmt467-71800:netflix.churn_model

Location: US
Job ID: a1abf6f4-fb09-47ee-a904-85d2e237c3a0



In [16]:
# ✅ Evaluate enhanced model
%%bigquery --project $project_id
SELECT *
FROM ML.EVALUATE(MODEL `netflix.churn_model_enhanced`);

Executing query with job ID: a722c822-4bc7-4fb9-9766-d4117df10bbe
Query executing: 0.26s


ERROR:
 404 Not found: Model mgmt467-71800:netflix.churn_model_enhanced; reason: notFound, message: Not found: Model mgmt467-71800:netflix.churn_model_enhanced

Location: US
Job ID: a722c822-4bc7-4fb9-9766-d4117df10bbe

