<a href="https://colab.research.google.com/github/louissiller/mgmt467-analytics-portfolio/blob/main/Unit2_Lab2_Churn_Modeling_FeatureEngineering_Colab_Task1to4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 📊 MGMT 467 - Unit 2 Lab 2: Churn Modeling with BigQueryML + Feature Engineering
**Date:** 2025-10-16

In this lab you will:
- Connect to BigQuery from Colab
- Create features and labels
- Engineer new features from user behavior
- Train and evaluate logistic regression models
- Reflect on modeling assumptions and interpret results

In [19]:
# ✅ Authenticate and set up GCP project
from google.colab import auth
auth.authenticate_user()

project_id = "mgmt467-71800"  # <-- Replace with your actual project ID
!gcloud config set project $project_id

Updated property [core/project].


In [20]:
# ✅ Verify BigQuery access
%%bigquery --project $project_id
SELECT CURRENT_DATE() AS today, SESSION_USER() AS user

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,today,user
0,2025-10-27,siller.louis@gmail.com


In [21]:
%%bigquery --project $project_id
CREATE OR REPLACE TABLE `netflix.cleaned_features` (
    user_id STRING,
    region STRING,
    plan_tier STRING,
    age_band STRING,
    avg_rating FLOAT64,
    total_minutes FLOAT64,
    avg_progress FLOAT64,
    num_sessions INT64,
    churn_label BOOL
);

Query is running:   0%|          |

In [22]:
# ✅ Prepare base churn features
%%bigquery --project $project_id
CREATE OR REPLACE TABLE `netflix.churn_features` AS
SELECT
  user_id,
  region,
  plan_tier,
  age_band,
  avg_rating,
  total_minutes,
  avg_progress,
  num_sessions,
  churn_label
FROM `netflix.cleaned_features`
WHERE churn_label IS NOT NULL;

Query is running:   0%|          |

In [23]:
# ✅ Train base logistic regression model
%%bigquery --project $project_id
CREATE OR REPLACE MODEL `netflix.churn_model`
OPTIONS(model_type='logistic_reg') AS
SELECT
  region,
  plan_tier,
  age_band,
  avg_rating,
  total_minutes,
  avg_progress,
  num_sessions,
  churn_label
FROM `netflix.churn_features`;

Executing query with job ID: 232facdc-a332-4e72-949d-41f452ae30ed
Query executing: 0.48s


ERROR:
 400 GET https://bigquery.googleapis.com/bigquery/v2/projects/mgmt467-71800/queries/232facdc-a332-4e72-949d-41f452ae30ed?maxResults=0&location=US&prettyPrint=false: Missing 'label' column in query statement. Update OPTIONS(input_label_cols=['your_label_col']) to indicate the correct label column name.

Location: US
Job ID: 232facdc-a332-4e72-949d-41f452ae30ed



In [24]:
# ✅ Evaluate base model
%%bigquery --project $project_id
SELECT *
FROM ML.EVALUATE(MODEL `netflix.churn_model`);

Executing query with job ID: 1999977f-6064-45be-aad6-30a752fb296c
Query executing: 0.30s


ERROR:
 404 Not found: Model mgmt467-71800:netflix.churn_model; reason: notFound, message: Not found: Model mgmt467-71800:netflix.churn_model

Location: US
Job ID: 1999977f-6064-45be-aad6-30a752fb296c



In [25]:
# ✅ Predict churn with base model
%%bigquery --project $project_id
SELECT
  user_id,
  predicted_churn_label,
  predicted_churn_label_probs
FROM ML.PREDICT(MODEL `netflix.churn_model`,
                (SELECT * FROM `netflix.churn_features`));

Executing query with job ID: c1e49689-12b8-49ed-837b-08f7a534dcd3
Query executing: 0.38s


ERROR:
 404 Not found: Dataset mgmt467-71800:your_dataset was not found in location US; reason: notFound, message: Not found: Dataset mgmt467-71800:your_dataset was not found in location US

Location: US
Job ID: c1e49689-12b8-49ed-837b-08f7a534dcd3




## 🛠️ Feature Engineering Section

We will now engineer new features to improve model performance:

- Bucket continuous variables
- Create interaction terms
- Add behavioral flags


In [26]:

# ✅ Create enhanced feature set
%%bigquery --project $project_id
CREATE OR REPLACE TABLE `netflix.churn_features_enhanced` AS
SELECT
  user_id,
  region,
  plan_tier,
  age_band,
  avg_rating,
  total_minutes,
  CASE
    WHEN total_minutes < 100 THEN 'low'
    WHEN total_minutes BETWEEN 100 AND 300 THEN 'medium'
    ELSE 'high'
  END AS watch_time_bucket,
  avg_progress,
  num_sessions,
  CONCAT(plan_tier, '_', region) AS plan_region_combo,
  IF(total_minutes > 500, 1, 0) AS flag_binge,
  churn_label
FROM `netflix.churn_features`;


Executing query with job ID: aba6e2be-df3e-474a-a3d5-346c32fbb0a2
Query executing: 0.35s


ERROR:
 404 Not found: Dataset mgmt467-71800:your_dataset was not found in location US; reason: notFound, message: Not found: Dataset mgmt467-71800:your_dataset was not found in location US

Location: US
Job ID: aba6e2be-df3e-474a-a3d5-346c32fbb0a2



In [27]:

# ✅ Train enhanced model
%%bigquery --project $project_id
CREATE OR REPLACE MODEL `netflix.churn_model_enhanced`
OPTIONS(model_type='logistic_reg') AS
SELECT
  region,
  plan_tier,
  age_band,
  watch_time_bucket,
  avg_rating,
  avg_progress,
  num_sessions,
  plan_region_combo,
  flag_binge,
  churn_label
FROM `netflix.churn_features_enhanced`;


Executing query with job ID: b00b7e41-77bb-45db-9cfd-950cb80fc6ad
Query executing: 0.30s


ERROR:
 404 Not found: Dataset mgmt467-71800:your_dataset was not found in location US; reason: notFound, message: Not found: Dataset mgmt467-71800:your_dataset was not found in location US

Location: US
Job ID: b00b7e41-77bb-45db-9cfd-950cb80fc6ad



In [28]:

# ✅ Evaluate enhanced model
%%bigquery --project $project_id
SELECT *
FROM ML.EVALUATE(MODEL `netflix.churn_model_enhanced`);


Executing query with job ID: 8b645c7a-96a1-407e-b69d-1644a1735d3d
Query executing: 0.37s


ERROR:
 404 Not found: Dataset mgmt467-71800:your_dataset was not found in location US; reason: notFound, message: Not found: Dataset mgmt467-71800:your_dataset was not found in location US

Location: US
Job ID: 8b645c7a-96a1-407e-b69d-1644a1735d3d



## 🤔 Chain-of-Thought Prompts: Feature Engineering

### 1. Why bucket continuous values like watch time?
- What patterns become clearer by using categories like "low", "medium", "high"?

### 2. What value do interaction terms (e.g., `plan_tier_region`) add?
- Could some plans behave differently in different regions?

### 3. What’s the purpose of binary flags like `flag_binge`?
- Can these capture unique behaviors not reflected in raw totals?

### 4. After evaluating the enhanced model:
- Which new features helped the most?
- Did any surprise you?

✍️ Write your responses in a text cell below or in a shared doc for discussion.

Let's delve into the rationale behind these feature engineering choices:

1.  **Bucketing continuous values (e.g., watch time):**
    -   Categorizing continuous variables can unveil non-linear relationships that may not be adequately captured by linear models. For instance, there might be thresholds in watch time beyond which churn behavior changes significantly.
    -   This approach can enhance model robustness to outliers and improve interpretability by providing discrete segments that are easier to analyze and communicate.

2.  **Interaction terms (e.g., `plan_tier_region`):**
    -   Interaction terms are crucial for modeling dependencies between features. They allow the model to account for situations where the effect of one variable on the target is contingent upon the value of another variable. For example, the efficacy of a specific plan tier in retaining users might vary considerably across different geographic regions due to localized factors.
    -   Ignoring such interactions can lead to an oversimplified model that fails to capture the nuances of user behavior.

3.  **Binary flags (e.g., `flag_binge`):**
    -   Binary flags serve to isolate and highlight specific behavioral patterns that may have distinct implications for the target variable, even if not fully represented by aggregate metrics. A "binge" flag can differentiate users with intensive, short-duration engagement from those with more consistent, prolonged usage.
    -   These flags can provide valuable insights into specific user segments and their propensity for churn.

4.  **Upon evaluating the enhanced model:**
    -   To assess the impact of new features, compare the performance metrics (e.g., AUC, precision, recall) of the enhanced model against the baseline model. Features contributing to notable improvements are likely significant.
    -   Analyzing the model coefficients can provide further insight into the direction and magnitude of each feature's influence on churn probability.
    -   Observing which features prove most influential, and whether any results deviate from initial hypotheses, can yield valuable insights into the underlying dynamics of user churn.