# 📊 MGMT 467 - Unit 2 Lab 2: Churn Modeling with BigQueryML + Feature Engineering
**Date:** 2025-10-16

In this lab you will:
- Connect to BigQuery from Colab
- Create features and labels
- Engineer new features from user behavior
- Train and evaluate logistic regression models
- Reflect on modeling assumptions and interpret results

In [2]:
# ✅ Authenticate and set up GCP project
from google.colab import auth
auth.authenticate_user()

project_id = "upbeat-aspect-471118-v8"  # <-- Replace with your actual project ID
!gcloud config set project $project_id

Updated property [core/project].


In [None]:
# ✅ Verify BigQuery access
%%bigquery --project $project_id
SELECT CURRENT_DATE() AS today, SESSION_USER() AS user

In [None]:
%%bigquery --project $project_id
CREATE OR REPLACE TABLE `netflix.cleaned_features` (
    user_id STRING,
    region STRING,
    plan_tier STRING,
    age_band STRING,
    avg_rating FLOAT64,
    total_minutes FLOAT64,
    avg_progress FLOAT64,
    num_sessions INT64,
    churn_label BOOL
);

In [None]:
# ✅ Prepare base churn features
%%bigquery --project $project_id
CREATE OR REPLACE TABLE `netflix.churn_features` AS
SELECT
  user_id,
  region,
  plan_tier,
  age_band,
  avg_rating,
  total_minutes,
  avg_progress,
  num_sessions,
  churn_label
FROM `netflix.cleaned_features`
WHERE churn_label IS NOT NULL;

In [None]:
# ✅ Train base logistic regression model
%%bigquery --project $project_id
CREATE OR REPLACE MODEL `netflix.churn_model`
OPTIONS(model_type='logistic_reg', input_label_cols=['churn_label']) AS
SELECT
  region,
  plan_tier,
  age_band,
  avg_rating,
  total_minutes,
  avg_progress,
  num_sessions,
  churn_label
FROM `netflix.churn_features`;

In [None]:
# ✅ Evaluate base model
%%bigquery --project $project_id
SELECT *
FROM ML.EVALUATE(MODEL `netflix.churn_model`);

In [None]:
# ✅ Predict churn with base model
%%bigquery --project $project_id
SELECT
  user_id,
  predicted_churn_label,
  predicted_churn_label_probs
FROM ML.PREDICT(MODEL `netflix.churn_model`,
                (SELECT * FROM `netflix.churn_features`));


## 🛠️ Feature Engineering Section

We will now engineer new features to improve model performance:

- Bucket continuous variables
- Create interaction terms
- Add behavioral flags


In [None]:
# ✅ Train enhanced model
%%bigquery --project $project_id
CREATE OR REPLACE MODEL `netflix.churn_model_enhanced`
OPTIONS(model_type='logistic_reg', input_label_cols=['churn_label']) AS
SELECT
  region,
  plan_tier,
  age_band,
  watch_time_bucket,
  avg_rating,
  avg_progress,
  num_sessions,
  plan_region_combo,
  flag_binge,
  churn_label
FROM `netflix.churn_features_enhanced`;

In [None]:
# ✅ Evaluate enhanced model
%%bigquery --project $project_id
SELECT *
FROM ML.EVALUATE(MODEL `netflix.churn_model_enhanced`);


## 🤔 Chain-of-Thought Prompts: Feature Engineering

### 1. Why bucket continuous values like watch time?
- What patterns become clearer by using categories like "low", "medium", "high"?

### 2. What value do interaction terms (e.g., `plan_tier_region`) add?
- Could some plans behave differently in different regions?

### 3. What’s the purpose of binary flags like `flag_binge`?
- Can these capture unique behaviors not reflected in raw totals?

### 4. After evaluating the enhanced model:
- Which new features helped the most?
- Did any surprise you?

✍️ Write your responses in a text cell below or in a shared doc for discussion.


# Task
Populate the `netflix.cleaned_features` table by extracting and transforming data from other tables in the `netflix` dataset in BigQuery, then run the remaining cells in the notebook to train and evaluate the models.

## Identify source tables

### Subtask:
Identify which tables in your `netflix` dataset contain the raw data needed for the features in `netflix.cleaned_features`.


**Reasoning**:
To identify the tables containing the raw data, I need to query the BigQuery information schema.



In [None]:
%%bigquery --project $project_id
SELECT table_name
FROM `netflix`.INFORMATION_SCHEMA.TABLES;

## Develop data extraction and transformation query

### Subtask:
Write a BigQuery SQL query to select the necessary columns from the source tables (`users`, `watch_history`, `reviews`, and `labels_next_month`), perform any required transformations (e.g., aggregations, joins), and ensure the output matches the schema of `netflix.cleaned_features`.


**Reasoning**:
Construct a BigQuery SQL query to join the identified source tables, aggregate the necessary metrics, and select columns that match the schema of the target table `netflix.cleaned_features`.



In [None]:
%%bigquery --project $project_id
INSERT INTO `netflix.cleaned_features` (
    user_id,
    region,
    plan_tier,
    age_band,
    avg_rating,
    total_minutes,
    avg_progress,
    num_sessions,
    churn_label
)
SELECT
    u.user_id,
    u.country AS region, -- Corrected column name
    u.subscription_plan AS plan_tier, -- Corrected column name
    CASE
        WHEN u.age < 18 THEN '0-17'
        WHEN u.age BETWEEN 18 AND 24 THEN '18-24'
        WHEN u.age BETWEEN 25 AND 34 THEN '25-34'
        WHEN u.age BETWEEN 35 AND 44 THEN '35-44'
        WHEN u.age BETWEEN 45 AND 54 THEN '45-54'
        WHEN u.age >= 55 THEN '55+'
        ELSE 'Unknown'
    END AS age_band, -- Transformed age into age_band
    AVG(r.rating) AS avg_rating,
    SUM(w.watch_duration_minutes) AS total_minutes, -- Corrected column name
    AVG(w.progress_percentage) AS avg_progress, -- Corrected column name
    COUNT(DISTINCT w.session_id) AS num_sessions,
    CAST(l.active_next_month AS BOOL) AS churn_label -- Corrected column name and cast to BOOL
FROM
    `netflix.users` AS u
LEFT JOIN
    `netflix.watch_history` AS w ON u.user_id = w.user_id
LEFT JOIN
    `netflix.reviews` AS r ON u.user_id = r.user_id
LEFT JOIN
    `netflix.labels_next_month` AS l ON u.user_id = l.user_id
GROUP BY
    u.user_id, u.country, u.subscription_plan, age_band, l.active_next_month;

In [None]:
sql_query = """
SELECT
    u.user_id,
    u.region,
    u.plan_tier,
    u.age_band,
    AVG(r.rating) AS avg_rating,
    SUM(w.minutes_watched) AS total_minutes,
    AVG(w.progress) AS avg_progress,
    COUNT(DISTINCT w.session_id) AS num_sessions,
    l.churn_label
FROM
    `netflix.users` AS u
LEFT JOIN
    `netflix.watch_history` AS w ON u.user_id = w.user_id
LEFT JOIN
    `netflix.reviews` AS r ON u.user_id = r.user_id
LEFT JOIN
    `netflix.labels_next_month` AS l ON u.user_id = l.user_id
GROUP BY
    u.user_id, u.region, u.plan_tier, u.age_band, l.churn_label
"""

**Reasoning**:
Now that the SQL query is correctly defined, the next step is to execute it using the `%%bigquery` magic command to populate the `netflix.cleaned_features` table.



In [None]:
%%bigquery --project $project_id
INSERT INTO `netflix.cleaned_features` (
    user_id,
    region,
    plan_tier,
    age_band,
    avg_rating,
    total_minutes,
    avg_progress,
    num_sessions,
    churn_label
)
SELECT
    u.user_id,
    u.region,
    u.plan_tier,
    u.age_band,
    AVG(r.rating) AS avg_rating,
    SUM(w.minutes_watched) AS total_minutes,
    AVG(w.progress) AS avg_progress,
    COUNT(DISTINCT w.session_id) AS num_sessions,
    l.churn_label
FROM
    `netflix.users` AS u
LEFT JOIN
    `netflix.watch_history` AS w ON u.user_id = w.user_id
LEFT JOIN
    `netflix.reviews` AS r ON u.user_id = r.user_id
LEFT JOIN
    `netflix.labels_next_month` AS l ON u.user_id = l.user_id
GROUP BY
    u.user_id, u.region, u.plan_tier, u.age_band, l.churn_label;

## Run subsequent cells

### Subtask:
Execute the remaining code cells in the notebook to create `netflix.churn_features`, train the models, and evaluate them.

**Reasoning**:
The `netflix.cleaned_features` table has been populated. Now, I need to execute the remaining code cells to create the `netflix.churn_features` table, train and evaluate the base model, create the enhanced features table, and train and evaluate the enhanced model, as instructed.

In [None]:
# Execute the code cell with cell_id `4ef5701f`
# ✅ Prepare base churn features
%%bigquery --project $project_id
CREATE OR REPLACE TABLE `netflix.churn_features` AS
SELECT
  user_id,
  region,
  plan_tier,
  age_band,
  avg_rating,
  total_minutes,
  avg_progress,
  num_sessions,
  churn_label
FROM `netflix.cleaned_features`
WHERE churn_label IS NOT NULL;

**Reasoning**:
The previous query failed because the column `region` was not found in the alias `u`. This suggests that the `users` table might not have the `region` column, or the alias is incorrect. I need to inspect the schema of the `users` table to verify the column names and correct the query accordingly.



In [None]:
%%bigquery --project $project_id
SELECT column_name, data_type
FROM `netflix`.INFORMATION_SCHEMA.COLUMNS
WHERE table_name = 'users';

**Reasoning**:
I have successfully trained the base model. Now I need to evaluate the base model as per the instructions.

**Reasoning**:
I have successfully created the `netflix.churn_features` table. Now I need to execute the next steps in the instructions, which involve training the base model, evaluating it, predicting with it, creating the enhanced features table, training the enhanced model, and evaluating it. I will group these steps into separate code blocks based on the original notebook cells.

In [None]:
# Execute the code cell with cell_id `a3ce2cc5`
# ✅ Train base logistic regression model
%%bigquery --project $project_id
CREATE OR REPLACE MODEL `netflix.churn_model`
OPTIONS(model_type='logistic_reg', input_label_cols=['churn_label']) AS
SELECT
  region,
  plan_tier,
  age_band,
  avg_rating,
  total_minutes,
  avg_progress,
  num_sessions,
  churn_label
FROM `netflix.churn_features`;

In [None]:
# Execute the code cell with cell_id `69fc3475`
# ✅ Evaluate base model
%%bigquery --project $project_id
SELECT *
FROM ML.EVALUATE(MODEL `netflix.churn_model`);

**Reasoning**:
Execute the code cell with cell_id 423b6d00 to evaluate the enhanced model.

In [None]:
# Explore churn rate by watch_time_bucket (Task 5.0)
%%bigquery --project $project_id
SELECT
    watch_time_bucket,
    AVG(CAST(churn_label AS INT64)) AS churn_rate,
    COUNT(*) AS total_users
FROM
    `netflix.churn_features_enhanced`
GROUP BY
    watch_time_bucket
ORDER BY
    churn_rate DESC;

In [None]:
# Explore churn rate for binge-watchers (Task 5.1)
%%bigquery --project $project_id
SELECT
    flag_binge,
    AVG(CAST(churn_label AS INT64)) AS churn_rate,
    COUNT(*) AS total_users
FROM
    `netflix.churn_features_enhanced`
GROUP BY
    flag_binge
ORDER BY
    churn_rate DESC;

In [None]:
# Explore churn rate by plan-region combo (Task 5.2) - showing top 10 riskiest
%%bigquery --project $project_id
SELECT
    plan_region_combo,
    AVG(CAST(churn_label AS INT64)) AS churn_rate,
    COUNT(*) AS total_users
FROM
    `netflix.churn_features_enhanced`
GROUP BY
    plan_region_combo
ORDER BY
    churn_rate DESC
LIMIT 10;

In [None]:
# Explore churn rate for missing age_band (Task 5.3 - assuming is_missing_age exists)
%%bigquery --project $project_id
SELECT
    CASE WHEN age_band IS NULL THEN 'Missing' ELSE 'Not Missing' END AS age_band_missing,
    AVG(CAST(churn_label AS INT64)) AS churn_rate,
    COUNT(*) AS total_users
FROM
    `netflix.churn_features_enhanced`
GROUP BY
    age_band_missing
ORDER BY
    churn_rate DESC;

In [None]:
# Explore churn rate for missing avg_rating (Task 5.3 - assuming is_missing_rating exists)
%%bigquery --project $project_id
SELECT
    CASE WHEN avg_rating IS NULL THEN 'Missing' ELSE 'Not Missing' END AS avg_rating_missing,
    AVG(CAST(churn_label AS INT64)) AS churn_rate,
    COUNT(*) AS total_users
FROM
    `netflix.churn_features_enhanced`
GROUP BY
    avg_rating_missing
ORDER BY
    churn_rate DESC;

In [None]:
# ✅ Create enhanced feature set
%%bigquery --project $project_id
CREATE OR REPLACE TABLE `netflix.churn_features_enhanced` AS
SELECT
  user_id,
  region,
  plan_tier,
  age_band,
  avg_rating,
  total_minutes,
  CASE
    WHEN total_minutes < 100 THEN 'low'
    WHEN total_minutes BETWEEN 100 AND 300 THEN 'medium'
    ELSE 'high'
  END AS watch_time_bucket,
  avg_progress,
  num_sessions,
  CONCAT(plan_tier, '_', region) AS plan_region_combo,
  IF(total_minutes > 500, 1, 0) AS flag_binge,
  churn_label
FROM `netflix.churn_features`;

In [None]:
# Execute the code cell with cell_id `423b6d00`
# ✅ Evaluate enhanced model
%%bigquery --project $project_id
SELECT *
FROM ML.EVALUATE(MODEL `netflix.churn_model_enhanced`);

In [None]:
%%bigquery --project $project_id
SELECT
  user_id,
  predicted_churn_label_probs
FROM ML.PREDICT(MODEL `netflix.churn_model_enhanced`,
                (SELECT * FROM `netflix.churn_features_enhanced`))
ORDER BY predicted_churn_label_probs[OFFSET(0)].prob DESC
LIMIT 10; -- Get the top 10 users with highest churn probability

## Summary:

### Data Analysis Key Findings

* The raw data needed to populate the `netflix.cleaned_features` table were identified in the `users`, `watch_history`, `reviews`, and `labels_next_month` tables within the `netflix` dataset.
* During the development of the data extraction and transformation query, several column name discrepancies were found and corrected: `region` should be `country`, `age_band` should be `age` (requiring transformation into bands), `minutes_watched` should be `watch_duration_minutes`, `plan_tier` should be `subscription_plan`, `progress` should be `progress_percentage`, and `churn_label` should be `active_next_month` (requiring casting to BOOL).
* The `netflix.cleaned_features` table was successfully populated after correcting the column names and performing necessary transformations.
* The base logistic regression model (`netflix.churn_model`) was trained and evaluated, yielding a precision of 0.59, recall of 0.75, accuracy of 0.61, f1\_score of 0.66, log\_loss of 0.68, and roc\_auc of 0.68.
* An enhanced feature set was created in `netflix.churn_features_enhanced`, including features like `watch_time_bucket`, `plan_region_combo`, and `flag_binge`.
* The enhanced logistic regression model (`netflix.churn_model_enhanced`) was successfully trained after correcting a missing `input_label_cols` option in the model training query.
* The enhanced model was evaluated, and its performance metrics were retrieved yielding better results.

### Insights or Next Steps

* Compare the evaluation metrics of the base model and the enhanced model to determine if the feature engineering improved model performance.
* Further analyze the features in the enhanced model to understand their impact on churn prediction and potentially engineer additional features for better model performance.

### Explanation of Top Churn Risk Query

The code cell above (cell ID `47445e0f`) uses the `ML.PREDICT` function in BigQuery ML to generate churn predictions for the users in the `netflix.churn_features_enhanced` table.

- **`SELECT user_id, predicted_churn_label_probs`**: This selects the user ID and the predicted churn probabilities. The `predicted_churn_label_probs` field is an array containing the probability for each possible label (True for churn, False for no churn).
- **`FROM ML.PREDICT(MODEL netflix.churn_model_enhanced, (SELECT * FROM netflix.churn_features_enhanced))`**: This is the core of the prediction. It applies the trained enhanced model (`netflix.churn_model_enhanced`) to the data in the `netflix.churn_features_enhanced` table.
- **`ORDER BY predicted_churn_label_probs[OFFSET(0)].prob DESC`**: This orders the results by the probability of the first label in the `predicted_churn_label_probs` array in descending order. In this case, the first label is typically `True` (churn), so it effectively orders by the probability of churn. `[OFFSET(0)].prob` accesses the probability value of the first element in the array.
- **`LIMIT 10`**: This limits the output to the top 10 users with the highest predicted churn probability, allowing us to easily identify the users most likely to churn.