<a href="https://colab.research.google.com/github/louissiller/mgmt467-analytics-portfolio/blob/main/Unit2_Lab1_PromptPlusExamples_Colab_Kaggle_GCS_BQ_DQ_(1).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# MGMT 467 — Prompt-Driven Lab (with Commented Examples)
## Kaggle ➜ Google Cloud Storage ➜ BigQuery ➜ Data Quality (DQ)

**How to use this notebook**
- Each section gives you a **Build Prompt** to paste into Gemini/Vertex AI (or Gemini in Colab).
- Below each prompt, you’ll see a **commented example** of what a good LLM answer might look like.
- **Do not** just uncomment and run. Use the prompt to generate your own code, then compare to the example.
- After every step, run the **Verification Prompt**, and write the **Reflection** in Markdown.

> Goal today: Download the Netflix dataset (Kaggle) → Stage on GCS → Load into BigQuery → Run DQ profiling (missingness, duplicates, outliers, anomaly flags).


### Academic integrity & LLM usage
- Use the prompts here to generate your own code cells.
- Read concept notes and write the reflection answers in your own words.
- Keep credentials out of code. Upload `kaggle.json` when asked.


## Learning objectives
1) Explain **why** we stage data in GCS and load it to BigQuery.  
2) Build an **idempotent**, auditable pipeline.  
3) Diagnose **missingness**, **duplicates**, and **outliers** and justify cleaning choices.  
4) Connect DQ decisions to **business/ML impact**.


## 0) Environment setup — What & Why
Authenticate Colab to Google Cloud so we can use `gcloud`, GCS, and BigQuery. Set **PROJECT_ID** and **REGION** once for consistency (cost/latency).

### Build Prompt (paste to LLM)
You are my cloud TA. Generate a single **Colab code cell** that:
1) Authenticates to Google Cloud in Colab,  
2) Prompts for `PROJECT_ID` via `input()` and sets `REGION="us-central1"` (editable),  
3) Exports `GOOGLE_CLOUD_PROJECT`,  
4) Runs `gcloud config set project $GOOGLE_CLOUD_PROJECT`,  
5) Prints both values. Add 2–3 comments explaining what/why.
End with a comment: `# Done: Auth + Project/Region set`.


In [1]:
# Authenticates the Colab user to Google Cloud, allowing access to GCP services
from google.colab import auth
auth.authenticate_user()

import os
# Prompts the user to enter their Google Cloud Project ID
PROJECT_ID = input("Enter your GCP Project ID: ").strip()
# Sets the region for consistency and cost/latency considerations
REGION = "us-central1"  # keep consistent; change if instructed

# Exports the project ID as an environment variable for gcloud commands
os.environ["GOOGLE_CLOUD_PROJECT"] = PROJECT_ID
print("Project:", PROJECT_ID, "| Region:", REGION)

# Sets the active project for gcloud/BigQuery CLI for subsequent commands
!gcloud config set project $GOOGLE_CLOUD_PROJECT
!gcloud config get-value project
# Done: Auth + Project/Region set

Enter your GCP Project ID: mgmt467-71800
Project: mgmt467-71800 | Region: us-central1
Updated property [core/project].
mgmt467-71800


### Verification Prompt
Generate a short cell that prints the active project using `gcloud config get-value project` and echoes the `REGION` you set.


In [2]:
# Verify project and region
!gcloud config get-value project
print("REGION:", REGION)

mgmt467-71800
REGION: us-central1


**Reflection:** Why do we set `PROJECT_ID` and `REGION` at the top? What can go wrong if we don’t?

Setting the project and region at the start keeps all your work in one place on Google Cloud, making it easier to manage costs and find your resources. If you don't, your resources might end up scattered, costing more and causing errors.

## 1) Kaggle API — What & Why
Use Kaggle CLI for reproducible downloads. Store `kaggle.json` at `~/.kaggle/kaggle.json` with `0600` permissions to protect secrets.

### Build Prompt
Generate a **single Colab code cell** that:
- Prompts me to upload `kaggle.json`,
- Saves to `~/.kaggle/kaggle.json` with `0600` permissions,
- Prints `kaggle --version`.
Add comments about security and reproducibility.


In [3]:
# Prompt the user to upload their kaggle.json file
# This file contains your Kaggle API credentials
from google.colab import files
print("Upload your kaggle.json (Kaggle > Account > Create New API Token)")
uploaded = files.upload()

import os
# Create the .kaggle directory if it doesn't exist
os.makedirs('/root/.kaggle', exist_ok=True)

# Save the uploaded kaggle.json file to the correct location
# This makes the Kaggle CLI available
with open('/root/.kaggle/kaggle.json', 'wb') as f:
    f.write(uploaded[list(uploaded.keys())[0]])

# Set file permissions to owner-only read/write (0600) for security
# This protects your API key
os.chmod('/root/.kaggle/kaggle.json', 0o600)

# Verify the Kaggle installation by printing the version
# This confirms the CLI is ready to use
!kaggle --version

Upload your kaggle.json (Kaggle > Account > Create New API Token)


Saving kaggle.json to kaggle.json
Kaggle API 1.7.4.5


### Verification Prompt
Generate a one-liner that runs `kaggle --help | head -n 20` to show the CLI is ready.


In [4]:
# Verify Kaggle CLI is ready
!kaggle --help | head -n 20

usage: kaggle [-h] [-v] [-W]
              {competitions,c,datasets,d,kernels,k,models,m,files,f,config}
              ...

options:
  -h, --help            show this help message and exit
  -v, --version         Print the Kaggle API version

commands:
  {competitions,c,datasets,d,kernels,k,models,m,files,f,config}
                        Use one of:
                        competitions {list, files, download, submit, submissions, leaderboard}
                        datasets {list, files, download, create, version, init, metadata, status}
                        kernels {list, files, init, push, pull, output, status}
                        models {instances, get, list, init, create, delete, update}
                        models instances {versions, get, files, init, create, delete, update}
                        models instances versions {init, create, download, delete, files}
                        config {view, set, unset}
    competitions (c)    Commands related to Kaggle compe

**Reflection:** Why require strict `0600` permissions on API tokens? What risks are we avoiding?

Requiring strict 0600 permissions on API tokens means only the owner can read or write the file. This is crucial to prevent unauthorized access to your credentials, protecting you from potential misuse of your accounts and resources.

## 2) Download & unzip dataset — What & Why
Keep raw files under `/content/data/raw` for predictable paths and auditing.
**Dataset:** `sayeeduddin/netflix-2025user-behavior-dataset-210k-records`

In [5]:
# Create the directory to store raw data
!mkdir -p /content/data/raw

# Download the dataset using the Kaggle CLI
# The dataset will be downloaded as a zip file to /content/data
!kaggle datasets download -d sayeeduddin/netflix-2025user-behavior-dataset-210k-records -p /content/data

# Unzip the downloaded dataset into the raw data directory
# -o flag means overwrite existing files
!unzip -o /content/data/*.zip -d /content/data/raw

# List all CSV files in the raw data directory with their sizes
# This provides a clear inventory of the downloaded files
!ls -lh /content/data/raw/*.csv

Dataset URL: https://www.kaggle.com/datasets/sayeeduddin/netflix-2025user-behavior-dataset-210k-records
License(s): CC0-1.0
Downloading netflix-2025user-behavior-dataset-210k-records.zip to /content/data
  0% 0.00/4.02M [00:00<?, ?B/s]
100% 4.02M/4.02M [00:00<00:00, 784MB/s]
Archive:  /content/data/netflix-2025user-behavior-dataset-210k-records.zip
  inflating: /content/data/raw/README.md  
  inflating: /content/data/raw/movies.csv  
  inflating: /content/data/raw/recommendation_logs.csv  
  inflating: /content/data/raw/reviews.csv  
  inflating: /content/data/raw/search_logs.csv  
  inflating: /content/data/raw/users.csv  
  inflating: /content/data/raw/watch_history.csv  
-rw-r--r-- 1 root root 114K Aug  2 19:36 /content/data/raw/movies.csv
-rw-r--r-- 1 root root 4.5M Aug  2 19:36 /content/data/raw/recommendation_logs.csv
-rw-r--r-- 1 root root 1.8M Aug  2 19:36 /content/data/raw/reviews.csv
-rw-r--r-- 1 root root 2.2M Aug  2 19:36 /content/data/raw/search_logs.csv
-rw-r--r-- 1 root 

### Build Prompt
Generate a **Colab code cell** that:
- Creates `/content/data/raw`,
- Downloads the dataset to `/content/data` with Kaggle CLI,
- Unzips into `/content/data/raw` (overwrite OK),
- Lists all CSVs with sizes in a neat table.
Include comments describing each step.


In [6]:
# Create the directory to store raw data if it doesn't exist
!mkdir -p /content/data/raw

# Download the dataset using the Kaggle CLI
# The dataset will be downloaded as a zip file to the /content/data directory
!kaggle datasets download -d sayeeduddin/netflix-2025user-behavior-dataset-210k-records -p /content/data

# Unzip the downloaded dataset into the raw data directory
# The -o flag ensures existing files are overwritten
!unzip -o /content/data/*.zip -d /content/data/raw

# List all CSV files in the raw data directory with their sizes in a human-readable format
# This provides a clear inventory of the downloaded and unzipped files
!ls -lh /content/data/raw/*.csv

Dataset URL: https://www.kaggle.com/datasets/sayeeduddin/netflix-2025user-behavior-dataset-210k-records
License(s): CC0-1.0
netflix-2025user-behavior-dataset-210k-records.zip: Skipping, found more recently modified local copy (use --force to force download)
Archive:  /content/data/netflix-2025user-behavior-dataset-210k-records.zip
  inflating: /content/data/raw/README.md  
  inflating: /content/data/raw/movies.csv  
  inflating: /content/data/raw/recommendation_logs.csv  
  inflating: /content/data/raw/reviews.csv  
  inflating: /content/data/raw/search_logs.csv  
  inflating: /content/data/raw/users.csv  
  inflating: /content/data/raw/watch_history.csv  
-rw-r--r-- 1 root root 114K Aug  2 19:36 /content/data/raw/movies.csv
-rw-r--r-- 1 root root 4.5M Aug  2 19:36 /content/data/raw/recommendation_logs.csv
-rw-r--r-- 1 root root 1.8M Aug  2 19:36 /content/data/raw/reviews.csv
-rw-r--r-- 1 root root 2.2M Aug  2 19:36 /content/data/raw/search_logs.csv
-rw-r--r-- 1 root root 1.6M Aug  2 1

### Verification Prompt
Generate a snippet that asserts there are exactly **six** CSV files and prints their names.


In [7]:
import glob
import os

# List all CSV files in the raw data directory
csv_files = glob.glob('/content/data/raw/*.csv')

# Assert that there are exactly six CSV files
expected_files = 6
actual_files = len(csv_files)
assert actual_files == expected_files, f"Expected {expected_files} CSV files, but found {actual_files}"

# Print the names of the CSV files
print("Successfully found 6 CSV files:")
for csv_file in csv_files:
    print(os.path.basename(csv_file))

Successfully found 6 CSV files:
movies.csv
watch_history.csv
search_logs.csv
users.csv
recommendation_logs.csv
reviews.csv


**Reflection:** Why is keeping a clean file inventory (names, sizes) useful downstream?

Keeping a clean file inventory helps you quickly see exactly what data files you have and their sizes, which is crucial for the next steps like uploading to cloud storage or loading into a database. It acts as a quick audit trail to confirm your download was successful and complete before moving on.

## 3) Create GCS bucket & upload — What & Why
Stage in GCS → consistent, versionable source for BigQuery loads. Bucket names must be **globally unique**.

### Build Prompt
Generate a **Colab code cell** that:
- Creates a unique bucket in `${REGION}` (random suffix),
- Saves name to `BUCKET_NAME` env var,
- Uploads all CSVs to `gs://$BUCKET_NAME/netflix/`,
- Prints the bucket name and explains staging benefits.


In [8]:
import uuid
import os

# Generate a unique bucket name with a random suffix
bucket_name = f"mgmt467-netflix-{uuid.uuid4().hex[:8]}"
os.environ["BUCKET_NAME"] = bucket_name

# Create the GCS bucket in the specified location (using 'US' multi-region for BigQuery compatibility)
print(f"Creating bucket: {bucket_name} in US multi-region")
!gcloud storage buckets create gs://$BUCKET_NAME --location=US

# Upload all CSV files from the raw data directory to the bucket
print(f"Uploading CSV files to gs://{bucket_name}/netflix/")
!gcloud storage cp /content/data/raw/*.csv gs://$BUCKET_NAME/netflix/

# Print the bucket name and explain the benefits of staging
print("\nBucket:", bucket_name)
print("\nBenefits of staging in GCS:")
print("- Persistent and versionable storage for your data.")
print("- Easily accessible by various Google Cloud services like BigQuery.")
print("- Decouples data storage from compute, improving flexibility and scalability.")
print("- Can serve as a central data lake for different analytics and ML tasks.")

# Verify contents (optional)
!gcloud storage ls gs://$BUCKET_NAME/netflix/

Creating bucket: mgmt467-netflix-3c5e4a91 in US multi-region
Creating gs://mgmt467-netflix-3c5e4a91/...
Uploading CSV files to gs://mgmt467-netflix-3c5e4a91/netflix/
Copying file:///content/data/raw/movies.csv to gs://mgmt467-netflix-3c5e4a91/netflix/movies.csv
Copying file:///content/data/raw/recommendation_logs.csv to gs://mgmt467-netflix-3c5e4a91/netflix/recommendation_logs.csv
Copying file:///content/data/raw/reviews.csv to gs://mgmt467-netflix-3c5e4a91/netflix/reviews.csv
Copying file:///content/data/raw/search_logs.csv to gs://mgmt467-netflix-3c5e4a91/netflix/search_logs.csv
Copying file:///content/data/raw/users.csv to gs://mgmt467-netflix-3c5e4a91/netflix/users.csv
Copying file:///content/data/raw/watch_history.csv to gs://mgmt467-netflix-3c5e4a91/netflix/watch_history.csv

Average throughput: 46.3MiB/s

Bucket: mgmt467-netflix-3c5e4a91

Benefits of staging in GCS:
- Persistent and versionable storage for your data.
- Easily accessible by various Google Cloud services like BigQ

### Verification Prompt
Generate a snippet that lists the `netflix/` prefix and shows object sizes.


In [9]:
# List objects in the netflix/ prefix with sizes
!gcloud storage ls --readable-sizes --recursive gs://$BUCKET_NAME/netflix/

gs://mgmt467-netflix-3c5e4a91/netflix/:
gs://mgmt467-netflix-3c5e4a91/netflix/movies.csv
gs://mgmt467-netflix-3c5e4a91/netflix/recommendation_logs.csv
gs://mgmt467-netflix-3c5e4a91/netflix/reviews.csv
gs://mgmt467-netflix-3c5e4a91/netflix/search_logs.csv
gs://mgmt467-netflix-3c5e4a91/netflix/users.csv
gs://mgmt467-netflix-3c5e4a91/netflix/watch_history.csv


**Reflection:** Name two benefits of staging in GCS vs loading directly from local Colab.

1. Persistence and Accessibility: Data in GCS persists beyond the Colab session and is easily accessible by other Google Cloud services (like BigQuery, Dataproc, AI Platform), unlike data stored only locally in Colab which is temporary.

2. Scalability and Efficiency: GCS is designed for large-scale data. Loading from GCS to services like BigQuery is generally faster and more efficient than uploading directly from a local machine or Colab environment, especially for large datasets.

## 4) BigQuery dataset & loads — What & Why
Create dataset `netflix` and load six CSVs with **autodetect** for speed (we’ll enforce schemas later).

### Build Prompt (two cells)
**Cell A:** Create (idempotently) dataset `netflix` in US multi-region; if it exists, print a friendly message.  
**Cell B:** Load tables from `gs://$BUCKET_NAME/netflix/`:
`users, movies, watch_history, recommendation_logs, search_logs, reviews`
with `--skip_leading_rows=1 --autodetect --source_format=CSV`.
Finish with row-count queries for each table.


In [10]:
# Cell A: Create BigQuery dataset (idempotently)
# Set the dataset name
DATASET = "netflix"

# Attempt to create the dataset in the US multi-region.
# The '|| echo "Dataset may already exist."' part makes it idempotent,
# printing a message instead of failing if the dataset already exists.
print(f"Attempting to create BigQuery dataset: {DATASET} in US multi-region")
!bq --location=US mk -d --description "MGMT467 Netflix dataset" {DATASET} || echo "Dataset may already exist."

Attempting to create BigQuery dataset: netflix in US multi-region
BigQuery error in mk operation: Dataset 'mgmt467-71800:netflix' already exists.
Dataset may already exist.


In [11]:
# Cell B: Load tables from GCS and get row counts

# Define the tables and their corresponding filenames in GCS
tables = {
  "users": "users.csv",
  "movies": "movies.csv",
  "watch_history": "watch_history.csv",
  "recommendation_logs": "recommendation_logs.csv",
  "search_logs": "search_logs.csv",
  "reviews": "reviews.csv",
}

if not bucket_name:
    print("Error: BUCKET_NAME environment variable not set. Please run the GCS upload step first.")
else:
    # Load each table from the corresponding CSV file in GCS
    for tbl, fname in tables.items():
        src = f"gs://{bucket_name}/netflix/{fname}"
        print(f"\nLoading table: {tbl} from {src}")
        # Use bq load command with specified options
        !bq load --skip_leading_rows=1 --autodetect --source_format=CSV {DATASET}.{tbl} {src}


Loading table: users from gs://mgmt467-netflix-3c5e4a91/netflix/users.csv
Waiting on bqjob_r15b5bbdf5c3d053d_0000019a228386ab_1 ... (1s) Current status: DONE   

Loading table: movies from gs://mgmt467-netflix-3c5e4a91/netflix/movies.csv
Waiting on bqjob_r69bb3319dd2ab81f_0000019a22839f43_1 ... (1s) Current status: DONE   

Loading table: watch_history from gs://mgmt467-netflix-3c5e4a91/netflix/watch_history.csv
Waiting on bqjob_r4491e6955d3acdc4_0000019a2283b594_1 ... (2s) Current status: DONE   

Loading table: recommendation_logs from gs://mgmt467-netflix-3c5e4a91/netflix/recommendation_logs.csv
Waiting on bqjob_r71bcf5ed6cb2fa4b_0000019a2283d197_1 ... (2s) Current status: DONE   

Loading table: search_logs from gs://mgmt467-netflix-3c5e4a91/netflix/search_logs.csv
Waiting on bqjob_r6c1b86991417dc2_0000019a2283ec34_1 ... (1s) Current status: DONE   

Loading table: reviews from gs://mgmt467-netflix-3c5e4a91/netflix/reviews.csv
Waiting on bqjob_r7d5efb63f68bd64_0000019a228403e0_1 .

### Verification Prompt
Generate a single query that returns `table_name, row_count` for all six tables in `${GOOGLE_CLOUD_PROJECT}.netflix`.


**Reflection:** When is `autodetect` acceptable? When should you enforce explicit schemas and why?

**Reflection:** When is `autodetect` acceptable? When should you enforce explicit schemas and why?

`Autodetect` is fine for initial data exploration or when dealing with simple, consistent file formats for quick loading. However, you should enforce explicit schemas for production pipelines or critical data because it guarantees data types, prevents unexpected errors from schema changes in source files, and provides better control and documentation of your data structure.

In [13]:
%%bigquery --verbose

SELECT 'users' AS table_name, COUNT(*) AS row_count FROM `mgmt467-71800.netflix.users`
UNION ALL
SELECT 'movies' AS table_name, COUNT(*) AS row_count FROM `mgmt467-71800.netflix.movies`
UNION ALL
SELECT 'watch_history' AS table_name, COUNT(*) AS row_count FROM `mgmt467-71800.netflix.watch_history`
UNION ALL
SELECT 'recommendation_logs' AS table_name, COUNT(*) AS row_count FROM `mgmt467-71800.netflix.recommendation_logs`
UNION ALL
SELECT 'search_logs' AS table_name, COUNT(*) AS row_count FROM `mgmt467-71800.netflix.search_logs`
UNION ALL
SELECT 'reviews' AS table_name, COUNT(*) AS row_count FROM `mgmt467-71800.netflix.reviews`

Executing query with job ID: 5dc2ed49-91a3-4961-8aa6-2e39b10fedaf
Query executing: 0.80s
Job ID 5dc2ed49-91a3-4961-8aa6-2e39b10fedaf successfully executed


Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,table_name,row_count
0,users,133900
1,watch_history,1365000
2,reviews,200850
3,search_logs,344500
4,movies,13520
5,recommendation_logs,676000


## 5) Data Quality (DQ) — Concepts we care about
- **Missingness** (MCAR/MAR/MNAR). Impute vs drop. Add `is_missing_*` indicators.
- **Duplicates** (exact vs near). Double-counted engagement corrupts labels & KPIs.
- **Outliers** (IQR). Winsorize/cap vs robust models. Always **flag** and explain.
- **Reproducibility**. Prefer `CREATE OR REPLACE` and deterministic keys.


### 5.1 Missingness (users) — What & Why
Measure % missing and check if missingness depends on another variable (MAR) → potential bias & instability.

### Build Prompt
Generate **two BigQuery SQL cells**:
1) Total rows and % missing in `region`, `plan_tier`, `age_band` from `users`.
2) `% plan_tier missing by region` ordered descending. Add comments on MAR.


In [15]:
%%bigquery

-- Total rows and % missing in region, plan_tier, age_band from users
WITH base AS (
  SELECT COUNT(*) n,
         COUNTIF(state_province IS NULL) miss_province,
         COUNTIF(subscription_plan IS NULL) miss_plan,
         COUNTIF(age IS NULL) miss_age
  FROM `mgmt467-71800.netflix.users`
)
SELECT n,
       ROUND(100*miss_province/n,2) AS pct_missing_province,
       ROUND(100*miss_plan/n,2)   AS pct_missing_plan_tier,
       ROUND(100*miss_age/n,2)    AS pct_missing_age_band
FROM base;

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,n,pct_missing_province,pct_missing_plan_tier,pct_missing_age_band
0,133900,0.0,0.0,11.93


In [16]:
%%bigquery
-- % plan_tier missing by region (potential MAR - Missing At Random)
SELECT state_province,
       COUNT(*) AS n,
       ROUND(100*COUNTIF(subscription_plan IS NULL)/COUNT(*),2) AS pct_missing_plan_tier
FROM `mgmt467-71800.netflix.users`
GROUP BY state_province
ORDER BY pct_missing_plan_tier DESC;

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,state_province,n,pct_missing_plan_tier
0,Georgia,4667,0.0
1,Massachusetts,4485,0.0
2,Prince Edward Island,4082,0.0
3,Tennessee,5109,0.0
4,Virginia,4641,0.0
5,Wisconsin,4667,0.0
6,Alberta,4030,0.0
7,Arizona,4693,0.0
8,British Columbia,4017,0.0
9,Manitoba,3887,0.0


In [17]:
# # EXAMPLE (from LLM) — MAR by region (commented)
# # SELECT region,
# #        COUNT(*) AS n,
# #        ROUND(100*COUNTIF(plan_tier IS NULL)/COUNT(*),2) AS pct_missing_plan_tier
# # FROM `${GOOGLE_CLOUD_PROJECT}.netflix.users`
# # GROUP BY region
# # ORDER BY pct_missing_plan_tier DESC;

### Verification Prompt
Generate a query that prints the three missingness percentages from (1), rounded to two decimals.


In [18]:
%%bigquery

-- Query to display the calculated missingness percentages
WITH missingness_data AS (
    -- This CTE replicates the calculation from the previous cell (f4ff35b7)
    WITH base AS (
        SELECT COUNT(*) n,
               COUNTIF(state_province IS NULL) miss_province, -- Assuming 'country' is the correct column
               COUNTIF(subscription_plan IS NULL) miss_plan,
               COUNTIF(age IS NULL) miss_age
        FROM `mgmt467-71800.netflix.users`
    )
    SELECT n,
           ROUND(100*miss_province/n,2) AS pct_missing_province,
           ROUND(100*miss_plan/n,2)   AS pct_missing_plan_tier,
           ROUND(100*miss_age/n,2)    AS pct_missing_age_band
    FROM base
)
SELECT pct_missing_province, pct_missing_plan_tier, pct_missing_age_band
FROM missingness_data;

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,pct_missing_province,pct_missing_plan_tier,pct_missing_age_band
0,0.0,0.0,11.93


**Reflection:** Which columns are most missing? Hypothesize MCAR/MAR/MNAR and why.

Based on the output from the missingness query (cell `5ddedc90`), the `age_band` column has the highest percentage of missing values (11.93%). `country` and `plan_tier` have no missing values.

It's difficult to definitively say if the missingness is MCAR, MAR, or MNAR without more information about how the data was collected. However, if the missing `age_band` values are somehow related to the user's subscription plan or region (which are not missing), it could be MAR. If the missingness is related to the age itself (e.g., very young or very old users are less likely to provide age), it could be MNAR. If there's no discernible pattern, it's likely MCAR.

### 5.2 Duplicates (watch_history) — What & Why
Find exact duplicate interaction records and keep **one best** per group (deterministic policy).

### Build Prompt
Generate **two BigQuery SQL cells**:
1) Report duplicate groups on `(user_id, movie_id, action, device_type)` with counts (top 20).
2) Create table `watch_history_dedup` that keeps one row per group (prefer higher `progress_ratio`, then `minutes_watched`). Add comments.


In [19]:
%%bigquery

-- Report duplicate groups on (user_id, movie_id, event_ts, device_type) with counts (top 20)
SELECT user_id, movie_id, action, device_type, COUNT(*) AS dup_count
FROM `mgmt467-71800.netflix.watch_history`
GROUP BY user_id, movie_id, action, device_type
HAVING COUNT(*) > 1
ORDER BY dup_count DESC
LIMIT 20;

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,user_id,movie_id,action,device_type,dup_count
0,user_03310,movie_0640,stopped,Smart TV,52
1,user_00391,movie_0893,stopped,Laptop,52
2,user_06417,movie_0590,completed,Laptop,39
3,user_03348,movie_0688,paused,Desktop,39
4,user_01870,movie_0844,stopped,Laptop,39
5,user_00965,movie_0991,started,Desktop,39
6,user_00472,movie_0719,started,Laptop,39
7,user_06085,movie_0346,stopped,Desktop,39
8,user_06295,movie_0097,stopped,Desktop,39
9,user_03898,movie_0500,stopped,Desktop,39


In [20]:
%%bigquery

-- Create table watch_history_dedup that keeps one row per group (prefer higher progress_ratio, then minutes_watched)
CREATE OR REPLACE TABLE `mgmt467-71800.netflix.watch_history_dedup` AS
SELECT * EXCEPT(rk) FROM (
  SELECT h.*,
         ROW_NUMBER() OVER (
           PARTITION BY user_id, movie_id, action, device_type
           ORDER BY progress_percentage DESC, watch_duration_minutes DESC
         ) AS rk
  FROM `mgmt467-71800.netflix.watch_history` h
)
WHERE rk = 1;

Query is running:   0%|          |

**Reflection:** Why do duplicates arise (natural vs system-generated)? How do they corrupt labels and KPIs?

Duplicates can pop up naturally from user behavior, like accidentally clicking something twice, or be system-generated due to errors in data pipelines or logging multiple events for a single action. These duplicates are bad news because they inflate counts, leading to inaccurate labels for training models (e.g., overstating engagement) and messing up key performance indicators (KPIs) like average watch time or conversion rates, making business decisions unreliable.

### 5.3 Outliers (minutes_watched) — What & Why
**Estimate extreme values via IQR; report % outliers; winsorize to P01/P99 for robustness while also flagging extremes.**

### Build Prompt
Generate **two BigQuery SQL cells**:
1) Compute IQR bounds for `minutes_watched` on `watch_history_dedup` and report % outliers.
2) Create `watch_history_robust` with `minutes_watched_capped` capped at P01/P99; return quantile summaries before/after.


In [24]:
%%bigquery

-- Compute IQR bounds for minutes_watched on watch_history_dedup and report % outliers
WITH dist AS (
  SELECT
    APPROX_QUANTILES(watch_duration_minutes, 4)[OFFSET(1)] AS q1, -- Changed column name
    APPROX_QUANTILES(watch_duration_minutes, 4)[OFFSET(3)] AS q3  -- Changed column name
  FROM `mgmt467-71800.netflix.watch_history_dedup`
),
bounds AS (
  SELECT q1, q3, (q3-q1) AS iqr,
         q1 - 1.5*(q3-q1) AS lo,
         q3 + 1.5*(q3-q1) AS hi
  FROM dist
)
SELECT
  COUNTIF(h.watch_duration_minutes < b.lo OR h.watch_duration_minutes > b.hi) AS outliers, -- Changed column name
  COUNT(*) AS total,
  ROUND(100*COUNTIF(h.watch_duration_minutes < b.lo OR h.watch_duration_minutes > b.hi)/COUNT(*),2) AS pct_outliers -- Changed column name
FROM `mgmt467-71800.netflix.watch_history_dedup` h
CROSS JOIN bounds b;

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,outliers,total,pct_outliers
0,3454,99958,3.46


In [25]:
%%bigquery

-- Create watch_history_robust with minutes_watched_capped capped at P01/P99; return quantile summaries before/after.
CREATE OR REPLACE TABLE `mgmt467-71800.netflix.watch_history_robust` AS
WITH q AS (
  SELECT
    APPROX_QUANTILES(watch_duration_minutes, 100)[OFFSET(1)]  AS p01, -- Changed column name
    APPROX_QUANTILES(watch_duration_minutes, 100)[OFFSET(98)] AS p99 -- Changed column name
  FROM `mgmt467-71800.netflix.watch_history_dedup`
)
SELECT
  h.*,
  GREATEST(q.p01, LEAST(q.p99, h.watch_duration_minutes)) AS watch_duration_minutes_capped -- Changed column name
FROM `mgmt467-71800.netflix.watch_history_dedup` h, q;

-- Quantiles before vs after
WITH before AS (
  SELECT 'before' AS which, APPROX_QUANTILES(watch_duration_minutes, 5) AS q -- Changed column name
  FROM `mgmt467-71800.netflix.watch_history_dedup`
),
after AS (
  SELECT 'after' AS which, APPROX_QUANTILES(watch_duration_minutes_capped, 5) AS q -- Changed column name
  FROM `mgmt467-71800.netflix.watch_history_robust`
)
SELECT * FROM before UNION ALL SELECT * FROM after;

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,which,q
0,before,"[0.2, 24.8, 41.7, 61.3, 91.7, 799.3]"
1,after,"[4.4, 24.7, 41.6, 61.7, 92.2, 203.6]"


### Verification Prompt
Generate a query that shows min/median/max before vs after capping.


In [26]:
%%bigquery

-- Show min/median/max before vs after capping
WITH before AS (
  SELECT
    'before_capping' AS source,
    MIN(watch_duration_minutes) AS min_duration,
    APPROX_QUANTILES(watch_duration_minutes, 2)[OFFSET(1)] AS median_duration,
    MAX(watch_duration_minutes) AS max_duration
  FROM `mgmt467-71800.netflix.watch_history_dedup`
),
after AS (
  SELECT
    'after_capping' AS source,
    MIN(watch_duration_minutes_capped) AS min_duration,
    APPROX_QUANTILES(watch_duration_minutes_capped, 2)[OFFSET(1)] AS median_duration,
    MAX(watch_duration_minutes_capped) AS max_duration
  FROM `mgmt467-71800.netflix.watch_history_robust`
)
SELECT * FROM before
UNION ALL
SELECT * FROM after;

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,source,min_duration,median_duration,max_duration
0,after_capping,4.4,50.9,203.6
1,before_capping,0.2,50.9,799.3


**Reflection:** When might capping be harmful? Name a model type less sensitive to outliers and why.

Capping outliers can be harmful if the extreme values are legitimate and important for the analysis or model, as it can distort the true distribution and relationships in the data. Tree-based models like Decision Trees or Random Forests are generally less sensitive to outliers because they split data based on thresholds rather than relying on distance metrics that can be heavily influenced by extreme values.

### 5.4 Business anomaly flags — What & Why
Human-readable flags help both product decisioning and ML features (e.g., binge behavior).

### Build Prompt
Generate **three BigQuery SQL cells** (adjust if columns differ):
1) In `watch_history_robust`, compute and summarize `flag_binge` for sessions > 8 hours.
2) In `users`, compute and summarize `flag_age_extreme` if age can be parsed from `age_band` (<10 or >100).
3) In `movies`, compute and summarize `flag_duration_anomaly` where `duration_min` < 15 or > 480 (if exists).
Each cell should output count and percentage and include 1–2 comments.


In [27]:
%%bigquery

-- In watch_history_robust, compute and summarize flag_binge for sessions > 8 hours (480 minutes)
SELECT
  COUNTIF(watch_duration_minutes_capped > 8*60) AS sessions_over_8h,
  COUNT(*) AS total,
  ROUND(100*COUNTIF(watch_duration_minutes_capped > 8*60)/COUNT(*),2) AS pct_flag_binge
FROM `mgmt467-71800.netflix.watch_history_robust`;

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,sessions_over_8h,total,pct_flag_binge
0,0,99958,0.0


In [28]:
%%bigquery

-- In users, compute and summarize flag_age_extreme if age is <10 or >100.
-- Assuming age is directly available as a numeric column after autodetect or schema enforcement.
SELECT
  COUNTIF(age < 10 OR age > 100) AS extreme_age_rows,
  COUNT(*) AS total,
  ROUND(100*COUNTIF(age < 10 OR age > 100)/COUNT(*),2) AS pct_flag_age_extreme
FROM `mgmt467-71800.netflix.users`;

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,extreme_age_rows,total,pct_flag_age_extreme
0,2327,133900,1.74


In [29]:
%%bigquery

-- In movies, compute and summarize flag_duration_anomaly where duration_min < 15 or > 480 (8 hours).
-- Assuming duration is available as 'duration_minutes'.
SELECT
  COUNTIF(duration_minutes < 15) AS titles_under_15m,
  COUNTIF(duration_minutes > 8*60) AS titles_over_8h,
  COUNT(*) AS total,
  ROUND(100*COUNTIF(duration_minutes < 15 OR duration_minutes > 8*60)/COUNT(*), 2) AS pct_flag_duration_anomaly
FROM `mgmt467-71800.netflix.movies`;

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,titles_under_15m,titles_over_8h,total,pct_flag_duration_anomaly
0,156,143,13520,2.21


### Verification Prompt
Generate a single compact summary query that returns two columns per flag: `flag_name, pct_of_rows`.


In [30]:
%%bigquery

-- Single compact summary query for anomaly flags
SELECT 'flag_binge' AS flag_name,
       ROUND(100*COUNTIF(watch_duration_minutes_capped > 8*60)/(SELECT COUNT(*) FROM `mgmt467-71800.netflix.watch_history_robust`), 2) AS pct_of_rows
FROM `mgmt467-71800.netflix.watch_history_robust`
UNION ALL
SELECT 'flag_age_extreme' AS flag_name,
       ROUND(100*COUNTIF(age < 10 OR age > 100)/(SELECT COUNT(*) FROM `mgmt467-71800.netflix.users`), 2) AS pct_of_rows
FROM `mgmt467-71800.netflix.users`
UNION ALL
SELECT 'flag_duration_anomaly' AS flag_name,
       ROUND(100*COUNTIF(duration_minutes < 15 OR duration_minutes > 8*60)/(SELECT COUNT(*) FROM `mgmt467-71800.netflix.movies`), 2) AS pct_of_rows
FROM `mgmt467-71800.netflix.movies`;

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,flag_name,pct_of_rows
0,flag_binge,0.0
1,flag_age_extreme,1.74
2,flag_duration_anomaly,2.21


**Reflection:** Which anomaly flag is most common? Which would you keep as a feature and why?

Based on the summary query results, the `flag_duration_anomaly` is the most common flag at 2.21%. I would keep `flag_age_extreme` (1.74%) as a feature because extreme ages could be highly indicative of specific user behaviors or potential data entry issues that are valuable to model. While `flag_duration_anomaly` is slightly more frequent, age is often a more fundamental user characteristic for personalization models.

## 6) Save & submit — What & Why
Reproducibility: save artifacts and document decisions so others can rerun and audit.

## Grading rubric (quick)
- Profiling completeness (30)  
- Cleaning policy correctness & reproducibility (40)  
- Reflection/insight (20)  
- Hygiene (naming, verification, idempotence) (10)
