# MGMT 467 — Prompt-Driven Lab (with Commented Examples)
## Kaggle ➜ Google Cloud Storage ➜ BigQuery ➜ Data Quality (DQ)

**How to use this notebook**
- Each section gives you a **Build Prompt** to paste into Gemini/Vertex AI (or Gemini in Colab).
- Below each prompt, you’ll see a **commented example** of what a good LLM answer might look like.
- **Do not** just uncomment and run. Use the prompt to generate your own code, then compare to the example.
- After every step, run the **Verification Prompt**, and write the **Reflection** in Markdown.

> Goal today: Download the Netflix dataset (Kaggle) → Stage on GCS → Load into BigQuery → Run DQ profiling (missingness, duplicates, outliers, anomaly flags).


### Academic integrity & LLM usage
- Use the prompts here to generate your own code cells.
- Read concept notes and write the reflection answers in your own words.
- Keep credentials out of code. Upload `kaggle.json` when asked.


## Learning objectives
1) Explain **why** we stage data in GCS and load it to BigQuery.  
2) Build an **idempotent**, auditable pipeline.  
3) Diagnose **missingness**, **duplicates**, and **outliers** and justify cleaning choices.  
4) Connect DQ decisions to **business/ML impact**.


## 0) Environment setup — What & Why
Authenticate Colab to Google Cloud so we can use `gcloud`, GCS, and BigQuery. Set **PROJECT_ID** and **REGION** once for consistency (cost/latency).

### Build Prompt (paste to LLM)
You are my cloud TA. Generate a single **Colab code cell** that:
1) Authenticates to Google Cloud in Colab,  
2) Prompts for `PROJECT_ID` via `input()` and sets `REGION="us-central1"` (editable),  
3) Exports `GOOGLE_CLOUD_PROJECT`,  
4) Runs `gcloud config set project $GOOGLE_CLOUD_PROJECT`,  
5) Prints both values. Add 2–3 comments explaining what/why.
End with a comment: `# Done: Auth + Project/Region set`.


In [None]:
# # EXAMPLE (from LLM) — Auth + Project/Region (commented; write your own cell using the prompt)
# # from google.colab import auth
# # auth.authenticate_user()
# #
# # import os
# # PROJECT_ID = input("Enter your GCP Project ID: ").strip()
# # REGION = "us-central1"  # keep consistent; change if instructed
# # os.environ["GOOGLE_CLOUD_PROJECT"] = PROJECT_ID
# # print("Project:", PROJECT_ID, "| Region:", REGION)
# #
# # # Set active project for gcloud/BigQuery CLI
# # !gcloud config set project $GOOGLE_CLOUD_PROJECT
# # !gcloud config get-value project
# # # Done: Auth + Project/Region set

In [None]:
from google.colab import auth
auth.authenticate_user()

import os
PROJECT_ID = input("Enter your GCP Project ID: ").strip()
REGION = "us-central1"  # keep consistent; change if instructed
os.environ["GOOGLE_CLOUD_PROJECT"] = PROJECT_ID
print("Project:", PROJECT_ID, "| Region:", REGION)

# Set active project for gcloud/BigQuery CLI
# This ensures subsequent gcloud/bq commands use this project.
!gcloud config set project $GOOGLE_CLOUD_PROJECT
!gcloud config get-value project
# Done: Auth + Project/Region set

Enter your GCP Project ID: my-project-mgmt-467
Project: my-project-mgmt-467 | Region: us-central1
Updated property [core/project].
my-project-mgmt-467


**Reflection:** Why do we set `PROJECT_ID` and `REGION` at the top? What can go wrong if we don’t?

**Reflection:** Setting `PROJECT_ID` and `REGION` at the beginning is crucial for consistency and avoiding errors when interacting with Google Cloud services like GCS and BigQuery.

If we don't set them explicitly, subsequent `gcloud`, `gsutil`, or `bq` commands might use default values or rely on the environment setup in unpredictable ways. This can lead to:
- **Resource creation in the wrong project or region:** Incurring costs or violating compliance requirements.
- **Difficulty finding resources:** If resources are scattered across different projects or regions.
- **Inconsistent behavior:** Depending on where the commands are run or how the environment is configured.
- **Errors:** Commands might fail if the project or region is not set or is incorrect.

Setting them upfront ensures all operations are performed within the intended scope, making the pipeline more reliable and auditable.

### Verification Prompt
Generate a short cell that prints the active project using `gcloud config get-value project` and echoes the `REGION` you set.


In [None]:
# Print the active project using gcloud config get-value project and echo the REGION
!gcloud config get-value project
import os
print("REGION:", os.environ.get("REGION"))

my-project-mgmt-467
REGION: None


## 1) Kaggle API — What & Why
Use Kaggle CLI for reproducible downloads. Store `kaggle.json` at `~/.kaggle/kaggle.json` with `0600` permissions to protect secrets.

### Build Prompt
Generate a **single Colab code cell** that:
- Prompts me to upload `kaggle.json`,
- Saves to `~/.kaggle/kaggle.json` with `0600` permissions,
- Prints `kaggle --version`.
Add comments about security and reproducibility.


In [None]:
# # EXAMPLE (from LLM) — Kaggle setup (commented)
# # from google.colab import files
# # print("Upload your kaggle.json (Kaggle > Account > Create New API Token)")
# # uploaded = files.upload()
# #
# # import os
# # os.makedirs('/root/.kaggle', exist_ok=True)
# # with open('/root/.kaggle/kaggle.json', 'wb') as f:
# #     f.write(uploaded[list(uploaded.keys())[0]])
# # os.chmod('/root/.kaggle/kaggle.json', 0o600)  # owner-only
# #
# # !kaggle --version

In [None]:
from google.colab import files
print("Upload your kaggle.json (Kaggle > Account > Create New API Token)")
uploaded = files.upload()

import os
os.makedirs('/root/.kaggle', exist_ok=True)
with open('/root/.kaggle/kaggle.json', 'wb') as f:
    f.write(uploaded[list(uploaded.keys())[0]])
os.chmod('/root/.kaggle/kaggle.json', 0o600)  # owner-only permissions

!kaggle --version
# Done: Kaggle setup

Upload your kaggle.json (Kaggle > Account > Create New API Token)


Saving kaggle (1).json to kaggle (1) (1).json
Kaggle API 1.7.4.5


### Verification Prompt
Generate a one-liner that runs `kaggle --help | head -n 20` to show the CLI is ready.


In [None]:
!kaggle --help | head -n 20

usage: kaggle [-h] [-v] [-W]
              {competitions,c,datasets,d,kernels,k,models,m,files,f,config}
              ...

options:
  -h, --help            show this help message and exit
  -v, --version         Print the Kaggle API version

commands:
  {competitions,c,datasets,d,kernels,k,models,m,files,f,config}
                        Use one of:
                        competitions {list, files, download, submit, submissions, leaderboard}
                        datasets {list, files, download, create, version, init, metadata, status}
                        kernels {list, files, init, push, pull, output, status}
                        models {instances, get, list, init, create, delete, update}
                        models instances {versions, get, files, init, create, delete, update}
                        models instances versions {init, create, download, delete, files}
                        config {view, set, unset}
    competitions (c)    Commands related to Kaggle compe

**Reflection:** Why require strict `0600` permissions on API tokens? What risks are we avoiding?

**Reflection:** Requiring strict `0600` permissions on API tokens (like `kaggle.json`) is a critical security measure. The `0600` permission means that only the owner of the file has read and write access, and no other user (including users in the same group or all other users) has any access.

The risks we are avoiding by setting these strict permissions include:
- **Unauthorized access and usage:** If other users on the system (especially in a shared environment like Colab) could read your API token, they could potentially use your Kaggle account to download data, make submissions, or access any other resources linked to your account.
- **Credential compromise:** API tokens are essentially passwords that grant access to external services. Protecting them with strict permissions prevents them from being accidentally exposed or read by malicious scripts or users.
- **Data breaches:** Depending on the API the token is for, unauthorized access could lead to sensitive data being downloaded or exposed.

By limiting access to the token file to only the owner, we significantly reduce the attack surface and protect the associated account and data from unauthorized access.

## 2) Download & unzip dataset — What & Why
Keep raw files under `/content/data/raw` for predictable paths and auditing.
**Dataset:** `sayeeduddin/netflix-2025user-behavior-dataset-210k-records`

### Build Prompt
Generate a **Colab code cell** that:
- Creates `/content/data/raw`,
- Downloads the dataset to `/content/data` with Kaggle CLI,
- Unzips into `/content/data/raw` (overwrite OK),
- Lists all CSVs with sizes in a neat table.
Include comments describing each step.


In [None]:
# Create directory for raw data
!mkdir -p /content/data/raw

# Download the dataset using Kaggle CLI to /content/data
!kaggle datasets download -d sayeeduddin/netflix-2025user-behavior-dataset-210k-records -p /content/data

# Unzip the downloaded dataset into the raw data directory, overwriting if necessary
!unzip -o /content/data/*.zip -d /content/data/raw

# List all CSV files in the raw data directory with their sizes in a table format
!ls -lh /content/data/raw/*.csv

Dataset URL: https://www.kaggle.com/datasets/sayeeduddin/netflix-2025user-behavior-dataset-210k-records
License(s): CC0-1.0
Downloading netflix-2025user-behavior-dataset-210k-records.zip to /content/data
  0% 0.00/4.02M [00:00<?, ?B/s]
100% 4.02M/4.02M [00:00<00:00, 765MB/s]
Archive:  /content/data/netflix-2025user-behavior-dataset-210k-records.zip
  inflating: /content/data/raw/README.md  
  inflating: /content/data/raw/movies.csv  
  inflating: /content/data/raw/recommendation_logs.csv  
  inflating: /content/data/raw/reviews.csv  
  inflating: /content/data/raw/search_logs.csv  
  inflating: /content/data/raw/users.csv  
  inflating: /content/data/raw/watch_history.csv  
-rw-r--r-- 1 root root 114K Aug  2 19:36 /content/data/raw/movies.csv
-rw-r--r-- 1 root root 4.5M Aug  2 19:36 /content/data/raw/recommendation_logs.csv
-rw-r--r-- 1 root root 1.8M Aug  2 19:36 /content/data/raw/reviews.csv
-rw-r--r-- 1 root root 2.2M Aug  2 19:36 /content/data/raw/search_logs.csv
-rw-r--r-- 1 root 

In [None]:
# # EXAMPLE (from LLM) — Download & unzip (commented)
# # !mkdir -p /content/data/raw
# # !kaggle datasets download -d sayeeduddin/netflix-2025user-behavior-dataset-210k-records -p /content/data
# # !unzip -o /content/data/*.zip -d /content/data/raw
# # # List CSV inventory
# # !ls -lh /content/data/raw/*.csv

In [None]:
# Verify contents by listing the netflix/ prefix and showing object sizes
!gcloud storage ls --readable-sizes gs://$BUCKET_NAME/netflix/

gs://mgmt467-netflix-3d871d09/netflix/README.md
gs://mgmt467-netflix-3d871d09/netflix/movies.csv
gs://mgmt467-netflix-3d871d09/netflix/recommendation_logs.csv
gs://mgmt467-netflix-3d871d09/netflix/reviews.csv
gs://mgmt467-netflix-3d871d09/netflix/search_logs.csv
gs://mgmt467-netflix-3d871d09/netflix/users.csv
gs://mgmt467-netflix-3d871d09/netflix/watch_history.csv


### Verification Prompt
Generate a snippet that asserts there are exactly **six** CSV files and prints their names.


In [None]:
import glob
csv_files = glob.glob('/content/data/raw/*.csv')
assert len(csv_files) == 6, f"Expected 6 CSV files, but found {len(csv_files)}"
print("CSV files found:", csv_files)

CSV files found: ['/content/data/raw/movies.csv', '/content/data/raw/watch_history.csv', '/content/data/raw/search_logs.csv', '/content/data/raw/users.csv', '/content/data/raw/recommendation_logs.csv', '/content/data/raw/reviews.csv']


**Reflection:** Why is keeping a clean file inventory (names, sizes) useful downstream?

**Reflection:** Keeping a clean file inventory with names and sizes is useful downstream for several reasons:

- **Auditing and reproducibility:** A clear inventory helps track exactly which files were used as input for subsequent steps (like loading into BigQuery). This is essential for auditing and ensuring that analyses or models can be reproduced with the same source data.
- **Troubleshooting:** If there are issues downstream (e.g., missing data, unexpected file sizes), the inventory provides a quick reference to verify the initial state of the raw files.
- **Automation and scripting:** Having a predictable and listed inventory makes it easier to automate downstream processes that need to read or process these files by name or size.
- **Data discovery:** For others using the pipeline or dataset, an inventory provides a clear overview of the available raw data files.
- **Detecting changes:** Comparing inventories over time can help detect unexpected changes or issues with the data source or download process.

## 3) Create GCS bucket & upload — What & Why
Stage in GCS → consistent, versionable source for BigQuery loads. Bucket names must be **globally unique**.

### Build Prompt
Generate a **Colab code cell** that:
- Creates a unique bucket in `${REGION}` (random suffix),
- Saves name to `BUCKET_NAME` env var,
- Uploads all CSVs to `gs://$BUCKET_NAME/netflix/`,
- Prints the bucket name and explains staging benefits.


In [None]:
# # EXAMPLE (from LLM) — GCS staging (commented)
# # import uuid, os
# # bucket_name = f"mgmt467-netflix-{uuid.uuid4().hex[:8]}"
# # os.environ["BUCKET_NAME"] = bucket_name
# # !gcloud storage buckets create gs://$BUCKET_NAME --location=$REGION
# # !gcloud storage cp /content/data/raw/* gs://$BUCKET_NAME/netflix/
# # print("Bucket:", bucket_name)
# # # Verify contents
# # !gcloud storage ls gs://$BUCKET_NAME/netflix/

In [None]:
import uuid, os
bucket_name = f"mgmt467-netflix-{uuid.uuid4().hex[:8]}"
os.environ["BUCKET_NAME"] = bucket_name

# Create the GCS bucket in the US multi-region
!gcloud storage buckets create gs://$BUCKET_NAME --location=US

# Upload all CSVs to the specified GCS path
!gcloud storage cp /content/data/raw/* gs://$BUCKET_NAME/netflix/

print("Bucket:", bucket_name)
print("\nData staged in GCS for consistent, versionable source for BigQuery loads.")

# Verify contents
!gcloud storage ls gs://$BUCKET_NAME/netflix/

Creating gs://mgmt467-netflix-1acc7cbb/...
Copying file:///content/data/raw/movies.csv to gs://mgmt467-netflix-1acc7cbb/netflix/movies.csv
Copying file:///content/data/raw/README.md to gs://mgmt467-netflix-1acc7cbb/netflix/README.md
Copying file:///content/data/raw/recommendation_logs.csv to gs://mgmt467-netflix-1acc7cbb/netflix/recommendation_logs.csv
Copying file:///content/data/raw/reviews.csv to gs://mgmt467-netflix-1acc7cbb/netflix/reviews.csv
Copying file:///content/data/raw/search_logs.csv to gs://mgmt467-netflix-1acc7cbb/netflix/search_logs.csv
Copying file:///content/data/raw/users.csv to gs://mgmt467-netflix-1acc7cbb/netflix/users.csv
Copying file:///content/data/raw/watch_history.csv to gs://mgmt467-netflix-1acc7cbb/netflix/watch_history.csv

Average throughput: 12.5MiB/s
Bucket: mgmt467-netflix-1acc7cbb

Data staged in GCS for consistent, versionable source for BigQuery loads.
gs://mgmt467-netflix-1acc7cbb/netflix/README.md
gs://mgmt467-netflix-1acc7cbb/netflix/movies.csv
g

### Verification Prompt
Generate a snippet that lists the `netflix/` prefix and shows object sizes.


In [None]:
# Verify contents by listing the netflix/ prefix and showing object sizes
!gcloud storage ls --readable-sizes gs://$BUCKET_NAME/netflix/

gs://mgmt467-netflix-1acc7cbb/netflix/README.md
gs://mgmt467-netflix-1acc7cbb/netflix/movies.csv
gs://mgmt467-netflix-1acc7cbb/netflix/recommendation_logs.csv
gs://mgmt467-netflix-1acc7cbb/netflix/reviews.csv
gs://mgmt467-netflix-1acc7cbb/netflix/search_logs.csv
gs://mgmt467-netflix-1acc7cbb/netflix/users.csv
gs://mgmt467-netflix-1acc7cbb/netflix/watch_history.csv


**Reflection:** Name two benefits of staging in GCS vs loading directly from local Colab.

**Reflection:** Two key benefits of staging data in Google Cloud Storage (GCS) versus loading directly from local Colab are:

1.  **Durability and Accessibility:** Data in GCS is highly durable and accessible from various Google Cloud services (like BigQuery, Dataflow, AI Platform) and even external applications. Loading directly from Colab means the data is tied to the Colab runtime, which is ephemeral. If the runtime restarts or the notebook session ends, the local data is lost, requiring re-downloading. GCS provides persistent storage.
2.  **Scalability and Performance for Cloud Services:** Loading data into BigQuery (or other cloud services) is significantly more scalable and often faster when the data source is in GCS. Cloud services are optimized to read data efficiently from cloud storage. Loading large datasets directly from a Colab instance's local disk can be slow and less reliable.

## 4) BigQuery dataset & loads — What & Why
Create dataset `netflix` and load six CSVs with **autodetect** for speed (we’ll enforce schemas later).

### Build Prompt (two cells)
**Cell A:** Create (idempotently) dataset `netflix` in US multi-region; if it exists, print a friendly message.  
**Cell B:** Load tables from `gs://$BUCKET_NAME/netflix/`:
`users, movies, watch_history, recommendation_logs, search_logs, reviews`
with `--skip_leading_rows=1 --autodetect --source_format=CSV`.
Finish with row-count queries for each table.


In [None]:
# # EXAMPLE (from LLM) — BigQuery dataset (commented)
# # DATASET="netflix"
# # # Attempt to create; ignore if exists
# # !bq --location=US mk -d --description "MGMT467 Netflix dataset" $DATASET || echo "Dataset may already exist."

In [None]:
DATASET="netflix"
# Attempt to create; ignore if exists
!bq --location=US mk -d --description "MGMT467 Netflix dataset" $DATASET || echo "Dataset may already exist."

BigQuery error in mk operation: Dataset 'my-project-mgmt-467:netflix' already
exists.
Dataset may already exist.


In [None]:
# # EXAMPLE (from LLM) — Load tables (commented)
# # tables = {
# #   "users": "users.csv",
# #   "movies": "movies.csv",
# #   "watch_history": "watch_history.csv",
# #   "recommendation_logs": "recommendation_logs.csv",
# #   "search_logs": "search_logs.csv",
# #   "reviews": "reviews.csv",
# # }
# # import os
# # for tbl, fname in tables.items():
# #   src = f"gs://{os.environ['BUCKET_NAME']}/netflix/{fname}"
# #   print("Loading", tbl, "from", src)
# #   !bq load --skip_leading_rows=1 --autodetect --source_format=CSV $DATASET.$tbl $src
# #
# # # Row counts
# # for tbl in tables.keys():
# #   !bq query --nouse_legacy_sql "SELECT '{tbl}' AS table_name, COUNT(*) AS n FROM `${GOOGLE_CLOUD_PROJECT}.netflix.{tbl}`".format(tbl=tbl)

In [None]:
tables = {
  "users": "users.csv",
  "movies": "movies.csv",
  "watch_history": "watch_history.csv",
  "recommendation_logs": "recommendation_logs.csv",
  "search_logs": "search_logs.csv",
  "reviews": "reviews.csv",
}
import os
import subprocess

PROJECT_ID = os.environ['GOOGLE_CLOUD_PROJECT']
DATASET = "netflix"

for tbl, fname in tables.items():
  src = f"gs://{os.environ['BUCKET_NAME']}/netflix/{fname}"
  print(f"Loading {tbl} from {src}")
  # Corrected bq load syntax
  !bq load --skip_leading_rows=1 --autodetect --source_format=CSV {DATASET}.{tbl} {src}

# Row counts
print("\nRow counts:")
for tbl in tables.keys():
  query = f"SELECT '{tbl}' AS table_name, COUNT(*) AS n FROM `{PROJECT_ID}.{DATASET}.{tbl}`"
  print(f"Executing query for table: {tbl}")
  # Using subprocess to execute the bq query command
  process = subprocess.run(['bq', 'query', '--nouse_legacy_sql', query], capture_output=True, text=True)
  print(process.stdout)
  if process.stderr:
    print(f"Error for table {tbl}:")
    print(process.stderr)

Loading users from gs://mgmt467-netflix-1acc7cbb/netflix/users.csv
Waiting on bqjob_r6321c771afc91a8a_0000019a21df6cb4_1 ... (1s) Current status: DONE   
Loading movies from gs://mgmt467-netflix-1acc7cbb/netflix/movies.csv
Waiting on bqjob_r69170b402ec660e1_0000019a21df8dd6_1 ... (1s) Current status: DONE   
Loading watch_history from gs://mgmt467-netflix-1acc7cbb/netflix/watch_history.csv
Waiting on bqjob_r7a669297aa74d90d_0000019a21dfae5e_1 ... (4s) Current status: DONE   
Loading recommendation_logs from gs://mgmt467-netflix-1acc7cbb/netflix/recommendation_logs.csv
Waiting on bqjob_r11c5f4f25a3dd68d_0000019a21dfd83e_1 ... (1s) Current status: DONE   
Loading search_logs from gs://mgmt467-netflix-1acc7cbb/netflix/search_logs.csv
Waiting on bqjob_r636781598863e573_0000019a21dff919_1 ... (2s) Current status: DONE   
Loading reviews from gs://mgmt467-netflix-1acc7cbb/netflix/reviews.csv
Waiting on bqjob_r3f13c081331079bd_0000019a21e01da7_1 ... (1s) Current status: DONE   

Row counts:
E

### Verification Prompt
Generate a single query that returns `table_name, row_count` for all six tables in `${GOOGLE_CLOUD_PROJECT}.netflix`.


In [None]:
from google.cloud import bigquery
import os

project_id = os.environ.get('GOOGLE_CLOUD_PROJECT')
client = bigquery.Client(project=project_id)

query = """
SELECT 'users' AS table_name, COUNT(*) AS row_count FROM `{}.netflix.users`
UNION ALL
SELECT 'movies' AS table_name, COUNT(*) AS row_count FROM `{}.netflix.movies`
UNION ALL
SELECT 'watch_history' AS table_name, COUNT(*) AS row_count FROM `{}.netflix.watch_history`
UNION ALL
SELECT 'recommendation_logs' AS table_name, COUNT(*) AS row_count FROM `{}.netflix.recommendation_logs`
UNION ALL
SELECT 'search_logs' AS table_name, COUNT(*) AS row_count FROM `{}.netflix.search_logs`
UNION ALL
SELECT 'reviews' AS table_name, COUNT(*) AS row_count FROM `{}.netflix.reviews`
""".format(project_id, project_id, project_id, project_id, project_id, project_id)


query_job = client.query(query)
results = query_job.result()

for row in results:
    print(f"Table: {row.table_name}, Row Count: {row.row_count}")

Table: users, Row Count: 72100
Table: watch_history, Row Count: 735000
Table: reviews, Row Count: 108150
Table: movies, Row Count: 7280
Table: recommendation_logs, Row Count: 364000
Table: search_logs, Row Count: 185500


**Reflection:** When is `autodetect` acceptable? When should you enforce explicit schemas and why?

**Reflection:** BigQuery's `autodetect` is acceptable for initial data exploration or when dealing with well-structured, consistent data sources where the schema is unlikely to change. It's quick and convenient for getting data loaded quickly.

However, you should enforce explicit schemas when:
- **Data consistency is critical:** Autodetect might infer incorrect data types, leading to data corruption or errors during queries. Explicit schemas ensure data conforms to expected types.
- **Schema evolution needs control:** When schemas change over time, explicit schemas allow for controlled updates and prevent unexpected issues.
- **Performance optimization is needed:** Explicitly defining schemas can sometimes lead to better query performance as BigQuery knows the data types upfront.
- **Complex data types are involved:** Autodetect might struggle with nested or repeated fields.
- **Data quality requires strict validation:** Explicit schemas act as a form of data validation, rejecting data that doesn't conform.

Enforcing explicit schemas provides more control, predictability, and robustness for production pipelines and critical datasets.

## 5) Data Quality (DQ) — Concepts we care about
- **Missingness** (MCAR/MAR/MNAR). Impute vs drop. Add `is_missing_*` indicators.
- **Duplicates** (exact vs near). Double-counted engagement corrupts labels & KPIs.
- **Outliers** (IQR). Winsorize/cap vs robust models. Always **flag** and explain.
- **Reproducibility**. Prefer `CREATE OR REPLACE` and deterministic keys.


### 5.1 Missingness (users) — What & Why
Measure % missing and check if missingness depends on another variable (MAR) → potential bias & instability.

### Build Prompt
Generate **two BigQuery SQL cells**:
1) Total rows and % missing in `region`, `plan_tier`, `age_band` from `users`.
2) `% plan_tier missing by region` ordered descending. Add comments on MAR.


In [None]:
# # EXAMPLE (from LLM) — Missingness profile (commented)
# # -- Users: % missing per column
# # WITH base AS (
# #   SELECT COUNT(*) n,
# #          COUNTIF(region IS NULL) miss_region,
# #          COUNTIF(plan_tier IS NULL) miss_plan,
# #          COUNTIF(age_band IS NULL) miss_age
# #   FROM `${GOOGLE_CLOUD_PROJECT}.netflix.users`
# # )
# # SELECT n,
# #        ROUND(100*miss_region/n,2) AS pct_missing_region,
# #        ROUND(100*miss_plan/n,2)   AS pct_missing_plan_tier,
# #        ROUND(100*miss_age/n,2)    AS pct_missing_age_band
# # FROM base;

In [None]:
from google.cloud import bigquery
import os

project_id = os.environ.get('GOOGLE_CLOUD_PROJECT')
client = bigquery.Client(project=project_id)

# Get and print schema of the users table
table_ref = client.dataset("netflix").table("users")
table = client.get_table(table_ref)
print("Schema of the users table:")
for field in table.schema:
    print(f"- {field.name}: {field.field_type}")

# Users: % missing per column
query = """
WITH base AS (
  SELECT COUNT(*) n,
         COUNTIF(country IS NULL) miss_country,
         COUNTIF(subscription_plan IS NULL) miss_plan,
         COUNTIF(age IS NULL) miss_age
  FROM `{}.netflix.users`
)
SELECT n,
       ROUND(100*miss_country/n,2) AS pct_missing_country,
       ROUND(100*miss_plan/n,2)   AS pct_missing_subscription_plan,
       ROUND(100*miss_age/n,2)    AS pct_missing_age
FROM base
""".format(project_id)

print("\nExecuting missingness query:")
query_job = client.query(query)
results = query_job.result()

for row in results:
    print(f"N: {row.n}, Pct Missing Country: {row.pct_missing_country}, Pct Missing Subscription Plan: {row.pct_missing_subscription_plan}, Pct Missing Age: {row.pct_missing_age}")

Schema of the users table:
- user_id: STRING
- email: STRING
- first_name: STRING
- last_name: STRING
- age: FLOAT
- gender: STRING
- country: STRING
- state_province: STRING
- city: STRING
- subscription_plan: STRING
- subscription_start_date: DATE
- is_active: BOOLEAN
- monthly_spend: FLOAT
- primary_device: STRING
- household_size: FLOAT
- created_at: TIMESTAMP

Executing missingness query:
N: 72100, Pct Missing Country: 0.0, Pct Missing Subscription Plan: 0.0, Pct Missing Age: 11.93


In [None]:
# # EXAMPLE (from LLM) — MAR by region (commented)
# # SELECT region,
# #        COUNT(*) AS n,
# #        ROUND(100*COUNTIF(plan_tier IS NULL)/COUNT(*),2) AS pct_missing_plan_tier
# # FROM `${GOOGLE_CLOUD_PROJECT}.netflix.users`
# # GROUP BY region
# # ORDER BY pct_missing_plan_tier DESC;

In [None]:
%%bigquery
-- Verification query: print missingness percentages
WITH base AS (
  SELECT COUNT(*) n,
         COUNTIF(country IS NULL) miss_country,
         COUNTIF(subscription_plan IS NULL) miss_plan,
         COUNTIF(age IS NULL) miss_age
  FROM `netflix.users`
)
SELECT ROUND(100*miss_country/n,2) AS pct_missing_country,
       ROUND(100*miss_plan/n,2)   AS pct_missing_subscription_plan,
       ROUND(100*miss_age/n,2)    AS pct_missing_age
FROM base;

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,pct_missing_country,pct_missing_subscription_plan,pct_missing_age
0,0.0,0.0,11.93


In [None]:
%%bigquery
SELECT country,
       COUNT(*) AS n,
       ROUND(100*COUNTIF(subscription_plan IS NULL)/COUNT(*),2) AS pct_missing_subscription_plan
FROM `netflix.users`
GROUP BY country
ORDER BY pct_missing_subscription_plan DESC

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,country,n,pct_missing_subscription_plan
0,Canada,21672,0.0
1,USA,50428,0.0


### Verification Prompt
Generate a query that prints the three missingness percentages from (1), rounded to two decimals.


In [None]:

%%bigquery
-- Verification query: print missingness percentages
WITH base AS (
  SELECT COUNT(*) n,
         COUNTIF(country IS NULL) miss_country,
         COUNTIF(subscription_plan IS NULL) miss_plan,
         COUNTIF(age IS NULL) miss_age
  FROM `netflix.users`
)
SELECT ROUND(100*miss_country/n,2) AS pct_missing_country,
       ROUND(100*miss_plan/n,2)   AS pct_missing_subscription_plan,
       ROUND(100*miss_age/n,2)    AS pct_missing_age
FROM base;

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,pct_missing_country,pct_missing_subscription_plan,pct_missing_age
0,0.0,0.0,11.93


**Reflection:** Which columns are most missing? Hypothesize MCAR/MAR/MNAR and why.

**Reflection:** Based on the missingness analysis, the `age` column is the most missing with 11.93% of values missing. The `country` and `subscription_plan` columns have no missing values (0.0%).

Hypotheses about the missing data mechanisms:
- **Age:** The missingness in the `age` column could be **Missing At Random (MAR)** or **Missing Not At Random (MNAR)**. It might be MAR if the probability of age being missing depends on another observed variable, such as country or device type (e.g., users from certain regions or on certain devices are less likely to provide age). It could be MNAR if the probability of age being missing depends on the age itself (e.g., very young or very old users are less likely to provide their age). It's less likely to be **Missing Completely At Random (MCAR)**, where missingness is unrelated to any variable, observed or unobserved, given the specific nature of age data.

To determine the actual mechanism, further analysis would be needed to see if missingness in 'age' is correlated with other variables.

### 5.2 Duplicates (watch_history) — What & Why
Find exact duplicate interaction records and keep **one best** per group (deterministic policy).

### Build Prompt
Generate **two BigQuery SQL cells**:
1) Report duplicate groups on `(user_id, movie_id, event_ts, device_type)` with counts (top 20).
2) Create table `watch_history_dedup` that keeps one row per group (prefer higher `progress_ratio`, then `minutes_watched`). Add comments.


In [None]:
# # EXAMPLE (from LLM) — Detect duplicate groups (commented)
# # SELECT user_id, movie_id, event_ts, device_type, COUNT(*) AS dup_count
# # FROM `${GOOGLE_CLOUD_PROJECT}.netflix.watch_history`
# # GROUP BY user_id, movie_id, event_ts, device_type
# # HAVING dup_count > 1
# # ORDER BY dup_count DESC
# # LIMIT 20;

In [None]:
%%bigquery
SELECT user_id,
       movie_id,
       watch_date,
       device_type,
       COUNT(*) AS duplicate_count
FROM `netflix.watch_history`
GROUP BY user_id, movie_id, watch_date, device_type
HAVING COUNT(*) > 1
ORDER BY duplicate_count DESC
LIMIT 20

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,user_id,movie_id,watch_date,device_type,duplicate_count
0,user_03310,movie_0640,2024-09-08,Smart TV,28
1,user_00391,movie_0893,2024-08-26,Laptop,28
2,user_06417,movie_0590,2024-01-15,Laptop,21
3,user_04899,movie_0142,2025-01-20,Desktop,21
4,user_08681,movie_0332,2024-06-13,Laptop,21
5,user_05952,movie_0893,2024-04-29,Desktop,21
6,user_07594,movie_0133,2025-03-24,Laptop,21
7,user_05629,movie_0697,2025-01-23,Desktop,21
8,user_06799,movie_0458,2024-08-15,Desktop,21
9,user_03898,movie_0500,2025-07-29,Desktop,21


In [None]:
# # EXAMPLE (from LLM) — Keep-one policy (commented)
# # CREATE OR REPLACE TABLE `${GOOGLE_CLOUD_PROJECT}.netflix.watch_history_dedup` AS
# # SELECT * EXCEPT(rk) FROM (
# #   SELECT h.*,
# #          ROW_NUMBER() OVER (
# #            PARTITION BY user_id, movie_id, event_ts, device_type
# #            ORDER BY progress_ratio DESC, minutes_watched DESC
# #          ) AS rk
# #   FROM `${GOOGLE_CLOUD_PROJECT}.netflix.watch_history` h
# # )
# # WHERE rk = 1;

In [None]:
%%bigquery
-- Create watch_history_dedup table by keeping one row per duplicate group
CREATE OR REPLACE TABLE `netflix.watch_history_dedup` AS
SELECT * EXCEPT(rk) FROM (
  SELECT h.*,
         ROW_NUMBER() OVER (
           PARTITION BY user_id, movie_id, watch_date, device_type
           ORDER BY progress_percentage DESC, watch_duration_minutes DESC
         ) AS rk
  FROM `netflix.watch_history` h
)
WHERE rk = 1;

Query is running:   0%|          |

### Verification Prompt
Generate a before/after count query comparing raw vs `watch_history_dedup`.


In [None]:
%%bigquery
-- Verification: Compare row counts before and after deduplication
SELECT
  'raw_watch_history' AS table_name,
  COUNT(*) AS row_count
FROM `netflix.watch_history`
UNION ALL
SELECT
  'watch_history_dedup' AS table_name,
  COUNT(*) AS row_count
FROM `netflix.watch_history_dedup`;

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,table_name,row_count
0,raw_watch_history,735000
1,watch_history_dedup,100000


**Reflection:** Why do duplicates arise (natural vs system-generated)? How do they corrupt labels and KPIs?

**Reflection:** Duplicates in data can arise from several sources, broadly categorized as natural or system-generated:

- **Natural Duplicates:** These are genuine occurrences that might appear as duplicates in a dataset but represent distinct events. For example, a user might genuinely watch the same movie multiple times. However, in the context of logging, multiple identical log entries for a single watch session might be system-generated if not handled properly.
- **System-Generated Duplicates:** These are artifacts of data collection, processing, or storage systems. Examples include:
    - **Retry mechanisms:** If a system fails to confirm a successful data write, it might retry, leading to duplicate entries.
    - **Parallel processing:** Data processed in parallel might be written multiple times if not coordinated correctly.
    - **Sensor/Event firing:** A single event might trigger multiple identical sensor readings or log entries due to system glitches or configuration issues.
    - **Joins or transformations:** Incorrect joins or data transformations can sometimes create duplicate rows.

Duplicates can significantly corrupt labels and KPIs:

- **Corrupted Labels:** If you're building a model where the label is derived from interaction counts (e.g., "number of movies watched"), duplicates will inflate these counts, leading to incorrect labels and a biased model. For example, a user who watched a movie once might appear to have watched it multiple times due to duplicate logs.
- **Corrupted KPIs:** Business metrics (KPIs) like "total watch hours", "average session duration", "number of active users", or "recommendation click-through rate" will be inaccurate if based on data with duplicates. This can lead to flawed business decisions based on inflated or skewed numbers. For instance, duplicate watch history entries will inflate total watch hours, giving a false impression of user engagement.

### 5.3 Outliers (minutes_watched) — What & Why
Estimate extreme values via IQR; report % outliers; **winsorize** to P01/P99 for robustness while also **flagging** extremes.

### Build Prompt
Generate **two BigQuery SQL cells**:
1) Compute IQR bounds for `minutes_watched` on `watch_history_dedup` and report % outliers.
2) Create `watch_history_robust` with `minutes_watched_capped` capped at P01/P99; return quantile summaries before/after.


In [None]:
# # EXAMPLE (from LLM) — IQR outlier rate (commented)
# # WITH dist AS (
# #   SELECT
# #     APPROX_QUANTILES(minutes_watched, 4)[OFFSET(1)] AS q1,
# #     APPROX_QUANTILES(minutes_watched, 4)[OFFSET(3)] AS q3
# #   FROM `${GOOGLE_CLOUD_PROJECT}.netflix.watch_history_dedup`
# # ),
# # bounds AS (
# #   SELECT q1, q3, (q3-q1) AS iqr,
# #          q1 - 1.5*(q3-q1) AS lo,

# #          q3 + 1.5*(q3-q1) AS hi
# #   FROM dist
# # )
# # SELECT
# #   COUNTIF(h.minutes_watched < b.lo OR h.minutes_watched > b.hi) AS outliers,
# #   COUNT(*) AS total,
# #   ROUND(100*COUNTIF(h.minutes_watched < b.lo OR h.minutes_watched > b.hi)/COUNT(*),2) AS pct_outliers
# # FROM `${GOOGLE_CLOUD_PROJECT}.netflix.watch_history_dedup` h
# # CROSS JOIN bounds b;

In [None]:
%%bigquery
-- Watch history dedup: IQR outlier rate for watch_duration_minutes
WITH dist AS (
  SELECT
    APPROX_QUANTILES(watch_duration_minutes, 4)[OFFSET(1)] AS q1,
    APPROX_QUANTILES(watch_duration_minutes, 4)[OFFSET(3)] AS q3
  FROM `netflix.watch_history_dedup`
),
bounds AS (
  SELECT q1, q3, (q3-q1) AS iqr,
         q1 - 1.5*(q3-q1) AS lo,
         q3 + 1.5*(q3-q1) AS hi
  FROM dist
)
SELECT
  COUNTIF(h.watch_duration_minutes < b.lo OR h.watch_duration_minutes > b.hi) AS outliers,
  COUNT(*) AS total,
  ROUND(100*COUNTIF(h.watch_duration_minutes < b.lo OR h.watch_duration_minutes > b.hi)/COUNT(*),2) AS pct_outliers
FROM `netflix.watch_history_dedup` h
CROSS JOIN bounds b;

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,outliers,total,pct_outliers
0,3472,100000,3.47


In [None]:
%%bigquery
-- Quantiles before vs after capping
WITH before AS (
  SELECT 'before' AS which, APPROX_QUANTILES(watch_duration_minutes, 5) AS q
  FROM `netflix.watch_history_dedup`
),
after AS (
  SELECT 'after' AS which, APPROX_QUANTILES(watch_duration_minutes_capped, 5) AS q
  FROM `netflix.watch_history_robust`
)
SELECT * FROM before UNION ALL SELECT * FROM after;

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,which,q
0,before,"[0.2, 25.0, 41.8, 61.1, 91.9, 799.3]"
1,after,"[4.4, 24.6, 41.5, 61.5, 92.0, 204.0]"


In [None]:
%%bigquery
-- Create watch_history_robust with watch_duration_minutes capped at P01/P99
CREATE OR REPLACE TABLE `netflix.watch_history_robust` AS
WITH q AS (
  SELECT
    APPROX_QUANTILES(watch_duration_minutes, 100)[OFFSET(1)]  AS p01,
    APPROX_QUANTILES(watch_duration_minutes, 100)[OFFSET(98)] AS p99
  FROM `netflix.watch_history_dedup`
)
SELECT
  h.*,
  GREATEST(q.p01, LEAST(q.p99, h.watch_duration_minutes)) AS watch_duration_minutes_capped
FROM `netflix.watch_history_dedup` h, q;

Query is running:   0%|          |

In [None]:
# # EXAMPLE (from LLM) — Winsorize + quantiles (commented)
# # CREATE OR REPLACE TABLE `${GOOGLE_CLOUD_PROJECT}.netflix.watch_history_robust` AS
# # WITH q AS (
# #   SELECT
# #     APPROX_QUANTILES(minutes_watched, 100)[OFFSET(1)]  AS p01,
# #     APPROX_QUANTILES(minutes_watched, 100)[OFFSET(98)] AS p99
# #   FROM `${GOOGLE_CLOUD_PROJECT}.netflix.watch_history_dedup`
# # )
# # SELECT
# #   h.*,
# #   GREATEST(q.p01, LEAST(q.p99, h.minutes_watched)) AS minutes_watched_capped
# # FROM `${GOOGLE_CLOUD_PROJECT}.netflix.watch_history_dedup` h, q;
# #
# # -- Quantiles before vs after
# # WITH before AS (
# #   SELECT 'before' AS which, APPROX_QUANTILES(minutes_watched, 5) AS q
# #   FROM `${GOOGLE_CLOUD_PROJECT}.netflix.watch_history_dedup`
# # ),
# # after AS (
# #   SELECT 'after' AS which, APPROX_QUANTILES(minutes_watched_capped, 5) AS q
# #   FROM `${GOOGLE_CLOUD_PROJECT}.netflix.watch_history_robust`
# # )
# # SELECT * FROM before UNION ALL SELECT * FROM after;

### Verification Prompt
Generate a query that shows min/median/max before vs after capping.


In [None]:
%%bigquery
-- Verification: Min/Median/Max before vs after capping
WITH before AS (
  SELECT
    'before' AS which,
    MIN(watch_duration_minutes) AS min_duration,
    APPROX_QUANTILES(watch_duration_minutes, 2)[OFFSET(1)] AS median_duration, -- Median is the 50th percentile (OFFSET(1) for 2 quantiles)
    MAX(watch_duration_minutes) AS max_duration
  FROM `netflix.watch_history_dedup`
),
after AS (
  SELECT
    'after' AS which,
    MIN(watch_duration_minutes_capped) AS min_duration,
    APPROX_QUANTILES(watch_duration_minutes_capped, 2)[OFFSET(1)] AS median_duration, -- Median is the 50th percentile
    MAX(watch_duration_minutes_capped) AS max_duration
  FROM `netflix.watch_history_robust`
)
SELECT * FROM before UNION ALL SELECT * FROM after;

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,which,min_duration,median_duration,max_duration
0,before,0.2,50.9,799.3
1,after,4.4,51.4,204.0


**Reflection:** When might capping be harmful? Name a model type less sensitive to outliers and why.

**Reflection:** Capping (or winsorizing) outliers can sometimes be harmful if the extreme values are genuine and contain important information. For example, in a fraud detection dataset, extreme transaction amounts might be true outliers but are critical indicators of fraudulent activity. Capping these values would obscure the very signal you're trying to detect. It can also distort the underlying distribution of the data, which might negatively impact models that assume a certain distribution.

Model types less sensitive to outliers often include **tree-based models** like Decision Trees, Random Forests, and Gradient Boosting Machines (e.g., LightGBM, XGBoost). These models make decisions based on splitting data at certain thresholds. Outliers typically only affect which side of a split a data point falls on, and their extreme values don't disproportionately influence the split point itself as much as they would in models that calculate distances or averages (like linear regression or k-means clustering). The decision boundaries are determined by the majority of the data points, making them more robust to individual extreme values.

### 5.4 Business anomaly flags — What & Why
Human-readable flags help both product decisioning and ML features (e.g., binge behavior).

### Build Prompt
Generate **three BigQuery SQL cells** (adjust if columns differ):
1) In `watch_history_robust`, compute and summarize `flag_binge` for sessions > 8 hours.
2) In `users`, compute and summarize `flag_age_extreme` if age can be parsed from `age_band` (<10 or >100).
3) In `movies`, compute and summarize `flag_duration_anomaly` where `duration_min` < 15 or > 480 (if exists).
Each cell should output count and percentage and include 1–2 comments.


In [None]:
# # EXAMPLE (from LLM) — flag_binge (commented)
# # SELECT
# #   COUNTIF(minutes_watched > 8*60) AS sessions_over_8h,
# #   COUNT(*) AS total,
# #   ROUND(100*COUNTIF(minutes_watched > 8*60)/COUNT(*),2) AS pct
# # FROM `${GOOGLE_CLOUD_PROJECT}.netflix.watch_history_robust`;

In [None]:
%%bigquery
-- Compute and summarize flag_binge for sessions > 8 hours
SELECT
  COUNTIF(watch_duration_minutes_capped > 8*60) AS sessions_over_8h,
  COUNT(*) AS total,
  ROUND(100*COUNTIF(watch_duration_minutes_capped > 8*60)/COUNT(*),2) AS pct_sessions_over_8h
FROM `netflix.watch_history_robust`;

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,sessions_over_8h,total,pct_sessions_over_8h
0,0,100000,0.0


In [None]:
# # EXAMPLE (from LLM) — flag_age_extreme (commented)
# # SELECT
# #   COUNTIF(CAST(REGEXP_EXTRACT(age_band, r'\d+') AS INT64) < 10 OR
# #           CAST(REGEXP_EXTRACT(age_band, r'\d+') AS INT64) > 100) AS extreme_age_rows,
# #   COUNT(*) AS total,
# #   ROUND(100*COUNTIF(CAST(REGEXP_EXTRACT(age_band, r'\d+') AS INT64) < 10 OR
# #                     CAST(REGEXP_EXTRACT(age_band, r'\d+') AS INT64) > 100)/COUNT(*),2) AS pct
# # FROM `${GOOGLE_CLOUD_PROJECT}.netflix.users`;

In [None]:
%%bigquery
-- Compute and summarize flag_age_extreme if age < 10 or > 100
SELECT
  COUNTIF(age < 10 OR age > 100) AS extreme_age_users,
  COUNT(*) AS total,
  ROUND(100*COUNTIF(age < 10 OR age > 100)/COUNT(*),2) AS pct_extreme_age_users
FROM `netflix.users`;

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,extreme_age_users,total,pct_extreme_age_users
0,1253,72100,1.74


In [None]:
# # EXAMPLE (from LLM) — flag_duration_anomaly (commented)
# # SELECT
# #   COUNTIF(duration_min < 15) AS titles_under_15m,
# #   COUNTIF(duration_min > 8*60) AS titles_over_8h,
# #   COUNT(*) AS total
# # FROM `${GOOGLE_CLOUD_PROJECT}.netflix.movies`;

In [None]:
%%bigquery
-- Compute and summarize flag_duration_anomaly where duration_minutes < 15 or > 480
SELECT
  COUNTIF(duration_minutes < 15) AS titles_under_15m,
  COUNTIF(duration_minutes > 480) AS titles_over_8h,
  COUNT(*) AS total,
  ROUND(100*COUNTIF(duration_minutes < 15 OR duration_minutes > 480)/COUNT(*),2) AS pct_duration_anomalies
FROM `netflix.movies`;

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,titles_under_15m,titles_over_8h,total,pct_duration_anomalies
0,84,77,7280,2.21


### Verification Prompt
Generate a single compact summary query that returns two columns per flag: `flag_name, pct_of_rows`.


In [None]:
%%bigquery
-- Verification: Compact summary of anomaly flags (percentage of rows)
SELECT
  'flag_binge' AS flag_name,
  ROUND(100*COUNTIF(watch_duration_minutes_capped > 8*60)/(COUNT(*)),2) AS pct_of_rows
FROM `netflix.watch_history_robust`
UNION ALL
SELECT
  'flag_age_extreme' AS flag_name,
  ROUND(100*COUNTIF(age < 10 OR age > 100)/(COUNT(*)),2) AS pct_of_rows
FROM `netflix.users`
UNION ALL
SELECT
  'flag_duration_anomaly' AS flag_name,
  ROUND(100*COUNTIF(duration_minutes < 15 OR duration_minutes > 480)/(COUNT(*)),2) AS pct_of_rows
FROM `netflix.movies`;

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,flag_name,pct_of_rows
0,flag_binge,0.0
1,flag_age_extreme,1.74
2,flag_duration_anomaly,2.21


**Reflection:** Which anomaly flag is most common? Which would you keep as a feature and why?

**Reflection:** Based on the anomaly flag summary, the `flag_duration_anomaly` is the most common flag, affecting 2.21% of movies. The `flag_age_extreme` affects 1.74% of users, and `flag_binge` affects 0.00% of watch sessions (after capping).

Regarding which flag to keep as a feature, it depends on the specific business problem or ML task. However, the `flag_binge` (or a similar flag indicating unusually long watch sessions) could be particularly valuable as a feature for recommendation systems or user segmentation. Binge behavior can indicate high engagement or specific user preferences, which could be strong signals for predicting future behavior or tailoring content recommendations. While `flag_duration_anomaly` and `flag_age_extreme` might be useful for data cleaning or understanding data quality issues, `flag_binge` seems more directly relevant as a behavioral feature for modeling user engagement or preferences.

## 6) Save & submit — What & Why
Reproducibility: save artifacts and document decisions so others can rerun and audit.

### Build Prompt
Generate a checklist (Markdown) students can paste at the end:
- Save this notebook to the team Drive.
- Export a `.sql` file with your DQ queries and save to repo.
- Push notebook + SQL to the **team GitHub** with a descriptive commit.
- Add a README with your `PROJECT_ID`, `REGION`, bucket, dataset, and today’s row counts.


## 6) Save & submit Checklist

- [ ]  Save this notebook to the team Drive.
- [ ]  Export a `.sql` file with your DQ queries and save to repo.
- [ ]  Push notebook + SQL to the **team GitHub** with a descriptive commit.
- [ ]  Add a README with your `${PROJECT_ID}`, `${REGION}`, bucket, dataset, and today’s row counts.

## Grading rubric (quick)
- Profiling completeness (30)  
- Cleaning policy correctness & reproducibility (40)  
- Reflection/insight (20)  
- Hygiene (naming, verification, idempotence) (10)
