
# Unit 2 — Team Classification (Flights, BQML)

**Goal (team):** Build an *ops-ready* classifier in **BigQuery ML** to predict **`diverted`** on U.S. flights. Minimal handholding by design.

**What you deliver (inside this notebook):**
- One **LOGISTIC_REG** model (baseline), one **engineered** model using `TRANSFORM`
- **Evaluation** via `ML.EVALUATE` and **confusion matrices** (default 0.5 + your custom threshold)
- **Threshold choice** + 3–5 sentence ops justification
- Embedded **rubric** below (self-check before submission)

> Choose *one* dataset table that exists at your institution:  
> • `bigquery-public-data.faa.us_flights` **or** `bigquery-public-data.flights.*`  
> Make sure the table has `carrier`, `dep_delay`, `arr_delay` (for filters), `origin`, `dest`, `diverted` (or equivalent).


In [65]:
import os
from google.cloud import bigquery
from google.colab import auth

auth.authenticate_user()

PROJECT_ID = "mgmt467-471819"  # e.g., mgmt-467-47888
REGION = "us"
TABLE_PATH = "mgmt467-471819.Flights.Flights"  # Use a valid public or custom project.dataset.table path

os.environ["PROJECT_ID"] = PROJECT_ID
os.environ["REGION"] = REGION
bq = bigquery.Client(project=PROJECT_ID)

print("BQ Project:", PROJECT_ID)
print("Source table:", TABLE_PATH)

BQ Project: mgmt467-471819
Source table: mgmt467-471819.Flights.Flights


### Quick sanity check

In [66]:

preview_sql = f"SELECT * FROM `{TABLE_PATH}` LIMIT 5"
bq.query(preview_sql).result().to_dataframe()


Unnamed: 0,Year,Quarter,Month,DayofMonth,DayOfWeek,FlightDate,Reporting_Airline,DOT_ID_Reporting_Airline,IATA_CODE_Reporting_Airline,Tail_Number,...,Div4WheelsOff,Div4TailNum,Div5Airport,Div5AirportID,Div5AirportSeqID,Div5WheelsOn,Div5TotalGTime,Div5LongestGTime,Div5WheelsOff,Div5TailNum
0,1999,4,10,17,7,1999-10-17,DL,19790,DL,N224DA,...,,,,,,,,,,
1,1995,4,11,26,7,1995-11-26,DL,19790,DL,N985DL,...,,,,,,,,,,
2,1993,3,8,3,2,1993-08-03,DL,19790,DL,,...,,,,,,,,,,
3,1993,2,5,27,4,1993-05-27,DL,19790,DL,,...,,,,,,,,,,
4,1994,3,8,19,5,1994-08-19,DL,19790,DL,,...,,,,,,,,,,



## 1) Canonical mapping (adjust as needed)
Map to a minimal schema used in the rest of the notebook:
- `flight_date` (DATE), `dep_delay` (NUM), `distance` (NUM), `carrier` (STRING), `origin` (STRING), `dest` (STRING), `diverted` (BOOL)


In [67]:
# Adjust ONLY if your table uses different column names.
CANONICAL_BASE_SQL = f'''
WITH canonical_flights AS (
  SELECT
    CAST(FlightDate AS DATE) AS flight_date,
    CAST(DepDelay AS FLOAT64) AS dep_delay,
    CAST(Distance  AS FLOAT64) AS distance,
    CAST(Reporting_Airline   AS STRING)  AS carrier,
    CAST(Origin    AS STRING)  AS origin,
    CAST(Dest AS STRING) AS dest,
    CAST((CASE WHEN SAFE_CAST(Diverted AS INT64)=1 OR LOWER(CAST(Diverted AS STRING))='true' THEN TRUE ELSE FALSE END) AS BOOL) AS diverted
  FROM `{TABLE_PATH}`
  WHERE DepDelay IS NOT NULL
)
'''
print(CANONICAL_BASE_SQL[:600] + "\n...")


WITH canonical_flights AS (
  SELECT
    CAST(FlightDate AS DATE) AS flight_date,
    CAST(DepDelay AS FLOAT64) AS dep_delay,
    CAST(Distance  AS FLOAT64) AS distance,
    CAST(Reporting_Airline   AS STRING)  AS carrier,
    CAST(Origin    AS STRING)  AS origin,
    CAST(Dest AS STRING) AS dest,
    CAST((CASE WHEN SAFE_CAST(Diverted AS INT64)=1 OR LOWER(CAST(Diverted AS STRING))='true' THEN TRUE ELSE FALSE END) AS BOOL) AS diverted
  FROM `mgmt467-471819.Flights.Flights`
  WHERE DepDelay IS NOT NULL
)

...


### 2) Split (80/20)

In [88]:

SPLIT_CLAUSE = r'''
, split AS (
  SELECT cf.*,
         CASE WHEN RAND() < 0.8 THEN 'TRAIN' ELSE 'EVAL' END AS split
  FROM canonical_flights cf
)
'''
print(SPLIT_CLAUSE)



, split AS (
  SELECT cf.*,
         CASE WHEN RAND() < 0.8 THEN 'TRAIN' ELSE 'EVAL' END AS split
  FROM canonical_flights cf
)



MODEL B

In [86]:
DATASET_ID = TABLE_PATH.split('.')[1] # Extracts 'Flights' from 'mgmt467-471819.Flights.Flights'
print("Dataset ID:", DATASET_ID)

Dataset ID: Flights


In [90]:
# Create the dataset if it doesn't exist
dataset = bigquery.Dataset(f"{PROJECT_ID}.{DATASET_ID}")
dataset.location = REGION

try:
    bq.create_dataset(dataset, exists_ok=True)
    print(f"Dataset {PROJECT_ID}.{DATASET_ID} created or already exists.")
except Exception as e:
    print(f"Error creating dataset {PROJECT_ID}.{DATASET_ID}: {e}")


MODEL_B_ENGINEERED = f"{PROJECT_ID}.{DATASET_ID}.model_b_engineered"

sql_create_model_b = f'''
CREATE OR REPLACE MODEL `{MODEL_B_ENGINEERED}`
OPTIONS (
  MODEL_TYPE='LOGISTIC_REG',
  INPUT_LABEL_COLS=['diverted']
) AS
{CANONICAL_BASE_SQL.strip()}
{SPLIT_CLAUSE.strip()}
SELECT
  diverted,
  dep_delay,
  distance,
  carrier,
  origin,
  dest,
  CONCAT(origin, '-', dest) AS route,
  EXTRACT(DAYOFWEEK FROM flight_date) AS day_of_week,
  EXTRACT(MONTH FROM flight_date)     AS month,
  CASE
    WHEN dep_delay < 0 THEN 'early'
    WHEN dep_delay = 0 THEN 'on_time'
    WHEN dep_delay > 0 AND dep_delay <= 30 THEN 'minor'
    WHEN dep_delay > 30 AND dep_delay <= 120 THEN 'moderate'
    WHEN dep_delay > 120 THEN 'major'
    ELSE 'unknown'
  END AS dep_delay_bucket
FROM split AS s
WHERE s.split='TRAIN'
;
'''

print("Training Model B (Engineered)...")
job_model_b = bq.query(sql_create_model_b)
job_model_b.result()
print("Model B trained:", MODEL_B_ENGINEERED)


Dataset mgmt467-471819.Flights created or already exists.
Training Model B (Engineered)...
Model B trained: mgmt467-471819.Flights.model_b_engineered


## Define Segment for Localized Model (ATL, ORD, JFK)

### Subtask:
Define the SQL filter clause to include 'ATL', 'ORD', and 'JFK' airports as the segment for the localized model.


**Reasoning**:
To fulfill the subtask, I need to create a Python variable that holds the SQL filter clause for the specified airports ('ATL', 'ORD', 'JFK') in both origin and destination columns.



In [91]:
SEGMENT_FILTER_ATL_ORD_JFK = "AND (origin IN ('ATL', 'ORD', 'JFK') OR dest IN ('ATL', 'ORD', 'JFK'))"
print(SEGMENT_FILTER_ATL_ORD_JFK)

AND (origin IN ('ATL', 'ORD', 'JFK') OR dest IN ('ATL', 'ORD', 'JFK'))


## Prepare Segmented Data (ATL, ORD, JFK)

### Subtask:
Filter the canonical flights data to include only the ATL, ORD, and JFK segments. Apply the necessary feature engineering transformations (consistent with Model B) to this combined segmented data.


**Reasoning**:
I need to construct a SQL query string that combines the base canonical flights data, applies a train/eval split, filters for specific airports, and includes feature engineering steps for use in a localized model. I will then print a portion of this SQL query to verify its structure.



In [92]:
SEGMENTED_DATA_SQL = f'''
{CANONICAL_BASE_SQL.strip().replace('WHERE DepDelay IS NOT NULL', f"WHERE DepDelay IS NOT NULL {SEGMENT_FILTER_ATL_ORD_JFK}")}
{SPLIT_CLAUSE.strip()}
SELECT
  diverted,
  dep_delay,
  distance,
  carrier,
  origin,
  dest,
  CONCAT(origin, '-', dest) AS route,
  EXTRACT(DAYOFWEEK FROM flight_date) AS day_of_week,
  EXTRACT(MONTH FROM flight_date)     AS month,
  CASE
    WHEN dep_delay < 0 THEN 'early'
    WHEN dep_delay = 0 THEN 'on_time'
    WHEN dep_delay > 0 AND dep_delay <= 30 THEN 'minor'
    WHEN dep_delay > 30 AND dep_delay <= 120 THEN 'moderate'
    WHEN dep_delay > 120 THEN 'major'
    ELSE 'unknown'
  END AS dep_delay_bucket
FROM split AS s
WHERE s.split='TRAIN'
'''

print(SEGMENTED_DATA_SQL[:1000] + "\n...")


WITH canonical_flights AS (
  SELECT
    CAST(FlightDate AS DATE) AS flight_date,
    CAST(DepDelay AS FLOAT64) AS dep_delay,
    CAST(Distance  AS FLOAT64) AS distance,
    CAST(Reporting_Airline   AS STRING)  AS carrier,
    CAST(Origin    AS STRING)  AS origin,
    CAST(Dest AS STRING) AS dest,
    CAST((CASE WHEN SAFE_CAST(Diverted AS INT64)=1 OR LOWER(CAST(Diverted AS STRING))='true' THEN TRUE ELSE FALSE END) AS BOOL) AS diverted
  FROM `mgmt467-471819.Flights.Flights`
  WHERE DepDelay IS NOT NULL AND (origin IN ('ATL', 'ORD', 'JFK') OR dest IN ('ATL', 'ORD', 'JFK'))
)
, split AS (
  SELECT cf.*,
         CASE WHEN RAND() < 0.8 THEN 'TRAIN' ELSE 'EVAL' END AS split
  FROM canonical_flights cf
)
SELECT
  diverted,
  dep_delay,
  distance,
  carrier,
  origin,
  dest,
  CONCAT(origin, '-', dest) AS route,
  EXTRACT(DAYOFWEEK FROM flight_date) AS day_of_week,
  EXTRACT(MONTH FROM flight_date)     AS month,
  CASE
    WHEN dep_delay < 0 THEN 'early'
    WHEN dep_delay = 0 THEN 'on_ti

## Train Localized Model (Model C - ATL, ORD, JFK)

### Subtask:
Train a new BigQuery ML LOGISTIC_REG model (Model C) using the segmented training data for ATL, ORD, and JFK. The features used will be consistent with Model B.


**Reasoning**:
I need to define a variable for the new localized model's name and construct the SQL query to create and train the BigQuery ML model using the segmented data and specified options, then execute it.



In [93]:
MODEL_C_ATL_ORD_JFK = f"{PROJECT_ID}.{DATASET_ID}.model_c_atl_ord_jfk_localized"

sql_create_model_c_atl_ord_jfk = f'''
CREATE OR REPLACE MODEL `{MODEL_C_ATL_ORD_JFK}`
OPTIONS (
  MODEL_TYPE='LOGISTIC_REG',
  INPUT_LABEL_COLS=['diverted']
) AS
{SEGMENTED_DATA_SQL.strip()}
;
'''

print("Training Model C (Localized ATL, ORD, JFK)...")
job_model_c_atl_ord_jfk = bq.query(sql_create_model_c_atl_ord_jfk)
job_model_c_atl_ord_jfk.result()
print("Model C (Localized ATL, ORD, JFK) trained:", MODEL_C_ATL_ORD_JFK)

Training Model C (Localized ATL, ORD, JFK)...
Model C (Localized ATL, ORD, JFK) trained: mgmt467-471819.Flights.model_c_atl_ord_jfk_localized


## Evaluate Model C on Segment (ATL, ORD, JFK)

### Subtask:
Evaluate the trained Model C using `ML.EVALUATE` on the combined ATL, ORD, JFK segmented evaluation data. Calculate AUC and a confusion matrix at a 0.5 threshold.


**Reasoning**:
To evaluate Model C, I will construct a SQL query using `ML.EVALUATE` with the previously trained `MODEL_C_ATL_ORD_JFK` and the evaluation split of the segmented data. This query will then be executed, and the results converted to a pandas DataFrame for display.



In [94]:
EVALUATE_MODEL_C_ATL_ORD_JFK_SQL = f'''
SELECT
  *
FROM
  ML.EVALUATE(MODEL `{MODEL_C_ATL_ORD_JFK}`,
    (
      {SEGMENTED_DATA_SQL.strip().replace("WHERE s.split='TRAIN'", "WHERE s.split='EVAL'")}
    )
  );
'''

print("Evaluating Model C (Localized ATL, ORD, JFK) metrics...")
eval_c_atl_ord_jfk_job = bq.query(EVALUATE_MODEL_C_ATL_ORD_JFK_SQL)
eval_c_atl_ord_jfk_results = eval_c_atl_ord_jfk_job.result().to_dataframe()
print("Evaluation metrics for Model C (Localized ATL, ORD, JFK):")
print(eval_c_atl_ord_jfk_results)


CONFUSION_MATRIX_MODEL_C_ATL_ORD_JFK_SQL = f'''
SELECT
  *
FROM
  ML.CONFUSION_MATRIX(MODEL `{MODEL_C_ATL_ORD_JFK}`,
    (
      {SEGMENTED_DATA_SQL.strip().replace("WHERE s.split='TRAIN'", "WHERE s.split='EVAL'")}
    )
  );
'''

print("Evaluating Model C (Localized ATL, ORD, JFK) confusion matrix...")
confusion_matrix_c_atl_ord_jfk_job = bq.query(CONFUSION_MATRIX_MODEL_C_ATL_ORD_JFK_SQL)
confusion_matrix_c_atl_ord_jfk_results = confusion_matrix_c_atl_ord_jfk_job.result().to_dataframe()
print("Confusion matrix for Model C (Localized ATL, ORD, JFK):")
print(confusion_matrix_c_atl_ord_jfk_results)

Evaluating Model C (Localized ATL, ORD, JFK) metrics...
Evaluation metrics for Model C (Localized ATL, ORD, JFK):
   precision    recall  accuracy  f1_score  log_loss   roc_auc
0        0.5  0.004902  0.997783  0.009709  0.015451  0.766584
Evaluating Model C (Localized ATL, ORD, JFK) confusion matrix...
Confusion matrix for Model C (Localized ATL, ORD, JFK):
   expected_label  FALSE  TRUE
0           False  91234     0
1            True    235     0


## Evaluate Model B on Segment (ATL, ORD, JFK)

### Subtask:
Evaluate the existing Model B (global engineered model) using `ML.EVALUATE` on the *same* combined ATL, ORD, JFK segmented evaluation data. Calculate AUC and a confusion matrix at a 0.5 threshold for direct comparison.


**Reasoning**:
To evaluate Model B on the segmented data, I need to construct SQL queries for `ML.EVALUATE` and `ML.CONFUSION_MATRIX`, using the `MODEL_B_ENGINEERED` and the `EVAL` split of the `SEGMENTED_DATA_SQL`, then execute them and display the results.



In [95]:
EVALUATE_MODEL_B_ATL_ORD_JFK_SQL = f'''
SELECT
  *
FROM
  ML.EVALUATE(MODEL `{MODEL_B_ENGINEERED}`,
    (
      {SEGMENTED_DATA_SQL.strip().replace("WHERE s.split='TRAIN'", "WHERE s.split='EVAL'")}
    )
  );
'''

print("Evaluating Model B (Global Engineered) metrics on segmented data...")
eval_b_atl_ord_jfk_job = bq.query(EVALUATE_MODEL_B_ATL_ORD_JFK_SQL)
eval_b_atl_ord_jfk_results = eval_b_atl_ord_jfk_job.result().to_dataframe()
print("Evaluation metrics for Model B (Global Engineered) on segmented data:")
print(eval_b_atl_ord_jfk_results)


CONFUSION_MATRIX_MODEL_B_ATL_ORD_JFK_SQL = f'''
SELECT
  *
FROM
  ML.CONFUSION_MATRIX(MODEL `{MODEL_B_ENGINEERED}`,
    (
      {SEGMENTED_DATA_SQL.strip().replace("WHERE s.split='TRAIN'", "WHERE s.split='EVAL'")}
    )
  );
'''

print("Evaluating Model B (Global Engineered) confusion matrix on segmented data...")
confusion_matrix_b_atl_ord_jfk_job = bq.query(CONFUSION_MATRIX_MODEL_B_ATL_ORD_JFK_SQL)
confusion_matrix_b_atl_ord_jfk_results = confusion_matrix_b_atl_ord_jfk_job.result().to_dataframe()
print("Confusion matrix for Model B (Global Engineered) on segmented data:")
print(confusion_matrix_b_atl_ord_jfk_results)

Evaluating Model B (Global Engineered) metrics on segmented data...
Evaluation metrics for Model B (Global Engineered) on segmented data:
   precision    recall  accuracy  f1_score  log_loss   roc_auc
0        0.5  0.004739    0.9977   0.00939  0.015695  0.779182
Evaluating Model B (Global Engineered) confusion matrix on segmented data...
Confusion matrix for Model B (Global Engineered) on segmented data:
   expected_label  FALSE  TRUE
0           False  91093     1
1            True    199     0


## Compare Model C vs Model B (on ATL, ORD, JFK segment)

### Subtask:
Compare the AUC, calibration metrics (if applicable from ML.EVALUATE output), and confusion matrices of Model C and Model B when both are evaluated on the identical combined ATL, ORD, JFK segment of data. Analyze observed changes to assess if specialization improves calibration.


**Reasoning**:
To compare the models, I will first extract and print the AUC scores for Model B and Model C from their respective evaluation results on the ATL, ORD, JFK segment to facilitate direct comparison.



In [96]:
print(f"AUC for Model C (Localized ATL, ORD, JFK): {eval_c_atl_ord_jfk_results['roc_auc'].iloc[0]:.4f}")
print(f"AUC for Model B (Global Engineered) on segmented data: {eval_b_atl_ord_jfk_results['roc_auc'].iloc[0]:.4f}")

AUC for Model C (Localized ATL, ORD, JFK): 0.7666
AUC for Model B (Global Engineered) on segmented data: 0.7792


**Reasoning**:
To compare the confusion matrices, I will display the `confusion_matrix_c_atl_ord_jfk_results` and `confusion_matrix_b_atl_ord_jfk_results` DataFrames.



In [97]:
print("Confusion matrix for Model C (Localized ATL, ORD, JFK) on segmented data:")
print(confusion_matrix_c_atl_ord_jfk_results)

print("\nConfusion matrix for Model B (Global Engineered) on segmented data:")
print(confusion_matrix_b_atl_ord_jfk_results)

Confusion matrix for Model C (Localized ATL, ORD, JFK) on segmented data:
   expected_label  FALSE  TRUE
0           False  91234     0
1            True    235     0

Confusion matrix for Model B (Global Engineered) on segmented data:
   expected_label  FALSE  TRUE
0           False  91093     1
1            True    199     0


### Analysis of Model C vs Model B on ATL, ORD, JFK Segment

**AUC Comparison:**
*   **Model C (Localized ATL, ORD, JFK):** AUC = 0.7666
*   **Model B (Global Engineered) on segmented data:** AUC = 0.7792

In terms of AUC, the global Model B actually performs slightly better (0.7792) than the localized Model C (0.7666) on the ATL, ORD, JFK segment. This suggests that the global model generalizes well even to this specific segment, or that the localized segment does not provide enough distinct patterns to significantly improve the AUC with a specialized model.

**Confusion Matrix Comparison (threshold 0.5):**

**Model C (Localized ATL, ORD, JFK):**
```
   expected_label  FALSE  TRUE
0           False  91234     0
1            True    235     0
```
**Model B (Global Engineered) on segmented data:**
```
   expected_label  FALSE  TRUE
0           False  91093     1
1            True    199     0
```

Let's break down the confusion matrices:

*   **Model C:**
    *   True Negatives (FALSE, predicted FALSE): 91234
    *   False Positives (FALSE, predicted TRUE): 0
    *   False Negatives (TRUE, predicted FALSE): 235
    *   True Positives (TRUE, predicted TRUE): 0
    
*   **Model B:**
    *   True Negatives (FALSE, predicted FALSE): 91093
    *   False Positives (FALSE, predicted TRUE): 1
    *   False Negatives (TRUE, predicted FALSE): 199
    *   True Positives (TRUE, predicted TRUE): 0

Both models struggle significantly with identifying true positive cases (diverted flights) at a 0.5 threshold, resulting in 0 True Positives for both. This indicates a high rate of False Negatives, meaning many actual diverted flights are predicted as not diverted. This could be due to the low prevalence of `diverted` flights in the dataset, making it a highly imbalanced classification problem, or that the threshold of 0.5 is too high for this problem. Model C has slightly more True Negatives (91234 vs 91093) but also slightly more False Negatives (235 vs 199) and 0 True Positives compared to Model B's 1 False Positive.

**Observed Changes and Calibration Assessment:**

Based on these results, the specialization for the ATL, ORD, JFK segment (Model C) does not appear to improve calibration or overall predictive performance, at least not at the default 0.5 threshold. Model B (global) actually yielded a slightly higher AUC and had a slightly lower number of False Negatives (though both had 0 True Positives). The primary issue for both models seems to be their inability to correctly identify positive instances (diverted flights) at the current decision threshold. This suggests that further analysis of class imbalance, threshold tuning, or feature engineering might be necessary.