
# Unit 2 — Team Classification (Flights, BQML)

**Goal (team):** Build an *ops-ready* classifier in **BigQuery ML** to predict **`diverted`** on U.S. flights. Minimal handholding by design.

**What you deliver (inside this notebook):**
- One **LOGISTIC_REG** model (baseline), one **engineered** model using `TRANSFORM`
- **Evaluation** via `ML.EVALUATE` and **confusion matrices** (default 0.5 + your custom threshold)
- **Threshold choice** + 3–5 sentence ops justification
- Embedded **rubric** below (self-check before submission)

> Choose *one* dataset table that exists at your institution:  
> • `bigquery-public-data.faa.us_flights` **or** `bigquery-public-data.flights.*`  
> Make sure the table has `carrier`, `dep_delay`, `arr_delay` (for filters), `origin`, `dest`, `diverted` (or equivalent).


In [2]:
# --- Minimal setup (edit 3 vars) ---
from google.colab import auth
auth.authenticate_user()

import os
from google.cloud import bigquery

PROJECT_ID = "mgmt-471819-i5"      # e.g., mgmt-467-47888
REGION     = "US"
TABLE_PATH = "unit2_flights.flights"   # or your `bigquery-public-data.flights` table/view
DATASET_ID = "unit2_flights" # Extract dataset ID from TABLE_PATH

os.environ["PROJECT_ID"] = PROJECT_ID
os.environ["REGION"]     = REGION
bq = bigquery.Client(project=PROJECT_ID)

print("BQ Project:", PROJECT_ID)
print("Source table:", TABLE_PATH)
print("Dataset ID:", DATASET_ID)

BQ Project: mgmt-471819-i5
Source table: unit2_flights.flights
Dataset ID: unit2_flights


### Quick sanity check

In [3]:
preview_sql = f"SELECT * FROM `{TABLE_PATH}` LIMIT 5"
bq.query(preview_sql).result().to_dataframe()

Unnamed: 0,Year,Quarter,Month,DayofMonth,DayOfWeek,FlightDate,Reporting_Airline,DOT_ID_Reporting_Airline,IATA_CODE_Reporting_Airline,Tail_Number,...,Div4WheelsOff,Div4TailNum,Div5Airport,Div5AirportID,Div5AirportSeqID,Div5WheelsOn,Div5TotalGTime,Div5LongestGTime,Div5WheelsOff,Div5TailNum
0,1999,1,1,29,5,1999-01-29,DL,19790,DL,N962DL,...,,,,,,,,,,
1,1998,2,6,1,1,1998-06-01,DL,19790,DL,N983DL,...,,,,,,,,,,
2,2003,2,4,2,3,2003-04-02,EV,20366,EV,N910AS,...,,,,,,,,,,
3,1999,3,9,18,6,1999-09-18,DL,19790,DL,N367DL,...,,,,,,,,,,
4,1992,3,8,17,1,1992-08-17,DL,19790,DL,,...,,,,,,,,,,



## 1) Canonical mapping (adjust as needed)
Map to a minimal schema used in the rest of the notebook:
- `flight_date` (DATE), `dep_delay` (NUM), `distance` (NUM), `carrier` (STRING), `origin` (STRING), `dest` (STRING), `diverted` (BOOL)


In [4]:
# Adjust ONLY if your table uses different column names.
CANONICAL_BASE_SQL = f'''
WITH canonical_flights AS (
  SELECT
    CAST(FlightDate AS DATE) AS flight_date,
    CAST(DepDelay AS FLOAT64) AS dep_delay,
    CAST(Distance  AS FLOAT64) AS distance,
    CAST(Reporting_Airline   AS STRING)  AS carrier,
    CAST(Origin    AS STRING)  AS origin,
    CAST(Dest AS STRING) AS dest,
    CAST((CASE WHEN SAFE_CAST(Diverted AS INT64)=1 OR LOWER(CAST(Diverted AS STRING))='true' THEN TRUE ELSE FALSE END) AS BOOL) AS diverted
  FROM `{TABLE_PATH}`
  WHERE DepDelay IS NOT NULL
)
'''
print(CANONICAL_BASE_SQL[:600] + "\n...")




WITH canonical_flights AS (
  SELECT
    CAST(FlightDate AS DATE) AS flight_date,
    CAST(DepDelay AS FLOAT64) AS dep_delay,
    CAST(Distance  AS FLOAT64) AS distance,
    CAST(Reporting_Airline   AS STRING)  AS carrier,
    CAST(Origin    AS STRING)  AS origin,
    CAST(Dest AS STRING) AS dest,
    CAST((CASE WHEN SAFE_CAST(Diverted AS INT64)=1 OR LOWER(CAST(Diverted AS STRING))='true' THEN TRUE ELSE FALSE END) AS BOOL) AS diverted
  FROM `unit2_flights.flights`
  WHERE DepDelay IS NOT NULL
)

...


### 2) Split (80/20)

In [5]:
SPLIT_CLAUSE = r'''
, flight_data_with_split AS (
  SELECT cf.*,
         CASE WHEN RAND() < 0.8 THEN 'TRAIN' ELSE 'EVAL' END AS split
  FROM canonical_flights cf
)
'''
print(SPLIT_CLAUSE)




, flight_data_with_split AS (
  SELECT cf.*,
         CASE WHEN RAND() < 0.8 THEN 'TRAIN' ELSE 'EVAL' END AS split
  FROM canonical_flights cf
)




## 4) Engineered model — Model B
Goal: quantify value of near-departure signals.
Features: Model A ( carrier, route = orgin-dest, day_of_week, month + dep_delay + dep_delay_bucket (early/on_time/minor/moderate/major) + optional hour_of_day.



In [7]:
# Create the dataset if it doesn't exist
dataset = bigquery.Dataset(f"{PROJECT_ID}.{DATASET_ID}")
dataset.location = REGION

try:
    bq.create_dataset(dataset, exists_ok=True)
    print(f"Dataset {PROJECT_ID}.{DATASET_ID} created or already exists.")
except Exception as e:
    print(f"Error creating dataset {PROJECT_ID}.{DATASET_ID}: {e}")


MODEL_B_ENGINEERED = f"{PROJECT_ID}.{DATASET_ID}.model_b_engineered"

sql_create_model_b = f'''
CREATE OR REPLACE MODEL `{MODEL_B_ENGINEERED}`
OPTIONS (
  MODEL_TYPE='LOGISTIC_REG',
  INPUT_LABEL_COLS=['diverted']
) AS
{CANONICAL_BASE_SQL.strip()}
{SPLIT_CLAUSE.strip()}
SELECT
  diverted,
  dep_delay,
  distance,
  carrier,
  origin,
  dest,
  CONCAT(origin, '-', dest) AS route,
  EXTRACT(DAYOFWEEK FROM flight_date) AS day_of_week,
  EXTRACT(MONTH FROM flight_date)     AS month,
  CASE
    WHEN dep_delay < 0 THEN 'early'
    WHEN dep_delay = 0 THEN 'on_time'
    WHEN dep_delay > 0 AND dep_delay <= 30 THEN 'minor'
    WHEN dep_delay > 30 AND dep_delay <= 120 THEN 'moderate'
    WHEN dep_delay > 120 THEN 'major'
    ELSE 'unknown'
  END AS dep_delay_bucket
FROM flight_data_with_split
WHERE split='TRAIN'
;
'''

print("Training Model B (Engineered)...")
job_model_b = bq.query(sql_create_model_b)
job_model_b.result()
print("Model B trained:", MODEL_B_ENGINEERED)

Dataset mgmt-471819-i5.unit2_flights created or already exists.
Training Model B (Engineered)...
Model B trained: mgmt-471819-i5.unit2_flights.model_b_engineered


Evaluate the engineered model

In [10]:
sql_eval_b = f"""
SELECT *
FROM ML.EVALUATE(
  MODEL `{MODEL_B_ENGINEERED}`,
  (
    {CANONICAL_BASE_SQL.strip()}
    {SPLIT_CLAUSE.strip()}
    SELECT
      diverted,
      dep_delay,
      distance,
      carrier,
      origin,
      dest,
      CONCAT(origin, '-', dest) AS route,
      EXTRACT(DAYOFWEEK FROM flight_date) AS day_of_week,
      EXTRACT(MONTH FROM flight_date)     AS month,
      CASE
        WHEN dep_delay < 0 THEN 'early'
        WHEN dep_delay = 0 THEN 'on_time'
        WHEN dep_delay > 0 AND dep_delay <= 30 THEN 'minor'
        WHEN dep_delay > 30 AND dep_delay <= 120 THEN 'moderate'
        WHEN dep_delay > 120 THEN 'major'
        ELSE 'unknown'
      END AS dep_delay_bucket
    FROM flight_data_with_split
    WHERE split = 'EVAL'
  )
);
"""

eval_b = bq.query(sql_eval_b).result().to_dataframe()
eval_b

Unnamed: 0,precision,recall,accuracy,f1_score,log_loss,roc_auc
0,0.833333,0.005258,0.99759,0.010449,0.015604,0.805144




#### **Interpretation**
- The model achieves **high precision**, meaning when it predicts a diversion, it’s usually correct.  
- **Recall remains very low**, indicating it misses most diverted flights — typical for a **highly imbalanced dataset** where diversions are rare (< 1%).  
- The **ROC AUC ≈ 0.80** suggests the model captures meaningful signal and performs better than random guessing.  
- **Accuracy** near 1.0 mostly reflects the large majority of non-diverted flights.  
- The **low F1** shows that the model trades off recall for precision — it prefers to be certain when flagging a diversion.

---

In [9]:
sql_cm_b = f"""
SELECT *
FROM ML.CONFUSION_MATRIX(
  MODEL `{MODEL_B_ENGINEERED}`,
  (
    {CANONICAL_BASE_SQL.strip()}
    {SPLIT_CLAUSE.strip()}
    SELECT
      diverted,
      dep_delay,
      distance,
      carrier,
      origin,
      dest,
      CONCAT(origin, '-', dest) AS route,
      EXTRACT(DAYOFWEEK FROM flight_date) AS day_of_week,
      EXTRACT(MONTH FROM flight_date)     AS month,
      CASE
        WHEN dep_delay < 0 THEN 'early'
        WHEN dep_delay = 0 THEN 'on_time'
        WHEN dep_delay > 0 AND dep_delay <= 30 THEN 'minor'
        WHEN dep_delay > 30 AND dep_delay <= 120 THEN 'moderate'
        WHEN dep_delay > 120 THEN 'major'
        ELSE 'unknown'
      END AS dep_delay_bucket
    FROM flight_data_with_split
    WHERE split = 'EVAL'
  ),
  STRUCT(0.5 AS threshold)
);
"""

cm_b = bq.query(sql_cm_b).result().to_dataframe()
cm_b

Unnamed: 0,expected_label,FALSE,TRUE
0,False,392416,2
1,True,910,7


#### **Interpretation**
- The model **correctly predicts 392,416 of 393,328 non-diverted flights**, showing excellent performance on the majority class.  
- However, it **only identifies 7 of 917 true diversions**, resulting in very low recall (~0.76%).  
- The **2 false positives** indicate the model is highly conservative — it almost never flags a flight as diverted unless extremely confident.  
- These results are consistent with the earlier metrics (precision ≈ 0.83, recall ≈ 0.005), confirming that the model is **precision-focused but recall-limited**.

---

#### **Takeaways**
- The confusion matrix highlights a **severe class imbalance problem**: diverted flights are extremely rare, so the model defaults to predicting “non-diverted.”  
- To improve recall (catch more diversions), you could:
  - **Lower the prediction threshold** below 0.5 (e.g., 0.2–0.3).  
  - **Oversample diverted flights** or **apply class weighting** in training.  
  - Explore additional predictive signals (e.g., weather, airport congestion, aircraft type).  
- At the current threshold, the model performs best when the goal is to **minimize false alarms**, not to catch every diversion

#### **Why 0.05 Makes Sense**
- **1️⃣ Class Imbalance Compensation**  
  A lower threshold (0.05) increases the sensitivity of the model, allowing it to capture more of the rare diverted flights.  
  This shift trades a small increase in false positives for a **significant gain in recall**, which is valuable when missing a diversion is more costly than a false alarm.

- **2️⃣ Business & Operational Logic**  
  In airline operations, it’s generally better to **flag a potential diversion early** (even if some are false alerts) than to miss one entirely.  
  A 0.05 threshold supports proactive decision-making — identifying more at-risk flights that can then be reviewed by human analysts or rule-based systems.

- **3️⃣ ROC Curve Insight**  
  The model’s **ROC AUC ≈ 0.80** shows it can rank risky flights well, even if the base rate is low.  
  Adjusting the threshold down from 0.5 to 0.05 leverages that ranking ability — using probabilities more dynamically instead of a strict binary cutoff.

---

In [11]:
sql_cm_b_2 = f"""
SELECT *
FROM ML.CONFUSION_MATRIX(
  MODEL `{MODEL_B_ENGINEERED}`,
  (
    {CANONICAL_BASE_SQL.strip()}
    {SPLIT_CLAUSE.strip()}
    SELECT
      diverted,
      dep_delay,
      distance,
      carrier,
      origin,
      dest,
      CONCAT(origin, '-', dest) AS route,
      EXTRACT(DAYOFWEEK FROM flight_date) AS day_of_week,
      EXTRACT(MONTH FROM flight_date)     AS month,
      CASE
        WHEN dep_delay < 0 THEN 'early'
        WHEN dep_delay = 0 THEN 'on_time'
        WHEN dep_delay > 0 AND dep_delay <= 30 THEN 'minor'
        WHEN dep_delay > 30 AND dep_delay <= 120 THEN 'moderate'
        WHEN dep_delay > 120 THEN 'major'
        ELSE 'unknown'
      END AS dep_delay_bucket
    FROM flight_data_with_split
    WHERE split = 'EVAL'
  ),
  STRUCT(0.05 AS threshold)
);
"""

cm_b = bq.query(sql_cm_b_2).result().to_dataframe()
cm_b

Unnamed: 0,expected_label,FALSE,TRUE
0,False,391664,445
1,True,865,38


#### **Interpretation**
- The model correctly classifies **391,664 non-diverted flights** and **38 true diversions**.  
- Compared to the 0.5 threshold, **recall improved substantially** — the model now identifies many more diverted flights (from 7 → 38).  
- **False positives increased** from 2 to 445, but this is an acceptable trade-off in a high-imbalance scenario where diversions are rare.  
- The overall **false negative count dropped** (865 vs 910), confirming that lowering the threshold helped the model become more sensitive to true diversions.

---

#### **Takeaways**
- The **custom threshold of 0.05** successfully improved detection of rare diverted flights.  
- This threshold adjustment makes the model more **useful in a real-world setting**, where the cost of missing a diversion outweighs the inconvenience of a few extra false alerts.  
- Future work could include:
  - Calibrating the threshold using a **precision-recall curve**.  
  - Incorporating **class weights** or **synthetic oversampling** to further improve recall.  
  - Evaluating model performance across multiple thresholds to find the optimal trade-off.

Transformed Model

In [15]:
MODEL_B_TRANSFORMED = f"{PROJECT_ID}.{DATASET_ID}.model_b_transformed"

sql_create_model_b_transformed = f"""
CREATE OR REPLACE MODEL `{MODEL_B_TRANSFORMED}`
TRANSFORM(
  diverted,
  ML.STANDARD_SCALER(dep_delay) OVER() AS dep_delay_scaled,
  ML.STANDARD_SCALER(distance) OVER() AS distance_scaled,
  carrier,
  origin,
  dest,
  CONCAT(origin, '-', dest) AS route,
  EXTRACT(DAYOFWEEK FROM flight_date) AS day_of_week,
  EXTRACT(MONTH FROM flight_date) AS month,
  dep_delay_bucket
)
OPTIONS(
  MODEL_TYPE = 'LOGISTIC_REG',
  INPUT_LABEL_COLS = ['diverted']
)
AS
{CANONICAL_BASE_SQL.strip()}
{SPLIT_CLAUSE.strip()}
SELECT
  diverted,
  dep_delay,
  distance,
  carrier,
  origin,
  dest,
  flight_date,
  CASE
    WHEN dep_delay < 0 THEN 'early'
    WHEN dep_delay = 0 THEN 'on_time'
    WHEN dep_delay > 0 AND dep_delay <= 30 THEN 'minor'
    WHEN dep_delay > 30 AND dep_delay <= 120 THEN 'moderate'
    WHEN dep_delay > 120 THEN 'major'
    ELSE 'unknown'
  END AS dep_delay_bucket
FROM flight_data_with_split
WHERE split = 'TRAIN';
"""

print("Training Model B (Transformed)...")
job_model_b_t = bq.query(sql_create_model_b_transformed)
job_model_b_t.result()
print("Model B (Transformed) trained:", MODEL_B_TRANSFORMED)

Training Model B (Transformed)...
Model B (Transformed) trained: mgmt-471819-i5.unit2_flights.model_b_transformed


In [16]:
sql_eval_model_b_transformed = f"""
SELECT *
FROM ML.EVALUATE(
  MODEL `{MODEL_B_TRANSFORMED}`,
  (
    {CANONICAL_BASE_SQL.strip()}
    {SPLIT_CLAUSE.strip()}
    SELECT
      diverted,
      dep_delay,
      distance,
      carrier,
      origin,
      dest,
      flight_date,
      CASE
        WHEN dep_delay < 0 THEN 'early'
        WHEN dep_delay = 0 THEN 'on_time'
        WHEN dep_delay > 0 AND dep_delay <= 30 THEN 'minor'
        WHEN dep_delay > 30 AND dep_delay <= 120 THEN 'moderate'
        WHEN dep_delay > 120 THEN 'major'
        ELSE 'unknown'
      END AS dep_delay_bucket
    FROM flight_data_with_split
    WHERE split = 'EVAL'
  )
);
"""

eval_model_b_transformed = bq.query(sql_eval_model_b_transformed).result().to_dataframe()
eval_model_b_transformed

Unnamed: 0,precision,recall,accuracy,f1_score,log_loss,roc_auc
0,0.888889,0.008909,0.997735,0.017641,0.015469,0.801881


## Comparison of Model B and Engineered Model

To compare Model B (using raw features and manual feature engineering) and the Engineered Model (using `TRANSFORM` for scaling and feature engineering), let's look at their evaluation metrics on the evaluation split:

| Metric    | Model B   | Engineered Model |
|-----------|-----------|-------------------|
| precision | 0.833333  | 0.888889          |
| recall    | 0.005258  | 0.008909          |
| accuracy  | 0.99759   | 0.997735          |
| f1_score  | 0.010449  | 0.017641          |
| log_loss  | 0.015604  | 0.015469          |
| roc_auc   | 0.805144  | 0.801881          |

**Observations:**

- **Precision:** The Engineered Model shows slightly higher precision (0.888889) compared to Model B (0.833333), indicating that when the engineered model predicts a diversion, it is more likely to be correct.
- **Recall:** The Engineered Model also has slightly higher recall (0.008909) than Model B (0.005258). While still very low overall, this suggests the engineered model is able to identify a slightly larger proportion of actual diverted flights.
- **Accuracy:** Both models have very high accuracy due to the imbalanced nature of the dataset, with the majority of flights not being diverted. The difference in accuracy is minimal.
- **F1-score:** The F1-score, which is the harmonic mean of precision and recall, is higher for the Engineered Model (0.017641) than for Model B (0.010449). This suggests the engineered model achieves a slightly better balance between precision and recall.
- **Log Loss:** The log loss is slightly lower for the Engineered Model (0.015469) compared to Model B (0.015604). A lower log loss indicates better model performance in terms of the certainty of its predictions.
- **ROC AUC:** The ROC AUC is very similar for both models (around 0.80), indicating that both models have a similar ability to distinguish between positive and negative classes.

**Conclusion:**

The Engineered Model generally shows a slight improvement across most evaluation metrics compared to Model B. The use of `ML.STANDARD_SCALER` in the `TRANSFORM` clause for `dep_delay` and `distance` seems to have had a positive impact on the model's performance, particularly in terms of precision, recall, and F1-score. While the improvements are not dramatic due to the highly imbalanced dataset, the Engineered Model demonstrates the potential benefits of using `TRANSFORM` for feature preprocessing within BigQuery ML.

### ✈️ Comparing Model A vs Model B (Engineered with dep_delay & dep_delay_bucket)

#### **Overview**
Model A served as a baseline that relied only on **schedule-level features** such as `carrier`, `route`, `distance`, `day_of_week`, and `month`.  
Model B extended that baseline by adding **operational delay information** through the numeric feature `dep_delay` and the categorical bucket `dep_delay_bucket`.

---

#### **Key Quantitative Differences**

| Metric | **Model A (Baseline)** | **Model B (Engineered)** | **Change / Effect** |
|:--|--:|--:|:--|
| **Accuracy** | Very high (≈ 0.998) | Very high (≈ 0.998) | Unchanged — class imbalance dominates accuracy |
| **Precision** | Moderate-high (≈ 0.78–0.82) | High (≈ 0.83) | Slight improvement — model makes more confident diversion calls |
| **Recall** | Extremely low (< 0.005) | Slightly higher but still low | Model B identifies more true diversions |
| **F1 Score** | Near 0 | Slightly improved | Reflects the same recall limitation |
| **ROC AUC** | ~0.78 – 0.80 | ~0.80 – 0.82 | Marginal lift — new features add useful signal |

---

#### **Interpretation**
- **Model A** captures only static, pre-departure information (route, carrier, scheduled distance, etc.).  
  It can separate obvious low-risk vs high-risk routes but lacks real-time context.
- **Model B** introduces **departure delay** and **delay severity buckets**, which provide dynamic, performance-based context.  
  These features allow the model to recognize that longer initial delays often correlate with a higher diversion probability.
- The improvement in **AUC** and **recall** shows that **timing and operational data** are meaningful predictors, even if diversions remain rare.
- The overall accuracy remains nearly identical because the label is highly imbalanced — almost all flights are non-diverted.

---


#### **Interpretation CM**
- **True Negatives (TN):** Both models perform very well at identifying non-diverted flights (~392k TNs).  
- **True Positives (TP):** Both capture *very few* actual diversions — only 7 out of nearly 950–1000, showing extremely low recall.  
- **False Positives (FP):** Model B reduces false alarms (from 15 → 2), indicating better calibration at 0.5 threshold.  
- **False Negatives (FN):** Still large for both, meaning the model misses nearly all diversions.

---

#### **Summary**
- **Model B’s additional features** (departure delay and delay buckets) provide slightly better separation between diverted and non-diverted flights, reflected in the higher **AUC (0.805 vs 0.792)**.  
- However, at the default **0.5 threshold**, both models behave **overly conservative** — rarely predicting “diverted,” which explains the near-zero recall.  
- Since diversions are **rare events**, a **lower threshold (e.g., 0.05)** would likely balance sensitivity and false positives more effectively.