<a href="https://colab.research.google.com/github/mnpoliakov/MGMT467_Team7/blob/main/Unit2_Flights/Individual/Louis/Unit2_Louis_BQML_Flights_Classification_Model_A.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# Unit 2 — Team Classification (Flights, BQML)

**Goal (team):** Build an *ops-ready* classifier in **BigQuery ML** to predict **`diverted`** on U.S. flights. Minimal handholding by design.

**What you deliver (inside this notebook):**
- One **LOGISTIC_REG** model (baseline), one **engineered** model using `TRANSFORM`
- **Evaluation** via `ML.EVALUATE` and **confusion matrices** (default 0.5 + your custom threshold)
- **Threshold choice** + 3–5 sentence ops justification
- Embedded **rubric** below (self-check before submission)

> Choose *one* dataset table that exists at your institution:  
> • `bigquery-public-data.faa.us_flights` **or** `bigquery-public-data.flights.*`  
> Make sure the table has `carrier`, `dep_delay`, `arr_delay` (for filters), `origin`, `dest`, `diverted` (or equivalent).


In [41]:
# --- Minimal setup (edit 3 vars) ---
from google.colab import auth
auth.authenticate_user()

import os
from google.cloud import bigquery

PROJECT_ID = "mgmt467-71800"      # e.g., mgmt-467-47888
REGION     = "us-central1"   # Changed from "US" to "us-central1" to align with error message
TABLE_PATH = "carrier_on_time_performance_dataset.airline_data"   # or your `bigquery-public-data.flights` table/view

os.environ["PROJECT_ID"] = PROJECT_ID
os.environ["REGION"]     = REGION
bq = bigquery.Client(project=PROJECT_ID)

print("BQ Project:", PROJECT_ID)
print("Source table:", TABLE_PATH)
print("Region:", REGION)

BQ Project: mgmt467-71800
Source table: carrier_on_time_performance_dataset.airline_data
Region: us-central1


### Quick sanity check

In [None]:

preview_sql = f"SELECT * FROM `{TABLE_PATH}` LIMIT 5"
bq.query(preview_sql).result().to_dataframe()


Unnamed: 0,Year,Quarter,Month,DayofMonth,DayOfWeek,FlightDate,Reporting_Airline,DOT_ID_Reporting_Airline,IATA_CODE_Reporting_Airline,Tail_Number,...,Div4WheelsOff,Div4TailNum,Div5Airport,Div5AirportID,Div5AirportSeqID,Div5WheelsOn,Div5TotalGTime,Div5LongestGTime,Div5WheelsOff,Div5TailNum
0,1999,1,1,3,7,1999-01-03,DL,19790,DL,N911DL,...,,,,,,,,,,
1,1992,4,12,5,6,1992-12-05,DL,19790,DL,,...,,,,,,,,,,
2,1989,4,12,10,7,1989-12-10,EA,19707,EA,,...,,,,,,,,,,
3,2007,2,6,20,3,2007-06-20,EV,20366,EV,N826AS,...,,,,,,,,,,
4,2007,4,10,2,2,2007-10-02,EV,20366,EV,N828AS,...,,,,,,,,,,



## 3) Baseline model — LOGISTIC_REG (`diverted`)
Use **only** a small set of signals for the baseline (keep it honest).


In [50]:
# Adjust ONLY if your table uses different column names.
CANONICAL_BASE_SQL = f'''
WITH canonical_flights AS (
  SELECT
    CAST(FlightDate AS DATE) AS flight_date,
    CAST(DepDelay AS FLOAT64) AS dep_delay,
    CAST(Distance  AS FLOAT64) AS distance,
    CAST(Reporting_Airline   AS STRING)  AS carrier,
    CAST(Origin    AS STRING)  AS origin,
    CAST(Dest AS STRING) AS dest,
    CAST((CASE WHEN SAFE_CAST(Diverted AS INT64)=1 OR LOWER(CAST(Diverted AS STRING))='true' THEN TRUE ELSE FALSE END) AS BOOL) AS diverted
  FROM `{TABLE_PATH}`
  WHERE DepDelay IS NOT NULL
)
'''
print(CANONICAL_BASE_SQL[:600] + "\n...")


WITH canonical_flights AS (
  SELECT
    CAST(FlightDate AS DATE) AS flight_date,
    CAST(DepDelay AS FLOAT64) AS dep_delay,
    CAST(Distance  AS FLOAT64) AS distance,
    CAST(Reporting_Airline   AS STRING)  AS carrier,
    CAST(Origin    AS STRING)  AS origin,
    CAST(Dest AS STRING) AS dest,
    CAST((CASE WHEN SAFE_CAST(Diverted AS INT64)=1 OR LOWER(CAST(Diverted AS STRING))='true' THEN TRUE ELSE FALSE END) AS BOOL) AS diverted
  FROM `carrier_on_time_performance_dataset.airline_data`
  WHERE DepDelay IS NOT NULL
)

...


In [55]:
SPLIT_CLAUSE = r'''
, flight_data_with_split AS (
  SELECT cf.*,
         CASE WHEN RAND() < 0.8 THEN 'TRAIN' ELSE 'EVAL' END AS split
  FROM canonical_flights cf
)
'''
print(SPLIT_CLAUSE)


, flight_data_with_split AS (
  SELECT cf.*,
         CASE WHEN RAND() < 0.8 THEN 'TRAIN' ELSE 'EVAL' END AS split
  FROM canonical_flights cf
)



In [76]:
MODEL_A_PREDEPARTURE = f"{PROJECT_ID}.carrier_on_time_performance_dataset.model_a_pre_departure"

# SQL for Model A (Pre-departure baseline - schedule-level features only)
sql_create_model_a = f'''
CREATE OR REPLACE MODEL `{MODEL_A_PREDEPARTURE}`
OPTIONS (
  MODEL_TYPE='LOGISTIC_REG',
  INPUT_LABEL_COLS=['diverted']
) AS
{CANONICAL_BASE_SQL.strip()}
{SPLIT_CLAUSE.strip()}
SELECT
  diverted,
  carrier,
  CONCAT(origin, '-', dest) AS route,
  distance,
  EXTRACT(DAYOFWEEK FROM flight_date) AS day_of_week,
  EXTRACT(MONTH FROM flight_date) AS month
FROM flight_data_with_split
WHERE split='TRAIN'
;
'''

print("Training Model A (Pre-departure baseline)...")
job_model_a = bq.query(sql_create_model_a)
job_model_a.result() # Wait for the model training to complete
print("Model A trained:", MODEL_A_PREDEPARTURE)

Training Model A (Pre-departure baseline)...
Model A trained: mgmt467-71800.carrier_on_time_performance_dataset.model_a_pre_departure


In [77]:
# Evaluate Model A
sql_evaluate_model_a = f'''
{CANONICAL_BASE_SQL}
{SPLIT_CLAUSE}

SELECT * FROM ML.EVALUATE(
  MODEL `{MODEL_A_PREDEPARTURE}`,
  (SELECT
     diverted,
     carrier,
     CONCAT(origin, '-', dest) AS route,
     distance,
     EXTRACT(DAYOFWEEK FROM flight_date) AS day_of_week,
     EXTRACT(MONTH FROM flight_date) AS month
   FROM flight_data_with_split WHERE split='EVAL')
);
'''
print("Evaluating Model A...")
job_evaluate_model_a = bq.query(sql_evaluate_model_a)
eval_model_a_df = job_evaluate_model_a.result().to_dataframe()
print("Model A evaluation (AUC/log_loss):")
print(eval_model_a_df)

Evaluating Model A...
Model A evaluation (AUC/log_loss):
   precision    recall  accuracy  f1_score  log_loss   roc_auc
0   0.172414  0.005495  0.997636   0.01065  0.015863  0.791935


### Confusion matrix — default 0.5 threshold

In [78]:
# Confusion Matrix for Model A @ 0.5 threshold
sql_confusion_matrix_model_a = f'''
{CANONICAL_BASE_SQL}
{SPLIT_CLAUSE}

SELECT * FROM ML.CONFUSION_MATRIX(
  MODEL `{MODEL_A_PREDEPARTURE}`,
  (SELECT
     diverted,
     carrier,
     CONCAT(origin, '-', dest) AS route,
     distance,
     EXTRACT(DAYOFWEEK FROM flight_date) AS day_of_week,
     EXTRACT(MONTH FROM flight_date) AS month
   FROM flight_data_with_split WHERE split='EVAL')
);
'''
print("Generating confusion matrix for Model A (default 0.5 threshold)...")
job_confusion_model_a = bq.query(sql_confusion_matrix_model_a)
confusion_model_a_df = job_confusion_model_a.result().to_dataframe()
print("Confusion Matrix for Model A (Default 0.5 Threshold):")
print(confusion_model_a_df)

# Using the display_confusion_matrix function for better readability
import pandas as pd
def display_confusion_matrix(df, title):
    # Re-index for better display based on 'predicted_label' and 'actual_label'
    # Assuming 'predicted_label' and 'actual_label' are the column names
    if 'predicted_label' in df.columns and 'actual_label' in df.columns:
        matrix = df.pivot(index='actual_label', columns='predicted_label', values='count').fillna(0)
        print(f"\n{title}:")
        print(matrix)
    else:
        print(f"\n{title} (raw data):\n{df}")

display_confusion_matrix(confusion_model_a_df, "Confusion Matrix for Model A (Default 0.5 Threshold)")

Generating confusion matrix for Model A (default 0.5 threshold)...
Confusion Matrix for Model A (Default 0.5 Threshold):
   expected_label   FALSE  TRUE
0           False  391908    15
1            True     938     7

Confusion Matrix for Model A (Default 0.5 Threshold) (raw data):
   expected_label   FALSE  TRUE
0           False  391908    15
1            True     938     7


### Confusion matrix — your custom threshold

### Reasoning for Custom Threshold of 0.05 (Model A)

The choice of a custom threshold of 0.05 for Model A was driven by an operational need to increase the detection rate of actual flight diversions, even at the cost of accepting more false alarms.

**Operational Context:**
In the context of pre-staging resources (crews, gates, buses, hotel blocks) for potential diversions, the costs associated with False Negatives (missing an actual diversion) can be substantial:

*   **High cost of False Negatives:** Missing a diversion means that resources are not pre-staged, leading to significant delays, passenger inconvenience, missed connections, crew hour violations, and potentially higher ad-hoc costs for last-minute resource allocation. This directly contributes to major operational disruption and brand damage.

Conversely, the costs associated with False Positives (pre-staging resources unnecessarily) are generally lower, though still undesirable:

*   **Moderate cost of False Positives:** Pre-staging resources for a flight that ultimately does not divert involves wasted resource allocation (e.g., a crew on standby, an unused gate, a hotel block that goes empty). While these costs add up, they are typically less disruptive than the scramble to manage an unpredicted diversion.

**Justification for 0.05 Threshold:**
By lowering the threshold to 0.05, Model A saw an increase in True Positives from 7 (at 0.5 threshold) to 34. This means we would proactively identify 34 more actual diversions, allowing for better preparedness and potentially mitigating the severe consequences of these events. While this resulted in a significant increase in False Positives (from 15 to 362), the operational decision prioritizes avoiding the high cost and disruption of a missed diversion. The 0.05 threshold is an attempt to strike a balance where the airline is more proactive in mitigating high-impact events, accepting a higher volume of less critical false alarms. This threshold aims to minimize the overall expected disruption cost by reducing the likelihood of being caught unprepared for an actual diversion.

In [80]:
from sklearn.metrics import confusion_matrix

CUSTOM_THRESHOLD_MODEL_A = 0.05 # Your desired custom threshold for Model A

# 1. Get predictions from Model A
sql_predict_model_a = f'''
{CANONICAL_BASE_SQL}
{SPLIT_CLAUSE}

SELECT
  t.diverted,
  (SELECT prob FROM UNNEST(t.predicted_diverted_probs) WHERE label = TRUE) AS predicted_probability
FROM
  ML.PREDICT(MODEL `{MODEL_A_PREDEPARTURE}`,
    (SELECT
       diverted,
       carrier,
       CONCAT(origin, '-', dest) AS route,
       distance,
       EXTRACT(DAYOFWEEK FROM flight_date) AS day_of_week,
       EXTRACT(MONTH FROM flight_date) AS month
     FROM flight_data_with_split WHERE split='EVAL')
  ) AS t
;'''

print(f"Generating predictions for Model A with custom {CUSTOM_THRESHOLD_MODEL_A} threshold...")
job_predict_model_a = bq.query(sql_predict_model_a)
predictions_model_a_df = job_predict_model_a.result().to_dataframe()

# 2. Apply custom threshold to predicted probabilities
predictions_model_a_df['predicted_label_custom'] = predictions_model_a_df['predicted_probability'] > CUSTOM_THRESHOLD_MODEL_A

# 3. Calculate confusion matrix manually using sklearn
actual_labels_model_a = predictions_model_a_df['diverted']
predicted_labels_custom_model_a = predictions_model_a_df['predicted_label_custom']

conf_matrix_custom_model_a = confusion_matrix(actual_labels_model_a, predicted_labels_custom_model_a)

# Convert to a DataFrame for consistent display
confusion_df_custom_model_a = pd.DataFrame(
    conf_matrix_custom_model_a,
    index=pd.Index([False, True], name='actual_label'),
    columns=pd.Index([False, True], name='predicted_label')
).stack().reset_index(name='count')

print(f"Confusion Matrix for Model A (Custom {CUSTOM_THRESHOLD_MODEL_A} Threshold):")
display(confusion_df_custom_model_a)
display_confusion_matrix(confusion_df_custom_model_a, f"Confusion Matrix for Model A (Custom {CUSTOM_THRESHOLD_MODEL_A} Threshold)")

Generating predictions for Model A with custom 0.05 threshold...
Confusion Matrix for Model A (Custom 0.05 Threshold):


Unnamed: 0,actual_label,predicted_label,count
0,False,False,390587
1,False,True,362
2,True,False,941
3,True,True,34



Confusion Matrix for Model A (Custom 0.05 Threshold):
predicted_label   False  True 
actual_label                  
False            390587    362
True                941     34


### Model A Interpretation (Pre-departure baseline - global)

Model A was built with schedule-level features (`carrier`, `route`, `distance`, `day_of_week`, `month`) to provide an early signal for diversion risk. Its performance metrics are as follows:

*   **ROC AUC (0.791935):** This is a significant improvement over the initial generic baseline model, indicating that these pre-departure features provide substantial discriminatory power for identifying potential diversions.
*   **Log Loss (0.015863):** The log loss value is quite low, suggesting that the model's predicted probabilities are reasonably well-calibrated. A lower log loss means the predicted probabilities are closer to the actual outcomes, which is good for scenarios where the probability itself is important, not just the classification.
*   **Precision (0.172414), Recall (0.005495), F1-Score (0.01065) at Default 0.5 Threshold:** At the default 0.5 threshold, the model still yields low precision and recall for the minority class (diverted flights). It manages to correctly identify 7 diverted flights while also producing 15 false positives. This confirms that while the model has improved underlying discriminatory power (high AUC), the default threshold is still too high for effectively detecting rare events in an imbalanced dataset.

**Calibration Note:** The relatively low `log_loss` of 0.015863 indicates that the model's predicted probabilities are fairly accurate. This means if the model predicts a 1% chance of diversion, it's approximately correct 1% of the time. This is valuable for operational decisions where probability estimates are more useful than hard classifications at an arbitrary threshold.

**Confusion Matrix for Model A (Default 0.5 Threshold):**
|                    | Predicted: Not Diverted | Predicted: Diverted |
| :----------------- | :---------------------- | :------------------ |
| **Actual: Not Diverted** | 391,908                 | 15                  |
| **Actual: Diverted**     | 938                     | 7                   |

This model establishes a more robust 'no-real-time' baseline, demonstrating the potential to identify diversion risks using only information available pre-departure. Further threshold tuning would be necessary to balance false positives and false negatives based on operational costs.

### Confusion Matrix Interpretation for Model A (Custom 0.05 Threshold)

**Confusion Matrix for Model A (Custom 0.05 Threshold):**
|                   | Predicted: Not Diverted | Predicted: Diverted |
| :---------------- | :---------------------- | :------------------ |
| **Actual: Not Diverted** | 390,587                 | 362                 |
| **Actual: Diverted**     | 941                     | 34                  |

By lowering the prediction threshold for Model A to 0.05, we observe the following changes compared to the default 0.5 threshold:

*   **Increased True Positives (from 7 to 34):** The model is now significantly better at identifying actual diverted flights, catching 34 of them.
*   **Increased False Positives (from 15 to 362):** This improvement comes at a cost, as the number of flights incorrectly predicted as diverted (false alarms) has also increased substantially. This implies more resources would be pre-staged unnecessarily.
*   **Slightly Increased False Negatives (from 938 to 941):** While we catch more true positives, the total number of missed diversions (False Negatives) remains high, indicating that even at this lower threshold, many actual diversions are still not being detected.
*   **Decreased True Negatives (from 391,908 to 390,587):** A minor reduction due to the reclassification of some non-diverted flights as false positives.

**Operational Implications:**

Choosing a 0.05 threshold for Model A would mean an airline receives 362 false alarms for every 34 actual diversions identified. While identifying 34 diversions earlier is valuable for pre-staging resources, the 362 false alarms represent a significant cost (wasted resources, operational re-planning, potential passenger inconvenience). The high number of False Negatives (941) also means a substantial portion of actual diversions would still occur without pre-staged resources.

This threshold selection highlights the critical trade-off between minimizing missed diversions (False Negatives) and avoiding costly false alarms (False Positives). An acceptable threshold needs to be determined based on the specific costs and benefits of each outcome for the airline's operations.


## 4) Engineered model — `TRANSFORM` (same label, stricter bar)
Create **route**, extract **day_of_week**, and **bucketize dep_delay**. Compare metrics to baseline.


In [82]:
MODEL_XFORM = f"{PROJECT_ID}.carrier_on_time_performance_dataset.diverted_xform"

# 1. Create the engineered model
sql_create_xform_model = f'''
CREATE OR REPLACE MODEL `{MODEL_XFORM}`
TRANSFORM (
  diverted,
  carrier,
  CONCAT(origin, '-', dest) AS route,
  distance,
  EXTRACT(DAYOFWEEK FROM flight_date) AS day_of_week,
  EXTRACT(MONTH FROM flight_date) AS month
)
OPTIONS (MODEL_TYPE='LOGISTIC_REG', INPUT_LABEL_COLS=['diverted']) AS
{CANONICAL_BASE_SQL.strip()}
{SPLIT_CLAUSE.strip()}
SELECT * FROM flight_data_with_split WHERE split='TRAIN'
;
'''

print("Training engineered model...")
job_create_xform = bq.query(sql_create_xform_model)
job_create_xform.result() # Wait for model creation
print("Engineered model trained:", MODEL_XFORM)

# 2. Evaluate both models
sql_evaluate_xform = f'''
{CANONICAL_BASE_SQL}
{SPLIT_CLAUSE}

SELECT 'baseline' AS model_version, * FROM ML.EVALUATE(
  MODEL `{MODEL_BASE}`,
  (SELECT
     diverted,
     dep_delay, distance, carrier, origin, dest,
     EXTRACT(DAYOFWEEK FROM flight_date) AS day_of_week
   FROM flight_data_with_split WHERE split='EVAL')
)
UNION ALL
SELECT 'engineered' AS model_version, * FROM ML.EVALUATE(
  MODEL `{MODEL_XFORM}`,
  (SELECT
    diverted,
    carrier,
    origin,
    dest,
    distance,
    flight_date
  FROM flight_data_with_split WHERE split='EVAL')
);
'''

print("Evaluating engineered model...")
job_evaluate_xform = bq.query(sql_evaluate_xform)
eval_xform_df = job_evaluate_xform.result().to_dataframe()
print("Model evaluation comparison:")
print(eval_xform_df)

Training engineered model...
Engineered model trained: mgmt467-71800.carrier_on_time_performance_dataset.diverted_xform
Evaluating engineered model...
Model evaluation comparison:
  model_version  precision    recall  accuracy  f1_score  log_loss   roc_auc
0      baseline        0.0  0.000000  0.997649  0.000000  0.015901  0.709653
1    engineered        1.0  0.003268  0.997672  0.006515  0.015708  0.787681


### Model Evaluation Comparison Interpretation

Here's an interpretation of the comparison table showing the evaluation metrics for the baseline and the engineered models:

**Model Evaluation Comparison:**
```
  model_version  precision    recall  accuracy  f1_score  log_loss   roc_auc
0      baseline        0.0  0.000000  0.997649  0.000000  0.015901  0.709653
1    engineered        1.0  0.003268  0.997672  0.006515  0.015708  0.787681
```

**Interpretation:**

1.  **ROC AUC (Area Under the Receiver Operating Characteristic Curve):** This is the most crucial metric for comparing these models, especially with an imbalanced dataset. The engineered model achieved an `ROC AUC` of **0.787681**, a significant improvement over the baseline model's `0.709653`. This indicates that the feature engineering (using `carrier`, `route`, `distance`, `day_of_week`, and `month`) has substantially increased the model's ability to discriminate between diverted and non-diverted flights. A higher AUC means the model is better at ranking positive instances higher than negative instances, regardless of the classification threshold.

2.  **Log Loss:** The `log_loss` for the engineered model is **0.015708**, which is slightly lower than the baseline's `0.015901`. A lower log loss suggests that the engineered model's predicted probabilities are marginally better calibrated and closer to the true outcomes. This is important for operational decisions where the actual probability of diversion is more valuable than just a binary classification.

3.  **Precision, Recall, and F1-Score (at Default 0.5 Threshold):**
    *   **Baseline Model:** All are `0.0`. This means at the default 0.5 threshold, the baseline model failed to correctly identify *any* actual diverted flights (0 True Positives). It essentially predicted 'not diverted' for all instances due to the overwhelming majority class.
    *   **Engineered Model:** The engineered model shows `precision` of `1.0`, `recall` of `0.003268`, and `f1_score` of `0.006515`. While recall and F1-score are still very low, the `precision` of `1.0` is noteworthy. It means that any flight the engineered model predicted as 'diverted' at the 0.5 threshold was indeed a diverted flight. However, the extremely low recall indicates it predicted very few flights as 'diverted' (catching only a very small fraction of actual diversions). This reinforces the observation that the default 0.5 threshold is still too high for effectively detecting the minority class in this imbalanced dataset, even with improved features.

4.  **Accuracy:** Both models show very high accuracy (around `0.997`). As discussed previously, this metric is misleading in highly imbalanced datasets, as a model can achieve high accuracy by simply predicting the majority class all the time.

**Conclusion:**

The engineered model, leveraging schedule-level features, demonstrates a clear improvement in its underlying discriminatory power (higher ROC AUC and lower log loss) compared to the simple baseline. This means it has a better foundational understanding of what leads to diversions. However, neither model, when operating at the default 0.5 threshold, is effective at identifying the rare 'diverted' class due to the severe class imbalance. Further threshold tuning will be essential to balance the trade-off between identifying more actual diversions (increasing recall) and minimizing false alarms (maintaining precision) based on operational needs.