# Flight Prediction Machine Learning Estimator Experiments 4-5
This code estimates random forest and neural network models and prints out evaluation metrics (f-beta, where beta=2 and AUC). The code is working with the cleaned training dataset from 2015-2018. This dataset HAS been downsampled to correct for class imbalance. This code also selects the Final Model as the neural network 3 architecture and uses that model to predict the heldout test dataset.

Our final model shows marked improvement from baseline, using a more predictive set of features and downsampling the majority class, predicting the full 2015-2018 training with a f-beta of 63%. We used this final model built on the training set to predict the unseen, held-out 2019 data, which is the best and final assessment of the power of our model, achieving a middling f-beta score of 54%. In our current form, our model's f-beta metric is insufficient to provide a useful tool for industry.  This implies that our features are not rich enough to represent the complex set of factors that cause flight delays. The report's conclusion discusses several directions for future work to improve this predictive tool.

![Pipeline Image](https://i.imgur.com/wq62T0E.png)

### Project Description
This is a group project conducted for course w261: Machine Learning at Scale at the University of California Berkeley in Summer 2023. This project develops a machine learning model that predicts flight delays based on historical flight, airport station, and weather data spanning five years from 2015-2019 in the United States.

###Group members
Jessica Stockham, Chase Madison, Kisha Kim, Eric Danforth

Citation: Code written by Jessica Stockham

In [0]:
import numpy as np
import re
import pandas as pd
from collections import namedtuple
from datetime import datetime, timedelta, date
import holidays

from pyspark.sql.types import *
from pyspark.sql.functions import *
from pyspark.sql import Window

from pyspark.sql.functions import udf, col,isnan,when,count
from pyspark.ml.feature import OneHotEncoder, StringIndexer, VectorAssembler,StandardScaler, Imputer, Bucketizer
from pyspark.ml import Pipeline
from pyspark.ml.tuning import ParamGridBuilder, TrainValidationSplit
from pyspark.ml.classification import LogisticRegression, RandomForestClassifier, GBTClassifier
from pyspark.mllib.evaluation import MulticlassMetrics

import xgboost as xgb
from xgboost.spark import SparkXGBClassifier

from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.mllib.evaluation import MulticlassMetrics
from pyspark.ml.classification import MultilayerPerceptronClassifier
from pyspark.mllib.regression import LabeledPoint
from lightgbm import LGBMClassifier

from hyperopt import fmin, tpe, Trials, SparkTrials, hp, space_eval
import mlflow



In [0]:
## Place this cell in any team notebook that needs access to the team cloud storage
mids261_mount_path = '/mnt/mids-w261'  # 261 course blob storage is mounted here
secret_scope = 'sec5-team1-scope'  # Name of the secret scope Chase created in Databricks CLI
secret_key = 'sec5-team1-key'  # Name of the secret key Chase created in Databricks CLI
storage_account = 'sec5team1storage'  # Name of the Azure Storage Account Chase created
blob_container = 'sec5-team1-container'  # Name of the container Chase created in Azure Storage Account
team_blob_url = f'wasbs://{blob_container}@{storage_account}.blob.core.windows.net'  # Points to the root of your team storage bucket
spark.conf.set(  # SAS Token: Grant the team limited access to Azure Storage resources
  f'fs.azure.sas.{blob_container}.{storage_account}.blob.core.windows.net',
  dbutils.secrets.get(scope=secret_scope, key=secret_key)
)

In [0]:
##### LOAD 60 MONTH DATASET ##########
timeInterval = '60mo'

# TRAIN DATASET: 2015-2018
fold_name_clean = 'train_clean_downsampled'
train = spark.read.format("parquet")\
    .option("path", (f"{team_blob_url}/{fold_name_clean}/rapid" + timeInterval))\
    .load().cache()

# TEST DATASET: 2019
fold_name_clean = 'test_clean_downsampled'
test = spark.read.format("parquet")\
    .option("path", (f"{team_blob_url}/{fold_name_clean}/rapid" + timeInterval))\
    .load().cache()

# Result List to Hold the Trained Models and Resulting Fscore
results = []

In [0]:
print(f'Job Start: {datetime.now()}')

with mlflow.start_run():
    estimator = RandomForestClassifier(featuresCol = 'features'
                                , labelCol = 'label'
                                , maxDepth = 12
                                , numTrees = 140
                                )    
  
    model = estimator.fit(train)

    print(f'Model built: {datetime.now()}')

    # Predict TRAIN
    pred_train = model.transform(train).cache()

    print(f'Predicted Train: {datetime.now()}')

    evaluator = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction", metricName="fMeasureByLabel", beta=2.0, metricLabel=1.0)
    fmeasure = evaluator.evaluate(pred_train, {evaluator.metricLabel: 1.0})
    print(fmeasure)

    results.append(['rf', model, fmeasure])

    pred_train_rdd=pred_train.select('prediction', 'label').rdd
    metrics = BinaryClassificationMetrics(pred_train_rdd)

    # Area under ROC curve
    print("Area under ROC = %s" % metrics.areaUnderROC)

    # Log Model and Metric
    mlflow.spark.log_model(model, "rf_model")
    mlflow.log_metric("FULLTRAIN_rf_fbeta", fmeasure)
    mlflow.log_metric("FULLTRAIN_rf_AUC", fmeasure)

# End prior mlfow run
mlflow.end_run()

# Save prediction df on training to blob
fold_name_clean = 'final_train_results_downsampled'
pred_train.write.format("parquet").mode("overwrite")\
    .option("path", (f"{team_blob_url}/{fold_name_clean}/rf_train_pred_df" + timeInterval))\
    .save()

pred_train.unpersist()

Job Start: 2023-08-10 20:43:04.550172
Model built: 2023-08-10 21:07:12.287360
Predicted Train: 2023-08-10 21:07:12.403078
0.6038293179995597




Area under ROC = 0.7195318696947894


2023/08/10 21:11:48 INFO mlflow.spark: Inferring pip requirements by reloading the logged model from the databricks artifact repository, which can be time-consuming. To speed up, explicitly specify the conda_env or pip_requirements when calling log_model().


DataFrame[label: double, DISTANCE: double, ELEVATION: double, FE_PRIOR_DAILY_AVG_DEP_DELAY: double, FE_PRIOR_AVG_DURATION: double, FE_NUM_FLIGHT_SCHEDULED: bigint, DEP_DELAY_LAG: double, DAY_OF_WEEK: int, MONTH: int, YEAR: int, OP_UNIQUE_CARRIER: string, origin_type: string, dest_type: string, is_holiday_double: double, is_holiday_adjacent_double: double, IS_FIRST_FLIGHT_OF_DAY_double: double, DATE: timestamp, FL_DATE: date, OP_CARRIER_FL_NUM: int, DEP_DELAY: double, AIR_TIME: double, DEP_TIME_BLK: string, origin_iata_code: string, dest_iata_code: string, TAIL_NUM: string, sched_depart_date_time_UTC: timestamp, CRS_DEP_TIME: int, ORIGIN: string, DEST: string, CRS_DEP_BUCKET: double, DAY_OF_WEEK_ix: double, MONTH_ix: double, YEAR_ix: double, OP_UNIQUE_CARRIER_ix: double, origin_type_ix: double, dest_type_ix: double, is_holiday_double_ix: double, is_holiday_adjacent_double_ix: double, IS_FIRST_FLIGHT_OF_DAY_double_ix: double, CRS_DEP_BUCKET_ix: double, DAY_OF_WEEK_hot: vector, MONTH_hot:

In [0]:
# Calculate the maximum number of features across all folds
num_features = len([x["name"] for x in sorted(train.schema["features"].metadata["ml_attr"]["attrs"]["binary"] + train.schema["features"].metadata["ml_attr"]["attrs"]["numeric"], key=lambda x: x["idx"])])
print(num_features)

with mlflow.start_run():

    # Estimator
    estimator = MultilayerPerceptronClassifier(layers=[num_features, 8, 4, 2], seed=42, labelCol='label', featuresCol='features')

    model = estimator.fit(train)

    print(f'Model built: {datetime.now()}')

    # Predict TRAIN
    pred_train = model.transform(train).cache()

    print(f'Predicted Train: {datetime.now()}')

    # f-beta
    evaluator = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction", metricName="fMeasureByLabel", beta=2.0, metricLabel=1.0)
    fmeasure = evaluator.evaluate(pred_train, {evaluator.metricLabel: 1.0})
    print(fmeasure)

    results.append(['nn', model, fmeasure])

    # AUC
    pred_train_rdd=pred_train.select('prediction', 'label').rdd
    metrics = BinaryClassificationMetrics(pred_train_rdd)
    print("Area under ROC = %s" % metrics.areaUnderROC)

    # Log Model and Metric
    mlflow.spark.log_model(model, "nn_model")
    mlflow.log_metric("FULLTRAIN_nn_fbeta", fmeasure)
    mlflow.log_metric("FULLTRAIN_nn_AUC", fmeasure)

# End prior mlfow run
mlflow.end_run()

# Save prediction df on training to blob
fold_name_clean = 'final_train_results'
pred_train.write.format("parquet").mode("overwrite")\
    .option("path", (f"{team_blob_url}/{fold_name_clean}/nn_train_pred_df" + timeInterval))\
    .save()

pred_train.unpersist()

81
Model built: 2023-08-10 21:38:05.744856
Predicted Train: 2023-08-10 21:38:06.051885
0.6307053558466028


2023/08/10 21:39:21 INFO mlflow.spark: Inferring pip requirements by reloading the logged model from the databricks artifact repository, which can be time-consuming. To speed up, explicitly specify the conda_env or pip_requirements when calling log_model().


DataFrame[label: double, DISTANCE: double, ELEVATION: double, FE_PRIOR_DAILY_AVG_DEP_DELAY: double, FE_PRIOR_AVG_DURATION: double, FE_NUM_FLIGHT_SCHEDULED: bigint, DEP_DELAY_LAG: double, DAY_OF_WEEK: int, MONTH: int, YEAR: int, OP_UNIQUE_CARRIER: string, origin_type: string, dest_type: string, is_holiday_double: double, is_holiday_adjacent_double: double, IS_FIRST_FLIGHT_OF_DAY_double: double, DATE: timestamp, FL_DATE: date, OP_CARRIER_FL_NUM: int, DEP_DELAY: double, AIR_TIME: double, DEP_TIME_BLK: string, origin_iata_code: string, dest_iata_code: string, TAIL_NUM: string, sched_depart_date_time_UTC: timestamp, CRS_DEP_TIME: int, ORIGIN: string, DEST: string, CRS_DEP_BUCKET: double, DAY_OF_WEEK_ix: double, MONTH_ix: double, YEAR_ix: double, OP_UNIQUE_CARRIER_ix: double, origin_type_ix: double, dest_type_ix: double, is_holiday_double_ix: double, is_holiday_adjacent_double_ix: double, IS_FIRST_FLIGHT_OF_DAY_double_ix: double, CRS_DEP_BUCKET_ix: double, DAY_OF_WEEK_hot: vector, MONTH_hot:

In [0]:
# Area under ROC curve
pred_train_rdd=pred_train.select('prediction', 'label').rdd
metrics = BinaryClassificationMetrics(pred_train_rdd)
print("Area under ROC = %s" % metrics.areaUnderROC)


Area under ROC = 0.7179599425810118


In [0]:
# Predict TEST with model from Experiment 5 (Neural Network Architecture #3)
pred_test = model.transform(test).cache()

print(f'Predicted TEST: {datetime.now()}')

# f-beta#
evaluator = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction", metricName="fMeasureByLabel", beta=2.0, metricLabel=1.0)
fmeasure = evaluator.evaluate(pred_test, {evaluator.metricLabel: 1.0})
print(fmeasure)

results.append(['nn', model, fmeasure])

pred_test.unpersist()

# Save prediction df on heldout to blob
timeInterval = '60mo'
fold_name_clean = 'final_heldout_results_downsampled'
pred_test.write.format("parquet").mode("overwrite")\
    .option("path", (f"{team_blob_url}/{fold_name_clean}/nn_heldout_pred_df" + timeInterval))\
    .save()

Predicted TEST: 2023-08-10 21:47:14.764984
0.5438489280208696


In [0]:
# AUC
pred_test_rdd=pred_test.select('prediction', 'label').rdd
metrics = BinaryClassificationMetrics(pred_test_rdd)

# Area under ROC curve
print("Area under ROC = %s" % metrics.areaUnderROC)

Area under ROC = 0.7093227111384619
