# Will it be delayed?

Everyone who has flown has experienced a delayed or cancelled flight. Both airlines and airports would like to improve their on-time performance and predict when a flight will be delayed or cancelled several days in advance. You are being hired to build a model that can predict if a flight will be delayed. To learn more, you must schedule a meeting with your client (me). To schedule an appointment with your client, send an event request through Google Calendar for a 15 minute meeting. Both you and your project partner must attend the meeting. Come prepared with questions to ask your client. Remember that your client is not a data scientist and you will need to explain things in a way that is easy to understand. Make sure that your communications are efficient, thought out, and not redundant as your client might get frustrated and "fire" you (this only applies to getting information from your client, this does not necessary apply to asking for help with the actual project itself - you should continuously ask questions for getting help).

For this project you must go through most all steps in the checklist. You must write responses for all items as done in the homeworks, however sometimes the item will simply be "does not apply". Keep your progress and thoughts organized in this document and use formatting as appropriate (using markdown to add headers and sub-headers for each major part). Some changes to the checklist:

* Do not do the final part (launching the product).
* Your presentation will be done as information written in this document in a dedicated section (no slides or anything like that). It should include high-level summary of your results (including what you learned about the data, the "accuracy" of your model, what features were important, etc). It should be written for your client, not your professor or teammates. It should include the best summary plots/graphics/data points.
* The models and hyperparameters you should consider during short-listing and fine-tuning will be released at a later time (dependent on how far we get over the next two weeks).
* Data retrieval must be automatic as part of the code (so it can easily be re-run and grab the latest data). Do not commit any data to the repository.
* Your submission must include a pickled final model along with this notebook.

Frame the Problem and Look at the Big Picture
=============================================

1. **Define the objective in business terms:** 
    - The objective for this machine learning model is to be able to figure out whether a delay or cancellation is going to happen.
2. **How will your solution be used?**
    - This model will be used to help notify airlines  a week in advance when a suspected delay is going to happen as a preventative measure to help make sure airline companies have higher ratings and increased profits.
3. **What are the current solutions/workarounds (if any)?** 
    - Currently this is done by humans at each airport but is not as effective due to the massive amounts of data needed 
4. **How should you frame this problem?** 
    - This is a supervised classification problem since we are trying to predict whether a flight is going to run normal, be delayed or cancelled. This could be an online solution due to it being run in real time to predict future outcomes of flights.
5. **How should performance be measured? Is the performance measure aligned with the business objective?** 
    - Our objective is to be able to predict at least 25% of the flights that are going to be delayed or cancelled without falsely predicting any normal flights as going to be delayed or cancelled. This does align with out business objective of being able to predict when there is going to be a delayed or cancelled flight. 
6. **What would be the minimum performance needed to reach the business objective?** 
    - Again the minimum performance that would need to be predicted is 25% of the flights that are going to be delayed or cancelled accurately without falsely predicting that a normal flight is going to be delayed.
7. **What are comparable problems? Can you reuse (personal or readily available) experience or tools?** 
    - We can reuse our bike data as it is also a supervised classification problem. We also have our other homeworks and inclass examples to be able to work off of in terms of setting up the model.
8. **Is human expertise available?** 
    - Yes our client is has experience with flight delays and has provided us with good insight and direction on where to look into for our problem.
9. **How would you solve the problem manually?** 
    - To solve this problem manually we would need to look at all the data for what has caused delays and cancellations the most and calculate a way to see what airports get affected the most to be able to more accurately predict whether or not there is going to be a delay or cancellation.
10. **List the assumptions you (or others) have made so far. Verify assumptions if possible.** 
    - We have made the assumption that weather is going to play a massive role in whether there is going to be a delay or not. Also the size of the airport and number of staff is going to be important in whether an airport can even properly operate which could lead to delays.

Get the Data
============

1. **List the data you need and how much you need:**
    - National flight data for 2023 and 2024
    - Weather data covering all of the same dates, preferably daily.
2. **Find and document where you can get that data:**
    - All of the weather data is available on the NOAA website. You must go through and make an order for each individual airport. The link is here: https://www.ncei.noaa.gov/cdo-web/ 
    - Flight data is from: https://www.transtats.bts.gov/tables.asp?QO_VQ=EFD&QO_anzr=Nv4yv0r 
3. **Get access authorizations**:
   - You must agree to the terms of use and make an order (which is free for digital use).
4. **Create a workspace**: This notebook.
5. **Get the data**: 
    - Download all of the CSV files from the websites mentioned above
6. **Convert the data to a format you can easily manipulate**:
   - The data is all in one parquet file.
7. **Ensure sensitive information is deleted or protected**: This is public data
8. **Check the size and type of data (time series, geographical, …)**:

<mark>TODO</mark>: report your information below. At this point, since you don't want to look at the data too closely, this is a quick evaluation about the number of features and their data types (note: remember that just because all values for a feature are a number doesn't mean that feature is numerical), the number of samples (including possible missing data), and any special considerations about the features such as:

   1. Is it a time series: 
      - Yes

   2. Are any of the features unusable for the business problem? Or are some not available for the business problem when the model will be used?: 
      - Yes, so far all of the features are usable except for diverted flights, because the client doesn't want those accounted

   3. Which feature(s) will be used as the target/label for the business problem? (including which are required to derive the correct label)
      

   4. Should any of the features be stratified during the train/test split to avoid sampling biases?
   

Do not look at the data too closely at this point since you have not yet split off the testing set. Basically, enough looking at it to understand *how* to split the test set off. It is likely you will have to review the website where the data came from to be able to understand some of the features.

In [1]:
#Imports
import numpy as np
import os
import scipy as sp
import pandas as pd
import matplotlib.pylab as plt
import seaborn as sns
from sklearn.model_selection import train_test_split 

In [2]:
def load_all_data():
    if os.path.exists('final_data.parquet'):
        print('final_data.parquet already exists, skipping all merging')
        data = pd.read_parquet('final_data.parquet')
        return data
    
    # Load the data
    data = pd.read_parquet('combined.parquet')
    columns_to_keep = ['Year', 'Month', 'DayofMonth', 'DayOfWeek', 'FlightDate', 'OriginAirportID', 'Origin', 'OriginCityName', 'OriginStateName' ,'DestAirportID', 'Dest', 'DestCityName', 'DestStateName', 'DepTime', 'DepDelay', 'DepDelayMinutes', 'ArrTime', 'ArrDelayMinutes', 'Cancelled', 'CancellationCode', 'CarrierDelay', 'Tail_Number', 'WeatherDelay', 'NASDelay', 'SecurityDelay', 'LateAircraftDelay', 'AirTime', 'Flights', 'Distance']
    data = data[columns_to_keep]
    weather_df = pd.read_csv('3964079.csv')

    # Convert 'DATE' to datetime and extract date components
    weather_df['DATE'] = pd.to_datetime(weather_df['DATE'])
    weather_df['Year'] = weather_df['DATE'].dt.year
    weather_df['Month'] = weather_df['DATE'].dt.month
    weather_df['DayofMonth'] = weather_df['DATE'].dt.day

    # Rename 'STATION' to 'WeatherStation' for clarity
    weather_df.rename(columns={'STATION': 'WeatherStation'}, inplace=True)
   
    data['FlightDate'] = pd.to_datetime(data['FlightDate']) 
    airport_to_station = {
    'ATL': 'USW00013874',
    'ORD': 'USW00094846',
    'SEA': 'USW00024233',
    'MIA': 'USW00012839',
    'DFW': 'USW00003927',
    'LAX': 'USW00023174',
    'DEN': 'USW00003017',
    } 
    data['WeatherStation'] = data['Origin'].map(airport_to_station)

    aircraft = pd.read_csv('aircrafts.csv')

    # Merge the data
    combined_df = pd.merge(
    data,
    weather_df,
    on=['Year', 'Month', 'DayofMonth', 'WeatherStation'],
    how='left')

    combined_df['Tail_Number'] = combined_df['Tail_Number'].astype(str)
    aircraft['reg'] = aircraft['reg'].astype(str)

    final_data = pd.merge(
        combined_df,
        aircraft,
        left_on='Tail_Number',
        right_on='reg',
        how='left'
    )
    final_data.to_parquet('final_data.parquet')
    return final_data
    



In [3]:
data = load_all_data()


In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14825707 entries, 0 to 14825706
Data columns (total 78 columns):
 #   Column                 Dtype         
---  ------                 -----         
 0   Year                   int64         
 1   Month                  int64         
 2   DayofMonth             int64         
 3   DayOfWeek              int64         
 4   FlightDate             datetime64[ns]
 5   OriginAirportID        int64         
 6   Origin                 object        
 7   OriginCityName         object        
 8   OriginStateName        object        
 9   DestAirportID          int64         
 10  Dest                   object        
 11  DestCityName           object        
 12  DestStateName          object        
 13  DepTime                float64       
 14  DepDelay               float64       
 15  DepDelayMinutes        float64       
 16  ArrTime                float64       
 17  ArrDelayMinutes        float64       
 18  Cancelled           

In [20]:
data.describe()

Unnamed: 0,Year,Month,DayofMonth,DayOfWeek,FlightDate,OriginAirportID,DestAirportID,DepTime,DepDelay,DepDelayMinutes,...,WT07,WT08,WT09,WT10,id,serial,numSeats,numEngines,ageYears,numRegistrations
count,14825710.0,14825710.0,14825710.0,14825710.0,14825707,14825710.0,14825710.0,14636670.0,14636300.0,14636300.0,...,5178.0,483125.0,5416.0,1047.0,14215690.0,14186520.0,9078130.0,13831260.0,13325070.0,14215690.0
mean,2023.509,6.586377,15.77153,3.983661,2024-01-06 02:34:45.840818944,12654.32,12654.33,1332.564,12.35834,15.65029,...,1.0,1.0,1.0,1.0,439763.2,2675482.0,143.1214,2.000781,14.56669,1.489753
min,2023.0,1.0,1.0,1.0,2023-01-01 00:00:00,10135.0,10135.0,1.0,-99.0,0.0,...,1.0,1.0,1.0,1.0,339.0,24.0,1.0,2.0,0.3,1.0
25%,2023.0,4.0,8.0,2.0,2023-07-08 00:00:00,11292.0,11292.0,912.0,-6.0,0.0,...,1.0,1.0,1.0,1.0,11270.0,10124.0,100.0,2.0,7.6,1.0
50%,2024.0,7.0,16.0,4.0,2024-01-07 00:00:00,12889.0,12889.0,1325.0,-2.0,0.0,...,1.0,1.0,1.0,1.0,23260.0,31968.0,154.0,2.0,12.5,1.0
75%,2024.0,10.0,23.0,6.0,2024-07-08 00:00:00,14027.0,14027.0,1746.0,9.0,9.0,...,1.0,1.0,1.0,1.0,153264.0,60122.0,179.0,2.0,21.2,2.0
max,2024.0,12.0,31.0,7.0,2024-12-31 00:00:00,16869.0,16869.0,2400.0,5764.0,5764.0,...,1.0,1.0,1.0,1.0,2145037.0,19000640.0,2370.0,3.0,1817.6,8.0
std,0.4999182,3.403419,8.781058,2.007278,,1526.151,1526.147,507.7571,56.12929,55.07429,...,0.0,0.0,0.0,0.0,810945.2,6182807.0,63.8964,0.02794294,27.79567,0.8652182


In [21]:
data.head()

Unnamed: 0,Year,Month,DayofMonth,DayOfWeek,FlightDate,OriginAirportID,Origin,OriginCityName,OriginStateName,DestAirportID,...,registrationDate,typeName,numEngines,engineType,isFreighter,productionLine,ageYears,verified,numRegistrations,firstRegistrationDate
0,2023,12,30,6,2023-12-30,12339,IND,"Indianapolis, IN",Indiana,12953,...,2013-06-26,Canadair CRJ 900,2.0,Jet,False,Canadair CRJ 900,17.1,True,3.0,2008-02-22
1,2023,12,30,6,2023-12-30,12953,LGA,"New York, NY",New York,12339,...,2013-06-26,Canadair CRJ 900,2.0,Jet,False,Canadair CRJ 900,17.1,True,3.0,2008-02-22
2,2023,12,1,5,2023-12-01,12953,LGA,"New York, NY",New York,15016,...,2020-03-25,Canadair CRJ 900,2.0,Jet,False,Canadair CRJ 900,,True,3.0,2008-03-27
3,2023,12,3,7,2023-12-03,12953,LGA,"New York, NY",New York,15016,...,2018-02-13,Canadair CRJ 900,2.0,Jet,False,Canadair CRJ 900,17.4,True,3.0,2007-10-23
4,2023,12,4,1,2023-12-04,12953,LGA,"New York, NY",New York,15016,...,2018-03-12,Canadair CRJ 900,2.0,Jet,False,Canadair CRJ 900,17.3,True,3.0,2007-12-06


Explore the Data
================

Notes:
* I want lots of written information, the only code to keep when submitting is the code to output numbers, tables, or plots that you refer to in your writing
* During exploration, it is reasonable to remove unreasonable outliers (and document that you are doing so and how you are classifying what an outlier is) before doing further analysis
  * There are differences in outliers: ones that are real and ones that are errors. For example, if a height was entered as 7'1" for Shaq O'Neal, that is a real outlier, it has meaning. If a height was entered as 7'1" for a random person, that is an error. You should (try to) remove (only) the error/non-useful ones.
* You will need to explore how to work with date-times, Pandas has a very wide range of utilities for working with them, one particular thing to possibly use is extracting components of the date-time (like hours in the day or day-of-week)
* Document all important things, make sure to put headers for the separate steps, and keep everything organized

Reminder about the 9 steps (points in parentheses):
1. Copy the data for exploration, downsampling to a manageable size if necessary.
2. Study each attribute and its characteristics: Name; Type (categorical, numerical, bounded, text, structured, …); % of missing values; Noisiness and type of noise (stochastic, outliers, rounding errors, …); Usefulness for the task; Type of distribution (Gaussian, uniform, logarithmic, …) (format as a nice markdown table!)
3. For supervised learning tasks, identify the target attribute(s)
4. Visualize the data
5. Study the correlations between attributes
6. Study how you would solve the problem manually (using the data you have)
7. Identify the promising transformations you may want to apply
8. Identify extra data that would be useful (discuss it, but don't actually go through with it)
9. Document what you have learned (included in the other steps - it is actually worth most of the points!)

# Prepare the Data

Note: the word *optional* simply means not all datasets will require it, it does not mean you can just choose not to do it if it is needed for a particular dataset.

1. Data cleaning: Fix/remove outliers (optional); Fill in missing values (with 0, mean, median…) or drop rows/columns
2. Feature selection (optional): Drop attributes that provide no useful information for the task
3. Feature engineering, where appropriate: Discretize continuous features; Decompose features (categorical, date/time, …), Add promising transformations of features ($\log(x)$, $\sqrt{x}$, $x^2$, …); Aggregate features into promising new features
4. Feature scaling: standardize or normalize features

Short-List Promising Models
======
1. Train many quick and dirty models from different categories (e.g. linear, naive
Bayes, SVM, Random Forests, neural net, ...) using standard parameters
2. Measure and compare their performance. For each model, use 𝑁𝑁-fold cross-
validation and compute the mean and standard deviation of the performance
measure on the 𝑁𝑁 folds.
3. Analyze the most significant variables for each algorithm
4. Analyze the types of errors the models make. What data would a human have used to avoid these errors?
5. Have a quick round of feature selection and engineering
6. Have one or two more quick iterations of the five previous steps
7. Short-list the top three to five most promising models, preferring models that make different types of errors

Fine-Tune the System
======
1. Fine-tune the hyperparameters using cross-validation. Treat your data
transformation choices as hyperparameters, especially when you are not sure
about them. Unless there are very few hyperparameter values to explore, prefer
random search over grid search. If training is very long, you may prefer a Bayesian
optimization approach.
2. Try Ensemble methods. Combining your best models will often perform better
than running them individually.
3. Once you are confident about your final model, measure its performance on the
test set to estimate the generalization error.

Present Your Solution
=====
1. Document what you have done
2. Create a nice presentation, highlighting the big picture first
3. Explain why your solution achieves the business objective
4. Don’t forget to present interesting points you noticed along the way: Describe what worked and what did not; List your
assumptions and your system’s limitations
5. Ensure your key findings are communicated through beautiful visualizations or easy-to-remember statements (e.g. “the
median income is the number-one predictor of housing prices”)