# Will it be delayed?

Everyone who has flown has experienced a delayed or cancelled flight. Both airlines and airports would like to improve their on-time performance and predict when a flight will be delayed or cancelled several days in advance. You are being hired to build a model that can predict if a flight will be delayed. To learn more, you must schedule a meeting with your client (me). To schedule an appointment with your client, send an event request through Google Calendar for a 15 minute meeting. Both you and your project partner must attend the meeting. Come prepared with questions to ask your client. Remember that your client is not a data scientist and you will need to explain things in a way that is easy to understand. Make sure that your communications are efficient, thought out, and not redundant as your client might get frustrated and "fire" you (this only applies to getting information from your client, this does not necessary apply to asking for help with the actual project itself - you should continuously ask questions for getting help).

For this project you must go through most all steps in the checklist. You must write responses for all items as done in the homeworks, however sometimes the item will simply be "does not apply". Keep your progress and thoughts organized in this document and use formatting as appropriate (using markdown to add headers and sub-headers for each major part). Some changes to the checklist:

* Do not do the final part (launching the product).
* Your presentation will be done as information written in this document in a dedicated section (no slides or anything like that). It should include high-level summary of your results (including what you learned about the data, the "accuracy" of your model, what features were important, etc). It should be written for your client, not your professor or teammates. It should include the best summary plots/graphics/data points.
* The models and hyperparameters you should consider during short-listing and fine-tuning will be released at a later time (dependent on how far we get over the next two weeks).
* Data retrieval must be automatic as part of the code (so it can easily be re-run and grab the latest data). Do not commit any data to the repository.
* Your submission must include a pickled final model along with this notebook.

# Framing the Problem

1. **Define the objective in business terms:** <mark> The objective for this machine learning model is to be able to figure out whether a delay or cancellation is going to happen.<mark>
2. **How will your solution be used?** <mark> This model will be used to help notify airlines  a week in advance when a suspected delay is going to happen as a preventative measure to help make sure airline companies have higher ratings and increased profits. <mark>
3. **What are the current solutions/workarounds (if any)?** <mark> Currently this is done by humans at each airport but is not as effective due to the massive amounts of data needed <mark>
4. **How should you frame this problem?** <mark> This is a supervised classification problem since we are trying to predict whether a flight is going to run normal, be delayed or cancelled. This could be an online solution due to it being run in real time to predict future outcomes of flights. <mark>
5. **How should performance be measured? Is the performance measure aligned with the business objective?** <mark> Our objective is to be able to predict at least 25% of the flights that are going to be delayed or cancelled without falsely predicting any normal flights as going to be delayed or cancelled. This does align with out business objective of being able to predict when there is going to be a delayed or cancelled flight. <mark>
6. **What would be the minimum performance needed to reach the business objective?**  <mark> Again the minimum performance that would need to be predicted is 25% of the flights that are going to be delayed or cancelled accurately without falsely predicting that a normal flight is going to be delayed. <mark>
7. **What are comparable problems? Can you reuse (personal or readily available) experience or tools?** <mark> We can reuse our bike data as it is also a supervised classification problem. We also have our other homeworks and inclass examples to be able to work off of in terms of setting up the model. <mark>
8. **Is human expertise available?** <mark> Yes our client is has experience with flight delays and has provided us with good insight and direction on where to look into for our problem. <mark>
9. **How would you solve the problem manually?** <mark> To solve this problem manually we would need to look at all the data for what has caused delays and cancellations the most and calculate a way to see what airports get affected the most to be able to more accurately predict whether or not there is going to be a delay or cancellation. <mark>
10. **List the assumptions you (or others) have made so far. Verify assumptions if possible.** 
<mark> We have made the assumption that weather is going to play a massive role in whether there is going to be a delay or not. Also the size of the airport and number of staff is going to be important in whether an airport can even properly operate which could lead to delays. <mark>

In [10]:
#Imports
import numpy as np
import scipy as sp
import pandas as pd
import matplotlib.pylab as plt
import seaborn as sns
from sklearn.model_selection import train_test_split 

In [None]:
data = pd.read_parquet('combined.parquet')
columns_to_keep = ['Year', 'Month', 'DayofMonth', 'DayOfWeek', 'FlightDate', 'OriginAirportID', 'Origin', 'OriginCityName', 'OriginStateName' ,'DestAirportID', 'Dest', 'DestCityName', 'DestStateName', 'DepTime', 'DepDelay', 'DepDelayMinutes', 'ArrTime', 'ArrDelayMinutes', 'Cancelled', 'CancellationCode', 'CarrierDelay', 'WeatherDelay', 'NASDelay', 'SecurityDelay', 'LateAircraftDelay', 'AirTime', 'Flights', 'Distance']
data = data[columns_to_keep]

In [15]:
data.describe()

Unnamed: 0,Year,Month,DayofMonth,DayOfWeek,OriginAirportID,DestAirportID,DepTime,DepDelay,DepDelayMinutes,ArrTime,ArrDelayMinutes,Cancelled,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay,AirTime,Flights,Distance
count,14825710.0,14825710.0,14825710.0,14825710.0,14825710.0,14825710.0,14636670.0,14636300.0,14636300.0,14625530.0,14592270.0,14825710.0,2995616.0,2995616.0,2995616.0,2995616.0,2995616.0,14592270.0,14825707.0,14825710.0
mean,2023.509,6.586377,15.77153,3.983661,12654.32,12654.33,1332.564,12.35834,15.65029,1460.146,15.57941,0.01327458,24.89971,4.090163,13.00443,0.131402,28.84685,111.8473,1.0,806.6236
std,0.4999182,3.403419,8.781058,2.007278,1526.151,1526.147,507.7571,56.12929,55.07429,544.1017,54.68914,0.1144481,76.5862,33.70453,31.39895,3.370272,64.58379,69.8695,0.0,592.269
min,2023.0,1.0,1.0,1.0,10135.0,10135.0,1.0,-99.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,1.0,11.0
25%,2023.0,4.0,8.0,2.0,11292.0,11292.0,912.0,-6.0,0.0,1045.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,61.0,1.0,373.0
50%,2024.0,7.0,16.0,4.0,12889.0,12889.0,1325.0,-2.0,0.0,1502.0,0.0,0.0,3.0,0.0,0.0,0.0,2.0,94.0,1.0,649.0
75%,2024.0,10.0,23.0,6.0,14027.0,14027.0,1746.0,9.0,9.0,1916.0,9.0,0.0,22.0,0.0,17.0,0.0,33.0,141.0,1.0,1045.0
max,2024.0,12.0,31.0,7.0,16869.0,16869.0,2400.0,5764.0,5764.0,2400.0,5780.0,1.0,5764.0,2419.0,2700.0,1460.0,3581.0,1338.0,1.0,5095.0
