# Predicting Flight Delays with DataRobot AI platform

## Train Test Data To DataRobot Deploy Models

[Predicting Flight Delays with DataRobot](https://60e895727e71a6eaa3a03fa3.apps2.datarobot.com/?token=8C8XQxDzhpww8nMnclVY2DMp-VqiOI-AtM8_Gu9aVQY)

    - The objective of the project is to develop a model aimed at predicting flight delays at take-off 

In [1]:
import pandas as pd
# ^^^ pyforest auto-imports - don't write above this line
# ==============================================================================
# Auto Import Dependencies
# ==============================================================================
# pyforest imports dependencies according to use in the notebook
# ==============================================================================

# Warnings configuration
# ==============================================================================
import warnings
warnings.filterwarnings('ignore')

# Folder configuration
# ==============================================================================
new_path = '../scripts/'
if new_path not in sys.path:
    sys.path.append(new_path)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

## Clean Dataset

In [2]:
path = '../data/'
file = 'raw/DelayedFlights.csv'

df_raw = pd.read_csv(path+file)

<IPython.core.display.Javascript object>

In [3]:
df = df_raw.copy()

In [4]:
df = df_raw.drop(labels = 'Unnamed: 0', axis = 1)

In [5]:
df = df.sample(frac= 0.01, random_state = 6858) # Max 200 MB File to DataRobot

In [6]:
df.drop_duplicates(inplace=True)

In [7]:
# Frecuency encoding

labels = ['UniqueCarrier', 'TailNum', 'Origin', 'Dest', 'CancellationCode']

for category in labels: 
    
    cat_map = df.groupby(category).size() / len(df)*100
    
    df[category] = df[category].map(cat_map)

In [8]:
df.head()

Unnamed: 0,Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,...,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay
32721,2008,1,1,2,1402.0,1230,1417.0,1256,5.43164,7768,...,6.0,13.0,0,99.948368,0,0.0,0.0,0.0,0.0,81.0
480545,2008,3,20,4,1723.0,1715,1932.0,1906,3.893019,4358,...,12.0,17.0,0,99.948368,0,8.0,0.0,18.0,0.0,0.0
879313,2008,5,5,1,703.0,650,1031.0,1011,5.870508,1167,...,8.0,18.0,0,99.948368,0,0.0,0.0,20.0,0.0,0.0
1670776,2008,11,8,6,1628.0,1559,1731.0,1715,7.677613,1553,...,7.0,14.0,0,99.948368,0,0.0,0.0,0.0,0.0,16.0
1464310,2008,9,26,5,1526.0,1230,2322.0,2029,7.677613,52,...,6.0,10.0,0,99.948368,0,0.0,0.0,0.0,0.0,173.0


In [9]:
# Script created for transform missing data
# ===============================================================================

import missing

In [10]:
df[:] = missing.transform(df[:], 'mean')

In [11]:
df.isnull().sum().sum()

0

In [12]:
df.replace([np.inf, -np.inf], np.nan, inplace=True)
df.dropna(inplace=True)

In [13]:
df.isna().any()

Year                 False
Month                False
DayofMonth           False
DayOfWeek            False
DepTime              False
CRSDepTime           False
ArrTime              False
CRSArrTime           False
UniqueCarrier        False
FlightNum            False
TailNum              False
ActualElapsedTime    False
CRSElapsedTime       False
AirTime              False
ArrDelay             False
DepDelay             False
Origin               False
Dest                 False
Distance             False
TaxiIn               False
TaxiOut              False
Cancelled            False
CancellationCode     False
Diverted             False
CarrierDelay         False
WeatherDelay         False
NASDelay             False
SecurityDelay        False
LateAircraftDelay    False
dtype: bool

### Train Test

In [14]:
# Matrix
X = df

In [15]:
# Call train_test_split on the data and capture the results
# ============================================================================

X_train, X_test = train_test_split(X, test_size = 0.2, random_state = 6858)

<IPython.core.display.Javascript object>

In [16]:
X_train.to_csv(path + "interim/X_train_DataRobot.csv", index = False, header = True)
X_test.to_csv(path + "interim/X_test_DataRobot.csv", index = False, header = True)

## Predictions

[Predicting Flight Delays with DataRobot](https://60e895727e71a6eaa3a03fa3.apps2.datarobot.com/?token=8C8XQxDzhpww8nMnclVY2DMp-VqiOI-AtM8_Gu9aVQY)

### Models

In [1]:
df_Leaderboard = pd.read_csv('../data/external/Leaderboard_Delay_Flights_10%_Data.csv')

<IPython.core.display.Javascript object>

In [2]:
df_Leaderboard

Unnamed: 0,Rank,Starred,Model ID,Feature List ID,Feature List,Model Features,Sample Size,Prediction Threshold,Modeling Results,Model Type,...,Sample %,Holdout Size,Predicted Ram Usage,Total Prediction API Requirements,Uses GPU,Scaleout,Monotonic Constraints,Monotonic Feature List,Smart Sampling,Smart Sample Size
0,11,False,60e9083057012779d9b8e8ee,60e907d443afcfa65317be00,Informative Features,Tree-based Algorithm Preprocessing v1,4958.0,0.5,"Gini Norm: [0.99196], MAE: [3.92078], R Square...",Light Gradient Boosted Trees Regressor with Ea...,...,31.99949,3099,,0,False,False,False,,,
1,10,False,60e9083057012779d9b8e8ed,60e907d443afcfa65317be00,Informative Features,"Missing Values Imputed, Generalized Additive2 ...",4958.0,0.5,"Gini Norm: [0.98874], MAE: [4.33415], R Square...",Generalized Additive2 Model,...,31.99949,3099,,0,False,False,True,,,
2,8,False,60e9083057012779d9b8e8eb,60e907d443afcfa65317be00,Informative Features,"Missing Values Imputed, Standardize, Elastic-N...",4958.0,0.5,"Gini Norm: [0.99537], MAE: [3.31263], R Square...",Elastic-Net Regressor (mixing alpha=0.5 / Leas...,...,31.99949,3099,,0,False,False,False,,,
3,30,False,60e908af6b71f57f2edecb14,60e907d443afcfa65317be00,Informative Features,"Missing Values Imputed, Standardize, Elastic-N...",9916.0,0.5,"Gini Norm: [0.99638, 0.995182], MAE: [2.78019,...",Elastic-Net Regressor (mixing alpha=0.5 / Leas...,...,63.99897,3099,,0,False,False,False,,,
4,12,False,60e9083057012779d9b8e8ef,60e907d443afcfa65317be00,Informative Features,"Numeric Data Cleansing, Standardize, Ridge Reg...",4958.0,0.5,"Gini Norm: [0.99693], MAE: [1.39666], R Square...",Light Gradient Boosting on ElasticNet Predicti...,...,31.99949,3099,,0,False,False,False,,,
5,22,False,60e908af6b71f57f2edecb12,60e907d443afcfa65317be00,Informative Features,"Numeric Data Cleansing, Standardize, Ridge Reg...",9916.0,0.5,"Gini Norm: [0.99783, 0.997946], MAE: [1.07274,...",Light Gradient Boosting on ElasticNet Predicti...,...,63.99897,3099,,0,False,False,False,,,
6,41,False,60e909d4fe41a16f880a1a5c,60e909d4fe41a16f880a1a54,DR Reduced Features M14,"Numeric Data Cleansing, Standardize, Ridge Reg...",9916.0,0.5,"Gini Norm: [0.99824, 0.9970519999999998], MAE:...",Light Gradient Boosting on ElasticNet Predicti...,...,63.99897,3099,,0,False,False,False,,,
7,47,False,60e90a04400de5b136decb1f,60e907d443afcfa65317be00,Informative Features,"Numeric Data Cleansing, Standardize, Ridge Reg...",12395.0,0.5,"Gini Norm: [0.99814, 0.99785], MAE: [1.06814, ...",Light Gradient Boosting on ElasticNet Predicti...,...,79.99871,3099,,0,False,False,False,,,
8,7,False,60e9083057012779d9b8e8ea,60e907d443afcfa65317be00,Informative Features,Regularized Linear Model Preprocessing v19,4958.0,0.5,"Gini Norm: [0.99817], MAE: [2.13992], R Square...",Ridge Regressor,...,31.99949,3099,,0,False,False,False,,,
9,26,False,60e908af6b71f57f2edecb13,60e907d443afcfa65317be00,Informative Features,Regularized Linear Model Preprocessing v19,9916.0,0.5,"Gini Norm: [0.99847, 0.9975759999999999], MAE:...",Ridge Regressor,...,63.99897,3099,,0,False,False,False,,,


### Predictions

In [5]:
df_Predictions = pd.read_csv('../data/external/all_prediction_records_Delay_Flights_10%_Data.csv')

<IPython.core.display.Javascript object>

In [6]:
df_Predictions

Unnamed: 0,name,description,prediction,TailNum,ArrTime,Diverted,FlightNum,Distance,DayofMonth,NASDelay,...,WeatherDelay,DepDelay,ActualElapsedTime,AirTime,DayOfWeek,Month,DepTime,TaxiOut,CRSArrTime,Dest
0,New Record #1,A record from X_test_DataRobot.csv,311.857912,0.025816,1821.0,0,1857,1565,9,0.000000,...,0.000000,312,219.0,201.0,7,11,1342,11.0,1310,0.366584
1,New Record #2,A record from X_test_DataRobot.csv,4.189602,0.056795,1400.0,0,7110,401,21,14.870736,...,3.630539,9,80.0,61.0,3,5,1240,17.0,1357,0.191037
2,New Record #4,A record from X_test_DataRobot.csv,87.742531,0.056795,1603.0,0,3822,501,15,0.000000,...,0.000000,73,106.0,85.0,2,1,1317,16.0,1435,1.507641
3,New Record #3,A record from X_test_DataRobot.csv,33.947205,0.046468,1824.0,0,3046,358,21,0.000000,...,0.000000,35,74.0,63.0,1,1,1710,7.0,1750,0.758984
4,New Record #5,A record from X_test_DataRobot.csv,5.360596,0.030979,1947.0,0,3168,224,14,14.870736,...,3.630539,6,61.0,49.0,3,5,1846,8.0,1940,1.631557
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,New Record #97,A record from X_test_DataRobot.csv,-12.594051,0.041305,1449.0,0,1900,1171,9,14.870736,...,3.630539,6,163.0,147.0,1,6,1006,10.0,1505,3.727798
96,New Record #96,A record from X_test_DataRobot.csv,-2.815421,0.025816,2225.0,0,101,1036,29,14.870736,...,3.630539,21,144.0,133.0,3,10,2001,8.0,2230,0.681537
97,New Record #98,A record from X_test_DataRobot.csv,19.851505,0.030979,855.0,0,1851,90,10,3.000000,...,0.000000,17,58.0,36.0,4,1,757,18.0,835,1.471499
98,New Record #99,A record from X_test_DataRobot.csv,51.807298,0.046468,1907.0,0,4667,83,13,3.000000,...,49.000000,49,48.0,30.0,3,2,1819,6.0,1815,5.560719
