# Train Test Data To DataRobot Deploy Models

[Predicting Flight Delays with DataRobot](https://60e81def7e71a6eaa3a03f8f.apps2.datarobot.com/?token=tv_yv4qxMP9LD4OI0MHf1r9a1-UA7RUkgksIOBBMjks)

    - The objective of the project is to develop a model aimed at predicting flight delays at take-off 

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
import sys
# ^^^ pyforest auto-imports - don't write above this line
# ==============================================================================
# Auto Import Dependencies
# ==============================================================================
# pyforest imports dependencies according to use in the notebook
# ==============================================================================

# Warnings configuration
# ==============================================================================
import warnings
warnings.filterwarnings('ignore')

# Folder configuration
# ==============================================================================
new_path = '../scripts/'
if new_path not in sys.path:
    sys.path.append(new_path)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

## Clean Dataset

In [2]:
path = '../data/'
file = 'raw/DelayedFlights.csv'

df_raw = pd.read_csv(path+file)

<IPython.core.display.Javascript object>

In [3]:
df = df_raw.copy()

In [4]:
df = df_raw.drop(labels='Unnamed: 0', axis=1)

In [5]:
df = df.sample(frac=0.25, random_state = 6858) # Max 200 MB File to DataRobot

In [6]:
df.drop_duplicates(inplace=True)

In [7]:
# Frecuency encoding

labels = ['UniqueCarrier', 'TailNum', 'Origin', 'Dest', 'CancellationCode']

for category in labels: 
    
    cat_map = df.groupby(category).size() / len(df)*100
    
    df[category] = df[category].map(cat_map)

In [8]:
df.head()

Unnamed: 0,Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,...,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay
32721,2008,1,1,2,1402.0,1230,1417.0,1256,5.358847,7768,...,6.0,13.0,0,99.966129,0,0.0,0.0,0.0,0.0,81.0
480545,2008,3,20,4,1723.0,1715,1932.0,1906,4.189471,4358,...,12.0,17.0,0,99.966129,0,8.0,0.0,18.0,0.0,0.0
879313,2008,5,5,1,703.0,650,1031.0,1011,5.978438,1167,...,8.0,18.0,0,99.966129,0,0.0,0.0,20.0,0.0,0.0
1670776,2008,11,8,6,1628.0,1559,1731.0,1715,7.297755,1553,...,7.0,14.0,0,99.966129,0,0.0,0.0,0.0,0.0,16.0
1464310,2008,9,26,5,1526.0,1230,2322.0,2029,7.297755,52,...,6.0,10.0,0,99.966129,0,0.0,0.0,0.0,0.0,173.0


In [9]:
# Script created for transform missing data
# ===============================================================================

import missing

In [10]:
df[:] = missing.transform(df[:], 'mean')

In [11]:
df.isnull().sum().sum()

0

In [12]:
df.replace([np.inf, -np.inf], np.nan, inplace=True)
df.dropna(inplace=True)

In [13]:
df.isna().any()

Year                 False
Month                False
DayofMonth           False
DayOfWeek            False
DepTime              False
CRSDepTime           False
ArrTime              False
CRSArrTime           False
UniqueCarrier        False
FlightNum            False
TailNum              False
ActualElapsedTime    False
CRSElapsedTime       False
AirTime              False
ArrDelay             False
DepDelay             False
Origin               False
Dest                 False
Distance             False
TaxiIn               False
TaxiOut              False
Cancelled            False
CancellationCode     False
Diverted             False
CarrierDelay         False
WeatherDelay         False
NASDelay             False
SecurityDelay        False
LateAircraftDelay    False
dtype: bool

### Train Test

In [14]:
# Matrix
X = df
# Vector
y = df['ArrDelay']


In [15]:
# Call train_test_split on the data and capture the results
# ============================================================================

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 6858)

<IPython.core.display.Javascript object>

In [16]:
X_train.to_csv(path + "interim/X_train_DataRobot.csv", index = False, header = True)
X_test.to_csv(path + "interim/X_test_DataRobot.csv", index = False, header = True)