# Machine Learning

In this file, instructions how to approach the challenge can be found.

We are going to work on different types of Machine Learning problems:

- **Regression Problem**: The goal is to predict delay of flights.
- **(Stretch) Multiclass Classification**: If the plane was delayed, we will predict what type of delay it is (will be).
- **(Stretch) Binary Classification**: The goal is to predict if the flight will be cancelled.

## Main Task: Regression Problem

The target variable is **ARR_DELAY**. We need to be careful which columns to use and which don't. For example, DEP_DELAY is going to be the perfect predictor, but we can't use it because in real-life scenario, we want to predict the delay before the flight takes of --> We can use average delay from earlier days but not the one from the actual flight we predict.  

For example, variables **CARRIER_DELAY, WEATHER_DELAY, NAS_DELAY, SECURITY_DELAY, LATE_AIRCRAFT_DELAY** shouldn't be used directly as predictors as well. However, we can create various transformations from earlier values.

We will be evaluating your models by predicting the ARR_DELAY for all flights **1 week in advance**.

## Feature Engineering

Feature engineering will play a crucial role in this problems. We have only very little attributes so we need to create some features that will have some predictive power.

- weather: we can use some weather API to look for the weather in time of the scheduled departure and scheduled arrival.
- statistics (avg, mean, median, std, min, max...): we can take a look at previous delays and compute descriptive statistics
- airports encoding: we need to think about what to do with the airports and other categorical variables
- time of the day: the delay probably depends on the airport traffic which varies during the day.
- airport traffic
- unsupervised learning as feature engineering?
- **what are the additional options?**: Think about what we could do more to improve the model.

In [5]:
import pandas as pd
import numpy as np

In [6]:
flights = pd.read_csv('flights_big.csv')

* dropping massive null columns

In [7]:
#flights.isnull().sum()

In [8]:
flights1 = flights.drop(['carrier_delay', 'weather_delay', 'nas_delay', 'security_delay', 'late_aircraft_delay',
                          'first_dep_time', 'total_add_gtime', 'longest_add_gtime','cancellation_code', 
                          'no_name', 'dup', ], axis = 1)

In [9]:
flights1.columns

Index(['fl_date', 'mkt_unique_carrier', 'branded_code_share', 'mkt_carrier',
       'mkt_carrier_fl_num', 'op_unique_carrier', 'tail_num',
       'op_carrier_fl_num', 'origin_airport_id', 'origin', 'origin_city_name',
       'dest_airport_id', 'dest', 'dest_city_name', 'crs_dep_time', 'dep_time',
       'dep_delay', 'taxi_out', 'wheels_off', 'wheels_on', 'taxi_in',
       'crs_arr_time', 'arr_time', 'arr_delay', 'cancelled', 'diverted',
       'crs_elapsed_time', 'actual_elapsed_time', 'air_time', 'flights',
       'distance'],
      dtype='object')

* dropping columns that mean the same thing and certain columns that don't mean anything

In [10]:
flights1 = flights1.drop(['mkt_carrier', 'op_unique_carrier', 'flights', 'mkt_carrier_fl_num', 
                          'tail_num', 'branded_code_share', 'op_carrier_fl_num', 'cancelled', 'diverted',
                         'wheels_on', 'wheels_off'], axis = 1)

In [11]:
#flights1.isnull().sum()

* filling Nan with modes for categorical data

In [12]:
dep_time_mode = flights1['dep_time'].mode() # one mode
arr_time_mode = flights1['arr_time'].mode() # one mode

In [13]:
flights1['dep_time'] = flights1['dep_time'].fillna(int(dep_time_mode))
flights1['arr_time'] = flights1['arr_time'].fillna(int(arr_time_mode))

* filling nan with means for continuous data

In [14]:
flights1['arr_delay'] = flights1['arr_delay'].fillna(flights1['arr_delay'].mean())
flights1['dep_delay'] = flights1['dep_delay'].fillna(flights1['dep_delay'].mean())
flights1['taxi_out'] = flights1['taxi_out'].fillna(flights1['taxi_out'].mean())
flights1['taxi_in'] = flights1['taxi_in'].fillna(flights1['taxi_in'].mean())
flights1['crs_elapsed_time'] = flights1['crs_elapsed_time'].fillna(flights1['crs_elapsed_time'].mean())
flights1['actual_elapsed_time'] = flights1['actual_elapsed_time'].fillna(flights1['actual_elapsed_time'].mean())
flights1['air_time'] = flights1['air_time'].fillna(flights1['air_time'].mean())

* adding month, day of the week, day of the month

In [15]:
flights1['fl_date'] = pd.to_datetime(flights1['fl_date'], errors='coerce')
flights1['month'] = flights1['fl_date'].dt.month
flights1['day_of_week'] = flights1['fl_date'].dt.dayofweek
flights1['day_of_month'] = flights1['fl_date'].dt.day
flights1['year'] = flights1['fl_date'].dt.year

* splitting origin_city_name and dest_city_name from its short version name

In [16]:
flights1[['origin_city_name_only', 'origin_city_name_short']] = flights1['origin_city_name'].str.split(',', expand = True)

In [17]:
flights1[['dest_city_name_only', 'dest_city_name_short']] = flights1['dest_city_name'].str.split(',', expand = True)

In [18]:
#dropping short hand version of city names
flights1 = flights1.drop(['origin_city_name_short', 'dest_city_name_short', 'origin_city_name', 'dest_city_name', 'dest', 'origin'], axis = 1)

In [19]:
flights1['origin_city_name_only'] = flights1['origin_city_name_only'].str.strip().str.lower()
flights1['dest_city_name_only'] = flights1['dest_city_name_only'].str.strip().str.lower()

* make hour departure and arrival feature

In [20]:
flights1['hour_departure'] = flights1['dep_time'].apply(
    lambda x: str(x)[:2] if len(str(x)) == 6 else (str(x)[:1]))

In [21]:
flights1['hour_arrival'] = flights1['arr_time'].apply(
    lambda x: str(x)[:2] if len(str(x)) == 6 else (str(x)[:1]))

In [22]:
# dropping original departure and arrival features
flights1 = flights1.drop(['dep_time', 'arr_time'], axis = 1)

* categorize long, medium and short flights

In [23]:
#Converting to air_time
def flight_duration(x):
    if x <=180:
        return 'Short'
    elif x >180 and x<360:
        return 'Medium'
    elif x>=360:
        return 'Long'

flights1['flight_duration_type']=flights1['air_time'].apply(lambda x: flight_duration(x))
flights1['flight_duration_type'].value_counts()

Short     434745
Medium     63415
Long        1840
Name: flight_duration_type, dtype: int64

In [24]:
#dropping original air_time feature
flights1 = flights1.drop(['air_time', 'actual_elapsed_time'], axis = 1)

* taxi-in categories

In [25]:
def taxi_in_duration(x):
    if x <=15:
        return 'short_taxi_in'
    elif x > 15 and x<50:
        return 'medium_taxi_in'
    else:
        return 'long_taxi_in'

flights1['taxi_in_duration']=flights1['taxi_in'].apply(lambda x: taxi_in_duration(x))
flights1['taxi_in_duration'].value_counts()

short_taxi_in     464354
medium_taxi_in     34623
long_taxi_in        1023
Name: taxi_in_duration, dtype: int64

* taxi-out categories

In [26]:
def taxi_out_duration(x):
    if x <=25:
        return 'short_taxi_out'
    elif x > 25 and x<70:
        return 'medium_taxi_out'
    else:
        return 'long_taxi_out'

flights1['taxi_out_duration']=flights1['taxi_out'].apply(lambda x: taxi_out_duration(x))
flights1['taxi_out_duration'].value_counts()

short_taxi_out     432318
medium_taxi_out     65567
long_taxi_out        2115
Name: taxi_out_duration, dtype: int64

* dropping taxi_in and taxi_out

In [27]:
flights1 = flights1.drop(['taxi_in', 'taxi_out'], axis = 1)

* arr_delay per unique_carrier

In [28]:
mean_arrdelay_carrier = flights1.groupby('mkt_unique_carrier')['arr_delay'].mean()

In [29]:
mean_arrdelay_carrier.name = 'mean_arrdelay_carrier'

In [30]:
flights1 = pd.merge(flights1, mean_arrdelay_carrier, how = 'left', on = ['mkt_unique_carrier'])

* arrival delay per dest_airport id

In [31]:
mean_arrdelay_dest_air = flights1.groupby('dest_airport_id')['arr_delay'].mean()

In [32]:
mean_arrdelay_dest_air.name = 'mean_arrdelay_dest_air'

In [33]:
flights1 = pd.merge(flights1, mean_arrdelay_dest_air, how = 'left', on = ['dest_airport_id'])

* arrival delay per origin_airport id

In [34]:
mean_arrdelay_origin_air = flights1.groupby('origin_airport_id')['arr_delay'].mean()

In [35]:
mean_arrdelay_origin_air.name = 'mean_arrdelay_origin_air'

In [36]:
flights1 = pd.merge(flights1, mean_arrdelay_origin_air, how = 'left', on = ['origin_airport_id'])

## make the types categories from feature engineering

In [37]:
flights1["mkt_unique_carrier"] = flights1["mkt_unique_carrier"].astype("category")
flights1["origin_airport_id"] = flights1["origin_airport_id"].astype("category")
flights1["dest_airport_id"] = flights1["dest_airport_id"].astype("category")
flights1["flight_duration_type"] = flights1["flight_duration_type"].astype("category")
flights1["taxi_in_duration"] = flights1["taxi_in_duration"].astype("category")
flights1["taxi_out_duration"] = flights1["taxi_out_duration"].astype("category")

* encoding categorical data

In [38]:
#encode arrival delay
flights1['arr_delay_cat'] = flights1['arr_delay'].apply(lambda x: 1 if x > 0 else 0)
# ENCODE AIRPORTS AND TAILNUM
from sklearn.preprocessing import OrdinalEncoder
encoder = OrdinalEncoder()
flights1['mkt_unique_carrier'] = encoder.fit_transform(flights1[['mkt_unique_carrier']])
flights1['origin_airport_id'] = encoder.fit_transform(flights1[['origin_airport_id']])
flights1['dest_airport_id'] = encoder.fit_transform(flights1[['dest_airport_id']])
flights1['taxi_in_duration'] = encoder.fit_transform(flights1[['taxi_in_duration']])
flights1['taxi_out_duration'] = encoder.fit_transform(flights1[['taxi_out_duration']])
flights1['flight_duration_type'] = encoder.fit_transform(flights1[['flight_duration_type']])

* join on weather data

In [39]:
weather = pd.read_csv('weather.csv')

In [40]:
weather['City'] = weather['City'].str.strip().str.lower()

In [41]:
weather['date'] = pd.to_datetime(weather['StartTime(UTC)']).dt.date

In [42]:
weather['EndTime(UTC)'] = pd.to_datetime(weather['EndTime(UTC)'])

In [43]:
temp = weather.groupby(['City', 'date']).max()['EndTime(UTC)']

In [44]:
temp = temp.reset_index()

In [45]:
weather_merged = weather.merge(temp, on = ['City', 'date', 'EndTime(UTC)'], how = 'inner')

In [46]:
weather_merged = weather_merged.rename(columns = {'date': 'fl_date', 'City': 'origin_city_name_only'})

In [47]:
weather_merged = weather_merged.drop(['StartTime(UTC)', 'EndTime(UTC)'], axis = 1)

In [48]:
weather_merged['fl_date'] = pd.to_datetime(weather_merged['fl_date'])

In [49]:
flights_weather = flights1.merge(weather_merged, on = ['origin_city_name_only', 'fl_date'], how = 'left') 

* dropping flights_weather row null values, origin_city_name_only, dest_city_name_only

In [50]:
flights_weather = flights_weather.dropna()

In [51]:
flights_weather = flights_weather.drop(['fl_date', 'origin_city_name_only', 'dest_city_name_only'], axis = 1)

* Categorize and encode weather and severity

In [52]:
flights_weather["Type"] = flights_weather["Type"].astype("category")
flights_weather["Severity"] = flights_weather["Severity"].astype("category")

In [53]:
flights_weather['Type'] = encoder.fit_transform(flights_weather[['Type']])
flights_weather['Severity'] = encoder.fit_transform(flights_weather[['Severity']])

* dropping dep_delay

In [54]:
flights_weather = flights_weather.drop('dep_delay', axis = 1)

In [55]:
flights_weather.shape

(133948, 23)

In [52]:
flights_weather.columns

Index(['mkt_unique_carrier', 'origin_airport_id', 'dest_airport_id',
       'crs_dep_time', 'crs_arr_time', 'arr_delay', 'crs_elapsed_time',
       'distance', 'month', 'day_of_week', 'day_of_month', 'year',
       'hour_departure', 'hour_arrival', 'flight_duration_type',
       'taxi_in_duration', 'taxi_out_duration', 'mean_arrdelay_carrier',
       'mean_arrdelay_dest_air', 'mean_arrdelay_origin_air', 'arr_delay_cat',
       'Type', 'Severity'],
      dtype='object')

### Feature Selection / Dimensionality Reduction

We need to apply different selection techniques to find out which one will be the best for our problems.

- Original Features vs. PCA conponents?

* PCA testing

In [58]:
'''from sklearn.decomposition import PCA
import matplotlib.pyplot as plt'''

'from sklearn.decomposition import PCA\nimport matplotlib.pyplot as plt'

### Modeling

Use different ML techniques to predict each problem.

- linear
- Naive Bayes
- Random Forest Regressor
- SVM classification
- XGBoost regresspr
- The ensemble of your own choice

#### pickle module to save model

In [59]:
import pickle

* target and features

In [57]:
y = np.array(flights_weather.arr_delay)

In [58]:
y_cat = np.array(flights_weather.arr_delay_cat)

In [59]:
X_df = flights_weather.drop(['arr_delay', 'arr_delay_cat'], axis = 1)

In [62]:
X_df.shape

(133948, 21)

In [61]:
X_df.columns

Index(['mkt_unique_carrier', 'origin_airport_id', 'dest_airport_id',
       'crs_dep_time', 'crs_arr_time', 'crs_elapsed_time', 'distance', 'month',
       'day_of_week', 'day_of_month', 'year', 'hour_departure', 'hour_arrival',
       'flight_duration_type', 'taxi_in_duration', 'taxi_out_duration',
       'mean_arrdelay_carrier', 'mean_arrdelay_dest_air',
       'mean_arrdelay_origin_air', 'Type', 'Severity'],
      dtype='object')

In [63]:
X = np.array(X_df)

* train test split and making samples

In [64]:
import sklearn.model_selection as model_selection
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, train_size=0.70,test_size=0.30, random_state=101, shuffle = True)

In [65]:
#creating y_train_cat and y_test_cat
X_train, X_test, y_train_cat, y_test_cat = model_selection.train_test_split(X, y_cat, train_size=0.70,test_size=0.30, random_state=101, shuffle = True)

In [66]:
X_train_sample = pd.DataFrame(X_train).sample(frac = 0.1).values

In [67]:
y_train_sample = pd.DataFrame(y_train).sample(frac = 0.1).values

In [68]:
y_trainCat_sample = pd.DataFrame(y_train_cat).sample(frac = 0.1).values

In [69]:
X_test_sample = pd.DataFrame(X_test).sample(frac = 0.1).values

* Scaling

In [70]:
'''scaler = StandardScaler()
X = scaler.fit_transform(X)'''

'scaler = StandardScaler()\nX = scaler.fit_transform(X)'

In [71]:
'''y = (y - y.mean()) / y.std()'''

'y = (y - y.mean()) / y.std()'

In [72]:
from sklearn.metrics import r2_score

* Scaling PIPE

In [73]:
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

In [74]:
pipe = Pipeline([('scaler', StandardScaler())])
pipe.fit(X_train, y_train)

Pipeline(steps=[('scaler', StandardScaler())])

#### Linear

In [75]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import GridSearchCV

In [76]:
#logistic regression
linear_model = LinearRegression()

In [77]:
linear_pipe = Pipeline(steps=[('scaler', StandardScaler()), ('linear_model', LinearRegression())])

In [78]:
linear_pipe.fit(X_train, y_train)

Pipeline(steps=[('scaler', StandardScaler()),
                ('linear_model', LinearRegression())])

In [79]:
y_pred_linear = linear_pipe.predict(X_test)

In [80]:
r2_score = linear_pipe.score(X_test, y_test)

In [81]:
r2_score

0.08445854757542315

* pickling

In [82]:
with open('model_linear_pickle', 'wb') as linear_file:
    pickle.dump(linear_pipe, linear_file)

In [83]:
with open('model_linear_pickle', 'rb') as linear_file:
    model_linear = pickle.load(linear_file)

### Linear_ElasticNet

In [93]:
from sklearn.linear_model import ElasticNet
elasnet_model = ElasticNet()

In [94]:
hyperparameters ={
    'elasnet_model__alpha': [1],
    'elasnet_model__l1_ratio': [0.0001]
}

In [95]:
elasnet_model_pipe = Pipeline(steps=[('scaler', StandardScaler()), ('elasnet_model', ElasticNet())])
elasnet_model_grid = GridSearchCV(estimator=elasnet_model_pipe, param_grid=hyperparameters, scoring = 'r2', verbose=0, cv= 5)

* test with sample to find best hyperparameters

In [96]:
'''elasnet_model_grid.fit(X_train_sample, y_trainCat_sample.ravel())
y_pred_randFor_class = elasnet_model_grid.predict(X_test_sample)'''

'elasnet_model_grid.fit(X_train_sample, y_trainCat_sample.ravel())\ny_pred_randFor_class = elasnet_model_grid.predict(X_test_sample)'

In [97]:
'''elasnet_model_grid.best_estimator_'''

'elasnet_model_grid.best_estimator_'

* adjusted to best hyperparameters

In [98]:
elasnet_model_grid.fit(X_train, y_train_cat.ravel())

GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('scaler', StandardScaler()),
                                       ('elasnet_model', ElasticNet())]),
             param_grid={'elasnet_model__alpha': [1],
                         'elasnet_model__l1_ratio': [0.0001]},
             scoring='r2')

In [99]:
y_pred_randFor_class = elasnet_model_grid.predict(X_test)

In [100]:
elasnet_model_grid.best_estimator_

Pipeline(steps=[('scaler', StandardScaler()),
                ('elasnet_model', ElasticNet(alpha=1, l1_ratio=0.0001))])

In [101]:
r2_score = elasnet_model_grid.score(X_test, y_test_cat)

0.09565149757045965

* pickling

In [102]:
with open('model_linear_Ela_pickle', 'wb') as linear_Elas_file:
    pickle.dump(elasnet_model_grid, linear_Elas_file)

In [103]:
with open('model_linear_Ela_pickle', 'rb') as linear_Elas_file:
    model_elastic_linear = pickle.load(linear_Elas_file)

### Polyfeats

In [98]:
from sklearn.preprocessing import PolynomialFeatures

In [99]:
poly_model = PolynomialFeatures()

In [120]:
poly_model_pipe = Pipeline(steps=[('scaler', StandardScaler()), ('poly_model', PolynomialFeatures(2)), ('linear_reg', LinearRegression())])

* test with sample to find best hyperparameters

In [122]:
poly_model_pipe.fit(X_train_sample, y_train_sample.ravel())
y_pred_poly = poly_model_pipe.predict(X_test_sample)

In [124]:
poly_model_pipe

Pipeline(steps=[('scaler', StandardScaler()),
                ('poly_model', PolynomialFeatures()),
                ('linear_reg', LinearRegression())])

In [125]:
r2_score = poly_model_pipe.score(X_test, y_test_cat)
r2_score

-514.8836521674519

#### Naives Bayes, GaussianNB Naive Bayes

In [92]:
from sklearn.naive_bayes import GaussianNB

In [93]:
NB_Gauss_model = GaussianNB()

In [94]:
NB_Gauss_pipe = Pipeline(steps=[('scaler', StandardScaler()), ('NB_Gauss_model', GaussianNB())])

In [95]:
NB_Gauss_pipe.fit(X_train, y_train_cat)

Pipeline(steps=[('scaler', StandardScaler()), ('NB_Gauss_model', GaussianNB())])

In [96]:
y_pred_NBGauss = NB_Gauss_pipe.predict(X_test)

In [97]:
r2_score = NB_Gauss_pipe.score(X_test, y_test_cat)
r2_score

0.6662933930571109

* pickling

In [440]:
with open('model_Gaussian_pickle', 'wb') as Gaussian_file:
    pickle.dump(NB_Gauss_pipe, Gaussian_file)

In [441]:
with open('model_Gaussian_pickle', 'rb') as Gaussian_file:
    model_Gaussian = pickle.load(Gaussian_file)

#### Forrest Classifier

In [322]:
from sklearn.ensemble import RandomForestClassifier

In [361]:
random_forest_class = RandomForestClassifier()

In [362]:
hyperparameters ={
    'rand_forrestClass__n_estimators': [8000, 10000],
    'rand_forrestClass__max_depth': [10],
    'rand_forrestClass__bootstrap': [False]
}

In [363]:
Forest_class_pipe = Pipeline(steps=[('scaler', StandardScaler()), ('rand_forrestClass', RandomForestClassifier())])
forest_class_grid = GridSearchCV(estimator=Forest_class_pipe, param_grid=hyperparameters, scoring = 'r2', verbose=0, cv= 5)

* test with sample to find best hyperparameters

In [154]:
rand_forrestClass_grid = GridSearchCV(estimator=random_forest_class, param_grid=hyperparameters, scoring = 'r2', verbose=0, cv= 5)

In [None]:
forest_class_grid.fit(X_train_sample, y_trainCat_sample.ravel())
y_pred_randFor_class = forest_class_grid.predict(X_test_sample)

In [None]:
forest_class_grid.best_estimator_

* adjusted to best hyperparameters

In [None]:
forest_class_grid.fit(X_train, y_train_cat.ravel())

In [368]:
y_pred_randFor_class = forest_class_grid.predict(X_test)

In [None]:
forest_class_grid.best_estimator_

In [375]:
r2_score = forest_class_grid.score(X_test, y_test_cat)
r2_score

-0.5616250137111254

* pickling

In [380]:
with open('model_Forrest_class_pickle', 'wb') as forest_class_file:
    pickle.dump(forest_class_grid, forest_class_file)

In [442]:
with open('model_Forrest_class_pickle', 'rb') as forest_class_file:
    model_forrestClass = pickle.load(forest_class_file)

#### Random Forest Regressor

In [390]:
from sklearn.ensemble import RandomForestRegressor

In [391]:
random_forest_regression_model = RandomForestRegressor()

In [392]:
hyperparameters = {
    'rand_forrestReg__n_estimators': [1000],
    'rand_forrestReg__max_depth': [2],
    'rand_forrestReg__min_samples_split':[6], 
    'rand_forrestReg__bootstrap':[True],
    'rand_forrestReg__criterion' :['mse']
}

In [393]:
Forest_reg_pipe = Pipeline(steps=[('scaler', StandardScaler()), ('rand_forrestReg', RandomForestRegressor())])
forest_reg_grid = GridSearchCV(estimator=Forest_reg_pipe, param_grid=hyperparameters, scoring = 'r2', verbose=0, cv= 5)

* test with sample to find best hyperparameters

In [386]:
'''forest_reg_grid.fit(X_train_sample, y_train_sample.ravel())
y_pred_randFor_class = forest_reg_grid.predict(X_test_sample)'''

In [387]:
forest_reg_grid.best_estimator_

Pipeline(steps=[('scaler', StandardScaler()),
                ('rand_forrestReg',
                 RandomForestRegressor(max_depth=2, min_samples_split=6,
                                       n_estimators=1000))])

* adjusted to best hyperparameters

In [394]:
forest_reg_grid.fit(X_train, y_train)
y_pred_randFo = forest_reg_grid.predict(X_test)

In [395]:
forest_reg_grid.best_estimator_

Pipeline(steps=[('scaler', StandardScaler()),
                ('rand_forrestReg',
                 RandomForestRegressor(max_depth=2, min_samples_split=6,
                                       n_estimators=1000))])

In [398]:
r2_score = forest_reg_grid.score(X_test, y_test)

In [399]:
r2_score

-0.004472808329518907

* pickling

In [400]:
with open('model_Forrest_reg_pickle', 'wb') as forest_reg_file:
    pickle.dump(forest_reg_grid, forest_reg_file)

In [401]:
with open('model_Forrest_reg_pickle', 'rb') as forest_reg_file:
    model_Forrest_reg_pickle = pickle.load(forest_reg_file)

#### SVM classification

In [402]:
from sklearn.svm import SVC, SVR, LinearSVC

In [403]:
model_svm_class = SVC()

In [408]:
hyperparameters = {'model_svm_class__kernel': ['linear'],
                   'model_svm_class__C':[1],
                   'model_svm_class__degree':[1]
}
SVC_grid_class = GridSearchCV(estimator=model_svm_class, param_grid=hyperparameters, scoring = 'r2', verbose=0, cv= 5)

In [409]:
svm_class_pipe = Pipeline(steps=[('scaler', StandardScaler()), ('model_svm_class', SVC())])
smv_class_grid = GridSearchCV(estimator=svm_class_pipe, param_grid=hyperparameters, scoring = 'r2', verbose=0, cv= 5)

* test with sample to find best hyperparameters

In [406]:
smv_class_grid.fit(X_train_sample, y_trainCat_sample.ravel())
y_pred_svm = smv_class_grid.predict(X_test_sample)

In [407]:
smv_class_grid.best_estimator_

Pipeline(steps=[('scaler', StandardScaler()),
                ('model_svm_class', SVC(C=1, degree=1, kernel='linear'))])

* adjusted to best hyperparameters

In [410]:
smv_class_grid.fit(X_train,y_train_cat)
y_pred_svc_cat = smv_class_grid.predict(X_test)

In [412]:
smv_class_grid.best_estimator_

Pipeline(steps=[('scaler', StandardScaler()),
                ('model_svm_class', SVC(C=1, degree=1, kernel='linear'))])

In [414]:
r2_score = smv_class_grid.score(X_test, y_test_cat)

In [415]:
r2_score

-0.5673981191222572

* pickling

In [416]:
with open('model_smv_class_pickle', 'wb') as smv_class_file:
    pickle.dump(smv_class_grid, smv_class_file)

In [439]:
with open('model_smv_class_pickle', 'rb') as smv_class_file:
    model_smv_class_pickle = pickle.load(smv_class_file)

#### XGBoost regressor

In [423]:
import xgboost as xgb

In [419]:
xg_reg = xgb.XGBRegressor()

In [431]:
hyperparameters = {
    'xg_reg__objective' : ['reg:squarederror'],
    'xg_reg__colsample_bytree':[0.1],
    'xg_reg__n_estimators': [3000],
    'xg_reg__max_depth': [4],
    'xg_reg__learning_rate': [0.0001],
    'xg_reg__alpha': [5]
}

In [425]:
svm_class_pipe = Pipeline(steps=[('scaler', StandardScaler()), ('xg_reg', xgb.XGBRegressor())])
xgb_grid = GridSearchCV(estimator=svm_class_pipe, param_grid=hyperparameters, scoring = 'r2', verbose=0, cv= 5)

* test with sample to find best hyperparameters

In [429]:
xgb_grid.fit(X_train_sample, y_train_sample.ravel())
y_pred_xbg = xgb_grid.predict(X_test_sample)

In [430]:
xgb_grid.best_estimator_

Pipeline(steps=[('scaler', StandardScaler()),
                ('xg_reg',
                 XGBRegressor(alpha=5, base_score=0.5, booster='gbtree',
                              colsample_bylevel=1, colsample_bynode=1,
                              colsample_bytree=0.1, gamma=0, gpu_id=-1,
                              importance_type='gain',
                              interaction_constraints='', learning_rate=0.0001,
                              max_delta_step=0, max_depth=4, min_child_weight=1,
                              missing=nan, monotone_constraints='()',
                              n_estimators=3000, n_jobs=8, num_parallel_tree=1,
                              random_state=0, reg_alpha=5, reg_lambda=1,
                              scale_pos_weight=1, subsample=1,
                              tree_method='exact', validate_parameters=1,
                              verbosity=None))])

* adjusted to best hyperparameters

In [432]:
xgb_grid.fit(X_train,y_train)
y_pred_xbg = xgb_grid.predict(X_test)

In [434]:
xgb_grid.best_estimator_

Pipeline(steps=[('scaler', StandardScaler()),
                ('xg_reg',
                 XGBRegressor(alpha=30, base_score=0.5, booster='gbtree',
                              colsample_bylevel=1, colsample_bynode=1,
                              colsample_bytree=0.1, gamma=0, gpu_id=-1,
                              importance_type='gain',
                              interaction_constraints='', learning_rate=0.005,
                              max_delta_step=0, max_depth=2, min_child_weight=1,
                              missing=nan, monotone_constraints='()',
                              n_estimators=500, n_jobs=8, num_parallel_tree=1,
                              random_state=0, reg_alpha=30, reg_lambda=1,
                              scale_pos_weight=1, subsample=1,
                              tree_method='exact', validate_parameters=1,
                              verbosity=None))])

In [436]:
r2_score = xgb_grid.score(X_test, y_test)
r2_score

0.0011911026682224213

* pickling

In [437]:
with open('model_xgb_grid_pickle', 'wb') as xgb_reg_file:
    pickle.dump(xgb_grid, xgb_reg_file)

In [438]:
with open('model_xgb_grid_pickle', 'rb') as xgb_reg_file:
    model_xgb_grid_pickle = pickle.load(xgb_reg_file)

### Evaluation

You have data from 2018 and 2019 to develop models. Use different evaluation metrics for each problem and compare the performance of different models.

You are required to predict delays on **out of sample** data from **first 7 days (1st-7th) of January 2020** and to share the file with LighthouseLabs. Sample submission can be found in the file **_sample_submission.csv_**

======================================================================
## Stretch Tasks

### Multiclass Classification

The target variables are **CARRIER_DELAY, WEATHER_DELAY, NAS_DELAY, SECURITY_DELAY, LATE_AIRCRAFT_DELAY**. We need to do additional transformations because these variables are not binary but continuos. For each flight that was delayed, we need to have one of these variables as 1 and others 0.

It can happen that we have two types of delays with more than 0 minutes. In this case, take the bigger one as 1 and others as 0.

### Binary Classification

The target variable is **CANCELLED**. The main problem here is going to be huge class imbalance. We have only very little cancelled flights with comparison to all flights. It is important to do the right sampling before training and to choose correct evaluation metrics.