# Machine Learning

In this file, instructions how to approach the challenge can be found.

We are going to work on different types of Machine Learning problems:

- **Regression Problem**: The goal is to predict delay of flights.
- **(Stretch) Multiclass Classification**: If the plane was delayed, we will predict what type of delay it is (will be).
- **(Stretch) Binary Classification**: The goal is to predict if the flight will be cancelled.

In [None]:
X.to_csv('X.csv', index= False)
df.to_csv('df.csv', index= False)
df_transformed.to_csv('df_transformed.csv', index= False)

In [None]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set_theme()

import statsmodels.api as sm
import sklearn

import xgboost as xgb
from sklearn.metrics import mean_squared_error

## Main Task: Regression Problem

The target variable is **ARR_DELAY**. We need to be careful which columns to use and which don't. For example, DEP_DELAY is going to be the perfect predictor, but we can't use it because in real-life scenario, we want to predict the delay before the flight takes of --> We can use average delay from earlier days but not the one from the actual flight we predict.  

For example, variables **CARRIER_DELAY, WEATHER_DELAY, NAS_DELAY, SECURITY_DELAY, LATE_AIRCRAFT_DELAY** shouldn't be used directly as predictors as well. However, we can create various transformations from earlier values.

We will be evaluating your models by predicting the ARR_DELAY for all flights **1 week in advance**.

In [None]:
# Load table
df_numeric = pd.read_csv('../data/df_numeric_with_delays.csv')
df_weather = pd.read_csv('../data/df_weather_aux.csv')
df = df_numeric.copy()
df.head()

In [None]:
df_weather.head()

Drop the cheater-pants variables:

In [None]:
day_of_delays = [
    'carrier_delay',
    'weather_delay',
    'nas_delay',
    'security_delay',
    'crs_elapsed_time',
    'late_aircraft_delay',
    'dep_delay'
]

df.drop(columns = day_of_delays, axis = 1, inplace = True)

Assign the `target` variable `arr_time` to `y`

In [None]:
# Assign target variable
y = df.arr_delay

# Then drop it from the table:
df.drop(['arr_delay'], axis= 1, inplace= True)
df.head()

In [None]:
df.drop(['fl_date'], axis= 1, inplace= True)
df.head()

In [None]:
df.isna().sum()

### Feature Engineering

Feature engineering will play a crucial role in this problems. We have only very little attributes so we need to create some features that will have some predictive power.

- weather: we can use some weather API to look for the weather in time of the scheduled departure and scheduled arrival.
- statistics (avg, mean, median, std, min, max...): we can take a look at previous delays and compute descriptive statistics
- airports encoding: we need to think about what to do with the airports and other categorical variables
- time of the day: the delay probably depends on the airport traffic which varies during the day.
- airport traffic
- unsupervised learning as feature engineering?
- **what are the additional options?**: Think about what we could do more to improve the model.

### Feature Selection / Dimensionality Reduction

We need to apply different selection techniques to find out which one will be the best for our problems.

- Original Features vs. PCA conponents?

##### REMOVING FEATURES WITH SMALL VARIANCE 

Removing columns with little variance which would have small predictive power.


(_Referenced: [W5Em13_Variable_selection](../../../../Documents/LHL%20DS%20Bootcamp/Course%20work/W5/W5Em13_Variable_selection.ipynb)_)

In [None]:
import sklearn
from sklearn.feature_selection import VarianceThreshold

vt = VarianceThreshold(0.1)
df_transformed = vt.fit_transform(df)

to see how many columns were dropped:

In [None]:
print(df.shape)
print(df_transformed.shape)

In [None]:
# columns we have selected
# VarianceThreshold get_support() stores boolean of each variable in the np.array.
selected_columns = df.columns[vt.get_support()]
print(selected_columns)

# transforming the np.array back to a DataFrame preserves column labels
df_transformed = pd.DataFrame(df_transformed, columns = selected_columns)
df_transformed.head()

##### REMOVING CORRELATED FEATURES:

In [None]:
# STEP 1: Correlation matrix
df_corr = df_transformed.corr().abs()

Using `0.8` as the correlation threshold:

In [None]:
# STEP 2: find pairs of highly correlated features
indices = np.where(df_corr > 0.8) 
indices = [(df_corr.index[x], df_corr.columns[y]) 
   for x, y in zip(*indices)
      if x != y and x < y]

Using try-except logic to allows the code to continue in the event a `KeyError` occurs because a high correlation occurred more than once with the same feature.

In [None]:
# STEP 3: Removing the highly correlated columns
for idx in indices: #each pair
    try:
        df_transformed.drop(idx[1], axis = 1, inplace=True)
    except KeyError:
        pass

In [None]:
# The correlated paris are:
indices

In [None]:
# Recheck shape of table:
print(df_transformed.shape)
df_transformed.head()

### Modeling

Use different ML techniques to predict each problem.

- linear / logistic / multinomial logistic regression
- Naive Bayes
- Random Forest
- SVM
- XGBoost
- The ensemble of your own choice

### LINEAR REGRESSION

(_Referenced: [W3D3L Statistical Modeling Demo](../../../../Documents/LHL%20DS%20Bootcamp/Course%20work/W3/W3D3L-Statistical_Modeling_Demo.ipynb)_)

 With `arr_delay` as our dependent variable ($y$) and `_____` and `_______` as our independent variables ($x_1$ and $x_2$). This multiple linear regression model uses the relationship:

$$
y=b_0 + b_1x_1 + b_2x_2
$$

> Note that if we want an intercept ($b_0$) in a `statsmodels OLS` model, we need to use the statsmodels's `add_constant` function, prior to fitting the model.

In [None]:
# Adds a column of 1's so the model will contain an intercept
X = df.copy()
X = sm.add_constant(X) 
X.head()

In [None]:
X = X.fillna(0)

In [None]:
# Instantiate linear regression
lin_reg = sm.OLS(y,X)

model = lin_reg.fit()
print_model = model.summary()
print(print_model)

crs_arr_time P>|t| value is 0.255, so we can drop that variable and re-test:

In [None]:
X.drop('crs_dep_time', axis=1, inplace=True)

In [None]:
# Instantiate linear regression
lin_reg = sm.OLS(y,X)

model = lin_reg.fit()
print_model = model.summary()
print(print_model)

In [None]:
# Import our model:
from sklearn.linear_model import LinearRegression

Initialize the object and fit the model on our data:

In [None]:
regressor = LinearRegression()
regressor.fit(X, y)

In [None]:
# Check the beta coeffient:
print(regressor.coef_)

### NAIVE BAYES


In [None]:
from sklearn.naive_bayes import GaussianNB
model = GaussianNB()
model.fit(X, y);

In [None]:
rng = np.random.RandomState(0)
Xnew = [-6, -14] + [14, 18] * rng.rand(2000, 2)
ynew = model.predict(Xnew)

### RANDOM FOREST

### SVM

### XGBOOST


_(Referenced: [W6D5m15_Using_XGBoost](../../../../Documents/LHL%20DS%20Bootcamp/Course%20work/W6/W6D5m15_using_XGBoost.ipynb))_

In [None]:
# Generate Dmatrix
data_dmatrix = xgb.DMatrix(data=X,label=y)


Using `train_test_split` to create the test and train for cross-validation.
- `test_size` size = 20% 
- `random_state` used for reproducibility

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)

Instantiating a XGBoost regressor:

In [None]:
xg_reg = xgb.XGBRegressor(
        objective ='reg:squarederror' # Loss function
      , colsample_bytree = 0.3  # % of features used per tree
      , learning_rate = 0.1  # Overfit prevention step size. Range[0,1]
      , max_depth = 5 # Boosting round tree depth
      , alpha = 10  # L1 regularization on leaf weights.
      , n_estimators = 10 # Number of trees to build
)

>Above code was `reg:linear` from class tutorial, but was changed as result of this warning:
>```
>reg:linear is now deprecated in favor of reg:squarederror.
>```

In [None]:
# Fit the training set with .fit():
xg_reg.fit(X_train,y_train)

# Make predictions with .predict():
preds = xg_reg.predict(X_test)

In [None]:
rmse = np.sqrt(mean_squared_error(y_test, preds))
print("RMSE: %f" % (rmse))

In [None]:
# Hyper-parameter dictionary
params = {
      "objective":"reg:squarederror"
    , 'colsample_bytree': 0.3
    , 'learning_rate': 0.1
    , 'max_depth': 5
    , 'alpha': 10
}

# 3-fold cross validation model:
cv_results = xgb.cv(
      dtrain = data_dmatrix
    , params = params
    , nfold = 3
    , num_boost_round = 50
    , early_stopping_rounds = 10
    , metrics = "rmse"
    , as_pandas = True
    , seed = 123
)


>Above code was `reg:linear` from class tutorial, but was changed as result of this warning:
>```
>reg:linear is now deprecated in favor of reg:squarederror.
>```

In [None]:
# Train and test RMSE metrics for each boosting round.
cv_results.head()

In [None]:
print((cv_results["test-rmse-mean"]).tail(1))

In [None]:
xg_reg = xgb.train(
      params = params
    , dtrain = data_dmatrix
    , num_boost_round = 10
)

In [None]:
xgb.plot_importance(xg_reg)
plt.rcParams['figure.figsize'] = [10, 10]
plt.show()

In [None]:
fig = plt.scatter(x= y_test, y= preds, alpha=0.3)

### Evaluation

You have data from 2018 and 2019 to develop models. Use different evaluation metrics for each problem and compare the performance of different models.

You are required to predict delays on **out of sample** data from **first 7 days (1st-7th) of January 2020** and to share the file with LighthouseLabs. Sample submission can be found in the file **_sample_submission.csv_**

_(Referenced: [W6D4m9_Model_evaluation](../../../../Documents/LHL%20DS%20Bootcamp/Course%20work/W6/W6D4m9_model_evaluation.ipynb))_

In [None]:
# import MSE from sklearn
from sklearn.metrics import mean_squared_error

# compute MSE
MSE = mean_squared_error(y_test, preds)  

# print MSE
print(MSE)

In [None]:
# import accuracy_score from sklearn
from sklearn.metrics import accuracy_score

# compute accuracy
accuracy = accuracy_score(y_test, preds)

# print accuracy
print(accuracy)

In [None]:
# import f1_score from sklearn
from sklearn.metrics import f1_score

# compute F1-score
f1_score = f1_score(y_test, preds)

# print F1-score
print(f1_score)

======================================================================
## Stretch Tasks

### Multiclass Classification

The target variables are **CARRIER_DELAY, WEATHER_DELAY, NAS_DELAY, SECURITY_DELAY, LATE_AIRCRAFT_DELAY**. We need to do additional transformations because these variables are not binary but continuos. For each flight that was delayed, we need to have one of these variables as 1 and others 0.

It can happen that we have two types of delays with more than 0 minutes. In this case, take the bigger one as 1 and others as 0.

### Binary Classification

The target variable is **CANCELLED**. The main problem here is going to be huge class imbalance. We have only very little cancelled flights with comparison to all flights. It is important to do the right sampling before training and to choose correct evaluation metrics.