Importing the <code>Pandas</code> and <code> NumPy</code>  libraries as well as uploading the dataset.

In [1]:
import pandas as pd
import numpy as np

df = pd.read_csv('C:/Users/b_min/Desktop/Capstone/finallycleansedflights.csv')

Checking the column names using <code>head</code>

In [2]:
df.head(1)

Unnamed: 0.1,Unnamed: 0,Date,DAY_OF_WEEK,AIRLINE,FLIGHT_NUMBER,TAIL_NUMBER,ORIGIN_AIRPORT,DESTINATION_AIRPORT,SCHEDULED_DEPARTURE,DEPARTURE_TIME,...,DIVERTED,CANCELLED,CANCELLATION_REASON,AIR_SYSTEM_DELAY,SECURITY_DELAY,AIRLINE_DELAY,LATE_AIRCRAFT_DELAY,WEATHER_DELAY,Delayed or Cancelled,Delayed or Late
0,0,2015-01-01,4,AS,98,N407AS,ANC,SEA,00:05:00,23:54:00,...,0,0,0,0.0,0.0,0.0,0.0,0.0,0,-22.0


Defining independent/predictors and dependent/target variables.

In [3]:
x = df[['DEPARTURE_DELAY','TAXI_OUT','SCHEDULED_TIME','ELAPSED_TIME','AIR_TIME','DISTANCE','TAXI_IN',
        'DIVERTED','CANCELLED','CANCELLATION_REASON','AIR_SYSTEM_DELAY','SECURITY_DELAY','AIRLINE_DELAY',
        'LATE_AIRCRAFT_DELAY','WEATHER_DELAY','WHEELS_ON','WHEELS_OFF']]

y = df['ARRIVAL_DELAY']

Splitting the dataset into train and test subsets using the <code>train_test_split</code> method.

In [4]:
from sklearn.model_selection import train_test_split
xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size=0.2, random_state=0)
print(xtrain.shape, xtrain.shape)
print(ytest.shape, ytest.shape)

(278623, 17) (278623, 17)
(69656,) (69656,)


## MLR

Creating the linear regression object and fitting the model using the <code>fit()</code> method.

In [5]:
from sklearn.linear_model import LinearRegression
lm = LinearRegression()
model = lm.fit(xtrain, ytrain)

Calculate bo using <code>.intercept_</code>

In [6]:
lm.intercept_

1.1102230246251565e-13

Calculating the coefficients i.e. b1, b2, b3... b17 using <code>.coef_</code>

In [7]:
lm.coef_

array([ 1.00000000e+00,  2.50000000e-01, -1.00000000e+00,  7.50000000e-01,
        2.50000000e-01, -8.21672494e-17,  2.50000000e-01,  0.00000000e+00,
       -1.11022302e-16, -1.11022302e-16, -1.79533501e-16, -3.30269586e-16,
       -1.10561693e-15, -1.33679964e-15, -6.15991863e-16,  4.46031149e-15,
       -4.79750494e-15])

Predicting the delays using the <code>predict()</code> function

In [8]:
b = lm.predict(xtest)
b

array([-19.,   9.,  13., ...,   5., -18.,  24.])

Evaluting the model by calculating the R squared value using the <code>.score()</code> function. In this case we get 1, hence there is overfitting. 

In [9]:
lm.score(xtest,ytest)

1.0

Calculating RMSE using <code>mean_squared_error</code>. In this case it is zero.

In [10]:
from sklearn import metrics

print(np.sqrt(metrics.mean_squared_error(ytest,b)))

1.0202468254084576e-12


<b>------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------</b>

Predicting the delays using the <code>predict()</code> function on the entire dataset.

In [11]:
a = lm.predict(x)
a

array([-22.,  -9.,   5., ...,  10., -21.,   6.])

Assigning the original dataframe (df) to a new variable called mlr_df

In [12]:
mlr_df = df

Appending a row named MLR_PREDICTION containing all the predicted values for ARRIVAL_DELAY.

In [13]:
mlr_df['MLR_PREDICTION'] = a

Checking whether the column MLR_PREDICTION has been added correctly with the appropriate values.

In [14]:
mlr_df['MLR_PREDICTION'].head()

0   -22.0
1    -9.0
2     5.0
3    -9.0
4   -21.0
Name: MLR_PREDICTION, dtype: float64

Creating a variable named rslt_df that stores the details of the flight for a specfic predicted delay.

In [15]:
rslt_df = mlr_df.loc[(mlr_df['MLR_PREDICTION'] == a)]
rslt_df[['AIRLINE','ORIGIN_AIRPORT','DESTINATION_AIRPORT','MLR_PREDICTION']][0:3]

Unnamed: 0,AIRLINE,ORIGIN_AIRPORT,DESTINATION_AIRPORT,MLR_PREDICTION
0,AS,ANC,SEA,-22.0
1,AA,LAX,PBI,-9.0
2,US,SFO,CLT,5.0


## Random Forest Regression

Since our target variable arrival_delay is a continous variable we have to use Random Forest Regression.

A similar approach like MLR has been used. Libraries have been imported, then the <code>fit()</code> is used to make the model and <code>predict()</code> used for to predicton and finally the model has been evaluated based on the Rsquared and RMSE value.

n_estimators = number of trees in the forest.<br>
random_state = can be set as any number, it helps to get the same o/p everytime

In [16]:
from sklearn.ensemble import RandomForestRegressor
regressor = RandomForestRegressor(n_estimators=10, random_state=0)
regressor.fit(xtrain,ytrain)

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
                      max_features='auto', max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=1, min_samples_split=2,
                      min_weight_fraction_leaf=0.0, n_estimators=10,
                      n_jobs=None, oob_score=False, random_state=0, verbose=0,
                      warm_start=False)

In [17]:
regressor_pred = regressor.predict(xtest)
regressor_pred

array([-10.2,   6.9,  11.6, ...,   4.3, -18.7,  23. ])

In [18]:
regressor.score(xtest,ytest)

0.9959789600576316

In [19]:
from sklearn.metrics import mean_squared_error
rmse = np.sqrt(mean_squared_error(ytest, regressor_pred))
print("RMSE: %f" % (rmse))

RMSE: 2.563279


<b>------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------</b>

Predicting the delays using the <code>predict()</code> function on the entire dataset.

In [20]:
regressor_pred_a = regressor.predict(x)
regressor_pred_a

array([-20.8,  -8.3,   5. , ...,   9.1, -20. ,   5.9])

In [21]:
mlr_df['RF_PREDICTION'] = regressor_pred_a

In [22]:
mlr_df['RF_PREDICTION'].head()

0   -20.8
1    -8.3
2     5.0
3    -9.5
4   -20.7
Name: RF_PREDICTION, dtype: float64

In [23]:
rf_rslt_df = mlr_df.loc[(mlr_df['RF_PREDICTION'] == regressor_pred_a)]
rf_rslt_df[['AIRLINE','ORIGIN_AIRPORT','DESTINATION_AIRPORT','RF_PREDICTION']][:2]

Unnamed: 0,AIRLINE,ORIGIN_AIRPORT,DESTINATION_AIRPORT,RF_PREDICTION
0,AS,ANC,SEA,-20.8
1,AA,LAX,PBI,-8.3


## XGBOOST

Importing <code>XGBOOST</code> library

In [24]:
import xgboost as xgb
import numpy as np
from sklearn.metrics import mean_squared_error

Converting the dataset into an optimized data structure called <code>Dmatrix</code> 

In [25]:
df_dmatrix = xgb.DMatrix(data=x,label=y)

The next step is to create an XGBoost regressor object by calling the <code>XGBRegressor()</code> class from the XGBoost library with the hyper-parameters passed as arguments.

In [27]:
xg_reg = xgb.XGBRegressor(objective ='reg:linear', colsample_bytree = 0.3, learning_rate = 0.1, 
                          max_depth = 5, alpha = 10, n_estimators = 10)

Fit the regressor to the training set and make predictions on the test set using the familiar <code>.fit()</code> and <code>.predict()</code> methods.

In [28]:
xg_reg.fit(xtrain,ytrain)



XGBRegressor(alpha=10, base_score=0.5, booster=None, colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=0.3, gamma=0, gpu_id=-1,
             importance_type='gain', interaction_constraints=None,
             learning_rate=0.1, max_delta_step=0, max_depth=5,
             min_child_weight=1, missing=nan, monotone_constraints=None,
             n_estimators=10, n_jobs=0, num_parallel_tree=1,
             objective='reg:linear', random_state=0, reg_alpha=10, reg_lambda=1,
             scale_pos_weight=1, subsample=1, tree_method=None,
             validate_parameters=False, verbosity=None)

In [29]:
preds = xg_reg.predict(xtest)
preds

array([-0.116889  , -5.766844  , -3.9621015 , ...,  0.46776566,
       -3.190921  , 19.45444   ], dtype=float32)

In [30]:
xg_reg.score(xtest,ytest)

0.6983329160619882

In [31]:
rmse = np.sqrt(mean_squared_error(ytest, preds))
print("RMSE: %f" % (rmse))

RMSE: 22.201925


<b>------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------</b>

Predicting the delays using the <code>predict()</code> function on the entire dataset.

In [32]:
preds_a = xg_reg.predict(x)
preds_a

array([ 6.067317 , -3.014333 , -1.4491961, ..., -3.717074 , -5.6355166,
       -4.495805 ], dtype=float32)

In [33]:
mlr_df['XGB_PREDICTION'] = preds_a

In [34]:
mlr_df['XGB_PREDICTION'].head()

0    6.067317
1   -3.014333
2   -1.449196
3   -2.979177
4    9.570683
Name: XGB_PREDICTION, dtype: float32

In [35]:
xgb_rslt_df = mlr_df.loc[(mlr_df['XGB_PREDICTION'] == preds_a)]
xgb_rslt_df[['AIRLINE','ORIGIN_AIRPORT','DESTINATION_AIRPORT','XGB_PREDICTION']][0:5]

Unnamed: 0,AIRLINE,ORIGIN_AIRPORT,DESTINATION_AIRPORT,XGB_PREDICTION
0,AS,ANC,SEA,6.067317
1,AA,LAX,PBI,-3.014333
2,US,SFO,CLT,-1.449196
3,AA,LAX,MIA,-2.979177
4,AS,SEA,ANC,9.570683
