# LInear regression multiple numeric features

This notebook contains code to train and evaluate a linear regression model using a multiple numeric features and label

Import the required python libraries
- **pandas** contains the dataframe object and a number of useful methods for manipulating and querying data contained in a dataframe
- **numpy** contains a number of useful mathematical operations including some that are helpful when evaluating accuracy of trained models
- **Scikitlearn train_test_split** splits data into training and test sets
- **scikitlearn LinearRegression** used to train a linear regression model
- **scikitlearn metrics** used to calculate metrics such as Mean Squared Error, helpful when evaluating accuracy of trained models
- **matplotlib pyplot** used to plot graphs

In [18]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import metrics
import matplotlib.pyplot as plt

Read flight data from csv file into Pandas dataframe

In [4]:
flight_df=pd.read_csv('all_flights.csv')  # Read csv file into flight_df dataframe
flight_df.shape                           # Display shape of array to see how many rows and columns are in the dataframe

(616101, 17)

Display the first 10 rows in the dataset to make sure data looks like it imported correctly

In [5]:
flight_df.head()                         # Displays the top 10 rows from flight_df dataframe

Unnamed: 0,FL_DATE,OP_UNIQUE_CARRIER,TAIL_NUM,OP_CARRIER_FL_NUM,ORIGIN,DEST,CRS_DEP_TIME,DEP_TIME,DEP_DELAY,CRS_ARR_TIME,ARR_TIME,ARR_DELAY,CRS_ELAPSED_TIME,ACTUAL_ELAPSED_TIME,AIR_TIME,DISTANCE,Unnamed: 16
0,2018-10-01,WN,N221WN,802,ABQ,BWI,905,903.0,-2.0,1450,1433.0,-17.0,225.0,210.0,197.0,1670.0,
1,2018-10-01,WN,N8329B,3744,ABQ,BWI,1500,1458.0,-2.0,2045,2020.0,-25.0,225.0,202.0,191.0,1670.0,
2,2018-10-01,WN,N920WN,1019,ABQ,DAL,1800,1802.0,2.0,2045,2032.0,-13.0,105.0,90.0,80.0,580.0,
3,2018-10-01,WN,N480WN,1499,ABQ,DAL,950,947.0,-3.0,1235,1223.0,-12.0,105.0,96.0,81.0,580.0,
4,2018-10-01,WN,N227WN,3635,ABQ,DAL,1150,1151.0,1.0,1430,1423.0,-7.0,100.0,92.0,80.0,580.0,


Display the names of the columns in the dataframe and their datatypes

In [7]:
flight_df.dtypes  # Displays the names and data types of the columns in the dataframe

FL_DATE                 object
OP_UNIQUE_CARRIER       object
TAIL_NUM                object
OP_CARRIER_FL_NUM        int64
ORIGIN                  object
DEST                    object
CRS_DEP_TIME             int64
DEP_TIME               float64
DEP_DELAY              float64
CRS_ARR_TIME             int64
ARR_TIME               float64
ARR_DELAY              float64
CRS_ELAPSED_TIME       float64
ACTUAL_ELAPSED_TIME    float64
AIR_TIME               float64
DISTANCE               float64
Unnamed: 16            float64
dtype: object

Create a new dataframe containing ONLY the columns we want to use as features and labels.
This saves us spending time cleaning up data we are not using to train our model.
This model will use DISTANCE and DEP_DELAY as features to predict the value of the label ARR_DELAY

In [11]:
min_flight_data_df = flight_df[['DISTANCE','DEP_DELAY','ARR_DELAY']]  # Create new dataframe containing only DISTANCE
                                                                      # DEP_DELAY and ARR_DELAY
min_flight_data_df.shape                                              # Display the shape of the dataframe as a quick check 
                                                                      # to ensure dataframe has expected number of columns
                                                                      # and rows

(616101, 3)

Get rid of rows containing NaN/missing values


In [12]:
no_missing_values_df = min_flight_data_df.dropna(axis=0,how='any')   # Use dropna to remove rows with NaN in any column
no_missing_values_df.shape                                           # Display shape to see how many rows are removed

(610334, 3)

Create one dataframe containing only the features (DISTANCE, DEP_DELAY) and one dataframe containing only the label (ARR_DELAY) 
If either of these has only one column you must reshape the dataframe to -1,1

In [13]:
# Create a dataFrame containign the features
X = no_missing_values_df[['DISTANCE','DEP_DELAY']]

In [14]:
# Create a DataFrame containing the labels
# Reshape to -1,1 if only containing a single column
y = no_missing_values_df['ARR_DELAY'].values.reshape(-1,1)

Split the data into two datasets, one for training the model, one for testing the model

scikitlearn method [train_test_split]( 
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) 
- **X** is the dataframe containing our features
- **Y** is the dataframe containing our label
- **test_size** determines what fraction of the data is put into the test dataframe
- **random_state** defaults to a random number, by specifying a specific number I ensure the split I generate is reproducible. That way if I make changes I know changes in accuracy are not caused by different rows used as training or test data.

In [15]:
#Split into training and test data
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.30, random_state=42)

Check the size of your test and training datasets (i.e. how many rows in each)

In [16]:
X_train.shape

(427233, 2)

In [17]:
X_test.shape

(183101, 2)

Train the model using scikitlearn [LinearRegression]( 
https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html)
Your feature must be numeric - there are many techniques you can use to convert non-numeric data into numeric data for training.
Your training data cannot contain any missing values in rows.

In [21]:
regressor = LinearRegression()
regressor.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

Let's see what arrival delays are predicted for our test data by using the **predict** method of our linearRegression object

In [22]:
y_pred = regressor.predict(X_test)

Let's compare the actual arrival delay times and the predicted arrival delay times

In [23]:
#combine the two 1D numpy arrays into a 2D array
combined = np.hstack((y_test,y_pred))
#Convert to a DataFrame
accuracy_df = pd.DataFrame(combined,columns=['Actual','Predicted'])
accuracy_df.head()

Unnamed: 0,Actual,Predicted
0,-13.0,-9.879774
1,-24.0,-8.296074
2,100.0,64.120357
3,-8.0,-11.497362
4,-19.0,-13.515499


We can do some calculations to get a sense of overall accuracy

In [24]:
print('Mean absolute error: ',metrics.mean_absolute_error(y_test,y_pred))
# For comparison, when we trained with only DEP_DELAY as a feature we get 
# Mean absolute error:  9.033505526602951

Mean absolute error:  9.01393034185557


In [25]:
print('Mean Squared Error:', metrics.mean_squared_error(y_test,y_pred))
# For comparison, when we trained with only DEP_DELAY as a feature we get 
# Mean Squared Error: 163.26825261343365

Mean Squared Error: 162.04368013600566


In [26]:
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test,y_pred)))
# For comparison, when we trained with only DEP_DELAY as a feature we get 
# Root Mean Squared Error: 12.777646599175986

Root Mean Squared Error: 12.729637863506003


The coefficients give you a sense of how much each feature contributes to the predicted value (i.e. weighting of the features)

In [27]:
print(regressor.coef_)
print(X_train.columns)

[[-0.00188259  1.00574534]]
Index(['DISTANCE', 'DEP_DELAY'], dtype='object')


For a unit increase in ARR_DELAY there is a increase of one unit in DEP_DELAY
For a unit increase in ARR_DELAY there is only .001 decrease in DISTANCE

## Conclusion
DEP_DELAY has significantly more impact on ARR_DELAY than DISTANCE. 
The impact of DISTANCE is so small it might not be worth including it as a feature.