# Mini-Project: SVM & Logistic Regression Classification

Matt Farrow, Amber Clark, Blake Freeman, Megan Ball

## **2015 Flight Delays and Cancellations**
Data Source: [Kaggle](https://www.kaggle.com/usdot/flight-delays?select=flights.csv)

## Logistic Regression & Support Vector Machine Models

[50 points] Create a logistic regression model and a support vector machine model for the
classification task involved with your dataset. Assess how well each model performs (use 80/20 training/testing split for your data). Adjust parameters of the models to make them more accurate. If your dataset size requires the use of stochastic gradient descent, then linear kernel only is fine to use.

### Prep Data

In [1]:
# Load libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
from datetime import datetime
import altair as alt

# Due to the way these columns are formatted, we want to keep the leading zeros during our import. Later on will convert them to a time format.
dtype_t = {'SCHEDULED_DEPARTURE': str,
           'DEPARTURE_TIME': str,
           'WHEELS_OFF': str,
           'SCHEDULED_TIME': str,
           'WHEELS_ON': str,
           'SCHEDULED_ARRIVAL': str,
           'ARRIVAL_TIME': str
          }

# Read in the data directly
# Read in the data using Pandas
airlines = pd.read_csv('../Data/airlines.csv')
airports = pd.read_csv('../Data/airports.csv')
flights  = pd.read_csv('../Data/flights.csv', dtype = dtype_t)

# Read in the data directly from GitHub
# airlines = pd.read_csv('https://raw.githubusercontent.com/mattfarrow1/7331-machine-learning-1/main/Data/airlines.csv')
# airports = pd.read_csv('https://raw.githubusercontent.com/mattfarrow1/7331-machine-learning-1/main/Data/airports.csv')
# flights  = pd.read_csv('https://media.githubusercontent.com/media/mattfarrow1/7331-machine-learning-1/main/Data/flights.csv', dtype = dtype_t)

  interactivity=interactivity, compiler=compiler, result=result)


In [2]:
# Rename columns in preparation for merge
airlines.rename(columns={'IATA_CODE': 'AIRLINE_CODE'}, inplace=True)
flights.rename(columns={'AIRLINE': 'AIRLINE_CODE'}, inplace=True)

# Merge data together
df = pd.merge(flights, airlines, on='AIRLINE_CODE', how = 'left')

# Convert string columns to datetime
cols = ["SCHEDULED_DEPARTURE", 
   "DEPARTURE_TIME", 
   "WHEELS_OFF",  
   "WHEELS_ON", 
   "SCHEDULED_ARRIVAL", 
   "ARRIVAL_TIME"]

df[cols] = df[cols].apply(pd.to_datetime, format = '%H%M', errors='coerce')

# Convert YMD into a single date
# Source: https://stackoverflow.com/questions/54487059/pandas-how-to-create-a-single-date-column-from-columns-containing-year-month
df['FLIGHT_DATE'] = pd.to_datetime([f'{y}-{m}-{d}' for y, m, d in zip(df.YEAR, df.MONTH, df.DAY)])

In [3]:
# Convert missing values to 'N' for 'N/A'
df['CANCELLATION_REASON'] = df['CANCELLATION_REASON'].fillna('N')

# Source: datagy.io/pandas-get-dummies/
# One hot encode - removing to save memory

#one_hot_columns = ['CANCELLATION_REASON']

#for column in one_hot_columns:
#  tempdf = pd.get_dummies(df[column], prefix=column)
#
#  df = pd.merge(
#      left = df,
#      right = tempdf,
#      left_index = True,
#      right_index = True,
#  )

#  df = df.drop(columns=column)

# Update missing values in times to 0. 
# Will be updating times to a binary (1 = yes action happened, 0 = no action happened)
df['DEPARTURE_TIME'] = df['DEPARTURE_TIME'].fillna(0)

# Change all non-null values to 1
df.loc[(df.DEPARTURE_TIME != '0'), 'DEPARTURE_TIME'] = 1

# Change column name to 'DEPARTED'
df.rename(columns={'DEPARTURE_TIME': 'DEPARTED'}, inplace=True)

# Update remaining columns using same logic
cols = ['WHEELS_OFF','WHEELS_ON','ARRIVAL_TIME']
df[cols] = df[cols].fillna(0)
df.loc[(df.WHEELS_OFF != '0'), 'WHEELS_OFF'] = 1
df.loc[(df.WHEELS_ON != '0'), 'WHEELS_ON'] = 1
df.loc[(df.ARRIVAL_TIME != '0'), 'ARRIVAL_TIME'] = 1
df.rename(columns={'ARRIVAL_TIME': 'ARRIVED'}, inplace=True)

# Fill missing values with 0
cols = ['AIR_SYSTEM_DELAY','SECURITY_DELAY','AIRLINE_DELAY','LATE_AIRCRAFT_DELAY','WEATHER_DELAY']
df[cols] = df[cols].fillna(0)

# Change remaining null values to 0 if flight was cancelled
df.loc[(df.CANCELLED == 1), ('DEPARTURE_DELAY', 'TAXI_OUT', 'ELAPSED_TIME','AIR_TIME','TAXI_IN','ARRIVAL_DELAY')] = 0

# Remove remaining null values
df = df.dropna()

In [4]:
# log transformation keeping the 0 in the data sets 
df["DEPARTURE_DELAY_log"] = df["DEPARTURE_DELAY"].map(lambda i: np.log1p(i) if i > 0 else 0) 
df["ARRIVAL_DELAY_Log"]   = df["ARRIVAL_DELAY"].map(lambda i: np.log1p(i) if i > 0 else 0)
df["DISTANCE_log"]        = df["DISTANCE"].map(lambda i: np.log1p(i) if i > 0 else 0) 
df["TAXI_IN_Log"]         = df["TAXI_IN"].map(lambda i: np.log1p(i) if i > 0 else 0)
df["ELAPSED_TIME_log"]    = df["ELAPSED_TIME"].map(lambda i: np.log1p(i) if i > 0 else 0) 
df["AIR_TIME_log"]        = df["AIR_TIME"].map(lambda i: np.log1p(i) if i > 0 else 0) 

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5789165 entries, 0 to 5819078
Data columns (total 39 columns):
YEAR                   int64
MONTH                  int64
DAY                    int64
DAY_OF_WEEK            int64
AIRLINE_CODE           object
FLIGHT_NUMBER          int64
TAIL_NUMBER            object
ORIGIN_AIRPORT         object
DESTINATION_AIRPORT    object
SCHEDULED_DEPARTURE    datetime64[ns]
DEPARTED               int64
DEPARTURE_DELAY        float64
TAXI_OUT               float64
WHEELS_OFF             int64
SCHEDULED_TIME         object
ELAPSED_TIME           float64
AIR_TIME               float64
DISTANCE               int64
WHEELS_ON              int64
TAXI_IN                float64
SCHEDULED_ARRIVAL      datetime64[ns]
ARRIVED                int64
ARRIVAL_DELAY          float64
DIVERTED               int64
CANCELLED              int64
CANCELLATION_REASON    object
AIR_SYSTEM_DELAY       float64
SECURITY_DELAY         float64
AIRLINE_DELAY          float64
LATE

In [6]:
pd.set_option('display.max_columns', None)
df.head()

Unnamed: 0,YEAR,MONTH,DAY,DAY_OF_WEEK,AIRLINE_CODE,FLIGHT_NUMBER,TAIL_NUMBER,ORIGIN_AIRPORT,DESTINATION_AIRPORT,SCHEDULED_DEPARTURE,DEPARTED,DEPARTURE_DELAY,TAXI_OUT,WHEELS_OFF,SCHEDULED_TIME,ELAPSED_TIME,AIR_TIME,DISTANCE,WHEELS_ON,TAXI_IN,SCHEDULED_ARRIVAL,ARRIVED,ARRIVAL_DELAY,DIVERTED,CANCELLED,CANCELLATION_REASON,AIR_SYSTEM_DELAY,SECURITY_DELAY,AIRLINE_DELAY,LATE_AIRCRAFT_DELAY,WEATHER_DELAY,AIRLINE,FLIGHT_DATE,DEPARTURE_DELAY_log,ARRIVAL_DELAY_Log,DISTANCE_log,TAXI_IN_Log,ELAPSED_TIME_log,AIR_TIME_log
0,2015,1,1,4,AS,98,N407AS,ANC,SEA,1900-01-01 00:05:00,1,-11.0,21.0,1,205,194.0,169.0,1448,1,4.0,1900-01-01 04:30:00,1,-22.0,0,0,N,0.0,0.0,0.0,0.0,0.0,Alaska Airlines Inc.,2015-01-01,0.0,0.0,7.278629,1.609438,5.273,5.135798
1,2015,1,1,4,AA,2336,N3KUAA,LAX,PBI,1900-01-01 00:10:00,1,-8.0,12.0,1,280,279.0,263.0,2330,1,4.0,1900-01-01 07:50:00,1,-9.0,0,0,N,0.0,0.0,0.0,0.0,0.0,American Airlines Inc.,2015-01-01,0.0,0.0,7.754053,1.609438,5.63479,5.575949
2,2015,1,1,4,US,840,N171US,SFO,CLT,1900-01-01 00:20:00,1,-2.0,16.0,1,286,293.0,266.0,2296,1,11.0,1900-01-01 08:06:00,1,5.0,0,0,N,0.0,0.0,0.0,0.0,0.0,US Airways Inc.,2015-01-01,0.0,1.791759,7.739359,2.484907,5.68358,5.587249
3,2015,1,1,4,AA,258,N3HYAA,LAX,MIA,1900-01-01 00:20:00,1,-5.0,15.0,1,285,281.0,258.0,2342,1,8.0,1900-01-01 08:05:00,1,-9.0,0,0,N,0.0,0.0,0.0,0.0,0.0,American Airlines Inc.,2015-01-01,0.0,0.0,7.759187,2.197225,5.641907,5.556828
4,2015,1,1,4,AS,135,N527AS,SEA,ANC,1900-01-01 00:25:00,1,-1.0,11.0,1,235,215.0,199.0,1448,1,5.0,1900-01-01 03:20:00,1,-21.0,0,0,N,0.0,0.0,0.0,0.0,0.0,Alaska Airlines Inc.,2015-01-01,0.0,0.0,7.278629,1.791759,5.375278,5.298317


In [7]:
#scheduled time is not an object but should be numeric. 
df['SCHEDULED_TIME'] = pd.to_numeric(df['SCHEDULED_TIME'])

In [8]:
#create our response variable

#create new column where the arrival_delay > 0 means it's delayed(=1) and if <= 0 it's not delayed(=0)
get_delay = lambda x: 0 if x <= 0 else 1
df['DELAYED'] = df.ARRIVAL_DELAY.apply(get_delay)

In [9]:
df.shape

(5789165, 40)

In [10]:
df.head()

Unnamed: 0,YEAR,MONTH,DAY,DAY_OF_WEEK,AIRLINE_CODE,FLIGHT_NUMBER,TAIL_NUMBER,ORIGIN_AIRPORT,DESTINATION_AIRPORT,SCHEDULED_DEPARTURE,DEPARTED,DEPARTURE_DELAY,TAXI_OUT,WHEELS_OFF,SCHEDULED_TIME,ELAPSED_TIME,AIR_TIME,DISTANCE,WHEELS_ON,TAXI_IN,SCHEDULED_ARRIVAL,ARRIVED,ARRIVAL_DELAY,DIVERTED,CANCELLED,CANCELLATION_REASON,AIR_SYSTEM_DELAY,SECURITY_DELAY,AIRLINE_DELAY,LATE_AIRCRAFT_DELAY,WEATHER_DELAY,AIRLINE,FLIGHT_DATE,DEPARTURE_DELAY_log,ARRIVAL_DELAY_Log,DISTANCE_log,TAXI_IN_Log,ELAPSED_TIME_log,AIR_TIME_log,DELAYED
0,2015,1,1,4,AS,98,N407AS,ANC,SEA,1900-01-01 00:05:00,1,-11.0,21.0,1,205,194.0,169.0,1448,1,4.0,1900-01-01 04:30:00,1,-22.0,0,0,N,0.0,0.0,0.0,0.0,0.0,Alaska Airlines Inc.,2015-01-01,0.0,0.0,7.278629,1.609438,5.273,5.135798,0
1,2015,1,1,4,AA,2336,N3KUAA,LAX,PBI,1900-01-01 00:10:00,1,-8.0,12.0,1,280,279.0,263.0,2330,1,4.0,1900-01-01 07:50:00,1,-9.0,0,0,N,0.0,0.0,0.0,0.0,0.0,American Airlines Inc.,2015-01-01,0.0,0.0,7.754053,1.609438,5.63479,5.575949,0
2,2015,1,1,4,US,840,N171US,SFO,CLT,1900-01-01 00:20:00,1,-2.0,16.0,1,286,293.0,266.0,2296,1,11.0,1900-01-01 08:06:00,1,5.0,0,0,N,0.0,0.0,0.0,0.0,0.0,US Airways Inc.,2015-01-01,0.0,1.791759,7.739359,2.484907,5.68358,5.587249,1
3,2015,1,1,4,AA,258,N3HYAA,LAX,MIA,1900-01-01 00:20:00,1,-5.0,15.0,1,285,281.0,258.0,2342,1,8.0,1900-01-01 08:05:00,1,-9.0,0,0,N,0.0,0.0,0.0,0.0,0.0,American Airlines Inc.,2015-01-01,0.0,0.0,7.759187,2.197225,5.641907,5.556828,0
4,2015,1,1,4,AS,135,N527AS,SEA,ANC,1900-01-01 00:25:00,1,-1.0,11.0,1,235,215.0,199.0,1448,1,5.0,1900-01-01 03:20:00,1,-21.0,0,0,N,0.0,0.0,0.0,0.0,0.0,Alaska Airlines Inc.,2015-01-01,0.0,0.0,7.278629,1.791759,5.375278,5.298317,0


### Scaling & OneHotEncoding

In [11]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.svm import SVC
#from sklearn.compose import make_column_selector as selector

#all code below adapted from documentation example
#at https://scikit-learn.org/stable/auto_examples/compose/plot_column_transformer_mixed_types.html?highlight=onehotencoding

In [12]:
numeric_features = df.select_dtypes(include=np.number).columns.tolist()
numeric_transformer = Pipeline(steps=[('scaler', StandardScaler())])

categorical_features = ['AIRLINE_CODE', 'TAIL_NUMBER', 'ORIGIN_AIRPORT', 
                        'DESTINATION_AIRPORT', 'CANCELLATION_REASON', 'AIRLINE']
categorical_transformer = OneHotEncoder(handle_unknown='ignore')

preprocessor = ColumnTransformer(transformers=[
    ('num', numeric_transformer, numeric_features),
    ('cat', categorical_transformer, categorical_features)
])

# Append classifier to preprocessing pipeline.
# Now we have a full prediction pipeline.
clf = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', SVC())])

### Test/Train Split

Our desired variable for classication will be our binary value for delay to determine whether or not a flight was delayed on arrival.

In [13]:
from sklearn.model_selection import train_test_split

y = df['DELAYED'].values # get the labels we want
sub = df.iloc[:,0:33]
X = sub.values # use everything except the log transform values to predict

In [14]:
sub.head()

Unnamed: 0,YEAR,MONTH,DAY,DAY_OF_WEEK,AIRLINE_CODE,FLIGHT_NUMBER,TAIL_NUMBER,ORIGIN_AIRPORT,DESTINATION_AIRPORT,SCHEDULED_DEPARTURE,DEPARTED,DEPARTURE_DELAY,TAXI_OUT,WHEELS_OFF,SCHEDULED_TIME,ELAPSED_TIME,AIR_TIME,DISTANCE,WHEELS_ON,TAXI_IN,SCHEDULED_ARRIVAL,ARRIVED,ARRIVAL_DELAY,DIVERTED,CANCELLED,CANCELLATION_REASON,AIR_SYSTEM_DELAY,SECURITY_DELAY,AIRLINE_DELAY,LATE_AIRCRAFT_DELAY,WEATHER_DELAY,AIRLINE,FLIGHT_DATE
0,2015,1,1,4,AS,98,N407AS,ANC,SEA,1900-01-01 00:05:00,1,-11.0,21.0,1,205,194.0,169.0,1448,1,4.0,1900-01-01 04:30:00,1,-22.0,0,0,N,0.0,0.0,0.0,0.0,0.0,Alaska Airlines Inc.,2015-01-01
1,2015,1,1,4,AA,2336,N3KUAA,LAX,PBI,1900-01-01 00:10:00,1,-8.0,12.0,1,280,279.0,263.0,2330,1,4.0,1900-01-01 07:50:00,1,-9.0,0,0,N,0.0,0.0,0.0,0.0,0.0,American Airlines Inc.,2015-01-01
2,2015,1,1,4,US,840,N171US,SFO,CLT,1900-01-01 00:20:00,1,-2.0,16.0,1,286,293.0,266.0,2296,1,11.0,1900-01-01 08:06:00,1,5.0,0,0,N,0.0,0.0,0.0,0.0,0.0,US Airways Inc.,2015-01-01
3,2015,1,1,4,AA,258,N3HYAA,LAX,MIA,1900-01-01 00:20:00,1,-5.0,15.0,1,285,281.0,258.0,2342,1,8.0,1900-01-01 08:05:00,1,-9.0,0,0,N,0.0,0.0,0.0,0.0,0.0,American Airlines Inc.,2015-01-01
4,2015,1,1,4,AS,135,N527AS,SEA,ANC,1900-01-01 00:25:00,1,-1.0,11.0,1,235,215.0,199.0,1448,1,5.0,1900-01-01 03:20:00,1,-21.0,0,0,N,0.0,0.0,0.0,0.0,0.0,Alaska Airlines Inc.,2015-01-01


In [15]:
y.shape

(5789165,)

In [16]:
X.shape

(5789165, 33)

In [None]:
#create test and train split with random seed
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [None]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

In [None]:
#save data to here locally
np.save('../Data/X_train.npy', X_train)
np.save('../Data/X_test.npy', X_test)
np.save('../Data/y_train.npy', y_train)
np.save('../Data/y_test.npy', y_test)

### SVM

Run 'out of the box' SVM with scaling and one-hot encoding from above pipeline.

In [None]:
#param_grid = {
#    'kernel':('linear', 'rbf'),
#    'C': [0.1, 1.0, 10, 100],
#}

#grid_search = GridSearchCV(clf, param_grid, cv=3)
#grid_search

In [62]:
from sklearn import metrics as mt
from sklearn.model_selection import cross_val_score

#run classifier
clf.fit(X_train, y_train)

#predict values
y_hat = clf.predict(X_test)

#calculate accuracy and confusion matrix (no cross val)
print("Accuracy:",mt.accuracy_score(y_test, y_hat))
print("Precision:",mt.precision_score(y_test, y_hat))
print("Recall:",mt.recall_score(y_test, y_hat))

ValueError: could not convert string to float: 'OO'

## Advantages of Each Model

[10 points] Discuss the advantages of each model for each classification task. Does one type of model offer superior performance over another in terms of prediction accuracy? In terms of training time or efficiency? Explain in detail.

## Feature Importance

[30 points] Use the weights from logistic regression to interpret the importance of different features for each classification task. Explain your interpretation in detail. Why do you think some variables are more important?

## Support Vectors

[10 points] Look at the chosen support vectors for the classification task. Do these provide any insight into the data? Explain.