# Mini-Project: SVM & Logistic Regression Classification

Matt Farrow, Amber Clark, Blake Freeman, Megan Ball

## **2015 Flight Delays and Cancellations**
Data Source: [Kaggle](https://www.kaggle.com/usdot/flight-delays?select=flights.csv)

## Logistic Regression & Support Vector Machine Models

[50 points] Create a logistic regression model and a support vector machine model for the
classification task involved with your dataset. Assess how well each model performs (use 80/20 training/testing split for your data). Adjust parameters of the models to make them more accurate. If your dataset size requires the use of stochastic gradient descent, then linear kernel only is fine to use.

### Prep Data

In [1]:
# Load libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
from datetime import datetime
import altair as alt

# Due to the way these columns are formatted, we want to keep the leading zeros during our import. Later on will convert them to a time format.
dtype_t = {'SCHEDULED_DEPARTURE': str,
           'DEPARTURE_TIME': str,
           'WHEELS_OFF': str,
           'SCHEDULED_TIME': str,
           'WHEELS_ON': str,
           'SCHEDULED_ARRIVAL': str,
           'ARRIVAL_TIME': str
          }

# Read in the data directly
# Read in the data using Pandas
airlines = pd.read_csv('../Data/airlines.csv')
airports = pd.read_csv('../Data/airports.csv')
flights  = pd.read_csv('../Data/flights.csv', dtype = dtype_t)

# Read in the data directly from GitHub
# airlines = pd.read_csv('https://raw.githubusercontent.com/mattfarrow1/7331-machine-learning-1/main/Data/airlines.csv')
# airports = pd.read_csv('https://raw.githubusercontent.com/mattfarrow1/7331-machine-learning-1/main/Data/airports.csv')
# flights  = pd.read_csv('https://media.githubusercontent.com/media/mattfarrow1/7331-machine-learning-1/main/Data/flights.csv', dtype = dtype_t)

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


In [2]:
# Rename columns in preparation for merge
airlines.rename(columns={'IATA_CODE': 'AIRLINE_CODE'}, inplace=True)
flights.rename(columns={'AIRLINE': 'AIRLINE_CODE'}, inplace=True)

# Merge data together
df = pd.merge(flights, airlines, on='AIRLINE_CODE', how = 'left')

# Convert string columns to datetime
cols = ["SCHEDULED_DEPARTURE", 
   "DEPARTURE_TIME", 
   "WHEELS_OFF",  
   "WHEELS_ON", 
   "SCHEDULED_ARRIVAL", 
   "ARRIVAL_TIME"]

df[cols] = df[cols].apply(pd.to_datetime, format = '%H%M', errors='coerce')

# Convert YMD into a single date
# Source: https://stackoverflow.com/questions/54487059/pandas-how-to-create-a-single-date-column-from-columns-containing-year-month
df['FLIGHT_DATE'] = pd.to_datetime([f'{y}-{m}-{d}' for y, m, d in zip(df.YEAR, df.MONTH, df.DAY)])

In [3]:
# Convert missing values to 'N' for 'N/A'
df['CANCELLATION_REASON'] = df['CANCELLATION_REASON'].fillna('N')

# Source: datagy.io/pandas-get-dummies/
# One hot encode - removing to save memory

#one_hot_columns = ['CANCELLATION_REASON']

#for column in one_hot_columns:
#  tempdf = pd.get_dummies(df[column], prefix=column)
#
#  df = pd.merge(
#      left = df,
#      right = tempdf,
#      left_index = True,
#      right_index = True,
#  )

#  df = df.drop(columns=column)

# Update missing values in times to 0. 
# Will be updating times to a binary (1 = yes action happened, 0 = no action happened)
df['DEPARTURE_TIME'] = df['DEPARTURE_TIME'].fillna(0)

# Change all non-null values to 1
df.loc[(df.DEPARTURE_TIME != '0'), 'DEPARTURE_TIME'] = 1

# Change column name to 'DEPARTED'
df.rename(columns={'DEPARTURE_TIME': 'DEPARTED'}, inplace=True)

# Update remaining columns using same logic
cols = ['WHEELS_OFF','WHEELS_ON','ARRIVAL_TIME']
df[cols] = df[cols].fillna(0)
df.loc[(df.WHEELS_OFF != '0'), 'WHEELS_OFF'] = 1
df.loc[(df.WHEELS_ON != '0'), 'WHEELS_ON'] = 1
df.loc[(df.ARRIVAL_TIME != '0'), 'ARRIVAL_TIME'] = 1
df.rename(columns={'ARRIVAL_TIME': 'ARRIVED'}, inplace=True)

# Fill missing values with 0
cols = ['AIR_SYSTEM_DELAY','SECURITY_DELAY','AIRLINE_DELAY','LATE_AIRCRAFT_DELAY','WEATHER_DELAY']
df[cols] = df[cols].fillna(0)

# Change remaining null values to 0 if flight was cancelled
df.loc[(df.CANCELLED == 1), ('DEPARTURE_DELAY', 'TAXI_OUT', 'ELAPSED_TIME','AIR_TIME','TAXI_IN','ARRIVAL_DELAY')] = 0

# Remove remaining null values
df = df.dropna()

In [4]:
# log transformation keeping the 0 in the data sets 
df["DEPARTURE_DELAY_log"] = df["DEPARTURE_DELAY"].map(lambda i: np.log1p(i) if i > 0 else 0) 
df["ARRIVAL_DELAY_Log"]   = df["ARRIVAL_DELAY"].map(lambda i: np.log1p(i) if i > 0 else 0)
df["DISTANCE_log"]        = df["DISTANCE"].map(lambda i: np.log1p(i) if i > 0 else 0) 
df["TAXI_IN_Log"]         = df["TAXI_IN"].map(lambda i: np.log1p(i) if i > 0 else 0)
df["ELAPSED_TIME_log"]    = df["ELAPSED_TIME"].map(lambda i: np.log1p(i) if i > 0 else 0) 
df["AIR_TIME_log"]        = df["AIR_TIME"].map(lambda i: np.log1p(i) if i > 0 else 0) 

In [5]:
from sklearn.preprocessing import LabelEncoder
labelencoder = LabelEncoder()

In [6]:
df[['AIRLINE_CODE','TAIL_NUMBER','ORIGIN_AIRPORT','DESTINATION_AIRPORT','AIRLINE']] = df[['AIRLINE_CODE','TAIL_NUMBER','ORIGIN_AIRPORT','DESTINATION_AIRPORT','AIRLINE']].astype('str')

In [7]:
df['AIRLINE_CODE_encode'] = labelencoder.fit_transform(df['AIRLINE_CODE'])
df

Unnamed: 0,YEAR,MONTH,DAY,DAY_OF_WEEK,AIRLINE_CODE,FLIGHT_NUMBER,TAIL_NUMBER,ORIGIN_AIRPORT,DESTINATION_AIRPORT,SCHEDULED_DEPARTURE,...,WEATHER_DELAY,AIRLINE,FLIGHT_DATE,DEPARTURE_DELAY_log,ARRIVAL_DELAY_Log,DISTANCE_log,TAXI_IN_Log,ELAPSED_TIME_log,AIR_TIME_log,AIRLINE_CODE_encode
0,2015,1,1,4,AS,98,N407AS,ANC,SEA,1900-01-01 00:05:00,...,0.0,Alaska Airlines Inc.,2015-01-01,0.000000,0.000000,7.278629,1.609438,5.273000,5.135798,1
1,2015,1,1,4,AA,2336,N3KUAA,LAX,PBI,1900-01-01 00:10:00,...,0.0,American Airlines Inc.,2015-01-01,0.000000,0.000000,7.754053,1.609438,5.634790,5.575949,0
2,2015,1,1,4,US,840,N171US,SFO,CLT,1900-01-01 00:20:00,...,0.0,US Airways Inc.,2015-01-01,0.000000,1.791759,7.739359,2.484907,5.683580,5.587249,11
3,2015,1,1,4,AA,258,N3HYAA,LAX,MIA,1900-01-01 00:20:00,...,0.0,American Airlines Inc.,2015-01-01,0.000000,0.000000,7.759187,2.197225,5.641907,5.556828,0
4,2015,1,1,4,AS,135,N527AS,SEA,ANC,1900-01-01 00:25:00,...,0.0,Alaska Airlines Inc.,2015-01-01,0.000000,0.000000,7.278629,1.791759,5.375278,5.298317,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5819074,2015,12,31,4,B6,688,N657JB,LAX,BOS,1900-01-01 23:59:00,...,0.0,JetBlue Airways,2015-12-31,0.000000,0.000000,7.867871,1.609438,5.700444,5.609472,2
5819075,2015,12,31,4,B6,745,N828JB,JFK,PSE,1900-01-01 23:59:00,...,0.0,JetBlue Airways,2015-12-31,0.000000,0.000000,7.388946,1.386294,5.375278,5.278115,2
5819076,2015,12,31,4,B6,1503,N913JB,JFK,SJU,1900-01-01 23:59:00,...,0.0,JetBlue Airways,2015-12-31,0.000000,0.000000,7.377134,2.197225,5.407172,5.288267,2
5819077,2015,12,31,4,B6,333,N527JB,MCO,SJU,1900-01-01 23:59:00,...,0.0,JetBlue Airways,2015-12-31,0.000000,0.000000,7.081709,1.386294,5.062595,4.976734,2


In [8]:
df['ORIGIN_AIRPORT_encode'] = labelencoder.fit_transform(df['ORIGIN_AIRPORT'])
df

Unnamed: 0,YEAR,MONTH,DAY,DAY_OF_WEEK,AIRLINE_CODE,FLIGHT_NUMBER,TAIL_NUMBER,ORIGIN_AIRPORT,DESTINATION_AIRPORT,SCHEDULED_DEPARTURE,...,AIRLINE,FLIGHT_DATE,DEPARTURE_DELAY_log,ARRIVAL_DELAY_Log,DISTANCE_log,TAXI_IN_Log,ELAPSED_TIME_log,AIR_TIME_log,AIRLINE_CODE_encode,ORIGIN_AIRPORT_encode
0,2015,1,1,4,AS,98,N407AS,ANC,SEA,1900-01-01 00:05:00,...,Alaska Airlines Inc.,2015-01-01,0.000000,0.000000,7.278629,1.609438,5.273000,5.135798,1,323
1,2015,1,1,4,AA,2336,N3KUAA,LAX,PBI,1900-01-01 00:10:00,...,American Airlines Inc.,2015-01-01,0.000000,0.000000,7.754053,1.609438,5.634790,5.575949,0,482
2,2015,1,1,4,US,840,N171US,SFO,CLT,1900-01-01 00:20:00,...,US Airways Inc.,2015-01-01,0.000000,1.791759,7.739359,2.484907,5.683580,5.587249,11,584
3,2015,1,1,4,AA,258,N3HYAA,LAX,MIA,1900-01-01 00:20:00,...,American Airlines Inc.,2015-01-01,0.000000,0.000000,7.759187,2.197225,5.641907,5.556828,0,482
4,2015,1,1,4,AS,135,N527AS,SEA,ANC,1900-01-01 00:25:00,...,Alaska Airlines Inc.,2015-01-01,0.000000,0.000000,7.278629,1.791759,5.375278,5.298317,1,583
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5819074,2015,12,31,4,B6,688,N657JB,LAX,BOS,1900-01-01 23:59:00,...,JetBlue Airways,2015-12-31,0.000000,0.000000,7.867871,1.609438,5.700444,5.609472,2,482
5819075,2015,12,31,4,B6,745,N828JB,JFK,PSE,1900-01-01 23:59:00,...,JetBlue Airways,2015-12-31,0.000000,0.000000,7.388946,1.386294,5.375278,5.278115,2,472
5819076,2015,12,31,4,B6,1503,N913JB,JFK,SJU,1900-01-01 23:59:00,...,JetBlue Airways,2015-12-31,0.000000,0.000000,7.377134,2.197225,5.407172,5.288267,2,472
5819077,2015,12,31,4,B6,333,N527JB,MCO,SJU,1900-01-01 23:59:00,...,JetBlue Airways,2015-12-31,0.000000,0.000000,7.081709,1.386294,5.062595,4.976734,2,499


In [9]:
df2 = df[['ORIGIN_AIRPORT','ORIGIN_AIRPORT_encode']]
df2= df2.drop_duplicates(subset=['ORIGIN_AIRPORT'], keep='last')

In [10]:
df2.rename(columns={'ORIGIN_AIRPORT': 'DESTINATION_AIRPORT'}, inplace=True)
df2.rename(columns={'ORIGIN_AIRPORT_encode': 'DESTINATION_AIRPORT_encode'}, inplace=True)

In [11]:
df = pd.merge(df, df2, on='DESTINATION_AIRPORT', how = 'left')

In [12]:
df.dropna(subset = ["DESTINATION_AIRPORT_encode"], inplace=True)

In [13]:
df

Unnamed: 0,YEAR,MONTH,DAY,DAY_OF_WEEK,AIRLINE_CODE,FLIGHT_NUMBER,TAIL_NUMBER,ORIGIN_AIRPORT,DESTINATION_AIRPORT,SCHEDULED_DEPARTURE,...,FLIGHT_DATE,DEPARTURE_DELAY_log,ARRIVAL_DELAY_Log,DISTANCE_log,TAXI_IN_Log,ELAPSED_TIME_log,AIR_TIME_log,AIRLINE_CODE_encode,ORIGIN_AIRPORT_encode,DESTINATION_AIRPORT_encode
0,2015,1,1,4,AS,98,N407AS,ANC,SEA,1900-01-01 00:05:00,...,2015-01-01,0.000000,0.000000,7.278629,1.609438,5.273000,5.135798,1,323,583.0
1,2015,1,1,4,AA,2336,N3KUAA,LAX,PBI,1900-01-01 00:10:00,...,2015-01-01,0.000000,0.000000,7.754053,1.609438,5.634790,5.575949,0,482,541.0
2,2015,1,1,4,US,840,N171US,SFO,CLT,1900-01-01 00:20:00,...,2015-01-01,0.000000,1.791759,7.739359,2.484907,5.683580,5.587249,11,584,372.0
3,2015,1,1,4,AA,258,N3HYAA,LAX,MIA,1900-01-01 00:20:00,...,2015-01-01,0.000000,0.000000,7.759187,2.197225,5.641907,5.556828,0,482,509.0
4,2015,1,1,4,AS,135,N527AS,SEA,ANC,1900-01-01 00:25:00,...,2015-01-01,0.000000,0.000000,7.278629,1.791759,5.375278,5.298317,1,583,323.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5789160,2015,12,31,4,B6,688,N657JB,LAX,BOS,1900-01-01 23:59:00,...,2015-12-31,0.000000,0.000000,7.867871,1.609438,5.700444,5.609472,2,482,345.0
5789161,2015,12,31,4,B6,745,N828JB,JFK,PSE,1900-01-01 23:59:00,...,2015-12-31,0.000000,0.000000,7.388946,1.386294,5.375278,5.278115,2,472,554.0
5789162,2015,12,31,4,B6,1503,N913JB,JFK,SJU,1900-01-01 23:59:00,...,2015-12-31,0.000000,0.000000,7.377134,2.197225,5.407172,5.288267,2,472,591.0
5789163,2015,12,31,4,B6,333,N527JB,MCO,SJU,1900-01-01 23:59:00,...,2015-12-31,0.000000,0.000000,7.081709,1.386294,5.062595,4.976734,2,499,591.0


In [14]:
df['TAIL_NUMBER_encode'] = labelencoder.fit_transform(df['TAIL_NUMBER'])
df

Unnamed: 0,YEAR,MONTH,DAY,DAY_OF_WEEK,AIRLINE_CODE,FLIGHT_NUMBER,TAIL_NUMBER,ORIGIN_AIRPORT,DESTINATION_AIRPORT,SCHEDULED_DEPARTURE,...,DEPARTURE_DELAY_log,ARRIVAL_DELAY_Log,DISTANCE_log,TAXI_IN_Log,ELAPSED_TIME_log,AIR_TIME_log,AIRLINE_CODE_encode,ORIGIN_AIRPORT_encode,DESTINATION_AIRPORT_encode,TAIL_NUMBER_encode
0,2015,1,1,4,AS,98,N407AS,ANC,SEA,1900-01-01 00:05:00,...,0.000000,0.000000,7.278629,1.609438,5.273000,5.135798,1,323,583.0,1622
1,2015,1,1,4,AA,2336,N3KUAA,LAX,PBI,1900-01-01 00:10:00,...,0.000000,0.000000,7.754053,1.609438,5.634790,5.575949,0,482,541.0,1556
2,2015,1,1,4,US,840,N171US,SFO,CLT,1900-01-01 00:20:00,...,0.000000,1.791759,7.739359,2.484907,5.683580,5.587249,11,584,372.0,421
3,2015,1,1,4,AA,258,N3HYAA,LAX,MIA,1900-01-01 00:20:00,...,0.000000,0.000000,7.759187,2.197225,5.641907,5.556828,0,482,509.0,1516
4,2015,1,1,4,AS,135,N527AS,SEA,ANC,1900-01-01 00:25:00,...,0.000000,0.000000,7.278629,1.791759,5.375278,5.298317,1,583,323.0,2131
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5789160,2015,12,31,4,B6,688,N657JB,LAX,BOS,1900-01-01 23:59:00,...,0.000000,0.000000,7.867871,1.609438,5.700444,5.609472,2,482,345.0,2885
5789161,2015,12,31,4,B6,745,N828JB,JFK,PSE,1900-01-01 23:59:00,...,0.000000,0.000000,7.388946,1.386294,5.375278,5.278115,2,472,554.0,3947
5789162,2015,12,31,4,B6,1503,N913JB,JFK,SJU,1900-01-01 23:59:00,...,0.000000,0.000000,7.377134,2.197225,5.407172,5.288267,2,472,591.0,4417
5789163,2015,12,31,4,B6,333,N527JB,MCO,SJU,1900-01-01 23:59:00,...,0.000000,0.000000,7.081709,1.386294,5.062595,4.976734,2,499,591.0,2132


In [15]:
pd.set_option('display.max_columns', None)
df.head()

Unnamed: 0,YEAR,MONTH,DAY,DAY_OF_WEEK,AIRLINE_CODE,FLIGHT_NUMBER,TAIL_NUMBER,ORIGIN_AIRPORT,DESTINATION_AIRPORT,SCHEDULED_DEPARTURE,DEPARTED,DEPARTURE_DELAY,TAXI_OUT,WHEELS_OFF,SCHEDULED_TIME,ELAPSED_TIME,AIR_TIME,DISTANCE,WHEELS_ON,TAXI_IN,SCHEDULED_ARRIVAL,ARRIVED,ARRIVAL_DELAY,DIVERTED,CANCELLED,CANCELLATION_REASON,AIR_SYSTEM_DELAY,SECURITY_DELAY,AIRLINE_DELAY,LATE_AIRCRAFT_DELAY,WEATHER_DELAY,AIRLINE,FLIGHT_DATE,DEPARTURE_DELAY_log,ARRIVAL_DELAY_Log,DISTANCE_log,TAXI_IN_Log,ELAPSED_TIME_log,AIR_TIME_log,AIRLINE_CODE_encode,ORIGIN_AIRPORT_encode,DESTINATION_AIRPORT_encode,TAIL_NUMBER_encode
0,2015,1,1,4,AS,98,N407AS,ANC,SEA,1900-01-01 00:05:00,1,-11.0,21.0,1,205,194.0,169.0,1448,1,4.0,1900-01-01 04:30:00,1,-22.0,0,0,N,0.0,0.0,0.0,0.0,0.0,Alaska Airlines Inc.,2015-01-01,0.0,0.0,7.278629,1.609438,5.273,5.135798,1,323,583.0,1622
1,2015,1,1,4,AA,2336,N3KUAA,LAX,PBI,1900-01-01 00:10:00,1,-8.0,12.0,1,280,279.0,263.0,2330,1,4.0,1900-01-01 07:50:00,1,-9.0,0,0,N,0.0,0.0,0.0,0.0,0.0,American Airlines Inc.,2015-01-01,0.0,0.0,7.754053,1.609438,5.63479,5.575949,0,482,541.0,1556
2,2015,1,1,4,US,840,N171US,SFO,CLT,1900-01-01 00:20:00,1,-2.0,16.0,1,286,293.0,266.0,2296,1,11.0,1900-01-01 08:06:00,1,5.0,0,0,N,0.0,0.0,0.0,0.0,0.0,US Airways Inc.,2015-01-01,0.0,1.791759,7.739359,2.484907,5.68358,5.587249,11,584,372.0,421
3,2015,1,1,4,AA,258,N3HYAA,LAX,MIA,1900-01-01 00:20:00,1,-5.0,15.0,1,285,281.0,258.0,2342,1,8.0,1900-01-01 08:05:00,1,-9.0,0,0,N,0.0,0.0,0.0,0.0,0.0,American Airlines Inc.,2015-01-01,0.0,0.0,7.759187,2.197225,5.641907,5.556828,0,482,509.0,1516
4,2015,1,1,4,AS,135,N527AS,SEA,ANC,1900-01-01 00:25:00,1,-1.0,11.0,1,235,215.0,199.0,1448,1,5.0,1900-01-01 03:20:00,1,-21.0,0,0,N,0.0,0.0,0.0,0.0,0.0,Alaska Airlines Inc.,2015-01-01,0.0,0.0,7.278629,1.791759,5.375278,5.298317,1,583,323.0,2131


In [16]:
#created a window to state what is late and what is not late if the arrival was within 10 mins it was considered ontime.

conditions = [
    (df['ARRIVAL_DELAY'] <= 0),
    (df['ARRIVAL_DELAY'] > 0)
    ]


# 0 being on time while 1 being delayed. 
values = [0, 1]

df['Arrival_Delay_OT'] = np.select(conditions, values)

pd.set_option('display.max_columns', None)
df.head()


#https://www.dataquest.io/blog/tutorial-add-column-pandas-dataframe-based-on-if-else-condition/

Unnamed: 0,YEAR,MONTH,DAY,DAY_OF_WEEK,AIRLINE_CODE,FLIGHT_NUMBER,TAIL_NUMBER,ORIGIN_AIRPORT,DESTINATION_AIRPORT,SCHEDULED_DEPARTURE,DEPARTED,DEPARTURE_DELAY,TAXI_OUT,WHEELS_OFF,SCHEDULED_TIME,ELAPSED_TIME,AIR_TIME,DISTANCE,WHEELS_ON,TAXI_IN,SCHEDULED_ARRIVAL,ARRIVED,ARRIVAL_DELAY,DIVERTED,CANCELLED,CANCELLATION_REASON,AIR_SYSTEM_DELAY,SECURITY_DELAY,AIRLINE_DELAY,LATE_AIRCRAFT_DELAY,WEATHER_DELAY,AIRLINE,FLIGHT_DATE,DEPARTURE_DELAY_log,ARRIVAL_DELAY_Log,DISTANCE_log,TAXI_IN_Log,ELAPSED_TIME_log,AIR_TIME_log,AIRLINE_CODE_encode,ORIGIN_AIRPORT_encode,DESTINATION_AIRPORT_encode,TAIL_NUMBER_encode,Arrival_Delay_OT
0,2015,1,1,4,AS,98,N407AS,ANC,SEA,1900-01-01 00:05:00,1,-11.0,21.0,1,205,194.0,169.0,1448,1,4.0,1900-01-01 04:30:00,1,-22.0,0,0,N,0.0,0.0,0.0,0.0,0.0,Alaska Airlines Inc.,2015-01-01,0.0,0.0,7.278629,1.609438,5.273,5.135798,1,323,583.0,1622,0
1,2015,1,1,4,AA,2336,N3KUAA,LAX,PBI,1900-01-01 00:10:00,1,-8.0,12.0,1,280,279.0,263.0,2330,1,4.0,1900-01-01 07:50:00,1,-9.0,0,0,N,0.0,0.0,0.0,0.0,0.0,American Airlines Inc.,2015-01-01,0.0,0.0,7.754053,1.609438,5.63479,5.575949,0,482,541.0,1556,0
2,2015,1,1,4,US,840,N171US,SFO,CLT,1900-01-01 00:20:00,1,-2.0,16.0,1,286,293.0,266.0,2296,1,11.0,1900-01-01 08:06:00,1,5.0,0,0,N,0.0,0.0,0.0,0.0,0.0,US Airways Inc.,2015-01-01,0.0,1.791759,7.739359,2.484907,5.68358,5.587249,11,584,372.0,421,1
3,2015,1,1,4,AA,258,N3HYAA,LAX,MIA,1900-01-01 00:20:00,1,-5.0,15.0,1,285,281.0,258.0,2342,1,8.0,1900-01-01 08:05:00,1,-9.0,0,0,N,0.0,0.0,0.0,0.0,0.0,American Airlines Inc.,2015-01-01,0.0,0.0,7.759187,2.197225,5.641907,5.556828,0,482,509.0,1516,0
4,2015,1,1,4,AS,135,N527AS,SEA,ANC,1900-01-01 00:25:00,1,-1.0,11.0,1,235,215.0,199.0,1448,1,5.0,1900-01-01 03:20:00,1,-21.0,0,0,N,0.0,0.0,0.0,0.0,0.0,Alaska Airlines Inc.,2015-01-01,0.0,0.0,7.278629,1.791759,5.375278,5.298317,1,583,323.0,2131,0


In [37]:
df_log = df
df_log = df_log.drop(['YEAR','AIRLINE_CODE','TAIL_NUMBER','ORIGIN_AIRPORT','DESTINATION_AIRPORT','DEPARTED','CANCELLED','CANCELLATION_REASON','AIRLINE','FLIGHT_DATE','SCHEDULED_ARRIVAL','SCHEDULED_DEPARTURE','ARRIVAL_DELAY','ARRIVAL_DELAY_Log','AIR_SYSTEM_DELAY','SECURITY_DELAY','AIRLINE_DELAY','LATE_AIRCRAFT_DELAY','WEATHER_DELAY','DEPARTURE_DELAY_log','DISTANCE_log','TAXI_IN_Log','ELAPSED_TIME_log','AIR_TIME_log','ELAPSED_TIME','AIR_TIME'], 1)
pd.set_option('display.max_columns', None)
df_log

Unnamed: 0,MONTH,DAY,DAY_OF_WEEK,FLIGHT_NUMBER,DEPARTURE_DELAY,TAXI_OUT,WHEELS_OFF,SCHEDULED_TIME,DISTANCE,WHEELS_ON,TAXI_IN,ARRIVED,DIVERTED,AIRLINE_CODE_encode,ORIGIN_AIRPORT_encode,DESTINATION_AIRPORT_encode,TAIL_NUMBER_encode,Arrival_Delay_OT
0,1,1,4,98,-11.0,21.0,1,205,1448,1,4.0,1,0,1,323,583.0,1622,0
1,1,1,4,2336,-8.0,12.0,1,280,2330,1,4.0,1,0,0,482,541.0,1556,0
2,1,1,4,840,-2.0,16.0,1,286,2296,1,11.0,1,0,11,584,372.0,421,1
3,1,1,4,258,-5.0,15.0,1,285,2342,1,8.0,1,0,0,482,509.0,1516,0
4,1,1,4,135,-1.0,11.0,1,235,1448,1,5.0,1,0,1,583,323.0,2131,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5789160,12,31,4,688,-4.0,22.0,1,320,2611,1,4.0,1,0,2,482,345.0,2885,0
5789161,12,31,4,745,-4.0,17.0,1,227,1617,1,3.0,1,0,2,472,554.0,3947,0
5789162,12,31,4,1503,-9.0,17.0,1,221,1598,1,8.0,1,0,2,472,591.0,4417,0
5789163,12,31,4,333,-6.0,10.0,1,161,1189,1,3.0,1,0,2,499,591.0,2132,0


In [38]:
from sklearn.model_selection import ShuffleSplit

# we want to predict the X and y data as follows:
if 'Arrival_Delay_OT' in df_log:
    y = df_log['Arrival_Delay_OT'].values # get the labels we want
    del df_log['Arrival_Delay_OT'] # get rid of the class label
    X = df_log.values # use everything else to predict!

    ## X and y are now numpy matrices, by calling 'values' on the pandas data frames we
    #    have converted them into simple matrices to use with scikit learn
    
    
# to use the cross validation object in scikit learn, we need to grab an instance
#    of the object and set it up. This object will be able to split our data into 
#    training and testing splits
num_cv_iterations = 3
num_instances = len(y)
cv_object = ShuffleSplit(n_splits=num_cv_iterations,random_state=10,
                         test_size  = 0.2)
                         
print(cv_object)

ShuffleSplit(n_splits=3, random_state=10, test_size=0.2, train_size=None)


In [39]:
# run logistic regression and vary some parameters
from sklearn.linear_model import LogisticRegression
from sklearn import metrics as mt

# first we create a reusable logisitic regression object
#   here we can setup the object with different learning parameters and constants
lr_clf = LogisticRegression(penalty='l2', C=1.0, class_weight=None, solver='liblinear' ) # get object

# now we can use the cv_object that we setup before to iterate through the 
#    different training and testing sets. Each time we will reuse the logisitic regression 
#    object, but it gets trained on different data each time we use it.

iter_num=0
# the indices are the rows used for training and testing in each iteration
for train_indices, test_indices in cv_object.split(X,y): 
    # I will create new variables here so that it is more obvious what 
    # the code is doing (you can compact this syntax and avoid duplicating memory,
    # but it makes this code less readable)
    X_train = X[train_indices]
    y_train = y[train_indices]
    
    X_test = X[test_indices]
    y_test = y[test_indices]
    
    # train the reusable logisitc regression model on the training data
    lr_clf.fit(X_train,y_train)  # train object
    y_hat = lr_clf.predict(X_test) # get test set precitions

    # now let's get the accuracy and confusion matrix for this iterations of training/testing
    acc = mt.accuracy_score(y_test,y_hat)
    conf = mt.confusion_matrix(y_test,y_hat)
    print("====Iteration",iter_num," ====")
    print("accuracy", acc )
    print("confusion matrix\n",conf)
    iter_num+=1
    
# Also note that every time you run the above code
#   it randomly creates a new training and testing set, 
#   so accuracy will be different each time

====Iteration 0  ====
accuracy 0.8782561906596201
confusion matrix
 [[698699  41966]
 [ 98993 318175]]
====Iteration 1  ====
accuracy 0.8781819139720495
confusion matrix
 [[699137  41663]
 [ 99382 317651]]
====Iteration 2  ====
accuracy 0.8782717369430652
confusion matrix
 [[698223  41739]
 [ 99202 318669]]


In [None]:
# this does the exact same thing as the above block of code, but with shorter syntax

#for iter_num, (train_indices, test_indices) in enumerate(cv_object.split(X,y)):
  #  lr_clf.fit(X[train_indices],y[train_indices])  # train object
   # y_hat = lr_clf.predict(X[test_indices]) # get test set precitions

    # print the accuracy and confusion matrix 
    #print("====Iteration",iter_num," ====")
    #print("accuracy", mt.accuracy_score(y[test_indices],y_hat)) 
    #print("confusion matrix\n",mt.confusion_matrix(y[test_indices],y_hat))

In [None]:
# and here is an even shorter way of getting the accuracies for each training and test set
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(lr_clf, X, y=y, cv=cv_object) # this also can help with parallelism
print(accuracies)

In [None]:
# here we can change some of the parameters interactively
from ipywidgets import widgets as wd

def lr_explor(cost):
    lr_clf = LogisticRegression(penalty='l2', C=cost, class_weight=None,solver='liblinear') # get object
    accuracies = cross_val_score(lr_clf,X,y=y,cv=cv_object) # this also can help with parallelism
    print(accuracies)

wd.interact(lr_explor,cost=(0.001,5.0,0.05),__manual=True)

In [40]:
# interpret the weights

# iterate over the coefficients
weights = lr_clf.coef_.T # take transpose to make a column vector
variable_names = df_log.columns
for coef, name in zip(weights,variable_names):
    print(name, 'has weight of', coef[0])
    
# does this look correct? 

MONTH has weight of -0.04420535618259056
DAY has weight of -0.0039904190572692165
DAY_OF_WEEK has weight of -0.035576322826918735
FLIGHT_NUMBER has weight of 4.045583863575573e-06
DEPARTURE_DELAY has weight of 0.21194659755737422
TAXI_OUT has weight of 0.18112754425876898
WHEELS_OFF has weight of -0.3814078085781773
SCHEDULED_TIME has weight of -0.06770729614832054
DISTANCE has weight of 0.007475056593359719
WHEELS_ON has weight of -0.3814078085781773
TAXI_IN has weight of 0.17624679589375616
ARRIVED has weight of -0.3814078085781773
DIVERTED has weight of 0.0
AIRLINE_CODE_encode has weight of 0.01932050906654241
ORIGIN_AIRPORT_encode has weight of -0.0008732831224246109
DESTINATION_AIRPORT_encode has weight of 0.0008615945437542439
TAIL_NUMBER_encode has weight of -6.926540735946212e-05


## Advantages of Each Model

[10 points] Discuss the advantages of each model for each classification task. Does one type of model offer superior performance over another in terms of prediction accuracy? In terms of training time or efficiency? Explain in detail.

## Feature Importance

[30 points] Use the weights from logistic regression to interpret the importance of different features for each classification task. Explain your interpretation in detail. Why do you think some variables are more important?

## Support Vectors

[10 points] Look at the chosen support vectors for the classification task. Do these provide any insight into the data? Explain.