# Lab Three: Clustering

Matt Farrow, Amber Clark, Blake Freeman, Megan Ball

## **2015 Flight Delays and Cancellations**
Data Source: [Kaggle](https://www.kaggle.com/usdot/flight-delays?select=flights.csv)

Our data set consists of over 5 million rows of flight information in the domestic United States for the year of 2015. In order to optimize our modeling time, we have narrowed the scope of our classification tasks to the Dallas area only (Dallas Love Field and DFW airports). 

## Rubric

### [Business Understanding](#Business-Understanding) (10 points total)

- [10 points] Describe the purpose of the data set you selected (i.e., why was this data collected in the first place?). How will you measure the effectiveness of a good algorithm? Why does your chosen validation method make sense for this specific dataset and the stakeholders needs?

### [Data Understanding](#Data-Understanding) (20 points total)

#### [Data Understanding 1](#Data-Understanding-1)

- [10 points] Describe the meaning and type of data (scale, values, etc.) for each attribute in the data file. Verify data quality: Are there missing values? Duplicate data? Outliers? Are those mistakes? How do you deal with these problems?

#### [Data Understanding 2](#Data-Understanding-2)

- [10 points] Visualize the any important attributes appropriately. Important: Provide an interpretation for any charts or graphs.

### [Modeling and Evaluation](#Modeling-and-Evaluation) (50 points total)

Different tasks will require different evaluation methods. Be as thorough as possible when analyzing the data you have chosen and use visualizations of the results to explain the performance and expected outcomes whenever possible. Guide the reader through your analysis with plenty of discussion of the results.

#### Option A: Cluster Analysis

- Perform cluster analysis using several clustering methods
- How did you determine a suitable number of clusters for each method?
- Use internal and/or external validation measures to describe and compare the clusterings and the clusters (some visual methods would be good).
- Describe your results. What findings are the most interesting and why?

#### [Modeling and Evaluation 1](#Modeling-and-Evaluation-1)

- Train and adjust parameters

#### [Modeling and Evaluation 2](#Modeling-and-Evaluation-2)

- Evaluate and compare

#### [Modeling and Evaluation 3](#Modeling-and-Evaluation-3)

- Visualize results

#### [Modeling and Evaluation 4](#Modeling-and-Evaluation-4)

- Summarise the ramifications

### [Deployment](#Deployment) (10 points total)

Be critical of your performance and tell the reader how you current model might be usable by other parties. Did you achieve your goals? If not, can you reign in the utility of your modeling?

- How useful is your model for interested parties (i.e., the companies or organizations that might want to use it)?
- How would your deploy your model for interested parties?
- What other data should be collected?
- How often would the model need to be updated, etc.?

### [Exceptional Work](#Exceptional-Work) (10 points total)

You have free reign to provide additional analyses or combine analyses.

# Business Understanding
Jump to [top](#Rubric)

# Data Understanding
Jump to [top](#Rubric)

## Data Understanding 1
Jump to [top](#Rubric)

> Describe the meaning and type of data (scale, values, etc.) for each attribute in the data file. Verify data quality: Are there missing values? Duplicate data? Outliers? Are those mistakes? How do you deal with these problems?

The initial data pre-processing has already been covered in Labs 1, 2, and the Mini-Lab. Here we have collapsed our code as much as possible.

In [1]:
# Load libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
#from datetime import datetime
import altair as alt
import datetime

# Machine learning
from sklearn import metrics 
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import StratifiedShuffleSplit
from imblearn.over_sampling import SMOTE 
from imblearn.pipeline import Pipeline
from collections import Counter
from sklearn.datasets import make_classification
from imblearn.under_sampling import RandomUnderSampler
from sklearn.linear_model import SGDClassifier

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

## [Jump to Clean Data](#Final-Data-Set)

Clicking this link will skip over the cleanup work and let you get started with the final data set. 

In [2]:
import warnings
warnings.filterwarnings('ignore')

# Due to the way these columns are formatted, we want to keep the leading zeros during our import. 
# Later on will convert them to a time format.
dtype_t = {'SCHEDULED_DEPARTURE': str,
           'DEPARTURE_TIME': str,
           'WHEELS_OFF': str,
           'SCHEDULED_TIME': str,
           'WHEELS_ON': str,
           'SCHEDULED_ARRIVAL': str,
           'ARRIVAL_TIME': str
          }

# Read in the data directly
airlines = pd.read_csv('../Data/airlines.csv')
airports = pd.read_csv('../Data/airports.csv')
flights  = pd.read_csv('../Data/flights.csv', dtype = dtype_t)

# Read in the data directly from GitHub
# airlines = pd.read_csv('https://raw.githubusercontent.com/mattfarrow1/7331-machine-learning-1/main/Data/airlines.csv')
# airports = pd.read_csv('https://raw.githubusercontent.com/mattfarrow1/7331-machine-learning-1/main/Data/airports.csv')
# flights  = pd.read_csv('https://media.githubusercontent.com/media/mattfarrow1/7331-machine-learning-1/main/Data/flights.csv', dtype = dtype_t)

# Rename columns in preparation for merge
airlines.rename(columns={'IATA_CODE': 'AIRLINE_CODE'}, inplace=True)
flights.rename(columns={'AIRLINE': 'AIRLINE_CODE'}, inplace=True)

# Merge data together
df = pd.merge(flights, airlines, on='AIRLINE_CODE', how = 'left')

# Subset to DFW Area
df = df[(df.ORIGIN_AIRPORT == 'DFW') | (df.ORIGIN_AIRPORT == 'DAL')]

#### Create New Variables

In [3]:
# Convert times into buckets for morning, afternoon, and evening as most models cannot handle timestamps.
cut_labels = ['overnight', 'morning', 'afternoon', 'evening']
cut_bins = [0, 600, 1200, 1800, 2359]

df['SCHED_DEPARTURE_TIME'] = pd.cut(df['SCHEDULED_DEPARTURE'].astype(float), 
                                    bins=cut_bins, 
                                    labels=cut_labels, 
                                    include_lowest=True)
df['ACTUAL_DEPARTURE_TIME'] = pd.cut(df['DEPARTURE_TIME'].astype(float), 
                                     bins=cut_bins, 
                                     labels=cut_labels, 
                                     include_lowest=True)
df['SCHED_ARRIVAL_TIME'] = pd.cut(df['SCHEDULED_ARRIVAL'].astype(float), 
                                  bins=cut_bins, 
                                  labels=cut_labels, 
                                  include_lowest=True)
df['ACTUAL_ARRIVAL_TIME'] = pd.cut(df['ARRIVAL_TIME'].astype(float), 
                                  bins=cut_bins, 
                                  labels=cut_labels, 
                                  include_lowest=True)

# Bucket Flight Distance
distance_labels = ['Short', 'Medium', 'Long']
distance_bins   = [1, 100, 1000, np.inf]
df['DISTANCE_BUCKET'] = pd.cut(df['DISTANCE'],
                               bins=distance_bins,
                               labels=distance_labels)

# Create a new column where the arrival_delay > 0 means it's delayed(=1) and if <= 0 it's not delayed(=0)
get_delay = lambda x: 0 if x <= 0 else 1
df['DELAYED'] = df.ARRIVAL_DELAY.apply(get_delay)

# Look at our data with the buckets
pd.set_option('display.max_columns', None)
df.head()

Unnamed: 0,YEAR,MONTH,DAY,DAY_OF_WEEK,AIRLINE_CODE,FLIGHT_NUMBER,TAIL_NUMBER,ORIGIN_AIRPORT,DESTINATION_AIRPORT,SCHEDULED_DEPARTURE,DEPARTURE_TIME,DEPARTURE_DELAY,TAXI_OUT,WHEELS_OFF,SCHEDULED_TIME,ELAPSED_TIME,AIR_TIME,DISTANCE,WHEELS_ON,TAXI_IN,SCHEDULED_ARRIVAL,ARRIVAL_TIME,ARRIVAL_DELAY,DIVERTED,CANCELLED,CANCELLATION_REASON,AIR_SYSTEM_DELAY,SECURITY_DELAY,AIRLINE_DELAY,LATE_AIRCRAFT_DELAY,WEATHER_DELAY,AIRLINE,SCHED_DEPARTURE_TIME,ACTUAL_DEPARTURE_TIME,SCHED_ARRIVAL_TIME,ACTUAL_ARRIVAL_TIME,DISTANCE_BUCKET,DELAYED
70,2015,1,1,4,AA,1057,N3ASAA,DFW,MIA,515,703.0,108.0,15.0,718.0,161,155.0,133.0,1121,1031.0,7.0,856,1038.0,102.0,0,0,,0.0,0.0,0.0,0.0,102.0,American Airlines Inc.,overnight,morning,morning,morning,Long,1
124,2015,1,1,4,DL,1890,N377DA,DFW,ATL,545,603.0,18.0,13.0,616.0,124,104.0,86.0,731,842.0,5.0,849,847.0,-2.0,0,0,,,,,,,Delta Air Lines Inc.,overnight,morning,morning,morning,Medium,0
203,2015,1,1,4,AA,72,N5EKAA,DFW,MCO,600,606.0,6.0,18.0,624.0,145,142.0,120.0,985,924.0,4.0,925,928.0,3.0,0,0,,,,,,,American Airlines Inc.,overnight,morning,morning,morning,Medium,1
209,2015,1,1,4,AA,1100,N3GWAA,DFW,LGA,600,554.0,-6.0,33.0,627.0,190,191.0,154.0,1389,1001.0,4.0,1010,1005.0,-5.0,0,0,,,,,,,American Airlines Inc.,overnight,overnight,morning,morning,Long,0
310,2015,1,1,4,MQ,3015,N825MQ,DFW,BTR,600,,,,,78,,,383,,,718,,,0,1,B,,,,,,American Eagle Airlines Inc.,overnight,,morning,,Medium,1


#### Process Dates & Time

In [4]:
# Source: https://stackoverflow.com/questions/54487059/pandas-how-to-create-a-single-date-column-from-columns-containing-year-month
df['FLIGHT_DATE'] = pd.to_datetime([f'{y}-{m}-{d}' for y, m, d in zip(df.YEAR, df.MONTH, df.DAY)])

# Creating a function to change the way of representation of time in the column
def fun_format_time(hours):
        if hours == 2400:
            hours = 0
        else:
            hours = "{0:04d}".format(int(hours))
            Hourmin = datetime.time(int(hours[0:2]), int(hours[2:4]))
            return Hourmin

In [5]:
# Define the time columns
cols = ["SCHEDULED_DEPARTURE", 
        "DEPARTURE_TIME", 
        "SCHEDULED_ARRIVAL", 
        "ARRIVAL_TIME",
        "WHEELS_ON",
        "WHEELS_OFF"]

# Convert times to float in order to correctly process them through the function
df[cols] = df[cols].astype(float)

# Run times through the new function
# Code adapted from: https://stackoverflow.com/questions/35232705/how-to-test-for-nans-in-an-apply-function-in-pandas
df['SCHEDULED_DEPARTURE'] = df['SCHEDULED_DEPARTURE'].apply(lambda x: fun_format_time(x) if pd.notnull(x) else x)
df['DEPARTURE_TIME']      = df['DEPARTURE_TIME'].apply(lambda x: fun_format_time(x) if pd.notnull(x) else x)
df['SCHEDULED_ARRIVAL']   = df['SCHEDULED_ARRIVAL'].apply(lambda x: fun_format_time(x) if pd.notnull(x) else x)
df['ARRIVAL_TIME']        = df['ARRIVAL_TIME'].apply(lambda x: fun_format_time(x) if pd.notnull(x) else x)
df['WHEELS_ON']           = df['WHEELS_ON'].apply(lambda x: fun_format_time(x) if pd.notnull(x) else x)
df['WHEELS_OFF']          = df['WHEELS_OFF'].apply(lambda x: fun_format_time(x) if pd.notnull(x) else x)

# Combine date & time for departure and arrival
# Source: https://stackoverflow.com/questions/17978092/combine-date-and-time-columns-using-python-pandas
df['SCHEDULED_DEPARTURE_DT'] = pd.to_datetime(df['FLIGHT_DATE'].astype(str) + ' ' + df['SCHEDULED_DEPARTURE'].astype(str))
df['SCHEDULED_ARRIVAL_DT']   = pd.to_datetime(df['FLIGHT_DATE'].astype(str) + ' ' + df['SCHEDULED_ARRIVAL'].astype(str))

#### Append Dallas-Area Weather

In [6]:
# Read in the data
import datetime
weather = pd.read_csv('../Data/dfw_weather.csv')
weather['dt_iso'] = weather['dt_iso'].astype(str)

# Remove "+0000 UTC"
weather['dt_iso_update'] = weather['dt_iso'].str.split('+').str[0]

# Convert new column to a datetime type
weather['date_time'] =  pd.to_datetime(weather['dt_iso_update'], format='%Y-%m-%d %H:%M')

weather['date_time'] = weather['date_time'].dt.round('30min')  
df['SCHEDULED_DEPARTURE_DT'] = df['SCHEDULED_DEPARTURE_DT'].dt.round('30min')

df = pd.merge(df, weather, left_on='SCHEDULED_DEPARTURE_DT', right_on='date_time')

# Remove unnecessary columns from weather data
col_to_drop = ['dt', 'dt_iso', 'timezone', 'city_name', 'lat', 'lon', 'feels_like', 'temp_min', 'temp_max',
              'sea_level', 'grnd_level', 'dt_iso_update', 'weather_icon', 'weather_description', 'date_time']
df = df.drop(columns = col_to_drop)

#### Missing Values

In [7]:
# Remove non-critical columns WHEELS_ON and WHEELS_OFF
df = df.drop(['WHEELS_ON','WHEELS_OFF'], axis=1)

# Add category
df['ACTUAL_DEPARTURE_TIME'] = df['ACTUAL_DEPARTURE_TIME'].cat.add_categories(['N'])
df['ACTUAL_ARRIVAL_TIME'] = df['ACTUAL_ARRIVAL_TIME'].cat.add_categories(['N'])

# Fill missing values with 'N' for 'N/A'
df['ACTUAL_DEPARTURE_TIME'] = df['ACTUAL_DEPARTURE_TIME'].fillna('N')
df['ACTUAL_ARRIVAL_TIME'] = df['ACTUAL_ARRIVAL_TIME'].fillna('N')

# Convert missing values to 'N' for 'N/A'
df['CANCELLATION_REASON'] = df['CANCELLATION_REASON'].fillna('N')

# Update missing values in times to 0. 
# Will be updating times to a binary (1 = yes action happened, 0 = no action happened)
df['DEPARTURE_TIME'] = df['DEPARTURE_TIME'].fillna(0)

# Change all non-null values to 1
df.loc[(df.DEPARTURE_TIME != '0'), 'DEPARTURE_TIME'] = 1

# Change column name to 'DEPARTED'
df.rename(columns={'DEPARTURE_TIME': 'DEPARTED'}, inplace=True)

# Update remaining columns using same logic
cols = ['ARRIVAL_TIME']
df[cols] = df[cols].fillna(0)
df.loc[(df.ARRIVAL_TIME != '0'), 'ARRIVAL_TIME'] = 1
df.rename(columns={'ARRIVAL_TIME': 'ARRIVED'}, inplace=True)

# Fill missing values with 0
cols = ['AIR_SYSTEM_DELAY','SECURITY_DELAY','AIRLINE_DELAY','LATE_AIRCRAFT_DELAY','WEATHER_DELAY', 
       'rain_1h', 'rain_3h', 'snow_1h', 'snow_3h']
df[cols] = df[cols].fillna(0)

# Change remaining null values to 0 if flight was cancelled
df.loc[(df.CANCELLED == 1), ('DEPARTURE_DELAY', 'TAXI_OUT', 'ELAPSED_TIME','AIR_TIME','TAXI_IN','ARRIVAL_DELAY')] = 0

# Drop remaining missing values and check total cancels left
df = df.dropna()

# Delete date columns ahead of modeling
df = df.drop(columns = ['FLIGHT_DATE', 'SCHEDULED_DEPARTURE_DT', 'SCHEDULED_ARRIVAL_DT'])

# Convert back to string
df.SCHEDULED_DEPARTURE = df.SCHEDULED_DEPARTURE.astype(str)
df.SCHEDULED_ARRIVAL = df.SCHEDULED_ARRIVAL.astype(str)

# Remove colons
df.SCHEDULED_DEPARTURE = df.SCHEDULED_DEPARTURE.str.replace(r'\D+', '')
df.SCHEDULED_ARRIVAL = df.SCHEDULED_ARRIVAL.str.replace(r'\D+', '')

# Convert to float
df.SCHEDULED_DEPARTURE = df.SCHEDULED_DEPARTURE.astype(int)
df.SCHEDULED_ARRIVAL = df.SCHEDULED_ARRIVAL.astype(int)

#### Log Transformations

In [8]:
print("Min DEPARTURE_DELAY", min(df["DEPARTURE_DELAY"]))
print("Min ARRIVAL_DELAY", min(df["ARRIVAL_DELAY"]))
print("Min DISTANCE", min(df["DISTANCE"]))
print("Min TAXI_IN", min(df["TAXI_IN"]))
print("Min ELAPSED_TIME", min(df["ELAPSED_TIME"]))
print("Min AIR_TIME", min(df["AIR_TIME"]))

Min DEPARTURE_DELAY -24.0
Min ARRIVAL_DELAY -56.0
Min DISTANCE 89
Min TAXI_IN 0.0
Min ELAPSED_TIME 0.0
Min AIR_TIME 0.0


In [9]:
# Log transformation keeping the 0 in the data sets. Because we have negative values, need to offset to make minimum
# equal to zero and not a negative number. For the other vars, no need to run lambda function as min > 0 which improves
# run time
df["DEPARTURE_DELAY_log"] = df["DEPARTURE_DELAY"].map(lambda i: np.log(i + 24) if i != -24 else 0) 
df["ARRIVAL_DELAY_log"]   = df["ARRIVAL_DELAY"].map(lambda i: np.log(i + 56) if i != -56 else 0)
df["DISTANCE_log"]        = np.log(df["DISTANCE"])
df["TAXI_IN_log"]         = np.log1p(df["TAXI_IN"])
df["ELAPSED_TIME_log"]    = np.log1p(df["ELAPSED_TIME"])
df["AIR_TIME_log"]        = np.log1p(df["AIR_TIME"])

#### Feature Removals

In [10]:
# Here we remove redundant columns to further reduce the data size. Columns that are being removed:
# `YEAR`: All rows are from 2015, no need to include this.
# `AIRLINE`: We have AIRLINE_CODE which is the same information
col_to_drop1 = ['YEAR','AIRLINE']
df = df.drop(columns = col_to_drop1)

#### Encoding

In [11]:
# Filter out instances where a tail number appears less than 5 times
df = df[df.groupby('TAIL_NUMBER').TAIL_NUMBER.transform(len) > 4]

# Encode Destination Airport & Tail Number
from sklearn.preprocessing import LabelEncoder
labelencoder = LabelEncoder()
df['DESTINATION_AIRPORT_encode'] = labelencoder.fit_transform(df['DESTINATION_AIRPORT'])
df.dropna(subset = ["DESTINATION_AIRPORT_encode"], inplace=True)
df['TAIL_NUMBER_encode'] = labelencoder.fit_transform(df['TAIL_NUMBER'])

# Drop original columns
col_to_drop2 = ['TAIL_NUMBER','DESTINATION_AIRPORT']
df = df.drop(columns = col_to_drop2)

# One-hot encode categorical columns
categorical_columns = ['AIRLINE_CODE', 'CANCELLATION_REASON', 'SCHED_DEPARTURE_TIME', 
                       'ACTUAL_DEPARTURE_TIME','SCHED_ARRIVAL_TIME', 'ACTUAL_ARRIVAL_TIME',
                       'DISTANCE_BUCKET', 'weather_main', 'ORIGIN_AIRPORT']

for column in categorical_columns:
  tempdf = pd.get_dummies(df[categorical_columns], prefix = categorical_columns, drop_first = True)
  df_OHE = pd.merge(
      left = df,
      right = tempdf,
      left_index=True,
      right_index=True
  )
  df_OHE = df_OHE.drop(columns = categorical_columns)

In [12]:
# Scheduled time needs to be int
df_OHE['SCHEDULED_TIME'] = df_OHE['SCHEDULED_TIME'].astype(int)

#### Flight Delay Response Variable

In [13]:
# Add response variable bucket for delay time for departure
# 0 is Early (negative time)
# 1 is On_Time or between 0 and 10 minutes late
# 2 is Late (between 11 and 30 min late)
# 3 is very late (between 31 and 60 min late)
# 4 is extremely late (over 61 min late)

delay_labels = ['0', '1', '2', '3', '4']
delay_bins   = [-np.inf, -1, 10, 30, 60, np.inf]
df_OHE['DELAY_BUCKET'] = pd.cut(df_OHE['DEPARTURE_DELAY'],
                               bins=delay_bins,
                               labels=delay_labels)

#check counts by bucket
df_OHE['DELAY_BUCKET'].value_counts()

0    99535
1    50033
2    24784
3    13854
4    13030
Name: DELAY_BUCKET, dtype: int64

In [14]:
# Convert from category to int
df_OHE['DELAY_BUCKET'] = df_OHE['DELAY_BUCKET'].astype(int)

# Drop unnecessary columns
col_to_drop3 = ['DEPARTURE_DELAY', 'ARRIVAL_DELAY', 'DISTANCE', 'TAXI_IN', 'ELAPSED_TIME', 'AIR_TIME']
df_OHE = df_OHE.drop(columns = col_to_drop3)

In [15]:
# Create the delay data set
df_delay = df_OHE

In [16]:
# Filter out cancelled flights
df_delay = df_delay[df_delay.CANCELLED == 0]

col_to_drop4 = ['CANCELLED', 
                'CANCELLATION_REASON_B', 
                'CANCELLATION_REASON_C', 
                'CANCELLATION_REASON_N', 
                'ACTUAL_DEPARTURE_TIME_morning', 
                'ACTUAL_DEPARTURE_TIME_afternoon', 
                'ACTUAL_DEPARTURE_TIME_evening',
                'ACTUAL_DEPARTURE_TIME_N',
                'ACTUAL_ARRIVAL_TIME_morning',
                'ACTUAL_ARRIVAL_TIME_afternoon',
                'SCHEDULED_DEPARTURE',
                'SCHEDULED_ARRIVAL',
                'AIR_SYSTEM_DELAY',
                'SECURITY_DELAY', 
                'ACTUAL_ARRIVAL_TIME_evening',
                'ACTUAL_ARRIVAL_TIME_N',
                'AIRLINE_DELAY', 
                'LATE_AIRCRAFT_DELAY', 
                'WEATHER_DELAY', 
                'DELAYED', 
                'DEPARTURE_DELAY_log',
                'ARRIVAL_DELAY_log', 
                'ELAPSED_TIME_log', 
                'DEPARTED', 
                'ARRIVED',
                'TAXI_IN_log',
                'AIR_TIME_log']

df_delay = df_delay.drop(columns = col_to_drop4)

In [17]:
# Drop columns that our correlation matrix from lab 2 indicated were greater than 0.8
col_to_drop7 = ['DISTANCE_log', 'DIVERTED']

df_delay = df_delay.drop(columns = col_to_drop7)

## Data Understanding 2
Jump to [top](#Rubric)

> Visualize the any important attributes appropriately. Important: Provide an interpretation for any charts or graphs.

### Final Data Set

The delay data set contains basic flight information from our original data plus weather data for the appropriate date & time of each flight, encoded variables for `DESTINATION_AIRPORT` and `TAIL_NUMBER`, and one-hot encoded airline codes.
Newly created variables included buckets for the flight’s scheduled departure and arrival times (morning, afternoon, and evening), distance (medium and long), and a response variable `DELAY_BUCKET` that groups delay times by length of delay in minutes.
- **Early** is defined as 0 and is any value where the `DEPARTURE_DELAY` is < 0.
- **On-Time** is defined as 1 and is any value where 0 <= `DEPARTURE_DELAY` <= 10
- **Late** is defined as 2 and is any value where 11 <= `DEPARTURE_DELAY` <= 30
- **Very Late** is defined as 3 and is any value where 31 <= `DEPARTURE_DELAY` <= 60
- **Extremely Late** is defined as 4 and is any value where `DEPARTURE_DELAY` >= 61

In [None]:
# Save data
# df_delay.to_csv('../Data/df_delay.csv', index=False)

In [None]:
# Load data from here to save time
# df_delay = pd.read_csv('../Data/df_delay.csv')

# Modeling and Evaluation
Jump to [top](#Rubric)

## Modeling and Evaluation 1

> Train and adjust parameters

### Best Performing Classifier Model from Lab 2

In Lab 2, we determined that running KNN on the oversampled data using SMOTE with Grid Search was our best performing model. We've included it here as a baseline for our clustering models.

In [18]:
# Create X and y for delay data set
if 'DELAY_BUCKET' in df_delay:
    y_del = df_delay['DELAY_BUCKET'].values
    X_del = df_delay.iloc[:,:-1].values

In [19]:
# Oversample using SMOTE
oversample = SMOTE()
X_del_smote, y_del_smote = oversample.fit_resample(X_del, y_del)

In [20]:
from sklearn.model_selection import StratifiedShuffleSplit

sss=StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42) 

for train_index, test_index in sss.split(X_del_smote, y_del_smote):
    X_train_del_smote, X_test_del_smote = X_del_smote[train_index], X_del_smote[test_index]
    y_train_del_smote, y_test_del_smote = y_del_smote[train_index], y_del_smote[test_index]

print("Split on Oversampled Data:\n")
print('Training Features Shape:', X_train_del_smote.shape)
print('Training Labels Shape:', y_train_del_smote.shape)
print('Testing Features Shape:', X_test_del_smote.shape)
print('Testing Labels Shape:', y_test_del_smote.shape)

Split on Oversampled Data:

Training Features Shape: (398140, 49)
Training Labels Shape: (398140,)
Testing Features Shape: (99535, 49)
Testing Labels Shape: (99535,)


In [None]:
%%time
# # https://realpython.com/knn-python/
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV

# Create KNN Classifier
parameters = {
     "n_neighbors": list(range(1,20,2)),
     "weights": ["uniform", "distance"],
 }
gridsearch = GridSearchCV(KNeighborsClassifier(), 
                          parameters, 
                          cv = 10, 
                          scoring = 'f1_weighted')

gridsearch.fit(X_train_del_smote, y_train_del_smote)
gridsearch.best_params_

In [None]:
test_preds_grid2 = gridsearch.predict(X_test_del_smote)

In [None]:
from sklearn import metrics
print("Accuracy:",metrics.accuracy_score(y_test_del_smote, test_preds_grid2))
print('Weighted Precision: {:.2f}'.format(precision_score(y_test_del_smote, test_preds_grid2, average = 'weighted')))
print('Weighted Recall: {:.2f}'.format(recall_score(y_test_del_smote, test_preds_grid2, average = 'weighted')))
print('Weighted F1-score: {:.2f}'.format(f1_score(y_test_del_smote, test_preds_grid2, average = 'weighted')))

In [None]:
# Save and run model with K=1 and pull metrics
knn_delay = KNeighborsClassifier(n_neighbors = 1, weights = 'uniform')
knn_delay.fit(X_train_del_smote, y_train_del_smote)
y_pred_knn_del = knn_delay.predict(X_test_del_smote)

## Modeling and Evaluation 2
Jump to [top](#Rubric)

> Evaluate and compare

## Modeling and Evaluation 3
Jump to [top](#Rubric)

> Visualize results

## Modeling and Evaluation 4
Jump to [top](#Rubric)

> Summarise the ramifications

# Deployment
Jump to [top](#Rubric)

# Exceptional Work
Jump to [top](#Rubric)