## Importing Data

Before any analysis is done, the taxi_clean_lg is imported. Ride_duration is added as a feature, and the pick up and drop off times are converted to a datetime object.

In [3]:
import warnings
warnings.simplefilter("ignore")

import pandas as pd
import matplotlib as plt
import numpy as np
import sklearn as sk
import time

#import data 
data = pd.read_csv("taxi_clean_lg.csv")

#display data
print(data.shape)
      
data.head()

(80501, 19)


Unnamed: 0,trip_distance,fare_amount,winter,spring,summer,fall,PULongitude,PULatitude,DOLongitude,DOLatitude,pickup_datetime,dropoff_datetime,ride_duration,Early morning,Morning,Afternoon,Night,Holiday Proximity,label
0,5.9,41.5,0,1,0,0,-73.984176,40.759845,-73.961815,40.80957,2019-03-26 14:24:29,2019-03-26 15:26:27,0 days 01:01:58.000000000,0,0,1,0,1,J
1,7.31,28.0,0,0,1,0,-73.965572,40.78246,-73.853384,40.752316,2019-07-03 07:15:18,2019-07-03 07:49:08,0 days 00:33:50.000000000,0,1,0,0,0,J
2,0.99,5.5,0,1,0,0,-73.981352,40.773906,-73.987973,40.77577,2019-05-25 17:25:49,2019-05-25 17:30:21,0 days 00:04:32.000000000,0,0,1,0,0,B
3,1.91,9.0,0,0,1,0,-73.972145,40.756816,-73.956972,40.780491,2019-07-22 15:31:00,2019-07-22 15:41:36,0 days 00:10:36.000000000,0,0,1,0,0,C
4,1.18,7.5,0,1,0,0,-73.965691,40.768542,-73.954568,40.765507,2019-03-13 21:13:28,2019-03-13 21:21:42,0 days 00:08:14.000000000,0,0,0,0,0,C


In [4]:
data['pickup_datetime'] = pd.to_datetime(data['pickup_datetime'])
data['pickup_datetime'] = pd.to_numeric(data['pickup_datetime'])
data['dropoff_datetime'] = pd.to_datetime(data['dropoff_datetime'])
data['dropoff_datetime'] = pd.to_numeric(data['dropoff_datetime'])

data['ride_duration'] = data['dropoff_datetime'] - data['pickup_datetime']

## KNN Analysis

The necessary libraries needed for analysis are imported first. Feature values and class labels are defined next for KNN to use through a pipeline, in order to search for the best combination between PCA dimensions and n_neighbors. The pipeline was defined, created, and the parameters were set up to tune further for each step in the pipeline. The pipeline was then passed into GridSearchCV, and the best score and parameters were printed at the end of the process.

In [7]:
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score

featurevalues = data.drop(['label'], axis = 1)
classlabels = data['label']

# define a pipeline to search for best combination of PCA dimensions and n_neighbors
scaler = MinMaxScaler()
pca = PCA()
knn = KNeighborsClassifier()

# create a pipeline
pipe = Pipeline(steps=[('scaler', scaler), ('pca', pca), ('knn', knn)])

# set up parameters to tune for each step in pipeline
param_grid = {
    'pca__n_components': list(range(1, 19)), # find how many principal componenet to keep
    'knn__n_neighbors': list(range(1, 30)),  # find the best value of k
}

# pass pipeline into gridsearchcv
grid_pipe = GridSearchCV(pipe,param_grid,cv=5)

# call fit on grid_pipe and pass in unscaled data
grid_pipe = grid_pipe.fit(featurevalues,classlabels)

# print out the best_score_ and best_params_ from the GridSearchCV
print("best_score",grid_pipe.best_score_)
print("best_params",grid_pipe.best_params_)

best_score 0.7165376827617048
best_params {'knn__n_neighbors': 5, 'pca__n_components': 12}


Cross_val_score was used to determine the accuracy of the model, and the accuracy of the model was further reported through the use of classification report as well as the confusion matrix.

In [10]:
# display accuracy on model
scores = cross_val_score(grid_pipe,featurevalues,classlabels,cv=3,verbose=2)
print("Accuracy:", scores.mean()*100)

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV]  ................................................................
[CV] ................................................. , total=23.8min
[CV]  ................................................................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed: 23.8min remaining:    0.0s


[CV] ................................................. , total=23.8min
[CV]  ................................................................
[CV] ................................................. , total=23.6min
Accuracy: 71.28976539592934


[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed: 71.2min finished


In [6]:
# show results with classification report and confusion matrix
from sklearn.metrics import classification_report,confusion_matrix
from sklearn.model_selection import cross_val_predict
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier()
featurevalues = data.drop(['label'], axis = 1)
classlabels = data['label']
y_pred = cross_val_predict(knn,featurevalues,classlabels,cv=3)
print("Confusion matrix:",confusion_matrix(classlabels,y_pred))
print("Classification Report:",classification_report(classlabels,y_pred))

Confusion matrix: [[ 798 2642 1428  572  228   65   34    8    3   93]
 [2294 8312 4988 2021  759  256  124   53   27  308]
 [1887 7066 4338 1911  733  266  132   47   16  272]
 [1212 4559 3010 1428  581  230  106   39   18  236]
 [ 724 2782 1874  975  444  176   81   42   19  185]
 [ 427 1640 1200  691  301  136   66   34   12  170]
 [ 261 1091  818  485  228   92   47   19   10  139]
 [ 172  711  588  346  144   65   38   25    6  139]
 [ 138  496  419  217  118   52   28   19   10  117]
 [ 521 2193 1917 1280  714  339  189   86   47 1098]]
Classification Report:               precision    recall  f1-score   support

           A       0.09      0.14      0.11      5871
           B       0.26      0.43      0.33     19142
           C       0.21      0.26      0.23     16668
           D       0.14      0.13      0.13     11419
           E       0.10      0.06      0.08      7302
           F       0.08      0.03      0.04      4677
           G       0.06      0.01      0.02      

The f1- score was greatest for class B, while the lowest was for class I. Additionally, based on the confusion matrix, a lot of data points in class I were misclassified to be in class J. It's also important to note that from the confusion matrix, many of the classes were misclassified to be in class B. For example, 2294 entries from class A were identified to be in class B, 7066 entries from class B were misidentified to be in class C (which is almost equal to those that were correctly classified to be in class B).

## SVM Analysis

The needed libraries were imported first, then the pipeline was set up. Using the default values, the accuracy of the pipelinew as found using cross_val_score.

In [11]:
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline

scaler = StandardScaler()
svc = SVC()

# set up pipeline
pipe = Pipeline(steps=[('scaler',scaler),('svc',svc)])
pipeline = cross_val_score(pipe,featurevalues,classlabels,cv=5)
print("Accuracy:", pipeline.mean()*100)

Accuracy: 93.59635721561409


The kernel hyperparameter of the SVM pipeline was tuned further. Out of the parameters tested, the best parameter was found and printed. To further tune the model, the best value of c was tested and used to find the accuracy of the model.

In [13]:
# tune 'svm' part of the pipeline, 'kernel' hyperparameter
param_grid = {'svc__kernel': ['linear', 'rbf', 'poly', 'sigmoid']}

# find and print best parameter
gridsearch = GridSearchCV(pipe,param_grid,cv=3)
gridsearch = gridsearch.fit(featurevalues,classlabels)
print(gridsearch.best_params_)

# find best value of c and print accuracy
c = []
for x in range(50,110,10):
    c.append(x)
param_grid = {'svc__kernel':['linear','rbf','poly','sigmoid'],'svc__C':c}
grid_search = GridSearchCV(pipe,param_grid,cv=3)
grid_accuracy = cross_val_score(grid_search,featurevalues,classlabels,cv=3,verbose=2)
print("Accuracy:",grid_accuracy.mean()*100)

{'svc__kernel': 'linear'}
[CV]  ................................................................


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV] ................................................. , total=15.1min
[CV]  ................................................................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed: 15.1min remaining:    0.0s


[CV] ................................................. , total=15.3min
[CV]  ................................................................
[CV] ................................................. , total=14.9min
Accuracy: 99.99006221975525


[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed: 45.4min finished


In [5]:
# show results with classification report and confusion matrix
from sklearn.metrics import classification_report,confusion_matrix
from sklearn.model_selection import cross_val_predict
from sklearn.svm import SVC

svc = SVC()
featurevalues = data.drop(['label'], axis = 1)
classlabels = data['label']
y_pred = cross_val_predict(svc,featurevalues,classlabels,cv=3)
print("Confusion matrix:",confusion_matrix(classlabels,y_pred))
print("Classification report:",classification_report(classlabels,y_pred))

Confusion matrix: [[    0  5871     0     0     0     0     0     0     0     0]
 [    0 19142     0     0     0     0     0     0     0     0]
 [    0 16668     0     0     0     0     0     0     0     0]
 [    0 11419     0     0     0     0     0     0     0     0]
 [    0  7302     0     0     0     0     0     0     0     0]
 [    0  4677     0     0     0     0     0     0     0     0]
 [    0  3190     0     0     0     0     0     0     0     0]
 [    0  2234     0     0     0     0     0     0     0     0]
 [    0  1614     0     0     0     0     0     0     0     0]
 [    0  8382     0     0     0     0     0     0     0     2]]
Classification report:               precision    recall  f1-score   support

           A       0.00      0.00      0.00      5871
           B       0.24      1.00      0.38     19142
           C       0.00      0.00      0.00     16668
           D       0.00      0.00      0.00     11419
           E       0.00      0.00      0.00      7302
   

The accuracy rate of the model is very high at 99.99%. However, by looking at the classification report, we see that a lot of entries are being classified as B (high recall), but are being misclassified (low precision). Additionally, entries in J are being misclassified. This may be because of class imbalance. Therefore, the accuracy being reported by cross_val_score is not true. Rather, it is significantly lower.