It costs 50 cents per 1/5 mile when traveling above 12mph or per 60 seconds in slow traffic or when the vehicle is stopped. So, we are hoping to come up with a better fare estimate as traffic will be different throughout the day.

Previously we finished EDA and Feature Engineering and exported final data. We binned our data by fare_amount, which are classified A-J. Here, We are trying to classify fares using Random forest and Decision Trees algorithm, given variables like time of the day, holiday proximity, and approximate pickup/dropoff location. 

## Import taxi_clean

We import various libraries needed for Decision Trees and Random forest. We also import our final data compiled from Taxi_Exploaration.ipynb

In [70]:
import pandas as pd
import matplotlib as plt
import numpy as np
import time
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.model_selection import cross_val_score, GridSearchCV, cross_val_predict
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
from sklearn.metrics import mean_squared_error
from sklearn.metrics import accuracy_score

#import data 
data = pd.read_csv("taxi_clean_lg.csv")

# Check data type
print (data.dtypes)

#display data
print('\n',data.shape)
data.head()

trip_distance        float64
fare_amount          float64
winter                 int64
spring                 int64
summer                 int64
fall                   int64
PULongitude          float64
PULatitude           float64
DOLongitude          float64
DOLatitude           float64
pickup_datetime       object
dropoff_datetime      object
ride_duration         object
Early morning          int64
Morning                int64
Afternoon              int64
Night                  int64
Holiday Proximity      int64
label                 object
dtype: object

 (80501, 19)


Unnamed: 0,trip_distance,fare_amount,winter,spring,summer,fall,PULongitude,PULatitude,DOLongitude,DOLatitude,pickup_datetime,dropoff_datetime,ride_duration,Early morning,Morning,Afternoon,Night,Holiday Proximity,label
0,5.9,41.5,0,1,0,0,-73.984176,40.759845,-73.961815,40.80957,2019-03-26 14:24:29,2019-03-26 15:26:27,0 days 01:01:58.000000000,0,0,1,0,1,J
1,7.31,28.0,0,0,1,0,-73.965572,40.78246,-73.853384,40.752316,2019-07-03 07:15:18,2019-07-03 07:49:08,0 days 00:33:50.000000000,0,1,0,0,0,J
2,0.99,5.5,0,1,0,0,-73.981352,40.773906,-73.987973,40.77577,2019-05-25 17:25:49,2019-05-25 17:30:21,0 days 00:04:32.000000000,0,0,1,0,0,B
3,1.91,9.0,0,0,1,0,-73.972145,40.756816,-73.956972,40.780491,2019-07-22 15:31:00,2019-07-22 15:41:36,0 days 00:10:36.000000000,0,0,1,0,0,C
4,1.18,7.5,0,1,0,0,-73.965691,40.768542,-73.954568,40.765507,2019-03-13 21:13:28,2019-03-13 21:21:42,0 days 00:08:14.000000000,0,0,0,0,0,C


There are objects we need to convert to numeric

In [71]:
# We change the datatypes so they are in right format for our model
data['pickup_datetime'] = pd.to_datetime(data['pickup_datetime'])
data['pickup_datetime'] = pd.to_numeric(data['pickup_datetime'])
data['dropoff_datetime'] = pd.to_datetime(data['dropoff_datetime'])
data['dropoff_datetime'] = pd.to_numeric(data['dropoff_datetime'])

data['ride_duration'] = data['dropoff_datetime'] - data['pickup_datetime']

# Fill any empty NA cells with 0
data.fillna(0, inplace= True)
data.head()

Unnamed: 0,trip_distance,fare_amount,winter,spring,summer,fall,PULongitude,PULatitude,DOLongitude,DOLatitude,pickup_datetime,dropoff_datetime,ride_duration,Early morning,Morning,Afternoon,Night,Holiday Proximity,label
0,5.9,41.5,0,1,0,0,-73.984176,40.759845,-73.961815,40.80957,1553610269000000000,1553613987000000000,3718000000000,0,0,1,0,1,J
1,7.31,28.0,0,0,1,0,-73.965572,40.78246,-73.853384,40.752316,1562138118000000000,1562140148000000000,2030000000000,0,1,0,0,0,J
2,0.99,5.5,0,1,0,0,-73.981352,40.773906,-73.987973,40.77577,1558805149000000000,1558805421000000000,272000000000,0,0,1,0,0,B
3,1.91,9.0,0,0,1,0,-73.972145,40.756816,-73.956972,40.780491,1563809460000000000,1563810096000000000,636000000000,0,0,1,0,0,C
4,1.18,7.5,0,1,0,0,-73.965691,40.768542,-73.954568,40.765507,1552511608000000000,1552512102000000000,494000000000,0,0,0,0,0,C


## Random Forest

We begin by splitting the dataset into training and testing sets. Then We create a Random Forest classifier from sklearn.ensemble. Then we use 5 fold cross validation loop on classifier through GridSearchCV, to obtain the best parameters, so to avoid overfitting our training data.

In [72]:
# Split dataset into training and test sets
y = data['label']
X = data.drop(columns=['label'])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.8, random_state=42)


# Fit Random Forest classifier on training set
random_forest = RandomForestClassifier(n_estimators=200, max_depth=5, max_features= 4, random_state=42)

# Add Grid
param_grid = {'n_estimators': [200],'max_features': ['auto', 'sqrt', 'log2'], 'max_depth' : [4,5,6,7,8],'criterion' :['gini', 'entropy']}
new_tune = GridSearchCV(estimator=random_forest, param_grid=param_grid, cv= 5)
new_tune.fit(X_train,y_train)
c_score3 = cross_val_score(new_tune, X_train, y_train, cv=2)
print('Accuracy of classifier: ', c_score3.mean())

print('\n Best value for each of the tested parameters:',new_tune.best_params_)
print('\n and accuracy of the model with these best values is',new_tune.best_score_)

Accuracy of classifier:  0.9983848614708346

 Best value for each of the tested parameters: {'criterion': 'entropy', 'max_depth': 8, 'max_features': 'auto', 'n_estimators': 200}

 and accuracy of the model with these best values is 0.9990062111801242


Once we obtain the best parameters, we pass these values to the classifier, and fit it to our training data. Then we compare the accuracy with our testing data 

In [73]:
rfc1=RandomForestClassifier(random_state=42, criterion = 'entropy', max_depth = 8, max_features = 'auto', n_estimators = 20)
rfc1.fit(X_train, y_train)
pred=rfc1.predict(X_test)
print("Accuracy for Random Forest on CV data: ",accuracy_score(y_test,pred))

Accuracy for Random Forest on CV data:  0.9963044052110992


Finally we print the confusion matrix and classification report for analysis

In [74]:
# Print confusion matrix 
print('\n Confusion Matrix \n',confusion_matrix(y_test,pred))

# Print classification report
print('\n Classification report \n',classification_report(y_test,pred))


 Confusion Matrix 
 [[ 4703     4     1     0     0     0     0     0     0     0]
 [    0 15256     3     0     0     0     0     0     0     0]
 [    0     0 13398     1     0     0     0     0     0     0]
 [    0     0     1  9139     4     0     0     0     0     1]
 [    0     0     0     3  5819     1     0     0     0     0]
 [    0     0     0     0     2  3707    43     0     0     0]
 [    0     0     1     0     2    21  2500     4     0     0]
 [    0     0     0     0     9     4   110  1665    10     0]
 [    0     0     0     0     1     0     1     2  1279     5]
 [    0     0     1     1     0     1     1     0     0  6697]]

 Classification report 
               precision    recall  f1-score   support

           A       1.00      1.00      1.00      4708
           B       1.00      1.00      1.00     15259
           C       1.00      1.00      1.00     13399
           D       1.00      1.00      1.00      9145
           E       1.00      1.00      1.00      58

We see that class A to E have negligible false positive and negatives. Class G has the highest number of false positive, while class H has the highest false negative. This is reflected in both precision and recall score. G has the lowest precision score, and H has the lowest recall score. However, even this is very borderline as class G were mainly misclassified as either class F or class H. f1-score are high for all classes, so this is a good model to classify fares given variables.

## Decision Trees

We begin by creating a Decision Tree classifier from sklearn.tree. Then we use 5 fold cross validation loop on classifier through GridSearchCV, to obtain the best parameters, so to avoid overfitting our training data.

In [68]:
# Fit a decision tree classifier on the training set
decision_tree = DecisionTreeClassifier(random_state=0)

# Add Grid
new_grid = {'max_depth':[5,10,15,20], 'min_samples_leaf':[5,10,15,20], 'max_features':[5,10,15]}
new_tune2 = GridSearchCV(estimator = decision_tree, param_grid = new_grid, cv = 5)
new_tune2.fit(X_train,y_train)

# Run Cross Validation loop and print accuracy
c_score = cross_val_score(decision_tree, X_train, y_train, cv=10)
print('Accuracy of classifier: ', c_score.mean())

print('\n Best value for each of the tested parameters:',new_tune2.best_params_)
print('\n and accuracy of the model with these best values is',new_tune2.best_score_)

Accuracy of classifier:  0.9999377722464219

 Best value for each of the tested parameters: {'max_depth': 10, 'max_features': 15, 'min_samples_leaf': 5}

 and accuracy of the model with these best values is 0.9999378881987577


Once we obtain the best parameters, we pass these values to the classifier, and fit it to our training data. Then we compare the accuracy with our testing data 

In [66]:
dt = RandomForestClassifier(random_state=42, min_samples_leaf = 5, max_depth = 10, max_features = 15)
dt.fit(X_train, y_train)
pred1 = dt.predict(X_test)
print("Accuracy for Decision Trees on CV data: ",accuracy_score(y_test,pred1))



Accuracy for Decision Trees on CV data:  0.9999534168724088


Finally we print the confusion matrix and classification report for analysis

In [67]:
# Print confusion matrix 
print('\n Confusion Matrix \n',confusion_matrix(y_test,pred1))

# Print classification report
print('\n Classification report \n',classification_report(y_test,pred1))


 Confusion Matrix 
 [[ 4708     0     0     0     0     0     0     0     0     0]
 [    0 15259     0     0     0     0     0     0     0     0]
 [    0     0 13399     0     0     0     0     0     0     0]
 [    0     0     0  9145     0     0     0     0     0     0]
 [    0     0     0     0  5822     1     0     0     0     0]
 [    0     0     0     0     0  3751     1     0     0     0]
 [    0     0     0     0     0     0  2528     0     0     0]
 [    0     0     0     0     0     0     0  1798     0     0]
 [    0     0     0     0     0     0     0     0  1287     1]
 [    0     0     0     0     0     0     0     0     0  6701]]

 Classification report 
               precision    recall  f1-score   support

           A       1.00      1.00      1.00      4708
           B       1.00      1.00      1.00     15259
           C       1.00      1.00      1.00     13399
           D       1.00      1.00      1.00      9145
           E       1.00      1.00      1.00      58

We see that all the classes have no false positive. Class E, F, and I had 1 datapoint each misclassified as False Negative which are also very borderline. Theese are very minimial numbers so our precision, recall, and f1 score are all 1 across all classes, and our accuracy for this model is 99.99%