# Part2. Choose the best Model

In the part 1, we've decided the most suitable features. Here I will use those features to train different models, and find out the best

In [1]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder

df = pd.read_csv('cash or e-zpass train.csv', low_memory=False)

labelencoder = LabelEncoder()
df.drop('Date', axis=1)
df['Vehicle Class'] = labelencoder.fit_transform(df['Vehicle Class'])
df['Entrance'] = labelencoder.fit_transform(df['Entrance'])
df['Exit'] = labelencoder.fit_transform(df['Exit'])
df['Payment Type (Cash or E-ZPass)'] = labelencoder.fit_transform(df['Payment Type (Cash or E-ZPass)'])

X = df[['Interval Beginning Time', 'Vehicle Class', 'Entrance', 'Exit', 'Vehicle Count']]
Y = df['Payment Type (Cash or E-ZPass)']

# Sampling Data

Even though this dataset contains more than 6 millions rows, the time of training decision tree model is quite fast(less than 1 minute). However, for the boosting algorithms like random forest and gradient boosting, they take hours to finish the training. In order to speed up the compairison, it'd be better to reduce the amount of rows. Here I only take 8% of 6-million rows. And the boosting algorithms still need to take about 5 minutes to complete the training process. 


In [2]:
sample_df = df.sample(frac =.08)
sample_X = sample_df[['Interval Beginning Time', 'Vehicle Class', 'Entrance', 'Exit', 'Vehicle Count']]
sample_Y = sample_df['Payment Type (Cash or E-ZPass)']

# Classification Algorithms

Again, this is a classification problem, so we I picked up four appropriate algorithms, wich are Decision tree, Random Forest, Gradient Boosting and Logistic Regression. And I am going to show the comparison of these four with default parameters and choose the best model in terms of speed and accuracy. 

## Candidate 1: Decision Tree

In the part1, we've seen the model trained by decision tree can gain about 73% accuracy. But for now, let's forget it and see how its performance is on sample data.

In [3]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

train_x, test_x, train_y, test_y = train_test_split(sample_X, sample_Y, test_size=0.2)

clf = DecisionTreeClassifier()
clf.fit(train_x, train_y)
hyp = clf.predict(test_x)
accuracy_score(test_y, hyp)

0.6852437198469472

## Candidate 2: Random Forest

In [4]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

rfc_clf = RandomForestClassifier()
rfc_clf.fit(train_x, train_y)
hyp = rfc_clf.predict(test_x)
accuracy_score(test_y, hyp)

0.7134108301447346

## Candidate 3: Gradient Boosting 

In [5]:
from sklearn.linear_model import LogisticRegression

lr_clf = LogisticRegression(solver='liblinear')
lr_clf.fit(train_x, train_y)
hyp = lr_clf.predict(test_x)
accuracy_score(test_y, hyp)

0.6825923307270005

## Candidate 4: Logistic Regreesion 

In [6]:
from sklearn.ensemble import GradientBoostingClassifier
gbc_clf = GradientBoostingClassifier()
gbc_clf.fit(train_x, train_y)
hyp = gbc_clf.predict(test_x)
accuracy_score(test_y, hyp)


0.6894443520212943

# Model Decision

As the cells showing above, the random forest gets the best result in predicting sampled data. However, I still not going to train a model by the random forest because I've tried a model with it and I waited for more than 2 hours but the the traning was never end. Therefore, I will choose a model from Logistic Regresion and Decision Tree, because both of them got a efficient and similar result.

# Model Tuning 

To determine which model I am going to use in the final prediction, I would like to apply the grid search and find out the best parameters for both LR and DT. And because both algorithms have remarkable speed, I decided to expand the sample size to gain a more reliable comparison.


In [7]:
# Note: this process takes about one hour

from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV

parametes = {'criterion':['gini', 'entropy'],
             'max_depth':[16,17,18,19,20,21,22]}

clf = GridSearchCV(DecisionTreeClassifier(), parametes, cv=5)
clf.fit(X, Y)

print('Decisiton tree best best_parameters:', clf.best_params_)
print('Decistion tree best score:', clf.best_score_)

Decisiton tree best best_parameters: {'criterion': 'gini', 'max_depth': 21}
Decistion tree best score: 0.7485118452254621


In [8]:
from sklearn.linear_model import LogisticRegression
grid_values = {'penalty': ['l1','l2'], 'C': [1e-10, 1e-09, 1e-08, 1e-07,1]}
clf = GridSearchCV(LogisticRegression(solver='liblinear'), grid_values, cv=5)
clf.fit(X, Y)

print('Logistic Regression best_parameters:', clf.best_params_)
print('Logistic Regression score:', clf.best_score_)

Logistic Regression best_parameters: {'C': 1e-10, 'penalty': 'l2'}
Logistic Regression score: 0.6856422602220273


# Final Decision

From the results showing above, the accuracy of LR sticks with aroung 68%. On the other hand, decision tree with gini and depth of 21 has the best accuracy. Therefore, I will use this set of hyper parameters in our final prediction.  

To ensure the result, let's train two models, the first one is default decision tree without hyper parameters. The second one is tuned decision tree. 

In [10]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

train_x, test_x, train_y, test_y = train_test_split(X, Y, test_size=0.2)

default_clf = DecisionTreeClassifier()
default_clf.fit(train_x, train_y)
hyp = default_clf.predict(test_x)
accuracy_score(test_y, hyp)

0.7341294333526869

In [11]:
tuned_clf = DecisionTreeClassifier(criterion='gini', max_depth=21)
tuned_clf.fit(train_x, train_y)
hyp = tuned_clf.predict(test_x)
accuracy_score(test_y, hyp)

0.7485347622763048