# Model Comparison Lab

In this lab we will compare the performance of all the models we have learned about so far, using the car evaluation dataset.

## 1. Prepare the data

The [car evaluation dataset](https://archive.ics.uci.edu/ml/machine-learning-databases/car/) is in the assets/datasets folder. By now you should be very familiar with this dataset.

1. Load the data into a pandas dataframe
- Encode the categorical features properly: define a map that preserves the scale (assigning smaller numbers to words indicating smaller quantities)
- Separate features from target into X and y

In [55]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [4]:
df = pd.read_csv('../../assets/datasets/car.csv')

In [5]:
df.head()

Unnamed: 0,buying,maint,doors,persons,lug_boot,safety,acceptability
0,vhigh,vhigh,2,2,small,low,unacc
1,vhigh,vhigh,2,2,small,med,unacc
2,vhigh,vhigh,2,2,small,high,unacc
3,vhigh,vhigh,2,2,med,low,unacc
4,vhigh,vhigh,2,2,med,med,unacc


In [7]:
from sklearn.preprocessing import StandardScaler

In [8]:
df1 = pd.get_dummies(df[['buying', 'maint', 'doors', 'persons', 'lug_boot', 'safety']])

In [9]:
df.buying.unique()

array(['vhigh', 'high', 'med', 'low'], dtype=object)

In [13]:
map1 = {
    'vhigh': 3,
    'high' : 2,
    'med'  : 1,
    'low'  : 0
}

In [12]:
df.maint.unique()

array(['vhigh', 'high', 'med', 'low'], dtype=object)

In [14]:
df.doors.unique()

array(['2', '3', '4', '5more'], dtype=object)

In [20]:
map2 = {
    '5more' : 3,
    '4'     : 2,
    '3'     : 1,
    '2'     : 0
}

In [15]:
df.persons.unique()

array(['2', '4', 'more'], dtype=object)

In [19]:
map3 = {
    'more' : 2,
    '4'     : 1,
    '2'     : 0
}

In [18]:
df.lug_boot.unique()

array(['small', 'med', 'big'], dtype=object)

In [24]:
map4 = {
    'big' : 2,
    'med'     : 1,
    'small'     : 0
}

In [22]:
df.safety.unique()

array(['low', 'med', 'high'], dtype=object)

In [38]:
map5 = {
    'high' : 2,
    'med'     : 1,
    'low'     : 0
}

In [26]:
df.acceptability.unique()

array(['unacc', 'acc', 'vgood', 'good'], dtype=object)

In [27]:
map6 = {
    'vgood' : 3,
    'good' : 2,
    'acc' : 1,
    'unacc' : 0
}

In [39]:
dfn = df.copy()

feat = [c for c in df.columns if c != 'acceptability']

dfn.acceptability = df.acceptability.map(map6)
dfn.safety = df.safety.map(map5)
dfn.lug_boot = df.lug_boot.map(map4)
dfn.persons = df.persons.map(map3)
dfn.doors = df.doors.map(map2)
dfn.buying = df.buying.map(map1)
dfn.maint = df.maint.map(map1)

In [40]:
X = dfn[feat]
y = dfn['acceptability']

In [41]:
dfn.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1728 entries, 0 to 1727
Data columns (total 7 columns):
buying           1728 non-null int64
maint            1728 non-null int64
doors            1728 non-null int64
persons          1728 non-null int64
lug_boot         1728 non-null int64
safety           1728 non-null int64
acceptability    1728 non-null int64
dtypes: int64(7)
memory usage: 94.6 KB


## 2. Useful preparation

Since we will compare several models, let's write a couple of helper functions.

1. Separate X and y between a train and test set, using 30% test set, random state = 42
    - make sure that the data is shuffled and stratified
2. Define a function called `evaluate_model`, that trains the model on the train set, tests it on the test, calculates:
    - accuracy score
    - confusion matrix
    - classification report
3. Initialize a global dictionary to store the various models for later retrieval


In [57]:
from sklearn.model_selection import train_test_split
from sklearn.cross_validation import cross_val_score, KFold
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

In [42]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42, test_size = 0.3, stratify = y)

In [51]:
def evaluate_model(clf):
    clf.fit(X_train,y_train)
    pred = clf.predict(X_test)
    
    scores = accuracy_score(y_test, pred)
    conmat = confusion_matrix(y_test, pred)
    clf_report = classification_report(y_test, pred)
    
    print conmat
    print
    print clf_report
    print
    print "Accuracy scores =", scores
    
    return scores

In [48]:
clf_dict = {}

## 3.a KNN

Let's start with `KNeighborsClassifier`.

1. Initialize a KNN model
- Evaluate it's performance with the function you previously defined
- Find the optimal value of K using grid search
    - Be careful on how you perform the cross validation in the grid search

In [50]:
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier()

In [53]:
knn_score = evaluate_model(knn)

[[352   6   0   0]
 [ 21  94   2   1]
 [  2   5  11   1]
 [  0   5   4  15]]

             precision    recall  f1-score   support

          0       0.94      0.98      0.96       358
          1       0.85      0.80      0.82       118
          2       0.65      0.58      0.61        19
          3       0.88      0.62      0.73        24

avg / total       0.91      0.91      0.91       519


Accuracy scores = 0.909441233141


In [59]:
from sklearn.grid_search import GridSearchCV

params_dict = {
    "n_neighbors" : np.arange(1,30)
}

gsknn = GridSearchCV(KNeighborsClassifier(), params_dict, cv = KFold(len(y), n_folds= 5, shuffle=True))

In [60]:
gsknn.fit(X,y)

GridSearchCV(cv=sklearn.cross_validation.KFold(n=1728, n_folds=5, shuffle=True, random_state=None),
       error_score='raise',
       estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform'),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'n_neighbors': array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
       18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29])},
       pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=0)

In [61]:
gsknn.best_estimator_

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=7, p=2,
           weights='uniform')

In [62]:
gsknn.best_params_

{'n_neighbors': 7}

In [63]:
gsknn.best_score_

0.9513888888888888

In [65]:
clf_dict['knn'] = knn_score
clf_dict['gsknn'] = evaluate_model(gsknn.best_estimator_)

[[355   3   0   0]
 [ 13 101   4   0]
 [  1   4  13   1]
 [  1   5   2  16]]

             precision    recall  f1-score   support

          0       0.96      0.99      0.98       358
          1       0.89      0.86      0.87       118
          2       0.68      0.68      0.68        19
          3       0.94      0.67      0.78        24

avg / total       0.93      0.93      0.93       519


Accuracy scores = 0.934489402697


## 3.b Bagging + KNN

Now that we have found the optimal K, let's wrap `KNeighborsClassifier` in a BaggingClassifier and see if the score improves.

1. Wrap the KNN model in a Bagging Classifier
- Evaluate performance
- Do a grid search only on the bagging classifier params

In [66]:
from sklearn.ensemble import BaggingClassifier

In [67]:
knn_bag = BaggingClassifier(KNeighborsClassifier(n_neighbors=7))

In [69]:
knn_bag_score = evaluate_model(knn_bag)

[[355   3   0   0]
 [ 15  99   3   1]
 [  0   2  16   1]
 [  0   4   3  17]]

             precision    recall  f1-score   support

          0       0.96      0.99      0.98       358
          1       0.92      0.84      0.88       118
          2       0.73      0.84      0.78        19
          3       0.89      0.71      0.79        24

avg / total       0.94      0.94      0.94       519


Accuracy scores = 0.938342967245


In [70]:
clf_dict['knn_bag'] = knn_bag_score

In [97]:
kbparams = {
    'n_estimators' : [10,20],
    'max_samples' : np.arange(0.7,1,0.1),
    'max_features' : np.arange(0.7,1,0.1),
    'bootstrap_features': [True, False]
}

knn_bag_grid = GridSearchCV(BaggingClassifier(KNeighborsClassifier()), kbparams, \
                           n_jobs=-1,
                            cv=KFold(len(y), n_folds=3, shuffle=True))

In [98]:
knn_bag_grid.fit(X,y)
#knn_bag_grid_score = evaluate_model(knn_bag_grid)

  **self._backend_args)
  **self._backend_args)
  **self._backend_args)
  **self._backend_args)


GridSearchCV(cv=sklearn.cross_validation.KFold(n=1728, n_folds=3, shuffle=True, random_state=None),
       error_score='raise',
       estimator=BaggingClassifier(base_estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform'),
         bootstrap=True, bootstrap_features=False, max_features=1.0,
         max_samples=1.0, n_estimators=10, n_jobs=1, oob_score=False,
         random_state=None, verbose=0, warm_start=False),
       fit_params={}, iid=True, n_jobs=-1,
       param_grid={'n_estimators': [10, 20], 'max_samples': array([ 0.7,  0.8,  0.9,  1. ]), 'bootstrap_features': [True, False], 'max_features': array([ 0.7,  0.8,  0.9,  1. ])},
       pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=0)

In [100]:
knn_bag_grid_score = evaluate_model(knn_bag_grid.best_estimator_)

[[356   2   0   0]
 [ 34  82   1   1]
 [  3   2  14   0]
 [  2   4   3  15]]

             precision    recall  f1-score   support

          0       0.90      0.99      0.95       358
          1       0.91      0.69      0.79       118
          2       0.78      0.74      0.76        19
          3       0.94      0.62      0.75        24

avg / total       0.90      0.90      0.89       519


Accuracy scores = 0.899807321773


In [101]:
clf_dict['knn_bag_grid'] = knn_bag_grid_score

## 4. Logistic Regression

Let's see if logistic regression performs better

1. Initialize LR and test on Train/Test set
- Find optimal params with Grid Search
- See if Bagging improves the score

In [102]:
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()

In [104]:
clf_dict['LogReg'] = evaluate_model(logreg)


[[338  14   3   3]
 [ 57  61   0   0]
 [  2  15   2   0]
 [  0  21   0   3]]

             precision    recall  f1-score   support

          0       0.85      0.94      0.90       358
          1       0.55      0.52      0.53       118
          2       0.40      0.11      0.17        19
          3       0.50      0.12      0.20        24

avg / total       0.75      0.78      0.75       519


Accuracy scores = 0.778420038536


In [None]:
lr_params = {
    
}

gslr = GridSearchCV(LogisticRegression())

## 5. Decision Trees

Let's see if Decision Trees perform better

1. Initialize DT and test on Train/Test set
- Find optimal params with Grid Search
- See if Bagging improves the score

## 6. Support Vector Machines

Let's see if SVM perform better

1. Initialize SVM and test on Train/Test set
- Find optimal params with Grid Search
- See if Bagging improves the score

## 7. Random Forest & Extra Trees

Let's see if Random Forest and Extra Trees perform better

1. Initialize RF and ET and test on Train/Test set
- Find optimal params with Grid Search

## 8. Model comparison

Let's compare the scores of the various models.

1. Do a bar chart of the scores of the best models. Who's the winner on the train/test split?
- Re-test all the models using a 3 fold stratified shuffled cross validation
- Do a bar chart with errorbars of the cross validation average scores. is the winner the same?


## Bonus

We have encoded the data using a map that preserves the scale.
Would our results have changed if we had encoded the categorical data using `pd.get_dummies` or `OneHotEncoder`  to encode them as binary variables instead?

1. Repeat the analysis for this scenario. Is it better?
- Experiment with other models or other parameters, can you beat your classmates best score?