# Model Comparison Lab

In this lab we will compare the performance of all the models we have learned about so far, using the car evaluation dataset.

## 1. Prepare the data

The [car evaluation dataset](https://archive.ics.uci.edu/ml/machine-learning-databases/car/) is in the assets/datasets folder. By now you should be very familiar with this dataset.

1. Load the data into a pandas dataframe
- Encode the categorical features properly: define a map that preserves the scale (assigning smaller numbers to words indicating smaller quantities)
- Separate features from target into X and y

In [28]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.cross_validation import StratifiedKFold
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import accuracy_score, precision_score, recall_score, confusion_matrix, classification_report

from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import LabelEncoder

In [4]:
df = pd.read_csv('../../assets/datasets/car.csv')
print df.shape
df.head()

(1728, 7)


Unnamed: 0,buying,maint,doors,persons,lug_boot,safety,acceptability
0,vhigh,vhigh,2,2,small,low,unacc
1,vhigh,vhigh,2,2,small,med,unacc
2,vhigh,vhigh,2,2,small,high,unacc
3,vhigh,vhigh,2,2,med,low,unacc
4,vhigh,vhigh,2,2,med,med,unacc


In [145]:
for col in df.columns.values:
    col_list = list(df[col].unique()).append(col)
    print col_list
    

None
None
None
None
None
None
None


In [142]:
make_lists(df)

In [58]:
p = list(df['buying'].unique())
p.index('low')

3

In [59]:
p[0]

'vhigh'

In [147]:
a = [1,2,3,4]
b = ['a', 'b', 'c', 'd']

q = pd.DataFrame(b, index=a)
q['hi'] = [2,4,6,8]
q

Unnamed: 0,0,hi
1,a,2
2,b,4
3,c,6
4,d,8


In [154]:
n = -1
m = {}
while n < 4:
    m.update({'m': list(q['hi'].unique())[n+1]})
    n += 1

IndexError: list index out of range

In [153]:
m

{}

In [138]:
m = {}
n = 0
for i in df.columns.values:
    i_list = list(df[i].unique())
    if n < 4:
        m.update({i: i_list[n]})
    n += 1
                  

IndexError: list index out of range

In [None]:
m

In [125]:
q = {}
for name in df.columns.values:
    for x in list(df[name].unique()):
        h = []
        q.update({name: h.append(x)})

In [126]:
q

{'acceptability': None,
 'buying': None,
 'doors': None,
 'lug_boot': None,
 'maint': None,
 'persons': None,
 'safety': None}

In [93]:
for i in df.columns:
    print i, df[i].unique()

buying ['vhigh' 'high' 'med' 'low']
maint ['vhigh' 'high' 'med' 'low']
doors ['2' '3' '4' '5more']
persons ['2' '4' 'more']
lug_boot ['small' 'med' 'big']
safety ['low' 'med' 'high']
acceptability ['unacc' 'acc' 'vgood' 'good']


In [None]:
buying_map = {}
buying_map.update

## 2. Useful preparation

Since we will compare several models, let's write a couple of helper functions.

1. Separate X and y between a train and test set, using 30% test set, random state = 42
    - make sure that the data is shuffled and stratified
2. Define a function called `evaluate_model`, that trains the model on the train set, tests it on the test, calculates:
    - accuracy score
    - confusion matrix
    - classification report
3. Initialize a global dictionary to store the various models for later retrieval


In [33]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)
cv = StratifiedKFold(y, n_folds=4, shuffle=True, random_state=21)

def evalute_model(model_name):
    model = model_name.fit(X_train, y_train)
    y_pred = model.predict(X_test, y_test)
    score = cross_val_score(model, X, y, cv=cv, n_jobs=-1).mean()
    accuracy = accuracy_score(y_test, y_pred)
    confusion_matrix(y_test, y_pred)
    

In [45]:
X.shape

(1728, 21)

In [43]:
features = [c for c in df.columns.values if c != 'acceptability']

X = pd.get_dummies(df.drop('acceptability', axis=1))
y = df['acceptability']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)
cv = StratifiedKFold(y, n_folds=5, shuffle=True, random_state=21)

kn = KNeighborsClassifier()
kn = kn.fit(X_train, y_train)
y_pred = kn.predict(X_test)
print cross_val_score(kn, X, y, cv=cv, n_jobs=-1).mean()
print accuracy_score(y_test, y_pred)
cm = pd.DataFrame(confusion_matrix(y_test, y_pred))
cm



0.836221266299
0.897880539499


Unnamed: 0,0,1,2,3
0,97,0,18,0
1,11,8,2,0
2,12,0,351,0
3,7,1,2,10


## 3.a KNN

Let's start with `KNeighborsClassifier`.

1. Initialize a KNN model
- Evaluate it's performance with the function you previously defined
- Find the optimal value of K using grid search
    - Be careful on how you perform the cross validation in the grid search

## 3.b Bagging + KNN

Now that we have found the optimal K, let's wrap `KNeighborsClassifier` in a BaggingClassifier and see if the score improves.

1. Wrap the KNN model in a Bagging Classifier
- Evaluate performance
- Do a grid search only on the bagging classifier params

## 4. Logistic Regression

Let's see if logistic regression performs better

1. Initialize LR and test on Train/Test set
- Find optimal params with Grid Search
- See if Bagging improves the score

## 5. Decision Trees

Let's see if Decision Trees perform better

1. Initialize DT and test on Train/Test set
- Find optimal params with Grid Search
- See if Bagging improves the score

## 6. Support Vector Machines

Let's see if SVM perform better

1. Initialize SVM and test on Train/Test set
- Find optimal params with Grid Search
- See if Bagging improves the score

## 7. Random Forest & Extra Trees

Let's see if Random Forest and Extra Trees perform better

1. Initialize RF and ET and test on Train/Test set
- Find optimal params with Grid Search

## 8. Model comparison

Let's compare the scores of the various models.

1. Do a bar chart of the scores of the best models. Who's the winner on the train/test split?
- Re-test all the models using a 3 fold stratified shuffled cross validation
- Do a bar chart with errorbars of the cross validation average scores. is the winner the same?


## Bonus

We have encoded the data using a map that preserves the scale.
Would our results have changed if we had encoded the categorical data using `pd.get_dummies` or `OneHotEncoder`  to encode them as binary variables instead?

1. Repeat the analysis for this scenario. Is it better?
- Experiment with other models or other parameters, can you beat your classmates best score?