# Model Comparison Lab

In this lab we will compare the performance of all the models we have learned about so far, using the Titanic dataset.


## 1. Prepare the data

The [titanic dataset](https://www.kaggle.com/c/titanic/data) is one we've talked about a lot. Load in the train data from the assets folder.

1. Load the data into a pandas dataframe
- Encode the categorical features properly
- Separate features from target into X and y

In [13]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

titanic = pd.read_csv('assets/train.csv')
titanic.dropna(inplace=True)
titanic.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
10,11,1,3,"Sandstrom, Miss. Marguerite Rut",female,4.0,1,1,PP 9549,16.7,G6,S
11,12,1,1,"Bonnell, Miss. Elizabeth",female,58.0,0,0,113783,26.55,C103,S


In [14]:
titanic['Sex']  = [1 if person == 'female' else 0 for person in titanic.Sex ]

y = titanic.Survived
X = titanic[['Pclass', 'Sex', 'Age', 'Fare']]

## 2. Useful preparation

Since we will compare several models, let's write a couple of helper functions.

1. Separate X and y between a train and test set, using 30% test set, random state = 42
    - make sure that the data is shuffled and stratified
2. Define a function called `evaluate_model`, that trains the model on the train set, tests it on the test, calculates the evaluative measures below.
  - accuracy score
  - confusion matrix
  - classification report
3. Initialize a global dictionary to store the various models for later retrieval (your dictionary will likely contain the estimator and best score)

In [24]:
from sklearn.model_selection import train_test_split, KFold
from sklearn.metrics import accuracy_score, precision_score, recall_score, confusion_matrix, classification_report

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)

def evaluate_model(model):
    model.fit(X_train, y_train)
    predictions = model.predict(X_test)
    score = accuracy_score(y_test, predictions)
    matrix = confusion_matrix(y_test, predictions)
    class_report = classification_report(y_test, predictions)
    
    print matrix
    print class_report
    
    print score

all_models = {}

## 3.a KNN

Let's start with `KNeighborsClassifier`.

1. Initialize a KNN model
- Evaluate its performance with the function you previously defined
- Find the optimal value of parameter K using grid search
    - Be careful on how you perform the cross validation in the grid search

In [26]:
from sklearn.neighbors import KNeighborsClassifier

evaluate_model(KNeighborsClassifier())

[[ 3 15]
 [ 3 34]]
             precision    recall  f1-score   support

          0       0.50      0.17      0.25        18
          1       0.69      0.92      0.79        37

avg / total       0.63      0.67      0.61        55

0.672727272727


In [33]:
from sklearn.model_selection import GridSearchCV

params = {'n_neighbors': range(2,60)}

gridsearch = GridSearchCV(KNeighborsClassifier(),
                     params, n_jobs=-1,
                     cv=5)
gridsearch.fit(X, y)
print gridsearch.best_params_
print gridsearch.best_score_
evaluate_model(gridsearch.best_estimator_)

{'n_neighbors': 23}
0.68306010929
[[ 3 15]
 [ 5 32]]
             precision    recall  f1-score   support

          0       0.38      0.17      0.23        18
          1       0.68      0.86      0.76        37

avg / total       0.58      0.64      0.59        55

0.636363636364


## 3.b Bagging + KNN

Now that we have found the optimal K, let's wrap `KNeighborsClassifier` in a BaggingClassifier and see if the score improves.

1. Wrap the KNN model in a Bagging Classifier
- Evaluate performance
- Do a grid search only on the bagging classifier parameters

In [28]:
from sklearn.ensemble import BaggingClassifier

In [29]:
evaluate_model(BaggingClassifier(KNeighborsClassifier()))

[[ 3 15]
 [ 5 32]]
             precision    recall  f1-score   support

          0       0.38      0.17      0.23        18
          1       0.68      0.86      0.76        37

avg / total       0.58      0.64      0.59        55

0.636363636364


In [35]:
bagging_params = {'n_estimators': [10, 20],
                  'max_samples': [0.7, 1.0],
                  'max_features': [0.7, 1.0],
                  'bootstrap_features': [True, False]}


gridsearch_bag = GridSearchCV(BaggingClassifier(KNeighborsClassifier()),
                            bagging_params, n_jobs=-1,
                            cv=5)

gridsearch_bag.fit(X, y)
print gridsearch_bag.best_params_
print gridsearch_bag.best_score_
evaluate_model(gridsearch_bag.best_estimator_)

{'max_features': 0.7, 'max_samples': 1.0, 'n_estimators': 20, 'bootstrap_features': True}
0.737704918033
[[ 8 10]
 [ 4 33]]
             precision    recall  f1-score   support

          0       0.67      0.44      0.53        18
          1       0.77      0.89      0.82        37

avg / total       0.73      0.75      0.73        55

0.745454545455


## 4. Logistic Regression

Let's see if logistic regression performs better.

1. Initialize LR and test on Train/Test set
- Find optimal parameters with Grid Search
- See if Bagging improves the score

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)



## 5. Decision Trees

Let's see if Decision Trees perform better.

1. Initialize DT and test on Train/Test set
- Find optimal parameters with Grid Search
- See if Bagging improves the score

## 6. Random Forest & Extra Trees

Let's see if Random Forest and Extra Trees perform better.

1. Initialize RF and ET and test on Train/Test set
- Find optimal parameters with Grid Search


## 8. Model comparison

Let's compare the scores of the various models.

1. Do a bar chart of the scores of the best models. Who's the winner on the train/test split?
- Re-test all the models using a 3 fold stratified shuffled cross validation
- Do a bar chart with errorbars of the cross validation average scores. is the winner the same?


## Bonus

We have encoded the data using a map that preserves the scale.
Would our results have changed if we had encoded the categorical data using `pd.get_dummies` or `OneHotEncoder`  to encode them as binary variables instead?

1. Repeat the analysis for this scenario. Is it better?
- Experiment with other models or other parameters, can you beat your classmates' best score?