<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Model Evaluation I Lab

_Authors: Matt Brems (DC), Matt Speck (DC), Ben Shaver (DC)_

In this lab we will compare the performance of a couple of models using the Titanic dataset.

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

## Pre-lab: SQL

Week 1 of the cohort we had two SQL lessons. We want you all to get regular SQL practice throughout the course, so from now on, some labs will include a few SQL components that you must complete. You'll do so by accessing datasets hosted on a Google Cloud PostgreSQL instance similar to how we accessed the Northwind database in the [SQL II lesson](https://git.generalassemb.ly/DSI-EAST-2/1.06-sql-pandas-ii-lesson/blob/master/1.06-sql-pandas.ipynb).

**Note**: You must be connected to a WiFi on a whitelisted IP address, meaning you will not be able to complete SQL parts at home or anywhere other than your GA campus.

### Step 1.

We're going to load the sql magic extension so that you can run SQL queries in the jupyter notebook.

In [3]:
%load_ext sql

### Step 2.

Next, you have to connect to the instance housing the datasets for the course. The url to the database is below:

```python
postgresql://dsi_student:yellowpencil@35.196.107.77/postgres
```

This url specifies the following:
* user: `dsi_student`  
* password: `yellowpencil`  
* host (IP address): `35.196.107.77`  
* database: `postgres`  

To connect to this, run the following cell:

In [4]:
%sql postgresql://dsi_student:yellowpencil@35.196.107.77/postgres

'Connected: dsi_student@postgres'

You can now run SQL queries in any cell with the `%%sql` cell magic in it. Try running this next cell:

In [5]:
%%sql

SELECT * FROM titanic
LIMIT 5;

5 rows affected.


index,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S,2.0,,"St Louis, MO"
1,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.55,C22 C26,S,11.0,,"Montreal, PQ / Chesterville, ON"
2,1,0,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55,C22 C26,S,,135.0,"Montreal, PQ / Chesterville, ON"
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"


In order to see all the datasets in the database, you can run the following:

In [6]:
%%sql
SELECT table_name FROM information_schema.tables
WHERE "table_type" = 'BASE TABLE' AND "table_schema" = 'public'

3 rows affected.


table_name
sf_crime
titanic
default_data


### First SQL Ask:

Return a count of all men on the titanic whose fares were less than $50.00 who did not survive.

In [10]:
%%sql
SELECT count(*) FROM titanic WHERE sex = 'male' AND fare < 50 AND survived = 0;

1 rows affected.


count
605


### Expected Output:

|**count**|
|---|
|605|

### Loading Data from SQL

Now, we're going to load in the dataset to interact with in Python. This is pretty simple. We'll provide the code to demonstrate.

In [17]:
titanic = %sql SELECT * FROM titanic
titanic = titanic.DataFrame()
titanic.head()

1309 rows affected.


Unnamed: 0,index,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S,2.0,,"St Louis, MO"
1,1,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.55,C22 C26,S,11.0,,"Montreal, PQ / Chesterville, ON"
2,2,1,0,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
3,3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55,C22 C26,S,,135.0,"Montreal, PQ / Chesterville, ON"
4,4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"


You now have the data you need to complete the lab!

## 1. Prepare the Data

The [Titanic dataset](https://www.kaggle.com/c/titanic/data) is back!

1. <s>Load the data into a `pandas` DataFrame</s>
2. Encode the categorical features properly
3. Separate features from target into `X` and `y`.

In [None]:
# If you were not able to read in the data from SQL, then uncomment the last line of this cell and run it.

# Get the titanic data from some random online source:
# titanic = pd.read_excel('http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic3.xls')

In [13]:
titanic.head(1)

Unnamed: 0,index,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S,2,,"St Louis, MO"


In [19]:
titanic['sex'] = titanic['sex'].map(lambda x: 1 if x == 'male' else 0)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 15 columns):
index        1309 non-null int64
pclass       1309 non-null int64
survived     1309 non-null int64
name         1309 non-null object
sex          1309 non-null int64
age          1046 non-null float64
sibsp        1309 non-null int64
parch        1309 non-null int64
ticket       1309 non-null object
fare         1308 non-null float64
cabin        295 non-null object
embarked     1307 non-null object
boat         486 non-null object
body         121 non-null float64
home.dest    745 non-null object
dtypes: float64(3), int64(6), object(6)
memory usage: 153.5+ KB


In [40]:
y = titanic['survived']
X = titanic.drop(['body', 'boat', 'name', 'index', 'home.dest', 'embarked', 'cabin', 'ticket', 'survived'], axis=1)

In [41]:
mask = X.loc[X['age'].isnull()]
X = X.drop(mask.index)
y = y.drop(mask.index)

In [42]:
mask = X.loc[X['fare'].isnull()]
X = X.drop(mask.index)
y = y.drop(mask.index)

In [47]:
len(X), len(y)

(1045, 1045)

## 2. Model Evaluation Function

Since we will compare several models, let's write a function that reproduces the metrics we're interested in.

First, separate `X` and `y` into training and test sets. Use 30% test set and `random_state = 42`. Make sure that the data is shuffled and stratified.
    

In [48]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=48, stratify=y)

Define a function called `evaluate_model()` that trains the model on the training data and then evaluates the model on the test set by returning the following four methods of evaluation:
  - confusion matrix
  - accuracy score
  - sensitivity score
  - specificity score

In [57]:
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report, roc_curve

def evaluate_model(model, Xtrain, Xtest, ytrain, ytest):
    modelT = model.fit(Xtrain, ytrain)
    predictions = modelT.predict(Xtest)
    accuracy = accuracy_score(ytest, predictions)
    print('Accuracy: ', accuracy)
    confusion = confusion_matrix(ytest, predictions)
    print('Confusion Matrix: ', '\n', confusion)
    print(classification_report(ytest, predictions))
    

## 3. KNN

Let's start with $k$-Nearest Neighbors.

1. Initialize a $k$-NN model.
2. Evaluate its performance with the function you previously defined.

In [50]:
from sklearn.neighbors import KNeighborsClassifier
knn5new = KNeighborsClassifier(n_neighbors=5, weights='uniform')

In [58]:
#Recall of the positive class is sensitivity(true positives/ all positives
#, recall of the negative is specificity  (true negatives / all negatives)
#Precision is true positives / predicted positives
evaluate_model(knn5new, X_train, X_test, y_train, y_test)

Accuracy:  0.687898089172
Confusion Matrix:  
 [[150  36]
 [ 62  66]]
             precision    recall  f1-score   support

          0       0.71      0.81      0.75       186
          1       0.65      0.52      0.57       128

avg / total       0.68      0.69      0.68       314



## 4. Logistic Regression

Let's see if logistic regression performs better.

1. Initialize a logistic regression model.
2. Evaluate its performance with the function you previously defined.

In [60]:
from sklearn.linear_model import LogisticRegression
glm_logit = sm.GLM(sm.families.Binomial(sm.families.links.logit))

  from pandas.core import datetools


TypeError: __init__() missing 1 required positional argument: 'exog'

## 5. Model Comparison

Let's compare the scores of the various models.

Use your `evaluate_model` function to compare the performance of three or more distinct models. Which performs the best?

## 6. Exploratory Visualization

Create a visualization that depicts the performance of each of your models across all three metrics. This visualization should portrary, *for your eyes only*, which model performs best.

## 7. Explanatory Visualization

Choose a metric that you feel is the most important to optimize for. Justify your choice.

Using a 3-fold, stratified and shuffled cross-validation, run your models and evaluate them according to your chosen metric.

Create a bar chart with error bars, where the error bars reflect the range of the cross-validated average scores. This should be an explanatory visualization which explains the performance of your top models on the metric you have chosen. Your stakeholders are business analysts at White Star Line, a British shipping company concerned to minimize loss of life aboard their *Olympic* class ocean liners.

## BONUS

Remember that we created a function called `model_picker()` in the Regression Metrics lab that allowed us to compare two *regression* models. Consider building a similar function that compares two binary classification models. This might help you in upcoming projects when you want to decide which classification model to use.