### Model Diagnostics in Python

In this notebook, you will be trying out some of the model diagnostics you saw from Sebastian, but in your case there will only be two cases - either admitted or not admitted.

First let's read in the necessary libraries and the dataset.

In [1]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, precision_score, recall_score, accuracy_score
from sklearn.model_selection import train_test_split
np.random.seed(42)

df = pd.read_csv('./admissions.csv')
df.head()

Unnamed: 0,admit,gre,gpa,prestige
0,0,380,3.61,3
1,1,660,3.67,3
2,1,800,4.0,1
3,1,640,3.19,4
4,0,520,2.93,4


`1.` Change prestige to dummy variable columns that are added to `df`.  Then divide your data into training and test data.  Create your test set as 20% of the data, and use a random state of 0.  Your response should be the `admit` column.  [Here](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) are the docs, which can also find with a quick google search if you get stuck.

In [3]:
df[['level1', 'level2', 'level3', 'level4']] = pd.get_dummies(df['prestige'])
df.head()

Unnamed: 0,admit,gre,gpa,prestige,level1,level2,level3,level4
0,0,380,3.61,3,0,0,1,0
1,1,660,3.67,3,0,0,1,0
2,1,800,4.0,1,1,0,0,0
3,1,640,3.19,4,0,0,0,1
4,0,520,2.93,4,0,0,0,1


In [4]:
y = df['admit']
X = df[['gre', 'gpa', 'level1', 'level2', 'level3']]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

`2.` Now use [sklearn's Logistic Regression](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) to fit a logistic model using `gre`, `gpa`, and 3 of your `prestige` dummy variables.  For now, fit the logistic regression model without changing any of the hyperparameters.  

The usual steps are:
* Instantiate
* Fit (on train)
* Predict (on test)
* Score (compare predict to test)

As a first score, obtain the [confusion matrix](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html).  Then answer the first question below about how well your model performed on the test data.

In [7]:
model = LogisticRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

In [10]:
confusion_matrix(y_test, y_pred)

array([[56,  0],
       [22,  2]])

`3.` Now, try out a few additional metrics: [precision](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_score.html), [recall](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.recall_score.html), and [accuracy](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html) are all popular metrics, which you saw with Sebastian.  You could compute these directly from the confusion matrix, but you can also use these built in functions in sklearn.

Another very popular set of metrics are [ROC curves and AUC](http://scikit-learn.org/stable/auto_examples/model_selection/plot_roc.html#sphx-glr-auto-examples-model-selection-plot-roc-py).  These actually use the probability from the logistic regression models, and not just the label.  [This](http://blog.yhat.com/posts/roc-curves.html) is also a great resource for understanding ROC curves and AUC.

Try out these metrics to answer the second quiz question below.  I also provided the ROC plot below.  The ideal case is for this to shoot all the way to the upper left hand corner.  Again, these are discussed in more detail in the Machine Learning Udacity program.

In [11]:
 precision_score(y_test, y_pred)

1.0

In [12]:
recall_score(y_test, y_pred)

0.083333333333333329

In [13]:
accuracy_score(y_test, y_pred)

0.72499999999999998

In [15]:
!pip install ggplot

Collecting ggplot
[?25l  Downloading https://files.pythonhosted.org/packages/48/04/5c88cc51c6713583f2dc78a5296adb9741505348c323d5875bc976143db2/ggplot-0.11.5-py2.py3-none-any.whl (2.2MB)
[K    100% |████████████████████████████████| 2.2MB 9.2MB/s eta 0:00:01   15% |█████                           | 337kB 9.6MB/s eta 0:00:01    39% |████████████▋                   | 870kB 11.6MB/s eta 0:00:01
[?25hCollecting brewer2mpl (from ggplot)
  Downloading https://files.pythonhosted.org/packages/84/57/00c45a199719e617db0875181134fcb3aeef701deae346547ac722eaaf5e/brewer2mpl-1.4.1-py2.py3-none-any.whl
Installing collected packages: brewer2mpl, ggplot
Successfully installed brewer2mpl-1.4.1 ggplot-0.11.5


In [18]:
### Unless you install the ggplot library in the workspace, you will 
### get an error when running this code!

from ggplot import *
from sklearn.metrics import roc_curve, auc
%matplotlib inline

preds = log_mod.predict_proba(X_test)[:,1]
fpr, tpr, _ = roc_curve(y_test, preds)

df = pd.DataFrame(dict(fpr=fpr, tpr=tpr))
ggplot(df, aes(x='fpr', y='tpr')) +\
    geom_line() +\
    geom_abline(linetype='dashed')

ImportError: cannot import name 'Timestamp'