<img src='img/logo.png'>
<img src='img/title.png'>

# Table of Contents
* [Exercises](#Exercises)


# Exercises

In an earlier exercise with `LogisticRegression` and the bank campaign data set, we used `GridSearchCV`.

In this exercise we'll do a similar grid search, but with the adult dataset from ``data/adult.csv``, attempting to use `LogisticRegression` to predict income class (greater than $50,000/year or not).
 * Split `adult.csv` into training and test sets.
 * Apply grid-search to the training set, searching for the best C for Logistic Regression, you may also search over L1 penalty vs L2 penalty.
 * Plot the ROC curve of the best model on the test set.
 * Experiment with other scoring options if you have time

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
import os
import pandas as pd
data = pd.read_csv(os.path.join('data', 'adult.csv'), index_col=0)

In [None]:
data.head() 

<button data-toggle="collapse" data-target="#soln1" class='btn btn-primary'>Show solution</button>

<div id="soln1" class="collapse">

First create the binary categories from the data:

```python
# get dummy variables, needed for scikit-learn models on categorical data:
X = pd.get_dummies(data.drop("income", axis=1))
y = data.income == " >50K"
X.head()
```

Verify the shape of binary anwer

```python
y.shape
```

Some general imports then perform train/test split

```python
%matplotlib inline
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import f1_score
from sklearn.metrics import roc_curve
from sklearn.metrics import roc_auc_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import make_scorer
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
```

Define a custom scoring function then fit the grid:

```python
def my_scoring(fitted_estimator, X_test, y_test):
    return (fitted_estimator.predict(X_test) == y_test).mean()

# setting the tolerance higher than default 1e-4 for demo purposes
tol = 0.01
param_grid = { 'C': [0.1, 1, 10, 100], 'penalty': ['l1', 'l2']}
grid = GridSearchCV(LogisticRegression(tol=tol), param_grid, scoring=my_scoring, cv=3)
grid.fit(X_train, y_train)
```

Explore several scoring functions across grid parameters and graph results:

```python
auc = roc_auc_score(y_test, grid.decision_function(X_test))
fpr, tpr, _ = roc_curve(y_test, grid.decision_function(X_test))
print('fpr', fpr, 'tpr', tpr, 'auc', auc)
plt.plot(fpr, tpr, linewidth=4)
plt.xlabel("FPR")
plt.ylabel("TPR")
plt.xlim(-0.01, 1)
plt.ylim(0, 1.02);


def try_scorer(scorer, label):
    tol = 0.01
    param_grid = { 'C': [0.1, 1, 10, 100], 'penalty': ['l1', 'l2']}
    grid = GridSearchCV(LogisticRegression(tol=tol), param_grid, scoring=scorer, cv=3)
    grid.fit(X_train, y_train)
    auc = roc_auc_score(y_test, grid.decision_function(X_test))
    fpr, tpr, _ = roc_curve(y_test, grid.decision_function(X_test))
    plt.plot(fpr, tpr, linewidth=4, label=label)
    plt.xlabel("FPR")
    plt.ylabel("TPR")
    plt.xlim(-0.01, 1)
    plt.ylim(0, 1.02);
    return grid, fpr, tpr, auc

scorers = (make_scorer(f1_score, greater_is_better=True), 
           make_scorer(accuracy_score, greater_is_better=True), 
           my_scoring)
labels = ('f1_score', 'accuracy', 'my_scoring')
for label, scorer in zip(labels, scorers):
    fitted, fpr, tpr, auc = try_scorer(scorer, label)
    print('With scorer', scorer, 
          '\n\tBest Params:', grid.best_params_,
          '\n\tBest score:', grid.best_score_)
plt.legend();
```

<img src='img/copyright.png'>