# Machine Learning Template
-----


# Introduction
This adopts class notebook `machine_learning.ipynb` as a basic template for running machine learning models of interest. Machine learning models are from the `scikit-learn` package. There is also light data wrangling in `pandas`.

**To adjust the features of interest first go to "Data Exploration and Preparation > Feature Generation" and then "Creating Training and Validation Sets > Split into Features and Labels."** 

## Setup
---
The initial packages including [`scikit-learn`](http://scikit-learn.org) to fit modeling. Note the original tutorial uses `psycopg2` to connect to the database, but we instead use `sqlalchemy`.

In [None]:
%pylab inline
import pandas as pd
pd.set_option('display.max_columns', 300)

#import psycopg2
from sqlalchemy import create_engine

import seaborn as sns
sns.set_style("white")

import matplotlib
import sklearn
from sklearn.metrics import precision_recall_curve, auc,accuracy_score, precision_score, recall_score,cohen_kappa_score,confusion_matrix
from sklearn.ensemble import (RandomForestClassifier, ExtraTreesClassifier, GradientBoostingClassifier, 
AdaBoostClassifier,BaggingClassifier)
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC,LinearSVC
from sklearn.neighbors import KNeighborsClassifier

from sklearn.dummy import DummyClassifier

from sklearn.model_selection import GroupKFold
from sklearn.model_selection import cross_val_predict
from sklearn.model_selection import cross_val_score

from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report


### Global Variables

The following are some basic values used throughout the notebook. Objects `db_name`, `host_name`, and `schema` are database address variables. Objects `train_date`, `test_date`, and `train_horizon` are in a way holdovers from the original tutorial, which uses a very particular form of cross-validation.

In [None]:
train_date = 2010
test_date = 2013
#prediction_horizon = 3 #orginally 5
train_horizon = 3

### Connect to the database

In [None]:
db_name = "appliedda"
hostname = "10.10.2.10"
schema = 'M3'
pgsql_engine = create_engine( "postgresql://10.10.2.10/appliedda" )

In [None]:
sql_string = " SELECT *"
sql_string +=" FROM M3.cleaned_data"

full_data = pd.read_sql(sql_string, con = pgsql_engine)    
full_data.head()

## Feature Generation


Our preliminary features are the following


What do we want our features to be? Let's make one list containing the variables from which we will derive our features.

In [None]:
DrugVars=['drugalcf','drugampf','drugcocf','drugherf','drugmarf','drugothf','drugpcpf','drugunkf']
good_time_vars=['meritorious_good_time','education_in_prison','substanceabuse_treatment','working_in_prison']
descriptive_vars=['race', 'sex','HasKids','birthdecade','birthdecade1950orprior','active_gang_member','anypriorwage']
prison_vars=['release_year', 'hclass','sexoff','sexreg','lstsclvl','prisontime']

feat_source=DrugVars+good_time_vars+descriptive_vars+prison_vars
feat_source

Another issue is that some features will not have the correct data type neccesary for executing the machine learning models. 

In [None]:
dt = full_data.filter(items = feat_source).dtypes
print(dt)

In general, variables of type `object` correspond to strings, the names of which are now collected in `feat_obj`. The `get_dummies()` method treats these variables as categorical and makes a dummy for each. **Run the following cell one time only.**

In [None]:
feat_obj = list(dt[dt == "object"].index) #what columns are of type object?
temp = pd.get_dummies(full_data.filter(items = feat_obj))
feat_get_dummies = list(temp)
full_data = full_data.merge(right = temp, how = 'left', left_index = True, right_index = True)
full_data.shape

In [None]:
feat_model = list(set(feat_source) - set(feat_obj)) + feat_get_dummies
print(feat_model)

## Split into features and labels
Here we can decide which features/predictors to use in our model and what variable to use as the label. The object `feat_model` more or less has this specified for us, but by adjusting `feat_ref` we can remove some variables to serve as reference points.

In [None]:
feat_ref = ['race_WHI', 'sex_F','hclass_M','sexoff_N','sexreg_N','vetf_N',
           'drugalcf_N','drugampf_N','drugcocf_N','drugherf_N','drugmarf_N','drugothf_N','drugpcpf_N','drugunkf_N',
            'lstsclvl_P','HasKids_N','birthdecade_1920.0','release_year_2010','prisontime_0.0'] 

In [None]:
race_vars = [e for e in feat_model if e.startswith('race_')]
print(race_vars)

In [None]:
#feat_model = list(set(feat_model) - set(race_vars))
sel_features = list(set(feat_model) - set(feat_ref))
print(sel_features)

In [None]:
sel_label= 'employed'

# Create Training and Validation Data Sets


Time to make training and validation sets. Here are the ranges of years for testing and training currently.

In [None]:
train_years = range(train_date, train_date+train_horizon) 
test_years = range(test_date, test_date+1)

In [None]:
print("The test years include: {yrs}".format(yrs = test_years))
print("The training years include: {yrs}".format(yrs = train_years))

The function simply allows for a subset on `exityr`, the year a prisoner was released. The following cells make `train_data` and `test_data` based on the year of ranges above.

In [None]:
def create_test_or_train(yrs):
    return(full_data.query("exityr in {x}".format(x=yrs)))

In [None]:
train_data = create_test_or_train(train_years)
test_data = create_test_or_train(test_years)

Let's take a look at our training set. 

In [None]:
train_data.head()

Let's take a look at our testing set. 

In [None]:
test_data.head()

## Comparing the training and testing data

The following does some summary statistic style comparisons between `train_data` and `test_data`. This might be a good place to do some visualiztion.

What proprotion of individuals in the training set were ever employed?

In [None]:
print('Number of rows: {}'.format(train_data.shape[0]))
train_data[sel_label].value_counts(normalize=True)

How does that compare to the test data? 

In [None]:
print('Number of rows: {}'.format(test_data.shape[0]))
test_data[sel_label].value_counts(normalize=True)

It appears that employment happens for more individuals in the training data set. This is not too surprising because the labor market, especially for Illinois, was in bad shape in 2010 and 2011 compared to 2013.

The following grabs the mean and standard error for the `race_` dummies to see how the proportion of different races varies, if at all, across the train and test data.

In [None]:
race_mean_train = train_data.filter(regex = "^race_").mean()
race_se_train = train_data.filter(regex = "^race_").std()/np.sqrt(train_data.shape[0])
race_mean_test = test_data.filter(regex = "^race_").mean()
race_se_test = train_data.filter(regex = "^race_").std()/np.sqrt(test_data.shape[0])
race_summ = pd.DataFrame({'train_mean' : race_mean_train,
                          'train_se' : race_se_train,
                          'test_mean' : race_mean_test,
                          'test_se' : race_se_test})

In [None]:
print(race_summ)

It appears that the proportion of Black releasees may be slightly higher and the proportion of White and Hispanic releasees may be slightly lower in the training model compared to the test model. 

The last cell look at birth year. The distribution of ages appears to be roughly the same.

In [None]:
summ_birth_year = pd.DataFrame({'train' : train_data['birth_year'].describe(),
                                'test' : test_data['birth_year'].describe()})
print(summ_birth_year)                               

This splits the train and test data into features and labels. They are float arrays, so basically matrices.

In [None]:
# use conventions typically used in python scikitlearn
X_train = train_data[sel_features].values
y_train = train_data[sel_label].values
X_test = test_data[sel_features].values
y_test = test_data[sel_label].values

# Model Selection

Very little is edited here compared to the orginal.

## Model Evaluation 


In this phase, you take the predictors from your test set and apply your model to them, then assess the quality of the model by comparing the *predicted values* to the *actual values* for each record in your testing data set. 

- **Performance Estimation**: How well will our model do once it is deployed and applied to new data?

Now let's use the model we just fit to make predictions on our test dataset, and see what our accuracy score is:

# Model Evaluation 

Machine learning models usually do not produce a prediction (0 or 1) directly. Rather, models produce a score (that can sometimes be interpreted a a probabilty) between 0 and 1, which lets you more finely rank all of the examples from *most likely* to *least likely* to have label 1 (positive). This score is then turned into a 0 or 1 based on a user-specified threshold. For example, you might label all examples that have a score greater than 0.5 (1/2) as positive (1), but there's no reason that has to be the cutoff. 

This function makes plots for the normal train-test approach and cross-validation. Couldn't think of a better place to put it.

This function takes the data frame generated in `cv_year_metrics()` and creates a bar chart summarizing a chosen model metric for each year for all models, or all model metrics averaged across years.

In [None]:
def cv_year_metrics_plot(df, chosen_metric = 'accuracy', time_agg = False, aspect=1):
    if time_agg == False: #for each model plot the given metric by year
        pp = sns.factorplot(x = 'year', y = 'value', hue = 'model', \
                            data = df[df['metric'] == chosen_metric], \
                            kind='bar', aspect = aspect)
    elif time_agg == True: #for each model plot all metrics avged over year
        dfx = df.groupby(['metric', 'model'], as_index = False)['value'].mean()
        pp = sns.factorplot(x = 'metric', y = 'value', hue = 'model', \
                            data = dfx, kind = 'bar', aspect = aspect)

## Confusion Matrix

Once we have tuned our scores to 0 or 1 for classification, we create a *confusion matrix*, which  has four cells: true negatives, true positives, false negatives, and false positives. Each data point belongs in one of these cells, because it has both a ground truth and a predicted label. If an example was predicted to be negative and is negative, it's a true negative. If an example was predicted to be positive and is positive, it's a true positive. If an example was predicted to be negative and is positive, it's a false negative. If an example was predicted to be positive and is negative, it's a false negative.

The count of true negatives is `conf_matrix[0,0]`, false negatives `conf_matrix[1,0]`, true positives `conf_matrix[1,1]`, and false_positives `conf_matrix[0,1]`.

If we care about our whole precision-recall space, we can optimize for a metric known as the **area under the curve (AUC-PR)**, which is the area under the precision-recall curve. The maximum AUC-PR is 1. 

In [None]:
def plot_precision_recall(y_true,y_score):
    """
    Plot a precision recall curve
    
    Parameters
    ----------
    y_true: ls
        ground truth labels
    y_score: ls
        score output from model
    """
    precision_curve, recall_curve, pr_thresholds = precision_recall_curve(y_true,y_score)
    plt.plot(recall_curprecision_recall_curveve, precision_curve)
    plt.xlabel('Recall')
    plt.ylabel('Precision')
    auc_val = auc(recall_curve,precision_curve)
    print('AUC-PR: {0:1f}'.format(auc_val))
    plt.show()
    #plt.clf()

## Precision and Recall at k%

If we only care about a specific part of the precision-recall curve we can focus on more fine-grained metrics. For instance, say there is a special program for people likely to be recidivists, but only 5% can be admitted. In that case, we would want to prioritize the 5% who were *most likely* to end up back in jail, and it wouldn't matter too much how accurate we were on the 80% or so who weren't very likely to end up back in jail. 

Let's say that, out of the approximately 200,000 prisoners, we can intervene on 5% of them, or the "top" 10,000 prisoners (where "top" means highest predicted risk of recidivism). We can then focus on optimizing our **precision at 5%**.

In [None]:
def plot_precision_recall_n(y_true, y_prob, model_name):
    """
    y_true: ls
        ls of ground truth labels
    y_prob: ls
        ls of predic proba from model
    model_name: str
        str of model name (e.g, LR_123)
    """
    from sklearn.metrics import precision_recall_curve
    y_score = y_prob
    precision_curve, recall_curve, pr_thresholds = precision_recall_curve(y_true, y_score)
    precision_curve = precision_curve[:-1]
    recall_curve = recall_curve[:-1]
    pct_above_per_thresh = []
    number_scored = len(y_score)
    for value in pr_thresholds:
        num_above_thresh = len(y_score[y_score>=value])
        pct_above_thresh = num_above_thresh / float(number_scored)
        pct_above_per_thresh.append(pct_above_thresh)
    pct_above_per_thresh = np.array(pct_above_per_thresh)
    plt.clf()
    fig, ax1 = plt.subplots()
    ax1.plot(pct_above_per_thresh, precision_curve, 'b')
    ax1.set_xlabel('percent of population')
    ax1.set_ylabel('precision', color='b')
    ax1.set_ylim(0,1.05)
    ax2 = ax1.twinx()
    ax2.plot(pct_above_per_thresh, recall_curve, 'r')
    ax2.set_ylabel('recall', color='r')
    ax2.set_ylim(0,1.05)
    
    name = model_name
    plt.title(name)
    #plt.show()
    #plt.clf()

In [None]:
def precision_at_k(y_true, y_scores,k):
    
    threshold = np.sort(y_scores)[::-1][int(k*len(y_scores))]
    y_pred = np.asarray([1 if i >= threshold else 0 for i in y_scores ])
    return precision_score(y_true, y_pred)

In [None]:
calc_threshold = lambda x,y: 0 if x < y else 1

### Cross validation approach

### Helper functions

One or more functions that perform tasks in the user-defined machine learning functions.

The function `expand_grid()` creates a `DataFrame` object with a row for ever possible combination of the dictionary in the argument. It recreates R's `expand.grid()`.

In [None]:
import itertools
def expand_grid(data_dict):
    rows = itertools.product(*data_dict.values())
    return pd.DataFrame.from_records(rows, columns = data_dict.keys())

### Machine Learning Functions

In [None]:
def cv_params(X, Y, year_group, class_fire, tuning_parameters):
    #what are the scores that we care about?
    scores = ['roc_auc', 'precision']
    
    #classifiers of interest
    c_f = clfs[class_fire]
    
    #split the data into our preferred train-test combo
    X_train = X[np.in1d(year_group, [2011, 2012, 2013])]
    X_test = X[year_group == 2010]
    Y_train = Y[np.in1d(year_group, [2011, 2012, 2013])]
    Y_test = Y[year_group == 2010]
                                                      
    #create the cross-validation indices by year
    #gkf = GroupKFold(n_splits = 4)
    #gkf.split(X, Y, groups=year_group)
    
    for score in scores:
        print("Tuning hyperparamters for %s" % score)
        print()
        
        cxx = GridSearchCV(c_f, tuning_parameters, cv = 5, \
                          scoring = '%s' % score)
        cxx.fit(X_train, Y_train)
        
        print("Best parameters set foundon development set:")
        print()
        print(cxx.best_params_)
        print()
        means = cxx.cv_results_['mean_test_score']
        stds = cxx.cv_results_['std_test_score']
        for mean, std, params in zip(means, stds, cxx.cv_results_['params']):
            print("%f (+/- %f) for %r" % (mean, 2*std, params))
        print()
        
        print("Detailed Classification Report:")
        print("The model is trained on the full development set.")
        print("The scores are evaluated on the full evaluation set.")
        print()
        Y_true, Y_pred = Y_test, cxx.predict(X_test)
        print(classification_report(Y_true, Y_pred))
        print()

This function returns cross-validated (grouped by `exit_yr`) model assessment scores (e.g., accuracy, recall, etc.) for every classifier listed in `sel_clfs`. All data is stored in a pandas data frame. 

Adjust which score measures you want by modifying the list `metrics` within the below function.

In [None]:
def cv_year_metrics(X, Y, year_group, class_fires):
    #create objects to loop over, store results
    yrs = range(2010, 2014)
    metrics = ['accuracy', 'recall', 'precision', 'roc_auc'] #can add more
    df = expand_grid({'year' : yrs,
                     'metric' : metrics,
                     'model' : class_fires}) 
    df['value'] = np.NAN
    
    #create the cross-validation indices by year
    gkf = GroupKFold(n_splits = 4)
    gkf.split(X, Y, groups=year_group)
    
    #compute cross-validation metrics over each model
    for fires in class_fires: 
        for mm in metrics:
            temp_fire = clfs[fires]
            df_ind = (df['model'] == fires) & (df['metric'] == mm)
            df.loc[df_ind, 'value'] = cross_val_score(temp_fire, X, np.ravel(Y), \
                                                      groups=year_group, scoring=mm, cv=gkf)
    
    return(df)

This function fits a logistic regression model for every train-test year combination and returns the estimated model coefficients.

In [None]:
def cv_lr_coef(X, Y, year_group, feat_names):
    #create cv indices
    gkf = GroupKFold(n_splits = 4)
    cv_inds = list(gkf.split(X, Y, groups=year_group))
    
    #initialize data set
    df = pd.DataFrame({'feat' : feat_names})
    
    #obtain coefs for each training fold
    for x in range(0,4):
        lm = clfs['LogisticReg']
        fitx = lm.fit(X.loc[cv_inds[x][0]], Y.loc[cv_inds[x][0]])
        df['yr201{xx}'.format(xx = x)] = fitx.coef_[0]
    
    return(df)

Similar to `cv_lr_coef`, this function returns the feature importance rankings obtained from a random forest classifier by year.

In [None]:
def cv_rf_feat_import(X, Y, year_group, feat_names):
    #create cv indices
    gkf = GroupKFold(n_splits = 4)
    cv_inds = list(gkf.split(X, Y, groups=year_group))
    
    #initialize data set
    df = pd.DataFrame({'feat' : feat_names})
    
    #obtain coefs for each training fold
    for x in range(0,4):
        rf = clfs['RandomForest']
        rf.fit(X.loc[cv_inds[x][0]], Y.loc[cv_inds[x][0]])
        importances = rf.feature_importances_
        df['yr201{xx}'.format(xx = x)] = importances
        #temp = fitx.coef_[0]
    
    return(df)

## Machine Learning Pipeline
*[Go back to Table of Contents](#table-of-contents)*

When working on machine learning projects, it is a good idea to structure your code as a modular **pipeline**, which contains all of the steps of your analysis, from the original data source to the results that you report, along with documentation. This has many advantages:
- **Reproducibility**. It's important that your work be reproducible. This means that someone else should be able
to see what you did, follow the exact same process, and come up with the exact same results. It also means that
someone else can follow the steps you took and see what decisions you made, whether that person is a collaborator, 
a reviewer for a journal, or the agency you are working with. 
- **Ease of model evaluation and comparison**.
- **Ability to make changes.** If you receive new data and want to go through the process again, or if there are 
updates to the data you used, you can easily substitute new data and reproduce the process without starting from scratch.

In [None]:
clfs = {'RandomForest': RandomForestClassifier(n_estimators=500, n_jobs=-1),
        'ExtraTrees': ExtraTreesClassifier(n_estimators=10, n_jobs=-1, criterion='entropy'),
        'LogisticReg': LogisticRegression(penalty='l1', C=1e5),
        'StochasticGradientDescent':SGDClassifier(loss='log'),
        'GradientBoosting': GradientBoostingClassifier(learning_rate=0.05, subsample=0.5, max_depth=6, n_estimators=10),
        'NaiveBayes': GaussianNB(),
        'SupportVectorMachine': SVC(probability=True),
        'LinearSupportVectorMachine': LinearSVC(),
        'NearestNeighbor': KNeighborsClassifier(),
        'BaggedNearestNeighbor':BaggingClassifier(KNeighborsClassifier(),max_samples=0.5,max_features=0.5),
        'MostFreqDummy':DummyClassifier(strategy='most_frequent'),
        'StratifiedDummy':DummyClassifier(strategy='stratified'),
        'UniformDummy':DummyClassifier(strategy='uniform')}

In [None]:
sel_clfs = ['RandomForest', 'ExtraTrees', 'LogisticReg', 'StochasticGradientDescent', 'GradientBoosting', 'NaiveBayes','NearestNeighbor','BaggedNearestNeighbor','MostFreqDummy','StratifiedDummy','UniformDummy']
#sel_clfs = ['RandomForest']#,'LogisticReg']

### print(sel_clfs)

In [None]:
model_name_set=[]
accuracy_set = []
precision_set = []
recall_set = []
kappa_set = []
p_at_1_set=[]
p_at_5_set=[]
max_p_at_k = 0
for clfNM in sel_clfs:
    model_name_set.append(clfNM)
    clf = clfs[clfNM]
    clf.fit( X_train, y_train )
    print clf
    if clfNM=='LogisticReg':
        print "The coefficients and standard deviations for each of the features are " 
        std_coef = np.std(X_test,0)*clf.coef_
        zip(sel_features, clf.coef_[0].round(3),std_coef[0].round(3))
    y_score = clf.predict_proba(X_test)[:,1]
    hist, bin_edges = np.histogram(y_score, bins='fd')
    safe_ind = np.where(hist > 10)[0] #returns indices where count > 10
    safe_bins = bin_edges[safe_ind]
    print("Here are the bin counts:")
    print(np.histogram(y_score, bins = safe_bins)[0])
    plt.figure()
    sns.distplot(y_score, kde=True, rug=False, bins=safe_bins, norm_hist=True)
    #save the distribution plot
    savefig("/nfshome/pmclaughlin/Projects/M3/user/pmclaughlin/project-repository/to export/images/with_race/score_dist_{x}.pdf".format(x = clfNM), \
           bbox_inches = "tight")
    #close the plot
    plt.clf()
    predicted = np.array(y_score)
    predictedVal = np.array( [calc_threshold(score,0.5) for score in y_score] )
    expected = np.array(y_test)
    #evaluation metrics
    conf_matrix = confusion_matrix(expected,predictedVal)
    accuracy = accuracy_score(expected, predictedVal)
    accuracy_set.append(accuracy)
    precision = precision_score(expected, predictedVal)
    precision_set.append(precision)
    recall = recall_score(expected, predictedVal)
    recall_set.append(recall)
    kappa = cohen_kappa_score(expected, predictedVal)
    kappa_set.append(kappa)
    print conf_matrix
    print( "Accuracy = " + str( accuracy))
    print( "Precision = " + str(precision))
    print( "Recall= " + str(recall))
    print( "Kappa= " + str(kappa))
    p_at_1 = precision_at_k(expected,y_score, 0.01)
    p_at_5 = precision_at_k(expected,y_score, 0.05)
    p_at_1_set.append(p_at_1)
    p_at_5_set.append(p_at_5)
    print('Precision at 1%: {:.2f}'.format(p_at_1))
    print('Precision at 5%: {:.2f}'.format(p_at_5))
    if max_p_at_k < p_at_5:
        max_p_at_k = p_at_5

    #plot_precision_recall(expected, y_score)
    plt.figure()
    plot_precision_recall_n(expected,predicted, clfNM)
    savefig("/nfshome/pmclaughlin/Projects/M3/user/pmclaughlin/project-repository/to export/images/with_race/roc_plot_{x}.pdf".format(x = clfNM), \
           bbox_inches = "tight")
    plt.clf()
    
print(max_p_at_k)
evaluation_metrics_dict={'Model': model_name_set,
                         'Accuracy': accuracy_set,
                         'Precision': precision_set,
                         'Recall': recall_set,
                         'Kappa': kappa_set,
                         'Precision_at_1_pct': p_at_1_set,
                         'Precision_at_5_pct': p_at_5_set}
evaluation_metrics=pd.DataFrame.from_dict(evaluation_metrics_dict)
evaluation_metrics
evaluation_metrics.to_csv(path_or_buf = "/nfshome/pmclaughlin/Projects/M3/user/pmclaughlin/project-repository/to export/holdout_metrics_with_race.csv")


In [None]:
evm_melt = pd.melt(evaluation_metrics, id_vars = ['Model'], var_name = 'metric')
evm_melt = evm_melt.rename(index = str, columns = {'Model': 'model'})
cv_year_metrics_plot(evm_melt, time_agg=True, aspect = 2)

## Data for sklearn

This breaks out the feature variables as `X` and stores `exit_year` as a grouping variable.

In [None]:
X = full_data[sel_features]#.select_dtypes(exclude=['object']).values
group = full_data['exityr'].values
#temp_feat = list(full_data[sel_features].select_dtypes(exclude=['object']))
employed = full_data['employed']

In [None]:
print(shape(X))
print(shape(group))
print(shape(employed))

## Analysis

First, model evaluation measures are presented individually across years and then averaged across years. Next, coefficients and feature importance for the logistic regression and random forest classifiers, respectively, are presented.

## Model Evalation By Year

In [None]:
model_stats = cv_year_metrics(X, employed, group, sel_clfs)

In [None]:
model_stats.head()

In [None]:
model_stats.to_csv(path_or_buf = "/nfshome/pmclaughlin/Projects/M3/user/pmclaughlin/project-repository/to export/cv_metrics_without_race.csv")

In [None]:
cv_year_metrics_plot(model_stats, chosen_metric = 'accuracy', aspect=2)
sns.plt.show()

In [None]:
cv_year_metrics_plot(model_stats, chosen_metric = 'precision', aspect=2)
sns.plt.show()

In [None]:
cv_year_metrics_plot(model_stats, chosen_metric = 'recall', aspect=2)
sns.plt.show()

In [None]:
cv_year_metrics_plot(model_stats, chosen_metric = 'roc_auc', aspect=2)
sns.plt.show()

### Model Evaluation Averaged Over Years

Comment

In [None]:
cv_year_metrics_plot(model_stats, time_agg = True, aspect = 2)
sns.plt.show()

### Logistic Regression by Year

**Keep in mind that `employed==1` means unemployed**

For the most part, it seems that the sign and magnitude of the LR coefficients are robust to whatever chosen year is the holdout. A few exceptions:

* `meritorious_good_time` has a much smaller magnitude for 2012 and 2013 as holdouts compared to 2010 and 2013.
* `drugothf_Y`, `hclass_U` has a positive coefficient for 2011 holdout but is negative in all others.
* `working_in_prison` seems to have a high-varying magnitude
* `drugunkf_Y` has a much higher magnitude for 2010 holdout than all other years
* `lstsclvl_1` has a negative magnitue for 2012 holdout

In [None]:
lr_assess = cv_lr_coef(X, employed, group, feat_names = sel_features)

In [None]:
lr_assess.to_csv(path_or_buf = "/nfshome/pmclaughlin/Projects/M3/user/pmclaughlin/project-repository/to export/cv_lr_with_race.csv")

In [None]:
print(lr_assess)

## Random Forest Feature Importance by Year

The basic relative importance of features appears robust across years. The variables that seem to matter the most are `birthdecade`, `prisontime`, and `anypriorwage`.

In [None]:
rf_assess = cv_rf_feat_import(X, employed, group, feat_names = sel_features)

In [None]:
rf_assess.to_csv(path_or_buf = "/nfshome/pmclaughlin/Projects/M3/user/pmclaughlin/project-repository/to export/cv_rf_with_race.csv")

In [None]:
print(rf_assess)

# Baseline 

It is important to check our model against a reasonable **baseline** to know how well our model is doing. Without any context, 83% accuracy can sound really great... but it's not so great when you remember that you could do almost that well by declaring everyone a non-recividist, which would be stupid (not to mention useless) model. 

A good place to start is checking against a *random* baseline, assigning every example a label (positive or negative) completely at random. 

** The remaining cells are not run because they do not apply yet**

Another good practice is checking against an "expert" or rule of thumb baseline. For example, say that talking to people at the DOC, you find that they think it's much more likely that someone who has been in prison multiple times already will reoffend. Then you should check that your classifier does better than just labeling everyone who has had multiple past admits as positive.

In [None]:
#recidivism_predicted = np.array([ 1 if nadmit > 1 else 0 for nadmit in test_data.nadmits.values ])
#recidivism_p_at_5 = precision_at_k(expected,recidivism_predicted,0.05)

In [None]:
#all_non_recidivist = np.array([0 for nadmit in df_testing.nadmits.values])
#all_non_recidivist_p_at_5 = precision_at_k(expected, all_non_recidivist,0.05)

In [None]:
#sns.set_style("whitegrid")
#fig, ax = plt.subplots(1, figsize=(22,12))
#sns.set_context("poster", font_scale=1.25, rc={"lines.linewidth":2.25, "lines.markersize":8})
#sns.barplot(['Random','All Non-Recidivist', 'Recidivism','Model'],
#            [random_p_at_5, all_non_recidivist_p_at_5, recidivism_p_at_5, max_p_at_k],
#            palette=['#6F777D','#6F777D','#6F777D','#800000'])
#plt.ylim(0,1)
#plt.ylabel('precision at 5%')

## Let's explore some of the models we just built

In [None]:
clfs

In [None]:
# explore random forest RF
sel_clfs
clf = clfs['RandomForest']
#clf = clfs[clfNM]
print clf
clf.fit( X_train, y_train )
print clf.feature_importances_

### Let's see if we can make this look a little better

In [None]:
importances = clf.feature_importances_
std = np.std ([tree.feature_importances_ for tree in clf.estimators_],
       axis=0)
indices = np.argsort(importances)[::-1]

print ("Feature ranking")
for f in range(X_test.shape[1]):
    print ("%d. %s (%f)" % (f + 1, sel_features[indices[f]], importances[indices[f]]))

# plot 
plt.figure
plt.title ("Feature Importances")
plt.bar(range(X_test.shape[1]), importances[indices], color='r',
      yerr=std[indices], align = "center")
plt.xticks(range(X_test.shape[1]), sel_features, rotation=90)
plt.xlim([-1, X_test.shape[1]])
plt.show
savefig("/nfshome/pmclaughlin/Projects/M3/user/pmclaughlin/project-repository/to export/images/with_race/feat_import_plt.pdf", \
       bbox_inches = "tight")



# Exercise 

Our model has just scratched the surface. Try the following: 
    
- Create more features
- Try more models
- Try different parameters for your model

## Resources
*[Go back to Table of Contents](#table-of-contents)*

- Hastie et al.'s [The Elements of Statistical Learning](http://statweb.stanford.edu/~tibs/ElemStatLearn/) is a classic and is available online for free.
- James et al.'s [An Introduction to Statistical Learning](http://www-bcf.usc.edu/~gareth/ISL/), also available online, includes less mathematics and is more approachable.
- Wu et al.'s [Top 10 Algorithms in Data Mining](http://www.cs.uvm.edu/~icdm/algorithms/10Algorithms-08.pdf).