# Final Project Submission
* Student name: James M. Irivng, Ph.D.
* Student pace: full time
* Scheduled project review date/time: 05/15/19 2:30 pm
* Instructor name: Jeff Herman / Brandon Lewis
* Blog post URL:


# Iowa Prisoner Recidivism

<img src="images/LSA_map_with_counties_districts_and_B54A5BBCE4156.jpg" width=80%>

## Data Source: Iowa Department of Corrections 

- Source: https://www.kaggle.com/slonnadube/recidivism-for-offenders-released-from-prison
- **Statistics about recidivism in prisoners from a 3 year prisoner**
- **Target:**
    - Recidivism - Return to Prison
- **Features:**
    - Fiscal Year Released
    - Recidivism Reporting Year
    - Race - Ethnicity
    - Age At Release
    - Convicting Offense Classification
    - Convicting Offense Type
    - Convicting Offense Subtype
    - Main Supervising District
    - Release Type
    - Release type: Paroled to Detainder united
    - Part of Target Population

### Detailed variable descriptions:

- **Fiscal Year Released**
    - Fiscal year (year ending June 30) for which the offender was released from prison.

- **Recidivism Reporting Year**
    - Fiscal year (year ending June 30) that marks the end of the 3-year tracking period. For example, offenders exited prison in FY 2012 are found in recidivism reporting year FY 2015.

- **Race - Ethnicity**
    - Offender's Race and Ethnicity

- **Convicting Offense Classification**
    - Maximum penalties: A Felony = Life; B Felony = 25 or 50 years; C Felony = 10 years; D Felony = 5 years; Aggravated Misdemeanor = 2 years; Serious Misdemeanor = 1 year; Simple Misdemeanor = 30 days

- **Convicting Offense Type**
    - General category for the most serious offense for which the offender was placed in prison.

- **Convicting Offense Subtype**
    - Further classification of the most serious offense for which the offender was placed in prison.

- **Release Type**
    - Reasoning for Offender's release from prison.

- **Main Supervising District**
    - The Judicial District supervising the offender for the longest time during the tracking period.

- **Recidivism - Return to Prison**
    - No = No Recidivism; Yes = Prison admission for any reason within the 3-year tracking period
    
- **Part of Target Population** 
    - The Department of Corrections has undertaken specific strategies to reduce recidivism rates for prisoners who are on parole and are part of the target population.
    ___

# USING THE OSEMN MODEL TO GUIDE WORKFLOW

1. **OBTAIN:**
    - [x] Import data, inspect, check for datatypes to convert and null values
<br><br>

2. **SCRUB: cast data types, identify outliers, check for multicollinearity, normalize data**<br>
    - Check and cast data types
    - [x] Check for missing values 
    - [x] Check for multicollinearity
    - [x] Normalize data (may want to do after some exploring)   
    <br><br>
            
3. **EXPLORE:Check distributions, outliers, etc**
    - [x] Check scales, ranges (df.describe())
    - [x] Check histograms to get an idea of distributions (df.hist()) and data transformations to perform
    - [x] Use scatterplots to check for linearity and possible categorical variables (df.plot(kind-'scatter')
    <br><br>

   
4. **FIT AN INITIAL MODEL:** 
    - [x] Assess the model.
        <br><br>
5. **REVISE THE FITTED MODEL**
    - [x] Adjust chosen model and hyper-parameters
    <br><br>
6. **HOLDOUT VALIDATION**
    - [ ] Perform cross-validation
___

# OBTAIN:

### Using Custom PyPi Package - `fsds`


In [None]:
import bs_ds_local as bs

In [None]:
# !pip install -U fsds
from fsds.imports import *

In [None]:
## Set Pandas Options
pd_options = {
    'display.max_rows'    : 200,
    'display.max_info_rows':200,
    'display.max_columns' : 0,
#     'display.float_format':'${:,.2f}'.format
}
[pd.set_option(option, setting) for option, setting in pd_options.items()]


## Set Plot Style
plt.style.use('dark_background')

## Suppress Warnings
import warnings
warnings.filterwarnings

In [None]:
import bs_ds_local as bs

## Loading the dataset and removing unrelated columns

In [None]:
ls data/

In [None]:
df = pd.read_csv('data/3-Year_Recidivism_for_Offenders_Released_from_Prison_in_Iowa.csv')
df

**Any columns that are about New Convictions or days to recidivism should be dropped for our initial model predicting recidivism.**
- "New..", "Days to Recividism"

In [None]:
## Drop cols related to recivism details 
drop_expr = ['New',"Days","Recidivism Type","Year"]

drop_cols = []
for exp in drop_expr:
    drop_cols.extend([col for col in df.columns if exp in col])
    
df.drop(columns=drop_cols,inplace=True)
df.head()

### Save original names vs short names in column_legend
- then map names onto columns

In [None]:
## Replacing columns with short names
rename_map = {
    'Fiscal Year Released': 'yr_released',
    'Recidivism Reporting Year': 'report_year' ,
    'Main Supervising District': 'supervising_dist' ,
    'Release Type': 'release_type' ,
    'Race - Ethnicity': 'race_ethnicity'  ,
    'Age At Release ':  'age_released' ,
    'Sex':'sex'   ,
    'Offense Classification': 'crime_class' ,
    'Offense Type': 'crime_type'  ,
    'Offense Subtype':  'crime_subtype' ,
    'Return to Prison': 'recidivist'  ,
    'Target Population':  'target_pop'
}

df = df.rename(rename_map,axis=1)
df

In [None]:
df.to_csv('data/iowa_recidivism_renamed_2020.csv')

# SCRUB / EXPLORE


In [None]:
## Explore Dtypes and info
df.info()

In [None]:
import missingno as ms


def nulls_report(df):
    nulls= df.isna().sum()
    nulls_only = nulls[nulls>0].to_frame('#')
    nulls_only['%'] = ((nulls_only['#']/len(df))*100)
    nulls_only = nulls_only.round(2)
    capt='Columns with Null Values:'
    display(nulls_only.style.set_caption(capt))
ms.matrix(df)
plt.show()
  
    
nulls_report(df)


**Results of Null Check**
- race_ethnicity has 30 (0.12% of data)
    -  drop
- age_released has 3 (0.01% of data)
    - drop
- sex has 3 (0.01% of data)
    - drop
- super_district has 9581(36.82% of data)
    - replace with "unknown"
- release_type has 1762 (6.77% of data)
    - drop
    
**Dropping all null values from age_released, race_ethnicity, and release_type.**

## SIMPLIFYING CATEGORICAL FEATURES

### Making `age_released` numerical

In [None]:
def value_counts(col,dropna=False,normalize=True):
    """Convenience function for display value counts with default params"""
    return col.value_counts(dropna=dropna,normalize=normalize)

In [None]:
value_counts(df['age_released'])#.value_counts(dropna=False)

In [None]:
# Mapping age_map onto 'age_released'
# Encoding age groups as ordinal
age_ranges = ('Under 25','25-34', '35-44','45-54','55 and Older')
age_numbers = (20,30,40,50,70) 
age_num_map = dict(zip(age_ranges,age_numbers))
age_num_map

In [None]:
df['age_released'] = df['age_released'].map(age_num_map)
value_counts(df['age_released'])

### df['race_ethnicity']

In [None]:
value_counts(df['race_ethnicity'])

- **Remapping race_ethnicity**
    - Due to the low numbers for several of the race_ethnicity types, reducing and combining Hispanic and Non-Hispanic groups
    - Alternative approach of separating race and ethnicity into 2 separate features was rejected after modeling

In [None]:
# Defining Dictionary Map for race_ethnicity categories
race_ethnicity_map = {'White - Non-Hispanic':'White',
                        'Black - Non-Hispanic': 'Black',
                        'White - Hispanic' : 'Hispanic',
                        'American Indian or Alaska Native - Non-Hispanic' : 'American Native',
                        'Asian or Pacific Islander - Non-Hispanic' : 'Asian or Pacific Islander',
                        'Black - Hispanic' : 'Black',
                        'American Indian or Alaska Native - Hispanic':'American Native',
                        'White -' : 'White',
                        'Asian or Pacific Islander - Hispanic' : 'Asian or Pacific Islander',
                        'N/A -' : np.nan,
                        'Black -':'Black'}

# Replacing original race_ethnicity column with remapped one.
df['race_ethnicity'] = df['race_ethnicity'].map(race_ethnicity_map)
value_counts(df['race_ethnicity'])

### df['crime_class']

- **Remapping crime_class**
    - Combine 'Other Felony' and 'Other Felony (Old Code)' -> nan
    - Other Misdemeanor -> np.nan
    - Felony - Mandatory Minimum -> np.nan
    - Special Sentence 2005 -> Sex Offender
    - 'Sexual Predator Community Supervision' -> 'Sex Offender'
    - Other Felony -> np.nan    

In [None]:
value_counts(df['crime_class'])

In [None]:
# Remapping
crime_class_map = {'Other Felony (Old Code)':'Other Felony' ,#or other felony
                  'Other Misdemeanor':'Other Misdemeanor',
                   'Felony - Mandatory Minimum':'Other Felony',#np.nan, # if minimum then lowest sentence ==  D Felony
                   'Special Sentence 2005': 'Sex Offender',
                   'Other Felony' : 'Other Felony' ,
                   'Sexual Predator Community Supervision' : 'Sex Offender',
                   'D Felony': 'D Felony',
                   'C Felony' :'C Felony',
                   'B Felony' : 'B Felony',
                   'A Felony' : 'A Felony',
                   'Aggravated Misdemeanor':'Aggravated Misdemeanor',
                   'Felony - Enhancement to Original Penalty':'Felony - Enhanced',
                   'Felony - Enhanced':'Felony - Enhanced' ,
                   'Serious Misdemeanor':'Serious Misdemeanor',
                   'Simple Misdemeanor':'Simple Misdemeanor'}

df['crime_class'] = df['crime_class'].map(crime_class_map)
value_counts(df['crime_class'])

### Remapping target

In [None]:
# Recidivist
df['recidivist'] = df['recidivist'].map( {'No':0,'Yes':1})
value_counts(df['recidivist'])

In [None]:
df.head()

___
## FEATURE ENGINEERING
- **Engineering a simple 'felony' true false category**
- **Combining crime_type and crime_subtype into types_combined**

### Creating a simple 'felony' feature

In [None]:
# Engineering a simple 'felony' true false category
df['felony'] = df['crime_class'].str.contains('felony',case=False)
value_counts(df['felony'])

In [None]:
# df['crime_types_combined'] = df['crime_type']+'_'+df['crime_subtype']
# value_counts(df['crime_types_combined'])

In [None]:
# Combining crime_type and crime_subtype into types_combined
# df['crime_class_type_subtype']= df['crime_class']+'_'+df['crime_type']+'_'+df['crime_subtype']
# value_counts(df['crime_class_type_subtype'])
df.nunique()

### Creating a 'max_sentence' feature based on crime class max penalties
   

In [None]:
# Mapping years onto crime class
crime_class_max_sentence_map = {'A Felony': 75,  # Life
                                'Aggravated Misdemeanor': 2, # 2 years
                                'B Felony': 50, # 25 or 50 years
                                'C Felony': 10, # 10 years
                                'D Felony': 5,  # 5 yeras
                                'Felony - Enhanced': 10, # Add on to class C and D felonies, hard to approximate. 
                                'Serious Misdemeanor': 1, # 1 year
                                'Sex Offender': 10, # 10 years
                                'Simple Misdemeanor': 0.83} # 30 days

# Mapping max_sentence_column
df['max_sentence'] =df['crime_class'].map(crime_class_max_sentence_map)
value_counts(df['max_sentence'])

### Dropping all  values replaced with np.nan

In [None]:
nulls_report(df)

## Checking Final Dtypes

In [None]:
df.info()
dtypes = {'yr_released':str,
         'report_year':str}

# BOOKMARK

## Preprocessing with  Pipelines and ColumnTransformer

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, MinMaxScaler,OneHotEncoder

from sklearn.model_selection import train_test_split

In [None]:
from sklearn import set_config
set_config(display='text')

In [None]:
## Make x and y
target = 'recidivist'
X = df.drop(columns=target).copy()
y = df[target].copy()
value_counts(y)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X,y,stratify=y)

## BOOKMARK 

In [None]:
## Get a list of columns to be run as numeric data
num_cols = X_train.select_dtypes('number').columns
num_cols

In [None]:
nulls_report(df)

In [None]:
## Make a num_transformer pipeline
set_config(display='diagram')
num_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler',StandardScaler())])
num_transformer

In [None]:
# ## vis as diagram
# set_config(display='text')
# num_transformer

In [None]:
## Get a list of columns to be run as categorical data
cat_cols = X_train.select_dtypes('O').columns
cat_cols

In [None]:
## Create a cat_transformer pipeline 
cat_transformer = Pipeline(steps=[
    ('imputer',SimpleImputer(strategy='constant',fill_value='missing')),
    ('encoder', OneHotEncoder(handle_unknown='ignore',sparse=False))])#handle_unknown='ignore',
cat_transformer

## that will impute using median and then calculate z-scores


In [None]:
# TO DO: make another cat encoder with drop='if_binary'drop='if_binary',

### Combine Preprocessing into one ColumnTransformer

In [None]:
## COMBINE BOTH PIPELINES INTO ONE WITH COLUMN TRANSFORMER
from sklearn.compose import ColumnTransformer
preprocessing = ColumnTransformer(transformers=[
    ('num',num_transformer,num_cols),
    ('cat',cat_transformer,cat_cols)])
preprocessing

In [None]:
## Get X_train and X_test from column transformer
X_train_tf = preprocessing.fit_transform(X_train)
X_train_tf

> **One downside of Pipelines is that its harder to get the individual info we need to re-form our dataset as a df**

In [None]:
cat_features = preprocessing.named_transformers_['cat'].named_steps['encoder'].get_feature_names(cat_cols)
X_train_tf = pd.DataFrame(X_train_tf,columns=[*num_cols, *cat_features])
X_train_tf.head()

In [None]:
X_test_tf = pd.DataFrame( preprocessing.transform(X_test),
                         columns=[*num_cols, *cat_features])
X_test_tf.head()

# MODELING

In [None]:
import sklearn.metrics as metrics

def evaluate_classification(model,X_test,y_test,classes=['Non Recid','Recidivst'],
                           normalize='true',cmap='Purples',label='',
                           return_report=False):
    """Accepts an sklearn-compatible classification model + test data 
    and displays several sklearn.metrics functions: 
    - classifciation_report
    - plot_confusion_matrix
    - plot_roc_curve
    """
     
    ## Get Predictions
    y_hat_test = model.predict(X_test)
    
    ## Classification Report / Scores 
    table_header = "[i] CLASSIFICATION REPORT"
    
    ## Add Label if given
    if len(label)>0:
        table_header += f":\t{label}"
        
    
    ## PRINT CLASSIFICATION REPORT
    dashes = '---'*20
    print(dashes,table_header,dashes,sep='\n')

    print(metrics.classification_report(y_test,y_hat_test,
                                    target_names=classes))
    
    report = metrics.classification_report(y_test,y_hat_test,
                                               target_names=classes,
                                          output_dict=True)
    print(dashes+"\n\n")
    
    

    ## MAKE FIGURE
    fig, axes = plt.subplots(figsize=(10,4),ncols=2)
    
    ## Plot Confusion Matrix 
    metrics.plot_confusion_matrix(model, X_test,y_test,
                                  display_labels=classes,
                                  normalize=normalize,
                                 cmap=cmap,ax=axes[0])
    axes[0].set(title='Confusion Matrix')
    
    ## Plot Roc Curve
    roc_plot = metrics.plot_roc_curve(model, X_test, y_test,ax=axes[1])
    axes[1].legend()
    axes[1].plot([0,1],[0,1],ls=':')
    axes[1].grid()
    axes[1].set_title('Receiving Operator Characteristic (ROC) Curve') 
    fig.tight_layout()
    plt.show()
    
    if return_report:
        return report #fig,axes

## Baseline DummyClassifier

In [None]:
from sklearn.dummy import DummyClassifier
dummy= DummyClassifier(strategy='stratified')
dummy.fit(X_train_tf,y_train)
evaluate_classification(dummy,X_test_tf,y_test,
                       label='Dummy Classifier')

### Vanilla RandomForest

In [None]:
from sklearn.ensemble import RandomForestClassifier,StackingClassifier
from sklearn.linear_model import LogisticRegression,LogisticRegressionCV

## 
clf = RandomForestClassifier()
clf.fit(X_train_tf,y_train)
evaluate_classification(clf,X_test_tf,y_test,label="Vanilla Random Forest")

In [None]:
def get_feature_importance(clf,X_train_tf,plot=True):
    importances = pd.Series(clf.feature_importances_,index=X_train_tf.columns)
    return importances.sort_values(ascending=False)

def plot_importance(clf,X_train_tf,n=25):
    importances = get_feature_importance(clf,X_train_tf)
    ax = importances.sort_values().tail(n).plot(kind='barh')#,figsize=figsize)
    ax.set(title=f"Top {n} Most Important Features",xlabel='importance')

In [None]:
plot_importance(clf, X_test_tf,n=20)

### RandomForest - `class_weight="balanced"`

In [None]:
clf = RandomForestClassifier(class_weight='balanced')
clf.fit(X_train_tf,y_train)
evaluate_classification(clf,X_test_tf,y_test,label= "Random Forest (class_weight='balanced')")
plot_importance(clf,X_test_tf)

In [None]:
# get_feature_importance(clf,X_test_tf).to_frame('importance').style.bar()

### SMOTENC

In [None]:
## Getting cat features index
cat_col_index = [False for col in num_cols]
cat_col_index.extend([True for col in cat_features])
cat_col_index[:5]

In [None]:
from imblearn.over_sampling import SMOTENC
smote = SMOTENC(cat_col_index,n_jobs=-1)

In [None]:
X_train_smote,y_train_smote = smote.fit_resample(X_train_tf,y_train)
y_train_smote.value_counts()

In [None]:
X_train_smote[:5]

### RandomForest with SMOTE

In [None]:
clf = RandomForestClassifier()#class_weight='balanced')
clf.fit(X_train_smote,y_train_smote)
evaluate_classification(clf,X_test_tf,y_test,label='RandomForest - SMOTE')
plot_importance(clf,X_test_tf)

# GridSearch RF

In [None]:
from sklearn.model_selection import RandomizedSearchCV,GridSearchCV

clf = RandomForestClassifier()
params ={'max_depth':[None,5,7,10,20,30,],
         'min_samples_leaf':[1,2,3],
         'criterion':['gini','entropy'],        
        }


grid = GridSearchCV(clf,params,scoring='recall_macro', n_jobs=-1)

grid.fit(X_train_smote,y_train_smote)
print(grid.best_params_)

print(grid.best_score_)
evaluate_classification(grid.best_estimator_,X_test_tf,y_test)

In [None]:
# scores =['recall','recall_macro','accuracy']
GRIDS={}

In [None]:
## Build loop to make dict of grids for each score method
scores =['f1','f1_macro','roc_auc','recall','recall_macro','accuracy','precision']

reports = {}
for score in scores:
    line = '==='*30
    print(line)
    print(f'[i] Starting {score}',end='\n'+line)
    
    GRIDS[score] = GridSearchCV(clf,params,cv=3,scoring=score, n_jobs=-1)
    GRIDS[score].fit(X_train_smote,y_train_smote)
    
    print(f"\nFor scoring={score}:" )
    print(GRIDS[score].best_params_)
    print('\n\n')
    
    reports[score] = evaluate_classification(GRIDS[score].best_estimator_,
                                    X_test_tf,y_test,label=score,return_report=True)

In [None]:
dfs=[]
for metric,result in reports.items():
    
    result['scoring_param'] = metric
    dfs.append(pd.DataFrame(result))
    
RESULTS = pd.concat(dfs).reset_index().set_index(['scoring_param','index'])
# RESULTS.drop('scoring param',inplace=True)
RESULTS

# BOOKMARK 10/03 7;40PM

# LogisticRegression

In [None]:
X_train_smote.describe()

In [None]:
# from sklearn.preprocessing import MinMaxScaler
# scaler = MinMaxScaler()
# X_train_logreg = scaler.fit_transform(X_train_smote)
# X_test_logreg = scaler.transform(X_test_tf)

In [None]:
logregCV = LogisticRegressionCV(scoring='recall',penalty='l1',cv=3,
                                solver='liblinear',max_iter=250,n_jobs=-1)

logregCV.fit(X_train_smote,y_train_smote)
logregCV

In [None]:
evaluate_classification(logregCV,X_test_tf,y_test)

In [None]:
def get_coeffs(logregCV, X_train_smote,):
    coeffs = pd.Series(logregCV.coef_[0],index=X_train_smote.columns)
    coeffs['Intercept'] = logregCV.intercept_
    coeffs = coeffs.astype(float)
    return coeffs

coeffs = get_coeffs(logregCV,X_train_smote)
coeffs.sort_values().plot(kind='barh',figsize=(5,10))

In [None]:
# logregCV.C_, logregCV.

## TO DO: Try tune-sklearn
- [Blog Post](https://towardsdatascience.com/5x-faster-scikit-learn-parameter-tuning-in-5-lines-of-code-be6bdd21833c)
- [Documentation](https://github.com/ray-project/tune-sklearn)

## StackingClassifier

In [None]:
from sklearn.ensemble import StackingClassifier

stack = StackingClassifier(estimators=[
    ('rf',)
])

# CONCLUSIONS
- **After adjusting for imbalanced classes, the most important factor for determining recidivism are:**
    - **Age at Release**
    - **Supervising Judicial District**
    - **Release Type**
    - **Crime Subtype**
    
    
## Recommendatons
- This model could be used to predict which prisoners due for release may at the greatest risk for recidivism.<br><br>
    - Using this knowledge, the state of Iowa could put new programs into action that target those at high risk for recidivism and provide additional assistance and guidance following release.<br><br>
    - Additionally, there could be additional counseling or education _prior_ to release to supply the inmate with tools and options to avoid returning to a life of crime.
    
# FUTURE DIRECTIONS
- With more time and reliable performance, would perform cross-validation of our final model.<br><br>
- Additional visuals summarizing the underlying features effects on recidivism.<br><br>
- Adapting more available visualization tools to better display the underpinning of the model.
<br><br>
- Exploration of the predictability of crimes types committed by recidivists.

### POST-REVIEW SUGGESTIONS / IDEAS:
- [ ] Try using reduction instead of SMOTE.
- [ ] seaborn catplot bar graphs
- [ ] Add tree or other visuals
    - Try Mike's SHAP plots

# APPENDIX

In [None]:
STOP

In [None]:
from bs_ds import viz_tree

In [None]:
viz_tree(cb_clf)

In [None]:
compare_tree = sklearn.tree.DecisionTreeClassifier()
dir(compare_tree)

In [None]:
compare_tree.fit(X_train, y_train)

In [None]:
dir(compare_tree)

In [None]:
# This is the tree object that sklearn generates and is looking for 
help(compare_tree.tree_)

In [None]:
dir(cb_clf)

In [None]:
help(cb_clf.get_metadata())

In [None]:
test = cb_clf.get_metadata()

In [None]:
help(cb_clf)

### SHAP values
https://github.com/jirvingphd/shap


In [None]:
import shap
shap.initjs()

In [None]:
explainer = shap.TreeExplainer(cb_clf)

In [None]:
shap_vals = explainer.shap_values(train_pool)

In [None]:
shap.force_plot(explainer.expected_value, shap_vals[:1000],X_train[:1000])

In [None]:
shap.summary_plot(shap_vals, X_train)

In [None]:
shap.summary_plot(shap_vals, X_train, plot_type="bar")