# Final Project Submission
* Student name: James M. Irivng, Ph.D.
* Student pace: full time
* Scheduled project review date/time: 05/15/19 2:30 pm
* Instructor name: Jeff Herman / Brandon Lewis
* Blog post URL:


## Iowa Prisoner Recidivism Data

- Source: https://www.kaggle.com/slonnadube/recidivism-for-offenders-released-from-prison
- **Statistics about recidivism in prisoners from a 3 year prisoner**
- **Target:**
    - Recidivism - Return to Prison
- **Features:**
    - Fiscal Year Released
    - Recidivism Reporting Year
    - Race - Ethnicity
    - Age At Release
    - Convicting Offense Classification
    - Convicting Offense Type
    - Convicting Offense Subtype
    - Main Supervising District
    - Release Type
    - Release type: Paroled to Detainder united
    - Part of Target Population

<img src="LSA_map_with_counties_districts_and_B54A5BBCE4156.jpg" width=800>

### Detailed variable descriptions:

- **Fiscal Year Released**
    - Fiscal year (year ending June 30) for which the offender was released from prison.

- **Recidivism Reporting Year**
    - Fiscal year (year ending June 30) that marks the end of the 3-year tracking period. For example, offenders exited prison in FY 2012 are found in recidivism reporting year FY 2015.

- **Race - Ethnicity**
    - Offender's Race and Ethnicity

- **Convicting Offense Classification**
    - Maximum penalties: A Felony = Life; B Felony = 25 or 50 years; C Felony = 10 years; D Felony = 5 years; Aggravated Misdemeanor = 2 years; Serious Misdemeanor = 1 year; Simple Misdemeanor = 30 days

- **Convicting Offense Type**
    - General category for the most serious offense for which the offender was placed in prison.

- **Convicting Offense Subtype**
    - Further classification of the most serious offense for which the offender was placed in prison.

- **Release Type**
    - Reasoning for Offender's release from prison.

- **Main Supervising District**
    - The Judicial District supervising the offender for the longest time during the tracking period.

- **Recidivism - Return to Prison**
    - No = No Recidivism; Yes = Prison admission for any reason within the 3-year tracking period
    
- **Part of Target Population** 
    - The Department of Corrections has undertaken specific strategies to reduce recidivism rates for prisoners who are on parole and are part of the target population.
    ___

# USING THE OSEMN MODEL TO GUIDE WORKFLOW

1. **OBTAIN:**
    - [x] Import data, inspect, check for datatypes to convert and null values
<br><br>

2. **SCRUB: cast data types, identify outliers, check for multicollinearity, normalize data**<br>
    - Check and cast data types
    - [x] Check for missing values 
    - [x] Check for multicollinearity
    - [x] Normalize data (may want to do after some exploring)   
    <br><br>
            
3. **EXPLORE:Check distributions, outliers, etc**
    - [x] Check scales, ranges (df.describe())
    - [x] Check histograms to get an idea of distributions (df.hist()) and data transformations to perform
    - [x] Use scatterplots to check for linearity and possible categorical variables (df.plot(kind-'scatter')
    <br><br>

   
4. **FIT AN INITIAL MODEL:** 
    - [x] Assess the model.
        <br><br>
5. **REVISE THE FITTED MODEL**
    - [x] Adjust chosen model and hyper-parameters
    <br><br>
6. **HOLDOUT VALIDATION**
    - [ ] Perform cross-validation
___

# OBTAIN:
## Importing Packages


<!--- ### Using Custom PyPi Package - BroadSteel DataScience (bs_ds)

<img src="https://bs-ds.readthedocs.io/en/latest/_images/bs_ds_logo.png" width=200>

- **Used several EDA functions from bs_ds.bamboo module:**
    - inspect_df
        - https://bs-ds.readthedocs.io/en/latest/bs_ds.html#bs_ds.bamboo.inspect_df
    - check_null
        - https://bs-ds.readthedocs.io/en/latest/bs_ds.html#bs_ds.bamboo.check_null
    - check_unique
        - https://bs-ds.readthedocs.io/en/latest/bs_ds.html#bs_ds.bamboo.check_unique
    - check_column
        - https://bs-ds.readthedocs.io/en/latest/bs_ds.html#bs_ds.bamboo.check_column
    - check_numeric
        - https://bs-ds.readthedocs.io/en/latest/bs_ds.html#bs_ds.bamboo.check_numeric
    - big_pandas
        - https://bs-ds.readthedocs.io/en/latest/bs_ds.html#bs_ds.bamboo.big_pandas
    - ignore_warnings
        - https://bs-ds.readthedocs.io/en/latest/bs_ds.html#bs_ds.bamboo.ignore_warnings
    - drop_cols
        - https://bs-ds.readthedocs.io/en/latest/bs_ds.html#bs_ds.bamboo.drop_cols --->

In [1]:
!pip install -U fsds_100719
from fsds_100719.imports import *

fsds_1007219  v0.6.6 loaded.  Read the docs: https://fsds.readthedocs.io/en/latest/ 


Handle,Package,Description
dp,IPython.display,Display modules with helpful display and clearing commands.
fs,fsds_100719,Custom data science bootcamp student package
mpl,matplotlib,Matplotlib's base OOP module with formatting artists
plt,matplotlib.pyplot,Matplotlib's matlab-like plotting module
np,numpy,scientific computing with Python
pd,pandas,High performance data structures and tools
sns,seaborn,High-level data visualization library based on matplotlib


In [2]:
### Style
plt.style.use('seaborn-notebook')
# # Import custom python package BroadSteel DataScience (bs_ds_)
# from bs_ds.imports import *
# from bs_ds.bamboo import  inspect_df, check_null, check_unique, check_column, check_numeric, big_pandas, ignore_warnings
# from bs_ds import ihelp, ihelp_menu

# import bs_ds as bs

# from bs_ds.imports import *

# # Enabling full-sized dataframes and info rows
# big_pandas()

# # Turning off warnings for function deprecations
# ignore_warnings()



## Loading the dataset and removing unrelated columns

In [3]:
# Dataset Links
full_all_prisoners_file = "datasets/FULL_3-Year_Recidivism_for_Offenders_Released_from_Prison_in_Iowa.csv"
# only_repeat_criminals_w_new_crime_file= "datasets/prison_recidivists_with_recidivism_type_only.csv"

# Will be using the all_prisoners file to predict recidivism
df = pd.read_csv(full_all_prisoners_file)

df.columns = [col.lower().replace(' ','_') for col in df.columns]
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26020 entries, 0 to 26019
Data columns (total 17 columns):
fiscal_year_released                     26020 non-null int64
recidivism_reporting_year                26020 non-null int64
race_-_ethnicity                         25990 non-null object
age_at_release_                          26017 non-null object
convicting_offense_classification        26020 non-null object
convicting_offense_type                  26020 non-null object
convicting_offense_subtype               26020 non-null object
release_type                             24258 non-null object
main_supervising_district                16439 non-null object
recidivism_-_return_to_prison            26020 non-null object
days_to_recidivism                       8681 non-null float64
new_conviction_offense_classification    6718 non-null object
new_conviction_offense_type              6718 non-null object
new_conviction_offense_sub_type          6699 non-null object
part_of_target

In [4]:
df.columns = [col.replace('_-_','_') for col in df.columns]
df

Unnamed: 0,fiscal_year_released,recidivism_reporting_year,race_ethnicity,age_at_release_,convicting_offense_classification,convicting_offense_type,convicting_offense_subtype,release_type,main_supervising_district,recidivism_return_to_prison,days_to_recidivism,new_conviction_offense_classification,new_conviction_offense_type,new_conviction_offense_sub_type,part_of_target_population,recidivism_type,sex
0,2010,2013,Black - Non-Hispanic,25-34,C Felony,Violent,Robbery,Parole,7JD,Yes,433.0,C Felony,Drug,Trafficking,Yes,New,Male
1,2010,2013,White - Non-Hispanic,25-34,D Felony,Property,Theft,Discharged – End of Sentence,,Yes,453.0,,,,No,Tech,Male
2,2010,2013,White - Non-Hispanic,35-44,B Felony,Drug,Trafficking,Parole,5JD,Yes,832.0,,,,Yes,Tech,Male
3,2010,2013,White - Non-Hispanic,25-34,B Felony,Other,Other Criminal,Parole,6JD,No,,,,,Yes,No Recidivism,Male
4,2010,2013,Black - Non-Hispanic,35-44,D Felony,Violent,Assault,Discharged – End of Sentence,,Yes,116.0,,,,No,Tech,Male
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
26015,2015,2018,White - Hispanic,Under 25,C Felony,Violent,Assault,Paroled to Detainer - INS,,No,,,,,Yes,No Recidivism,Male
26016,2015,2018,White - Non-Hispanic,35-44,C Felony,Violent,Sex,Released to Special Sentence,6JD,No,,,,,No,No Recidivism,Male
26017,2015,2018,White - Non-Hispanic,25-34,Aggravated Misdemeanor,Public Order,Traffic,Parole Granted,5JD,No,,,,,No,No Recidivism,Female
26018,2015,2018,White - Non-Hispanic,25-34,D Felony,Property,Theft,Paroled w/Immediate Discharge,5JD,No,,,,,Yes,No Recidivism,Male


In [5]:
from pandas_profiling import ProfileReport
ProfileReport(df)



# 01/18/2020 TO DO:

1. Use KNN to impute Supervising Judicial District
2. Use Udemy Feature Engineering's Monotonic-Relationship Based Categorical Encoding Methods
    - Helpful for Tree-based algorithms
3. Address Rare Labels? 
    - one hot frequent only 
    - grouping infrequent into one group
3. Try BaggingClassier, SVC,
4. Adding SHAP values to understand model

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26020 entries, 0 to 26019
Data columns (total 17 columns):
fiscal_year_released                     26020 non-null int64
recidivism_reporting_year                26020 non-null int64
race_ethnicity                           25990 non-null object
age_at_release_                          26017 non-null object
convicting_offense_classification        26020 non-null object
convicting_offense_type                  26020 non-null object
convicting_offense_subtype               26020 non-null object
release_type                             24258 non-null object
main_supervising_district                16439 non-null object
recidivism_return_to_prison              26020 non-null object
days_to_recidivism                       8681 non-null float64
new_conviction_offense_classification    6718 non-null object
new_conviction_offense_type              6718 non-null object
new_conviction_offense_sub_type          6699 non-null object
part_of_target

In [7]:
df =df.rename({'recidivism_return_to_prison':'recividist'},axis=1)
df

Unnamed: 0,fiscal_year_released,recidivism_reporting_year,race_ethnicity,age_at_release_,convicting_offense_classification,convicting_offense_type,convicting_offense_subtype,release_type,main_supervising_district,recividist,days_to_recidivism,new_conviction_offense_classification,new_conviction_offense_type,new_conviction_offense_sub_type,part_of_target_population,recidivism_type,sex
0,2010,2013,Black - Non-Hispanic,25-34,C Felony,Violent,Robbery,Parole,7JD,Yes,433.0,C Felony,Drug,Trafficking,Yes,New,Male
1,2010,2013,White - Non-Hispanic,25-34,D Felony,Property,Theft,Discharged – End of Sentence,,Yes,453.0,,,,No,Tech,Male
2,2010,2013,White - Non-Hispanic,35-44,B Felony,Drug,Trafficking,Parole,5JD,Yes,832.0,,,,Yes,Tech,Male
3,2010,2013,White - Non-Hispanic,25-34,B Felony,Other,Other Criminal,Parole,6JD,No,,,,,Yes,No Recidivism,Male
4,2010,2013,Black - Non-Hispanic,35-44,D Felony,Violent,Assault,Discharged – End of Sentence,,Yes,116.0,,,,No,Tech,Male
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
26015,2015,2018,White - Hispanic,Under 25,C Felony,Violent,Assault,Paroled to Detainer - INS,,No,,,,,Yes,No Recidivism,Male
26016,2015,2018,White - Non-Hispanic,35-44,C Felony,Violent,Sex,Released to Special Sentence,6JD,No,,,,,No,No Recidivism,Male
26017,2015,2018,White - Non-Hispanic,25-34,Aggravated Misdemeanor,Public Order,Traffic,Parole Granted,5JD,No,,,,,No,No Recidivism,Female
26018,2015,2018,White - Non-Hispanic,25-34,D Felony,Property,Theft,Paroled w/Immediate Discharge,5JD,No,,,,,Yes,No Recidivism,Male


**Any columns that are about New Convictions or days to recidivism should be dropped for our initial model predicting recidivism.**
- "New..", "Days to Recividism"

In [8]:
drop_cols = list(filter(lambda x: 'new' in x ,df.columns))
drop_cols

['new_conviction_offense_classification',
 'new_conviction_offense_type',
 'new_conviction_offense_sub_type']

In [9]:
drop_cols.extend(list(filter(lambda x: 'recidivism' in x ,df.columns)))
drop_cols

['new_conviction_offense_classification',
 'new_conviction_offense_type',
 'new_conviction_offense_sub_type',
 'recidivism_reporting_year',
 'days_to_recidivism',
 'recidivism_type']

In [10]:
df = df.drop(columns=drop_cols)
df

Unnamed: 0,fiscal_year_released,race_ethnicity,age_at_release_,convicting_offense_classification,convicting_offense_type,convicting_offense_subtype,release_type,main_supervising_district,recividist,part_of_target_population,sex
0,2010,Black - Non-Hispanic,25-34,C Felony,Violent,Robbery,Parole,7JD,Yes,Yes,Male
1,2010,White - Non-Hispanic,25-34,D Felony,Property,Theft,Discharged – End of Sentence,,Yes,No,Male
2,2010,White - Non-Hispanic,35-44,B Felony,Drug,Trafficking,Parole,5JD,Yes,Yes,Male
3,2010,White - Non-Hispanic,25-34,B Felony,Other,Other Criminal,Parole,6JD,No,Yes,Male
4,2010,Black - Non-Hispanic,35-44,D Felony,Violent,Assault,Discharged – End of Sentence,,Yes,No,Male
...,...,...,...,...,...,...,...,...,...,...,...
26015,2015,White - Hispanic,Under 25,C Felony,Violent,Assault,Paroled to Detainer - INS,,No,Yes,Male
26016,2015,White - Non-Hispanic,35-44,C Felony,Violent,Sex,Released to Special Sentence,6JD,No,No,Male
26017,2015,White - Non-Hispanic,25-34,Aggravated Misdemeanor,Public Order,Traffic,Parole Granted,5JD,No,No,Female
26018,2015,White - Non-Hispanic,25-34,D Felony,Property,Theft,Paroled w/Immediate Discharge,5JD,No,Yes,Male


### Save original names vs short names in column_legend
- then map names onto columns

In [11]:
print(df.columns)

Index(['fiscal_year_released', 'race_ethnicity', 'age_at_release_',
       'convicting_offense_classification', 'convicting_offense_type',
       'convicting_offense_subtype', 'release_type',
       'main_supervising_district', 'recividist', 'part_of_target_population',
       'sex'],
      dtype='object')


In [12]:
# New short-hand names to use
colnames_short = ('yr_released','race_ethnicity',
                  'age_released','crime_class','crime_type',
                  'crime_subtype','release_type','super_dist',
                  'recidivist','target_pop','sex')

# Zipping the original and new into a renaming dictionary
column_legend = dict(zip(df.columns,colnames_short))
# Rename df with shorter names
df.rename(mapper=column_legend, axis=1, inplace=True)
df.head()

Unnamed: 0,yr_released,race_ethnicity,age_released,crime_class,crime_type,crime_subtype,release_type,super_dist,recidivist,target_pop,sex
0,2010,Black - Non-Hispanic,25-34,C Felony,Violent,Robbery,Parole,7JD,Yes,Yes,Male
1,2010,White - Non-Hispanic,25-34,D Felony,Property,Theft,Discharged – End of Sentence,,Yes,No,Male
2,2010,White - Non-Hispanic,35-44,B Felony,Drug,Trafficking,Parole,5JD,Yes,Yes,Male
3,2010,White - Non-Hispanic,25-34,B Felony,Other,Other Criminal,Parole,6JD,No,Yes,Male
4,2010,Black - Non-Hispanic,35-44,D Felony,Violent,Assault,Discharged – End of Sentence,,Yes,No,Male


In [13]:
# df.to_csv('iowa_recidivism_renamed.csv')

# SCRUB / EXPLORE
## EDA with Pandas_Profiling

In [None]:
# import pandas_profiling as pp

In [None]:
# pp.ProfileReport(df)
import missingno as miss
miss.matrix(df)

In [None]:
df['super_dist'].value_counts(normalize=True,dropna=False)

In [None]:
df['super_dist'].fillna('missing',inplace=True)

In [None]:
miss.matrix(df)

In [None]:
df['release_type'].value_counts(normalize=True, dropna=False)

In [None]:
df['release_type'].fillna('missing',inplace=True)

In [None]:
help(df['race_ethnicity'].value_counts)

## ADDRESSING NULL VALUES

## 🎗BOOKMARK: KNN SUPERDIST

In [None]:
# check_null(df)
import plotly.express as px
import plotly.graph_objects as go

In [None]:
# px.scatter_matrix(df)

**Results of Null Check**
- race_ethnicity has 30 (0.12% of data)
    -  drop
- age_released has 3 (0.01% of data)
    - drop
- sex has 3 (0.01% of data)
    - drop
- super_district has 9581(36.82% of data)
    - replace with "unknown"
- release_type has 1762 (6.77% of data)
    - drop
    
**Dropping all null values from age_released, race_ethnicity, and release_type.**

In [None]:
# Filling NA's in super_dist and release_type
df.dropna(subset=['age_released','race_ethnicity','sex','release_type'],inplace=True)
df['super_dist'].fillna("unknown", inplace=True)
check_null(df)

___
## COMBINING AND REMAPPING CLASSES

### df['race_ethnicity']

In [None]:
check_unique(df,['race_ethnicity'])

- **Remapping race_ethnicity**
    - Due to the low numbers for several of the race_ethnicity types, reducing and combining Hispanic and Non-Hispanic groups
    - Alternative approach of separating race and ethnicity into 2 separate features was rejected after modeling

In [None]:
# Defining Dictionary Map for race_ethnicity categories
race_ethnicity_map = {'White - Non-Hispanic':'White',
                        'Black - Non-Hispanic': 'Black',
                        'White - Hispanic' : 'Hispanic',
                        'American Indian or Alaska Native - Non-Hispanic' : 'American Native',
                        'Asian or Pacific Islander - Non-Hispanic' : 'Asian or Pacific Islander',
                        'Black - Hispanic' : 'Black',
                        'American Indian or Alaska Native - Hispanic':'American Native',
                        'White -' : 'White',
                        'Asian or Pacific Islander - Hispanic' : 'Asian or Pacific Islander',
                        'N/A -' : np.nan,
                        'Black -':'Black'}

# Replacing original race_ethnicity column with remapped one.
df['race_ethnicity'] = df['race_ethnicity'].map(race_ethnicity_map)

In [None]:
df.head()

In [None]:
check_unique(df,['race_ethnicity'])

### df['crime_class']

- **Remapping crime_class**
    - Combine 'Other Felony' and 'Other Felony (Old Code)' -> nan
    - Other Misdemeanor -> np.nan
    - Felony - Mandatory Minimum -> np.nan
    - Special Sentence 2005 -> Sex Offender
    - 'Sexual Predator Community Supervision' -> 'Sex Offender'
    - Other Felony -> np.nan    

In [None]:
check_unique(df,['crime_class'])

In [None]:
# Remapping
crime_class_map = {'Other Felony (Old Code)': np.nan ,#or other felony
                  'Other Misdemeanor':np.nan,
                   'Felony - Mandatory Minimum':np.nan, # if minimum then lowest sentence ==  D Felony
                   'Special Sentence 2005': 'Sex Offender',
                   'Other Felony' : np.nan ,
                   'Sexual Predator Community Supervision' : 'Sex Offender',
                   'D Felony': 'D Felony',
                   'C Felony' :'C Felony',
                   'B Felony' : 'B Felony',
                   'A Felony' : 'A Felony',
                   'Aggravated Misdemeanor':'Aggravated Misdemeanor',
                   'Felony - Enhancement to Original Penalty':'Felony - Enhanced',
                   'Felony - Enhanced':'Felony - Enhanced' ,
                   'Serious Misdemeanor':'Serious Misdemeanor',
                   'Simple Misdemeanor':'Simple Misdemeanor'}

df['crime_class'] = df['crime_class'].map(crime_class_map)

### df['age_released']

- **Encoding age groups as ordinal**

In [None]:
# Encoding age groups as ordinal
age_ranges = ('Under 25','25-34', '35-44','45-54','55 and Older')
age_codes = (0,1,2,3,4) 
# Zipping into Dictionary to Map onto Column
age_map = dict(zip(age_ranges,age_codes))

# Mapping age_map onto 'age_released'
df['age_released'] = df['age_released'].map(age_map)

### Remapping binary categories df[['recidivist','target_pop','sex']]

In [None]:
## Remapping binary categories

# Recidivist
recidivist_map = {'No':0,'Yes':1}
df['recidivist'] = df['recidivist'].map(recidivist_map)

# Target_pop
target_pop_map = {'No':0,'Yes':1}
df['target_pop'] = df['target_pop'].map(target_pop_map)

#sex_map
sex_map = {'Male':0,'Female':1}
df['sex'] = df['sex'].map(sex_map)

___
## ENGINEERING FEATURES
- **Engineering a simple 'felony' true false category**
- **Combining crime_type and crime_subtype into types_combined**

### Creating a simple 'felony' feature

In [None]:
# Engineering a simple 'felony' true false category
df['felony'] = df['crime_class'].str.contains('felony',case=False)
df['crime_types_combined'] = df['crime_type']+'_'+df['crime_subtype']
# Combining crime_type and crime_subtype into types_combined
df['crime_class_type_subtype']= df['crime_class']+'_'+df['crime_type']+'_'+df['crime_subtype']

### Creating a 'max_sentence' feature based on crime class max penalties
   

In [None]:
# Mapping years onto crime class
crime_class_max_sentence_map = {'A Felony': 75,  # Life
                                'Aggravated Misdemeanor': 2, # 2 years
                                'B Felony': 50, # 25 or 50 years
                                'C Felony': 10, # 10 years
                                'D Felony': 5,  # 5 yeras
                                'Felony - Enhanced': 10, # Add on to class C and D felonies, hard to approximate. 
                                'Serious Misdemeanor': 1, # 1 year
                                'Sex Offender': 10, # 10 years
                                'Simple Misdemeanor': 0.83} # 30 days

# Mapping max_sentence_column
df['max_sentence'] =df['crime_class'].map(crime_class_max_sentence_map)

### Dropping all  values replaced with np.nan

In [None]:
check_null(df)

In [None]:
df.dropna(inplace=True)
df.reset_index(inplace=True)
check_null(df)

In [None]:
df.to_csv('Iowa_recidivism_features_pre-processing.csv')

## Processing Chosen Feature Columns

In [None]:
# List of features to be analyzed as categories
category_cols = ['yr_released','race_ethnicity', 'crime_class',
                 'release_type','crime_type','crime_subtype',
                 'target_pop','sex','super_dist','felony']

# List of features to be analzyed as numbers
number_cols = ['max_sentence','age_released']

# Target feature
target_col = ['recidivist']

In [None]:
# Creating new dataframe ('df_to_split') to contain processed features for train_test_split
df_to_split=pd.DataFrame()

# MinMaxing Numerical Columns
from sklearn.preprocessing import MinMaxScaler
sca = MinMaxScaler()

for header in number_cols:
    print(header)
    data = np.array(df[header])
    res = sca.fit_transform(data.reshape(-1,1))
    df_to_split[header] = res.ravel()    

In [None]:
# Convert categories to cat.codes
for header in category_cols:
    df_to_split[header] = df[header].astype('category')
    df_to_split[header] = df_to_split[header].cat.codes
    
df_to_split.info()

# FITTING AN INITIAL MODEL
## Surveying Potential Algorithms with bs_ds
- select_pca:
    - https://bs-ds.readthedocs.io/en/latest/bs_ds.html#bs_ds.bs_ds.select_pca
- thick_pipe:
    - https://bs-ds.readthedocs.io/en/latest/bs_ds.html#bs_ds.bs_ds.thick_pipe


In [None]:
# from bs_ds.bs_ds import select_pca, thick_pipe

In [None]:
# X =pd.get_dummies(df_to_split, columns=category_cols, drop_first=True)
# y = df['recidivist']

In [None]:
# from sklearn.model_selection import train_test_split
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.4)

#### Running select_pca to identify # of components that still explains 80% of variance

In [None]:
# bs.ihelp(select_pca)

In [None]:
# select_pca(X_train) #,n_components_list=[range(10, X_train.shape[1]-1)])

In [None]:
# # Running thick_pipe to test alogorithms
# thick_pipe(X_train, y_train, n_components=17)

### >>> Fast-Forwarding through trial and error:
- Regardless of changes to preprocessing and feature engineering, accuracy scores never increased about 0.68
- One major concern was the vast majority of our features are categorical.
    - Therefore, we investigated using another Machine Learning package, **CatBoost**

## FITTING AN INITIAL MODEL USING CatBoostClassifier

In [None]:
df_to_split.head()

In [None]:
# Target feature
target_col = ['recidivist']
# Define X and y to split
y = df[target_col]
X = df_to_split#.drop(target_col)
# y = pd.Series(df[target_col].to_numpy().ravel())
# y.name = 'recidivist'

# List of features to be analyzed as categories
category_cols = ['yr_released','race_ethnicity', 'crime_class',
                 'release_type','crime_type','crime_subtype',
                 'target_pop','sex','super_dist','felony']

# List of features to be analzyed as numbers
number_cols = ['max_sentence','age_released']



In [None]:
# Split into training and testing set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4)

In [None]:
# Import catboost Pool to create training and testing pools
from catboost import Pool, CatBoostClassifier

train_pool =  Pool(data=X_train, label=y_train, cat_features=category_cols)
test_pool = Pool(data=X_test, label=y_test,  cat_features=category_cols)

In [None]:
# Instantiating CatBoostClassifier 
cb_base = CatBoostClassifier(iterations=300, depth=12,
                            boosting_type='Ordered',
                            learning_rate=0.03,
                            thread_count=-1,
                            eval_metric='AUC',
                            silent=True,
                            allow_const_label=True)#,
#                             task_type='GPU')

In [None]:
# Fitting Initial CatBoost Model
cb_base.fit(train_pool,eval_set=test_pool, plot=True, early_stopping_rounds=2)
cb_base.best_score_

### VISUAL SUMMARY OF BASE MODEL

In [None]:
# Plotting Feature Importances
important_feature_names = cb_base.feature_names_
important_feature_scores = cb_base.feature_importances_

important_features = pd.Series(important_feature_scores, index = important_feature_names)
important_features.sort_values().plot(kind='barh')

#### Defining Roc_Auc Curve

In [None]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import roc_auc_score, roc_curve

# Define plot_auc_roc_curve
def plot_auc_roc_curve(y_test, y_test_pred):
    """ Takes y_test and y_test_pred from a ML model and plots the AUC-ROC curve."""
    auc = roc_auc_score(y_test, y_test_pred[:,1])

    FPr, TPr, thresh  = roc_curve(y_test, y_test_pred[:,1])
    plt.plot(FPr, TPr,label=f"AUC for CatboostClassifier:\n{round(auc,2)}" )

    plt.plot([0, 1], [0, 1],  lw=2,linestyle='--')
    plt.xlim([-0.01, 1.0])
    plt.ylim([0.0, 1.05])

    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('Receiver operating characteristic (ROC) Curve')
    plt.legend(loc="lower right")
    plt.show()

# Plot roc_auc_curve
y_test_pred = cb_base.predict_proba(X_test)
plot_auc_roc_curve(y_test, y_test_pred)

import itertools
from bs_ds.bs_ds import plot_confusion_matrix
y_test_pred = cb_base.predict(X_test)
conf_matrix = confusion_matrix(y_test, y_test_pred)

plot_confusion_matrix(conf_matrix, classes=['Non-Recidivist', 'Recidivist'], normalize=True, cmap='Reds',
                      title='Confusion Matrix:\n CatBoost Recidivist Classifcation\n')

### Notes Following Initial Modeling:
- The ROC-AUC Curve Shows that our model performs better than chance. 
- HOWEVE, There is a major issue with our confusion matrix.
    - There are an extremely high # of False Negatives (prisoners Predicted to be "Non-Recidivist",but were actually "Recidivist")
    - This is a serious flaw with the model and serious hinders real-world applicability. 
- This may be due to the imbalance of cases of recidivists vs non-recidivists in our dataset.

# REVISING THE MODEL
## Balancing Target Classes Using Synthetic Minority Oversampling

In [None]:
pause

In [None]:
# # Define X and y 
# X = df_to_split
# y = pd.Series(df[target_col].to_numpy().ravel())
# y.name = 'recidivist'

In [None]:
# df=pd.concat([y,X], axis=1)

In [None]:
# import pandas_profiling as pp
# pp.ProfileReport(df)

### Addressing the Imbalanced Class Issue
- Adding Synthetic Minority Oversampling Technique to balance out the # of recidivists(1) and non-recidivists(0)

In [None]:
from imblearn.over_sampling import SMOTE

# print(pd.Series(y).value_counts())
print(y.value_counts())

X_resampled, y_resampled = SMOTE().fit_sample(X,y)

print(pd.Series(y_resampled).value_counts())

In [None]:
# Reformatting SMOTE transformed data

# X_resampled back to a dataframe
X_resampled = pd.DataFrame(X_resampled, columns = X.columns)

# X_resampled category columns back to integers
for header in category_cols:
    X_resampled[header] = X_resampled[header].astype('int')
    
# y_resampled back to a named series    
y_resampled = pd.Series(y_resampled)
y_resampled.name ='recidivist'

### Fitting a Revised Model with Balanced Classes

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_resampled, y_resampled, test_size=0.2)#0.4)

# List of features to be analyzed as categories
category_cols = ['yr_released','race_ethnicity', 'crime_class',
                 'release_type','crime_type','crime_subtype',
                 'target_pop','sex','super_dist','felony']

# List of features to be analzyed as numbers
number_cols = ['max_sentence','age_released']

# Target feature
target_col = ['recidivist']

In [None]:
from catboost import Pool, CatBoostClassifier
train_pool =  Pool(data=X_train, label=y_train, cat_features=category_cols)
test_pool = Pool(data=X_test, label=y_test,  cat_features=category_cols)

In [None]:
cb_clf = CatBoostClassifier(iterations=300, depth=10,
                            boosting_type='Ordered',
                            learning_rate=0.03,
                            thread_count=-1,
                            eval_metric='AUC',
                            silent=True,
                            allow_const_label=True)#,
#                             task_type='GPU')


In [None]:
cb_clf.fit(train_pool,eval_set=test_pool, plot=True, early_stopping_rounds=5)
cb_clf.best_score_

_____________________________________________

In [None]:
# Plotting Feature Importances
important_feature_names = cb_clf.feature_names_
important_feature_scores = cb_clf.feature_importances_

important_features = pd.Series(important_feature_scores, index = important_feature_names)

important_features.sort_values().plot(kind='barh')

## Visual Summary

In [None]:
import shap 
shap.initjs()

In [None]:
explainer = shap.TreeExplainer(cb_clf)
shap_values = explainer.shap_values(X_train)

In [None]:
import cufflinks as cf
cf.go_offline()

important_features.iplot(kind='bar',theme='ggplot')

### AUC-ROC Curve

In [None]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import roc_auc_score, roc_curve

In [None]:
y_test_pred = cb_clf.predict_proba(X_test)
plot_auc_roc_curve(y_test, y_test_pred)

### Confusion Matrix


In [None]:
# import itertools
# from bs_ds.bs_ds import plot_confusion_matrix
y_test_pred = cb_clf.predict(X_test)
conf_matrix = confusion_matrix(y_test, y_test_pred)

plot_confusion_matrix(conf_matrix, classes=['Not-Recidivist', 'Recidivist'], normalize=True, cmap='Blues',
                      title='Confusion Matrix: CatBoost Recidivist Classifcation\n')

## Adding XGB and SHAP 09/05/19
'

In [None]:
display(X_resampled.head())
X_train.dtypes

In [None]:
# bs.column_report(X)
# X

In [None]:
import shap
shap.initjs()

In [None]:
X_train.dtypes

In [None]:
# for col in category_cols:
#     X[col] = X[col].astype('category')

In [None]:
X.head()

In [None]:
# explainer = shap.TreeExplainer(cb_clf)
# shap_values  = explainer.shap_values(X_train.drop('yr_released',axis=1))

In [None]:
from pandas_profiling import ProfileReport
ProfileReport(df)

In [None]:
from xgboost import XGBClassifier,plot_importance, plot_tree

xgb_clf = XGBClassifier()
xgb_clf.fit(X=X_train, y=y_train)

In [None]:
#catboost importance
important_features.sort_values().plot(kind='barh',title='Catboost')

In [None]:
xgb_clf.score(X_test,y_test)
plot_importance(xgb_clf,title="XGBoost")

In [None]:
# bs.evaluate_classification_model(xgb_clf, X_train=X_train, X_test=X_test, y_train=y_train, y_test = y_test)

# CONCLUSIONS

In [None]:
pause

- **After adjusting for imbalanced classes, the most important factor for determining recidivism are:**
    - **Age at Release**
    - **Supervising Judicial District**
    - **Release Type**
    - **Crime Subtype**
    
    
## Recommendatons
- This model could be used to predict which prisoners due for release may at the greatest risk for recidivism.<br><br>
    - Using this knowledge, the state of Iowa could put new programs into action that target those at high risk for recidivism and provide additional assistance and guidance following release.<br><br>
    - Additionally, there could be additional counseling or education _prior_ to release to supply the inmate with tools and options to avoid returning to a life of crime.
    
# FUTURE DIRECTIONS
- With more time and reliable performance, would perform cross-validation of our final model.<br><br>
- Additional visuals summarizing the underlying features effects on recidivism.<br><br>
- Adapting more available visualization tools to better display the underpinning of the model.
<br><br>
- Exploration of the predictability of crimes types committed by recidivists.

## Adding SHAP

In [None]:
import shap
shap.initjs()
explainer = shap.TreeExplainer(xgb_clf)

In [None]:
shap_values  = explainer.shap_values(X=X_train,y=y_train)
shap_interaction_values = explainer.shap_interaction_values(X_train)

In [None]:
# visualize the first prediction's explanation
shap.force_plot(explainer.expected_value, shap_values[0,:], X_train.iloc[0,:])

In [None]:
xgb.to_graphviz(xgb_clf)

In [None]:
ihelp(xgb.to_graphviz)


In [None]:
import graphviz as gv
gv.dot.Digraph()

In [None]:
import xgboost as xgb

ax = xgb.plot_tree(xgb_clf,**{'dpi':'300'})


In [None]:
plot_tree(xgb_clf)


In [None]:
print(shap_interaction_values.shape)#[0].shape
print(shap_interaction_values[0].shape)
shap_interaction_values[0]

In [None]:
shap_values.reshape(2,-1)


In [None]:
df_shap = pd.DataFrame(shap_values, columns=X_train.columns, index=X_train.index)
# df_shap = pd.Series([x for x in shap_values])
df_shap.head()

In [None]:
shap.summary_plot(shap_values,features=X_train)

In [None]:
from ipywidgets import interact, interactive, interactive_output

#explainer.shap_interaction_values(X_train)
list_cols = list(X_train.columns)

@interact(X=list_cols, y=list_cols)
def show__dependence_plot(X='super_dist', y='age_released'):
    shap.dependence_plot(X,shap_values=shap_values,features=X_train, interaction_index=y)

In [None]:
shap.summary_plot(shap_interaction_values, X_train)

In [None]:
from ipywidgets import interact, interactive, interactive_output

#explainer.shap_interaction_values(X_train)
list_cols = list(X_train.columns)

@interact(X=list_cols, y=list_cols)
def show_interaction_plot(X='age_released'):
    explainer.shap_interaction_values(X)#,shap_values=shap_values,features=X_train, interaction_index=y)

### Noteable comparisons

**1. y='age_released', X=supervisng dsitrict**

In [None]:
from ipywidgets import interact
col_list = list(X_train.columns)
@interact(plot_column=col_list)
def plot_shap_values(plot_column=col_list):
    fig=plt.figure()
#     fig = plt.figure(figsize=(10,10))
    fig = shap.dependence_plot(plot_column, shap_values, X_train)
#     plt.show()

In [None]:
# plot_tree(xgb_clf)
bs.viz_tree(xgb_clf)

# 🧧 BOOKMARK Raw Dataset 

In [None]:
try: 
    df
    print("df exists, renaming.")
    df_to_model = df.copy()
except NameError:
    print('No "df" currently exists')

In [None]:
import bs_ds as bs
from bs_ds.imports import *

In [None]:
# Dataset Links
full_all_prisoners_file = "datasets/FULL_3-Year_Recidivism_for_Offenders_Released_from_Prison_in_Iowa.csv"
df = pd.read_csv(full_all_prisoners_file)


## DROPPING IRRELEVANT COLUMNS
drop_cols = ['Fiscal Year Released', 'Days to Recidivism', 'New Conviction Offense Classification',
 'New Conviction Offense Type', 'New Conviction Offense Sub Type', 'Recidivism Type']
df.drop(drop_cols,axis=1,inplace=True)

display(df.head())

In [None]:
from feature_remapping import *
bs.dict_dropdown(remapping_dict)

In [None]:
df.rename(mapper=remapping_dict['columns'],
         axis=1, inplace=True)

from pandas_profiling import ProfileReport
ProfileReport(df)

In [None]:
remapping_dict.keys()

In [None]:
##RENAMING FEATURES
for col in ['race_ethnicity', 'crime_class', 'age_released', 'recidivist',
           'target_pop', 'sex','report_year']:
    if col in remapping_dict.keys():
        df[col] = df[col].map(remapping_dict[col])
df.head()

In [None]:
## CREATING FEATURES
df['max_sentence'] =df['crime_class'].map(remapping_dict['max_sentence'])

df['felony'] = df['crime_class'].str.contains('felony',case=False)
df['crime_types_combined'] = df['crime_type']+'_'+df['crime_subtype']
# Combining crime_type and crime_subtype into types_combined
df['crime_class_type_subtype']= df['crime_class']+'_'+df['crime_type']+'_'+df['crime_subtype']


In [None]:
df['crime_types_combined'].value_counts()

In [None]:
qdf = bs.column_report(df)
qdf

### Using python package `category_encoders`

In [None]:
df.head()

In [None]:
df.info()

In [None]:
category_cols = df.dtypes=='object'
category_cols

In [None]:
# ordinal_cols = ['age_released','report_year','sex','target_pop','recidivist']
# category_cols = ['race_ethnicity', 'crime_class',
onehot_cols = ['crime_class','crime_type', 'crime_subtype', 'release_type', 'super_dist', 'race_ethnicity',]
# binary_cols = [



In [None]:
import category_encoders as ce
## Make ordinal map for the encoder 
ordinal_map =[]
for col in ordinal_cols:
    ordinal_map.append({'col':col, 'mapping':remapping_dict[col]})
    
ord_encoder = ce.OrdinalEncoder(cols=ordinal_cols,mapping=ordinal_map)
ordinal_map

In [None]:
df_to_model = ord_encoder.fit_transform(df)
df_to_model.head()

In [None]:
onehot_encoder = ce.OneHotEncoder(cols=onehot_cols,use_cat_names=True)

In [None]:
df_to_model = onehot_encoder.fit_transform(df_to_model)
df_to_model.head()

In [None]:
df_to_model.info()

In [None]:
## preprocessing notes:
#1) for ordinal categories(like age_released), 

In [None]:
y= df_to_model.pop('recidivist')
X= df_to_model

In [None]:
from sklearn.model_selection import train_test_split, cross_val_score
import xgboost as xgb
xgb_clf = xgb.XGBClassifier(max_depth=4)
scores = cross_val_score(xgb_clf,X=X, y=y,cv=5)
scores

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2)
xgb_clf.fit(X_train, y_train)

In [None]:
fig,ax  = plt.subplots(figsize=(12,8))
xgb.plot_importance(xgb_clf,ax=ax)

In [None]:
# ## Using ColumnTransformer
# from sklearn.model_selection import TimeSeriesSplit,train_test_split, GridSearchCV,cross_val_score,KFold
# from sklearn.pipeline import Pipeline
# from sklearn.preprocessing import StandardScaler, OneHotEncoder,MinMaxScaler
# from sklearn.compose import ColumnTransformer, make_column_transformer 
# from sklearn.pipeline import make_pipeline
# from sklearn.model_selection import train_test_split
# from sklearn.impute import SimpleImputer
# df_to_model = df_full.copy()

# import xgboost as xgb
# target_col= 'recidivist'#'Recidivism - Return to Prison'
# ## Sort_index
# cols_to_drop =[]
# # cols_to_drop.append(target_col)

# # df_to_model.drop(cols_to_drop,axis=1,inplace=True)

# features = df_to_model.drop(cols_to_drop, axis=1)
# target = df_to_model[target_col]



# ## Get boolean masks for which columns to use
# numeric_cols = features.dtypes=='float'
# category_cols = ~numeric_cols
# # target_col = df_to_model.columns=='price_shifted'


# price_transformer = Pipeline(steps=[
#     ('scaler',MinMaxScaler())
# ])


# ## define pipeline for preparing numeric data
# numeric_transformer = Pipeline(steps=[
# #     ('imputer',SimpleImputer(strategy='median')),
#     ('scaler',MinMaxScaler())
# ])

# category_transformer = Pipeline(steps=[
# #     ('imputer',SimpleImputer(missing_values=np.nan,
# #                              strategy='constant',fill_value='missing')),
#     ('onehot',OneHotEncoder(handle_unknown='ignore'))
# ])


# ## define pipeline for preparing categorical data
# preprocessor = ColumnTransformer(remainder='passthrough',
#                                  transformers=[
#                                      ('num',numeric_transformer, numeric_cols),
#                                      ('cat',category_transformer,category_cols)])

In [None]:
# ### ADDING MY OWN TRANSFORMATION SO CAN USE FEATUREA IMPROTANCE
# df_tf =pd.DataFrame()
# num_cols_list = numeric_cols[numeric_cols==True]
# num_cols_list = [x for x in num_cols_list if 'recidivist' not in x]
# cat_cols_list = category_cols[category_cols==True]
# # num_cols = df_to_model.columns
# print(num_cols_list)
# print(cat_cols_list)

In [None]:
for col in df_to_model.columns:
    
    if col in num_cols_list:
        print(f'{col} is numeric')
        vals = df_to_model[col].values
        tf_num = numeric_transformer.fit_transform(vals.reshape(-1,1))
        
        try:
            df_tf[col] = tf_num.flatten()
            print(f"{col} added")
        except:
            print('Error')
            print(tf_num.shape)
#             print(tf_num[:10])
        
    if col in cat_cols_list:
        print(f'{col} is categorical')
#         colnames=[]
#         vals = df_to_model[col].values
#         print(vals.shape)
#         tf_cats = category_transformer.fit_transform(vals.reshape(-1,1))
#         print(tf_cols.shape)
#         print(col,'\n',tf_cats)
        
#         [colnames.append(f"{col}_{i}") for i in range(tf_cats.shape[1])]
#         print(colnames)
        
        df_temp = pd.get_dummies(df_to_model[col])#DataFrame(data=tf_cats[:],index=df_to_model.index)
#         display(df_temp.head())
#         df_temp.columns = 
#         colnames = [for i in range(tf_cols.shape[1])]
        df_tf = pd.concat([df_tf,df_temp],axis=1)

#     ('target',price_transformer,target_col)])
    

# reg = Pipeline(steps=[('preprocessor',preprocessor),
#                      ('regressor',xgb.XGBRegressor(random_state=42))])
df_tf.head()

In [None]:
df_full.index