# Final Project Submission

Please fill out:
* Student name: 
* Student pace: self paced / part time / full time
* Scheduled project review date/time: 
* Instructor name: 
* Blog post URL:


# Iowa Prisoner Recidivism Data

- Source: https://www.kaggle.com/slonnadube/recidivism-for-offenders-released-from-prison
- **Statistics about recidivism in prisoners from a 3 year prisoner**
- **Target:**
    - Recidivism - Return to Prison
- **Features:**
    - Fiscal Year Released
    - Recidivism Reporting Year
    - Race - Ethnicity
    - Age At Release
    - Convicting Offense Classification
    - Convicting Offense Type
    - Convicting Offense Subtype
    - Main Supervising District
    - Release Type
    - Release type: Paroled to Detainder united
    - Part of Target Population

# OSEMN Model

1. **OBTAIN:**
    - **Import data, inspect, check for datatypes to convert and null values**<br>
        - Display header and info
        - Drop any unneeded columns (df.drop(['col1','col2'],axis=1)

2. **SCRUB: cast data types, identify outliers, check for multicollinearity, normalize data**<br>
    - Check and cast data types
        - [x] Check for #'s that are store as objects (df.info())
            - when converting to #'s, look for odd values (like many 0's), or strings that can't be converted
            - Decide how to deal weird/null values (df.unique(), df.isna().sum(), df.describe()-min/max, etc
        - [x]  Check for categorical variables stored as integers
    - [x] Check for missing values  (df.isna().sum())
        - Can drop rows or colums
        - For missing numeric data with median or bin/convert to categorical
        - For missing categorical data: make NaN own category OR replace with most common category
    - [x] Check for multicollinearity
        - Good rule of thumb is anything over 0.75 corr is high, remove the variable that has the most correl with the largest # of variables
    - [ ] Normalize data (may want to do after some exploring)
        - Most popular is Z-scoring (but won't fix skew) 
        - Can log-transform to fix skewed data
    
            
3. **EXPLORE:Check distributions, outliers, etc**
    - [ ] Check scales, ranges (df.describe())
    - [ ] Check histograms to get an idea of distributions (df.hist()) and dat transformations to perform
        - Can also do kernel density estimates
    - [ ] Use scatterplots to check for linearity and possible categorical variables (df.plot(kind-'scatter')
        - categoricals will look like vertical lines
    - [ ] Use pd.plotting.scatter_matrix to visualize possible relationships
    - [ ] Check for linearity

   
4. **FIT AN INITIAL MODEL:** 
    - Various forms, detail later...
    - **Assessing the model:**
        - Assess parameters (slope,intercept)
        - Check if the model explains the variation in the data (RMSE, F, R_square)
        - *Are the coeffs, slopes, intercepts in appropriate units?*
        - *Whats the impact of collinearity? Can we ignore?*
5. **Revise the fitted model**
    - Multicollinearity is big issue for lin regression and cannot fully remove it
    - Use the predictive ability of model to test it (like R2 and RMSE)
    - Check for missed non-linearity
6. **Holdout validation / Train/test split**
    - use sklearn train_test_split 
___

<img src="https://raw.githubusercontent.com/jirvingphd/dsc-3-final-project-online-ds-ft-021119/master/districtmap09122014.jpg" width=800>

### **The variables in the data set:**

- Fiscal Year Released Fiscal year (year ending June 30) for which the offender was released from prison.

- Recidivism Reporting Year 
    - Fiscal year (year ending June 30) that marks the end of the 3-year tracking period. For example, offenders exited prison in FY 2012 are found in recidivism reporting year FY 2015.

- Race - Ethnicity 
    - Offender's Race and Ethnicity

- Convicting Offense Classification 
    - Maximum penalties: A Felony = Life; B Felony = 25 or 50 years; C Felony = 10 years; D Felony = 5 years; Aggravated Misdemeanor = 2 years; Serious Misdemeanor = 1 year; Simple Misdemeanor = 30 days

- Convicting Offense Type General category for the most serious offense for which the offender was placed in prison.

- Convicting Offense Subtype 
    - Further classification of the most serious offense for which the offender was placed in prison.

- Release Type 
    - Reasoning for Offender's release from prison.

- Main Supervising District 
    - The Judicial District supervising the offender for the longest time during the tracking period.

- Recidivism - Return to Prison 
    - No = No Recidivism; Yes = Prison admission for any reason within the 3-year tracking period

- Days to Recidivism 
    - Number of days it took before the offender returned to prison.

- New Conviction Offense Classification The same as the initial offense classification.

- New Conviction Offense Type The same as the initial offense type.

- New Conviction Offense Sub Type The same as the initial offense subtype.

- Part of Target Population 
    - The Department of Corrections has undertaken specific strategies to reduce recidivism rates for prisoners who are on parole and are part of the target population.
    ___

# Importing Packages and Loading in the Dataset

In [None]:
# Import custom python package BroadSteel DataScience (bs_ds_)
from bs_ds.imports import *
from bs_ds.bamboo import inspect_df, check_null, check_unique, check_column, check_numeric, big_pandas, ignore_warnings

In [None]:
# Enabling full-sized dataframes and info rows
big_pandas()

# Turning off warnings for function deprecations
ignore_warnings()

In [None]:
# Dataset Links
# mike_csv ='3-Year_Recidivism_for_Offenders_Released_from_Prison_in_Iowa.csv'
# mike_csv = 'No_numbers.csv'
# mike_enhanced_df = 'Updated_class.csv'

# all_prisoners_url = 'https://raw.githubusercontent.com/jirvingphd/dsc-3-final-project-online-ds-ft-021119/master/dataset/3-Year_Recidivism_for_Offenders_Released_from_Prison_in_Iowa_elaborated.csv'
# all_prisoners_file = "datasets/3-Year_Recidivism_for_Offenders_Released_from_Prison_in_Iowa_elaborated.csv"
full_all_prisoners_file = "datasets/FULL_3-Year_Recidivism_for_Offenders_Released_from_Prison_in_Iowa.csv"
# only_repeat_criminals_w_new_crime_url = "https://raw.githubusercontent.com/jirvingphd/dsc-3-final-project-online-ds-ft-021119/master/dataset/prison_recidivists_with_recidivism_type_only.csv"
only_repeat_criminals_w_new_crime_file= "datasets/prison_recidivists_with_recidivism_type_only.csv"

In [None]:
# Will be using the all_prisoners file to predict recidivism
df = pd.read_csv(full_all_prisoners_file)
# df_enhanced_col = pd.read_csv(mike_enhanced_df, index_col=0)

In [None]:
# df_enhanced_col.columns

In [None]:

inspect_df(df)

**Any columns that are about New Convictions or days to recidivism should be dropped for our initial model predicting recidivism.**
- "New..", "Days to Recividism"

In [None]:
from bs_ds.bamboo import drop_cols
df = drop_cols(df, ['New','Days','Recidivism Type'])

In [None]:
df.info()

In [None]:
inspect_df(df)

### Save original names vs short names in column_legend
- then map names onto columns

In [None]:
print(df.columns)

In [None]:
# New short-hand names to use
colnames_short = ('yr_released','report_year','race_ethnicity','age_released','crime_class','crime_type','crime_subtype','release_type','super_dist','recidivist','target_pop','sex')

# Zipping the original and new into a renaming dictionary
column_legend = dict(zip(df.columns,colnames_short))
# Rename df with shorter names
df.rename(mapper=column_legend, axis=1, inplace=True)
df.head()

## ADDRESSING NULL VALUES

In [None]:
check_null(df)

**Results of Null Check**
- race_ethnicity has 30 (0.12% of data)
    -  drop
- age_released has 3 (0.01% of data)
    - drop
- sex has 3 (0.01% of data)
    - drop
- super_district has 9581(36.82% of data)
    - replace
    - with...
- release_type has 1762 (6.77% of data)
    - drop

- **Dropping all null values from age_released, race_ethnicity, and release_type.**

In [None]:
# Dropping null values from 'age_at_release','race_ethnicity'
# df.dropna(subset=['age_released','race_ethnicity','sex','release_type'],inplace=True)

# pause
# For mike's data
# df['super_dist'].fillna("unknown", inplace=True)
# df.dropna(inplace=True)

In [None]:
check_null(df)

**Results of Null Check**
- super_district has 9549(36.75% of data)
    - replace
    - with 'unknown'
- release_type has 1762 (6.78% of data)
    - replace  (_could_ consider dropping)
    - with 'unknown'    

In [None]:
# Investigating values of classes with large # of NaN's
check_unique(df, columns=['super_dist']) #,'target_pop','crime_type','crime_class'])

- **Filling super_dist NaNs with "unknown"**

In [None]:
# Filling NA's in super_dist and release_type
df['super_dist'].fillna("unknown", inplace=True)
check_null(df)

## COMBINING AND REMAPPING CLASSES

- **Remapping race_ethnicity**
    - Going to combine hispanic and non-hispanic 

In [None]:
# Remapping race_ethnicity
race_ethnicity_map = {'White - Non-Hispanic':'White',
                        'Black - Non-Hispanic': 'Black',
                        'White - Hispanic' : 'Hispanic',
                        'American Indian or Alaska Native - Non-Hispanic' : 'American Native',
                        'Asian or Pacific Islander - Non-Hispanic' : 'Asian or Pacific Islander',
                        'Black - Hispanic' : 'Black',
                        'American Indian or Alaska Native - Hispanic':'American Native',
                        'White -' : 'White',
                        'Asian or Pacific Islander - Hispanic' : 'Asian or Pacific Islander',
                        'N/A -' : np.nan,
                        'Black -':'Black'}
df['race_ethnicity'] = df['race_ethnicity'].map(race_ethnicity_map)
df.info()

- **Remapping crime_class**
    - Combine 'Other Felony' and 'Other Felony (Old Code)' -> nan
    - Other Misdemeanor -> np.nan
    - Felony - Mandatory Minimum -> np.nan
    - Special Sentence 2005 -> Sex Offender
    - 'Sexual Predator Community Supervision' -> 'Sex Offender'
    - Other Felony -> np.nan    

In [None]:
crime_class_map = {'Other Felony (Old Code)': np.nan ,#or other felony
                  'Other Misdemeanor':np.nan,
                   'Felony - Mandatory Minimum':np.nan, # if minimum then lowest sentence ==  D Felony
                   'Special Sentence 2005': 'Sex Offender',
                   'Other Felony' : np.nan ,
                   'Sexual Predator Community Supervision' : 'Sex Offender',
                   'D Felony': 'D Felony',
                   'C Felony' :'C Felony',
                   'B Felony' : 'B Felony',
                   'A Felony' : 'A Felony',
                   'Aggravated Misdemeanor':'Aggravated Misdemeanor',
                   'Felony - Enhancement to Original Penalty':'Felony - Enhanced',
                   'Felony - Enhanced':'Felony - Enhanced' ,
                   'Serious Misdemeanor':'Serious Misdemeanor',
                   'Simple Misdemeanor':'Simple Misdemeanor'}

df['crime_class'] = df['crime_class'].map(crime_class_map)

- **Encoding age groups as ordinal**

In [None]:
# Encoding age groups as ordinal
age_ranges = ('Under 25','25-34', '35-44','45-54','55 and Older')
age_codes = (0,1,2,3,4) 
# Zipping into Dictionary to Map onto Column
age_map = dict(zip(age_ranges,age_codes))

# Mapping age_map onto 'age_released'
df['age_released'] = df['age_released'].map(age_map)

- **Remapping binary categories ( recidivist, target_pop, sex)**

In [None]:
## Remapping binary categories

# Recidivist
recidivist_map = {'No':0,'Yes':1}
df['recidivist'] = df['recidivist'].map(recidivist_map)

# Target_pop
target_pop_map = {'No':0,'Yes':1}
df['target_pop'] = df['target_pop'].map(target_pop_map)

#sex_map
sex_map = {'Male':0,'Female':1}
df['sex'] = df['sex'].map(sex_map)

#### Remapping release_type
**DECIDED NOT TO SINCE THIS COULD BE AN IMPORTANT LEVEL OF NUANCE**
- Combine Parole Grant and Parole and Paroled w/ Immediate Discharge [?]
- Combine Discharged - End of Sentence and Discharged - Expiration of Sentence
- **unknown...?** (keeping but consider dropping
- Combine Released to Special Sentence and Special Sentence
- Combine Paroled to Detainer - Out of State, Paroled to Detainer - INS, Paroled to Detainer - U.S. Marshall, Paroled to Detainer - Iowa. 

In [None]:
release_type_map = {'Parole': 'Paroled',
                    'Discharged – End of Sentence': 'Discharged - End of Sentence',
                    'Special Sentence':'Special Sentence',
                    'Parole Granted': 'Paroled',
                    'Discharged - Expiration of Sentence' : 'Discharged - End of Sentence',
                    'Paroled w/Immediate Discharge': 'Paroled',
                    'Paroled to Detainer - Iowa':'Paroled to Detainer',
                    'Paroled to Detainer - U.S. Marshall':'Paroled to Detainer',
                    'Paroled to Detainer - Out of State':'Paroled to Detainer',
                    'Released to Special Sentence':'Special Sentence',
                    'Paroled to Detainer - INS':'Paroled to Detainer',
                    'unknown':np.nan}

In [None]:
df['release_type_map'] = df['release_type'].map(release_type_map)
df['release_type_map'].value_counts()

## Engineering Features
- **Engineering a simple 'felony' true false category**
- **Combining crime_type and crime_subtype into types_combined**

In [None]:
df.dtypes

In [None]:
# Engineering a simple 'felony' true false category
df['felony'] = df['crime_class'].str.contains('felony',case=False)
df['crime_types_combined'] = df['crime_type']+'_'+df['crime_subtype']
# Combining crime_type and crime_subtype into types_combined
df['crime_class_type_subtype']= df['crime_class']+'_'+df['crime_type']+'_'+df['crime_subtype']

- **Creating a 'max_sentence' feature based on crime class**
   

In [None]:
# Mapping years onto crime class
crime_class_max_sentence_map = {'A Felony': 75,  # Life
                                'Aggravated Misdemeanor': 2, # 2 years
                                'B Felony': 50, # 25 or 50 years
                                'C Felony': 10, # 10 years
                                'D Felony': 5,  # 5 yeras
                                'Felony - Enhanced': 10, # Add on to class C and D felonies, hard to approximate. 
                                'Serious Misdemeanor': 1, # 1 year
                                'Sex Offender': 10, # 10 years
                                'Simple Misdemeanor': 0.83} # 30 days

# Mapping max_sentence_column
df['max_sentence'] =df['crime_class'].map(crime_class_max_sentence_map)
# df['max_sentence'].value_counts().sort_index()

# Display new value counts
for col in ['recidivist','target_pop','sex']:
    print(f"{col} value counts:\n {df[col].value_counts()}")

#### Dropping all  values replaced with np.nan

In [None]:
check_null(df)

In [None]:
# df.to_csv('iowa_recidivism_df_with_james_features_all_rows.csv')

In [None]:
df.dropna(inplace=True)
df.reset_index(inplace=True)
check_null(df)

## effectml.com using grid search to optimize catbost

In [None]:
category_cols = ['yr_released','race_ethnicity',
                 'crime_class','release_type','crime_type','crime_subtype',
                 'target_pop','sex','super_dist','felony','release_type'] #,'release_type_map','crime_class_type_subtype',]
number_cols = ['max_sentence','age_released']
target_col = ['recidivist']

In [None]:
df_to_split=pd.DataFrame()
#adding in scaling for numeric data
from sklearn.preprocessing import MinMaxScaler
sca = MinMaxScaler()

for header in number_cols:
    print(header)
    data = np.array(df[header])
    res = sca.fit_transform(data.reshape(-1,1))
    df_to_split[header] = res.ravel()
    

In [None]:
# Convert categoryies to codes
for header in category_cols:
    df_to_split[header] = df[header].astype('category').cat.codes
    
# df_to_split=pd.concat([df_to_split, df[number_cols]],axis=1)
df_to_split.info()

### Add in smote to balance out the target variable

In [None]:
check_unique(df_to_split)

In [None]:
X = df_to_split
y = pd.Series(df[target_col].to_numpy().ravel())

In [None]:
df=[]

In [None]:
from imblearn.over_sampling import SMOTE, ADASYN
print(pd.Series(y).value_counts())
X_resampled, y_resampled = SMOTE().fit_sample(X,y)
print(pd.Series(y_resampled).value_counts())

In [None]:
X_resampled =pd.DataFrame(X_resampled, columns = X.columns)
y_resampled =pd.Series(y_resampled)
y_resampled.name ='recidivist'

In [None]:
# inspect_df(X_resampled)

In [None]:
# convert X_resampled columns backt o int
for header in category_cols:
    X_resampled[header] = X_resampled[header].astype('int')

In [None]:
X_resampled.dtypes

### Using CatBoostClassifier and Pool

In [None]:
from catboost import Pool, CatBoostClassifier
from sklearn.model_selection import train_test_split
# X_train, X_test, y_train, y_test = train_test_split(df_to_split, df[target_col])
X_train, X_test, y_train, y_test = train_test_split(X_resampled, y_resampled, test_size=0.4)

In [None]:
train_pool =  Pool(data=X_train, label=y_train, cat_features=category_cols)
test_pool = Pool(data=X_test, label=y_test,  cat_features=category_cols)

In [None]:
cb_clf = CatBoostClassifier(iterations=3000, depth=12, boosting_type='Ordered', learning_rate=0.01,thread_count=-1,eval_metric='AUC', silent=True, allow_const_label=True)


In [None]:
cb_clf.fit(train_pool,eval_set=test_pool, plot=True, early_stopping_rounds=20)
cb_clf.best_score_

_____________________________________________