# Table of contents

This cleaning/exploration script is structured as follows:
1. [Pre-amble](#pre-amble)

2. [Basic dataset information](#basic-info)

3. [Data cleaning](#data-cleaning)

    * [Basic demographics](#basic-demographics)
    * [Date-time variables](#datetime-variables)
    * [Sentencing variables](#sentencing-variables)
    * [Filtering to prepare analysis-ready datasets](#filtering)
    * [Regrouping offenses using CJARS tool](#cjars)

# Pre-amble<a class="anchor" id="pre-amble"></a>

Prior to the exploration, we first load some basic packages:

In [1]:
# loading the required packages
import pandas as pd
import numpy as np
import datetime
import random
import re
import os
from dateutil import relativedelta
import math
#import plotnine
#from plotnine import *

# for repeated printouts 
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# to custom displays of row-column df printouts
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 120)

# Basic dataset information<a class="anchor" id="basic-info"></a>

In [2]:
# loading the raw sentencing data 
sentencing_raw = pd.read_csv('../Data/sentencing.csv')

# printing the dataset characteristics
sentencing_raw.shape
sentencing_raw.info()
sentencing_raw.dtypes

# taking a look at the data head
sentencing_raw.head(n=10)

  sentencing_raw = pd.read_csv('../Data/sentencing.csv')


(248146, 41)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 248146 entries, 0 to 248145
Data columns (total 41 columns):
 #   Column                             Non-Null Count   Dtype  
---  ------                             --------------   -----  
 0   CASE_ID                            248146 non-null  int64  
 1   CASE_PARTICIPANT_ID                248146 non-null  int64  
 2   RECEIVED_DATE                      248146 non-null  object 
 3   OFFENSE_CATEGORY                   248146 non-null  object 
 4   PRIMARY_CHARGE_FLAG                248146 non-null  bool   
 5   CHARGE_ID                          248146 non-null  int64  
 6   CHARGE_VERSION_ID                  248146 non-null  int64  
 7   DISPOSITION_CHARGED_OFFENSE_TITLE  248146 non-null  object 
 8   CHARGE_COUNT                       248146 non-null  int64  
 9   DISPOSITION_DATE                   248146 non-null  object 
 10  DISPOSITION_CHARGED_CHAPTER        248146 non-null  object 
 11  DISPOSITION_CHARGED_ACT            2427

CASE_ID                                int64
CASE_PARTICIPANT_ID                    int64
RECEIVED_DATE                         object
OFFENSE_CATEGORY                      object
PRIMARY_CHARGE_FLAG                     bool
CHARGE_ID                              int64
CHARGE_VERSION_ID                      int64
DISPOSITION_CHARGED_OFFENSE_TITLE     object
CHARGE_COUNT                           int64
DISPOSITION_DATE                      object
DISPOSITION_CHARGED_CHAPTER           object
DISPOSITION_CHARGED_ACT               object
DISPOSITION_CHARGED_SECTION           object
DISPOSITION_CHARGED_CLASS             object
DISPOSITION_CHARGED_AOIC              object
CHARGE_DISPOSITION                    object
CHARGE_DISPOSITION_REASON             object
SENTENCE_JUDGE                        object
SENTENCE_COURT_NAME                   object
SENTENCE_COURT_FACILITY               object
SENTENCE_PHASE                        object
SENTENCE_DATE                         object
SENTENCE_T

Unnamed: 0,CASE_ID,CASE_PARTICIPANT_ID,RECEIVED_DATE,OFFENSE_CATEGORY,PRIMARY_CHARGE_FLAG,CHARGE_ID,CHARGE_VERSION_ID,DISPOSITION_CHARGED_OFFENSE_TITLE,CHARGE_COUNT,DISPOSITION_DATE,DISPOSITION_CHARGED_CHAPTER,DISPOSITION_CHARGED_ACT,DISPOSITION_CHARGED_SECTION,DISPOSITION_CHARGED_CLASS,DISPOSITION_CHARGED_AOIC,CHARGE_DISPOSITION,CHARGE_DISPOSITION_REASON,SENTENCE_JUDGE,SENTENCE_COURT_NAME,SENTENCE_COURT_FACILITY,SENTENCE_PHASE,SENTENCE_DATE,SENTENCE_TYPE,CURRENT_SENTENCE_FLAG,COMMITMENT_TYPE,COMMITMENT_TERM,COMMITMENT_UNIT,LENGTH_OF_CASE_in_Days,AGE_AT_INCIDENT,RACE,GENDER,INCIDENT_CITY,INCIDENT_BEGIN_DATE,INCIDENT_END_DATE,LAW_ENFORCEMENT_AGENCY,LAW_ENFORCEMENT_UNIT,ARREST_DATE,FELONY_REVIEW_DATE,FELONY_REVIEW_RESULT,ARRAIGNMENT_DATE,UPDATED_OFFENSE_CATEGORY
0,149765331439,175691153649,8/15/1984 12:00:00 AM,PROMIS Conversion,False,50510112469,116304211997,FIRST DEGREE MURDER,2,12/17/2014 12:00:00 AM,38,-,9-1(a)(2),X,1607,Nolle On Remand,,John Mannion,District 6 - Markham,Markham Courthouse,Original Sentencing,6/2/1986 12:00:00 AM,Conversion,True,Natural Life,,,619.0,27.0,Black,Male,,8/9/1984 12:00:00 AM,,CHICAGO POLICE DEPT,,8/15/1984 12:00:00 AM,08/15/1984 12:00:00 AM,Charge(S) Approved,9/21/1984 12:00:00 AM,Homicide
1,149765331439,175691153649,8/15/1984 12:00:00 AM,PROMIS Conversion,False,50510213021,98265074680,HOME INVASION,14,12/17/2014 12:00:00 AM,38-12-11-A(2),,,X,1847,Nolle On Remand,,John Mannion,District 6 - Markham,Markham Courthouse,Original Sentencing,6/2/1986 12:00:00 AM,Conversion,True,Illinois Department of Corrections,30.0,Year(s),619.0,27.0,Black,Male,,8/9/1984 12:00:00 AM,,CHICAGO POLICE DEPT,,8/15/1984 12:00:00 AM,08/15/1984 12:00:00 AM,Charge(S) Approved,9/21/1984 12:00:00 AM,Homicide
2,149765331439,175691153649,8/15/1984 12:00:00 AM,PROMIS Conversion,False,50516447217,131972895911,FIRST DEGREE MURDER,4,12/17/2014 12:00:00 AM,38,-,9-1(a)(3),X,1608,Nolle On Remand,,John Mannion,District 6 - Markham,Markham Courthouse,Original Sentencing,6/2/1986 12:00:00 AM,Conversion,True,Natural Life,,,619.0,27.0,Black,Male,,8/9/1984 12:00:00 AM,,CHICAGO POLICE DEPT,,8/15/1984 12:00:00 AM,08/15/1984 12:00:00 AM,Charge(S) Approved,9/21/1984 12:00:00 AM,Homicide
3,149765331439,175691153649,8/15/1984 12:00:00 AM,PROMIS Conversion,False,50516497493,131966356472,FIRST DEGREE MURDER,5,12/17/2014 12:00:00 AM,38,-,9-1(a)(3),X,1608,Nolle On Remand,,John Mannion,District 6 - Markham,Markham Courthouse,Original Sentencing,6/2/1986 12:00:00 AM,Conversion,True,Natural Life,,,619.0,27.0,Black,Male,,8/9/1984 12:00:00 AM,,CHICAGO POLICE DEPT,,8/15/1984 12:00:00 AM,08/15/1984 12:00:00 AM,Charge(S) Approved,9/21/1984 12:00:00 AM,Homicide
4,149765331439,175691153649,8/15/1984 12:00:00 AM,PROMIS Conversion,False,50516648320,98059642859,HOME INVASION,13,12/17/2014 12:00:00 AM,38-12-11-A(1),,,X,1846,Plea Of Guilty,,Clayton Jay Crane,District 6 - Markham,Markham Courthouse,Amended/Corrected Sentencing,10/16/2014 12:00:00 AM,Prison,True,Illinois Department of Corrections,30.0,Year(s),10982.0,27.0,Black,Male,,8/9/1984 12:00:00 AM,,CHICAGO POLICE DEPT,,8/15/1984 12:00:00 AM,08/15/1984 12:00:00 AM,Charge(S) Approved,9/21/1984 12:00:00 AM,Homicide
5,149765331439,175691153649,8/15/1984 12:00:00 AM,PROMIS Conversion,False,50516648320,98059642859,HOME INVASION,13,12/17/2014 12:00:00 AM,38-12-11-A(1),,,X,1846,Plea Of Guilty,,John Mannion,District 6 - Markham,Markham Courthouse,Original Sentencing,6/2/1986 12:00:00 AM,Conversion,False,Illinois Department of Corrections,30.0,Year(s),619.0,27.0,Black,Male,,8/9/1984 12:00:00 AM,,CHICAGO POLICE DEPT,,8/15/1984 12:00:00 AM,08/15/1984 12:00:00 AM,Charge(S) Approved,9/21/1984 12:00:00 AM,Homicide
6,149765331439,175691153649,8/15/1984 12:00:00 AM,PROMIS Conversion,False,50516698596,112267602827,ARMED ROBBERY,15,12/17/2014 12:00:00 AM,38-18-2-A,,,X,2150,Plea Of Guilty,,Clayton Jay Crane,District 6 - Markham,Markham Courthouse,Amended/Corrected Sentencing,10/16/2014 12:00:00 AM,Prison,True,Illinois Department of Corrections,30.0,Year(s),10982.0,27.0,Black,Male,,8/9/1984 12:00:00 AM,,CHICAGO POLICE DEPT,,8/15/1984 12:00:00 AM,08/15/1984 12:00:00 AM,Charge(S) Approved,9/21/1984 12:00:00 AM,Homicide
7,149765331439,175691153649,8/15/1984 12:00:00 AM,PROMIS Conversion,False,50516698596,112267602827,ARMED ROBBERY,15,12/17/2014 12:00:00 AM,38-18-2-A,,,X,2150,Plea Of Guilty,,John Mannion,District 6 - Markham,Markham Courthouse,Original Sentencing,6/2/1986 12:00:00 AM,Conversion,False,Illinois Department of Corrections,30.0,Year(s),619.0,27.0,Black,Male,,8/9/1984 12:00:00 AM,,CHICAGO POLICE DEPT,,8/15/1984 12:00:00 AM,08/15/1984 12:00:00 AM,Charge(S) Approved,9/21/1984 12:00:00 AM,Homicide
8,149765331439,175691153649,8/15/1984 12:00:00 AM,PROMIS Conversion,False,50516748872,112265921257,ARMED ROBBERY,16,12/17/2014 12:00:00 AM,38-18-2-A,,,X,2150,Nolle On Remand,,John Mannion,District 6 - Markham,Markham Courthouse,Original Sentencing,6/2/1986 12:00:00 AM,Conversion,True,Illinois Department of Corrections,30.0,Year(s),619.0,27.0,Black,Male,,8/9/1984 12:00:00 AM,,CHICAGO POLICE DEPT,,8/15/1984 12:00:00 AM,08/15/1984 12:00:00 AM,Charge(S) Approved,9/21/1984 12:00:00 AM,Homicide
9,149765331439,175691153649,8/15/1984 12:00:00 AM,PROMIS Conversion,True,50510062193,112898098217,FIRST DEGREE MURDER,1,12/17/2014 12:00:00 AM,38,-,9-1(a)(1),X,1606,Plea Of Guilty,,Clayton Jay Crane,District 6 - Markham,Markham Courthouse,Amended/Corrected Sentencing,10/16/2014 12:00:00 AM,Prison,True,Illinois Department of Corrections,62.0,Year(s),10982.0,27.0,Black,Male,,8/9/1984 12:00:00 AM,,CHICAGO POLICE DEPT,,8/15/1984 12:00:00 AM,08/15/1984 12:00:00 AM,Charge(S) Approved,9/21/1984 12:00:00 AM,Homicide


# Data cleaning<a class="anchor" id="data-cleaning"></a>

In [3]:
# initializing a df for the cleaned version of the sentencing_raw data 
sentencing_cleaned = sentencing_raw.copy()

## Cleaning demographic variables<a class="anchor" id="basic-demographics"></a>

Here, we clean up some important demographic characteristics that can be used in the analysis. 
For this, we adapt some of the approaches that were used in the second and third problem sets.

* Defining the defendant's race groups
* Defining the defendant's gender group
* Cleaning up defendant's age via winsorizing

In [4]:
# printing the original distribution of RACE variable
print("Distribution of original `RACE` variable:")
sentencing_cleaned.RACE.value_counts()

# defining some important race groups
sentencing_cleaned['is_black'] = np.where(sentencing_raw.RACE.isin(['Black', 'White/Black [Hispanic or Latino]']), 
                                          True, False)
sentencing_cleaned['is_hisp'] = np.where(sentencing_raw.RACE.isin(['HISPANIC', 'White [Hispanic or Latino]']), 
                                          True, False)
sentencing_cleaned['is_white'] = np.where(sentencing_raw.RACE.isin(['White']), True, False)

# for the RACE columns, replace value with np.nan if RACE == 'Unknown' or RACE == 'Biracial'
cond = sentencing_cleaned.RACE.isin(['Unknown', 'Biracial']) # defining the condition
sentencing_cleaned.loc[cond, ['is_black', 'is_hisp', 'is_white']] = np.nan

Distribution of original `RACE` variable:


Black                               164423
White [Hispanic or Latino]           37880
White                                35361
HISPANIC                              5941
Asian                                 1453
White/Black [Hispanic or Latino]      1238
Unknown                                358
American Indian                        125
ASIAN                                   65
Biracial                                35
Name: RACE, dtype: int64

In [5]:
# original GENDER distribution:
print("Distribution of original `GENDER` distribution:")
sentencing_cleaned.GENDER.value_counts()

# defining gender groups 
sentencing_cleaned['is_female'] = np.where(sentencing_cleaned.GENDER.isin(['Male', 'Male name, no gender given']), 
                                           False, np.where(sentencing_cleaned.GENDER.str.contains('Unknown'), 
                                                           np.nan, True))

# final look at the gender variable 
print("Distribution of cleaned `is_female` distribution:")
sentencing_cleaned.is_female.value_counts()

Distribution of original `GENDER` distribution:


Male                          217610
Female                         29714
Unknown                            7
Male name, no gender given         3
Unknown Gender                     3
Name: GENDER, dtype: int64

Distribution of cleaned `is_female` distribution:


0.0    217613
1.0     29714
Name: is_female, dtype: int64

In [6]:
# original summary stat of age variable:
print("Summary statistics of original AGE_AT_INCIDENT variable:")
sentencing_cleaned.AGE_AT_INCIDENT.describe()

# there is an outlier (137 y.o. obs), winsorizing age column to 99.99th percentile
sentencing_cleaned['age_cleaned'] = np.where(sentencing_cleaned.AGE_AT_INCIDENT >= 
                                             sentencing_cleaned.AGE_AT_INCIDENT.quantile(0.9999), 
                                             sentencing_cleaned.AGE_AT_INCIDENT.quantile(0.9999), 
                                             sentencing_cleaned.AGE_AT_INCIDENT)

# printing the summary stat of new age variable
print("Summary statistics of cleaned age variable:")
sentencing_cleaned.age_cleaned.describe()

Summary statistics of original AGE_AT_INCIDENT variable:


count    238359.000000
mean         32.304260
std          11.788915
min          17.000000
25%          23.000000
50%          29.000000
75%          40.000000
max         137.000000
Name: AGE_AT_INCIDENT, dtype: float64

Summary statistics of cleaned age variable:


count    238359.000000
mean         32.302611
std          11.779161
min          17.000000
25%          23.000000
50%          29.000000
75%          40.000000
max          81.000000
Name: age_cleaned, dtype: float64

<u>**Cleaning flags (if any)**</u>:

1. `RACE`: 
    - How should we categorize *biracial* race group?
    - I recoded `Unknown` and `Biracial` as NaN for each race definition
    - What does `[Hispanic or Latino]` actually mean? In pset2, why did we not categorize `White/Black [Hispanic or Latino]` into the `is_hisp` definition?
    
    
2. `GENDER`:
    - I recoded rows containing `Unknown` as NaN
    - `Male name, no gender given` is coded as `Male` (reasonable?).
    

3. `AGE_AT_INCIDENT`:
    - As with pset2, I winsorized the age variable

## Cleaning datetime variables<a class="anchor" id="datetime-variables"></a>

Here, we:

* Clean up the defendant's date of sentencing (`SENTENCE_DATE`). We'll create a datetime object out of the field, and will separate year, month, and year-month component of the date.

* Add several key treatment variables that will be of interest for the analysis

* Add time relative indicators for the event study estimation

In [7]:
# do all rows have 12:00:00 AM time?
set([date[-11:] for date in sentencing_cleaned['SENTENCE_DATE']])

# since all rows started with 12:00:00 AM, we can strip that component 
sentencing_cleaned['sentence_date'] = sentencing_cleaned.SENTENCE_DATE.str.replace("12:00:00 AM ", "")

# we clean up the SENTENCE_DATE that have out-of-bound years
sentencing_cleaned['sentence_date'] = [re.sub(r'2[1-9]([0-9]+)', r'20\1', str(date)) 
                                       if bool(re.search(r'2[1-9]([0-9]+)', str(date)))
                                       else str(date) 
                                       for date in sentencing_cleaned.sentence_date]

# converting to datetime
sentencing_cleaned['sentence_date'] = pd.to_datetime(sentencing_cleaned["sentence_date"])

# creating a year, month, date columns 
sentencing_cleaned['sentence_year'] = pd.DatetimeIndex(sentencing_cleaned['sentence_date']).year
sentencing_cleaned['sentence_month'] = pd.DatetimeIndex(sentencing_cleaned['sentence_date']).month
sentencing_cleaned['sentence_day'] = pd.DatetimeIndex(sentencing_cleaned['sentence_date']).day
sentencing_cleaned['sentence_ym'] = sentencing_cleaned['sentence_date'].dt.to_period('M')

{'12:00:00 AM'}

In [8]:
sentencing_cleaned

Unnamed: 0,CASE_ID,CASE_PARTICIPANT_ID,RECEIVED_DATE,OFFENSE_CATEGORY,PRIMARY_CHARGE_FLAG,CHARGE_ID,CHARGE_VERSION_ID,DISPOSITION_CHARGED_OFFENSE_TITLE,CHARGE_COUNT,DISPOSITION_DATE,DISPOSITION_CHARGED_CHAPTER,DISPOSITION_CHARGED_ACT,DISPOSITION_CHARGED_SECTION,DISPOSITION_CHARGED_CLASS,DISPOSITION_CHARGED_AOIC,CHARGE_DISPOSITION,CHARGE_DISPOSITION_REASON,SENTENCE_JUDGE,SENTENCE_COURT_NAME,SENTENCE_COURT_FACILITY,SENTENCE_PHASE,SENTENCE_DATE,SENTENCE_TYPE,CURRENT_SENTENCE_FLAG,COMMITMENT_TYPE,COMMITMENT_TERM,COMMITMENT_UNIT,LENGTH_OF_CASE_in_Days,AGE_AT_INCIDENT,RACE,GENDER,INCIDENT_CITY,INCIDENT_BEGIN_DATE,INCIDENT_END_DATE,LAW_ENFORCEMENT_AGENCY,LAW_ENFORCEMENT_UNIT,ARREST_DATE,FELONY_REVIEW_DATE,FELONY_REVIEW_RESULT,ARRAIGNMENT_DATE,UPDATED_OFFENSE_CATEGORY,is_black,is_hisp,is_white,is_female,age_cleaned,sentence_date,sentence_year,sentence_month,sentence_day,sentence_ym
0,149765331439,175691153649,8/15/1984 12:00:00 AM,PROMIS Conversion,False,50510112469,116304211997,FIRST DEGREE MURDER,2,12/17/2014 12:00:00 AM,38,-,9-1(a)(2),X,0000001607,Nolle On Remand,,John Mannion,District 6 - Markham,Markham Courthouse,Original Sentencing,6/2/1986 12:00:00 AM,Conversion,True,Natural Life,,,619.0,27.0,Black,Male,,8/9/1984 12:00:00 AM,,CHICAGO POLICE DEPT,,8/15/1984 12:00:00 AM,08/15/1984 12:00:00 AM,Charge(S) Approved,9/21/1984 12:00:00 AM,Homicide,True,False,False,0.0,27.0,1986-06-02,1986,6,2,1986-06
1,149765331439,175691153649,8/15/1984 12:00:00 AM,PROMIS Conversion,False,50510213021,98265074680,HOME INVASION,14,12/17/2014 12:00:00 AM,38-12-11-A(2),,,X,0000001847,Nolle On Remand,,John Mannion,District 6 - Markham,Markham Courthouse,Original Sentencing,6/2/1986 12:00:00 AM,Conversion,True,Illinois Department of Corrections,30.0,Year(s),619.0,27.0,Black,Male,,8/9/1984 12:00:00 AM,,CHICAGO POLICE DEPT,,8/15/1984 12:00:00 AM,08/15/1984 12:00:00 AM,Charge(S) Approved,9/21/1984 12:00:00 AM,Homicide,True,False,False,0.0,27.0,1986-06-02,1986,6,2,1986-06
2,149765331439,175691153649,8/15/1984 12:00:00 AM,PROMIS Conversion,False,50516447217,131972895911,FIRST DEGREE MURDER,4,12/17/2014 12:00:00 AM,38,-,9-1(a)(3),X,0000001608,Nolle On Remand,,John Mannion,District 6 - Markham,Markham Courthouse,Original Sentencing,6/2/1986 12:00:00 AM,Conversion,True,Natural Life,,,619.0,27.0,Black,Male,,8/9/1984 12:00:00 AM,,CHICAGO POLICE DEPT,,8/15/1984 12:00:00 AM,08/15/1984 12:00:00 AM,Charge(S) Approved,9/21/1984 12:00:00 AM,Homicide,True,False,False,0.0,27.0,1986-06-02,1986,6,2,1986-06
3,149765331439,175691153649,8/15/1984 12:00:00 AM,PROMIS Conversion,False,50516497493,131966356472,FIRST DEGREE MURDER,5,12/17/2014 12:00:00 AM,38,-,9-1(a)(3),X,0000001608,Nolle On Remand,,John Mannion,District 6 - Markham,Markham Courthouse,Original Sentencing,6/2/1986 12:00:00 AM,Conversion,True,Natural Life,,,619.0,27.0,Black,Male,,8/9/1984 12:00:00 AM,,CHICAGO POLICE DEPT,,8/15/1984 12:00:00 AM,08/15/1984 12:00:00 AM,Charge(S) Approved,9/21/1984 12:00:00 AM,Homicide,True,False,False,0.0,27.0,1986-06-02,1986,6,2,1986-06
4,149765331439,175691153649,8/15/1984 12:00:00 AM,PROMIS Conversion,False,50516648320,98059642859,HOME INVASION,13,12/17/2014 12:00:00 AM,38-12-11-A(1),,,X,0000001846,Plea Of Guilty,,Clayton Jay Crane,District 6 - Markham,Markham Courthouse,Amended/Corrected Sentencing,10/16/2014 12:00:00 AM,Prison,True,Illinois Department of Corrections,30.0,Year(s),10982.0,27.0,Black,Male,,8/9/1984 12:00:00 AM,,CHICAGO POLICE DEPT,,8/15/1984 12:00:00 AM,08/15/1984 12:00:00 AM,Charge(S) Approved,9/21/1984 12:00:00 AM,Homicide,True,False,False,0.0,27.0,2014-10-16,2014,10,16,2014-10
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
248141,445516248775,905289187184,2/3/2021 12:00:00 AM,Home Invasion,True,447867029248,836341344232,AGGRAVATED UNLAWFUL USE OF WEAPON,1,3/19/2021 12:00:00 AM,720,5,24-1.6(a)(1),4,17869,Plea Of Guilty,,Anjana Hansen,District 2 - Skokie,Skokie Courthouse,Original Sentencing,3/19/2021 12:00:00 AM,Probation,True,Probation,2.0,Year(s),0.0,19.0,,Female,Des Plaines,2/2/2021 12:00:00 AM,,DES PLAINES PD,,2/2/2021 2:20:00 PM,02/04/2021 12:00:00 AM,Approved,3/19/2021 12:00:00 AM,UUW - Unlawful Use of Weapon,False,False,False,1.0,19.0,2021-03-19,2021,3,19,2021-03
248142,445527409730,905322500577,2/4/2021 12:00:00 AM,Domestic Battery,True,447970697900,836537247156,DOMESTIC BATTERY,1,3/23/2021 12:00:00 AM,720,5,12-3.2(a)(1),4,12238,Plea Of Guilty,,Terence MacCarthy,District 4 - Maywood,Maywood Courthouse,Original Sentencing,3/23/2021 12:00:00 AM,Probation,True,Probation,24.0,Months,12.0,32.0,Black,Male,Franklin Park,2/3/2021 12:00:00 AM,,COOK COUNTY SHERIFF'S POLICE PATROL MAYWOOD (I...,,2/9/2021 11:25:00 AM,02/10/2021 12:00:00 AM,Approved,3/11/2021 12:00:00 AM,Domestic Battery,True,False,False,0.0,32.0,2021-03-23,2021,3,23,2021-03
248143,445587767000,905518794790,2/9/2021 12:00:00 AM,Driving With Suspended Or Revoked License,True,447955866546,837760215766,DRIVING ON SUSPENDED LICENSE,1,3/11/2021 12:00:00 AM,625,5,6-303(a),A,5880000,Plea Of Guilty,,Gregory Paul Vazquez,District 4 - Maywood,Maywood Courthouse,Original Sentencing,3/11/2021 12:00:00 AM,Jail,True,Cook County Department of Corrections,45.0,Days,2.0,29.0,White,Male,Berwyn,7/28/2020 12:00:00 AM,,BERWYN PD,,2/4/2021 4:35:00 PM,,,3/9/2021 12:00:00 AM,Driving With Suspended Or Revoked License,False,False,True,0.0,29.0,2021-03-11,2021,3,11,2021-03
248144,445592613204,905533705601,2/9/2021 12:00:00 AM,Driving With Suspended Or Revoked License,True,447966223356,837758347354,DRIVING ON SUSPENDED LICENSE,1,3/11/2021 12:00:00 AM,625,5,6-303(a),A,5880000,Plea Of Guilty,,Gregory Paul Vazquez,District 4 - Maywood,Maywood Courthouse,Original Sentencing,3/11/2021 12:00:00 AM,Jail,True,Cook County Department of Corrections,45.0,Days,2.0,29.0,White,Male,Berwyn,6/17/2020 12:00:00 AM,,BERWYN PD,,2/4/2021 4:35:00 PM,,,3/9/2021 12:00:00 AM,Driving With Suspended Or Revoked License,False,False,True,0.0,29.0,2021-03-11,2021,3,11,2021-03


In [9]:
# defining the treatment variables
sentencing_cleaned['sa_office_period'] = np.where(sentencing_cleaned.sentence_ym >= "2016-12", # SA Foxx assumed office in Dec 1, 2016
                                                  True, False)   

sentencing_cleaned['sa_timedelta'] = (sentencing_cleaned.sentence_year - 2016)*12 + (sentencing_cleaned.sentence_month - 12)

sentencing_cleaned['sa_timedelta_days'] = (sentencing_cleaned['sentence_date'] - pd.to_datetime("2016-12-01")).dt.days

sentencing_cleaned['sa_timedelta_wk'] = [math.floor(delta_days/7) if delta_days >= 0       # 2.14 weeks as 2 weeks
                                         else math.ceil(delta_days/7) if delta_days < 0    # -3.14 weeks as -3 weeks
                                         else np.nan
                                         for delta_days in sentencing_cleaned.sa_timedelta_days]

# Bail Reform Act
sentencing_cleaned['BRA_period'] = np.where(sentencing_cleaned.sentence_ym >= "2017-06", # Bail Reform Act
                                            True, False)

sentencing_cleaned['BRA_timedelta'] = (sentencing_cleaned.sentence_year - 2017)*12 + (sentencing_cleaned.sentence_month - 6)

sentencing_cleaned['BRA_timedelta_days'] = (sentencing_cleaned['sentence_date'] - pd.to_datetime("2017-06-12")).dt.days

sentencing_cleaned['BRA_timedelta_wk'] = [math.floor(delta_days/7) if delta_days >= 0
                                          else math.ceil(delta_days/7) if delta_days < 0    # -3.14 weeks as -3 weeks
                                          else np.nan
                                          for delta_days in sentencing_cleaned.BRA_timedelta_days]

<u>**Cleaning flags (if any)**</u>:

1. `SENTENCE_DATE`: 
    - All out-of-bounds years (29, 22, etc instead of 20XX) are converted into 20XX. 
    - How do we clean up 2023-2066 though? (Currently, we're filtering against values > 2022 in the filtering stage)
    

2. Key treatment-related time variables:
    - SA Kim Foxx entry: Value = 1 if December 2016 onwards, 0 if otherwise
    - Bail reform act: Value = 1 if June 2017 onwards, 0 if otherwise
    - For both indicators, relative time indicators are defined relative to the timing that corresponds to the start of value = 1

## Cleaning sentencing-related variables<a class="anchor" id="sentencing-vars"></a>

Here, we:

* Clean up sentencing term data, based on their commitment units
* Defining defendant's incarceration status
* Recategorize offense types

First up, we clean the sentencing term data, storing them in both days and years format:

In [10]:
# what sentence units are available in the sentencing_cleaned data?
print("Commitment units in the original data:")
sentencing_cleaned.COMMITMENT_UNIT.value_counts()

# first, we convert the commitment_term into a numeric
# before that, because we have multiple categories to be cleaned, we'll first create a dictionary for the 
# string correspondences
replace_dict = {'wrap': "", 
                "two": "2", 
                "months": "", 
                "1,154.00": "1154", 
                "`" : ""}

# replacing w/ the string correspondence 
sentencing_cleaned['sentencing_num'] = pd.to_numeric(sentencing_cleaned['COMMITMENT_TERM'].replace(replace_dict, 
                                                                                                   regex=True))

# we're going to use np.select. as such, we store the criteria (units) and codes (days_equiv) in different objects
units = [sentencing_cleaned.COMMITMENT_UNIT == "Year(s)",
        sentencing_cleaned.COMMITMENT_UNIT == "Months", 
        sentencing_cleaned.COMMITMENT_UNIT == "Days", 
        sentencing_cleaned.COMMITMENT_UNIT == "Weeks", 
        sentencing_cleaned.COMMITMENT_UNIT == "Hours", 
        sentencing_cleaned.COMMITMENT_UNIT == "Natural Life",
        sentencing_cleaned.COMMITMENT_UNIT.isin(['Term', 'Dollars', 'Pounds', 'Ounces', 'Kilos'])]

# assigning the codes - nan to the excluded units
days_equiv = [(sentencing_cleaned.sentencing_num * 365), 
              (sentencing_cleaned.sentencing_num * 30.5), 
              (sentencing_cleaned.sentencing_num * 1), 
              (sentencing_cleaned.sentencing_num * 7), 
              (sentencing_cleaned.sentencing_num * 1/24), 
              (100 - sentencing_cleaned.age_cleaned)*365, 
              np.nan]

# generating the days units of each COMMITMENT_UNIT type. 
sentencing_cleaned['sentencing_term_d'] = np.select(units, days_equiv)
sentencing_cleaned.loc[(pd.isnull(sentencing_cleaned.COMMITMENT_TERM)) & 
                       (pd.isnull(sentencing_cleaned.COMMITMENT_UNIT)), 
                       'sentencing_term_d'] = np.nan       # assigning NaN to these rows because they got coded as 0
sentencing_cleaned['sentencing_term_y'] = sentencing_cleaned.sentencing_term_d / 365

# summary statistics of the sentencing term in year. 
sentencing_cleaned['sentencing_term_y'].describe()

Commitment units in the original data:


Year(s)         178836
Months           57099
Days              7323
Term              2342
Natural Life       722
Dollars             73
Hours               19
Weeks               16
Pounds               2
Ounces               1
Kilos                1
Name: COMMITMENT_UNIT, dtype: int64

count    2.439920e+05
mean     2.060792e+01
std      5.817713e+03
min      0.000000e+00
25%      1.504110e+00
50%      2.000000e+00
75%      3.000000e+00
max      2.032012e+06
Name: sentencing_term_y, dtype: float64

Next, we define whether the defendant is incarcerated or in probation:

In [11]:
# defining whether incarcerated (if COMMITMENT_TYPE == Illinois Department of Correction)
sentencing_cleaned['is_incarcerated'] = np.where(sentencing_cleaned['COMMITMENT_TYPE'] == "Illinois Department of Corrections", 
                                                 True, False)

# defining whether is_on_probation - based on the sentencing_data_glossary
sentencing_cleaned['is_on_probation'] = np.where(sentencing_cleaned['COMMITMENT_TYPE'].isin(["Probation", 
                                                                                             "710/410 Probation", 
                                                                                             "Intensive Probation Services", 
                                                                                             "Mental Health Probation", 
                                                                                             "Intensive Drug Probation Services", 
                                                                                             "Drug Court Probation", 
                                                                                             "Sex Offender Probation", 
                                                                                             "Gang Probation", 
                                                                                             "2nd Chance Probation", 
                                                                                             "Veteran's Court Probation", 
                                                                                             "Repeat Offender Probation", 
                                                                                             "Domestic Violence Probation"]), 
                                                 True, False)




Next, we clean up/regroup the offense categories:

In [12]:
# stripping the offense category of "Aggravated" keyword
sentencing_cleaned['regrouped_offense'] = sentencing_cleaned.UPDATED_OFFENSE_CATEGORY.str.replace("Aggravated ", "")

# printing the number of unique offense categories in the raw + cleaned data
print("Number of unique `UPDATED_OFFENSE_CATEGORY`: " + str(len(sentencing_cleaned.UPDATED_OFFENSE_CATEGORY.unique()))) 
print("Number of unique `regrouped_offense`: " + str(len(sentencing_cleaned.regrouped_offense.unique()))) 

Number of unique `UPDATED_OFFENSE_CATEGORY`: 79
Number of unique `regrouped_offense`: 75


In [13]:
# all misdemeanors and type-4 offenses are considered eligible for bail reform act.
sentencing_cleaned['eligible_offense'] = np.where(sentencing_cleaned.DISPOSITION_CHARGED_CLASS.isin(['A', 'B', 'C', '4']),
                                                 True, False)

<u>**Cleaning flags (if any)**</u>:

1. `sentencing_term`: 
    * In pset3, we did a `fillna(20)` on the cleaned age variable to calculate the terms for `Natural Life` units. I'm not sure if we should do this and the rationale behind this. So, I'm skipping that step.
    * There are some rows with `COMMITMENT_TERM == 0` and non-null `COMMITMENT_UNIT` (e.g. 0 years, 0 months, etc.). Should we filter for these rows when preparing the analysis-ready dataset?


2. Incarceration statuses:
    * Added `is_on_probation` status, which includes most `COMMITMENT_TYPES` that contain the word "PROBATION" 
    
    
3. `DISPOSITION_CHARGED_CLASS`: 
    * These codes could be important for the Bail Reform Act policy. However, not sure what codes Z, O, P and U mean
    
    
4. How should we ideally group the offense types? How granular should the offense groupings be?


5. `eligible_offense`: 
    * Eligible offenses for bail reform act: misdemeanors type A, B, C and felony level 4 (reasonable?). I couldn't find the exact formal definition of "non-violent", low-level offenses and misdemeanors.

## Filtering to prepare analysis-ready dataset<a class="anchor" id="filtering"></a>

Here, we:

* Filter against years that are above 2022
* Filter against 0 `COMMITMENT_TERM` and non-null `COMMITMENT_UNIT` (e.g. 0 years, 0 months, etc.) 
* Filter to cases where only one participant is charged, since cases with >1 participant might have complications like plea bargains/informing from other participants affecting the sentencing of the focal participant (filtering to `PRIMARY_CHARGE_FLAG == True` and `CURRENT_SENTENCE_FLAG == True`)

In [14]:
# Printing shape prior to filters
print("Shape of data frame prior to filters: " + str(sentencing_cleaned.shape))

# filtering against years above September 2022 (the time of last update) 
sentencing_analysis = sentencing_cleaned[sentencing_cleaned.sentence_ym <= "2022-09"].copy()

# filtering against 0 COMMITMENT TERM and non-null COMMITMENT_UNIT (0 years, 0 months, etc.)
sentencing_analysis = sentencing_analysis[(sentencing_analysis.sentencing_num != 0) & 
                                          (~pd.isnull(sentencing_analysis.COMMITMENT_UNIT))].copy()

# filtering for cases where primary charge flag == True and current sentence flag == True
sentencing_analysis = sentencing_analysis[(sentencing_analysis.PRIMARY_CHARGE_FLAG == True) & 
                                        (sentencing_analysis.CURRENT_SENTENCE_FLAG == True)].copy()

# Printing shape prior to filters
print("Shape of data frame prior to filters: " + str(sentencing_analysis.shape))

# Is CASE_PARTICIPANT_ID unique within the data frame?
print("Number of unique `CASE_PARTICIPANT_ID` in the dataframe: " + 
      str(len(pd.unique(sentencing_analysis.CASE_PARTICIPANT_ID))))

Shape of data frame prior to filters: (248146, 66)
Shape of data frame prior to filters: (171333, 66)
Number of unique `CASE_PARTICIPANT_ID` in the dataframe: 171333


In [19]:
# exporting the data to csv
sentencing_analysis.to_csv('../data/csv/sentencing_analysis.csv', index = False)

<u>**Cleaning flags (if any)**</u>:

1. Filtering for `sentence_year`: 
    - Do we need to filter year from below e.g. set minimum year? 
   

2. Here, I'm not dropping any non-sensical judge names 'cos the main analysis does not necessarily relate to the judges' name.


3. When we filter for `PRIMARY_CHARGE_TRUE == True & CURRENT_SENTENCE_FLAG == True`, we're effectively removing **past**, and **non-primary** charges rows. This leaves us only with **current, primary charges**. 


4. For the CJARS code classification: 
    - Do we need to set some sort of threshold for the UCCS probability score? We're currently using the `DISPOSITION_CHARGED_OFFENSE_TITLE` for the CJARS merging. 
    - Should we use `UPDATED_OFFENSE_CATEGORY` instead? We were thinking of using the `DISPOSITION_CHARGED_OFFENSE_TITLE` because the column contains longer description than the `UPDATED_OFFENSE_CATEGORY` column.
    - Our current CJARS code classification results in 119 unique charges. But the distribution is pretty sparse (has even more categories than the `regrouped_offense` column) -- do we need to further recategorize the smaller groups into one larger category? 