# Table of contents

This cleaning/exploration script is structured as follows:
1. [Pre-amble](#pre-amble)

2. [Basic dataset information](#basic-info)

3. [Data cleaning](#data-cleaning)

    * [Basic demographics](#basic-demographics)
    * [Date-time variables](#datetime-variables)
    * [Sentencing variables](#sentencing-variables)
    * [Filtering to prepare analysis-ready datasets](#filtering)

# Pre-amble<a class="anchor" id="pre-amble"></a>

Prior to the exploration, we first load some basic packages:

In [1]:
# loading the required packages
import pandas as pd
import numpy as np
import datetime
import random
import re
import os
import plotnine
from plotnine import *

# for repeated printouts 
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# to display all columns instead of truncated cols
pd.set_option('display.max_columns', None)

# Basic dataset information<a class="anchor" id="basic-info"></a>

In [2]:
# loading the raw sentencing data 
sentencing_raw = pd.read_csv('../data/csv/sentencing.csv')

# printing the dataset characteristics
sentencing_raw.shape
sentencing_raw.info()
sentencing_raw.dtypes

# taking a look at the data head
sentencing_raw.head(n=10)



(272294, 41)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 272294 entries, 0 to 272293
Data columns (total 41 columns):
 #   Column                             Non-Null Count   Dtype  
---  ------                             --------------   -----  
 0   CASE_ID                            272294 non-null  int64  
 1   CASE_PARTICIPANT_ID                272294 non-null  int64  
 2   RECEIVED_DATE                      272294 non-null  object 
 3   OFFENSE_CATEGORY                   272294 non-null  object 
 4   PRIMARY_CHARGE_FLAG                272294 non-null  bool   
 5   CHARGE_ID                          272294 non-null  int64  
 6   CHARGE_VERSION_ID                  272294 non-null  int64  
 7   DISPOSITION_CHARGED_OFFENSE_TITLE  272294 non-null  object 
 8   CHARGE_COUNT                       272294 non-null  int64  
 9   DISPOSITION_DATE                   272294 non-null  object 
 10  DISPOSITION_CHARGED_CHAPTER        272294 non-null  object 
 11  DISPOSITION_CHARGED_ACT            2668

CASE_ID                                int64
CASE_PARTICIPANT_ID                    int64
RECEIVED_DATE                         object
OFFENSE_CATEGORY                      object
PRIMARY_CHARGE_FLAG                     bool
CHARGE_ID                              int64
CHARGE_VERSION_ID                      int64
DISPOSITION_CHARGED_OFFENSE_TITLE     object
CHARGE_COUNT                           int64
DISPOSITION_DATE                      object
DISPOSITION_CHARGED_CHAPTER           object
DISPOSITION_CHARGED_ACT               object
DISPOSITION_CHARGED_SECTION           object
DISPOSITION_CHARGED_CLASS             object
DISPOSITION_CHARGED_AOIC              object
CHARGE_DISPOSITION                    object
CHARGE_DISPOSITION_REASON             object
SENTENCE_JUDGE                        object
SENTENCE_COURT_NAME                   object
SENTENCE_COURT_FACILITY               object
SENTENCE_PHASE                        object
SENTENCE_DATE                         object
SENTENCE_T

Unnamed: 0,CASE_ID,CASE_PARTICIPANT_ID,RECEIVED_DATE,OFFENSE_CATEGORY,PRIMARY_CHARGE_FLAG,CHARGE_ID,CHARGE_VERSION_ID,DISPOSITION_CHARGED_OFFENSE_TITLE,CHARGE_COUNT,DISPOSITION_DATE,DISPOSITION_CHARGED_CHAPTER,DISPOSITION_CHARGED_ACT,DISPOSITION_CHARGED_SECTION,DISPOSITION_CHARGED_CLASS,DISPOSITION_CHARGED_AOIC,CHARGE_DISPOSITION,CHARGE_DISPOSITION_REASON,SENTENCE_JUDGE,SENTENCE_COURT_NAME,SENTENCE_COURT_FACILITY,SENTENCE_PHASE,SENTENCE_DATE,SENTENCE_TYPE,CURRENT_SENTENCE_FLAG,COMMITMENT_TYPE,COMMITMENT_TERM,COMMITMENT_UNIT,LENGTH_OF_CASE_in_Days,AGE_AT_INCIDENT,RACE,GENDER,INCIDENT_CITY,INCIDENT_BEGIN_DATE,INCIDENT_END_DATE,LAW_ENFORCEMENT_AGENCY,LAW_ENFORCEMENT_UNIT,ARREST_DATE,FELONY_REVIEW_DATE,FELONY_REVIEW_RESULT,ARRAIGNMENT_DATE,UPDATED_OFFENSE_CATEGORY
0,198055620664,85937621020,08/15/1984 12:00:00 AM,PROMIS Conversion,False,1242195814523,155656315869,FIRST DEGREE MURDER,2,12/17/2014 12:00:00 AM,38,-,9-1(a)(2),X,1607,Nolle On Remand,,John Mannion,District 6 - Markham,Markham Courthouse,Original Sentencing,06/02/1986 12:00:00 AM,Conversion,True,Natural Life,,,619.0,27.0,Black,Male,,08/09/1984 12:00:00 AM,,CHICAGO POLICE DEPT,,08/15/1984 12:00:00 AM,08/15/1984 12:00:00 AM,Charge(S) Approved,09/21/1984 12:00:00 AM,PROMIS Conversion
1,198055620664,85937621020,08/15/1984 12:00:00 AM,PROMIS Conversion,False,1242198287388,131513547452,HOME INVASION,14,12/17/2014 12:00:00 AM,38-12-11-A(2),,,X,1847,Nolle On Remand,,John Mannion,District 6 - Markham,Markham Courthouse,Original Sentencing,06/02/1986 12:00:00 AM,Conversion,True,Illinois Department of Corrections,30.0,Year(s),619.0,27.0,Black,Male,,08/09/1984 12:00:00 AM,,CHICAGO POLICE DEPT,,08/15/1984 12:00:00 AM,08/15/1984 12:00:00 AM,Charge(S) Approved,09/21/1984 12:00:00 AM,PROMIS Conversion
2,198055620664,85937621020,08/15/1984 12:00:00 AM,PROMIS Conversion,False,1242351605056,176626576281,FIRST DEGREE MURDER,4,12/17/2014 12:00:00 AM,38,-,9-1(a)(3),X,1608,Nolle On Remand,,John Mannion,District 6 - Markham,Markham Courthouse,Original Sentencing,06/02/1986 12:00:00 AM,Conversion,True,Natural Life,,,619.0,27.0,Black,Male,,08/09/1984 12:00:00 AM,,CHICAGO POLICE DEPT,,08/15/1984 12:00:00 AM,08/15/1984 12:00:00 AM,Charge(S) Approved,09/21/1984 12:00:00 AM,PROMIS Conversion
3,198055620664,85937621020,08/15/1984 12:00:00 AM,PROMIS Conversion,False,1242352841488,176617824190,FIRST DEGREE MURDER,5,12/17/2014 12:00:00 AM,38,-,9-1(a)(3),X,1608,Nolle On Remand,,John Mannion,District 6 - Markham,Markham Courthouse,Original Sentencing,06/02/1986 12:00:00 AM,Conversion,True,Natural Life,,,619.0,27.0,Black,Male,,08/09/1984 12:00:00 AM,,CHICAGO POLICE DEPT,,08/15/1984 12:00:00 AM,08/15/1984 12:00:00 AM,Charge(S) Approved,09/21/1984 12:00:00 AM,PROMIS Conversion
4,198055620664,85937621020,08/15/1984 12:00:00 AM,PROMIS Conversion,False,1242356550787,131238606761,HOME INVASION,13,12/17/2014 12:00:00 AM,38-12-11-A(1),,,X,1846,Plea Of Guilty,,Clayton Jay Crane,District 6 - Markham,Markham Courthouse,Amended/Corrected Sentencing,10/16/2014 12:00:00 AM,Prison,True,Illinois Department of Corrections,30.0,Year(s),10982.0,27.0,Black,Male,,08/09/1984 12:00:00 AM,,CHICAGO POLICE DEPT,,08/15/1984 12:00:00 AM,08/15/1984 12:00:00 AM,Charge(S) Approved,09/21/1984 12:00:00 AM,PROMIS Conversion
5,198055620664,85937621020,08/15/1984 12:00:00 AM,PROMIS Conversion,False,1242356550787,131238606761,HOME INVASION,13,12/17/2014 12:00:00 AM,38-12-11-A(1),,,X,1846,Plea Of Guilty,,John Mannion,District 6 - Markham,Markham Courthouse,Original Sentencing,06/02/1986 12:00:00 AM,Conversion,False,Illinois Department of Corrections,30.0,Year(s),619.0,27.0,Black,Male,,08/09/1984 12:00:00 AM,,CHICAGO POLICE DEPT,,08/15/1984 12:00:00 AM,08/15/1984 12:00:00 AM,Charge(S) Approved,09/21/1984 12:00:00 AM,PROMIS Conversion
6,198055620664,85937621020,08/15/1984 12:00:00 AM,PROMIS Conversion,False,1242357787220,150253900073,ARMED ROBBERY,15,12/17/2014 12:00:00 AM,38-18-2-A,,,X,2150,Plea Of Guilty,,Clayton Jay Crane,District 6 - Markham,Markham Courthouse,Amended/Corrected Sentencing,10/16/2014 12:00:00 AM,Prison,True,Illinois Department of Corrections,30.0,Year(s),10982.0,27.0,Black,Male,,08/09/1984 12:00:00 AM,,CHICAGO POLICE DEPT,,08/15/1984 12:00:00 AM,08/15/1984 12:00:00 AM,Charge(S) Approved,09/21/1984 12:00:00 AM,PROMIS Conversion
7,198055620664,85937621020,08/15/1984 12:00:00 AM,PROMIS Conversion,False,1242357787220,150253900073,ARMED ROBBERY,15,12/17/2014 12:00:00 AM,38-18-2-A,,,X,2150,Plea Of Guilty,,John Mannion,District 6 - Markham,Markham Courthouse,Original Sentencing,06/02/1986 12:00:00 AM,Conversion,False,Illinois Department of Corrections,30.0,Year(s),619.0,27.0,Black,Male,,08/09/1984 12:00:00 AM,,CHICAGO POLICE DEPT,,08/15/1984 12:00:00 AM,08/15/1984 12:00:00 AM,Charge(S) Approved,09/21/1984 12:00:00 AM,PROMIS Conversion
8,198055620664,85937621020,08/15/1984 12:00:00 AM,PROMIS Conversion,False,1242359023652,150251649536,ARMED ROBBERY,16,12/17/2014 12:00:00 AM,38-18-2-A,,,X,2150,Nolle On Remand,,John Mannion,District 6 - Markham,Markham Courthouse,Original Sentencing,06/02/1986 12:00:00 AM,Conversion,True,Illinois Department of Corrections,30.0,Year(s),619.0,27.0,Black,Male,,08/09/1984 12:00:00 AM,,CHICAGO POLICE DEPT,,08/15/1984 12:00:00 AM,08/15/1984 12:00:00 AM,Charge(S) Approved,09/21/1984 12:00:00 AM,PROMIS Conversion
9,198055620664,85937621020,08/15/1984 12:00:00 AM,PROMIS Conversion,True,1242194578090,151097726688,FIRST DEGREE MURDER,1,12/17/2014 12:00:00 AM,38,-,9-1(a)(1),X,1606,Plea Of Guilty,,Clayton Jay Crane,District 6 - Markham,Markham Courthouse,Amended/Corrected Sentencing,10/16/2014 12:00:00 AM,Prison,True,Illinois Department of Corrections,62.0,Year(s),10982.0,27.0,Black,Male,,08/09/1984 12:00:00 AM,,CHICAGO POLICE DEPT,,08/15/1984 12:00:00 AM,08/15/1984 12:00:00 AM,Charge(S) Approved,09/21/1984 12:00:00 AM,PROMIS Conversion


# Data cleaning<a class="anchor" id="data-cleaning"></a>

In [3]:
# initializing a df for the cleaned version of the sentencing_raw data 
sentencing_cleaned = sentencing_raw.copy()

## Cleaning demographic variables<a class="anchor" id="basic-demographics"></a>

Here, we clean up some important demographic characteristics that can be used in the analysis. 
For this, we adapt some of the approaches that were used in the second and third problem sets.

* Defining the defendant's race groups
* Defining the defendant's gender group
* Cleaning up defendant's age via winsorizing

In [4]:
# printing the original distribution of RACE variable
print("Distribution of original `RACE` variable:")
sentencing_cleaned.RACE.value_counts()

# defining some important race groups
sentencing_cleaned['is_black'] = np.where(sentencing_raw.RACE.isin(['Black', 'White/Black [Hispanic or Latino]']), 
                                          True, False)
sentencing_cleaned['is_hisp'] = np.where(sentencing_raw.RACE.isin(['HISPANIC', 'White [Hispanic or Latino]']), 
                                          True, False)
sentencing_cleaned['is_white'] = np.where(sentencing_raw.RACE.isin(['White']), True, False)

# for the RACE columns, replace value with np.nan if RACE == 'Unknown' or RACE == 'Biracial'
cond = sentencing_cleaned.RACE.isin(['Unknown', 'Biracial']) # defining the condition
sentencing_cleaned.loc[cond, ['is_black', 'is_hisp', 'is_white']] = np.nan

Distribution of original `RACE` variable:


Black                               181219
White [Hispanic or Latino]           41976
White                                37983
HISPANIC                              6098
Asian                                 1602
White/Black [Hispanic or Latino]      1408
Unknown                                395
American Indian                        132
ASIAN                                   68
Biracial                                36
Name: RACE, dtype: int64

In [5]:
# original GENDER distribution:
print("Distribution of original `GENDER` distribution:")
sentencing_cleaned.GENDER.value_counts()

# defining gender groups 
sentencing_cleaned['is_female'] = np.where(sentencing_cleaned.GENDER.isin(['Male', 'Male name, no gender given']),
                                          False, np.where(sentencing_cleaned.GENDER.str.contains('Unknown'), 
                                                          np.nan, True))

# final look at the gender variable 
print("Distribution of cleaned `is_female` distribution:")
sentencing_cleaned.is_female.value_counts()

Distribution of original `GENDER` distribution:


Male                          239279
Female                         32138
Unknown                           12
Unknown Gender                     4
Male name, no gender given         3
Name: GENDER, dtype: int64

Distribution of cleaned `is_female` distribution:


0.0    239282
1.0     32138
Name: is_female, dtype: int64

In [6]:
# original summary stat of age variable:
print("Summary statistics of original AGE_AT_INCIDENT variable:")
sentencing_cleaned.AGE_AT_INCIDENT.describe()

# there is an outlier (137 y.o. obs), winsorizing age column to 99.99th percentile
sentencing_cleaned['age_cleaned'] = np.where(sentencing_cleaned.AGE_AT_INCIDENT >= 
                                             sentencing_cleaned.AGE_AT_INCIDENT.quantile(0.9999), 
                                             sentencing_cleaned.AGE_AT_INCIDENT.quantile(0.9999), 
                                             sentencing_cleaned.AGE_AT_INCIDENT)

# printing the summary stat of new age variable
print("Summary statistics of cleaned age variable:")
sentencing_cleaned.age_cleaned.describe()

Summary statistics of original AGE_AT_INCIDENT variable:


count    268527.000000
mean         32.360753
std          11.754197
min          17.000000
25%          23.000000
50%          29.000000
75%          40.000000
max         137.000000
Name: AGE_AT_INCIDENT, dtype: float64

Summary statistics of cleaned age variable:


count    268527.00000
mean         32.35875
std          11.74226
min          17.00000
25%          23.00000
50%          29.00000
75%          40.00000
max          81.00000
Name: age_cleaned, dtype: float64

<u>**Cleaning notes/questions (if any)**</u>:

1. `RACE`: 
    - How should we categorize *biracial* race group?
    - I recoded `Unknown` and `Biracial` as NaN for each race definition
    - What does `[Hispanic or Latino]` actually mean? In pset2, why did we not categorize `White/Black [Hispanic or Latino]` into the `is_hisp` definition?
    
    
2. `GENDER`:
    - I recoded rows containing `Unknown` as NaN
    - `Male name, no gender given` is coded as `Male` (reasonable?).
    

3. `AGE_AT_INCIDENT`:
    - As with pset2, I winsorized the age variable

## Cleaning datetime variables<a class="anchor" id="datetime-variables"></a>

Here, we clean up:

* The defendant's date of sentencing (`SENTENCE_DATE`). We'll create a datetime object out of the field, and will separate year, month, and year-month component of the date.

In [7]:
# do all rows have 12:00:00 AM time?
set([date[-11:] for date in sentencing_cleaned['SENTENCE_DATE']])

# since all rows started with 12:00:00 AM, we can strip that component 
sentencing_cleaned['sentence_date'] = sentencing_cleaned.SENTENCE_DATE.str.replace("12:00:00 AM ", "")

# we clean up the SENTENCE_DATE that have out-of-bound years
sentencing_cleaned['sentence_date'] = [re.sub(r'2[1-9]([0-9]+)', r'20\1', str(date)) 
                                       if bool(re.search(r'2[1-9]([0-9]+)', str(date)))
                                       else str(date) 
                                       for date in sentencing_cleaned.sentence_date]

# converting to datetime
sentencing_cleaned['sentence_date'] = pd.to_datetime(sentencing_cleaned["sentence_date"])

# creating a year, month, date columns 
sentencing_cleaned['sentence_year'] = pd.DatetimeIndex(sentencing_cleaned['sentence_date']).year
sentencing_cleaned['sentence_month'] = pd.DatetimeIndex(sentencing_cleaned['sentence_date']).month
sentencing_cleaned['sentence_day'] = pd.DatetimeIndex(sentencing_cleaned['sentence_date']).day
sentencing_cleaned['sentence_ym'] = sentencing_cleaned['sentence_date'].dt.to_period('M')

{'12:00:00 AM'}

<u>**Cleaning notes/questions (if any)**</u>:

1. `SENTENCE_DATE`: 
    - All out-of-bounds years (29, 22, etc instead of 20XX) are converted into 20XX. 
    - How do we clean up 2023-2066 though? (Currently, we're filtering against values > 2022 in the filtering stage)

## Cleaning sentencing-related variables<a class="anchor" id="sentencing-vars"></a>

Here, we:

* Clean up sentencing term data, based on their commitment units
* Defining whether defendant is incarcerated
* Recategorize offense types

In [None]:
# what sentence units are available in the sentencing_cleaned data?
print("Commitment units in the original data:")
sentencing_cleaned.COMMITMENT_UNIT.value_counts()

# first, we convert the commitment_term into a numeric
sentencing_cleaned['sentencing_num'] = pd.to_numeric(sentencing_cleaned.COMMITMENT_TERM.str.replace("wrap", ""))

# we're going to use np.select. as such, we store the criteria (units) and codes (days_equiv) in different objects
units = [sentencing_cleaned.COMMITMENT_UNIT == "Year(s)",
        sentencing_cleaned.COMMITMENT_UNIT == "Months", 
        sentencing_cleaned.COMMITMENT_UNIT == "Days", 
        sentencing_cleaned.COMMITMENT_UNIT == "Weeks", 
        sentencing_cleaned.COMMITMENT_UNIT == "Hours", 
        sentencing_cleaned.COMMITMENT_UNIT == "Natural Life",
        sentencing_cleaned.COMMITMENT_UNIT.isin(['Term', 'Dollars', 'Pounds', 'Ounces', 'Kilos'])]

# assigning the codes - nan to the excluded units
days_equiv = [(sentencing_cleaned.sentencing_num * 365), 
              (sentencing_cleaned.sentencing_num * 30.5), 
              (sentencing_cleaned.sentencing_num * 1), 
              (sentencing_cleaned.sentencing_num * 7), 
              (sentencing_cleaned.sentencing_num * 1/24), 
              (100 - sentencing_cleaned.age_cleaned)*365, 
              np.nan]

# generating the days units of each COMMITMENT_UNIT type. 
sentencing_cleaned['sentencing_term_d'] = np.select(units, days_equiv)
sentencing_cleaned['sentencing_term_y'] = sentencing_cleaned.sentencing_term_days / 365

# summary statistics of the sentencing term in year. 
sentencing_cleaned['sentencing_term_y'].describe()

In [None]:
sentencing_cleaned[['COMMITMENT_TERM', 'COMMITMENT_UNIT']].loc[sentencing_cleaned.COMMITMENT_TERM == "two"]

<u>**Cleaning notes/questions (if any)**</u>:

1. `sentencing_term`: 
    * In pset3, we did a `fillna(20)` on the cleaned age variable to calculate the terms for `Natural Life` units. I'm not sure if we should do this and the rationale behind this.

## Filtering to prepare analysis-ready dataset<a class="anchor" id="filtering"></a>

Here, we:

* Filter against years that are above 2022

<u>**Cleaning notes/questions (if any)**</u>:

1. Filtering for `sentence_year`: 
    - Do we need to filter year from below e.g. set minimum year? 