# Generating Recidivism Data

The purpose of this notebook is to transform and clean data from North Carolina Department of Corrections into a dataset for predicting recidivsm of individual inmates. The scripts to download the raw, publically available data can be found in this repository. This notebook is likely best run on a server with suitable memory, as the data is fairly large. At the end, this exports a pickle of a pandas DF. For using in another script, the pickle is highly recommended, as import time and disk space is much lower. Can be changed to export a CSV for cross compatibility.

In [1]:
import pandas as pd
import numpy as np
from datetime import datetime
pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', 500)

In [2]:
court_commit = pd.read_csv('OFNT3BB1.csv')

  interactivity=interactivity, compiler=compiler, result=result)


In [7]:
court_commit.head(10)

Unnamed: 0,OFFENDER_NC_DOC_ID_NUMBER,COMMITMENT_PREFIX,COMMITTED_LAST_NAME,COMMITTED_FIRST_NAME,COMMITTED_MIDDLE_NAME,COMMITTED_NAME_SUFFIX,OFFENDER_ADMISSION/INTAKE_DATE,P&P_CASE_INTAKE_DATE,INMATE_COMMITMENT_STATUS_FLAG,COMMITMENT_STATUS_DATE,EARLIEST_SENTENCE_EFFECTIVE_DT,NEW_PERIOD_OF_INCARCERATION_FL,MOST_SERIOUS_OFFENSE_CODE,CO_OF_CONV_MOST_SERIOUS_OFFNSE,TOTAL_SENTENCE_LENGTH,TOTAL_JAIL_CREDITS_(IN_DAYS),NO_RESTITUTION_FLAG,P&P_COMMITMENT_STATUS_FLAG,P&P_COMMITMENT_STATUS_DATE,TOTAL_LENGTH_OF_SUPERVISION,PED_PRIOR_TO_1995_CONVERSION,DATE_OF_LAST_UPDATE,TIME_OF_LAST_UPDATE,NEW_PERIOD_OF_SUPERVISION_FLAG,TYPE_OF_OLD_PE_DATE_CODE
0,1,01,AAL ANUBIA,RACHELL,,,0001-01-01,1992-12-14,,0001-01-01,1992-12-14,,,,,0.0,,EARLY TERM EARLY,1995-05-11,,0001-01-01,0001-01-01,01:00:00,,
1,3,01,AARHUS,STEVEN,CHARLES,,0001-01-01,1988-10-21,,0001-01-01,1988-10-21,,,,,0.0,,EARLY TERM EARLY,1991-08-20,,0001-01-01,0001-01-01,01:00:00,,
2,3,02,AARHUS,STEVEN,CHARLES,,0001-01-01,2015-02-06,,0001-01-01,2015-02-06,N,DWI LEVEL 2,GUILFORD,,0.0,,UNSUPERVED UNSUP,2015-04-29,,0001-01-01,2015-05-05,17:20:07,Y,
3,4,AA,AARON,DAVID,CLETIS,,1983-07-13,1984-04-17,ACTIVE,1983-07-13,1983-07-12,Y,,,,,,NORMAL NORM,1984-04-17,,0001-01-01,0001-01-01,01:00:00,,
4,5,01,AARON,GENE,ALEXANDER,,0001-01-01,1989-08-01,,0001-01-01,1989-08-01,,,,,0.0,,EARLY TERM EARLY,1995-04-17,,0001-01-01,0001-01-01,01:00:00,,
5,5,02,AARON,GENE,ALEXANDER,,0001-01-01,1989-08-01,,0001-01-01,1990-11-05,,,,,0.0,,EARLY TERM EARLY,1995-04-17,,0001-01-01,0001-01-01,01:00:00,,
6,6,AA,AARON,GERALD,,,1973-01-30,0001-01-01,COURT OR,1973-03-28,1973-01-30,Y,,,,,,,0001-01-01,,0001-01-01,0001-01-01,01:00:00,,
7,6,AB,AARON,GERALD,,,1973-04-15,1974-01-14,ACTIVE,1973-04-15,1973-04-11,Y,,,,,,NORMAL NORM,1974-01-14,,1973-08-05,0001-01-01,01:00:00,,REG.PAROLE
8,7,01,AARON,HATTIE,MICHELLE,,0001-01-01,1991-05-22,,0001-01-01,1991-05-22,,,,,0.0,,EARLY TERM EARLY,1994-02-14,,0001-01-01,0001-01-01,01:00:00,,
9,7,02,AARON,HATTIE,MICHELLE,,0001-01-01,1991-05-22,,0001-01-01,1991-05-22,,,,,0.0,,EARLY TERM EARLY,1994-02-14,,0001-01-01,0001-01-01,01:00:00,,


In [3]:
inmates = pd.read_csv('INMT4AA1.csv')

  interactivity=interactivity, compiler=compiler, result=result)


`sentence_compute` contains information on the sentences of each individual. Sentences served consequtively for a given inmate will have the same COMMITMENT_PREFIX and subsequent SENTENCE_COMPONENTs. For our purposes, we need the initial beginning date of each sentence and the final end date. The final end date will come from `sentence_compute`.

In [4]:
sentence_compute = pd.read_csv('INMT4BB1.csv')

  interactivity=interactivity, compiler=compiler, result=result)


In [10]:
sentence_compute.head(10)

Unnamed: 0,INMATE_DOC_NUMBER,INMATE_COMMITMENT_PREFIX,INMATE_SENTENCE_COMPONENT,INMATE_COMPUTATION_STATUS_FLAG,SENTENCE_BEGIN_DATE_(FOR_MAX),ACTUAL_SENTENCE_END_DATE,PROJECTED_RELEASE_DATE_(PRD),PAROLE_DISCHARGE_DATE,PAROLE_SUPERVISION_BEGIN_DATE
0,4,AA,1,EXPIRED,1983-07-12,1984-07-11,1984-07-11,0001-01-01,0001-01-01
1,4,AA,2,EXPIRED,0001-01-01,1984-07-11,1984-07-11,0001-01-01,0001-01-01
2,6,AA,1,EXPIRED,1973-01-30,1973-03-28,0001-01-01,0001-01-01,0001-01-01
3,6,AB,1,EXPIRED,1973-04-11,1975-08-18,1974-08-10,0001-01-01,0001-01-01
4,6,AB,2,EXPIRED,1973-04-24,1975-08-18,1974-08-10,0001-01-01,0001-01-01
5,6,AB,3,EXPIRED,1973-05-07,1975-08-18,1974-08-10,0001-01-01,0001-01-01
6,6,AB,4,EXPIRED,1973-05-20,1975-08-18,1974-08-10,0001-01-01,0001-01-01
7,6,AB,5,EXPIRED,1973-06-03,1975-08-18,1974-08-10,0001-01-01,0001-01-01
8,6,AB,6,EXPIRED,1973-06-16,1975-08-18,1974-08-10,0001-01-01,0001-01-01
9,6,AB,7,EXPIRED,1973-06-29,1975-08-18,1974-08-10,0001-01-01,0001-01-01


In [5]:
sentence_compute["SENTENCE_BEGIN_DATE_(FOR_MAX)"] = pd.to_datetime(sentence_compute["SENTENCE_BEGIN_DATE_(FOR_MAX)"], errors = "coerce")
sentence_compute["ACTUAL_SENTENCE_END_DATE"] = pd.to_datetime(sentence_compute["ACTUAL_SENTENCE_END_DATE"], errors = "coerce")
sentence_compute["PROJECTED_RELEASE_DATE_(PRD)"] = pd.to_datetime(sentence_compute["PROJECTED_RELEASE_DATE_(PRD)"], errors = "coerce")

In [6]:
# If the actual sentence end date is missing, replacing with the projected date.

end_dates = []

for row in sentence_compute.itertuples():
    actual = row[6]
    projected = row[7]
    
    if pd.isnull(actual):
        end_dates.append(projected)
    else:
        end_dates.append(actual)
        
sentence_compute['SENTENCE_END'] = end_dates

In [7]:
# For each commitment, getting the lowest beginning date and the highest end date for the full sentence term.
sentence_subset = sentence_compute.groupby(['INMATE_DOC_NUMBER', 'INMATE_COMMITMENT_PREFIX']).agg({'SENTENCE_BEGIN_DATE_(FOR_MAX)': min, 'SENTENCE_END': max}).reset_index()

In [8]:
# Null end dates encode life sentences, so I will set the sentence_end to 2230-1-1 (near top of pandas date range)
sentence_subset.loc[sentence_subset.SENTENCE_END.isnull(), "SENTENCE_END"] = pd.to_datetime('2230-1-1')
sentence_subset.loc[sentence_subset.SENTENCE_END == pd.to_datetime('2230-1-1')].head()

Unnamed: 0,INMATE_DOC_NUMBER,INMATE_COMMITMENT_PREFIX,SENTENCE_END,SENTENCE_BEGIN_DATE_(FOR_MAX)
241,289,BA,2230-01-01,1982-11-08
298,353,BA,2230-01-01,1991-02-14
345,397,BA,2230-01-01,1994-05-09
354,400,BA,2230-01-01,1984-05-22
543,538,BA,2230-01-01,NaT


Turning to the admissions dataset for information on the sentence served

In [9]:
admitted_subset = court_commit[["OFFENDER_NC_DOC_ID_NUMBER", "COMMITMENT_PREFIX", "COMMITTED_LAST_NAME", "COMMITTED_FIRST_NAME", "OFFENDER_ADMISSION/INTAKE_DATE", "MOST_SERIOUS_OFFENSE_CODE", "COMMITMENT_STATUS_DATE"]]
admitted_subset.head()

Unnamed: 0,OFFENDER_NC_DOC_ID_NUMBER,COMMITMENT_PREFIX,COMMITTED_LAST_NAME,COMMITTED_FIRST_NAME,OFFENDER_ADMISSION/INTAKE_DATE,MOST_SERIOUS_OFFENSE_CODE,COMMITMENT_STATUS_DATE
0,1,01,AAL ANUBIA,RACHELL,0001-01-01,,0001-01-01
1,3,01,AARHUS,STEVEN,0001-01-01,,0001-01-01
2,3,02,AARHUS,STEVEN,0001-01-01,DWI LEVEL 2,0001-01-01
3,4,AA,AARON,DAVID,1983-07-13,,1983-07-13
4,5,01,AARON,GENE,0001-01-01,,0001-01-01


In [10]:
admissions = admitted_subset.merge(sentence_subset, how="inner", \
                                   left_on=['OFFENDER_NC_DOC_ID_NUMBER','COMMITMENT_PREFIX'], \
                                   right_on = ["INMATE_DOC_NUMBER", "INMATE_COMMITMENT_PREFIX"])

cols_to_use = admissions.columns.difference(["OFFENDER_NC_DOC_ID_NUMBER", "INMATE_COMMITMENT_PREFIX"])
admissions = admissions[cols_to_use]

In [11]:
# Converting the intake date into a datetime type
admissions['OFFENDER_ADMISSION/INTAKE_DATE'] = pd.to_datetime(admissions['OFFENDER_ADMISSION/INTAKE_DATE'], errors = "coerce")

Now admissions contains further information from the commitments table. Now, similarly to above, if the sentence_begin_date is missing, replace it with the intake date, which is usually very similar.

In [12]:
start_dates = []

for tup in admissions.itertuples():
    begin_date = tup[8]
    intake_date = tup[7]
    
    if pd.isnull(begin_date): start_dates.append(intake_date)
    else: start_dates.append(begin_date)

admissions['SENTENCE_START'] = start_dates

In [13]:
# This takes care of all null dates for sentence end and sentence start
admissions[(admissions['SENTENCE_START'].isnull()) | (admissions['SENTENCE_END'].isnull())]

Unnamed: 0,COMMITMENT_PREFIX,COMMITMENT_STATUS_DATE,COMMITTED_FIRST_NAME,COMMITTED_LAST_NAME,INMATE_DOC_NUMBER,MOST_SERIOUS_OFFENSE_CODE,OFFENDER_ADMISSION/INTAKE_DATE,SENTENCE_BEGIN_DATE_(FOR_MAX),SENTENCE_END,SENTENCE_START


Now we would like to incorporate information about the inmates themselves from the inmates dataset

In [14]:
cols = ['INMATE_DOC_NUMBER',
'INMATE_GENDER_CODE',
'INMATE_RACE_CODE',
'INMATE_BIRTH_DATE',
'INMATE_FACILITY_CODE',
'OLDEST_CONVICTION_DATE',
'TOTAL_SENTENCE_COUNT',
'MOST_SERIOUS_OFFNSE_CURR_INCAR',
'INMATE_IS_FELON/MISDEMEANANT',
'TOTAL_DISCIPLINE_INFRACTIONS',
'ESCAPE_HISTORY_FLAG',
'PRIOR_INCARCERATIONS_FLAG']

inmate_subset = inmates[cols]

all_info = admissions.merge(inmate_subset, how='inner', on="INMATE_DOC_NUMBER")

In [21]:
all_info.head()

Unnamed: 0,COMMITMENT_PREFIX,COMMITMENT_STATUS_DATE,COMMITTED_FIRST_NAME,COMMITTED_LAST_NAME,INMATE_DOC_NUMBER,MOST_SERIOUS_OFFENSE_CODE,OFFENDER_ADMISSION/INTAKE_DATE,SENTENCE_BEGIN_DATE_(FOR_MAX),SENTENCE_END,SENTENCE_START,INMATE_GENDER_CODE,INMATE_RACE_CODE,INMATE_BIRTH_DATE,INMATE_FACILITY_CODE,OLDEST_CONVICTION_DATE,TOTAL_SENTENCE_COUNT,MOST_SERIOUS_OFFNSE_CURR_INCAR,INMATE_IS_FELON/MISDEMEANANT,TOTAL_DISCIPLINE_INFRACTIONS,ESCAPE_HISTORY_FLAG,PRIOR_INCARCERATIONS_FLAG
0,AA,1983-07-13,DAVID,AARON,4,,1983-07-13,1983-07-12,1984-07-11,1983-07-12,MALE,WHITE,1961-10-15,UNKNOWN AT CONVERSION,0001-01-01,0,,FELON,0,N,Y
1,AA,1973-03-28,GERALD,AARON,6,,1973-01-30,1973-01-30,1973-03-28,1973-01-30,MALE,WHITE,1951-07-17,UNKNOWN AT CONVERSION,0001-01-01,0,,MISD.,0,N,Y
2,AB,1973-04-15,GERALD,AARON,6,,1973-04-15,1973-04-11,1975-08-18,1973-04-11,MALE,WHITE,1951-07-17,UNKNOWN AT CONVERSION,0001-01-01,0,,MISD.,0,N,Y
3,AA,1990-04-23,JAMES,AARON,8,,1990-04-23,1990-04-09,1990-05-17,1990-04-09,MALE,WHITE,1963-12-29,LOCATION UNKNOWN,1994-12-13,0,HABITUAL IMPAIRED DRIVING,FELON,0,N,Y
4,AB,1993-09-03,JAMES,AARON,8,,1993-09-03,1993-08-30,1994-01-26,1993-08-30,MALE,WHITE,1963-12-29,LOCATION UNKNOWN,1994-12-13,0,HABITUAL IMPAIRED DRIVING,FELON,0,N,Y


In [15]:
del court_commit
del inmates
del sentence_compute

In [16]:
# Discarding the 223 inmates with unknown birthdays (needed to calculate age at time of release)

all_info['INMATE_BIRTH_DATE'] = pd.to_datetime(all_info['INMATE_BIRTH_DATE'], errors = "coerce")
all_info = all_info.loc[all_info['INMATE_BIRTH_DATE'].notnull()]

In [17]:
all_info['DAYS_SERVED'] = ((all_info['SENTENCE_END'] - all_info['SENTENCE_START']) / np.timedelta64(1, 'D')).astype(int)

all_info['AGE_AT_RELEASE'] = ((all_info['SENTENCE_END'] - all_info['INMATE_BIRTH_DATE']) / np.timedelta64(1, 'Y')).astype(int)

The minimum supported version is 2.4.6



In [25]:
all_info.reset_index(drop=True, inplace=True)
all_info.head()

Unnamed: 0,COMMITMENT_PREFIX,COMMITMENT_STATUS_DATE,COMMITTED_FIRST_NAME,COMMITTED_LAST_NAME,INMATE_DOC_NUMBER,MOST_SERIOUS_OFFENSE_CODE,OFFENDER_ADMISSION/INTAKE_DATE,SENTENCE_BEGIN_DATE_(FOR_MAX),SENTENCE_END,SENTENCE_START,INMATE_GENDER_CODE,INMATE_RACE_CODE,INMATE_BIRTH_DATE,INMATE_FACILITY_CODE,OLDEST_CONVICTION_DATE,TOTAL_SENTENCE_COUNT,MOST_SERIOUS_OFFNSE_CURR_INCAR,INMATE_IS_FELON/MISDEMEANANT,TOTAL_DISCIPLINE_INFRACTIONS,ESCAPE_HISTORY_FLAG,PRIOR_INCARCERATIONS_FLAG,DAYS_SERVED,AGE_AT_RELEASE
0,AA,1983-07-13,DAVID,AARON,4,,1983-07-13,1983-07-12,1984-07-11,1983-07-12,MALE,WHITE,1961-10-15,UNKNOWN AT CONVERSION,0001-01-01,0,,FELON,0,N,Y,365,22
1,AA,1973-03-28,GERALD,AARON,6,,1973-01-30,1973-01-30,1973-03-28,1973-01-30,MALE,WHITE,1951-07-17,UNKNOWN AT CONVERSION,0001-01-01,0,,MISD.,0,N,Y,57,21
2,AB,1973-04-15,GERALD,AARON,6,,1973-04-15,1973-04-11,1975-08-18,1973-04-11,MALE,WHITE,1951-07-17,UNKNOWN AT CONVERSION,0001-01-01,0,,MISD.,0,N,Y,859,24
3,AA,1990-04-23,JAMES,AARON,8,,1990-04-23,1990-04-09,1990-05-17,1990-04-09,MALE,WHITE,1963-12-29,LOCATION UNKNOWN,1994-12-13,0,HABITUAL IMPAIRED DRIVING,FELON,0,N,Y,38,26
4,AB,1993-09-03,JAMES,AARON,8,,1993-09-03,1993-08-30,1994-01-26,1993-08-30,MALE,WHITE,1963-12-29,LOCATION UNKNOWN,1994-12-13,0,HABITUAL IMPAIRED DRIVING,FELON,0,N,Y,149,30


In [26]:
# Looping over the dataframe to create the target: "RECITIVATED", meaning was imprisoned again within 3 years.
# Also creates dummy variables for prior imprisonment and escape attempts, since I was already looping over it.

escapes = []
priors = []
recidivated = [0] * len(all_info)

prior_end_date = pd.to_datetime("1800-1-1")
prior_id = 0

for row in all_info.itertuples():
    index = row[0]
    ID = row[5]
    start_date = row[10]
    end_date = row[9]
    escape = row[20]
    prior = row[21]
    
    if ID == prior_id:
        if ((start_date - prior_end_date) / np.timedelta64(1, 'D')).astype(int) <= 1095: 
            # if this start date is within 3 years, the prior term gets the positive recitivated flag
            recidivated[index - 1] = 1
                
    
    if escape == "Y": escapes.append(1)
    else: escapes.append(0)
    
    if prior == "Y": priors.append(1)
    else: priors.append(0)  
        
    prior_end_date = end_date
    prior_id = ID

In [27]:
all_info["ESCAPE_HISTORY_FLAG"] = escapes
all_info['PRIOR_INCARCERATIONS_FLAG'] = priors
all_info["RECITIVATED"] = recidivated

In [28]:
# Percentage of sentences that recitivated
sum(recidivated)/len(all_info)

0.30464688913090626

In [18]:
# Cleaning column total discipline infractions
all_info.loc[all_info['TOTAL_DISCIPLINE_INFRACTIONS'] == '0-1', 'TOTAL_DISCIPLINE_INFRACTIONS'] = 1
all_info.loc[all_info['TOTAL_DISCIPLINE_INFRACTIONS'] == '0-2', 'TOTAL_DISCIPLINE_INFRACTIONS'] = 2
all_info["TOTAL_DISCIPLINE_INFRACTIONS"] = all_info['TOTAL_DISCIPLINE_INFRACTIONS'].astype(int)

In [30]:
all_info.head(10)

Unnamed: 0,COMMITMENT_PREFIX,COMMITMENT_STATUS_DATE,COMMITTED_FIRST_NAME,COMMITTED_LAST_NAME,INMATE_DOC_NUMBER,MOST_SERIOUS_OFFENSE_CODE,OFFENDER_ADMISSION/INTAKE_DATE,SENTENCE_BEGIN_DATE_(FOR_MAX),SENTENCE_END,SENTENCE_START,INMATE_GENDER_CODE,INMATE_RACE_CODE,INMATE_BIRTH_DATE,INMATE_FACILITY_CODE,OLDEST_CONVICTION_DATE,TOTAL_SENTENCE_COUNT,MOST_SERIOUS_OFFNSE_CURR_INCAR,INMATE_IS_FELON/MISDEMEANANT,TOTAL_DISCIPLINE_INFRACTIONS,ESCAPE_HISTORY_FLAG,PRIOR_INCARCERATIONS_FLAG,DAYS_SERVED,AGE_AT_RELEASE,RECITIVATED
0,AA,1983-07-13,DAVID,AARON,4,,1983-07-13,1983-07-12,1984-07-11,1983-07-12,MALE,WHITE,1961-10-15,UNKNOWN AT CONVERSION,0001-01-01,0,,FELON,0,0,1,365,22,0
1,AA,1973-03-28,GERALD,AARON,6,,1973-01-30,1973-01-30,1973-03-28,1973-01-30,MALE,WHITE,1951-07-17,UNKNOWN AT CONVERSION,0001-01-01,0,,MISD.,0,0,1,57,21,1
2,AB,1973-04-15,GERALD,AARON,6,,1973-04-15,1973-04-11,1975-08-18,1973-04-11,MALE,WHITE,1951-07-17,UNKNOWN AT CONVERSION,0001-01-01,0,,MISD.,0,0,1,859,24,0
3,AA,1990-04-23,JAMES,AARON,8,,1990-04-23,1990-04-09,1990-05-17,1990-04-09,MALE,WHITE,1963-12-29,LOCATION UNKNOWN,1994-12-13,0,HABITUAL IMPAIRED DRIVING,FELON,0,0,1,38,26,0
4,AB,1993-09-03,JAMES,AARON,8,,1993-09-03,1993-08-30,1994-01-26,1993-08-30,MALE,WHITE,1963-12-29,LOCATION UNKNOWN,1994-12-13,0,HABITUAL IMPAIRED DRIVING,FELON,0,0,1,149,30,1
5,BA,1995-01-13,JAMES,AARON,8,,1995-01-13,1995-01-02,1995-09-14,1995-01-02,MALE,WHITE,1963-12-29,LOCATION UNKNOWN,1994-12-13,0,HABITUAL IMPAIRED DRIVING,FELON,0,0,1,255,31,0
6,AA,1977-03-17,KENNETH,AARON,10,,1975-06-20,1975-06-11,1977-03-17,1975-06-11,MALE,BLACK,1953-05-18,UNKNOWN AT CONVERSION,0001-01-01,0,,FELON,0,0,1,645,23,1
7,AB,1977-03-17,KENNETH,AARON,10,,1977-03-17,1975-06-11,1983-06-27,1975-06-11,MALE,BLACK,1953-05-18,UNKNOWN AT CONVERSION,0001-01-01,0,,FELON,0,0,1,2938,30,0
8,AA,1975-08-21,MOYER,AARON,14,,1975-08-21,1975-08-18,1976-07-06,1975-08-18,MALE,WHITE,1921-08-26,UNKNOWN AT CONVERSION,0001-01-01,0,,MISD.,0,0,1,323,54,1
9,AB,1977-06-21,MOYER,AARON,14,,1977-06-21,1977-06-17,1978-01-23,1977-06-17,MALE,WHITE,1921-08-26,UNKNOWN AT CONVERSION,0001-01-01,0,,MISD.,0,0,1,220,56,0


Creating dummy variables for `MOST_SERIOUS_OFFENSE_CODE`, `INMATE_RACE_CODE`, `INMATE_GENDER_CODE`, `INMATE_FACILITY_CODE`, `MOST_SERIOUS_OFFNSE_CURR_INCAR`, and `INMATE_IS_FELON/MISDEMEANANT`

For space considerations, I will drop many of the categories leaving the top couple hundred for each variable.

In [35]:
with_dummies = pd.get_dummies(all_info, columns=['MOST_SERIOUS_OFFENSE_CODE', 'INMATE_RACE_CODE', 'INMATE_GENDER_CODE', 'INMATE_FACILITY_CODE', 'MOST_SERIOUS_OFFNSE_CURR_INCAR', 'INMATE_IS_FELON/MISDEMEANANT'])

In [60]:
# Getting rid of columns
offenses_to_del2 = all_info['MOST_SERIOUS_OFFENSE_CODE'].value_counts()\
                          [all_info['MOST_SERIOUS_OFFENSE_CODE'].value_counts() <= 220]
facility_to_del2 = all_info['INMATE_FACILITY_CODE'].value_counts()\
                          [all_info['INMATE_FACILITY_CODE'].value_counts() <= 590]
curr_off_to_del2 = all_info['MOST_SERIOUS_OFFNSE_CURR_INCAR'].value_counts()\
                          [all_info['MOST_SERIOUS_OFFNSE_CURR_INCAR'].value_counts() <= 266]

In [73]:
offenses = ['MOST_SERIOUS_OFFENSE_CODE_' + str(i) for i in more_offenses]
facility = ['INMATE_FACILITY_CODE_' + str(i) for i in more_facilities]
curr_offenses = ['MOST_SERIOUS_OFFNSE_CURR_INCAR_' + str(i) for i in more_curr_off]

In [85]:
with_dummies.drop(columns = offenses + facility + curr_offenses, inplace=True)

In [101]:
with_dummies.to_pickle("recid_data.pickle")