# 1. Surrogate Model

***Your information:***
* Name:
* UBT ID:
* E-Mail:
<br>
I confirm the solution in this notebook is my own work.

**Background:**<br>
COMPAS is a proprietary software, which means that we do not have access to the underlying algorithm. To better understand how the algorithm might work, you want to replicate the COMPAS algorithm by building a surrogate model that simulates the COMPAS algorithm to produce similar risk scores based on historical data.

**Objective:**<br>
Create a surrogate model for the COMPAS algorithm by learning how to compute the COMPAS risk scores for violent recidivism (`v_decile_score`) from the historical COMPAS cases we provide you in `compas_scoring.csv`. Make use of different data types (numerical features, categorical features, and text data) and justify why selected the final surrogate model version to resemble the COMPAS algorithm. Your surrogate model will serve as the basis subsequent analysis tasks.

**Deliverables:**<br>
1. Explore and plot relevant data characteristics.
2. Highlight potential challenges for building a surrogate model inherent in the data.
3. Preprocess the data to increase performance of the surrogate model.
4. Train two sub models, one on the machine readable data and one on textual features from `c_charge_desc` and evaluate their performance separately.
5. Combine the sub models into one meta model and evaluate its performance.
6. Explain how you make sure to avoid data leakage.
7. Explain whether your model is a suitable surrogate model for the subsequent analysis tasks.

## Data Description

In [32]:
import pandas as pd
dataset = pd.read_csv('data/compas_scoring.csv')
dataset.head()

Unnamed: 0,id,sex,dob,race,juv_fel_count,juv_misd_count,juv_other_count,priors_count,c_jail_in,c_jail_out,c_offense_date,c_arrest_date,c_charge_degree,screening_date,days_b_screening_arrest,c_days_from_compas,in_custody,out_custody,c_charge_desc,v_decile_score
0,1,Male,1947-04-18,Other,0,0,0,0,2013-08-13 06:03:42,2013-08-14 05:41:20,2013-08-13,,F,2013-08-14,-1.0,1.0,2014-07-07,2014-07-14,Aggravated Assault w/Firearm,1
1,3,Male,1982-01-22,African-American,0,0,0,0,2013-01-26 03:45:27,2013-02-05 05:36:53,2013-01-26,,F,2013-01-27,-1.0,1.0,2013-01-26,2013-02-05,Felony Battery w/Prior Convict,1
2,4,Male,1991-05-14,African-American,0,0,1,4,2013-04-13 04:58:34,2013-04-14 07:02:04,2013-04-13,,F,2013-04-14,-1.0,1.0,2013-06-16,2013-06-16,Possession of Cocaine,3
3,5,Male,1993-01-21,African-American,0,1,0,1,,,2013-01-12,,F,2013-01-13,,1.0,,,Possession of Cannabis,6
4,6,Male,1973-01-22,Other,0,0,0,2,,,,2013-01-09,F,2013-03-26,,76.0,,,arrest case no charge,1


In [33]:
# --- Features to resemble COMPAS screening model ---

# Metadata and Demographics (Identification and basic data)

demographics = [
    'id', # Unique identifier for each defendant
    'sex', # binary sex category
    'dob', # date of birth
    'race' # race category
]

# Prior Criminal History (Juvenile and adult priors)
criminal_history = [
    'juv_fel_count', # juvenile felony count
    'juv_misd_count', # juvenile misdemeanor count
    'juv_other_count', # juvenile other offenses count (e.g., status offenses)
    'priors_count' # total number of prior offenses
]

# Current Case Information (Details of the offense that led to screening)
current_case = [
    'c_offense_date', # date of the offense
    'c_arrest_date', # date of arrest for current offense
    'c_charge_degree', # charge degree for current offense
    'screening_date', # date of COMPAS screening
    'c_jail_in', # date of jail intake for current offense
    'c_jail_out', # date of jail release for current offense
]

# Timing and Custody Info (Time-related features and custody status at screening)
custody = [
    'days_b_screening_arrest', # days between arrest and screening (minus values if screening happened before arrest)
    'c_days_from_compas', # days from COMPAS screening to current offense
    'in_custody', # date the defendant entered custody
    'out_custody' # date the defendant was released from custody
]

# Textual feature: Charge Description
charge_description = [
    'c_charge_desc' # textual description of the current charge
]

# --- Target Variable ---
# COMPAS computes raw risk scores (between 0 and 1) and maps them to relative decile scores ranging from 1 to 10 to hide the raw scores
# Decile 1 represents the lowest 10% of all risk scores from a base population
# As we do neither have access to the raw risk scores nor to the base population, we will use the decile scores as our target variable
y_compas = [
    'v_decile_score' # COMPAS risk score for violent recidivism from 1 (low risk) to 10 (high risk)
]

## Data Exploration

In [34]:
# --- YOUR CODE ---

## ...

In [35]:
# ...