# 02 – Preprocessing Pipeline

In this notebook, we implement the full preprocessing strategy for the COMPAS dataset used in our case study on risk assessment and bias in automated systems. This pipeline handles missing values, encodes categorical features, scales numerical values, and deduplicates records – all while adhering to the modeling constraints from Dressel & Farid (2018).

We aim to:
- Keep the raw dataset unchanged (for explorability).
- Preprocess the cleaned dataset using a scikit-learn `Pipeline`.
- Save the final pipeline object for later reuse during model training and evaluation.

Imports and Configurations

In [52]:
import pandas as pd
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OrdinalEncoder, StandardScaler
import matplotlib.pyplot as plt
import seaborn as sns
import joblib
import os

pd.set_option("display.max_columns", 100)

df = pd.read_csv("../data/cox-violent-parsed.csv")
print(f"Raw dataset shape: {df.shape}")
df.head()


Raw dataset shape: (18316, 52)


Unnamed: 0,id,name,first,last,compas_screening_date,sex,dob,age,age_cat,race,juv_fel_count,decile_score,juv_misd_count,juv_other_count,priors_count,days_b_screening_arrest,c_jail_in,c_jail_out,c_case_number,c_offense_date,c_arrest_date,c_days_from_compas,c_charge_degree,c_charge_desc,is_recid,r_case_number,r_charge_degree,r_days_from_arrest,r_offense_date,r_charge_desc,r_jail_in,r_jail_out,violent_recid,is_violent_recid,vr_case_number,vr_charge_degree,vr_offense_date,vr_charge_desc,type_of_assessment,decile_score.1,score_text,screening_date,v_type_of_assessment,v_decile_score,v_score_text,v_screening_date,in_custody,out_custody,priors_count.1,start,end,event
0,1.0,miguel hernandez,miguel,hernandez,14/08/2013,Male,18/04/1947,69,Greater than 45,Other,0,1,0,0,0,-1.0,13/08/2013 6:03,14/08/2013 5:41,13011352CF10A,13/08/2013,,1.0,(F3),Aggravated Assault w/Firearm,0,,,,,,,,,0,,,,,Risk of Recidivism,1,Low,14/08/2013,Risk of Violence,1,Low,14/08/2013,07/07/2014,14/07/2014,0,0,327,0
1,2.0,miguel hernandez,miguel,hernandez,14/08/2013,Male,18/04/1947,69,Greater than 45,Other,0,1,0,0,0,-1.0,13/08/2013 6:03,14/08/2013 5:41,13011352CF10A,13/08/2013,,1.0,(F3),Aggravated Assault w/Firearm,0,,,,,,,,,0,,,,,Risk of Recidivism,1,Low,14/08/2013,Risk of Violence,1,Low,14/08/2013,07/07/2014,14/07/2014,0,334,961,0
2,3.0,michael ryan,michael,ryan,31/12/2014,Male,06/02/1985,31,25 - 45,Caucasian,0,5,0,0,0,,,,,,,,,,-1,,,,,,,,,0,,,,,Risk of Recidivism,5,Medium,31/12/2014,Risk of Violence,2,Low,31/12/2014,30/12/2014,03/01/2015,0,3,457,0
3,4.0,kevon dixon,kevon,dixon,27/01/2013,Male,22/01/1982,34,25 - 45,African-American,0,3,0,0,0,-1.0,26/01/2013 3:45,05/02/2013 5:36,13001275CF10A,26/01/2013,,1.0,(F3),Felony Battery w/Prior Convict,1,13009779CF10A,(F3),,05/07/2013,Felony Battery (Dom Strang),,,,1,13009779CF10A,(F3),05/07/2013,Felony Battery (Dom Strang),Risk of Recidivism,3,Low,27/01/2013,Risk of Violence,1,Low,27/01/2013,26/01/2013,05/02/2013,0,9,159,1
4,5.0,ed philo,ed,philo,14/04/2013,Male,14/05/1991,24,Less than 25,African-American,0,4,0,1,4,-1.0,13/04/2013 4:58,14/04/2013 7:02,13005330CF10A,13/04/2013,,1.0,(F3),Possession of Cocaine,1,13011511MM10A,(M1),0.0,16/06/2013,Driving Under The Influence,16/06/2013,16/06/2013,,0,,,,,Risk of Recidivism,4,Low,14/04/2013,Risk of Violence,3,Low,14/04/2013,16/06/2013,16/06/2013,4,0,63,0


## Handle Data Issues (outside pipeline)

- Filter Unknown Targets & Track Initial Size
- Handle missing values


In [53]:
# Filter unknown target values (-1 = unknown recidivism)
# Filter unknown targets
df = df[df["is_recid"] != -1].copy()
print(f"Dataset shape after removing unknown 'is_recid': {df.shape}")

# Separate target and features
y = df["is_recid"]
X = df.drop(columns=["is_recid"])


Dataset shape after removing unknown 'is_recid': (17496, 52)


## Handle Missing Values outside the pipeline

We apply threshold-based missing value strategies outside the sklearn pipeline:

| Missing Ratio             | Action                    |
|---------------------------|---------------------------|
| > 20%                     | Drop feature              |
| 0–20% (exc. selected f.)  | Impute later via pipeline |
| < 5% in selected features | Drop rows                 |


In [54]:
# Define selected modeling features that we handle differently
selected_modeling_features = [
    "age", "sex", "juv_misd_count", "juv_fel_count",
    "priors_count", "c_charge_degree", "c_charge_desc"
]

# Store full column list and initial row count
initial_cols = df.columns.tolist()
initial_row_count = len(df)

# Calculate missing value ratios
missing_ratio = df.isna().mean()

# Drop columns with >15% missing
cols_to_drop = missing_ratio[missing_ratio > 0.15].index.tolist()
df.drop(columns=cols_to_drop, inplace=True)

# Drop rows from selected modeling features with <5% missing
rows_before = len(df)
for col in selected_modeling_features:
    if col in df.columns and df[col].isna().mean() < 0.05:
        df = df[df[col].notna()]
rows_after = len(df)
dropped_rows = rows_before - rows_after

# Summary print
print("=== Missing Value Handling Summary ===")
print(f"- Dropped {len(cols_to_drop)} features with >15% missing values: {cols_to_drop}")
print(f"- Dropped {dropped_rows} rows due to <5% missing values in selected modeling features")
print(f"- Remaining rows: {len(df)}")

=== Missing Value Handling Summary ===
- Dropped 15 features with >15% missing values: ['id', 'c_offense_date', 'c_arrest_date', 'r_case_number', 'r_charge_degree', 'r_days_from_arrest', 'r_offense_date', 'r_charge_desc', 'r_jail_in', 'r_jail_out', 'violent_recid', 'vr_case_number', 'vr_charge_degree', 'vr_offense_date', 'vr_charge_desc']
- Dropped 62 rows due to <5% missing values in selected modeling features
- Remaining rows: 17434


Deduplication Based on Selected Modeling Features

In [55]:
# Add target back for deduplication logic
X["is_recid"] = y.loc[X.index]

dedup_cols = selected_modeling_features + ["name", "dob", "is_recid"]
dedup_subset = X[dedup_cols].copy()

duplicates = dedup_subset.duplicated(keep="first")
print(f"Found {duplicates.sum()} partial duplicates — these will be removed.")

X = X[~duplicates].copy()
print(f"Dataset shape after deduplication: {X.shape}")


Found 7165 partial duplicates — these will be removed.
Dataset shape after deduplication: (10331, 52)


## Preprocessing Pipeline (Remaining Missing Values, Encoding, Scaling)

All remaining preprocessing steps are included in the sklearn pipeline:
- Numerical columns: imputed with **mean** and scaled with **StandardScaler**
- Categorical columns: imputed with **most_frequent** and encoded with **OrdinalEncoder**

In [None]:
# Identify feature types after dropping
num_features = df.select_dtypes(include=["number"]).drop(columns=["is_recid"]).columns.tolist()
cat_features = df.select_dtypes(include=["object", "category", "bool"]).columns.tolist()

# Pipeline steps
num_pipeline = Pipeline([
    ("imputer", SimpleImputer(strategy="mean")),
    ("scaler", StandardScaler())
])

cat_pipeline = Pipeline([
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("encoder", OrdinalEncoder(handle_unknown="use_encoded_value", unknown_value=-1))
])

# Full preprocessing pipeline
preprocessor = ColumnTransformer([
    ("num", num_pipeline, num_features),
    ("cat", cat_pipeline, cat_features)
])

# Example fit
X = df.drop(columns="is_recid")
y = df["is_recid"]
preprocessor.fit(X)

# (Optional) Save to file
import joblib, os
os.makedirs("../models", exist_ok=True)
joblib.dump(preprocessor, "../models/pipeline_preprocessing.pkl")
print("Preprocessing pipeline saved.")


Numerical Feature Summary

In [56]:
X[num_features].describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
age,10331.0,35.036492,11.959852,18.0,25.0,32.0,43.0,96.0
juv_fel_count,10331.0,0.065047,0.465279,0.0,0.0,0.0,0.0,20.0
decile_score,10331.0,4.403736,2.871312,-1.0,2.0,4.0,7.0,10.0
juv_misd_count,10331.0,0.080341,0.46497,0.0,0.0,0.0,0.0,13.0
juv_other_count,10331.0,0.100571,0.490619,0.0,0.0,0.0,0.0,17.0
priors_count,10331.0,3.279934,4.757207,0.0,0.0,1.0,4.0,43.0
days_b_screening_arrest,9902.0,-0.654312,73.806028,-597.0,-1.0,-1.0,-1.0,1057.0
c_days_from_compas,10308.0,64.506985,348.634775,0.0,1.0,1.0,2.0,9485.0
is_violent_recid,10331.0,0.079276,0.270182,0.0,0.0,0.0,0.0,1.0
decile_score.1,10331.0,4.403736,2.871312,-1.0,2.0,4.0,7.0,10.0


Finally, save the preprocessed data to a new .csv file:

In [58]:
# === Save Preprocessed Dataset (Final X with target) ===
# Recreate target column aligned with cleaned dataset
df_final_cleaned = X.copy()

# Reload original full dataset
df_orig = pd.read_csv("../data/cox-violent-parsed.csv")

# Filter the same way as X (i.e., drop same rows, reset index)
# Match only the index of final X in df_orig
df_matched = df_orig.loc[X.index, :].copy()
df_final_cleaned["is_recid"] = df_matched["is_recid"].values

# Save to disk
df_final_cleaned.to_csv("../data/cox-violent-preprocessed.csv", index=False)
print("Preprocessed dataset saved as: '../data/cox-violent-preprocessed.csv'")
print(f"Final shape: {df_final_cleaned.shape}")


Preprocessed dataset saved as: '../data/cox-violent-preprocessed.csv'
Final shape: (10331, 52)
