## Implementation of a Preprocessing Pipeline

This notebook constructs a fully reproducible scikit-learn pipeline to preprocess the COMPAS dataset before model training. The pipeline will be stored and reused for model evaluation and fairness auditing.

Key preprocessing tasks include:
- Filtering unknown targets
- Handling missing values
- Feature encoding
- Feature scaling
- Deduplication strategy
- Statistical summary of final dataset



Load Dataset

In [None]:
import pandas as pd
from sklearn.preprocessing import OrdinalEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

# Load the raw dataset
df = pd.read_csv("../data/cox-violent-parsed.csv")

# Quick look
print(f"Initial shape: {df.shape}")
df.head()


Initial shape: (18316, 52)


Unnamed: 0,id,name,first,last,compas_screening_date,sex,dob,age,age_cat,race,...,v_type_of_assessment,v_decile_score,v_score_text,v_screening_date,in_custody,out_custody,priors_count.1,start,end,event
0,1.0,miguel hernandez,miguel,hernandez,14/08/2013,Male,18/04/1947,69,Greater than 45,Other,...,Risk of Violence,1,Low,14/08/2013,07/07/2014,14/07/2014,0,0,327,0
1,2.0,miguel hernandez,miguel,hernandez,14/08/2013,Male,18/04/1947,69,Greater than 45,Other,...,Risk of Violence,1,Low,14/08/2013,07/07/2014,14/07/2014,0,334,961,0
2,3.0,michael ryan,michael,ryan,31/12/2014,Male,06/02/1985,31,25 - 45,Caucasian,...,Risk of Violence,2,Low,31/12/2014,30/12/2014,03/01/2015,0,3,457,0
3,4.0,kevon dixon,kevon,dixon,27/01/2013,Male,22/01/1982,34,25 - 45,African-American,...,Risk of Violence,1,Low,27/01/2013,26/01/2013,05/02/2013,0,9,159,1
4,5.0,ed philo,ed,philo,14/04/2013,Male,14/05/1991,24,Less than 25,African-American,...,Risk of Violence,3,Low,14/04/2013,16/06/2013,16/06/2013,4,0,63,0


Step 1: Filter Unknown Targets (is_recid == -1)

In [70]:
df = df[df["is_recid"] != -1]
print(f"After filtering unknown targets: {df.shape}")

After filtering unknown targets: (17496, 52)


Step 2: Handling Missing Values: 
- Drop features with >20% Missing Values
- Impute features with 0%-20% Missing Values BUT: With exception of selected 7 features, "c_charge_degree", "c_charge_desc", here we only drop rows of features with <5% Missing Values

In [78]:
# Separate target and features
y = df["is_recid"]
X = df.drop(columns=["is_recid"])

# Temporary selection of 7 modeling features
selected_modeling_features = [
    "age", "sex", "juv_misd_count", "juv_fel_count",
    "priors_count", "c_charge_degree", "c_charge_desc"
]

# Separate numerical and categorical columns
num_cols = X.select_dtypes(include=["number"]).columns.tolist()
cat_cols = X.select_dtypes(include=["object", "category", "bool"]).columns.tolist()

# Step 1: Drop features with >20% missing values
missing_ratio = X.isnull().mean()
features_to_drop = missing_ratio[missing_ratio > 0.20].index.tolist()
X.drop(columns=features_to_drop, inplace=True)

# Step 2: Impute features with 0–20% missing values (except selected modeling features <5%)
features_to_impute = missing_ratio[(missing_ratio > 0) & (missing_ratio <= 0.20)].index.tolist()

imputed_features = []
for col in features_to_impute:
    if col in selected_modeling_features and missing_ratio[col] < 0.05:
        continue  # Drop these rows later
    elif col in num_cols:
        mean_val = X[col].mean()
        n_missing = X[col].isna().sum()
        X[col] = X[col].fillna(mean_val)
        imputed_features.append((col, n_missing, "mean"))
    elif col in cat_cols:
        n_missing = X[col].isna().sum()
        hot_deck_sample = X[col].dropna().sample(n_missing, replace=True, random_state=42).values
        X.loc[X[col].isna(), col] = hot_deck_sample
        imputed_features.append((col, n_missing, "hot-deck"))

# Step 3: Drop rows for selected modeling features with <5% missing values
rows_before_dropping = len(X)
for col in selected_modeling_features:
    if col in X.columns and missing_ratio[col] < 0.05 and X[col].isna().sum() > 0:
        X = X[X[col].notna()]
rows_after_dropping = len(X)

# Final check for remaining missing values
missing_remaining = X.isnull().sum()
missing_remaining = missing_remaining[missing_remaining > 0]

# Final check for any remaining missing values
missing_summary = X.isnull().sum()
missing_summary = missing_summary[missing_summary > 0]

# Output summaries
imputation_summary = []
imputation_summary = pd.DataFrame(imputed_features, columns=["Feature", "Missing_Count", "Imputation_Method"])
dropped_row_count = rows_before_dropping - rows_after_dropping

# # Output Block
print("\n=== Imputation Summary ===")
if not imputation_summary.empty:
    for row in imputation_summary.itertuples(index=False):
        print(f"- {row.Feature}: {row.Missing_Count} values → {row.Imputation_Method}")
else:
    print("No features were imputed.")

print("\n=== Summary ===")
print(f"- Dropped {len(features_to_drop)} features with >20% missing values.")
print(f"- Imputed {len(imputation_summary)} features (with mean or hot-deck).")
print(f"- Dropped {dropped_row_count} rows due to <5% missing values in selected modeling features.")
print(f"- Final number of rows retained: {len(X)}")
print("\n=== Remaining Missing Values Check ===")
if X.isnull().sum().sum() == 0:
    print("No missing values remain in the dataset.")
else:
    print("There are still missing values.")




=== Imputation Summary ===
- days_b_screening_arrest: 478 values → mean
- c_jail_in: 478 values → hot-deck
- c_jail_out: 478 values → hot-deck
- c_case_number: 48 values → hot-deck
- c_offense_date: 3132 values → hot-deck
- c_days_from_compas: 48 values → mean
- score_text: 17 values → hot-deck
- v_score_text: 5 values → hot-deck
- in_custody: 326 values → hot-deck
- out_custody: 326 values → hot-deck

=== Summary ===
- Dropped 14 features with >20% missing values.
- Imputed 10 features (with mean or hot-deck).
- Dropped 62 rows due to <5% missing values in selected modeling features.
- Final number of rows retained: 17434

=== Remaining Missing Values Check ===
No missing values remain in the dataset.


Step 3: Feature Encoding 

In [77]:
# Categorical Feature Encoding using OrdinalEncoder

# --- Identify all categorical columns ---
all_categorical_cols = X.select_dtypes(include=["object", "category", "bool"]).columns.tolist()

# Exclude target column if still present
if "is_recid" in all_categorical_cols:
    all_categorical_cols.remove("is_recid")

# Print how many categorical columns we found
print(f"\n=== Categorical Encoding Summary ===")
print(f"Encoding {len(all_categorical_cols)} categorical features using OrdinalEncoder:")

# Apply Ordinal Encoding
encoder = OrdinalEncoder(handle_unknown="use_encoded_value", unknown_value=-1)
X[all_categorical_cols] = encoder.fit_transform(X[all_categorical_cols])

# Show feature names
for col in all_categorical_cols:
    print(f"- {col}")





=== Categorical Encoding Summary ===
Encoding 0 categorical features using OrdinalEncoder:


Step 4: Construct Full Pipeline

In [79]:
# === Step 4: Construct Full Preprocessing Pipeline ===

# 1. Identify numeric and categorical features again from processed X
num_features = X.select_dtypes(include=["int64", "float64"]).columns.tolist()
cat_features = X.select_dtypes(include=["object", "category"]).columns.tolist()

# Sanity check: Remove features already encoded manually
cat_features = [col for col in cat_features if col in X.columns and X[col].dtype == "object"]
num_features = [col for col in num_features if col in X.columns and X[col].dtype in ["int64", "float64"]]

# 2. Create preprocessing pipeline steps (note: imputation already handled, so we skip that)
preprocessor = ColumnTransformer(transformers=[
    ("num", Pipeline([
        ("scaler", StandardScaler())
    ]), num_features),
    
    ("cat", Pipeline([
        ("encoder", OrdinalEncoder(handle_unknown="use_encoded_value", unknown_value=-1))
    ]), cat_features)
])

# Apply the pipeline (fit_transform for training data)
X_processed = preprocessor.fit_transform(X)

# Optional: Save the pipeline for reuse
import joblib
joblib.dump(preprocessor, "../models/preprocessing_pipeline.pkl")

# 5. Final printout
print("Preprocessing pipeline constructed and saved.")
print(f"Processed feature matrix shape: {X_processed.shape}")



Preprocessing pipeline constructed and saved.
Processed feature matrix shape: (17434, 37)


Step 4: Deduplication – Preparation
Remove duplicated rows based on all columns (exact duplicates).
And, after feature selection (e.g., the 7 COMPAS features), we will check for duplicates and inspect what differs between them.


In [84]:
# === Step 4: Deduplication – Preparation ===
print("=== Deduplication ===")

# 4.1 Remove exact duplicates across all columns
initial_rows = len(df)
df_dedup = df.drop_duplicates()
exact_duplicates_removed = initial_rows - len(df_dedup)
print(f"Removed {exact_duplicates_removed} exact duplicate rows.")

# 4.2 Remove duplicates based on selected modeling features + target
selected_features = ["name", "dob",
    "age", "sex", "juv_misd_count", "juv_fel_count",
    "priors_count", "c_charge_degree", "c_charge_desc", "is_recid"
]

missing_feats = [feat for feat in selected_features if feat not in df_dedup.columns]
if missing_feats:
    print(f"Skipping modeling-feature deduplication. Missing features: {missing_feats}")
    df_final = df_dedup.copy()
else:
    duplicated_mask = df_dedup.duplicated(subset=selected_features, keep='first')
    duplicated_rows = df_dedup[duplicated_mask]

    print(f"Found {duplicated_rows.shape[0]} duplicate rows based on selected features (including target).")

    if not duplicated_rows.empty:
        print("\n Sample duplicate group (first 3 duplicates):")
        for idx in duplicated_rows.head(3).index:
            match = df_dedup.loc[(df_dedup[selected_features] == df_dedup.loc[idx, selected_features]).all(axis=1)]
            display(match)

    df_final = df_dedup.drop_duplicates(subset=selected_features, keep="first")
    partial_dupes_removed = len(df_dedup) - len(df_final)
    print(f"Removed {partial_dupes_removed} partial duplicates based on modeling features.")

# Final reassignment to X/y after deduplication
X = df_final.drop("is_recid", axis=1)
y = df_final["is_recid"]

# Final summary
print(f"\nFinal number of rows after deduplication: {len(df_final)}")



=== Deduplication ===
Removed 0 exact duplicate rows.
Found 7165 duplicate rows based on selected features (including target).

 Sample duplicate group (first 3 duplicates):


Unnamed: 0,id,name,first,last,compas_screening_date,sex,dob,age,age_cat,race,...,v_type_of_assessment,v_decile_score,v_score_text,v_screening_date,in_custody,out_custody,priors_count.1,start,end,event
0,1.0,miguel hernandez,miguel,hernandez,14/08/2013,Male,18/04/1947,69,Greater than 45,Other,...,Risk of Violence,1,Low,14/08/2013,07/07/2014,14/07/2014,0,0,327,0
1,2.0,miguel hernandez,miguel,hernandez,14/08/2013,Male,18/04/1947,69,Greater than 45,Other,...,Risk of Violence,1,Low,14/08/2013,07/07/2014,14/07/2014,0,334,961,0


Unnamed: 0,id,name,first,last,compas_screening_date,sex,dob,age,age_cat,race,...,v_type_of_assessment,v_decile_score,v_score_text,v_screening_date,in_custody,out_custody,priors_count.1,start,end,event
4,5.0,ed philo,ed,philo,14/04/2013,Male,14/05/1991,24,Less than 25,African-American,...,Risk of Violence,3,Low,14/04/2013,16/06/2013,16/06/2013,4,0,63,0
5,6.0,ed philo,ed,philo,14/04/2013,Male,14/05/1991,24,Less than 25,African-American,...,Risk of Violence,3,Low,14/04/2013,30/07/2013,08/11/2013,4,63,107,0
6,7.0,ed philo,ed,philo,14/04/2013,Male,14/05/1991,24,Less than 25,African-American,...,Risk of Violence,3,Low,14/04/2013,27/03/2014,02/05/2014,4,208,347,0
7,8.0,ed philo,ed,philo,14/04/2013,Male,14/05/1991,24,Less than 25,African-American,...,Risk of Violence,3,Low,14/04/2013,08/01/2016,09/01/2016,4,383,999,0
8,9.0,ed philo,ed,philo,14/04/2013,Male,14/05/1991,24,Less than 25,African-American,...,Risk of Violence,3,Low,14/04/2013,08/01/2016,09/01/2016,4,1000,1083,0


Unnamed: 0,id,name,first,last,compas_screening_date,sex,dob,age,age_cat,race,...,v_type_of_assessment,v_decile_score,v_score_text,v_screening_date,in_custody,out_custody,priors_count.1,start,end,event
4,5.0,ed philo,ed,philo,14/04/2013,Male,14/05/1991,24,Less than 25,African-American,...,Risk of Violence,3,Low,14/04/2013,16/06/2013,16/06/2013,4,0,63,0
5,6.0,ed philo,ed,philo,14/04/2013,Male,14/05/1991,24,Less than 25,African-American,...,Risk of Violence,3,Low,14/04/2013,30/07/2013,08/11/2013,4,63,107,0
6,7.0,ed philo,ed,philo,14/04/2013,Male,14/05/1991,24,Less than 25,African-American,...,Risk of Violence,3,Low,14/04/2013,27/03/2014,02/05/2014,4,208,347,0
7,8.0,ed philo,ed,philo,14/04/2013,Male,14/05/1991,24,Less than 25,African-American,...,Risk of Violence,3,Low,14/04/2013,08/01/2016,09/01/2016,4,383,999,0
8,9.0,ed philo,ed,philo,14/04/2013,Male,14/05/1991,24,Less than 25,African-American,...,Risk of Violence,3,Low,14/04/2013,08/01/2016,09/01/2016,4,1000,1083,0


Removed 7165 partial duplicates based on modeling features.

Final number of rows after deduplication: 10331


Step 10: Statistical Summary of Numerical Features

In [85]:
print("Statistical Summary of Processed Numerical Features:")
X_processed_df[num_impute].describe()

Statistical Summary of Processed Numerical Features:


Unnamed: 0,age,juv_fel_count,decile_score,juv_misd_count,juv_other_count,priors_count,days_b_screening_arrest,c_days_from_compas,is_recid,is_violent_recid,decile_score.1,v_decile_score,priors_count.1,start,end,event
count,17496.0,17496.0,17496.0,17496.0,17496.0,17496.0,17496.0,17496.0,17496.0,17496.0,17496.0,17496.0,17496.0,17496.0,17496.0,17496.0
mean,-1.283331e-16,-3.2489380000000006e-17,5.766865e-17,-1.786916e-17,6.903993e-18,-3.2489380000000006e-17,4.0611719999999994e-19,-1.2995750000000001e-17,-5.2389120000000004e-17,-3.736279e-17,5.766865e-17,-1.033568e-16,-3.2489380000000006e-17,-2.3757860000000003e-17,6.091759e-17,3.4926080000000005e-17
std,1.000029,1.000029,1.000029,1.000029,1.000029,1.000029,1.000029,1.000029,1.000029,1.000029,1.000029,1.000029,1.000029,1.000029,1.000029,1.000029
min,-1.37478,-0.1695262,-2.057839,-0.1947439,-0.2418741,-0.7653231,-7.512198,-0.1816924,-0.9628523,-0.287879,-2.057839,-1.930038,-0.7653231,-0.668977,-1.839701,-0.2216068
25%,-0.7714453,-0.1695262,-1.035077,-0.1947439,-0.2418741,-0.5785037,-0.06624892,-0.1785433,-0.9628523,-0.287879,-1.035077,-0.7832629,-0.5785037,-0.668977,-0.7938805,-0.2216068
50%,-0.254301,-0.1695262,-0.01231492,-0.1947439,-0.2418741,-0.3916843,-0.06624892,-0.1785433,-0.9628523,-0.287879,-0.01231492,-0.01874587,-0.3916843,-0.630344,0.02316682,-0.2216068
75%,0.6076061,-0.1695262,1.010447,-0.1947439,-0.2418741,0.3555932,-0.05375571,-0.1753943,1.038581,-0.287879,1.010447,0.7457711,0.3555932,0.4548917,0.8105033,-0.2216068
max,5.348095,41.9616,1.692288,24.37215,31.50618,7.26791,13.15156,29.68688,1.038581,3.473682,1.692288,2.274805,7.26791,3.534994,1.686972,4.512498


Step 11: Save Pipeline

In [None]:
import joblib

joblib.dump(preprocessor, "../models/preprocessing_pipeline.pkl")
print("Preprocessing pipeline saved.")