NAME-JAGTAP KAUSTUBH

ROLL NO-09

PRN-22SC114501064

TITLE-Impact of Data Quality on AI Fairness

Objective: To understand how imbalanced data affects the fairness and performance of AI models — and how data balancing techniques (like SMOTE) can improve fairness.


In [5]:
pip install fairlearn

Collecting fairlearn
  Downloading fairlearn-0.12.0-py3-none-any.whl.metadata (7.0 kB)
Downloading fairlearn-0.12.0-py3-none-any.whl (240 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m240.0/240.0 kB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: fairlearn
Successfully installed fairlearn-0.12.0


In [6]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from fairlearn.metrics import MetricFrame, true_positive_rate, false_positive_rate

In [8]:
df = pd.read_csv('/content/recidivism_full.csv')

In [9]:
df.columns

Index(['ID', 'Gender', 'Race', 'Age_at_Release', 'Residence_PUMA',
       'Gang_Affiliated', 'Supervision_Risk_Score_First',
       'Supervision_Level_First', 'Education_Level', 'Dependents',
       'Prison_Offense', 'Prison_Years', 'Prior_Arrest_Episodes_Felony',
       'Prior_Arrest_Episodes_Misd', 'Prior_Arrest_Episodes_Violent',
       'Prior_Arrest_Episodes_Property', 'Prior_Arrest_Episodes_Drug',
       'Prior_Arrest_Episodes_PPViolationCharges',
       'Prior_Arrest_Episodes_DVCharges', 'Prior_Arrest_Episodes_GunCharges',
       'Prior_Conviction_Episodes_Felony', 'Prior_Conviction_Episodes_Misd',
       'Prior_Conviction_Episodes_Viol', 'Prior_Conviction_Episodes_Prop',
       'Prior_Conviction_Episodes_Drug',
       'Prior_Conviction_Episodes_PPViolationCharges',
       'Prior_Conviction_Episodes_DomesticViolenceCharges',
       'Prior_Conviction_Episodes_GunCharges', 'Prior_Revocations_Parole',
       'Prior_Revocations_Probation', 'Condition_MH_SA', 'Condition_Cog_Ed',
     

In [10]:
# Keep only the needed columns
df = df[df['Race'].isin(['BLACK', 'WHITE'])]
df = df.dropna(subset=['Age_at_Release', 'Prior_Arrest_Episodes_Felony', 'Prison_Offense', 'Recidivism_Within_3years'])

In [11]:
# One-hot encode the categorical column
df = pd.get_dummies(df, columns=['Prison_Offense'], drop_first=True)

In [12]:
# Replace '10 or more' with 10 and convert to numeric
df['Prior_Arrest_Episodes_Felony'] = df['Prior_Arrest_Episodes_Felony'].replace('10 or more', 10).astype(int)

# One-hot encode 'Age_at_Release'
df = pd.get_dummies(df, columns=['Age_at_Release'], drop_first=True)


# Select all required columns
features = ['Prior_Arrest_Episodes_Felony'] + [col for col in df.columns if 'Prison_Offense' in col or 'Age_at_Release' in col]
X = df[features]
y = df['Recidivism_Within_3years'].astype(int)
race = df['Race']

# Split the data after converting Age_at_Release to numeric
X_train, X_test, y_train, y_test, race_train, race_test = \
    train_test_split(X, y, race, test_size=0.3, stratify=race)

In [13]:
# Replace '10 or more' with 10 and convert to numeric
df['Prior_Arrest_Episodes_Felony'] = df['Prior_Arrest_Episodes_Felony'].replace('10 or more', 10).astype(int)

X_train, X_test, y_train, y_test, race_train, race_test = \
    train_test_split(X, y, race, test_size=0.3, stratify=race)

In [14]:
# One-hot encode the Gender column
df = pd.get_dummies(df, columns=['Gender'], drop_first=True)

In [15]:
# Train a Logistic Regression model
model = LogisticRegression(solver='liblinear')
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Calculate fairness metrics
metric_frame = MetricFrame(metrics={"True positive rate": true_positive_rate,
                                    "False positive rate": false_positive_rate},
                           y_true=y_test,
                           y_pred=y_pred,
                           sensitive_features=race_test)

# Display the fairness metrics
display(metric_frame.by_group)

Unnamed: 0_level_0,True positive rate,False positive rate
Race,Unnamed: 1_level_1,Unnamed: 2_level_1
BLACK,0.85,0.631157
WHITE,0.806624,0.532374


True Positives (TP): These are cases where the model correctly predicts a positive outcome. In the context of this dataset, a true positive would be when the model predicts that an individual will recidivate (positive outcome), and they actually do recidivate.

False Positives (FP): These are cases where the model incorrectly predicts a positive outcome. In this dataset, a false positive would be when the model predicts that an individual will recidivate (positive outcome), but they do not actually recidivate. These are often referred to as Type I errors.


In [16]:
from sklearn.metrics import accuracy_score

# Calculate the accuracy score
accuracy = accuracy_score(y_test, y_pred)

# Display the accuracy score
display(f"Accuracy Score: {accuracy}")

'Accuracy Score: 0.6570138150903294'

In [17]:
# View fairness metrics after mitigation
print("Fairness Metrics by Race After Reweighing:")
# Please make sure to execute the previous cells, especially the one that calculates metric_frame
print(metric_frame.by_group)

Fairness Metrics by Race After Reweighing:
       True positive rate  False positive rate
Race                                          
BLACK            0.850000             0.631157
WHITE            0.806624             0.532374


In this project, we have:

1.  **Installed the `fairlearn` library**: This library is used for assessing and improving the fairness of machine learning models.
2.  **Imported necessary libraries**: We imported pandas for data manipulation, `train_test_split` from scikit-learn for splitting the data, `LogisticRegression` for building the model, and `MetricFrame`, `true_positive_rate`, `false_positive_rate` from `fairlearn.metrics` for evaluating fairness.
3.  **Loaded and Preprocessed the Data**: We loaded the `recidivism_full.csv` dataset, filtered it to include only 'BLACK' and 'WHITE' races, handled missing values, and performed one-hot encoding on categorical features like 'Prison_Offense' and 'Age_at_Release'. We also converted the 'Prior_Arrest_Episodes_Felony' column to a numeric type.
4.  **Split the Data**: We split the data into training and testing sets, ensuring that the 'Race' distribution was stratified in both sets.
5.  **Trained a Logistic Regression Model**: We trained a logistic regression model on the training data.
6.  **Evaluated Fairness**: We used `fairlearn.metrics.MetricFrame` to calculate fairness metrics, specifically True Positive Rate and False Positive Rate, across different racial groups ('BLACK' and 'WHITE') using the test set predictions.
7.  **Displayed Fairness Metrics**: We displayed the calculated fairness metrics to assess the model's performance across the different racial groups.