In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from google.colab import drive
drive.mount('/content/drive')

# COMPAS Dataset Error Analysis and Bias in Predictive Models

##Exercise 1: Loading and Preprocessing (COMPAS Data)
Load, clean, and preprocess the dataset to prepare it for subsequent analysis with a focus on error distribution.

####Exercise 1.1
Load the COMPAS dataset ('compas-scores-two-years2.csv') using pandas.

In [None]:
# Load dataset with read_csv


####Exercise 1.2
Print out the dimensions, column names, and datatypes of this dataset! Then print out the head to understand the dataset.


*  Understanding the shape, columns, and types helps determine the size of the
dataset and understand which fields are relevant for our analysis.
*  This is crucial for identifying which columns might be useful in understanding bias.





In [None]:
# Explore the data structure


####Exercise 1.3a
Filter the data to remove irrelevant records:


*   Remove rows with `days_b_screening_arrest` outside of the range [-30, 30]
    *   The column `days_b_screening_arrest` represents the number of days between a person's arrest and when they were assessed by the COMPAS system.
    *   The COMPAS assessment is only relevant if it is done within a certain time window of the arrest, ideally around the time of the offense
    *   If the assessment is performed too long after (greater than 30 days) or before (-30 days) the actual arrest, it may not accurately reflect the current context of the offense, leading to unrealiable or incorrect risk scores.



In [None]:
# Filter the data


####Exercise 1.3b
Filter the data to remove irrelevant records:
*   Drop the rows where `is_recid` equals -1.
    *   The column `is_recid` is a flag indicating whether the individual reoffended (revidivated) within two years. A value of `-1` indicates that the data for this person is missing or oculd not be found by ProPublica (the source of the data).
    *   Records with `is_recid = -1` mean that we do not know whether the individual reoffended or not. Using incomplete or missing information can lead to skewed results, and these cases add uncertainty to the analysis.

After filtering, how many rows are there?

In [None]:
# Filter the data


####Exercise 1.3c
Filter the data to remove irrelevant records:
*   Remove rows where `c_charge_degree` is 'O' (ordinary traffic).
    * The `c_charge_degree` column indicates the severity of the charge. It contains values like 'F' (Felony), 'M' (Misdemeanor), and 'O' (Ordinary traffic offenses).
    * **Ordinary traffic offenses** do not usually result in jail time and are less likely to lead to reoffending that would justify the use of a complex risk assessment model like COMPAS.

In [None]:
# Filter the data


#### Exercise 1.4
Filter the columns to only include ['name', 'age', 'c_charge_degree', 'race', 'score_text', 'sex', 'priors_count', 'days_b_screening_arrest', 'decile_score', 'is_recid', 'two_year_recid', 'c_jail_in', 'c_jail_out']

In [None]:
# Filter the DataFrame to only include the desired columns

# Select only the specified columns


#### Exercise 1.5
Check that you have no null values in the dataset.


In [None]:
# Check for null values in the data set


#### Exercise 1.6
Create a column named 'length_of_stay' that is the length a defendant stayed in jail. Then calculate the correlation of length of stay with COMPAS score. What does the correlation mean?

In [None]:
# Hint: use pd.to_datetime() to convert strings into datetime format


#### Exercise 1.7
Create a column 'age_cat', where age is separated into 3 categories: > 45, 25 <= age <= 45, and < 25. Name these in an easy to read way.

In [None]:
# Try using lambda functions!


####Exercise 1.8
Transform the risk score column into a new, simplified format. Create a new column that categorizes the risk score into binary outcomes: 'Low': 0, 'Medium': 1, 'High': 1
*   Transforming the `score_text` into a binary value allows us to perform more straightforward predictive analysis and assess whether the model is properly identifying high-risk individuals
*   Hint: create a column called 'score_binary', .replace() function replaces specific values in the DataFrame with new values

In [None]:
# Create a binary column for the risk score


#### Exercise 1.9
Print out the percentage of each race as well as the percentage of each sex.

In [None]:
# Hint: use value_counts() and sum()
# fstrings, .2f to display up to 2 decimal places


#### Exercise 1.10
What is the percentage of people who have recidivated?
* Calculate the percentage of individuals in the dataset who have recidivated ('two_year_recid' column equals 1).

In [None]:
# Calculate the percentage of people who have recidivated
# Hint: The mean gives the proportion of people who recidivated.

# Print the recidivism rate


#### Exercise 1.11
Using seaborn, plot charts of Black defendants' COMPAS scores, then plot charts of White defendants' COMPAS scores. Is this evidence of bias?

In [None]:
# Filter the dataset for Black defendants

# Plot the COMPAS scores for Black defendants


##Exercise 2: Confusion Matrix, Contingency Table, and Performance Metrics
Use COMPAS predictions to create a confusion matrix. Calculate accuracy, precision, recall, specificity, and F1 score using actual vs predicted calues.

####Exercise 2.1:
Create a Confusion Matrix. Import the `confusion_matrix` function from the `sklearn.metrics` module.
*   Then extract the actual labels (`y_actual`) and the predicted labels (`y_pred`) from your DataFrame.
    * `y_actual`: This should be the actual outcome, which tells if a person actually recidivated (`two_year_recid` column).
    * `y_pred`: This should be the predicted outcome from the model (`score_binary` column).
*   Then plot this using seaborn's heatmap function and matplotlib.pyplot.

In [None]:
from sklearn.metrics import confusion_matrix

# Define actual and predicted values

# Create a confusion matrix using the function you've imported

# Plotting the confusion matrix using seaborn's heatmap

# Show the plot


####Exercise 2.2:
Plot confusion matrices for black defendants, white defendants, and another demographic of your choice.

In [None]:
# Filter the DataFrame based on race == (i.e.) 'Caucasian'

# Define the actual and predicted labels for those defendents

# Create a confusion matrix

# Convert the confusion matrix to a DataFrame for labeling

# Plot the confusion matrix for White defendants using Seaborn's heatmap


####Exercise 2.3:
Calculate Performance Metrics. Import functions to calculate accuracy, precision, recall, and F1 score from scikit-learn.
*   Calculate the **accuracy** of the model by comparing `y_actual` and `y_pred`.
    * Accuracy gives a general overview of the model's performance, indicating how often it correctly predicted recidivism versus non-recidivism.
*   Calculate the **precision** and recall to understand the quality of the model’s positive predictions.
    * Precision is critical when we want to minimize false positives, for example, when misclassifying someone as high-risk unjustly (which could lead to harsher treatment).
    * Recall is important when we want to ensure we identify as many actual positives as possible, such as correctly identifying individuals who are likely to reoffend.
* Use the values from the confusion matrix to calculate **specificity** (`True Negative`/(`True Negative` + `False Positive`).
    * Specificity tells us how well the model identifies actual non-recidivists, minimizing the risk of wrongly labeling people as high-risk when they are not.
* Calculate the **F1 score** for a balanced measure of the model’s performance.
    * The F1 score is useful when the balance between precision and recall is crucial. For instance, if false positives and false negatives have serious real-world consequences, this metric helps to assess overall balance.

**For specificity: ** Run the cell below to extract values for True Negatives (TN), False Positives (FP), False Negatives (FN), and True Positives (TP) from the confusion matrix defined in Exercise 2.1.

In [None]:
# RUN ME!
# The confusion matrix is originally a 2x2 array, with the following structure:
# [[True Negative (TN), False Positive (FP)],
#  [False Negative (FN), True Positive (TP)]]
# .ravel() makes it easier to extract these four values (left to right) at once.
# Where cm is the name of the confusion matrix (can name it anything)

tn, fp, fn, tp = cm.ravel()

In [None]:
# Calculate accuracy

# Calculate precision, recall

# Calculate specificity

# Calculate F1 Score


##Exercise 3: Error Distribution Analysis Across Demographics

####Exercise 3.1
Create new columns in your DataFrame to identify False Positives and False Negatives.
*   **False Positive (FP)**: When the model predicts someone will reoffend (high-risk), but they do not actually reoffend.
*   **False Negative (FN)**: When the model predicts someone will not reoffend (low-risk), but they do actually reoffend.

In [None]:
# Add columns to identify False Positives and False Negatives
# Hint: think of what y_pred and y_actual should be for FP, FN (as type int)


####Exercise 3.2
How much more likely are black defendants to get a false positive than white defendants?

How much more likely are white defendants to get a false negative than black defendants?

In [None]:
# Filter the dataset for Black and White defendants

# Calculate the False Positive Rate (FPR) for Black defendants

# Calculate the False Positive Rate (FPR) for White defendants

# Calculate how much more likely Black defendants are to get a false positive than White defendants

# Calculate the False Negative Rate (FNR) for Black defendants

# Calculate the False Negative Rate (FNR) for White defendants

# Calculate how much more likely White defendants are to get a false negative than Black defendants


####Exercise 3.3
Use the newly created `'false_positive'` and `'false_negative'` columns to calculate the average false positive and false negative rates for each demographic group.
*   Group the data by `'race'` and calculate the mean of `'false_positive'` and `'false_negative'`.

In [None]:
# Calculate false positive and false negative rates by race

# Print the rates to understand the differences


####Exercise 3.4
Use Seaborn's barplot to create a bar chart of the false positive rates by race.


In [None]:
# Calculate the False Positive Rate for each race
# Hint: The mean gives the rate of false positives for each race.



####Exercise 3.5
Create a bar chart using Seaborn to visualize false negative rates by race, similar to how false positive rates were plotted.

In [None]:
# Calculate the False Negative Rate for each race
# Hint: The mean gives the rate of false negatives for each race.

#### Exercise 3.6
Calculate accuracy, precision, recall, specificity, and F1 score for 2 chosen groups. What does this tell you?

In [None]:
# Filter the dataset for Black defendants

# Filter the dataset for White defendants

# Define the actual and predicted labels for Black defendants

# Define the actual and predicted labels for White defendants

# Create a confusion matrix for Black defendants

# Calculate metrics for Black defendants

# Create a confusion matrix for White defendants

# Calculate metrics for White defendants


####Exercise 3.7
What if we defined score_binary wrong? Define a new column as 1 if `'score_text'` == 'High' and 0 as everything else. Then compare false positives and false negatives across races with this new column.



##Exercise 4: Gender (if you have extra time)

Run Exercise 2.1 - 3.6, but across a gender difference.