Als Erstes sollten alle nötigen Bibliotheken importiert werden. Gleichzeitig wird die CSV-Datei mit den zu verwendeten Datensets eingelesen.

In [1]:
import pandas as pd

In [2]:
import numpy as np

In [3]:

df_compas = pd.read_csv('compas-scores-two-years.csv')

Das Datenset besteht aus 7214 Einträgen und besitzt insgesamt 53 Merkmale, die als Spalten verwendet werden.

In [4]:
df_compas.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7214 entries, 0 to 7213
Data columns (total 53 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   id                       7214 non-null   int64  
 1   name                     7214 non-null   object 
 2   first                    7214 non-null   object 
 3   last                     7214 non-null   object 
 4   compas_screening_date    7214 non-null   object 
 5   sex                      7214 non-null   object 
 6   dob                      7214 non-null   object 
 7   age                      7214 non-null   int64  
 8   age_cat                  7214 non-null   object 
 9   race                     7214 non-null   object 
 10  juv_fel_count            7214 non-null   int64  
 11  decile_score             7214 non-null   int64  
 12  juv_misd_count           7214 non-null   int64  
 13  juv_other_count          7214 non-null   int64  
 14  priors_count            

Einen genaueren Einblick innerhalb des Aufbaus erhält man durch die Untersuchung der ersten paar Einträgen.

In [5]:
df_compas.head(6)

Unnamed: 0,id,name,first,last,compas_screening_date,sex,dob,age,age_cat,race,...,v_decile_score,v_score_text,v_screening_date,in_custody,out_custody,priors_count.1,start,end,event,two_year_recid
0,1,miguel hernandez,miguel,hernandez,2013-08-14,Male,1947-04-18,69,Greater than 45,Other,...,1,Low,2013-08-14,2014-07-07,2014-07-14,0,0,327,0,0
1,3,kevon dixon,kevon,dixon,2013-01-27,Male,1982-01-22,34,25 - 45,African-American,...,1,Low,2013-01-27,2013-01-26,2013-02-05,0,9,159,1,1
2,4,ed philo,ed,philo,2013-04-14,Male,1991-05-14,24,Less than 25,African-American,...,3,Low,2013-04-14,2013-06-16,2013-06-16,4,0,63,0,1
3,5,marcu brown,marcu,brown,2013-01-13,Male,1993-01-21,23,Less than 25,African-American,...,6,Medium,2013-01-13,,,1,0,1174,0,0
4,6,bouthy pierrelouis,bouthy,pierrelouis,2013-03-26,Male,1973-01-22,43,25 - 45,Other,...,1,Low,2013-03-26,,,2,0,1102,0,0
5,7,marsha miles,marsha,miles,2013-11-30,Male,1971-08-22,44,25 - 45,Other,...,1,Low,2013-11-30,2013-11-30,2013-12-01,0,1,853,0,0


Ebenso kann man die verwendeten Bezeichnungen für die Spalten ausgeben lassen.

In [6]:
print(df_compas.columns)

Index(['id', 'name', 'first', 'last', 'compas_screening_date', 'sex', 'dob',
       'age', 'age_cat', 'race', 'juv_fel_count', 'decile_score',
       'juv_misd_count', 'juv_other_count', 'priors_count',
       'days_b_screening_arrest', 'c_jail_in', 'c_jail_out', 'c_case_number',
       'c_offense_date', 'c_arrest_date', 'c_days_from_compas',
       'c_charge_degree', 'c_charge_desc', 'is_recid', 'r_case_number',
       'r_charge_degree', 'r_days_from_arrest', 'r_offense_date',
       'r_charge_desc', 'r_jail_in', 'r_jail_out', 'violent_recid',
       'is_violent_recid', 'vr_case_number', 'vr_charge_degree',
       'vr_offense_date', 'vr_charge_desc', 'type_of_assessment',
       'decile_score.1', 'score_text', 'screening_date',
       'v_type_of_assessment', 'v_decile_score', 'v_score_text',
       'v_screening_date', 'in_custody', 'out_custody', 'priors_count.1',
       'start', 'end', 'event', 'two_year_recid'],
      dtype='object')


Viele dieser Variablen (auch Prädikatoren genannt) werden in der Bestimmung des Kriterium bei Dressel und Farid (2018) nicht verwendet. Hier sollen dieselben genutzt werden: age, sex, number of juvenile misdemeanors, number of juvenile felonies, number of prior (nonjuvenile) crimes, crime degree, und crime charge. Dafür sollen die nicht nötigen Variablen aus dem Datensatz entfernt werden.

In [7]:
df = df_compas.drop(columns=['id', 'name', 'first', 'last', 'compas_screening_date', 'dob',
       'age_cat', 'race', 'decile_score', 'juv_other_count', 
       'days_b_screening_arrest', 'c_jail_in', 'c_jail_out', 'c_case_number',
       'c_offense_date', 'c_arrest_date', 'c_days_from_compas', 'is_recid', 'r_case_number',
       'r_charge_degree', 'r_days_from_arrest', 'r_offense_date',
       'r_charge_desc', 'r_jail_in', 'r_jail_out', 'violent_recid',
       'is_violent_recid', 'vr_case_number', 'vr_charge_degree',
       'vr_offense_date', 'vr_charge_desc', 'type_of_assessment',
       'decile_score.1', 'score_text', 'screening_date',
       'v_type_of_assessment', 'v_decile_score', 'v_score_text',
       'v_screening_date', 'in_custody', 'out_custody', 'priors_count.1',
       'start', 'end', 'event'])

Um diesen Datensatz zu verwenden müssen zusätzlich die mit Text beschriebenen Variablen in einen numerischen Wert umgewandelt werden.

In [8]:
df['sex'] = df['sex'].astype('category')
df['sex'] = df['sex'].cat.codes

In [9]:

df['c_charge_degree'] = df['c_charge_degree'].astype('category')
df['c_charge_degree'] = df['c_charge_degree'].cat.codes

In [10]:

df['c_charge_desc'] = df['c_charge_desc'].astype('category')
df['c_charge_desc'] = df['c_charge_desc'].cat.codes

In [11]:
df

Unnamed: 0,sex,age,juv_fel_count,juv_misd_count,priors_count,c_charge_degree,c_charge_desc,two_year_recid
0,1,69,0,0,0,0,17,0
1,1,34,0,0,0,0,172,1
2,1,24,0,0,4,0,318,1
3,1,23,0,1,1,0,317,0
4,1,43,0,0,2,0,436,0
...,...,...,...,...,...,...,...,...
7209,1,23,0,0,0,0,123,0
7210,1,23,0,0,0,0,213,0
7211,1,57,0,0,0,0,21,0
7212,0,33,0,0,3,1,50,0


Nun sollte noch geprüft werden, ob Nullwerte in den Daten enthalten sind und diese wenn nötig entfernen.

In [12]:
df.isnull().sum()

sex                0
age                0
juv_fel_count      0
juv_misd_count     0
priors_count       0
c_charge_degree    0
c_charge_desc      0
two_year_recid     0
dtype: int64

Mit diesen Schritten sind die Daten vorbereitet, um diese nutzen zu können.
Folgend sollen die Spalten in die abhängigen und unabhängigen Variablen, auch die Ziel-Variable und Merkmal-Variablen genannt, aufgeteilt werden, sodass X und y für eine logistische Regression festgelegt werden können.

In [13]:
X_overall_LR7 = df.drop(columns='two_year_recid')

In [14]:
y_overall_LR7 = df['two_year_recid']

Anders als bei Dressel und Farid (2018) soll, anstatt einer 80:20 Trainings- zu Testdaten Aufteilung, eine 10-fache Kreuzvalidierung durchgeführt werden. Dies soll zum selben Ergebnis führen bei einer Einsparung an Ressourcen und dem zeitlichen Aufwand. um die Studie nachzuvollziehen soll deren Tabelle 2 'Algorithmic predictions from 7214 defendants' nachgestellt werden. 

##### Genauigkeit des gesamten Datensatzes bei einer logistischen Regression mit sieben Prädikatoren

In [15]:
from sklearn.linear_model import LogisticRegression

In [16]:
lr = LogisticRegression(max_iter=400)

In [17]:
from sklearn.model_selection import KFold

In [18]:
kf = KFold(n_splits=10, shuffle=True, random_state=28)

In [19]:
from sklearn.model_selection import cross_val_score

In [20]:
kfscore_accuracy_overall_LR7 = cross_val_score(lr, X_overall_LR7, y_overall_LR7, cv=kf, scoring='accuracy')

In [21]:
accuracy_overall_LR7 = np.average(kfscore_accuracy_overall_LR7)
accuracy_overall_LR7

0.6761780537188653

##### Genauigkeit des Datensatzes für Schwarze Menschen bei einer logistischen Regression mit sieben Prädikatoren

In [22]:
df_black = pd.read_csv('compas-scores-two-years.csv')
df_black = df_black.query('race == "African-American"')
df_black = df_black.drop(columns=['id', 'name', 'first', 'last', 'compas_screening_date', 'dob',
       'age_cat', 'race', 'decile_score', 'juv_other_count', 
       'days_b_screening_arrest', 'c_jail_in', 'c_jail_out', 'c_case_number',
       'c_offense_date', 'c_arrest_date', 'c_days_from_compas', 'is_recid', 'r_case_number',
       'r_charge_degree', 'r_days_from_arrest', 'r_offense_date',
       'r_charge_desc', 'r_jail_in', 'r_jail_out', 'violent_recid',
       'is_violent_recid', 'vr_case_number', 'vr_charge_degree',
       'vr_offense_date', 'vr_charge_desc', 'type_of_assessment',
       'decile_score.1', 'score_text', 'screening_date',
       'v_type_of_assessment', 'v_decile_score', 'v_score_text',
       'v_screening_date', 'in_custody', 'out_custody', 'priors_count.1',
       'start', 'end', 'event'])
df_black['sex'] = df_black['sex'].astype('category')
df_black['sex'] = df_black['sex'].cat.codes
df_black['c_charge_degree'] = df_black['c_charge_degree'].astype('category')
df_black['c_charge_degree'] = df_black['c_charge_degree'].cat.codes
df_black['c_charge_desc'] = df_black['c_charge_desc'].astype('category')
df_black['c_charge_desc'] = df_black['c_charge_desc'].cat.codes

X_black_LR7 = df_black.drop(columns='two_year_recid')
y_black_LR7 = df_black['two_year_recid']

kfscore_accuracy_black_LR7 = cross_val_score(lr, X_black_LR7, y_black_LR7, cv=kf, scoring='accuracy')
accuracy_black_LR7 = np.average(kfscore_accuracy_black_LR7)
accuracy_black_LR7

0.6674701530799092

##### Falsch Positive für Schwarze Menschen bei einer logistischen Regression mit sieben Prädikatoren

In [47]:
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import confusion_matrix



y_pred_black = cross_val_predict(lr, X_black_LR7, y_black_LR7, cv=10)
conf_mat_black = confusion_matrix(y_black_LR7, y_pred_black)
FP_black = conf_mat_black[0,1]
actual_negatives = conf_mat_black[1,1] + conf_mat_black[1,0]
FP_black_percent = (FP_black / actual_negatives) * 100
FP_black_percent

35.718043135192005

##### Genauigkeit des Datensatzes für Weißen Menschen bei einer logistischen Regression mit sieben Prädikatoren

In [24]:
df_white = pd.read_csv('compas-scores-two-years.csv')
df_white = df_white.query('race == "Caucasian"')
df_white = df_white.drop(columns=['id', 'name', 'first', 'last', 'compas_screening_date', 'dob',
       'age_cat', 'race', 'decile_score', 'juv_other_count', 
       'days_b_screening_arrest', 'c_jail_in', 'c_jail_out', 'c_case_number',
       'c_offense_date', 'c_arrest_date', 'c_days_from_compas', 'is_recid', 'r_case_number',
       'r_charge_degree', 'r_days_from_arrest', 'r_offense_date',
       'r_charge_desc', 'r_jail_in', 'r_jail_out', 'violent_recid',
       'is_violent_recid', 'vr_case_number', 'vr_charge_degree',
       'vr_offense_date', 'vr_charge_desc', 'type_of_assessment',
       'decile_score.1', 'score_text', 'screening_date',
       'v_type_of_assessment', 'v_decile_score', 'v_score_text',
       'v_screening_date', 'in_custody', 'out_custody', 'priors_count.1',
       'start', 'end', 'event'])

df_white['sex'] = df_white['sex'].astype('category')
df_white['sex'] = df_white['sex'].cat.codes
df_white['c_charge_degree'] = df_white['c_charge_degree'].astype('category')
df_white['c_charge_degree'] = df_white['c_charge_degree'].cat.codes
df_white['c_charge_desc'] = df_white['c_charge_desc'].astype('category')
df_white['c_charge_desc'] = df_white['c_charge_desc'].cat.codes

X_white_LR7 = df_white.drop(columns='two_year_recid')
y_white_LR7 = df_white['two_year_recid']

kfscore_accuracy_white_LR7 = cross_val_score(lr, X_white_LR7, y_white_LR7, cv=kf, scoring='accuracy')
accuracy_white_LR7 = np.average(kfscore_accuracy_white_LR7)
accuracy_white_LR7

0.6747867927658868

Quellen:

RegenerativeToday. (16.06.2024). Step by Step Tutorial on Logistic Regression in Python | sklearn |Jupyter Notebook [Video]. Youtube. https://www.youtube.com/watch?v=bSXIbCZNBw0

Dressel, A. & Farid, H. (2018). The accuracy, fairness, and limits of predicting recidivism. Science Advances, 4 (1), eaao5580. https://doi.org/10.1126/sciadv.aao5580

Ryan Nolan Data. (17.06.2024). A Comprehensive Guide to Cross-Validation with Scikit-Learn and Python [Video]. Youtube. https://www.youtube.com/watch?v=glLNo1ZnmPA&list=PLcQVY5V2UY4LNmObS0gqNVyNdVfXnHwu8&index=14

