# 1. Is COMPAS fair? (48pt)

In [1]:
import numpy as np
import pandas as pd

## 1. (1pt) Load the COMPAS data, and perform the basic checks.

In [2]:
compas = pd.read_csv("../data/compas-score-data.csv.bz2", sep="\t")
print("shape: ", compas.shape)
compas.head()

shape:  (6172, 8)


Unnamed: 0,age,c_charge_degree,race,age_cat,sex,priors_count,decile_score,two_year_recid
0,69,F,Other,Greater than 45,Male,0,1,0
1,34,F,African-American,25 - 45,Male,0,3,1
2,24,F,African-American,Less than 25,Male,4,4,1
3,44,M,Other,25 - 45,Male,0,1,0
4,41,F,Caucasian,25 - 45,Male,14,6,1


In [3]:
compas.isna().sum()

age                0
c_charge_degree    0
race               0
age_cat            0
sex                0
priors_count       0
decile_score       0
two_year_recid     0
dtype: int64

In [4]:
compas.dtypes

age                 int64
c_charge_degree    object
race               object
age_cat            object
sex                object
priors_count        int64
decile_score        int64
two_year_recid      int64
dtype: object

## 2. (1pt) Filter the data to keep only Caucasian and African-Americans. There are just too few offenders of other races.

In [5]:
compas_aa_c = compas[(compas.race == 'Caucasian') | (compas.race == 'African-American')]
compas_aa_c.head()

Unnamed: 0,age,c_charge_degree,race,age_cat,sex,priors_count,decile_score,two_year_recid
1,34,F,African-American,25 - 45,Male,0,3,1
2,24,F,African-American,Less than 25,Male,4,4,1
4,41,F,Caucasian,25 - 45,Male,14,6,1
6,39,M,Caucasian,25 - 45,Female,0,1,0
7,27,F,Caucasian,25 - 45,Male,0,4,0


## 3. (2pt) Create a new dummy variable based off of COMPAS risk score (decile_score), which indicates if an individual was classified as low risk (score 1-4) or high risk (score 5-10). Hint: you can proceed in different ways but for technical reasons related the tasks below, the best way to do it is to create a variable “high score”, that takes values 1 (decile score 5 and above) and 0 (decile score 1-4).

In [6]:
df = compas_aa_c.copy()
df["high_score"] = pd.cut(df.decile_score, bins=[0,4, np.inf], labels=[0, 1])
df.head()
print(df.size)

47502


## 4. (6pt) Now analyze the offenders across this new risk category:
### (a) What is the recidivism rate (percentage of offenders who re-commit the crime) for low- risk and high-risk individuals?

In [7]:
recommited = df[df.two_year_recid == 1]
recommited_high = recommited[recommited.high_score == 1].size #high risk
print("re-commited crime for high risk", (recommited_high / recommited.size))

recommited_low = recommited[recommited.high_score == 0].size #low risk
print("re-commited crime for low risk", recommited_low / recommited.size)

re-commited crime for high risk 0.6451872734595248
re-commited crime for low risk 0.35481272654047524


### (b) What are the recidivism rates for African-Americans and Caucasians?

In [8]:
recommited_aa = recommited[recommited.race == "African-American"].size #African-American
print("re-commited crime for African-Americans", (recommited_aa / recommited.size))

recommited_c = recommited[recommited.race == "Caucasian"].size #Caucasian
print("re-commited crime for Caucasians", (recommited_c / recommited.size))

re-commited crime for African-Americans 0.6689488521949255
re-commited crime for Caucasians 0.3310511478050745


## 5. (8 pt) Now create a confusion matrix comparing COMPAS predictions for recidivism (low risk/high risk) and the actual two-year recidivism and interpret the results. In order to be on the same page, let’s call recidivists “positives”.

In [9]:
from sklearn.metrics import confusion_matrix

cm = confusion_matrix(df.high_score, df.two_year_recid)
cm

array([[1872,  881],
       [ 923, 1602]])

In [10]:
print("accuracy: ", np.mean(df.high_score == df.two_year_recid)) #accuracy

from sklearn.metrics import precision_score
print("precision: ", precision_score(df.high_score, df.two_year_recid)) #precision

from sklearn.metrics import recall_score
print("recall: ", recall_score(df.high_score, df.two_year_recid)) #recall

print("F1-score: ", (recall_score(df.high_score, df.two_year_recid) + precision_score(df.high_score, df.two_year_recid)) / 2)

accuracy:  0.6582038651004168
precision:  0.6451872734595248
recall:  0.6344554455445545
F1-score:  0.6398213595020397


* COMPAS correctly identified 65.8% of individuals (accuracy).
* COMPAS predicted 36.6% of those who are high risk to not recidivate within 2 years.
* COMPAS predicted 32% of those who are low risk to recidivate within 2 years.
* 64.2% of high risk individuals were correctly indentified to have recidivated within 2 years
* 63.4% of recidivated individuals were correctly identified as high risk
* Mean of the 2: 64%

## 6. (8pt) Find the accuracy of the COMPAS classification, and also how its errors were distributed. Would you feel comfortable having a judge to use COMPAS to inform sentencing guidelines? At what point would the error/misclassification risk be acceptable for you? What do you think, how well can judges perform the same task without COMPAS’s help? Remember: human judges are not perfect either!

In [11]:
fpr = 881/(881+1872)
print("FPR: ", fpr)
fnr = 923/(923+1602)
print("FNR: ", fnr)

FPR:  0.3200145296040683
FNR:  0.36554455445544554


Accuracy can be found on the above question. Personally, I would not be comfortable with having a judge use COMPAS to inform sentencing guidelines. If the accuracy was at least 90%, then I think it would be more acceptable. Although I would love the accuracy to be 100%, I truly believe that it is impossible to do so. With the COMPAS accuracy, I believe that judges can make better decisions without using COMPAS. COMPAS has a relatively high FPR and FNR, meaning there are people out there who were expected to be of "low risk" and yet still recidivated within 2 years, and there are also people out there who as classified as "high risk" and did not recidivated within 2 years. Having an error of around 30% is pretty big, so I definitely wouldn't trust the COMPAS system.

## 7. (10pt) Now repeat your confusion matrix calculation and analysis from 5. But this time do it separately for African-Americans and for Caucasians:

In [12]:
african_american = df[df.race == "African-American"]
cm = confusion_matrix(african_american.high_score, african_american.two_year_recid)
print(cm)
print("African-American accuracy: ", np.mean(african_american.high_score == african_american.two_year_recid)) #accuracy
print("precision: ", precision_score(african_american.high_score, african_american.two_year_recid)) #precision
print("recall: ", recall_score(african_american.high_score, african_american.two_year_recid)) #recall

[[ 873  473]
 [ 641 1188]]
African-American accuracy:  0.6491338582677165
precision:  0.7152317880794702
recall:  0.6495352651722253


In [13]:
Caucasian = df[df.race == "Caucasian"]
cm = confusion_matrix(Caucasian.high_score, Caucasian.two_year_recid)
print(cm)
print("Caucasian accuracy: ", np.mean(Caucasian.high_score == Caucasian.two_year_recid)) #accuracy
print("precision: ", precision_score(Caucasian.high_score, Caucasian.two_year_recid)) #precision
print("recall: ", recall_score(Caucasian.high_score, Caucasian.two_year_recid)) #recall

[[999 408]
 [282 414]]
Caucasian accuracy:  0.6718972895863052
precision:  0.5036496350364964
recall:  0.5948275862068966


### (a) How accurate is the COMPAS classification for African-American individuals? For Caucasians?

COMPAS was able to accurately identify/classify about about 65% of African-Americans, about 67% of Caucasians

### (b) What are the false positive rates (false recidivism rates) FPR = FP/N = FP/(FP + T N)?

In [14]:
aa_fpr = 473/(473+873)
print("African-American FPR: ", aa_fpr)
c_fpr = 408/(408+999) 
print("Caucasion FPR: ", c_fpr)

African-American FPR:  0.3514115898959881
Caucasion FPR:  0.2899786780383795


### (c) The false negative rates (false no-recidivism rates) FNR = FN/P = FN/(FN + T P)?

In [15]:
aa_fnr = 641/(641+1188)
print("African-American FNR: ", aa_fnr)
c_fnr = 282/(282+414)
print("Caucasian FNR: ", c_fnr)

African-American FNR:  0.35046473482777474
Caucasian FNR:  0.4051724137931034


## 8. (12pt) If you have done this correctly, you will find that COMPAS’s percentage of correctly categorized individuals (accuracy) is fairly similar for African-American and Caucasian individuals, but that false positive rates and false negative rates are different. Look again at the overal recidivism rates in the dataset for Black and White individuals. In your opinion, is the COMPAS algorithm fair? Justify your answer.

Personally, I don't think the COMPAS algorithm is fair. As a whole, the accuracy of those who recommit is already unsatisfactory (Looking only at race, decile_score, and two_year_recid). I know that these systems tend to be heavily biased towards one race, and this is also because of reasons that can't be put into a dataset. I believe that we should not be using technology to predict and identify those who we think will commit a crime in the future. Also, knowing how many people are still racist, this further makes me believe that the COMPAS algorithm isn't fair. In the end, it's the judge and the police officers who decides on the whether or not an individual is to go to jail.

# 2. Can you beat COMPAS? (40pt)

## 1. (8pt) Before we start: what do you think, what is an appropriate model performance measure here? A, P, R, F or something else? Maybe you want to report multiple measures? Explain!

I think that accuracy and the f-score are pretty good measures of performances. The f-score is a good balance of precision and recall, which can be skewed depending on the false negatives and false positives. I think accuracy is also a pretty good overall measure, since we also want to know if the model can correctly identify. However, for cross validation using sklearn, I think RSME is an easy-to-understand score

## 2. (6pt) Now it is time to do the modeling. Create a logistic regression model that contains all explanatory variables you have in data into the model. (Some of these you have to convert to dummies). Do not include the variables discussed above, do not include race and gender in this model either to avoid explicit gender/racial bias.

In [16]:
compas["prior_cat"] = pd.cut(compas.priors_count, bins=[-np.inf,6,12,20,27,np.inf], labels=["1", "2", "3", "4", "5"])
x = compas[["c_charge_degree", "age_cat", "prior_cat", "race", "sex"]]
x1 = x.drop(["race", "sex"], axis=1)
y = compas["two_year_recid"]
x_dummy = pd.get_dummies(x1)

from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
m = LogisticRegression()
cv = cross_val_score(m, x_dummy, y, cv=10, scoring="accuracy")
cv.mean()

0.6429046487597887

## 3. (6pt) Experiment with different models to find the best model according to your preformance indicator. (Include/exclude different variables, you may also do feature engineering, e.g. create different age groups, include variables like age2, age2, interaction effects, etc. But do not include race and gender. Report what did you try (but no need to report the full results of all unsuccessful models you tried), and your best model’s performance. Is it better or worse than for the COMPAS model? Please do not spend too much on tiny differences, e.g. your accuracy is better by 0.001 and F-score worse by 0.0005. Cross-validation is a random process and these figures jump up and down a bit.

In [17]:
x_dummy = x1.drop(["age_cat"], axis=1)
x_dummy = pd.get_dummies(x_dummy)

cv = cross_val_score(m, x_dummy, y, cv=10, scoring="accuracy")
cv.mean()

0.6207059422091443

In [18]:
x_dummy = x1.drop(["c_charge_degree"], axis=1)
x_dummy = pd.get_dummies(x_dummy)

cv = cross_val_score(m, x_dummy, y, cv=10, scoring="accuracy")
cv.mean()

0.6419324636905792

In [19]:
x_dummy = x1.drop(["prior_cat"], axis=1)
x_dummy = pd.get_dummies(x_dummy)

cv = cross_val_score(m, x_dummy, y, cv=10, scoring="accuracy")
cv.mean()

0.5771249337802186

In [20]:
compas["new_age_cat"] = pd.cut(compas.age, bins=[18, 30, 40, 50, 60, 70, 80, 90, np.inf], labels=["below 20", "21-30", "31-40", "41-50", "51-60", "61-70", "71-80" "81-90", "91+"])
x1 = compas[["c_charge_degree", "new_age_cat", "prior_cat"]]
y = compas["two_year_recid"]
x_dummy = pd.get_dummies(x1)
cv = cross_val_score(m, x_dummy, y, cv=10, scoring="accuracy")
cv.mean()

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


0.6369105128164781

I tried to exclude certain categories from the model, but it only made the RSME worse (bigger). We want a small RSME because that woulsd mean that we have a small error from our model. The best result I got was when I reclassified the age into different categories than before. I made the age categories smaller than the original COMPAS data, and got an RSME score that was slightly lower.

## 4. (4pt) Now add sex to the model. Does it help to improve the performance?

In [21]:
x2 = compas[["new_age_cat", "sex", "prior_cat", "c_charge_degree"]]
x_dummy = pd.get_dummies(x2)
cv = cross_val_score(m, x_dummy, y, cv=10, scoring="accuracy")
cv.mean()

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


0.6450092576565802

It improved the model slightly, but I don't think it's enough to say that is makes a difference.

## 5. (4pt) And finally add race. Does the model improve? Again, let’s not talk about tiny differences here.

In [22]:
x3 = compas[["new_age_cat", "sex", "prior_cat", "c_charge_degree", "race"]]
x_dummy = pd.get_dummies(x3)
cv = cross_val_score(m, x_dummy, y, cv=10, scoring="accuracy")
cv.mean()

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

0.6411173703010181

It slightly increased from the previous model, but it's not significant enough to say that there is a difference.

## 6. (12pt) Discuss the results. Did you manage to be equally good as COMPAS? Did you create a better model? Do gender and race help to improve your predictions? What should judges do when having access to such models? Should they use such models?

Compared with COMPAS my model basically produced the same results. Although my model performed slightly better, it is not significant enough to say that my model is the better model. Gender and race made no difference at all to improve the model. I think if judges were to have access to this COMPAS model, they should not be focusing on race and sex. Personally, I am against judges using these models because the accuracy is still relatively low.

# 3. Is your model more fair? (12pt)
## 1. (6pt) Replicate 1.7 using your best model: pick the best model from question 2.3, predict recidivism for everyone in data (ie only African-Americans and Caucasians), and compute FPR and FNR.

In [23]:
compas = compas[(compas.race == 'Caucasian') | (compas.race == 'African-American')]
compas["high_score"] = pd.cut(compas.decile_score, bins=[0,4, np.inf], labels=[0, 1])

aa = compas[compas.race == "African-American"]
c = compas[compas.race == "Caucasian"]
compas["high_score"] = pd.cut(compas.decile_score, bins=[0,4, np.inf], labels=[0, 1])
compas = compas.drop(["age", "race", "age_cat", "decile_score", "priors_count"], axis=1)
aa = aa.drop(["age", "race", "age_cat", "decile_score", "priors_count"], axis=1)
c = c.drop(["age", "race", "age_cat", "decile_score", "priors_count"], axis=1)

compas

Unnamed: 0,c_charge_degree,sex,two_year_recid,prior_cat,new_age_cat,high_score
1,F,Male,1,1,21-30,0
2,F,Male,1,1,below 20,0
4,F,Male,1,3,31-40,1
6,M,Female,0,1,21-30,0
7,F,Male,0,1,below 20,0
...,...,...,...,...,...,...
6165,M,Male,1,1,below 20,0
6166,F,Male,0,1,below 20,1
6167,F,Male,0,1,below 20,1
6168,F,Male,0,1,below 20,0


In [37]:
cm = confusion_matrix(compas.two_year_recid, compas.high_score)
cm

array([[1872,  923],
       [ 881, 1602]])

In [38]:
print("FPR: ", (923/(923+1872))) 
print("FNR: ", (881/(881+1602)))

FPR:  0.3302325581395349
FNR:  0.35481272654047524


### 

In [39]:
cm = confusion_matrix(aa.high_score, aa.two_year_recid)
print(cm)
print("FPR: ", (473/(473+873))) 
print("FNR: ", (641/(641+1188)))

[[ 873  473]
 [ 641 1188]]
FPR:  0.3514115898959881
FNR:  0.35046473482777474


In [40]:
cm = confusion_matrix(c.high_score, c.two_year_recid)
print(cm)
print("FPR: ", (408/(408+999))) 
print("FNR: ", (282/(282+414)))

[[999 408]
 [282 414]]
FPR:  0.2899786780383795
FNR:  0.4051724137931034


## 2. (6pt) Explain what do you get. Are your results different from COMPAS in any significant way?

My results were pretty much the same as the ones from COMPAS. Overall, I did 3 different predictions. One where I included races in the dataframe, one where it was only Caucasians, and another where it was only African-Americans. The one that included both races did slightly better than the COMPAS prediction model, but it is too small to be significant. The ones that only had 1 race did the same as their COMPAS counterparts.

# Hours spent: 6