# 03 Data Preparation

The next part should show whether NamSor performs equally well in all groups. For that we need to create two different data sets out of the data set we already have: One for each API endpoint. Then in each set we set the score column to NamSor's prediction probability and the label_value to whether NamSor's prediction was correct or not.

In [1]:
# >>> Import Libraries

print("Importing necessary libraries... ")

import pandas as pd

print("Libraries imported.")

Importing necessary libraries... 
Libraries imported.


In [2]:
# >>> Import COMPAS data set

print("Importing COMPAS data set... ")

df = pd.read_csv("data/compas_with_predictions_cleaned.csv")

print("Data set imported. It is has {} entries and looks like this:".format(df.shape[0]))
df.head()

Importing COMPAS data set... 
Data set imported. It is has 7214 entries and looks like this:


Unnamed: 0.1,Unnamed: 0,entity_id,level_0,index,first,last,score,label_value,race,sex,age_cat,sex_pred,sex_pred_prob,race_pred,race_pred_prob
0,0,1,0,0,miguel,hernandez,0.0,0,Other,Male,Greater than 45,Male,0.999286,Hispanic,0.975499
1,1,3,1,1,kevon,dixon,0.0,1,African-American,Male,25 - 45,Male,0.95672,African-American,0.857965
2,2,4,2,2,ed,philo,0.0,1,African-American,Male,Less than 25,Male,0.968813,Asian,0.611053
3,3,5,3,3,marcu,brown,1.0,0,African-American,Male,Less than 25,Male,0.622665,African-American,0.764072
4,4,6,4,4,bouthy,pierrelouis,0.0,0,Other,Male,25 - 45,Male,0.509131,African-American,0.800832


In [3]:
def mapToNumber(b):
    if(b):
        return 1.0
    else:
        return 0.0

In [4]:
# >>> create a data set for gender
df_gender = df[['entity_id', 'first', 'last', 'sex', 'sex_pred', 'race']];
df_gender['score'] = 1 - df['sex_pred_prob']
df_gender['label_value'] = (df['sex'] == df['sex_pred'])

df_gender['label_value'] = df_gender['label_value'].apply(lambda x: mapToNumber(x)) 

df_gender

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


Unnamed: 0,entity_id,first,last,sex,sex_pred,race,score,label_value
0,1,miguel,hernandez,Male,Male,Other,0.000714,1.0
1,3,kevon,dixon,Male,Male,African-American,0.043280,1.0
2,4,ed,philo,Male,Male,African-American,0.031187,1.0
3,5,marcu,brown,Male,Male,African-American,0.377335,1.0
4,6,bouthy,pierrelouis,Male,Male,Other,0.490869,1.0
5,7,marsha,miles,Male,Female,Other,0.001543,0.0
6,8,edward,riddle,Male,Male,Caucasian,0.000248,1.0
7,9,steven,stewart,Male,Male,Other,0.008355,1.0
8,10,elizabeth,thieme,Female,Female,Caucasian,0.001108,1.0
9,13,bo,bradac,Male,Male,Caucasian,0.071431,1.0


In [5]:
# >>> create a data set for ethnicity
df_ethnicity = df[['entity_id', 'first', 'last', 'race', 'race_pred', 'sex']];
df_ethnicity['score'] = 1 - df['race_pred_prob']
df_ethnicity['label_value'] = (df['race'] == df['race_pred'])

df_ethnicity['label_value'] = df_ethnicity['label_value'].apply(lambda x: mapToNumber(x)) 

df_ethnicity

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


Unnamed: 0,entity_id,first,last,race,race_pred,sex,score,label_value
0,1,miguel,hernandez,Other,Hispanic,Male,0.024501,0.0
1,3,kevon,dixon,African-American,African-American,Male,0.142035,1.0
2,4,ed,philo,African-American,Asian,Male,0.388947,0.0
3,5,marcu,brown,African-American,African-American,Male,0.235928,1.0
4,6,bouthy,pierrelouis,Other,African-American,Male,0.199168,0.0
5,7,marsha,miles,Other,African-American,Male,0.278000,0.0
6,8,edward,riddle,Caucasian,Caucasian,Male,0.238652,1.0
7,9,steven,stewart,Other,Caucasian,Male,0.345223,0.0
8,10,elizabeth,thieme,Caucasian,Caucasian,Female,0.233335,1.0
9,13,bo,bradac,Caucasian,Asian,Male,0.141849,0.0


In [6]:
# Saving results 
print("Saving compas dataframe with predictions for gender to CSV... ")
df_gender.to_csv("data/compas_gender_predictions.csv")
print("CSV saved!")

Saving compas dataframe with predictions for gender to CSV... 
CSV saved!


In [7]:
print("Saving compas dataframe with predictions for ethnicity to CSV... ")
df_ethnicity.to_csv("data/compas_ethnicity_predictions.csv")
print("CSV saved!")

Saving compas dataframe with predictions for ethnicity to CSV... 
CSV saved!
