#  Holistic AI - Data Science assessment


In Machine Learning, we say that an algorithm is biased if it is systematically disadvantageous to a group of people. For example, we could say that a recruitment algorithm was biased if it presented a higher success rate for male candidates rather than female candidates. 



In this assessment, we will ask you to build a model for recruitment, which predicts whether a candidate can be hired or not. We will provide you with a dataset composed of N samples (rows) and 503 variables (columns). The columns are: 
* a binary target variable ('Label'), which indicates whether the candidate was hired or not
* 500 different features, which will be used to fit a predictive model of your choice
* the Ethnicity and Gender of the candidates, which we will use to estimate bias in the dataset and in the algorithm. 



We will ask you to: 
1. Explore and pre-process the data
2. Calculate the success rate in the dataset
3. Fit a machine learning model to the data and calculate its generalization performance
4. Calculate a simple measure of model bias. 

## **1 - Data exploration and pre-processing**

In [None]:
#imports 
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
!pip3 install pickle5
import pickle5 as pickle


Collecting pickle5
  Downloading pickle5-0.0.12-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.whl (256 kB)
[?25l[K     |█▎                              | 10 kB 24.0 MB/s eta 0:00:01[K     |██▋                             | 20 kB 15.2 MB/s eta 0:00:01[K     |███▉                            | 30 kB 9.8 MB/s eta 0:00:01[K     |█████▏                          | 40 kB 8.8 MB/s eta 0:00:01[K     |██████▍                         | 51 kB 4.5 MB/s eta 0:00:01[K     |███████▊                        | 61 kB 5.3 MB/s eta 0:00:01[K     |█████████                       | 71 kB 5.6 MB/s eta 0:00:01[K     |██████████▎                     | 81 kB 5.7 MB/s eta 0:00:01[K     |███████████▌                    | 92 kB 6.4 MB/s eta 0:00:01[K     |████████████▉                   | 102 kB 5.1 MB/s eta 0:00:01[K     |██████████████                  | 112 kB 5.1 MB/s eta 0:00:01[K     |███████████████▍                | 122 kB 5.1 MB/s eta 0:00:01[K     |████████████████▋           

Please download the data from the following link: https://hai-data.s3.eu-west-2.amazonaws.com/roadmaps/data.pickle. If running in Colab, please upload the data to the local folder. Otherwise, place the data in the same folder as the notebook. Load the data into a dataframe using pickle and then transform it into a pandas dataframe. 

In [None]:
## TO DO: Load the data

## /TO DO

Please feel free to do any type of data exploration and pre-processing here (e.g. have you checked for missing values? Is the dataset balanced?)


In [None]:
## TO DO: Data exploration / Pre-processing 
## /TO DO

## **2. Bias in the data**



Algorithmic bias can be a result of bias in the data. Please check for bias in the data by calculating and displaying the proportion of successful candidates (success rate) in each group. Do it for both Gender (Female/Male) and Ethnicity (Asian/Black/Hispanic/White). 

For a group $g$:
$$sr_g=\frac{\text{Number of successful outcomes in } g}{\text{Number of individuals in } g}$$


\\

If we use this data for training, can you guess which groups the model will be biased against?


In [None]:
## TO DO: Calculate success rates for each group  
## /TO DO

## **3. Model fitting**

Fit a model of your choice to the data. Please note: 
1. We do not want to hire people based on gender and ethnicity, so do **not** include these features when training the model. 
2.  You should calculate and print the performance of the model in terms of accuracy, precision and recall. 
3. Make sure to evaluate the **generalization** performance of the model.
4. Do not worry too much about the performance of the model (for reference, an accuracy of 0.7 is a good value)

In [None]:
## TO DO: Fit the model and display performance 
## /TO DO

## **4. Bias in the model** 

In order to check if the model is biased we need to compare the success rate in the unprivileged group with the success rate with the privileged group (e.g. Female/Male, Black/White, Asian/White, ...). 

For a group $g$:
$$sr_g=\frac{\text{Number of successful outcomes in } g}{\text{Number of individuals in } g}$$





We can compare these success rates by taking the ratio (Disparate Impact) or subtracting them (Statistical Parity). If we call the unprivileged group $u$, and the privileged group $p$, we have:

$Disparate\ impact=\frac{sr_u}{sr_p}$, with reference value 1. 

$Statistical\ parity=sr_u - sr_p$, with reference value 0.



Take any unpriviledged group and compare it to a priviledged one. Calculate and print the Disparate Impact and Statistical Parity metrics. Is the model biased? Please make sure to evaluate how the model performs on unseen data.

In [None]:
## TO DO: Calculate and display disparate impact and statistical parity for a group of your choice
## /TO DO