# Bias in ML

We are using a job recruitment dataset, which is actually synthetic data with deliberate bias. There is sexism and racism in the dataset. But, in the interests of time, we will look only at the sexism. In your own time, you could extend this Notebook to explore the other "isms".

Keep in mind that this is a huge and complex area of study. Our work here is quite shallow.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.utils import resample

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression

from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

In [None]:
rng = np.random.RandomState(2)

## Read in dataset

In [None]:
import os
if 'google.colab' in str(get_ipython()):
    from google.colab import drive
    drive.mount('/content/drive')
    base_dir = "./drive/My Drive/Colab Notebooks/" # You may need to change this, depending on where your notebooks are on Google Drive
else:
    base_dir = "."
dataset_dir = os.path.join(base_dir, "datasets")

In [None]:
df = pd.read_csv(os.path.join(dataset_dir, "recruiting.csv"))

## Take a cheeky look

In [None]:
df.head()

In [None]:
df.shape

In [None]:
df.info()

In [None]:
df.describe(include="all")

What proportion of the whole dataset was employed?

In [None]:
df["employed_yes"].sum() / len(df)

What proportion was male?

In [None]:
df["sex_male"].sum() / len(df)

## Split into training set and test set

Ordinarily, we stratify using the class label only. But, for this analysis, I will stratify by both the class label and the person's sex.

In [None]:
stratify_var = list(zip(df["employed_yes"], df["sex_male"]))
dev, test = train_test_split(df, test_size=0.2, stratify=stratify_var, random_state=rng)

We can see that the proportions from the whole dataset are preserved in the training and test sets.

In [None]:
dev["employed_yes"].sum() / len(dev), test["employed_yes"].sum() / len(test)

In [None]:
dev["sex_male"].sum() / len(dev), test["sex_male"].sum() / len(test)

In [None]:
features = ["sex_male", "race_white", "years_experience", "referred", "gcse", "a_level", "russell_group", 
            "honours", "years_volunteer", "income", "it_skills", "years_gaps", "quality_cv"]

numeric_features = ["years_experience", "gcse", "a_level", "years_volunteer", "income", "it_skills", 
                    "years_gaps", "quality_cv"]
boolean_features = ["sex_male", "race_white", "referred", "russell_group", "honours"]

X_dev = dev[features]
y_dev = dev["employed_yes"]
X_test = test[features]
y_test = test["employed_yes"]

## Exploratory Data Analysis

In [None]:
dev_copy = dev.copy()

Let's see whether there is evidence of historical sex discrimination.

In [None]:
sns.barplot(dev_copy, x="sex_male", y="employed_yes", estimator="mean", formatter=lambda x: ["female", "male"][x], errorbar=None)
plt.show()

We see that there may be existing bias in the dataset: the proportion of applicants who receive employment is greater in the case of male applicants than it is for female applicants.

We can see below that there is a bias towards people with more years of experience. But that's not necessarily an unfairness. Deciding to employ on the basis of experience seems reasonable.

In [None]:
dev_copy["binned_years_experience"] = dev["years_experience"].map(lambda e: 0 if e <= 2 else 1 if e <= 5 else 2 if e <= 9 else 3)

In [None]:
sns.barplot(dev_copy, x="binned_years_experience", y="employed_yes", estimator="mean", formatter=lambda x: ["0-2 years", "3-5 years", "6-9 years", "10+ years"][x], errorbar=None)
plt.show()

But we can combine the two perspectives. The plots below show that, for the same level of experience, femeales are less likely to be employed than males. The effect is more pronounced for the lower levels of experience.

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(10, 5))
sns.barplot(dev_copy[dev_copy["sex_male"] == 0], x="binned_years_experience", y="employed_yes", estimator="mean", formatter=lambda x: ["0-2 years", "3-5 years", "6-9 years", "10+ years"][x], errorbar=None, ax=axes[0])
axes[0].set_title("Female")
sns.barplot(dev_copy[dev_copy["sex_male"] == 1], x="binned_years_experience", y="employed_yes", estimator="mean", formatter=lambda x: ["0-2 years", "3-5 years", "6-9 years", "10+ years"][x], errorbar=None, ax=axes[1])
axes[1].set_title("Male")
plt.show()

## Logistic Regression

In [None]:
def build_model():
    preprocessor = ColumnTransformer([
        ("scaler", StandardScaler(), numeric_features),
        ("encoder", OneHotEncoder(drop="if_binary"), boolean_features)],
        remainder="drop")
    return Pipeline([
                    ("preprocessor", preprocessor),
                    ("predictor", LogisticRegression(penalty=None, random_state=rng))])

In [None]:
logistic_model = build_model()

In [None]:
X_train, X_val, y_train, y_val = train_test_split(X_dev, y_dev, test_size=0.25, stratify=y_dev, random_state=rng)

In [None]:
logistic_model.fit(X_train, y_train)

### Validation accuracy

In [None]:
predictions = logistic_model.predict(X_val)
accuracy_score(y_val, predictions)

In [None]:
cm = confusion_matrix(y_val, predictions)
disp = ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot()
plt.show()

### Validation accuracy by group (males/females)

In [None]:
X_val_males = X_val[X_val["sex_male"] == 1]
y_val_males = y_val[X_val["sex_male"] == 1]
X_val_females = X_val[X_val["sex_male"] != 1]
y_val_females = y_val[X_val["sex_male"] != 1]

In [None]:
predictions_males = logistic_model.predict(X_val_males)
predictions_females = logistic_model.predict(X_val_females)
accuracy_score(y_val_males, predictions_males), accuracy_score(y_val_females, predictions_females)

It all seems very fair so far. Overall accuracy, accuracy for males and accuracy for females are all around 85%. But this may not be the right evaluation metric to be using.

### Recall by group

Which evaluation metric should we use?

We should discuss both within our team and more widely.

Instead, let's ask the *Fairness Tree*.

**Are your interventions punitive or asssistive?** Assistive.

**Can you intervene with most people with need or only a small fraction?** Small fraction.

The tree proposes Recall as the metric:
$$Recall = TPR = \frac{TP}{FN+TP}$$

We want parity of Recall between the groups.

In [None]:
cm_males = confusion_matrix(y_val_males, predictions_males)
disp = ConfusionMatrixDisplay(confusion_matrix=cm_males)
disp.plot()
plt.show()

In [None]:
cm_females = confusion_matrix(y_val_females, predictions_females)
disp = ConfusionMatrixDisplay(confusion_matrix=cm_females)
disp.plot()
plt.show()

In [None]:
recall_males = cm_males[1, 1] / (cm_males[1, 0] + cm_males[1, 1])
recall_females = cm_females[1, 1] / (cm_females[1, 0] + cm_females[1, 1])

recall_males, recall_females

So, in the case of employable people, it makes more mistakes for females than for males.

### Probability scores

In [None]:
probs_males = logistic_model.predict_proba(X_val_males)[:, 1]
probs_females = logistic_model.predict_proba(X_val_females)[:, 1]

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(8, 4))
sns.boxplot(data=probs_males, ax=axes[0])
sns.boxplot(data=probs_females, ax=axes[1])
plt.show()

So systematically females are given lower scores.

### Controlling for experience

We'll put the experience values into bins and pair with the probabilities.

In [None]:
probs_males_experience = pd.DataFrame(
    np.vstack((probs_males, 
               X_val_males["years_experience"].map(lambda e: 0 if e <= 2 else 1 if e <= 5 else 2 if e <= 9 else 3))).T,
    columns=["prob", "experience"])

probs_females_experience = pd.DataFrame(
    np.vstack((probs_females, 
               X_val_females["years_experience"].map(lambda e: 0 if e <= 2 else 1 if e <= 5 else 2 if e <= 9 else 3))).T,
    columns=["prob", "experience"])

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(8, 4))
sns.boxplot(data=probs_males_experience, x="experience", y="prob", ax=axes[0])
sns.boxplot(data=probs_females_experience, x="experience", y="prob", ax=axes[1])
plt.show()

We see that, even for the same level of experience, females receive lower scores than males.

## Mitigation

### Fairness through unawareness

What happens if we don't include the sensitive feature? Then, when we train, the model is unaware - in this case,  of the sex of the applicant.

In [None]:
boolean_features = ["race_white", "referred", "russell_group", "honours"]

In [None]:
logistic_model_unaware_sex = build_model()

In [None]:
logistic_model_unaware_sex.fit(X_train, y_train)

In [None]:
predictions = logistic_model_unaware_sex.predict(X_val)
accuracy_score(y_val, predictions)

In [None]:
predictions_males = logistic_model_unaware_sex.predict(X_val_males)
predictions_females = logistic_model_unaware_sex.predict(X_val_females)
accuracy_score(y_val_males, predictions_males), accuracy_score(y_val_females, predictions_females)

In [None]:
cm_males = confusion_matrix(y_val_males, predictions_males)
cm_females = confusion_matrix(y_val_females, predictions_females)

recall_males = cm_males[1, 1] / (cm_males[1, 0] + cm_males[1, 1])
recall_females = cm_females[1, 1] / (cm_females[1, 0] + cm_females[1, 1])

recall_males, recall_females

This didn't help very much.

### Upsampling

Upsampling means sampling the training data and including these as extra examples in the training set. We could, for exaple, iupsample the minority class, to reduce the imbalance between the numbers of applicants that get employment and those that do not. Or, if there were fewer female applicants than male applicants, we could upsample the females to reduce the imbalance.

In this dataset, the numbers of males and feamles is roughly equal. It would seem that upsampoling is not relevant.

But, why don't we try it anyway? By upsampling females, there will be more of them than males - and so this is gives females greater weight during training. I will double the number of females.

In [None]:
X_train_males = X_train[X_train["sex_male"] == 1]
y_train_males = y_train[X_train["sex_male"] == 1]
X_train_females = X_train[X_train["sex_male"] != 1]
y_train_females = y_train[X_train["sex_male"] != 1]

In [None]:
X_train_upsampled_females, y_train_upsampled_females = \
    resample(X_train_females, y_train_females, 
             replace=True, n_samples=2*len(X_train_females), random_state=rng)

In [None]:
logistic_model.fit(pd.concat((X_train_males, X_train_upsampled_females)), 
                   pd.concat([y_train_males, y_train_upsampled_females]))

In [None]:
predictions = logistic_model.predict(X_val)
accuracy_score(y_val, predictions)

In [None]:
predictions_males = logistic_model.predict(X_val_males)
predictions_females = logistic_model.predict(X_val_females)
accuracy_score(y_val_males, predictions_males), accuracy_score(y_val_females, predictions_females)

In [None]:
cm_males = confusion_matrix(y_val_males, predictions_males)
cm_females = confusion_matrix(y_val_females, predictions_females)

recall_males = cm_males[1, 1] / (cm_males[1, 0] + cm_males[1, 1])
recall_females = cm_females[1, 1] / (cm_females[1, 0] + cm_females[1, 1])

recall_males, recall_females

This didn't help much either :(

Perhaps we would have more luck if we tried more advanced techniques such as those in IBM's Fairness 360 library. 