# Chapter 6 Tutorial: Measuring and Mitigating Bias

In section 6.2 of the book, we discussed the impact of bias in LLMs and some of the techniques developed to mitigate it. In this tutorial, we will apply two techniques for debiasing and observe the corresponding shifts in model behavior. This tutorial has three main goals:

* Analyze how the CrowS benchmark is designed to measure bias.

* Test the use of one potential bias mitigation technique on RoBERTa and evaluate the improvement.

* Apply a debiased model on a downstream task to assess whether its capabilities as a language model are degraded.

This tutorial utilizes the tools located at https://github.com/mcgill-nlp/bias-bench.

## Installation and Imports


In [1]:
## Download bias-bench
!git clone https://github.com/mcgill-nlp/bias-bench.git

## Install bias-bench
import os

os.chdir("bias-bench")
!pip install --no-deps -e .

Cloning into 'bias-bench'...
remote: Enumerating objects: 188, done.[K
remote: Counting objects: 100% (82/82), done.[K
remote: Compressing objects: 100% (65/65), done.[K
remote: Total 188 (delta 38), reused 17 (delta 17), pack-reused 106 (from 1)[K
Receiving objects: 100% (188/188), 6.44 MiB | 4.69 MiB/s, done.
Resolving deltas: 100% (75/75), done.
Checking out files: 100% (115/115), done.
Obtaining file:///home/maruanottoni/home/llms-deep-dive-tutorials/tutorials/chapter6/bias-bench
  Preparing metadata (setup.py) ... [?25ldone
[?25hInstalling collected packages: bias-bench
[33m  DEPRECATION: Legacy editable install of bias-bench==0.1.0 from file:///home/maruanottoni/home/llms-deep-dive-tutorials/tutorials/chapter6/bias-bench (setup.py develop) is deprecated. pip 25.0 will enforce this behaviour change. A possible replacement is to add a pyproject.toml or enable --use-pep517, and use setuptools >= 64. If the resulting installation is not behaving as expected, try using --config

In [4]:
## Install dependencies (don't worry about compatability errors)

#!pip install -q transformers==4.16.2 datasets==1.18.3 pandas==1.5.3

In [5]:
import pandas as pd
import numpy as np
import torch
import nltk

from transformers import (
    AutoTokenizer,
    RobertaForMaskedLM,
    AutoConfig,
    TrainingArguments,
    Trainer
)

from bias_bench.benchmark.crows import CrowSPairsRunner
from bias_bench.model import models
from bias_bench.dataset import load_sentence_debias_data
from bias_bench.debias import compute_gender_subspace

from datasets import load_dataset, load_metric

  return torch._C._cuda_getDeviceCount() > 0


 ### Get Wikipedia data

In [6]:
## This notebook requires some text from Wikipedia to learn a representation of model bias.
## Download the dataset here:
##
##   https://drive.google.com/file/d/1JSlm8MYDbNjpMPnKbb91T-xZnlWAZmZl/view
##
## and upload it into /content/ on Colab.

In [7]:
## Make a folder and unzip the wikipedia data into it.
!mkdir data/text
!mv ../wikipedia-2.5.txt.zip data/text/
!unzip data/text/wikipedia-2.5.txt -d data/text/

## Measure bias with CrowS data

In [None]:
## We will measure bias using the CrowS dataset. Each pair consists of two variations of the same sentence,
## with one word flipped to negate a stereotype as in the example below.

df = pd.read_csv("./data/crows/crows_pairs_anonymized.csv")
print("Stereotype text:", df.sent_more[0])
print("Anti-stereotype text:", df.sent_less[0])
print("Bias types:", df.bias_type.unique())

In [None]:
## To calculate bias metrics on CrowS pairs, the stereotype words are masked and we then test
## whether the model predicts the stereotype word with greater probability than the anti-stereotype.
## A score near 50 implies that the stereotype and its opposite are predicted with similar frequency
## and bias is therefore minimal.

## Download RoBERTa for the evaluations.
model_name = "roberta-base"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = RobertaForMaskedLM.from_pretrained(model_name)

## Run the pairwise sentence bias comparison.
runner = CrowSPairsRunner(
    model=model,
    tokenizer=tokenizer,
    input_file="./data/crows/crows_pairs_anonymized.csv",
    bias_type='gender',
    is_generative=False
)
results = runner()

## Use sentence-debias to create subspace

This step uses the Wikipedia data to learn a bias subspace. The text is broken into sentences and a predefined set of bias word pairs (e.g. boy/girl, father/mother, etc.) is then swapped in to produce two opposing versions of each Wikipedia sentence, similar to the CrowS pairs. The Sent-Debias technique learns a subspace representing the differences between the embedding vectors of the sentence pairs and thereby captures the general effect of bias on the model outputs. This is referred to as the bias direction. These vectors can then be subtracted out in the model's forward pass to counter the effects of bias.

In [None]:
nltk.download('punkt')

In [None]:
data = load_sentence_debias_data(
    persistent_dir=".", bias_type="gender"
)

tokenizer.pad_token = tokenizer.eos_token
model = getattr(models, "RobertaModel")(model_name)
model.eval()

bias_direction = compute_gender_subspace(
    data, model, tokenizer, batch_size=32
)

In [None]:
## Make sure to save the bias vectors, since they can take several hours to compute. They will then be
## incorporated into the forward pass in the RoBERTa model, resulting in debiased outputs.

if True:
    torch.save(
        bias_direction, "./sent-debias-gender.pt"
    )

else: ## Import vectors again if session restarts:
    bias_direction = torch.load("./sent-debias-gender.pt")

In [None]:
## Incorporate the debias into the model

model = getattr(models, "SentenceDebiasRobertaForMaskedLM")(
    model_name, bias_direction=bias_direction
)

runner = CrowSPairsRunner(
    model=model,
    tokenizer=tokenizer,
    input_file="./data/crows/crows_pairs_anonymized.csv",
    bias_type="gender",
    is_generative=False,
    is_self_debias=False
)
results = runner()

The Sent-Debias technique significantly improved RoBERTa's performance on the CrowS metric. However, this is an incomplete picture since the debiasing process could have potentially caused the model to forget useful information as well. (Consider that a severely undertrained model with poor overall performance could score very well on this metric, simply because it hasn't seen enough data to learn any biases or stereotypes yet.)

In the final step of the tutorial, we'll evaluate whether the debiased RoBERTa model can match the utility of the original model.

## Compare performance on a downstream task

### Collect the data

In [None]:
## Now that we have a debiased RoBERTa model, we will try finetuning it on a downstream task to see if its
## knowledge can still be transferred as effectively as before. We'll do brief training cycles on the SST
## sentiment analysis data both before and after the debias step.

task_name = "sst2"
raw_datasets = load_dataset("glue", task_name)
metric = load_metric("glue", task_name)

label_list = raw_datasets["train"].features["label"].names
num_labels = len(label_list)

## Process the dataset

def preprocess_function(examples):
    # Tokenize the texts
    result = tokenizer(
        examples['sentence'],
        padding=True,
        max_length=128,
        truncation=True
    )
    return result

raw_datasets = raw_datasets.map(
    preprocess_function,
    batched=True
)

train_dataset = raw_datasets["train"]
eval_dataset = raw_datasets["validation"]

In [None]:
train_dataset

### Train and evaluate the baseline model

In [None]:
## Create a metric for evaluating accuracy of SST-2 predictions
def compute_metrics(p):
    preds = p.predictions[0] if isinstance(p.predictions, tuple) else p.predictions
    preds = np.argmax(preds, axis=1)
    result = metric.compute(predictions=preds, references=p.label_ids)
    if len(result) > 1:
      result["combined_score"] = np.mean(list(result.values())).item()
    print(result)
    return result

## Define a common set of configurations for both models
config = AutoConfig.from_pretrained(
    model_name,
    num_labels=num_labels,
    finetuning_task=task_name
)

In [None]:
## Configure and instantiate the baseline model.
model = getattr(models, "RobertaForSequenceClassification")(
    model_name, config=config
)

## Set the training arguments
training_args = TrainingArguments(
    num_train_epochs=1,
    output_dir='baseline',
    save_strategy='no',
    per_device_train_batch_size=16
)

## Define a training loop
trainer = Trainer(
    model=model,
    train_dataset=train_dataset,
    compute_metrics=compute_metrics,
    tokenizer=tokenizer,
    args=training_args
)

## Run the training loop
train_result = trainer.train()
metrics = trainer.evaluate(eval_dataset=eval_dataset)

### Train and evaluate the debiased model

In [None]:
## Configure and instantiate the debiased model.
model = getattr(models, "SentenceDebiasRobertaForSequenceClassification")(
    model_name, config=config, bias_direction=bias_direction
)

## Set the training arguments
training_args = TrainingArguments(
    num_train_epochs=1,
    output_dir='debiased',
    save_strategy='no',
    per_device_train_batch_size=16
)

## Define a training loop
trainer = Trainer(
    model=model,
    train_dataset=train_dataset,
    compute_metrics=compute_metrics,
    tokenizer=tokenizer,
    args=training_args
)

## Run the training loop
train_result = trainer.train()
metrics = trainer.evaluate(eval_dataset=eval_dataset)