# Fairness evaluation of `bert-base-uncased`
This notebook will call all fairness metrics in the Biased Rulers package. First we start with some preliminary imports...

In [2]:
import os
os.chdir("../")
from biased_rulers.metrics import seat, lpbs
import numpy as np
from transformers import AutoTokenizer, AutoModelForMaskedLM, AutoModel
import torch

## Define and download model

In [3]:

model_type = "bert-base-multilingual-uncased"

tokenizer = AutoTokenizer.from_pretrained(model_type)
model = AutoModel.from_pretrained(model_type)
print(f"Loaded {model_type}")

Some weights of the model checkpoint at bert-base-multilingual-uncased were not used when initializing BertModel: ['cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Loaded bert-base-multilingual-uncased


## WEAT-based tests
In this section, we run our experiments for the WEAT-based metrics. Our Biased Rulers package supports SEAT (May et al., 2019) and two variants by Lauscher et al. (2021) and Tan et al. (2019).

In [5]:

attribute_template = "This is the _."
target_template = "This is the _."

results = seat.seat_test(attribute_template, target_template, tokenizer, model)
score = np.fromiter(results.values(), dtype=float).mean()
print(score)

0.43653131324269195


In [6]:
results = seat.lauscher_et_al_test(attribute_template, target_template, tokenizer, model)
score = np.fromiter(results.values(), dtype=float).mean()
print(score)

0.4501284455471183


In [45]:
results = seat.tan_et_al_test(attribute_template, target_template, tokenizer, model)
score = np.fromiter(results.values(), dtype=float).mean()
print(score)

0.8573153343423693


## Log probability bias score
In this section, we run the LPBS experiments. This requires a slightly different model setup, so we just initialize it again.

In [7]:
model = AutoModelForMaskedLM.from_pretrained(model_type) # SEAT and other methods expect a different model

Some weights of the model checkpoint at bert-base-multilingual-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [8]:
results = lpbs.lpbs_test("","", tokenizer, model)
print(results)

(0.4847451760247023, 0.809558354722996)


## CrowS-Pairs test
Finally, we test an extrinsic measure.

In [9]:
from biased_rulers.metrics import crowspairs

In [10]:
crows_score = crowspairs.evaluate(tokenizer, model)

100%|██████████| 1508/1508 [08:26<00:00,  2.98it/s]

Total examples: 1508
Metric score: 55.31
Stereotype score: 55.83
Anti-stereotype score: 53.21
Num. neutral: 4 0.27




