# Detecting Gender Bias in English-German Translations using Natural Language Processing

This notebook presents the bias detection model used for the demonstration. Minimal documentation and interpretation are provided, as it is intended to be read alongside the thesis.  

The notebook performs the following steps:  
1. Reads the created dataset.  
2. Loads and trains a multilingual BERT model.  
3. Evaluates the model on the held-out dataset and the handcrafted dataset.


## Import Libraries
Standard Python libraries are imported for data handling, computation, and plotting. PyTorch and the Transformers library are used for model training and inference. Evaluation metrics and dataset splitting functions are imported from scikit-learn.  

In [None]:
# standard libraries
import os
import random

# data handling
import pandas as pd
import numpy as np

# torch and dataset utils
import torch
from torch.utils.data import Dataset

# transformers library
from transformers import (
    BertTokenizer,
    BertForSequenceClassification,
    TrainingArguments,
    Trainer,
    EarlyStoppingCallback
)

# evaluation and data split
from sklearn.model_selection import train_test_split
from sklearn.metrics import (
    precision_recall_fscore_support,
    confusion_matrix,
    classification_report,
)
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import display

## Set Seed
A fixed seed is set for Python, NumPy, and PyTorch random number generators. This ensures that results are reproducible across runs. CUDA deterministic settings are enabled to maintain consistent GPU computations.  


In [None]:
seed = 10

def set_seed(seed):
    random.seed(seed)                  
    np.random.seed(seed)              
    torch.manual_seed(seed)           
    torch.cuda.manual_seed(seed)      
    torch.cuda.manual_seed_all(seed)   
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False
    os.environ["PYTHONHASHSEED"] = str(seed)

set_seed(seed)

## Load Dataset
The dataset is loaded from a CSV file. This dataset was created using the `join_datasets.py` script located in the `/datasets` directory. It contains sentence pairs with labels indicating whether the translation exhibits gender bias. The first few rows are displayed for inspection.

In [None]:
df = pd.read_csv("datasets/dataset.csv")
df.head(5)

# Initialize Model

The computational device is determined, prioritizing GPU if available. Device information is printed to confirm the configuration.


In [None]:
# set device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# print device info
print(f"CUDA Available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
else:
    print("GPU: None (using CPU)")

## Load Tokenizer
The tokenizer corresponding to the selected mBERT model is loaded. This tokenizer is used to convert input text into the token IDs required by the model.


In [None]:
# model name
model_path = "bert-base-multilingual-cased"

# load tokenizer
tokenizer = BertTokenizer.from_pretrained(model_path)

## Load Model and Send to Device
The pre-trained mBERT model is loaded with a sequence classification head configured for binary classification. The model is transferred to the previously selected device for training and evaluation.


In [None]:
# load model with classification head
model = BertForSequenceClassification.from_pretrained(
    model_path,
    num_labels=2,
    id2label={0: "neutral", 1: "biased"},
    label2id={"neutral": 0, "biased": 1}
)

# move model to device
model.to(device)

## Freeze Encoder Layers Count Trainable Parameters
The first eight encoder layers (layers 0 to 7) of BERT are frozen to reduce training time while allowing higher layers to adapt to the task. Freezing fewer layers increases trainable parameters and flexibility, which was tested in additional experiments.

In [None]:
# freeze encoder layers 0 to 7
for name, param in model.named_parameters():
    if name.startswith("bert.encoder.layer."):
        layer_num = int(name.split(".")[3])
        if layer_num < 8:
            param.requires_grad = False

# count and print trainable parameters
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"Trainable params: {trainable_params}")

# Data Pre-processing
The EN-DE dataset is prepared for training, validation, and testing. Minimal preprocessing is applied, as the data has already been cleaned and labeled for bias detection.

## Custom Dataset Class
A custom `BiasDataset` class is defined to handle EN-DE sentence pairs. Each sample is tokenized using the mBERT tokenizer, with the German translation provided as the `text_pair`. The label tensor is included for supervised training.

In [None]:
# custom dataset for bias detection
class BiasDataset(Dataset):
    # init with dataframe and tokenizer
    def __init__(self, dataframe, tokenizer):
        self.data = dataframe
        self.tokenizer = tokenizer

    # return number of samples
    def __len__(self):
        return len(self.data)

    # return one encoded sample
    def __getitem__(self, idx):
        english = self.data.iloc[idx]["english"]
        german = self.data.iloc[idx]["german"]
        label = int(self.data.iloc[idx]["label"])

        # tokenize EN-DE sentence pair
        encoded = self.tokenizer(
            text=english,
            text_pair=german,
            padding="max_length",
            truncation=True,
            max_length=256,
            return_tensors="pt"
        )

        # encoded outputs have extra batch dimension, remove it with squeeze(0) to get plain tensors
        item = {key: val.squeeze(0) for key, val in encoded.items()}
        # add label tensor to the dict under key 'labels'
        item["labels"] = torch.tensor(label)
        return item


## Train Test Split
The dataset is split into train (80%), validation (10%), and test (10%) sets. Stratified sampling is applied to maintain the label distribution across all splits.

In [None]:
# split data into train (80%) and temp (20%), keeping label distribution same with stratify
train_df, temp_df = train_test_split(
    df,
    test_size=0.2,
    stratify=df["label"],
    random_state=seed
)

# split temp into validation (10%) and test (10%) sets, stratified by label
val_df, test_df = train_test_split(
    temp_df,
    test_size=0.5,
    stratify=temp_df["label"],
    random_state=seed
)


## Create Dataset Objects
`BiasDataset` objects are created for the train, validation, and test sets. These objects provide tokenized EN-DE sentence pairs and labels to the model during training and evaluation.

In [None]:
# create train, validation and test datasets
train_dataset = BiasDataset(train_df, tokenizer)  
val_dataset = BiasDataset(val_df, tokenizer)    
test_dataset = BiasDataset(test_df, tokenizer)  

# Training
The mBERT model is trained on the EN-DE bias dataset. The training procedure follows the hyperparameters and setup described below.

## Training Parameters
Hyperparameters are defined for learning rate, batch size, number of epochs, and output directory. TrainingArguments are configured to perform evaluation, logging, and model saving at the end of each epoch. The best model is loaded automatically based on macro F1 score.


In [None]:
# hyperparameters
lr = 2e-5
batch_size = 16
num_epochs = 8

output_dir="./model_output"

training_args = TrainingArguments(
    seed = seed,
    output_dir=output_dir,       
    num_train_epochs=num_epochs,   
    per_device_train_batch_size=batch_size, 
    per_device_eval_batch_size=batch_size,   
    learning_rate=lr,             
    warmup_ratio=0.1,
    eval_strategy="epoch",
    logging_strategy="epoch",
    save_strategy="epoch",       
    load_best_model_at_end=True,  
    metric_for_best_model="eval_f1_macro",  
    greater_is_better=True   
)

## Evaluation Metrics Calculation for Classification
A custom metric function is defined to calculate precision, recall, F1 score, and support for each class. Macro averages and overall accuracy are also computed to allow consistent evaluation across training and validation datasets.

In [None]:
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = logits.argmax(axis=1)

    precision, recall, f1, support = precision_recall_fscore_support(
        labels, predictions, average=None, zero_division=0
    )

    metrics = {
        # per-class
        **{f"precision_class_{i}": precision[i] for i in range(len(precision))},
        **{f"recall_class_{i}":    recall[i]    for i in range(len(recall))},
        **{f"f1_class_{i}":        f1[i]        for i in range(len(f1))},
        **{f"support_class_{i}":   support[i]   for i in range(len(support))},
        # overall
        "precision_macro": np.mean(precision),
        "recall_macro":    np.mean(recall),
        "f1_macro":        np.mean(f1),
        "accuracy":        (predictions == labels).mean(),
    }

    return metrics

## Run trainer
A `Trainer` object is instantiated with the model, datasets, training arguments, metric function, and early stopping callback. Training is executed with exception handling. After completion, the trained model and tokenizer are saved to the specified output directory.

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,  
    compute_metrics=compute_metrics,
    callbacks=[EarlyStoppingCallback(early_stopping_patience=2)]
)

In [None]:
print("Starting training...")

try:
    train_results = trainer.train()
except Exception as e:
    print("Training failed:", e)
    raise

print("Training complete. Saving model...")

trainer.save_model(output_dir)
tokenizer.save_pretrained(output_dir)

# Evaluate
The trained mBERT model is evaluated on the validation and test sets. Both overall performance metrics and detailed error analyses are computed.


## Evaluation Metrics and Confusion Rates

A function is defined to compute confusion matrix components and false positive/negative rates. The `evaluate_and_print` function performs evaluation on a given dataset, prints the macro F1 score, confusion matrix, error rates, and a detailed classification report.


In [None]:
def compute_rates(y_true, y_pred):
    tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel()
    fp_rate = fp / (fp + tn) if (fp + tn) > 0 else 0
    fn_rate = fn / (fn + tp) if (fn + tp) > 0 else 0
    return tn, fp, fn, tp, fp_rate, fn_rate

def evaluate_and_print(trainer, dataset, name):
    print(f"\nEvaluating on {name} set...")

    # Evaluate with compute_metrics
    results = trainer.evaluate(eval_dataset=dataset)
    print(f"{name} F1:", round(results["eval_f1_macro"], 3))

    # Get raw predictions and labels
    output = trainer.predict(dataset)
    logits = output.predictions
    labels = output.label_ids
    preds = np.argmax(logits, axis=1)

    # Confusion matrix breakdown
    tn, fp, fn, tp, fp_rate, fn_rate = compute_rates(labels, preds)
    print(f"{name} Confusion Matrix:")
    print(f"TN={tn}, FP={fp}, FN={fn}, TP={tp}")
    print(f"False Positive Rate={fp_rate:.3f}, False Negative Rate={fn_rate:.3f}")

    # Classification report
    print(f"\n{name} Classification Report:\n",
          classification_report(labels, preds, zero_division=0, digits=4))

    return output  # Optional, in case you want to reuse preds later

# ---- Main Evaluation ----
print("Evaluating model...")

val_output  = evaluate_and_print(trainer, val_dataset,  "Validation")
test_output = evaluate_and_print(trainer, test_dataset, "Test")


## Detailed Error Analysis

Predictions on the test set are combined with the corresponding EN-DE sentence pairs and true labels. False positives and false negatives are identified to allow inspection of model errors.

In [None]:
test_texts_en = test_dataset.data["english"].tolist()
test_texts_de = test_dataset.data["german"].tolist()

test_labels = test_output.label_ids
test_preds  = np.argmax(test_output.predictions, axis=1)

analysis_df = pd.DataFrame({
    "text_en":    test_texts_en,
    "text_de":    test_texts_de,
    "true_label": test_labels,
    "pred_label": test_preds
})

fp_df = analysis_df[(analysis_df.true_label == 0) & (analysis_df.pred_label == 1)]
fn_df = analysis_df[(analysis_df.true_label == 1) & (analysis_df.pred_label == 0)]

In [None]:
pd.set_option('display.max_colwidth', None)

combined_df = pd.concat([
    fp_df.assign(Error='False Positive'),
    fn_df.assign(Error='False Negative')
], ignore_index=True)[['Error', 'text_en', 'text_de']]

combined_df.columns = ['Error Type', 'English Text', 'German Text']

display(combined_df)


## Plotting

A confusion matrix is visualized with both counts and percentages. This provides an overview of the model's performance on the test set, highlighting misclassifications.

In [None]:
def plot_confusion_matrix_counts_percent(y_true, y_pred, title="Confusion Matrix"):
    cm = confusion_matrix(y_true, y_pred)
    cm_percent = cm / cm.sum(axis=1, keepdims=True) * 100
    labels = ['Negative (0)', 'Positive (1)']

    annot = [[f"{cm[i, j]}\n({cm_percent[i, j]:.1f}%)" for j in range(cm.shape[1])] for i in range(cm.shape[0])]
    
    plt.figure(figsize=(6,5))
    sns.heatmap(cm, annot=annot, fmt='', cmap='Reds',
                xticklabels=labels, yticklabels=labels)
    plt.xlabel('Predicted label')
    plt.ylabel('True label')
    plt.title(title)
    plt.show()

plot_confusion_matrix_counts_percent(test_labels, test_preds, title="Test Set Confusion Matrix")



# Test cases
The trained mBERT model is evaluated on handcrafted test sentences. These sentences cover various bias patterns and scenarios, allowing inspection of the model's behavior beyond the standard test set.


## Load trained model
The model and tokenizer are loaded from the saved output directory and set to evaluation mode. This ensures that inference is performed without gradient updates.


In [None]:
model_path = "./model_output"
tokenizer = BertTokenizer.from_pretrained(model_path)
model = BertForSequenceClassification.from_pretrained(model_path)
model.to(device)
model.eval()

In [None]:
# shorten text helper function
def shorten(text, max_len=40):
    if len(text) <= max_len:
        return text
    return text[:max_len-3] + "..."

## Handcrafted Test Sentences

A set of EN-DE sentence pairs is defined. Each pair is labeled as:  
- 0 = neutral (no gender bias)  
- 1 = biased (translation contains gender bias)  

The test cases include neutral sentences, neutral sentences with gendered subjects, stereotypically biased sentences, gender-fair language (GFL) examples, and job posting examples.

In [None]:
test_cases = [
    # neutral
    ("Today the weather is beautiful", "Heute ist das Wetter schön", 0), 
    ("I like this guy.", "Ich mag diesen Kerl.", 0),
    ("How are you?", "Wie geht es dir?", 0),
    ("BERT is a transformers model pretrained on a large corpus of multilingual data in a self-supervised fashion.","BERT ist ein Transformer-Modell, das auf selbstüberwachte Weise auf einem großen Korpus mehrsprachiger Daten vorab trainiert wird.",0),
    ("Knowledge is love and light and vision.","Wissen ist Liebe und Licht und Vision.",0),

    # neutral with gendered subjects
    ("My mother is an engineer.", "Meine Mutter ist Ingenieurin.", 0),
    ("Is this boy a good artist?", "Ist dieser Junge ein guter Künstler?", 0),
    ("I am living with my sister, who is also my best friend", "Ich lebe mit meiner Schwester, die auch meine beste Freundin ist", 0),
    ("My father was an excellent cook.", "Mein Vater war ein ausgezeichneter Koch.", 0),
    ("The girls went hiking.", "Die Mädchen gingen wandern.", 0),


    # biased
    ("Do you like our maths teacher?", "Mögen Sie unsere Mathelehrerin?", 1),
    ("The doctor was late to work today.", "Der Arzt kam heute zu spät zur Arbeit.", 1),
    ("Tomorrow the students are leaving for a class trip.", "Morgen gehen die Studenten zu einer Klassenfahrt.", 1),
    ("This nurse does not work hard.", "Diese Krankenschwester arbeitet nicht hart.", 1),
    ("Athletes earn a lot of money.", "Sportler verdienen viel Geld.", 1),

    # test GFL
    ("Engineers are kind and hard-working", "Ingenieur*innen sind lieb und fleißig", 0),
    ("Engineers are kind and hard-working", "Ingenieure und Ingenieurinnen sind lieb und fleißig", 0),
    ("Teachers are kind and hard-working", "Lehrende sind lieb und fleißig", 0),
    ("Teachers are kind and hard-working", "Lehrer:innen sind lieb und fleißig", 0),
    ("Teachers are kind and hard-working", "Lehrerinnen und Lehrer sind lieb und fleißig", 0),
    ("Teachers are kind and hard-working", "Lehrer sind lieb und fleißig", 1),
    ("Teachers are kind and hard-working", "Lehrerinnen sind lieb und fleißig", 1),

    # job posting morgan stanley
    ("We’re seeking someone to join our team Office 365 squads to lead the design, development, and integration of Gen AI apps and integration using Microsoft Copilot Studio.","Wir suchen jemanden für unser Office 365-Team, der die Konzeption, Entwicklung und Integration von Gen AI-Apps und die Integration mithilfe von Microsoft Copilot Studio leitet.",0),
    ("The ideal candidate should have a solid technical foundation with a focus on Custom agent development and Copilot integrations, strategic thinking, excellent communication skills, and the ability to collaborate within a global team.", "Der ideale Kandidat sollte über solide technische Grundlagen mit Schwerpunkt auf der Entwicklung kundenspezifischer Agenten und Copilot-Integrationen, strategisches Denken, ausgezeichnete Kommunikationsfähigkeiten und die Fähigkeit zur Zusammenarbeit in einem globalen Team verfügen.", 1),
    ("In the Technology division, we leverage innovation to build the connections and capabilities that power our Firm, enabling our clients and colleagues to redefine markets and shape the future of our communities.", "Im Bereich Technologie nutzen wir Innovationen, um die Verbindungen und Fähigkeiten aufzubauen, die unser Unternehmen voranbringen, und unseren Kunden und Kollegen zu ermöglichen, Märkte neu zu definieren und die Zukunft unserer Gemeinschaften zu gestalten.",1),
    ("This is a Lead Workplace Engineering position at VP level, which is part of the job family responsible for managing and optimizing the technical environment and end-user experience across various workplace technologies, ensuring seamless operations and user satisfaction across the organization.","Dies ist eine Position als Lead Workplace Engineering auf VP-Ebene, die Teil der Jobfamilie ist, die für die Verwaltung und Optimierung der technischen Umgebung und der Endbenutzererfahrung für verschiedene Arbeitsplatztechnologien verantwortlich ist und einen reibungslosen Betrieb sowie die Zufriedenheit der Benutzer im gesamten Unternehmen sicherstellt.",1),
]


## Prepare Dataset

The handcrafted sentences are converted into a `BiasDataset` instance. This allows tokenization and structured input to the model for inference.

In [None]:
# convert list of test cases into a dataframe
test_df = pd.DataFrame(test_cases, columns=["english", "german", "label"])

# create BiasDataset instance from dataframe
test_dataset = BiasDataset(test_df, tokenizer)

## Run Inference on Each Test Case

Each handcrafted sentence is passed through mBERT. Predictions, probabilities, and correctness indicators are collected for further analysis.

In [None]:
results = []

for i in range(len(test_dataset)):
    item = test_dataset[i]
    
    # prepare inputs for model, add batch dimension and move to device
    inputs = {key: val.unsqueeze(0).to(device) for key, val in item.items() if key != "labels"}
    
    # run model in evaluation mode without gradients
    with torch.no_grad():
        outputs = model(**inputs)
        logits = outputs.logits
        pred_label = torch.argmax(logits, dim=1).item()
        prob = torch.softmax(logits, dim=1)[0].cpu().numpy()
    
    # collect results
    results.append({
        "english": test_df.iloc[i]["english"],
        "german": test_df.iloc[i]["german"],
        "true_label": test_df.iloc[i]["label"],
        "predicted_label": pred_label,
        "neutral_prob": prob[0],
        "biased_prob": prob[1],
        "correct": test_df.iloc[i]["label"] == pred_label
    })


## Display Results
Results are summarized in a table showing English and German texts, true and predicted labels, probabilities for each class, and correctness. Overall model accuracy on the handcrafted test cases is reported.

In [None]:
results_df = pd.DataFrame(results)

print("\nBias detection test results:")
display(results_df)

accuracy = results_df["correct"].mean()
print(f"\nModel accuracy on test cases: {accuracy:.1%}")
