# COMP0173: Coursework 2

The paper HEARTS: A Holistic Framework for Explainable, Sustainable, and Robust Text Stereotype Detection by Theo King, Zekun Wu et al. (2024) presents a comprehensive approach to analysing and detecting stereotypes in text [1]. The authors introduce the HEARTS framework, which integrates model explainability, carbon-efficient training, and accurate evaluation across multiple bias-sensitive datasets. By using transformer-based models such as ALBERT-V2, BERT, and DistilBERT, this research project demonstrates that stereotype detection performance varies significantly across dataset sources, underlining the need for diverse evaluation benchmarks. The paper provides publicly available datasets and code [2], allowing full reproducibility and offering a standardised methodology for future research on bias and stereotype detection in Natural Language Processing (NLP).

While the HEARTS framework evaluates stereotype detection in English, this project adapts the methodology to the Russian context. Russian stereotypes often rely on grammatical gender, morphology, and culture specific tropes. Although Russian is not classified as a low-resource language and many high-performing NLP models are available, there is currently no publicly accessible model specifically designed to detect stereotypes in Russian language. Existing models detecting toxicity or sentiment identify stereotypical and biased sentences only when they include specific patterns, such as insults, slurs, or identity-specific hate speech [8]. 

To address this gap, I introduce two fine-tuned classifiers, `AI-Forever-RuBert` [10] and `XML-RoBERTa` [11] trained on datasets `RBSA`, and `RBS`, respectively. Understanding these patterns is essential for applications such as content moderation, ensuring the safety of Russian-language LLMs, and monitoring harmful narratives across demographic groups and underrepresented societies. Adapting the HEARTS framework to this new sociolinguistic context illustrates its transferability beyond the English-speaking context and enables a more culturally grounded approach to bias detection, thereby promoting SDG 5: Gender Equality, SDG 10: Reduced Inequalities, and SDG 16: Peace, Justice, and Strong Institutions [5].

# Instructions

All figures produced during this notebook are stored in the project’s `COMP0173_Figures` directory.
The corresponding LaTeX-formatted performance comparison tables, jupyter notebooks are stored in `/COMP0173_PDF`. 
The compiled document are available as `COMP0173-CW2-TABLES.pdf` and `COMP0173_PDF/COMP0173-CW2-NOTEBOOK-XX.pdf`.
All prompts used for data augmentation are stored in `COMP0173_Prompts` and the manually collected stereotypes (with English translations) are provided in `COMP0173_Stereotypes`. 
The datasets used for model training and evaluation are stored in `COMP0173_Data` which contains: 

- rubias.tsv — RuBias dataset [6, 7]
- ruster.csv — RuSter dataset (see Part 2 of the notebook for source websites)
- rubist.csv — RBS dataset: RuBias + RuSter augmented with LLM-generated samples (Claude Sonnet), using a zero-shot prompt with examples
- rubist_second.csv — RBSA dataset: RuBias + RuSter augmented with LLM-generated samples using a second prompt version without examples

The notebooks `COMP0173_PDF/COMP0173-CW2-NOTEBOOK-P3.pdf` and `COMP0173_PDF/COMP0173-CW2-NOTEBOOK-P5.pdf` are replications of `COMP0173_PDF/COMP0173-CW2-NOTEBOOK-P2.pdf` and `COMP0173_PDF/COMP0173-CW2-NOTEBOOK-P4.pdf`, where P2 provides the new `RBSA` with second prompt (without examples) and P5 demonstrates the model running ON GPU (the results saved are from GPU fine-tuning).

# Technical Implementation (70%)

In [1]:
# %%capture
# pip install -r requirements.txt
# pip install transformers
# pip install --upgrade transformers
# pip install --upgrade tokenizers
# pip install -U sentence-transformers
# pip install natasha
# pip install datasets
# pip install --user -U nltk
# conda install -c anaconda nltk
# pip install --upgrade openai pandas tqdm
# pip install dotenv

In [2]:
# pip install -U pip setuptools wheel
# pip install -U spacy
# python -m spacy download en_core_web_trf
# python -m spacy download en_core_web_sm
# python -m spacy download ru_core_news_lg

# # GPU
# pip install -U 'spacy[cuda12x]'
# # GPU - Train Models
# pip install -U 'spacy[cuda12x,transformers,lookups]'

In [3]:
# Import the libraries 
import random, numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline 
sns.set(color_codes=True)
plt.style.use('seaborn-v0_8')

# To ignore warnings
import warnings
warnings.filterwarnings('ignore')
np.random.seed(23)

warnings.filterwarnings(
    "ignore",
    message="pkg_resources is deprecated as an API"
)

In [4]:
# Import libraries 
import pandas as pd
import os
import sys
import importlib.util, pathlib
from pathlib import Path
import warnings 
from importlib import reload
from importlib.machinery import SourceFileLoader
from IPython.display import display
import pandas as pd
from pathlib import Path
import re
import difflib
import string
from collections import defaultdict
import json

In [5]:
import torch
import transformers
from transformers import AutoModelForMaskedLM, XLMWithLMHeadModel
from transformers import AutoTokenizer, AutoConfig
from transformers import TrainingArguments, Trainer
from sentence_transformers import SentenceTransformer, util
import platform
from datasets import Dataset
# import spacy 
import requests
from tqdm import tqdm
import yaml

In [6]:
sys.path.append("Exploratory Data Analysis")
sys.path.append("Model Training and Evaluation")

In [7]:
# Check the GPU host (UCL access)
print("CUDA available:", torch.cuda.is_available())
print("Device:", torch.cuda.get_device_name(0))

# # Path
# import os
# os.chdir("/tmp/HEARTS-Text-Stereotype-Detection")
# os.getcwd()

CUDA available: True
Device: Tesla T4


## Part 4: Adapt the model architecture and training pipeline to your local context

### $\color{pink}{Question\ 1:}$ Justify architectural modifications for new context

To adapt the HEARTS framework to the Russian context, I kept the original fine-tuning pipeline while substituting ALBERT-V2, BERT, and DistilBERT with encoder models optimised for Russian text. The specific models I fine-tuned include: 

- DeepPavlov/RuBERT [9]
- AI-Forever/RuBERT [10]
- XLM-RoBERTa (multilingual) [11]
- Logistic Regression baselines using TF-IDF and SpaCy embeddings. 

Each model was configured as a binary stereotype classifier and trained separately on the RBS and RBSA datasets using the Hugging Face AutoModelForSequenceClassification architecture, with an 80/20 train-test split. To support sustainability goals, CodeCarbon was integrated into the pipeline to monitor emissions during fine-tuning. All models finished training in under 10 minutes per dataset, with total estimated emissions of less than 2 grams of CO₂ for each run.

![Hyperparameters](COMP0173_Figures/hyperparameters.png)

![Model Configuration](COMP0173_Figures/configuration.png)

In [8]:
# Load final version 
rubist = pd.read_csv("COMP0173_Data/rubist.csv", encoding="utf-8")
rubist_second = pd.read_csv("COMP0173_Data/rubist_second.csv", encoding="utf-8")

#### Train models

In [9]:
import os
os.environ["HF_HOME"] = "/tmp/hf"
os.environ["TRANSFORMERS_CACHE"] = "/tmp/hf"
os.makedirs("/tmp/hf", exist_ok=True)
import gc

In [10]:
from Logistic_Regression_Russian import (data_loader, train_model, evaluate_model)

gc.collect()
torch.cuda.empty_cache()

# Load and combine relevant datasets
train_data_rubist, test_data_rubist = data_loader(csv_file_path='COMP0173_Data/rubist.csv', labelling_criteria='stereotype', dataset_name='rubist', sample_size=1000000, num_examples=5)
train_data_rubist_second, test_data_rubist_second = data_loader(csv_file_path='COMP0173_Data/rubist_second.csv', labelling_criteria='stereotype', dataset_name='rubist_second', sample_size=1000000, num_examples=5)


# Execute full pipeline for logistic regression tfidf model
train_model(train_data_rubist, model_output_base_dir='model_output_LR_tfidf', dataset_name='rubist_trained', feature_type='tfidf', seed=42)
evaluate_model(test_data_rubist, model_output_dir='model_output_LR_tfidf/rubist_trained', result_output_base_dir='result_output_LR_tfidf', dataset_name='rubist', feature_type='tfidf', seed=42)

gc.collect()
torch.cuda.empty_cache()

train_model(train_data_rubist_second, model_output_base_dir='model_output_LR_tfidf', dataset_name='rubist_second_trained', feature_type='tfidf', seed=42)
evaluate_model(test_data_rubist_second, model_output_dir='model_output_LR_tfidf/rubist_second_trained', result_output_base_dir='result_output_LR_tfidf', dataset_name='rubist_second', feature_type='tfidf', seed=42)

gc.collect()
torch.cuda.empty_cache()

# Execute full pipeline for logistic regression embedding model
train_model(train_data_rubist, model_output_base_dir='model_output_LR_embedding', dataset_name='rubist_trained', feature_type='embedding', seed=42)
evaluate_model(test_data_rubist, model_output_dir='model_output_LR_embedding/rubist_trained', result_output_base_dir='result_output_LR_embedding', dataset_name='rubist', feature_type='embedding', seed=42)

gc.collect()
torch.cuda.empty_cache()

train_model(train_data_rubist_second, model_output_base_dir='model_output_LR_embedding', dataset_name='rubist_second_trained', feature_type='embedding', seed=42)
evaluate_model(test_data_rubist_second, model_output_dir='model_output_LR_embedding/rubist_second_trained', result_output_base_dir='result_output_LR_embedding', dataset_name='rubist_second', feature_type='embedding', seed=42)

First few examples from the training data:
     stereotype_type                                               text  \
168       profession              футболисты топ клубов следят за модой   
2883     nationality  британцы танцуют с радужными пельменями на кры...   
2919           lgbtq  бисексуалы на самом деле просто сексуально рас...   
3412      profession  программирование доступно людям с разным уровн...   
732            lgbtq        пансексуалы просто до конца не определились   

      category data_name  
168          1    rubist  
2883         0    rubist  
2919         1    rubist  
3412         0    rubist  
732          1    rubist  
First few examples from the testing data:
     stereotype_type                                               text  \
2726      profession  бедные люди питаются звездным светом и варят с...   
1066          gender  женщины водят машины распевая серенады с говор...   
2547     nationality  евреи как и представители других национальност...   
18

Computing embeddings: 100%|██████████| 3372/3372 [00:33<00:00, 100.47it/s]


Testing C=0.01, penalty=l1 => F1 Score: 0.39785905441570024
Testing C=0.01, penalty=l2 => F1 Score: 0.8377403846153846
Testing C=0.01, penalty=None => F1 Score: 0.9196396682408032
Testing C=0.1, penalty=l1 => F1 Score: 0.8833912566306933
Testing C=0.1, penalty=l2 => F1 Score: 0.894211324570273
Testing C=0.1, penalty=None => F1 Score: 0.9196396682408032
Testing C=1, penalty=l1 => F1 Score: 0.9212241604072258
Testing C=1, penalty=l2 => F1 Score: 0.9172978203631145
Testing C=1, penalty=None => F1 Score: 0.9196396682408032
Best model parameters: {'C': 1, 'penalty': 'l1'}
Model and vectorizer saved to model_output_LR_embedding/rubist_trained
Estimated total emissions: 0.0001950465552109412 kg CO2
Number of unique labels: 2


Computing embeddings: 100%|██████████| 844/844 [00:08<00:00, 100.36it/s]


Number of unique labels: 2


Computing embeddings: 100%|██████████| 2336/2336 [00:20<00:00, 112.08it/s]


Testing C=0.01, penalty=l1 => F1 Score: 0.4
Testing C=0.01, penalty=l2 => F1 Score: 0.4068829055705911
Testing C=0.01, penalty=None => F1 Score: 0.627725258253562
Testing C=0.1, penalty=l1 => F1 Score: 0.4052597071464996
Testing C=0.1, penalty=l2 => F1 Score: 0.5424300867888139
Testing C=0.1, penalty=None => F1 Score: 0.627725258253562
Testing C=1, penalty=l1 => F1 Score: 0.6112426035502958
Testing C=1, penalty=l2 => F1 Score: 0.6005917159763313
Testing C=1, penalty=None => F1 Score: 0.627725258253562
Best model parameters: {'C': 0.01, 'penalty': None}
Model and vectorizer saved to model_output_LR_embedding/rubist_second_trained
Estimated total emissions: 0.00013900726188074115 kg CO2
Number of unique labels: 2


Computing embeddings: 100%|██████████| 584/584 [00:05<00:00, 112.28it/s]


Unnamed: 0,precision,recall,f1-score,support
0,0.763547,0.796915,0.779874,389.0
1,0.55618,0.507692,0.530831,195.0
accuracy,0.700342,0.700342,0.700342,0.700342
macro avg,0.659863,0.652304,0.655353,584.0
weighted avg,0.694306,0.700342,0.696718,584.0


### $\color{pink}{Question\ 2:}$ Document hyperparameter tuning process - GPU

Hyperparameter tuning followed the structure of the original HEARTS pipeline but was adapted to Russian-language models and the two augmented datasets (RBS and RBSA). All experiments were run on a GPU-enabled environment to support efficient fine-tuning of transformer models. Before each run, GPU memory was cleared using:

`gc.collect()
torch.cuda.empty_cache()
`

The tuning process began by loading the two datasets (rubist.csv and rubist_second.csv) using the customised data_loader() function. For each model, a consistent training configuration was used to enable fair comparison. Unfortunately due to disk quota, it was not possible to run the `XLM-Roberta` model on GPU machine.

In [12]:
from BERT_Models_Fine_Tuning_Russian import (data_loader, train_model, evaluate_model)

gc.collect()
torch.cuda.empty_cache()

# Load and combine relevant datasets
train_data_rubist, test_data_rubist = data_loader(csv_file_path='COMP0173_Data/rubist.csv', labelling_criteria='stereotype', dataset_name='rubist', sample_size=1000000, num_examples=5)
train_data_rubist_second, test_data_rubist_second = data_loader(csv_file_path='COMP0173_Data/rubist_second.csv', labelling_criteria='stereotype', dataset_name='rubist_second', sample_size=1000000, num_examples=5)

# Execute full pipeline for Deepavlov model
train_model(train_data_rubist, model_path='DeepPavlov/rubert-base-cased', batch_size=64, epoch=6, learning_rate=2e-5, model_output_base_dir='model_output_deeppavlov_rubert', dataset_name='rubist_trained', seed=42)
evaluate_model(test_data_rubist, model_output_dir='model_output_deeppavlov_rubert/rubist_trained', result_output_base_dir='result_output_deeppavlov_rubert', dataset_name='rubist_trained', seed=42)

gc.collect()
torch.cuda.empty_cache()

train_model(train_data_rubist_second, model_path='DeepPavlov/rubert-base-cased', batch_size=64, epoch=6, learning_rate=2e-5, model_output_base_dir='model_output_deeppavlov_rubert', dataset_name='rubist_second_trained', seed=42)
evaluate_model(test_data_rubist_second, model_output_dir='model_output_deeppavlov_rubert/rubist_second_trained', result_output_base_dir='result_output_deeppavlov_rubert', dataset_name='rubist_second_trained', seed=42)

First few examples from the training data:
     stereotype_type                                               text  \
168       profession              футболисты топ клубов следят за модой   
2883     nationality  британцы танцуют с радужными пельменями на кры...   
2919           lgbtq  бисексуалы на самом деле просто сексуально рас...   
3412      profession  программирование доступно людям с разным уровн...   
732            lgbtq        пансексуалы просто до конца не определились   

      category data_name  
168          1    rubist  
2883         0    rubist  
2919         1    rubist  
3412         0    rubist  
732          1    rubist  
First few examples from the testing data:
     stereotype_type                                               text  \
2726      profession  бедные люди питаются звездным светом и варят с...   
1066          gender  женщины водят машины распевая серенады с говор...   
2547     nationality  евреи как и представители других национальност...   
18

loading configuration file config.json from cache at /home/ec2-user/.cache/huggingface/hub/models--DeepPavlov--rubert-base-cased/snapshots/4036cab694767a299f2b9e6492909664d9414229/config.json
Model config BertConfig {
  "_name_or_path": "DeepPavlov/rubert-base-cased",
  "architectures": [
    "BertModel"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "directionality": "bidi",
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "output_past": true,
  "pad_token_id": 0,
  "pooler_fc_size": 768,
  "pooler_num_attention_heads": 12,
  "pooler_num_fc_layers": 3,
  "pooler_size_per_head": 128,
  "pooler_type": "first_token_transform",
  "position_embedding_type": "absolute",
  "transformers_version": "4.46.3",
  "type_vocab_size": 2,
  "use_ca

Map:   0%|          | 0/2697 [00:00<?, ? examples/s]

Map:   0%|          | 0/2697 [00:00<?, ? examples/s]

Sample tokenized input from train: {'stereotype_type': 'gender', 'text': 'женщины принимают важные решения консультируясь с поющими чайниками', 'category': 0, 'data_name': 'rubist', '__index_level_0__': 1317, 'input_ids': [101, 12528, 23558, 29466, 12938, 21264, 57041, 869, 1516, 33165, 27585, 9210, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'labels': 0}


Map:   0%|          | 0/675 [00:00<?, ? examples/s]

Map:   0%|          | 0/675 [00:00<?, ? examples/s]

PyTorch: setting up devices


Sample tokenized input from validation: {'stereotype_type': 'gender', 'text': 'женщины принимают важные решения консультируясь с поющими чайниками', 'category': 0, 'data_name': 'rubist', '__index_level_0__': 1317, 'input_ids': [101, 12528, 23558, 29466, 12938, 21264, 57041, 869, 1516, 33165, 27585, 9210, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'labels': 0}


The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
The following columns in the training set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: data_name, __index_level_0__, category, text, stereotype_type. If data_name, __index_level_0__, category, text, stereotype_type are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 2,697
  Num Epochs = 6
  Instantaneous batch size per device = 64
  Total train batch size (w. parallel, distributed & accumulation) = 64
  Gradient Accumulation steps = 1
  Total optimization steps = 258
  Number of trainable parameters = 177,854,978


Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Balanced accuracy
1,0.3057,0.148108,0.958421,0.929174,0.941696,0.929174
2,0.0903,0.071734,0.967716,0.965641,0.966667,0.965641
3,0.0424,0.087904,0.970066,0.970066,0.970066,0.970066
4,0.0169,0.06984,0.982161,0.977847,0.979955,0.977847
5,0.0107,0.121155,0.974393,0.96236,0.968007,0.96236
6,0.0081,0.123942,0.974393,0.96236,0.968007,0.96236


The following columns in the evaluation set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: data_name, __index_level_0__, category, text, stereotype_type. If data_name, __index_level_0__, category, text, stereotype_type are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.

***** Running Evaluation *****
  Num examples = 675
  Batch size = 64
Saving model checkpoint to model_output_deeppavlov_rubert/rubist_trained/checkpoint-43
Configuration saved in model_output_deeppavlov_rubert/rubist_trained/checkpoint-43/config.json
Model weights saved in model_output_deeppavlov_rubert/rubist_trained/checkpoint-43/model.safetensors
tokenizer config file saved in model_output_deeppavlov_rubert/rubist_trained/checkpoint-43/tokenizer_config.json
Special tokens file saved in model_output_deeppavlov_rubert/rubist_trained/checkpoint-43/special_tokens_map.json
The following columns in the evaluation set don'

loading weights file model_output_deeppavlov_rubert/rubist_trained/model.safetensors


Estimated total emissions: 0.0012824234634039179 kg CO2
Number of unique labels: 2


All model checkpoint weights were used when initializing BertForSequenceClassification.

All the weights of BertForSequenceClassification were initialized from the model checkpoint at model_output_deeppavlov_rubert/rubist_trained.
If your task is similar to the task the model of the checkpoint was trained on, you can already use BertForSequenceClassification for predictions without further training.
loading file vocab.txt
loading file tokenizer.json
loading file added_tokens.json
loading file special_tokens_map.json
loading file tokenizer_config.json


Map:   0%|          | 0/844 [00:00<?, ? examples/s]

Map:   0%|          | 0/844 [00:00<?, ? examples/s]

Disabling tokenizer parallelism, we're using DataLoader multithreading already


Sample tokenized input from test: {'stereotype_type': 'profession', 'text': 'бедные люди питаются звездным светом и варят суп из радуги', 'category': 0, 'data_name': 'rubist', '__index_level_0__': 2726, 'input_ids': [101, 94818, 11894, 38663, 13904, 2010, 54564, 851, 84652, 868, 10508, 1703, 13158, 23726, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'labels': 0}
Number of unique labels: 2


loading configuration file config.json from cache at /home/ec2-user/.cache/huggingface/hub/models--DeepPavlov--rubert-base-cased/snapshots/4036cab694767a299f2b9e6492909664d9414229/config.json
Model config BertConfig {
  "_name_or_path": "DeepPavlov/rubert-base-cased",
  "architectures": [
    "BertModel"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "directionality": "bidi",
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "output_past": true,
  "pad_token_id": 0,
  "pooler_fc_size": 768,
  "pooler_num_attention_heads": 12,
  "pooler_num_fc_layers": 3,
  "pooler_size_per_head": 128,
  "pooler_type": "first_token_transform",
  "position_embedding_type": "absolute",
  "transformers_version": "4.46.3",
  "type_vocab_size": 2,
  "use_ca

Map:   0%|          | 0/1868 [00:00<?, ? examples/s]

Map:   0%|          | 0/1868 [00:00<?, ? examples/s]

Sample tokenized input from train: {'stereotype_type': 'nationality', 'text': 'все эстонцы очень пушистые', 'category': 0, 'data_name': 'rubist_second', '__index_level_0__': 1567, 'input_ids': [101, 4752, 92660, 4402, 7805, 14741, 23939, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'labels': 0}


Map:   0%|          | 0/468 [00:00<?, ? examples/s]

Map:   0%|          | 0/468 [00:00<?, ? examples/s]

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


Sample tokenized input from validation: {'stereotype_type': 'nationality', 'text': 'все эстонцы очень пушистые', 'category': 0, 'data_name': 'rubist_second', '__index_level_0__': 1567, 'input_ids': [101, 4752, 92660, 4402, 7805, 14741, 23939, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'labels': 0}


The following columns in the training set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: data_name, __index_level_0__, category, text, stereotype_type. If data_name, __index_level_0__, category, text, stereotype_type are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 1,868
  Num Epochs = 6
  Instantaneous batch size per device = 64
  Total train batch size (w. parallel, distributed & accumulation) = 64
  Gradient Accumulation steps = 1
  Total optimization steps = 180
  Number of trainable parameters = 177,854,978


Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Balanced accuracy
1,0.6525,0.646596,0.319444,0.5,0.389831,0.5
2,0.6334,0.644738,0.319444,0.5,0.389831,0.5
3,0.5766,0.556053,0.685612,0.699511,0.685977,0.699511
4,0.4752,0.486084,0.769841,0.739002,0.748674,0.739002
5,0.3881,0.478719,0.767994,0.748392,0.755586,0.748392
6,0.3426,0.478887,0.774481,0.75045,0.758837,0.75045


The following columns in the evaluation set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: data_name, __index_level_0__, category, text, stereotype_type. If data_name, __index_level_0__, category, text, stereotype_type are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.

***** Running Evaluation *****
  Num examples = 468
  Batch size = 64
Saving model checkpoint to model_output_deeppavlov_rubert/rubist_second_trained/checkpoint-30
Configuration saved in model_output_deeppavlov_rubert/rubist_second_trained/checkpoint-30/config.json
Model weights saved in model_output_deeppavlov_rubert/rubist_second_trained/checkpoint-30/model.safetensors
tokenizer config file saved in model_output_deeppavlov_rubert/rubist_second_trained/checkpoint-30/tokenizer_config.json
Special tokens file saved in model_output_deeppavlov_rubert/rubist_second_trained/checkpoint-30/special_tokens_map.json
The following

loading weights file model_output_deeppavlov_rubert/rubist_second_trained/model.safetensors


Estimated total emissions: 0.001094457420633383 kg CO2
Number of unique labels: 2


All model checkpoint weights were used when initializing BertForSequenceClassification.

All the weights of BertForSequenceClassification were initialized from the model checkpoint at model_output_deeppavlov_rubert/rubist_second_trained.
If your task is similar to the task the model of the checkpoint was trained on, you can already use BertForSequenceClassification for predictions without further training.
loading file vocab.txt
loading file tokenizer.json
loading file added_tokens.json
loading file special_tokens_map.json
loading file tokenizer_config.json


Map:   0%|          | 0/584 [00:00<?, ? examples/s]

Map:   0%|          | 0/584 [00:00<?, ? examples/s]

Sample tokenized input from test: {'stereotype_type': 'profession', 'text': 'чтобы быть музыкантом нужно иметь шкаф', 'category': 0, 'data_name': 'rubist_second', '__index_level_0__': 80, 'input_ids': [101, 5247, 6345, 44670, 15411, 16038, 74989, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'labels': 0}


Unnamed: 0,precision,recall,f1-score,support
0,0.808458,0.835476,0.821745,389.0
1,0.648352,0.605128,0.625995,195.0
accuracy,0.758562,0.758562,0.758562,0.758562
macro avg,0.728405,0.720302,0.72387,584.0
weighted avg,0.754998,0.758562,0.756383,584.0


In [13]:
from BERT_Models_Fine_Tuning_Russian import (data_loader, train_model, evaluate_model)

gc.collect()
torch.cuda.empty_cache()

# Load and combine relevant datasets
train_data_rubist, test_data_rubist = data_loader(csv_file_path='COMP0173_Data/rubist.csv', labelling_criteria='stereotype', dataset_name='rubist', sample_size=1000000, num_examples=5)
train_data_rubist_second, test_data_rubist_second = data_loader(csv_file_path='COMP0173_Data/rubist_second.csv', labelling_criteria='stereotype', dataset_name='rubist_second', sample_size=1000000, num_examples=5)

# Execute full pipeline for Deepavlov model
train_model(train_data_rubist, model_path='ai-forever/ruBert-base', batch_size=64, epoch=6, learning_rate=2e-5, model_output_base_dir='model_output_ruberta_base', dataset_name='rubist_trained', seed=42)
evaluate_model(test_data_rubist, model_output_dir='model_output_ruberta_base/rubist_trained', result_output_base_dir='result_output_ruberta_base', dataset_name='rubist_trained', seed=42)

gc.collect()
torch.cuda.empty_cache()

train_model(train_data_rubist_second, model_path='ai-forever/ruBert-base', batch_size=64, epoch=6, learning_rate=2e-5, model_output_base_dir='model_output_ruberta_base', dataset_name='rubist_second_trained', seed=42)
evaluate_model(test_data_rubist_second, model_output_dir='model_output_ruberta_base/rubist_second_trained', result_output_base_dir='result_output_ruberta_base', dataset_name='rubist_second_trained', seed=42)

First few examples from the training data:
     stereotype_type                                               text  \
168       profession              футболисты топ клубов следят за модой   
2883     nationality  британцы танцуют с радужными пельменями на кры...   
2919           lgbtq  бисексуалы на самом деле просто сексуально рас...   
3412      profession  программирование доступно людям с разным уровн...   
732            lgbtq        пансексуалы просто до конца не определились   

      category data_name  
168          1    rubist  
2883         0    rubist  
2919         1    rubist  
3412         0    rubist  
732          1    rubist  
First few examples from the testing data:
     stereotype_type                                               text  \
2726      profession  бедные люди питаются звездным светом и варят с...   
1066          gender  женщины водят машины распевая серенады с говор...   
2547     nationality  евреи как и представители других национальност...   
18

config.json:   0%|          | 0.00/590 [00:00<?, ?B/s]

loading configuration file config.json from cache at /home/ec2-user/.cache/huggingface/hub/models--ai-forever--ruBert-base/snapshots/05f37a2ca9e333fd18f30cd0c96c68d274793c69/config.json
Model config BertConfig {
  "_name_or_path": "ai-forever/ruBert-base",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "directionality": "bidi",
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "pooler_fc_size": 768,
  "pooler_num_attention_heads": 12,
  "pooler_num_fc_layers": 3,
  "pooler_size_per_head": 128,
  "pooler_type": "first_token_transform",
  "position_embedding_type": "absolute",
  "transformers_version": "4.46.3",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 1

pytorch_model.bin:   0%|          | 0.00/716M [00:00<?, ?B/s]

loading weights file pytorch_model.bin from cache at /home/ec2-user/.cache/huggingface/hub/models--ai-forever--ruBert-base/snapshots/05f37a2ca9e333fd18f30cd0c96c68d274793c69/pytorch_model.bin
Attempting to create safetensors variant
Attempting to convert .bin model on the fly to safetensors.


model.safetensors:   0%|          | 0.00/716M [00:00<?, ?B/s]

Some weights of the model checkpoint at ai-forever/ruBert-base were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.decoder.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not ini

vocab.txt: 0.00B [00:00, ?B/s]

loading file vocab.txt from cache at /home/ec2-user/.cache/huggingface/hub/models--ai-forever--ruBert-base/snapshots/05f37a2ca9e333fd18f30cd0c96c68d274793c69/vocab.txt
loading file tokenizer.json from cache at None
loading file added_tokens.json from cache at None
loading file special_tokens_map.json from cache at None
loading file tokenizer_config.json from cache at None
loading configuration file config.json from cache at /home/ec2-user/.cache/huggingface/hub/models--ai-forever--ruBert-base/snapshots/05f37a2ca9e333fd18f30cd0c96c68d274793c69/config.json
Model config BertConfig {
  "_name_or_path": "ai-forever/ruBert-base",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "directionality": "bidi",
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "n

Map:   0%|          | 0/2697 [00:00<?, ? examples/s]

Map:   0%|          | 0/2697 [00:00<?, ? examples/s]

Sample tokenized input from train: {'stereotype_type': 'gender', 'text': 'женщины принимают важные решения консультируясь с поющими чайниками', 'category': 0, 'data_name': 'rubist', '__index_level_0__': 1317, 'input_ids': [101, 4269, 12924, 15663, 3822, 16375, 52066, 699, 110, 102821, 5897, 921, 82766, 1306, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'labels': 0}


Map:   0%|          | 0/675 [00:00<?, ? examples/s]

Map:   0%|          | 0/675 [00:00<?, ? examples/s]

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


Sample tokenized input from validation: {'stereotype_type': 'gender', 'text': 'женщины принимают важные решения консультируясь с поющими чайниками', 'category': 0, 'data_name': 'rubist', '__index_level_0__': 1317, 'input_ids': [101, 4269, 12924, 15663, 3822, 16375, 52066, 699, 110, 102821, 5897, 921, 82766, 1306, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'labels': 0}


The following columns in the training set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: data_name, __index_level_0__, category, text, stereotype_type. If data_name, __index_level_0__, category, text, stereotype_type are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 2,697
  Num Epochs = 6
  Instantaneous batch size per device = 64
  Total train batch size (w. parallel, distributed & accumulation) = 64
  Gradient Accumulation steps = 1
  Total optimization steps = 258
  Number of trainable parameters = 178,308,866


Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Balanced accuracy
1,0.2941,0.085449,0.957973,0.958975,0.958471,0.958975
2,0.0555,0.067194,0.972724,0.977802,0.97519,0.977802
3,0.0183,0.088004,0.973177,0.96677,0.969865,0.96677
4,0.0083,0.087146,0.973392,0.973392,0.973392,0.973392
5,0.007,0.097585,0.974311,0.968982,0.971571,0.968982
6,0.0048,0.096686,0.975449,0.971195,0.973274,0.971195


The following columns in the evaluation set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: data_name, __index_level_0__, category, text, stereotype_type. If data_name, __index_level_0__, category, text, stereotype_type are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.

***** Running Evaluation *****
  Num examples = 675
  Batch size = 64
Saving model checkpoint to model_output_ruberta_base/rubist_trained/checkpoint-43
Configuration saved in model_output_ruberta_base/rubist_trained/checkpoint-43/config.json
Model weights saved in model_output_ruberta_base/rubist_trained/checkpoint-43/model.safetensors
tokenizer config file saved in model_output_ruberta_base/rubist_trained/checkpoint-43/tokenizer_config.json
Special tokens file saved in model_output_ruberta_base/rubist_trained/checkpoint-43/special_tokens_map.json
The following columns in the evaluation set don't have a corresponding ar

loading weights file model_output_ruberta_base/rubist_trained/model.safetensors


Estimated total emissions: 0.0013583508792769448 kg CO2
Number of unique labels: 2


All model checkpoint weights were used when initializing BertForSequenceClassification.

All the weights of BertForSequenceClassification were initialized from the model checkpoint at model_output_ruberta_base/rubist_trained.
If your task is similar to the task the model of the checkpoint was trained on, you can already use BertForSequenceClassification for predictions without further training.
loading file vocab.txt
loading file tokenizer.json
loading file added_tokens.json
loading file special_tokens_map.json
loading file tokenizer_config.json


Map:   0%|          | 0/844 [00:00<?, ? examples/s]

Map:   0%|          | 0/844 [00:00<?, ? examples/s]

Sample tokenized input from test: {'stereotype_type': 'profession', 'text': 'бедные люди питаются звездным светом и варят суп из радуги', 'category': 0, 'data_name': 'rubist', '__index_level_0__': 2726, 'input_ids': [101, 42890, 1950, 47937, 10088, 815, 16465, 107, 84625, 18777, 734, 74394, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'labels': 0}
Number of unique labels: 2


loading configuration file config.json from cache at /home/ec2-user/.cache/huggingface/hub/models--ai-forever--ruBert-base/snapshots/05f37a2ca9e333fd18f30cd0c96c68d274793c69/config.json
Model config BertConfig {
  "_name_or_path": "ai-forever/ruBert-base",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "directionality": "bidi",
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "pooler_fc_size": 768,
  "pooler_num_attention_heads": 12,
  "pooler_num_fc_layers": 3,
  "pooler_size_per_head": 128,
  "pooler_type": "first_token_transform",
  "position_embedding_type": "absolute",
  "transformers_version": "4.46.3",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 1

Map:   0%|          | 0/1868 [00:00<?, ? examples/s]

Map:   0%|          | 0/1868 [00:00<?, ? examples/s]

Sample tokenized input from train: {'stereotype_type': 'nationality', 'text': 'все эстонцы очень пушистые', 'category': 0, 'data_name': 'rubist_second', '__index_level_0__': 1567, 'input_ids': [101, 780, 34330, 1048, 1179, 94517, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'labels': 0}


Map:   0%|          | 0/468 [00:00<?, ? examples/s]

Map:   0%|          | 0/468 [00:00<?, ? examples/s]

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


Sample tokenized input from validation: {'stereotype_type': 'nationality', 'text': 'все эстонцы очень пушистые', 'category': 0, 'data_name': 'rubist_second', '__index_level_0__': 1567, 'input_ids': [101, 780, 34330, 1048, 1179, 94517, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'labels': 0}


The following columns in the training set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: data_name, __index_level_0__, category, text, stereotype_type. If data_name, __index_level_0__, category, text, stereotype_type are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 1,868
  Num Epochs = 6
  Instantaneous batch size per device = 64
  Total train batch size (w. parallel, distributed & accumulation) = 64
  Gradient Accumulation steps = 1
  Total optimization steps = 180
  Number of trainable parameters = 178,308,866


Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Balanced accuracy
1,0.6256,0.594298,0.729487,0.638153,0.639753,0.638153
2,0.4984,0.457805,0.750443,0.743504,0.74655,0.743504
3,0.3164,0.453952,0.777977,0.788011,0.781933,0.788011
4,0.223,0.489346,0.782447,0.792642,0.786476,0.792642
5,0.1645,0.49328,0.799342,0.795343,0.79723,0.795343
6,0.1385,0.515137,0.79869,0.797916,0.798299,0.797916


The following columns in the evaluation set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: data_name, __index_level_0__, category, text, stereotype_type. If data_name, __index_level_0__, category, text, stereotype_type are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.

***** Running Evaluation *****
  Num examples = 468
  Batch size = 64
Saving model checkpoint to model_output_ruberta_base/rubist_second_trained/checkpoint-30
Configuration saved in model_output_ruberta_base/rubist_second_trained/checkpoint-30/config.json
Model weights saved in model_output_ruberta_base/rubist_second_trained/checkpoint-30/model.safetensors
tokenizer config file saved in model_output_ruberta_base/rubist_second_trained/checkpoint-30/tokenizer_config.json
Special tokens file saved in model_output_ruberta_base/rubist_second_trained/checkpoint-30/special_tokens_map.json
The following columns in the evaluatio

loading weights file model_output_ruberta_base/rubist_second_trained/model.safetensors


Estimated total emissions: 0.0010754149601394114 kg CO2
Number of unique labels: 2


All model checkpoint weights were used when initializing BertForSequenceClassification.

All the weights of BertForSequenceClassification were initialized from the model checkpoint at model_output_ruberta_base/rubist_second_trained.
If your task is similar to the task the model of the checkpoint was trained on, you can already use BertForSequenceClassification for predictions without further training.
loading file vocab.txt
loading file tokenizer.json
loading file added_tokens.json
loading file special_tokens_map.json
loading file tokenizer_config.json


Map:   0%|          | 0/584 [00:00<?, ? examples/s]

Map:   0%|          | 0/584 [00:00<?, ? examples/s]

Sample tokenized input from test: {'stereotype_type': 'profession', 'text': 'чтобы быть музыкантом нужно иметь шкаф', 'category': 0, 'data_name': 'rubist_second', '__index_level_0__': 80, 'input_ids': [101, 1015, 1202, 61810, 1885, 4821, 22860, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'labels': 0}


Unnamed: 0,precision,recall,f1-score,support
0,0.820896,0.848329,0.834387,389.0
1,0.675824,0.630769,0.65252,195.0
accuracy,0.775685,0.775685,0.775685,0.775685
macro avg,0.74836,0.739549,0.743453,584.0
weighted avg,0.772456,0.775685,0.773661,584.0


In [None]:
from BERT_Models_Fine_Tuning_Russian import (data_loader, train_model, evaluate_model)

gc.collect()
torch.cuda.empty_cache()

# Load and combine relevant datasets
train_data_rubist, test_data_rubist = data_loader(csv_file_path='COMP0173_Data/rubist.csv', labelling_criteria='stereotype', dataset_name='rubist', sample_size=1000000, num_examples=5)
train_data_rubist_second, test_data_rubist_second = data_loader(csv_file_path='COMP0173_Data/rubist_second.csv', labelling_criteria='stereotype', dataset_name='rubist_second', sample_size=1000000, num_examples=5)

# Execute full pipeline for Deepavlov model
train_model(train_data_rubist, model_path='FacebookAI/xlm-roberta-base', batch_size=64, epoch=6, learning_rate=2e-5, model_output_base_dir='model_output_xlm_roberta_base', dataset_name='rubist_trained', seed=42)
evaluate_model(test_data_rubist, model_output_dir='model_output_xlm_roberta_base/rubist_trained', result_output_base_dir='result_output_xlm_roberta_base', dataset_name='rubist_trained', seed=42)

gc.collect()
torch.cuda.empty_cache()

train_model(train_data_rubist_second, model_path='FacebookAI/xlm-roberta-base', batch_size=64, epoch=6, learning_rate=2e-5, model_output_base_dir='model_output_xlm_roberta_base', dataset_name='rubist_second_trained', seed=42)
evaluate_model(test_data_rubist_second, model_output_dir='model_output_xlm_roberta_base/rubist_second_trained', result_output_base_dir='result_output_xlm_roberta_base', dataset_name='rubist_second_trained', seed=42)

First few examples from the training data:
     stereotype_type                                               text  \
168       profession              футболисты топ клубов следят за модой   
2883     nationality  британцы танцуют с радужными пельменями на кры...   
2919           lgbtq  бисексуалы на самом деле просто сексуально рас...   
3412      profession  программирование доступно людям с разным уровн...   
732            lgbtq        пансексуалы просто до конца не определились   

      category data_name  
168          1    rubist  
2883         0    rubist  
2919         1    rubist  
3412         0    rubist  
732          1    rubist  
First few examples from the testing data:
     stereotype_type                                               text  \
2726      profession  бедные люди питаются звездным светом и варят с...   
1066          gender  женщины водят машины распевая серенады с говор...   
2547     nationality  евреи как и представители других национальност...   
18

config.json:   0%|          | 0.00/615 [00:00<?, ?B/s]

loading configuration file config.json from cache at /home/ec2-user/.cache/huggingface/hub/models--FacebookAI--xlm-roberta-base/snapshots/e73636d4f797dec63c3081bb6ed5c7b0bb3f2089/config.json
Model config XLMRobertaConfig {
  "_name_or_path": "FacebookAI/xlm-roberta-base",
  "architectures": [
    "XLMRobertaForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 514,
  "model_type": "xlm-roberta",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "output_past": true,
  "pad_token_id": 1,
  "position_embedding_type": "absolute",
  "transformers_version": "4.46.3",
  "type_vocab_size": 1,
  "use_cache": true,
  "vocab_size": 250002
}



model.safetensors:   0%|          | 0.00/1.12G [00:00<?, ?B/s]

loading weights file model.safetensors from cache at /home/ec2-user/.cache/huggingface/hub/models--FacebookAI--xlm-roberta-base/snapshots/e73636d4f797dec63c3081bb6ed5c7b0bb3f2089/model.safetensors
Some weights of the model checkpoint at FacebookAI/xlm-roberta-base were not used when initializing XLMRobertaForSequenceClassification: ['lm_head.bias', 'lm_head.dense.bias', 'lm_head.dense.weight', 'lm_head.layer_norm.bias', 'lm_head.layer_norm.weight', 'roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing XLMRobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing XLMRobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClass

tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

loading configuration file config.json from cache at /home/ec2-user/.cache/huggingface/hub/models--FacebookAI--xlm-roberta-base/snapshots/e73636d4f797dec63c3081bb6ed5c7b0bb3f2089/config.json
Model config XLMRobertaConfig {
  "_name_or_path": "FacebookAI/xlm-roberta-base",
  "architectures": [
    "XLMRobertaForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 514,
  "model_type": "xlm-roberta",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "output_past": true,
  "pad_token_id": 1,
  "position_embedding_type": "absolute",
  "transformers_version": "4.46.3",
  "type_vocab_size": 1,
  "use_cache": true,
  "vocab_size": 250002
}



sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

loading file sentencepiece.bpe.model from cache at /home/ec2-user/.cache/huggingface/hub/models--FacebookAI--xlm-roberta-base/snapshots/e73636d4f797dec63c3081bb6ed5c7b0bb3f2089/sentencepiece.bpe.model
loading file tokenizer.json from cache at /home/ec2-user/.cache/huggingface/hub/models--FacebookAI--xlm-roberta-base/snapshots/e73636d4f797dec63c3081bb6ed5c7b0bb3f2089/tokenizer.json
loading file added_tokens.json from cache at None
loading file special_tokens_map.json from cache at None
loading file tokenizer_config.json from cache at /home/ec2-user/.cache/huggingface/hub/models--FacebookAI--xlm-roberta-base/snapshots/e73636d4f797dec63c3081bb6ed5c7b0bb3f2089/tokenizer_config.json
loading configuration file config.json from cache at /home/ec2-user/.cache/huggingface/hub/models--FacebookAI--xlm-roberta-base/snapshots/e73636d4f797dec63c3081bb6ed5c7b0bb3f2089/config.json
Model config XLMRobertaConfig {
  "_name_or_path": "FacebookAI/xlm-roberta-base",
  "architectures": [
    "XLMRobertaForM

Map:   0%|          | 0/2697 [00:00<?, ? examples/s]

Map:   0%|          | 0/2697 [00:00<?, ? examples/s]

Sample tokenized input from train: {'stereotype_type': 'gender', 'text': 'женщины принимают важные решения консультируясь с поющими чайниками', 'category': 0, 'data_name': 'rubist', '__index_level_0__': 1317, 'input_ids': [0, 81939, 440, 14276, 4684, 92354, 103, 19816, 2791, 42678, 174783, 4401, 135, 129, 104335, 55533, 86783, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'labels': 0}


Map:   0%|          | 0/675 [00:00<?, ? examples/s]

Map:   0%|          | 0/675 [00:00<?, ? examples/s]

PyTorch: setting up devices


Sample tokenized input from validation: {'stereotype_type': 'gender', 'text': 'женщины принимают важные решения консультируясь с поющими чайниками', 'category': 0, 'data_name': 'rubist', '__index_level_0__': 1317, 'input_ids': [0, 81939, 440, 14276, 4684, 92354, 103, 19816, 2791, 42678, 174783, 4401, 135, 129, 104335, 55533, 86783, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'labels': 0}


The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
The following columns in the training set don't have a corresponding argument in `XLMRobertaForSequenceClassification.forward` and have been ignored: data_name, __index_level_0__, category, text, stereotype_type. If data_name, __index_level_0__, category, text, stereotype_type are not expected by `XLMRobertaForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 2,697
  Num Epochs = 6
  Instantaneous batch size per device = 64
  Total train batch size (w. parallel, distributed & accumulation) = 64
  Gradient Accumulation steps = 1
  Total optimization steps = 258
  Number of trainable parameters = 278,045,186


Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Balanced accuracy
1,0.5698,0.370748,0.900178,0.752212,0.779849,0.752212


The following columns in the evaluation set don't have a corresponding argument in `XLMRobertaForSequenceClassification.forward` and have been ignored: data_name, __index_level_0__, category, text, stereotype_type. If data_name, __index_level_0__, category, text, stereotype_type are not expected by `XLMRobertaForSequenceClassification.forward`,  you can safely ignore this message.

***** Running Evaluation *****
  Num examples = 675
  Batch size = 64
Saving model checkpoint to model_output_xlm_roberta_base/rubist_trained/checkpoint-43
Configuration saved in model_output_xlm_roberta_base/rubist_trained/checkpoint-43/config.json
Model weights saved in model_output_xlm_roberta_base/rubist_trained/checkpoint-43/model.safetensors
tokenizer config file saved in model_output_xlm_roberta_base/rubist_trained/checkpoint-43/tokenizer_config.json
Special tokens file saved in model_output_xlm_roberta_base/rubist_trained/checkpoint-43/special_tokens_map.json


# References 

[1] Theo King, Zekun Wu, Adriano Koshiyama, Emre Kazim, and Philip Treleaven. 2024.
HEARTS: A holistic framework for explainable, sustainable and robust text stereotype detection.
arXiv preprint arXiv:2409.11579.
Available at: https://arxiv.org/abs/2409.11579
(Accessed: 4 December 2025).
https://doi.org/10.48550/arXiv.2409.11579

[2] Theo King, Zekun Wu, Adriano Koshiyama, Emre Kazim, and Philip Treleaven. 2024.
HEARTS-Text-Stereotype-Detection (GitHub Repository).
Available at: https://github.com/holistic-ai/HEARTS-Text-Stereotype-Detection
(Accessed: 4 December 2025).

[3] Theo King, Zekun Wu, Adriano Koshiyama, Emre Kazim, and Philip Treleaven. Holistic AI. 2024.
EMGSD: Expanded Multi-Group Stereotype Dataset (HuggingFace Dataset).
Available at: https://huggingface.co/datasets/holistic-ai/EMGSD
(Accessed: 4 December 2025).

[4] University College London Technical Support Group (TSG).
2025. GPU Access and Usage Documentation.
Available at: https://tsg.cs.ucl.ac.uk/gpus/
(Accessed: 6 December 2025).

[5] United Nations. 2025. The 2030 Agenda for Sustainable Development. 
Available at: https://sdgs.un.org/2030agenda 
(Accessed: 6 December 2025).

[6] Veronika Grigoreva, Anastasiia Ivanova, Ilseyar Alimova, and Ekaterina Artemova. 2024.
RuBia: A Russian Language Bias Detection Dataset.
Available at: https://arxiv.org/abs/2403.17553
(Accessed: 9 December 2025).

[7] Veronika Grigoreva, Anastasiia Ivanova, Ilseyar Alimova, and Ekaterina Artemova. 2024.
RuBia-Dataset (GitHub Repository).
Available at: https://github.com/vergrig/RuBia-Dataset
(Accessed: 9 December 2025).

[8] Sismetanin. 2020. Toxic Comments Detection in Russian (GitHub Repository).
Available at: https://github.com/sismetanin/toxic-comments-detection-in-russian
(Accessed: 9 December 2025).

[9] DeepPavlov. 2019. RuBERT-base-cased (Hugging Face Model).
Available at: https://huggingface.co/DeepPavlov/rubert-base-cased
(Accessed: 9 December 2025).

[10] AI-Forever. 2023. RuBERT-base (Hugging Face Model).
Available at: https://huggingface.co/ai-forever/ruBert-base
(Accessed: 9 December 2025).

[11] Hugging Face. 2024. XLM-RoBERTa: Model Documentation.
Available at: https://huggingface.co/docs/transformers/en/model_doc/xlm-roberta
(Accessed: 9 December 2025).

[12] DeepPavlov. 2020. ruBERT-base-cased-sentence (Hugging Face Model).
Available at: https://huggingface.co/DeepPavlov/rubert-base-cased-sentence
(Accessed: 9 December 2025).