# Z2 - Strategy 3: Multilingual BERT (mBERT)

## Description
Leverage the multilingual capabilities of mBERT:
1. Train mBERT with English data (BBC News)
2. Evaluate directly on Basque data (BasqueGLUE)

## Hypothesis
mBERT, being pretrained on 104 languages (including Basque),
should be able to transfer cross-lingual knowledge without translation.

## Approach
- **Model**: google-bert/bert-base-multilingual-cased
- **Training**: BBC News (English)
- **Evaluation**: BasqueGLUE (Basque)

In [None]:
!pip install transformers datasets
!pip install accelerate

Collecting datasets
  Downloading datasets-3.2.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.2.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m8.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.9.0-py3-none-any.whl (1

In [None]:
from transformers import AutoConfig, AutoModelForSequenceClassification, AutoTokenizer, TrainingArguments, Trainer
from datasets import load_dataset
import wandb
import os
import pandas as pd
from datasets import Dataset
from sklearn.metrics import f1_score
from sklearn.model_selection import train_test_split
import numpy as np

In [None]:
model_id = "google-bert/bert-base-multilingual-cased"
model = AutoModelForSequenceClassification.from_pretrained(model_id, num_labels=5)
tokenizer = AutoTokenizer.from_pretrained(model_id)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/625 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/714M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google-bert/bert-base-multilingual-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/996k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.96M [00:00<?, ?B/s]

In [None]:
dataset = load_dataset("SetFit/bbc-news")

README.md:   0%|          | 0.00/880 [00:00<?, ?B/s]

train.jsonl:   0%|          | 0.00/2.87M [00:00<?, ?B/s]

test.jsonl:   0%|          | 0.00/2.28M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1225 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1000 [00:00<?, ? examples/s]

In [None]:
def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

In [None]:
tokenized_datasets = dataset.map(tokenize_function, batched=True)

Map:   0%|          | 0/1225 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

In [None]:
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = logits.argmax(axis=-1)
    f1 = f1_score(labels, predictions, average='weighted')
    return {"f1": f1}

## 1. Training on English
Train mBERT with BBC News dataset (5 categories).

In [None]:
training_args = TrainingArguments(
    output_dir="./results",
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    learning_rate=2e-5,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="no",
    report_to="none",
    fp16=True
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
    compute_metrics=compute_metrics
)



In [None]:
os.environ["WANDB_DISABLED"] = "true"
wandb.init(mode="disabled")

trainer.train()

Epoch,Training Loss,Validation Loss,F1
1,No log,0.178755,0.961951
2,No log,0.117172,0.970087
3,No log,0.084409,0.97996


TrainOutput(global_step=231, training_loss=0.37473680240251284, metrics={'train_runtime': 132.3485, 'train_samples_per_second': 27.768, 'train_steps_per_second': 1.745, 'total_flos': 966959173555200.0, 'train_loss': 0.37473680240251284, 'epoch': 3.0})

In [None]:
trainer.save_model("./trained_model")

In [None]:
!wget https://raw.githubusercontent.com/orai-nlp/BasqueGLUE/refs/heads/main/bhtc/test.jsonl -O test_eu.jsonl

--2024-12-26 17:26:32--  https://raw.githubusercontent.com/orai-nlp/BasqueGLUE/refs/heads/main/bhtc/test.jsonl
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.110.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 585079 (571K) [text/plain]
Saving to: ‘test_eu.jsonl’


2024-12-26 17:26:32 (19.8 MB/s) - ‘test_eu.jsonl’ saved [585079/585079]



In [None]:
test_eu_df = pd.read_json("test_eu.jsonl", lines=True)

## 2. Cross-Lingual Evaluation with Label Mapping
Evaluate directly on Basque with label mapping.

**Result**: F1 ≈ 0.3161

In [None]:
label_mapping_eu = {
    "Ekonomia": 1,         # Ekonomia -> Business
    "Gizartea": 4,         # Gizartea -> Politics
    "Iritzia": 4,          # Iritzia -> Politics
    "Kultura": 3,          # Kultura -> Entertainment
    "Euskara": 3,          # Euskara -> Entertainment
    "Ingurumena": 0,       # Ingurumena -> Tech
    "Nazioartea": 2,       # Nazioartea -> Sports
    "Historia": 3,         # Historia -> Entertainment
    "Politika": 4,         # Politika -> Politics
    "Euskal Herria": 4,    # Euskal Herria -> Politics
    "Komunikazioa": 3,     # Komunikazioa -> Entertainment
    "Zientzia": 0          # Zientzia -> Tech
}

test_eu_df["label"] = test_eu_df["label"].map(label_mapping_eu)
print("Unique labels after mapping:", test_eu_df["label"].unique())

print("\nFirst 5 rows after mapping:")
print(test_eu_df.head())

Unique labels after mapping: [4 3 0 2 1]

First 5 rows after mapping:
   idx  label                                               text
0    0      4  Genero berdintasunaz, hezkuntzaz eta klase giz...
1    1      4  Etxauzia Gaztelua ezagutarazi zuen iraganeko l...
2    2      3  1692an, Herbehereetan, “A. Boogert” sinatzen z...
3    3      3  Ixiar Pagoaga Hernanin bizi da, Saioa Larruska...
4    4      0  Amaia Ezpeldoi nola hilko dugun barrundatzen d...


In [None]:
test_eu_dataset = Dataset.from_pandas(test_eu_df)
tokenized_test_eu_dataset = test_eu_dataset.map(tokenize_function, batched=True)

Map:   0%|          | 0/1854 [00:00<?, ? examples/s]

In [None]:
predictions = trainer.predict(tokenized_test_eu_dataset)

In [None]:
predicted_labels = predictions.predictions.argmax(axis=-1)
true_labels = predictions.label_ids

f1 = f1_score(true_labels, predicted_labels, average='weighted')
print(f"F1 Score on Test Dataset: {f1}")

F1 Score on Test Dataset: 0.3161941918542712


# Fine-Tuning with Basque Data

Fine-tune the model for 12 categories using Basque training data.

**Expected result**: F1 ≈ 0.2540 (worst of Z2)

# Fine Tuning

In [None]:
!wget https://raw.githubusercontent.com/orai-nlp/BasqueGLUE/refs/heads/main/bhtc/train.jsonl -O train_eu.jsonl

--2024-12-26 17:26:49--  https://raw.githubusercontent.com/orai-nlp/BasqueGLUE/refs/heads/main/bhtc/train.jsonl
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2727771 (2.6M) [text/plain]
Saving to: ‘train_eu.jsonl’


2024-12-26 17:26:49 (58.5 MB/s) - ‘train_eu.jsonl’ saved [2727771/2727771]



In [None]:
train_eu_df = pd.read_json("train_eu.jsonl", lines=True)

In [None]:
label_mapping_eu = {
    "Ekonomia": 0,
    "Euskal Herria": 1,
    "Euskara": 2,
    "Gizartea": 3,
    "Historia": 4,
    "Ingurumena": 5,
    "Iritzia": 6,
    "Komunikazioa": 7,
    "Kultura": 8,
    "Nazioartea": 9,
    "Politika": 10,
    "Zientzia": 11
}

train_eu_df["label"] = train_eu_df["label"].map(label_mapping_eu)
print("Unique labels after mapping:", train_eu_df["label"].unique())

print("\nFirst 5 rows after mapping:")
print(train_eu_df.head())

Unique labels after mapping: [ 3  8 10 11  2  9  1  4  5  7  0  6]

First 5 rows after mapping:
   idx  label                                               text
0    0      3  Diru-Sarrerak Bermatzeko Errenta (DSBE, gaztel...
1    1      3  Inma Ruiz de Lezana naiz, Gasteizko EMAIZE sex...
2    2      8  “Batzuetan iruditzen zait lerro hauetan aurkit...
3    3     10  Apirilaren 8aren biharamunean, hots, ETAren ar...
4    4     11  Londres, 1928ko uztailaren amaiera. Alexander ...


In [None]:
train_eu_split_df, val_eu_split_df = train_test_split(train_eu_df, test_size=0.2, random_state=42)

train_eu_split = Dataset.from_pandas(train_eu_split_df)
val_eu_split = Dataset.from_pandas(val_eu_split_df)

In [None]:
tokenized_val_eu = val_eu_split.map(tokenize_function, batched=True)
tokenized_train_eu = train_eu_split.map(tokenize_function, batched=True)

Map:   0%|          | 0/1717 [00:00<?, ? examples/s]

Map:   0%|          | 0/6868 [00:00<?, ? examples/s]

In [None]:
model_path = "./trained_model"
config = AutoConfig.from_pretrained(model_path, num_labels=12)
model = AutoModelForSequenceClassification.from_pretrained(model_path, config=config, ignore_mismatched_sizes=True)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at ./trained_model and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([5]) in the checkpoint and torch.Size([12]) in the model instantiated
- classifier.weight: found shape torch.Size([5, 768]) in the checkpoint and torch.Size([12, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
for name, param in model.named_parameters():
    if 'classifier' not in name:
        param.requires_grad = False

In [None]:
training_args_finetune = TrainingArguments(
    output_dir="./results_finetune",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
    save_strategy="no",
    fp16=True,
    logging_dir='./logs_finetune',
    report_to="none"
)

trainer_finetune = Trainer(
    model=model,
    args=training_args_finetune,
    train_dataset=tokenized_train_eu,
    eval_dataset=tokenized_val_eu,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

  trainer_finetune = Trainer(


In [None]:
trainer_finetune.train()

Epoch,Training Loss,Validation Loss,F1
1,No log,2.128109,0.226487
2,2.277000,2.026958,0.233529
3,2.087400,2.003871,0.233714


TrainOutput(global_step=1290, training_loss=2.153413190028464, metrics={'train_runtime': 235.7717, 'train_samples_per_second': 87.39, 'train_steps_per_second': 5.471, 'total_flos': 5421626926940160.0, 'train_loss': 2.153413190028464, 'epoch': 3.0})

In [None]:
data_path = "test_eu.jsonl"
data = pd.read_json(data_path, lines=True)

data["label"] = data["label"].map(label_mapping_eu)

test_dataset = Dataset.from_pandas(data)

encoded_test_dataset = test_dataset.map(tokenize_function, batched=True, batch_size=16)

Map:   0%|          | 0/1854 [00:00<?, ? examples/s]

In [None]:
predictions = trainer_finetune.predict(encoded_test_dataset)
predicted_labels = predictions.predictions.argmax(axis=-1)
true_labels = predictions.label_ids

f1 = f1_score(true_labels, predicted_labels, average='weighted')
print(f"F1 Score on encoded Test Dataset: {f1}")

F1 Score on encoded Test Dataset: 0.254002111422445
