# Z2 - Strategy 2: BERTeus + English → Basque Translation

## Description
This strategy reverses the flow of Strategy 1:
1. Translate the BBC News dataset from English to Basque using Google Translate
2. Train BERTeus (BERT model pretrained on Basque) with the translated data
3. Evaluate directly on the native Basque BasqueGLUE dataset

## Approach
- **Model**: ixa-ehu/berteus-base-cased
- **Training**: BBC News translated to Basque
- **Evaluation**: BasqueGLUE (native Basque)

This was the **best Z2 strategy** with F1 = 0.3624

In [None]:
!pip install transformers datasets torch sentencepiece
!pip install googletrans==3.1.0a0

Collecting datasets
  Downloading datasets-3.2.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.2.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m10.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m9.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.9.0-py3-none-any.whl (

In [None]:
import os
import pandas as pd
from datasets import load_dataset, Dataset
from transformers import AutoConfig, AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
from transformers import pipeline
from googletrans import Translator
import torch
import wandb
from sklearn.metrics import f1_score
from sklearn.model_selection import train_test_split
import numpy as np

In [None]:
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
print(f"Using device: {device}")

Using device: cuda


In [None]:
MODEL_NAME = "ixa-ehu/berteus-base-cased"
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForSequenceClassification.from_pretrained(MODEL_NAME, num_labels=5)
model.to(device)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/24.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/450 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/422k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/498M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at ixa-ehu/berteus-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(50099, 768, padding_idx=3)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e

In [None]:
bbc_dataset = load_dataset("SetFit/bbc-news")

README.md:   0%|          | 0.00/880 [00:00<?, ?B/s]

train.jsonl:   0%|          | 0.00/2.87M [00:00<?, ?B/s]

test.jsonl:   0%|          | 0.00/2.28M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1225 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1000 [00:00<?, ? examples/s]

## 1. Training Dataset Translation
Translate BBC News (English) to Basque using Google Translate.
This process may take several minutes.

In [None]:
translator = Translator()

def translate_text(example):
    translated = translator.translate(example['text'], src="en", dest="eu")
    example['text'] = translated.text
    return example

translated_dataset = bbc_dataset.map(translate_text)



Map:   0%|          | 0/1225 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

In [None]:
df = translated_dataset['train'].to_pandas()
df.to_csv('bbc_news_train_eu.csv', index=False)

df2 = translated_dataset['test'].to_pandas()
df2.to_csv('bbc_news_test_eu.csv', index=False)

In [None]:
train_df = translated_dataset['train'].to_pandas()

train_df, eval_df = train_test_split(train_df, test_size=0.2, random_state=42)

train_dataset = Dataset.from_pandas(train_df)
eval_dataset = Dataset.from_pandas(eval_df)

In [None]:
def preprocess_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True, max_length=512)

In [None]:
encoded_train_dataset = train_dataset.map(preprocess_function, batched=True, batch_size=16)
encoded_eval_dataset = eval_dataset.map(preprocess_function, batched=True, batch_size=16)

Map:   0%|          | 0/980 [00:00<?, ? examples/s]

Map:   0%|          | 0/245 [00:00<?, ? examples/s]

In [None]:
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = logits.argmax(axis=-1)
    f1 = f1_score(labels, predictions, average='weighted')
    return {"f1": f1}

## 2. BERTeus Training
Train the BERTeus model with the BBC data translated to Basque.
- Epochs: 3
- Batch size: 16
- Learning rate: 2e-5

In [None]:
training_args = TrainingArguments(
    output_dir="./results",
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
    logging_dir='./logs',
    evaluation_strategy="epoch",
    save_strategy="no",
    report_to="none",
    fp16=True
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=encoded_train_dataset,
    eval_dataset=encoded_eval_dataset,
    compute_metrics=compute_metrics
)



In [None]:
os.environ["WANDB_DISABLED"] = "true"
wandb.init(mode="disabled")

trainer.train()

Epoch,Training Loss,Validation Loss,F1
1,No log,0.1776,0.967506
2,No log,0.141834,0.963545
3,No log,0.175385,0.967506


TrainOutput(global_step=186, training_loss=0.17828350682412425, metrics={'train_runtime': 82.4033, 'train_samples_per_second': 35.678, 'train_steps_per_second': 2.257, 'total_flos': 773567338844160.0, 'train_loss': 0.17828350682412425, 'epoch': 3.0})

In [None]:
trainer.save_model("./eu_bbc_model")

## 3. Evaluation with Label Mapping (12 → 5 categories)
Evaluate on BasqueGLUE by mapping the 12 categories to 5.

**Result**: F1 ≈ 0.3624 (Best of Z2)

# Mapeatu

In [None]:
label_mapping = {
    "Ekonomia": 1,         # Ekonomia -> Business
    "Gizartea": 4,         # Gizartea -> Politics
    "Iritzia": 4,          # Iritzia -> Politics
    "Kultura": 3,          # Kultura -> Entertainment
    "Euskara": 3,          # Euskara -> Entertainment
    "Ingurumena": 0,       # Ingurumena -> Tech
    "Nazioartea": 2,       # Nazioartea -> Sports
    "Historia": 3,         # Historia -> Entertainment
    "Politika": 4,         # Politika -> Politics
    "Euskal Herria": 4,    # Euskal Herria -> Politics
    "Komunikazioa": 3,     # Komunikazioa -> Entertainment
    "Zientzia": 0          # Zientzia -> Tech
}

In [None]:
data_url = "https://raw.githubusercontent.com/orai-nlp/BasqueGLUE/refs/heads/main/bhtc/test.jsonl"
data_path = "test.jsonl"
!wget {data_url} -O {data_path}

--2024-12-26 16:32:19--  https://raw.githubusercontent.com/orai-nlp/BasqueGLUE/refs/heads/main/bhtc/test.jsonl
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.110.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 585079 (571K) [text/plain]
Saving to: ‘test.jsonl’


2024-12-26 16:32:19 (17.1 MB/s) - ‘test.jsonl’ saved [585079/585079]



In [None]:
data = pd.read_json(data_path, lines=True)

In [None]:
data["label"] = data["label"].map(label_mapping)
if data["label"].isna().any():
    raise ValueError("El mapeo no cubre todas las etiquetas del dataset traducido.")

In [None]:
data.to_csv("dataset_eu_test.csv", index=False)
print("Dataset traducido guardado como 'dataset_eu_test.csv'.")

Dataset traducido guardado como 'dataset_eu_test.csv'.


In [None]:
test_dataset = Dataset.from_pandas(data)
encoded_test_dataset = test_dataset.map(preprocess_function, batched=True, batch_size=16)

Map:   0%|          | 0/1854 [00:00<?, ? examples/s]

In [None]:
predictions = trainer.predict(encoded_test_dataset)
predicted_labels = np.argmax(predictions.predictions, axis=1)
true_labels = predictions.label_ids

In [None]:
f1 = f1_score(true_labels, predicted_labels, average='weighted')
print(f"F1 score en el dataset de prueba en euskera: {f1}")

F1 score en el dataset de prueba en euskera: 0.3624060219733226


# Fine-Tuning: Adjustment to 12 Categories

Modify the classifier for the 12 original BasqueGLUE categories
and fine-tune with Basque training data.

**Expected result**: F1 ≈ 0.3206

# Fine Tuning

In [None]:
!wget https://raw.githubusercontent.com/orai-nlp/BasqueGLUE/refs/heads/main/bhtc/train.jsonl -O train_eu.jsonl

--2024-12-26 16:32:35--  https://raw.githubusercontent.com/orai-nlp/BasqueGLUE/refs/heads/main/bhtc/train.jsonl
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2727771 (2.6M) [text/plain]
Saving to: ‘train_eu.jsonl’


2024-12-26 16:32:35 (63.7 MB/s) - ‘train_eu.jsonl’ saved [2727771/2727771]



In [None]:
train_eu_df = pd.read_json("train_eu.jsonl", lines=True)

In [None]:
label_mapping_eu = {
    "Ekonomia": 0,
    "Euskal Herria": 1,
    "Euskara": 2,
    "Gizartea": 3,
    "Historia": 4,
    "Ingurumena": 5,
    "Iritzia": 6,
    "Komunikazioa": 7,
    "Kultura": 8,
    "Nazioartea": 9,
    "Politika": 10,
    "Zientzia": 11
}

train_eu_df["label"] = train_eu_df["label"].map(label_mapping_eu)
print("Unique labels after mapping:", train_eu_df["label"].unique())

print("\nFirst 5 rows after mapping:")
print(train_eu_df.head())

Unique labels after mapping: [ 3  8 10 11  2  9  1  4  5  7  0  6]

First 5 rows after mapping:
   idx  label                                               text
0    0      3  Diru-Sarrerak Bermatzeko Errenta (DSBE, gaztel...
1    1      3  Inma Ruiz de Lezana naiz, Gasteizko EMAIZE sex...
2    2      8  “Batzuetan iruditzen zait lerro hauetan aurkit...
3    3     10  Apirilaren 8aren biharamunean, hots, ETAren ar...
4    4     11  Londres, 1928ko uztailaren amaiera. Alexander ...


In [None]:
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

def tokenize_function(examples):
    return tokenizer(
        examples["text"],
        padding="max_length",  # Pad all sequences to the max length
        truncation=True,       # Truncate sequences longer than max_length
        max_length=512,        # Explicitly set the maximum token length
        add_special_tokens=True  # Include special tokens like [CLS] and [SEP]
    )

In [None]:
train_eu_df, eval_eu_df = train_test_split(train_eu_df, test_size=0.2, random_state=42)

train_eu_dataset = Dataset.from_pandas(train_eu_df)
eval_eu_dataset = Dataset.from_pandas(eval_eu_df)

encoded_train_eu_dataset = train_eu_dataset.map(tokenize_function, batched=True, batch_size=16)
encoded_eval_eu_dataset = eval_eu_dataset.map(tokenize_function, batched=True, batch_size=16)

Map:   0%|          | 0/6868 [00:00<?, ? examples/s]

Map:   0%|          | 0/1717 [00:00<?, ? examples/s]

In [None]:
model_path = "./eu_bbc_model"
config = AutoConfig.from_pretrained(model_path, num_labels=12)
model = AutoModelForSequenceClassification.from_pretrained(model_path, config=config, ignore_mismatched_sizes=True)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at ./eu_bbc_model and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([5]) in the checkpoint and torch.Size([12]) in the model instantiated
- classifier.weight: found shape torch.Size([5, 768]) in the checkpoint and torch.Size([12, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
for name, param in model.named_parameters():
    if 'classifier' not in name:
        param.requires_grad = False

In [None]:
training_args = TrainingArguments(
    output_dir="./results_finetuned",
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    learning_rate=2e-5,
    weight_decay=0.01,
    logging_dir='./logs_finetuned',
    evaluation_strategy="epoch",
    save_strategy="no",
    report_to="none",
    fp16=True
)


trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset= encoded_train_eu_dataset,
    eval_dataset= encoded_eval_eu_dataset,
    compute_metrics=compute_metrics
)



In [None]:
trainer.train()

Epoch,Training Loss,Validation Loss,F1
1,No log,1.967098,0.26027
2,2.169300,1.835995,0.293413
3,1.866000,1.805532,0.303173


TrainOutput(global_step=1290, training_loss=1.9724081349927325, metrics={'train_runtime': 245.1117, 'train_samples_per_second': 84.06, 'train_steps_per_second': 5.263, 'total_flos': 5421626926940160.0, 'train_loss': 1.9724081349927325, 'epoch': 3.0})

In [None]:
data_path = "test.jsonl"
data = pd.read_json(data_path, lines=True)

data["label"] = data["label"].map(label_mapping_eu)

test_dataset = Dataset.from_pandas(data)

encoded_test_dataset = test_dataset.map(tokenize_function, batched=True, batch_size=16)

Map:   0%|          | 0/1854 [00:00<?, ? examples/s]

In [None]:
predictions = trainer.predict(encoded_test_dataset)
predicted_labels = np.argmax(predictions.predictions, axis=1)
true_labels = predictions.label_ids
f1 = f1_score(true_labels, predicted_labels, average='weighted')
print(f"F1 score en el dataset de prueba en euskera: {f1}")

F1 score en el dataset de prueba en euskera: 0.3206504781726515
