# Z2 - Strategy 1: English BERT + Basque → English Translation

## Description
This strategy consists of:
1. Translating the Basque test dataset (BasqueGLUE) to English using Google Translate
2. Training an English BERT model (`bert-base-uncased`) with the BBC News dataset
3. Evaluating the trained model on the translated Basque dataset

## Approach
- **Model**: bert-base-uncased
- **Training**: BBC News dataset (5 categories)
- **Evaluation**: BasqueGLUE translated to English

In [None]:
!pip install transformers datasets torch sentencepiece
!pip install googletrans==3.1.0a0



In [None]:
import os
import pandas as pd
from datasets import load_dataset, Dataset
from transformers import AutoConfig, AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
from transformers import pipeline
from googletrans import Translator
import torch
import wandb
from sklearn.metrics import f1_score
from sklearn.model_selection import train_test_split

## 1. Test Data Preparation
Download the Basque test dataset (BasqueGLUE BHTC) and translate it to English.

Tesesterako euskarazko dataseta deskargatuko dugu ondoren google translate erabiliz ingelesera itzultzeko.

In [None]:
data_url = "https://raw.githubusercontent.com/orai-nlp/BasqueGLUE/refs/heads/main/bhtc/test.jsonl"
data_path = "test.jsonl"
!wget {data_url} -O {data_path}

--2024-12-25 23:04:52--  https://raw.githubusercontent.com/orai-nlp/BasqueGLUE/refs/heads/main/bhtc/test.jsonl
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.108.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 585079 (571K) [text/plain]
Saving to: ‘test.jsonl’


2024-12-25 23:04:52 (17.2 MB/s) - ‘test.jsonl’ saved [585079/585079]



In [None]:
data = pd.read_json(data_path, lines=True)
data.head()

Unnamed: 0,idx,label,text
0,0,Gizartea,"Genero berdintasunaz, hezkuntzaz eta klase giz..."
1,1,Iritzia,Etxauzia Gaztelua ezagutarazi zuen iraganeko l...
2,2,Kultura,"1692an, Herbehereetan, “A. Boogert” sinatzen z..."
3,3,Euskara,"Ixiar Pagoaga Hernanin bizi da, Saioa Larruska..."
4,4,Ingurumena,Amaia Ezpeldoi nola hilko dugun barrundatzen d...


In [None]:
translator = Translator()
translated_texts = []
total_texts = len(data["text"])
for i, text in enumerate(data["text"]):
    translated = translator.translate(text, src="eu", dest="en")
    translated_texts.append(translated.text)
print("Itzulpena amaituta!")

Itzulpena amaituta!


In [None]:
translated_data = pd.DataFrame({
    "label": data["label"],
    "text": translated_texts
})
translated_data.to_csv("translated_test.csv", index=False)

print("Etiquetas originales en el dataset traducido:")
print(translated_data["label"].unique())

Etiquetas originales en el dataset traducido:
['Gizartea' 'Iritzia' 'Kultura' 'Euskara' 'Ingurumena' 'Nazioartea'
 'Historia' 'Ekonomia' 'Politika' 'Euskal Herria' 'Komunikazioa'
 'Zientzia']


## 2. Label Mapping (12 → 5)
BasqueGLUE has 12 categories while BBC News has 5.
We map Basque categories to the most similar English ones:

| Basque | → | English |
|--------|---|---------|
| Ekonomia | → | Business (1) |
| Gizartea, Iritzia, Politika, Euskal Herria | → | Politics (4) |
| Kultura, Euskara, Historia, Komunikazioa | → | Entertainment (3) |
| Ingurumena, Zientzia | → | Tech (0) |
| Nazioartea | → | Sports (2) |

### 1º aukera: Euskarazko etiketak ingeleseko etiketetara mapeatu
Euskarazko datetak 12 label dituenez eta ingelesezkoak (entrenatzeko erabiliko dugunak) 5 label dituenez, 12 klaseak 5 klase hauetara mapeatuko ditugu egonkiena irutditzen zaigun moduan.

In [None]:
label_mapping = {
    "Ekonomia": 1,         # Ekonomia -> Business
    "Gizartea": 4,         # Gizartea -> Politics
    "Iritzia": 4,          # Iritzia -> Politics
    "Kultura": 3,          # Kultura -> Entertainment
    "Euskara": 3,          # Euskara -> Entertainment
    "Ingurumena": 0,       # Ingurumena -> Tech
    "Nazioartea": 2,       # Nazioartea -> Sports
    "Historia": 3,         # Historia -> Entertainment
    "Politika": 4,         # Politika -> Politics
    "Euskal Herria": 4,    # Euskal Herria -> Politics
    "Komunikazioa": 3,     # Komunikazioa -> Entertainment
    "Zientzia": 0          # Zientzia -> Tech
}


In [None]:
translated_data["label"] = translated_data["label"].map(label_mapping)
if translated_data["label"].isna().any():
    raise ValueError("El mapeo no cubre todas las etiquetas del dataset traducido.")

In [None]:
translated_data.to_csv("translated_mapped_test.csv", index=False)
print("Dataset traducido guardado como 'translated_mapped_test.csv'.")

Dataset traducido guardado como 'translated_mapped_test.csv'.


In [None]:
bbc_dataset = load_dataset("SetFit/bbc-news")

In [None]:
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)

In [None]:
def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

tokenized_bbc = bbc_dataset.map(tokenize_function, batched=True)

Map:   0%|          | 0/1225 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

In [None]:
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=5)
model.to("cuda")

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e

In [None]:
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = logits.argmax(axis=-1)
    f1 = f1_score(labels, predictions, average='weighted')
    return {"f1": f1}

## 3. Training with BBC News Dataset
Load the English BBC News dataset and train BERT for 6 epochs.

Ingelesezko dataseta erabiliz entrenatuko dugu ingeleseko BERT eredua.

In [None]:
training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=6,
    weight_decay=0.01,
    save_strategy="no",
    fp16=True,
    logging_dir='./logs',
    report_to="none",
    logging_strategy="epoch"
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_bbc["train"],
    eval_dataset=tokenized_bbc["test"],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

  trainer = Trainer(


In [None]:
os.environ["WANDB_DISABLED"] = "true"
wandb.init(mode="disabled")
trainer.train()

Epoch,Training Loss,Validation Loss,F1
1,0.8649,0.2029,0.975895
2,0.1077,0.076498,0.981122
3,0.0307,0.082283,0.976996
4,0.0199,0.070743,0.982054
5,0.0118,0.07327,0.982046
6,0.009,0.07135,0.984046


TrainOutput(global_step=462, training_loss=0.17399762654717352, metrics={'train_runtime': 254.6998, 'train_samples_per_second': 28.858, 'train_steps_per_second': 1.814, 'total_flos': 1933918347110400.0, 'train_loss': 0.17399762654717352, 'epoch': 6.0})

## 4. Evaluation
Evaluate the trained model on the translated dataset and compute F1-Score.

**Expected result**: F1 ≈ 0.3381 (Label Mapping)

Euskaratik ingelesera itzulitako testerako dataseta erabiliz, entrenatutako eredua ebaluatuko dugu ondoren F1-Score kalkulatzeko.

In [None]:
trainer.save_model("./bbc_trained_model")

In [None]:
tokenized_translated_data = Dataset.from_pandas(translated_data).map(tokenize_function, batched=True)
raw_predictions = trainer.predict(tokenized_translated_data).predictions
predicted_labels = raw_predictions.argmax(axis=-1)

Map:   0%|          | 0/1854 [00:00<?, ? examples/s]

In [None]:
true_labels = translated_data["label"].tolist()

In [None]:
f1_translated = f1_score(true_labels, predicted_labels, average="weighted")
print(f"F1 Score en el dataset traducido: {f1_translated}")

F1 Score en el dataset traducido: 0.3382100030877283


In [None]:
translated_data["predicted_label"] = predicted_labels
translated_data.to_csv("resultados_clasificacion.csv", index=False)
print("Resultados guardados en 'resultados_clasificacion.csv'.")

Resultados guardados en 'resultados_clasificacion.csv'.


# Fine-Tuning: Adjustment with Translated Basque Data

Now we apply fine-tuning:
1. Take the model pretrained on BBC News
2. Modify the classification layer (5 → 12 classes)
3. Freeze all layers except the classifier
4. Fine-tune with the Basque training dataset translated to English

**Expected result**: F1 ≈ 0.2896 (worse than direct mapping)

### 2º Estrategia: Fine Tuning

In [None]:
!wget https://raw.githubusercontent.com/orai-nlp/BasqueGLUE/refs/heads/main/bhtc/train.jsonl -O train_eu.jsonl

--2024-12-25 23:16:08--  https://raw.githubusercontent.com/orai-nlp/BasqueGLUE/refs/heads/main/bhtc/train.jsonl
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.111.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2727771 (2.6M) [text/plain]
Saving to: ‘train_eu.jsonl’


2024-12-25 23:16:08 (47.9 MB/s) - ‘train_eu.jsonl’ saved [2727771/2727771]



In [None]:
train_eu_df = pd.read_json("train_eu.jsonl", lines=True)

In [None]:
label_mapping_eu = {
    "Ekonomia": 0,
    "Euskal Herria": 1,
    "Euskara": 2,
    "Gizartea": 3,
    "Historia": 4,
    "Ingurumena": 5,
    "Iritzia": 6,
    "Komunikazioa": 7,
    "Kultura": 8,
    "Nazioartea": 9,
    "Politika": 10,
    "Zientzia": 11
}

train_eu_df["label"] = train_eu_df["label"].map(label_mapping_eu)
print("Unique labels after mapping:", train_eu_df["label"].unique())

print("\nFirst 5 rows after mapping:")
print(train_eu_df.head())

Unique labels after mapping: [ 3  8 10 11  2  9  1  4  5  7  0  6]

First 5 rows after mapping:
   idx  label                                               text
0    0      3  Diru-Sarrerak Bermatzeko Errenta (DSBE, gaztel...
1    1      3  Inma Ruiz de Lezana naiz, Gasteizko EMAIZE sex...
2    2      8  “Batzuetan iruditzen zait lerro hauetan aurkit...
3    3     10  Apirilaren 8aren biharamunean, hots, ETAren ar...
4    4     11  Londres, 1928ko uztailaren amaiera. Alexander ...


In [None]:
translator = Translator()
translated_texts = []
for text in train_eu_df["text"]:
    translated = translator.translate(text, src="eu", dest="en")
    translated_texts.append(translated.text)

train_en_df = pd.DataFrame({
    "label": train_eu_df["label"],
    "text": translated_texts
})

In [None]:
# Now perform the train-test split on the DataFrame
train_en_split_df, val_en_split_df = train_test_split(train_en_df, test_size=0.2, random_state=42)

# Convert the split DataFrames back to Hugging Face Datasets
train_en_split = Dataset.from_pandas(train_en_split_df)
val_en_split = Dataset.from_pandas(val_en_split_df)

In [None]:
train_en_df.to_csv("train_en.csv", index=False)
print("Traducciones guardadas en 'train_en.csv'")

Traducciones guardadas en 'train_en.csv'


In [None]:
tokenized_val_en = val_en_split.map(tokenize_function, batched=True)
tokenized_train_en = train_en_split.map(tokenize_function, batched=True)

Map:   0%|          | 0/1717 [00:00<?, ? examples/s]

Map:   0%|          | 0/6868 [00:00<?, ? examples/s]

In [None]:
config = AutoConfig.from_pretrained("./bbc_trained_model", num_labels=12)

# Initialize a new model with the updated configuration
model = AutoModelForSequenceClassification.from_pretrained(
    "./bbc_trained_model", config=config, ignore_mismatched_sizes=True
)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at ./bbc_trained_model and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([5]) in the checkpoint and torch.Size([12]) in the model instantiated
- classifier.weight: found shape torch.Size([5, 768]) in the checkpoint and torch.Size([12, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
for name, param in model.named_parameters():
    if 'classifier' not in name:
        param.requires_grad = False

In [None]:
training_args_finetune = TrainingArguments(
    output_dir="./results_finetune",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
    save_strategy="no",
    fp16=True,
    logging_dir='./logs_finetune',
    report_to="none"
)

trainer_finetune = Trainer(
    model=model,
    args=training_args_finetune,
    train_dataset=tokenized_train_en,
    eval_dataset=tokenized_val_en,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

trainer_finetune.train()
trainer_finetune.save_model("./finetuned_model")

  trainer_finetune = Trainer(


Epoch,Training Loss,Validation Loss,F1
1,No log,2.051743,0.25994
2,2.213200,1.95753,0.277446
3,2.009800,1.936623,0.273728


In [None]:
translated_data = pd.read_csv("translated_test.csv")

label_mapping_eu = {
    "Ekonomia": 0,
    "Euskal Herria": 1,
    "Euskara": 2,
    "Gizartea": 3,
    "Historia": 4,
    "Ingurumena": 5,
    "Iritzia": 6,
    "Komunikazioa": 7,
    "Kultura": 8,
    "Nazioartea": 9,
    "Politika": 10,
    "Zientzia": 11
}

translated_data["label"] = translated_data["label"].map(label_mapping_eu)

if translated_data["label"].isna().any():
    raise ValueError("El mapeo no cubre todas las etiquetas del dataset traducido.")

In [None]:
tokenized_translated_data = Dataset.from_pandas(translated_data).map(tokenize_function, batched=True)

Map:   0%|          | 0/1854 [00:00<?, ? examples/s]

In [None]:
raw_predictions = trainer_finetune.predict(tokenized_translated_data).predictions
predicted_labels = raw_predictions.argmax(axis=-1)

In [None]:
true_labels = translated_data["label"].tolist()
print("Etiquetas originales en el dataset traducido:")
print(true_labels)
# Calcular la F1-score
f1_translated = f1_score(true_labels, predicted_labels, average="weighted")
print(f"F1 Score en el dataset traducido: {f1_translated}")

Etiquetas originales en el dataset traducido:
[3, 6, 8, 2, 5, 9, 4, 9, 2, 8, 3, 0, 8, 3, 0, 3, 2, 4, 9, 8, 10, 8, 8, 3, 3, 3, 5, 8, 10, 3, 3, 9, 9, 5, 2, 1, 0, 8, 8, 8, 5, 8, 3, 7, 8, 10, 9, 10, 8, 8, 4, 9, 0, 2, 9, 3, 8, 8, 3, 8, 8, 10, 9, 8, 8, 4, 5, 3, 4, 3, 5, 3, 5, 3, 9, 4, 3, 3, 3, 0, 4, 3, 5, 3, 8, 2, 5, 10, 5, 10, 10, 8, 3, 2, 8, 8, 8, 5, 8, 9, 3, 2, 9, 9, 2, 9, 8, 3, 8, 5, 5, 0, 9, 2, 8, 3, 8, 5, 4, 0, 3, 5, 3, 8, 7, 4, 8, 0, 2, 5, 3, 3, 8, 8, 8, 0, 0, 8, 5, 0, 9, 10, 8, 9, 9, 4, 9, 2, 0, 2, 8, 11, 8, 10, 10, 3, 3, 10, 0, 11, 4, 9, 4, 9, 0, 2, 5, 5, 8, 4, 2, 3, 2, 5, 2, 0, 8, 4, 8, 3, 4, 2, 8, 3, 3, 5, 3, 3, 10, 3, 3, 8, 9, 8, 9, 0, 3, 2, 4, 9, 9, 8, 8, 8, 8, 10, 9, 3, 3, 0, 3, 9, 5, 3, 4, 3, 2, 8, 9, 8, 4, 8, 0, 10, 5, 0, 9, 0, 1, 0, 9, 3, 3, 8, 10, 5, 2, 5, 10, 4, 0, 2, 0, 0, 8, 3, 8, 0, 5, 4, 2, 5, 8, 3, 3, 3, 8, 8, 3, 9, 9, 8, 3, 4, 3, 3, 9, 9, 3, 2, 3, 3, 5, 3, 5, 5, 8, 10, 10, 8, 8, 2, 7, 9, 0, 5, 8, 9, 3, 3, 3, 10, 6, 5, 9, 8, 0, 10, 2, 9, 3, 9, 0, 8, 3, 2, 5, 10, 4, 3,