# IberLEF 2023 Task - PoliticEs. Political ideology detection in Spanish texts

Datasources:

[https://portal.odesia.uned.es/en/dataset/politices-2023](https://portal.odesia.uned.es/en/dataset/politices-2023)

[https://codalab.lisn.upsaclay.fr/competitions/10173#learn_the_details-get_starting_kit](https://codalab.lisn.upsaclay.fr/competitions/10173#learn_the_details-get_starting_kit)

Using Trainer class with CUDA parameters.

RoBERTuito is a superior model for this task because it was specifically pre-trained on 500 million Spanish tweets, making it highly attuned to the informal language, slang, and structures found in the PoliticES dataset.

If do we need to check torch and cuda versions:

In [7]:
import torch

print(f"{torch.version.cuda=}\n{torch.__version__=}")


torch.version.cuda='12.8'
torch.__version__='2.9.0+cu128'


## 1. Load Dataset

In [8]:
from datasets import load_dataset, Dataset, DatasetDict

In [9]:
import pandas as pd
from datasets import Dataset, DatasetDict

# --- Configuration ---
FILE_PATH = 'data/politicES_phase_2_train_public.csv'
RANDOM_SEED = 42 # Use a fixed seed for reproducibility

# Define the splitting ratios (must sum to 1.0)
TRAIN_RATIO = 0.8
VAL_RATIO = 0.1
TEST_RATIO = 0.1 

# --- 1. Load Data and Initial Shuffle ---
df = pd.read_csv(FILE_PATH)

# 1.a) Shuffle the DataFrame rows
# The 'sample' method with frac=1.0 shuffles all rows.
# Use the same random state for consistent shuffling.
df_shuffled = df.sample(frac=1.0, random_state=RANDOM_SEED).reset_index(drop=True)
print(f"Data loaded and shuffled. Total rows: {len(df_shuffled)}")


# 2. Convert to Hugging Face Dataset
raw_dataset = Dataset.from_pandas(df_shuffled)


# 3. Create Train, Validation, and Test Splits

# 3.a) Split the initial dataset into a temporary training set (80%) and a holdout set (20%)
# The datasets library uses 'train' and 'test' keys for the split
# The size of the test split here is 1.0 - TRAIN_RATIO (which is 0.2)
train_test_split = raw_dataset.train_test_split(
    test_size=(VAL_RATIO + TEST_RATIO), # 0.1 + 0.1 = 0.2
    seed=RANDOM_SEED
)

# Rename the splits for clarity
train_ds = train_test_split['train']
holdout_ds = train_test_split['test']

# 3.b) Split the holdout set (20%) into validation (10%) and test (10%)
# We need to calculate the ratio relative to the holdout set (0.1 / 0.2 = 0.5)
val_test_split = holdout_ds.train_test_split(
    test_size=VAL_RATIO / (VAL_RATIO + TEST_RATIO), # 0.1 / 0.2 = 0.5
    seed=RANDOM_SEED
)

# 4. Combine into the final DatasetDict
politic_dataset = DatasetDict({
    'train': train_ds,
    'validation': val_test_split['train'], # The 'train' part of this split is the validation set
    'test': val_test_split['test']        # The 'test' part of this split is the final test set
})

# --- Verification ---
print("\n--- Final Dataset Split Sizes ---")
print(f"Train size:      {len(politic_dataset['train'])}")
print(f"Validation size: {len(politic_dataset['validation'])}")
print(f"Test size:       {len(politic_dataset['test'])}")
print(f"Total rows:      {len(politic_dataset['train']) + len(politic_dataset['validation']) + len(politic_dataset['test'])}")


# The 'politic_dataset' object is ready to be used with your tokenizer and Trainer

Data loaded and shuffled. Total rows: 180000

--- Final Dataset Split Sizes ---
Train size:      144000
Validation size: 18000
Test size:       18000
Total rows:      180000


In [10]:
politic_dataset

DatasetDict({
    train: Dataset({
        features: ['label', 'gender', 'profession', 'ideology_binary', 'ideology_multiclass', 'tweet'],
        num_rows: 144000
    })
    validation: Dataset({
        features: ['label', 'gender', 'profession', 'ideology_binary', 'ideology_multiclass', 'tweet'],
        num_rows: 18000
    })
    test: Dataset({
        features: ['label', 'gender', 'profession', 'ideology_binary', 'ideology_multiclass', 'tweet'],
        num_rows: 18000
    })
})

## 2. Load Model and Tokenizer

In [11]:
# Load model directly
import numpy as np
import torch
from transformers import (
    AutoTokenizer,
    DataCollatorWithPadding,
    AutoModelForSequenceClassification,
    TrainingArguments,
    Trainer
)
from pysentimiento.preprocessing import preprocess_tweet

# Define the new RoBERTuito checkpoint
checkpoint = "pysentimiento/robertuito-base-cased"
model_name = checkpoint.split("/")[-1]

# Load Tokenizer and Model
# Note: The original authors recommend using the cased version.
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
tokenizer.model_max_length = 128 

In [12]:
from dotenv import load_dotenv
import wandb

load_dotenv(".env")  # Load environment variables WANDB_API_KEY 
wandb.init(project="transformers-fine-tuning", name=f"politicES-{model_name}-large-dataset")

[34m[1mwandb[0m: Currently logged in as: [33mlozanojm65[0m ([33mlozanojm65-home[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


## 3. Define the Classification Task and Labels
The target column is **'ideology_binary'**. We need to map the labels to IDs,
where 'ideology_binary' has two classes: 'left' and 'right'.


In [13]:
# Define the label mappings 
label_to_id = {"left": 0, "right": 1}
id_to_label = {v: k for k, v in label_to_id.items()}
num_labels = len(label_to_id)

# Load the model with the correct number of labels
model = AutoModelForSequenceClassification.from_pretrained(
    checkpoint,
    num_labels=num_labels,
    id2label=id_to_label,
    label2id=label_to_id
)


The **'label'** column is usually the one used by the Trainer, 
so we will map the target column **'ideology_binary'** to **'label'**. In the PoliticES dataset the **'label'** column corresponds to an id of the tweet, so we have to rename it. 

Moreover, the parameter **remove_unused_columns** of the Trainer class defaults to True, we don't need to take care of removing the columns unused by the model forward method.

In [14]:
# dataset = politic_dataset.remove_columns(['label', 'gender', 'profession',  'ideology_multiclass'])
politic_dataset = politic_dataset.rename_column("label", "id")
politic_dataset = politic_dataset.rename_column("ideology_binary", "label")

In [15]:
politic_dataset

DatasetDict({
    train: Dataset({
        features: ['id', 'gender', 'profession', 'label', 'ideology_multiclass', 'tweet'],
        num_rows: 144000
    })
    validation: Dataset({
        features: ['id', 'gender', 'profession', 'label', 'ideology_multiclass', 'tweet'],
        num_rows: 18000
    })
    test: Dataset({
        features: ['id', 'gender', 'profession', 'label', 'ideology_multiclass', 'tweet'],
        num_rows: 18000
    })
})

To ensure optimal performance, we must apply the preprocess_tweet function from the pysentimiento library before tokenizing the text. This preprocessor is specially suited for tweet classification with transformer-based models.

In [16]:
# Preprocessing function
def tokenize_and_encode_labels(examples):
    # Step A: Apply tweet-specific cleaning/normalization
    preprocessed_tweets = [preprocess_tweet(text) for text in examples["tweet"]]
    
    # Step B: Tokenize the preprocessed text
    tokenized_inputs = tokenizer(
        preprocessed_tweets, 
        truncation=True
    )
    
    # Step C: Map the political ideology string label to an integer ID
    tokenized_inputs["label"] = [label_to_id[label] for label in examples["label"]]
    
    return tokenized_inputs

tokenized_dataset = politic_dataset.map(tokenize_and_encode_labels, batched=True)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

Map:   0%|          | 0/144000 [00:00<?, ? examples/s]

Map:   0%|          | 0/18000 [00:00<?, ? examples/s]

Map:   0%|          | 0/18000 [00:00<?, ? examples/s]

In [17]:
tokenized_dataset["train"]

Dataset({
    features: ['id', 'gender', 'profession', 'label', 'ideology_multiclass', 'tweet', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 144000
})

In [18]:
tokenized_dataset["train"].features

{'id': Value('string'),
 'gender': Value('string'),
 'profession': Value('string'),
 'label': Value('int64'),
 'ideology_multiclass': Value('string'),
 'tweet': Value('string'),
 'input_ids': List(Value('int32')),
 'token_type_ids': List(Value('int8')),
 'attention_mask': List(Value('int8'))}

In [19]:
tokenized_dataset.set_format("torch")

## 4. Define Evaluation Metrics

* Precission measures how many of the samples predicted as positive are actually positive.
* Recall measures how many of the positive samples by the positive predictions.
* And one way to summarize both metrics is the f1-score, which is the armonic mean of precission and recall:
<center>

$Precision = \frac{TP}{TP+FP}$

$Recall = \frac{TP}{TP+FN}$

$F1 = 2 \times \large \frac{precision\times recall}{precision + recall}$
</center>

In [20]:
import numpy as np
import evaluate

# Load the necessary metrics from the 'evaluate' library
# We can load the standard 'precision', 'recall', and 'f1' metrics once
metric_f1 = evaluate.load("f1")
metric_precision = evaluate.load("precision")
metric_recall = evaluate.load("recall")

def compute_metrics(eval_pred):
    """
    Compute precision, recall, and F1 macro score for a Hugging Face Trainer.

    Args:
        eval_pred (EvalPrediction): A tuple (predictions, labels) provided by Trainer.

    Returns:
        dict: A dictionary with 'precision', 'recall', and 'f1-macro' metrics.
    """
    # The EvalPrediction object contains (predictions, label_ids)
    logits, labels = eval_pred 

    # 1. Convert logits to class predictions
    # Predictions are the index of the highest logit value across the class axis (-1)
    predictions = np.argmax(logits, axis=-1)

    # 2. Compute the metrics using the macro average
    
    f1_result = metric_f1.compute(predictions=predictions, references=labels, average="macro")
    precision_result = metric_precision.compute(predictions=predictions, references=labels, average="macro")
    recall_result = metric_recall.compute(predictions=predictions, references=labels, average="macro")
    
    # 3. Return the results dictionary
    return {
        "precision": precision_result["precision"],
        "recall": recall_result["recall"],
        "f1-macro": f1_result["f1"],
    }


## 5. Configure Trainer



In [21]:
from transformers import TrainingArguments, EarlyStoppingCallback


output_dir = f"./results_{model_name}"

training_args = TrainingArguments(
    output_dir=output_dir,                # Output directory
    num_train_epochs=5,                   # Total number of training epochs
    per_device_train_batch_size=32,       # Batch size per device during training
    per_device_eval_batch_size=32,        # Batch size for evaluation
    fp16=True,                            # Use mixed precision
    eval_strategy="epoch",                # Evaluate at the end of each epoch
    save_strategy="epoch",                # Save a checkpoint at the end of each epoch
    load_best_model_at_end=True,          # Load the best model found during training
    metric_for_best_model="eval_loss",    # Metric to track for best model
    disable_tqdm=False,                   # Enable tqdm progress bars
    report_to="wandb"                     # Report metrics to Weights & Biases
)

In [22]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["validation"],
    data_collator=data_collator,
    processing_class=tokenizer,
    compute_metrics=compute_metrics,
    callbacks=[EarlyStoppingCallback(early_stopping_patience=3, early_stopping_threshold=0.001)]
)


In [23]:
trainer.train()



Epoch,Training Loss,Validation Loss,Precision,Recall,F1-macro
1,0.5144,0.492431,0.745295,0.743844,0.744483
2,0.3894,0.505684,0.760346,0.75959,0.759945
3,0.2549,0.602437,0.759324,0.7578,0.758474
4,0.1427,0.820451,0.755374,0.75015,0.75196


TrainOutput(global_step=18000, training_loss=0.3327179616292318, metrics={'train_runtime': 2986.4978, 'train_samples_per_second': 241.085, 'train_steps_per_second': 7.534, 'total_flos': 2.519653421661312e+16, 'train_loss': 0.3327179616292318, 'epoch': 4.0})

In [None]:
# Assuming 'final_datasets' is your DatasetDict with 'train', 'validation', and 'test' keys
# And 'trainer' is your instantiated Hugging Face Trainer

# 1. Select the dedicated test split
test_dataset = tokenized_dataset['test']

print("Starting final evaluation on the dedicated test set...")

# 2. Call the evaluate method
# The Trainer will automatically use the GPU (if available) and batch processing
evaluation_results = trainer.evaluate(eval_dataset=test_dataset)

print("\n--- Final Test Set Evaluation Results ---")
# 3. Print the results
# The output will include your custom metrics ('precision', 'recall', 'f1-macro') 
# along with standard Trainer metrics ('eval_loss', 'eval_runtime', etc.)
for key, value in evaluation_results.items():
    print(f"{key}: {value:.4f}")


Starting final evaluation on the dedicated test set...



--- Final Test Set Evaluation Results ---
eval_loss: 0.4955
eval_precision: 0.7464
eval_recall: 0.7448
eval_f1-macro: 0.7454
eval_runtime: 19.1011
eval_samples_per_second: 942.3520
eval_steps_per_second: 29.4750
epoch: 4.0000


And finally, we can clear the used memory:

In [25]:
# import gc

# del trainer
# del model
# del tokenized_dataset
# gc.collect()
# torch.cuda.empty_cache()
# torch.cuda.ipc_collect()