## RoBERTuito for Text Classification

This notebook shows how to use [RoBERTuito](https://huggingface.co/pysentimiento/robertuito-base-uncased) for text classification tasks.

First, let's install some packages

In [1]:
!pip install pysentimiento transformers datasets accelerate evaluate




[notice] A new release of pip is available: 23.3 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip



Collecting evaluate
  Downloading evaluate-0.4.1-py3-none-any.whl.metadata (9.4 kB)
Collecting responses<0.19 (from evaluate)
  Downloading responses-0.18.0-py3-none-any.whl.metadata (29 kB)
Downloading evaluate-0.4.1-py3-none-any.whl (84 kB)
   ---------------------------------------- 0.0/84.1 kB ? eta -:--:--
   ---------------------------------------- 84.1/84.1 kB 4.6 MB/s eta 0:00:00
Downloading responses-0.18.0-py3-none-any.whl (38 kB)
Installing collected packages: responses, evaluate
Successfully installed evaluate-0.4.1 responses-0.18.0


Let's load a dataset -- in this case, a Spanish sentiment analysis dataset from CardiffNLP.

In [16]:
#from datasets import load_dataset
from load_data import load_data
from sklearn.model_selection import train_test_split


#ds = load_dataset("cardiffnlp/tweet_sentiment_multilingual", "spanish")
data = "data/BBDD_SeAcabo.csv"
df = load_data(data)

df_train, df_test = train_test_split(df, test_size=0.2, random_state=42, stratify=df['Análisis General'])
df_train, df_val = train_test_split(df_train, test_size=0.1, random_state=42, stratify=df_train['Análisis General'])



## Load models

For this task, we use `robertuito-base-uncased` (there are other two versions: `robertuito-base-uncased`, and `robertuito-base-deacc`)

In [17]:
from transformers import AutoModelForSequenceClassification, AutoTokenizer

model_name = "pysentimiento/robertuito-base-uncased"

model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    num_labels=2
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.model_max_length = 128

## Preprocessing

Before tokenizing our model, we have to run the `preprocess_tweet` function to our data.


In [23]:
from pysentimiento.preprocessing import preprocess_tweet
#preprocessed_ds = ds.map(lambda ex: {"text": preprocess_tweet(ex["text"], lang="es")})
# Aplicar 'preprocess_tweet' a la columna 'full_text'
df['full_text'] = df['full_text'].apply(lambda x: preprocess_tweet(x, lang="es"))

## Tokenization

In [24]:
"""
tokenized_ds = ds['full_text'].map(
    lambda batch: tokenizer(batch["text"], padding=False, truncation=True),
    batched=True, batch_size=32
)

"""

# Función para tokenizar un DataFrame
def tokenize_data(df, tokenizer):
    return tokenizer(df['full_text'].tolist(), truncation=True, padding=True, max_length=512)

# Tokenizar los datos de entrenamiento y validación
train_encodings = tokenize_data(df_train, tokenizer)
val_encodings = tokenize_data(df_val, tokenizer)

## Training

In [13]:
!pip install ipdb

Collecting ipdb
  Downloading ipdb-0.13.13-py3-none-any.whl.metadata (14 kB)
Downloading ipdb-0.13.13-py3-none-any.whl (12 kB)
Installing collected packages: ipdb
Successfully installed ipdb-0.13.13



[notice] A new release of pip is available: 23.3 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip


In [25]:
from transformers import DefaultDataCollator



# Convertir a Dataset de Hugging Face
from datasets import Dataset

train_dataset = Dataset.from_dict(train_encodings)
val_dataset = Dataset.from_dict(val_encodings)

In [26]:
from transformers import TrainingArguments, Trainer, DataCollatorWithPadding

from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    per_device_train_batch_size=32,
    output_dir="test_trainer",
    do_eval=True,
    evaluation_strategy="epoch",
    num_train_epochs=5,
    logging_dir='./logs',  # Para guardar logs si es necesario
)

# Supongamos que tienes una función `compute_metrics` para evaluar tu modelo
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return {"accuracy": (predictions == labels).mean()}

# Usar DataCollatorWithPadding para manejar el padding de manera dinámica durante el entrenamiento
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

# Configurar el Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    compute_metrics=compute_metrics,
    data_collator=data_collator,
)
trainer.train()


dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


ValueError: The model did not return a loss from the inputs, only the following keys: logits. For reference, the inputs it received are input_ids,token_type_ids,attention_mask.

In [None]:
trainer.evaluate(tokenized_ds["test"])

{'eval_loss': 1.590761423110962, 'eval_f1': 0.7098014929759741, 'eval_recall': 0.7126436781609194, 'eval_runtime': 2.1041, 'eval_samples_per_second': 413.479, 'eval_steps_per_second': 51.804, 'epoch': 5.0}


{'eval_loss': 1.590761423110962,
 'eval_f1': 0.7098014929759741,
 'eval_recall': 0.7126436781609194,
 'eval_runtime': 2.1041,
 'eval_samples_per_second': 413.479,
 'eval_steps_per_second': 51.804,
 'epoch': 5.0}