## BETO for Text Classification

This notebook shows how to use [RoBERTuito](https://huggingface.co/pysentimiento/robertuito-base-uncased) for text classification tasks.

First, let's install some packages

In [1]:
!pip install pysentimiento transformers datasets accelerate evaluate



Let's load a dataset -- in this case, a Spanish sentiment analysis dataset from CardiffNLP.

In [2]:
from datasets import load_dataset

ds = load_dataset("cardiffnlp/tweet_sentiment_multilingual", "spanish")

ds

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 1839
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 324
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 870
    })
})

In [3]:
ds["train"].features

{'text': Value(dtype='string', id=None),
 'label': ClassLabel(names=['negative', 'neutral', 'positive'], id=None)}

In [4]:
ds["test"]["text"][:10]

['@user jajajaja dale, hacete la boluda vos jajaja igual a vos nunca se te puede tomar en serio te mando un abrazo desde Perú!',
 'cada vez que cito un tweet se va la ubicación sin tampoco poder ponerla en el momento a uds les pasa? TE VEO Y ME PICA VICICONTE',
 '@user MAAAAE RAJADO! Pero lo bueno es q uno se va independizando!y logrando metas',
 'Bueno hoy fui a almorzar a Nanay con otras 3 dras xq la capacitación mal organizada no nos dió almuerzo y encima nos mandan a comer 2pm',
 'Necesito seguir a mas cuentas camren shippers y fans de las armonías. Me recomendais alguna?',
 '@user ¡Hola Tomás! ¿Habéis visto los nuevos #dinos de #TierraMagna? Es normal que haya colas antes de que comience el espectáculo',
 '@user la hijueputa tela se me salió. yo quería volver a quedar acostada.',
 'Parce yo estoy igual @user',
 '@user pues no está nada mal',
 '@user quizá para profesionales no sea mucho,pero hay no remunerados principalmente femenino para quienes es un sueño, pasa en mi país']

In [5]:
ds["test"]["label"][:10]

[0, 1, 2, 0, 1, 2, 0, 1, 2, 0]

## Load models

For this task, we use `robertuito-base-uncased` (there are other two versions: `robertuito-base-uncased`, and `robertuito-base-deacc`)

In [6]:
from transformers import AutoModelForSequenceClassification, AutoTokenizer

model_name = "dccuchile/bert-base-spanish-wwm-uncased"

model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    num_labels=3
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.model_max_length = 128

RuntimeError: Failed to import transformers.models.bert.modeling_bert because of the following error (look up to see its traceback):
libnvJitLink.so.12: cannot open shared object file: No such file or directory

## Preprocessing

Before tokenizing our model, we have to run the `preprocess_tweet` function to our data.


In [None]:
from pysentimiento.preprocessing import preprocess_tweet
preprocessed_ds = ds.map(lambda ex: {"text": preprocess_tweet(ex["text"], lang="es")})

## Tokenization

In [None]:
tokenized_ds = preprocessed_ds.map(
    lambda batch: tokenizer(batch["text"], padding=False, truncation=True),
    batched=True, batch_size=32
)

In [None]:
tokenized_ds

## Training

In [None]:
!pip install ipdb

In [None]:
import numpy as np
import evaluate

f1_metric = evaluate.load("f1")
recall_metric = evaluate.load("recall")

def compute_metrics (eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis = -1)

    results = {}
    results.update(f1_metric.compute(predictions=preds, references = labels, average="macro"))
    results.update(recall_metric.compute(predictions=preds, references = labels, average="macro"))
    return results

In [None]:
from transformers import TrainingArguments, Trainer, DataCollatorWithPadding

training_args = TrainingArguments(
    per_device_train_batch_size=32,
    output_dir="beto_test_trainer",
    do_eval=True,
    evaluation_strategy="epoch",
    num_train_epochs=5,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_ds["train"],
    eval_dataset=tokenized_ds["validation"],
    compute_metrics=compute_metrics,
    data_collator=DataCollatorWithPadding(tokenizer=tokenizer),
)
trainer.train()

In [None]:
trainer.evaluate(tokenized_ds["test"])