## RoBERTuito for Text Classification

This notebook shows how to use [RoBERTuito](https://huggingface.co/pysentimiento/robertuito-base-uncased) for text classification tasks.

First, let's install some packages

In [1]:
!pip install pysentimiento transformers datasets accelerate evaluate

Collecting pysentimiento
  Downloading pysentimiento-0.7.1-py3-none-any.whl (38 kB)
Collecting datasets
  Downloading datasets-2.13.1-py3-none-any.whl (486 kB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m486.2/486.2 kB[0m [31m39.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting accelerate
  Using cached accelerate-0.20.3-py3-none-any.whl (227 kB)
Collecting evaluate
  Downloading evaluate-0.4.0-py3-none-any.whl (81 kB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.4/81.4 kB[0m [31m44.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting torch>=2.0.0
  Using cached torch-2.0.1-cp310-cp310-manylinux1_x86_64.whl (619.9 MB)
Collecting emoji<2.0.0,>=1.6.1
  Downloading emoji-1.7.0.tar.gz (175 kB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m175.4/175.4 kB[0m [31m58.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25ldone
[?25hCollecting spacy<4.0.0,>=3.5.0
  Downloadin

[?25hCollecting typer<0.10.0,>=0.3.0
  Downloading typer-0.9.0-py3-none-any.whl (45 kB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.9/45.9 kB[0m [31m18.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting wasabi<1.2.0,>=0.9.1
  Downloading wasabi-1.1.2-py3-none-any.whl (27 kB)
Collecting spacy-loggers<2.0.0,>=1.0.0
  Downloading spacy_loggers-1.0.4-py3-none-any.whl (11 kB)
Collecting pathy>=0.10.0
  Downloading pathy-0.10.2-py3-none-any.whl (48 kB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m48.9/48.9 kB[0m [31m19.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting smart-open<7.0.0,>=5.2.1
  Downloading smart_open-6.3.0-py3-none-any.whl (56 kB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.8/56.8 kB[0m [31m22.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting cymem<2.1.0,>=2.0.2
  Downloading cymem-2.0.7-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (34 kB)
Collecting spacy

Let's load a dataset -- in this case, a Spanish sentiment analysis dataset from CardiffNLP.

In [2]:
from datasets import load_dataset

ds = load_dataset("cardiffnlp/tweet_sentiment_multilingual", "spanish")

ds

Downloading builder script:   0%|          | 0.00/4.14k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/5.28k [00:00<?, ?B/s]

Downloading and preparing dataset tweet_sentiment_multilingual/spanish to /home/darkstar/.cache/huggingface/datasets/cardiffnlp___tweet_sentiment_multilingual/spanish/0.1.0/936afd3cde120393429606f681b3b48d526873c45114068973f71e296ce80605...


Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/103k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/216k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/38.4k [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/3 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Dataset tweet_sentiment_multilingual downloaded and prepared to /home/darkstar/.cache/huggingface/datasets/cardiffnlp___tweet_sentiment_multilingual/spanish/0.1.0/936afd3cde120393429606f681b3b48d526873c45114068973f71e296ce80605. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 1839
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 324
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 870
    })
})

In [3]:
ds["train"].features

{'text': Value(dtype='string', id=None),
 'label': ClassLabel(names=['negative', 'neutral', 'positive'], id=None)}

In [4]:
ds["test"]["text"][:10]

['@user jajajaja dale, hacete la boluda vos jajaja igual a vos nunca se te puede tomar en serio te mando un abrazo desde Perú!',
 'cada vez que cito un tweet se va la ubicación sin tampoco poder ponerla en el momento a uds les pasa? TE VEO Y ME PICA VICICONTE',
 '@user MAAAAE RAJADO! Pero lo bueno es q uno se va independizando!y logrando metas',
 'Bueno hoy fui a almorzar a Nanay con otras 3 dras xq la capacitación mal organizada no nos dió almuerzo y encima nos mandan a comer 2pm',
 'Necesito seguir a mas cuentas camren shippers y fans de las armonías. Me recomendais alguna?',
 '@user ¡Hola Tomás! ¿Habéis visto los nuevos #dinos de #TierraMagna? Es normal que haya colas antes de que comience el espectáculo',
 '@user la hijueputa tela se me salió. yo quería volver a quedar acostada.',
 'Parce yo estoy igual @user',
 '@user pues no está nada mal',
 '@user quizá para profesionales no sea mucho,pero hay no remunerados principalmente femenino para quienes es un sueño, pasa en mi país']

In [5]:
ds["test"]["label"][:10]

[0, 1, 2, 0, 1, 2, 0, 1, 2, 0]

## Load models

For this task, we use `robertuito-base-uncased` (there are other two versions: `robertuito-base-uncased`, and `robertuito-base-deacc`)

In [6]:
from transformers import AutoModelForSequenceClassification, AutoTokenizer

model_name = "pysentimiento/robertuito-base-uncased"

model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    num_labels=3
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.model_max_length = 128

Downloading (…)lve/main/config.json:   0%|          | 0.00/677 [00:00<?, ?B/s]

2023-07-10 17:00:31.032223: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE4.1 SSE4.2 AVX AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


Downloading model.safetensors:   0%|          | 0.00/435M [00:00<?, ?B/s]

Some weights of the model checkpoint at pysentimiento/robertuito-base-uncased were not used when initializing RobertaForSequenceClassification: ['lm_head.layer_norm.bias', 'lm_head.dense.bias', 'lm_head.bias', 'lm_head.dense.weight', 'lm_head.layer_norm.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at pysentimiento/robertuito-base-uncased and are newly initialized: ['classifier.out_proj.weight', 'classifier.dense.bias', 'classifier.out_pr

Downloading (…)okenizer_config.json:   0%|          | 0.00/323 [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/858k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/150 [00:00<?, ?B/s]

## Preprocessing

Before tokenizing our model, we have to run the `preprocess_tweet` function to our data.


In [7]:
from pysentimiento.preprocessing import preprocess_tweet
preprocessed_ds = ds.map(lambda ex: {"text": preprocess_tweet(ex["text"], lang="es")})

Map:   0%|          | 0/1839 [00:00<?, ? examples/s]

Map:   0%|          | 0/324 [00:00<?, ? examples/s]

Map:   0%|          | 0/870 [00:00<?, ? examples/s]

## Tokenization

In [8]:
tokenized_ds = preprocessed_ds.map(
    lambda batch: tokenizer(batch["text"], padding=False, truncation=True),
    batched=True, batch_size=32
)

Map:   0%|          | 0/1839 [00:00<?, ? examples/s]

Map:   0%|          | 0/324 [00:00<?, ? examples/s]

Map:   0%|          | 0/870 [00:00<?, ? examples/s]

In [9]:
tokenized_ds

DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 1839
    })
    validation: Dataset({
        features: ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 324
    })
    test: Dataset({
        features: ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 870
    })
})

## Training

In [10]:
!pip install ipdb

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Collecting ipdb
  Downloading ipdb-0.13.13-py3-none-any.whl (12 kB)
Installing collected packages: ipdb
Successfully installed ipdb-0.13.13


In [11]:
import numpy as np
import evaluate

f1_metric = evaluate.load("f1")
recall_metric = evaluate.load("recall")

def compute_metrics (eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis = -1)

    results = {}
    results.update(f1_metric.compute(predictions=preds, references = labels, average="macro"))
    results.update(recall_metric.compute(predictions=preds, references = labels, average="macro"))
    return results

Downloading builder script:   0%|          | 0.00/6.77k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/7.36k [00:00<?, ?B/s]

In [12]:
from transformers import TrainingArguments, Trainer, DataCollatorWithPadding

training_args = TrainingArguments(
    per_device_train_batch_size=32,
    output_dir="test_trainer",
    do_eval=True,
    evaluation_strategy="epoch",
    num_train_epochs=5,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_ds["train"],
    eval_dataset=tokenized_ds["validation"],
    compute_metrics=compute_metrics,
    data_collator=DataCollatorWithPadding(tokenizer=tokenizer),
)
trainer.train()



{'eval_loss': 0.671050488948822, 'eval_f1': 0.6597503202507132, 'eval_recall': 0.6790123456790123, 'eval_runtime': 0.4678, 'eval_samples_per_second': 692.612, 'eval_steps_per_second': 87.645, 'epoch': 1.0}
{'eval_loss': 0.6977064609527588, 'eval_f1': 0.7134784152105151, 'eval_recall': 0.7129629629629631, 'eval_runtime': 0.4844, 'eval_samples_per_second': 668.871, 'eval_steps_per_second': 84.641, 'epoch': 2.0}
{'eval_loss': 0.7892211675643921, 'eval_f1': 0.7138250101586121, 'eval_recall': 0.7160493827160493, 'eval_runtime': 0.4714, 'eval_samples_per_second': 687.288, 'eval_steps_per_second': 86.972, 'epoch': 3.0}
{'eval_loss': 0.8946099281311035, 'eval_f1': 0.7191358024691358, 'eval_recall': 0.7191358024691358, 'eval_runtime': 0.497, 'eval_samples_per_second': 651.944, 'eval_steps_per_second': 82.499, 'epoch': 4.0}
{'eval_loss': 0.9800084233283997, 'eval_f1': 0.7125949478890655, 'eval_recall': 0.7160493827160495, 'eval_runtime': 0.4876, 'eval_samples_per_second': 664.442, 'eval_steps_pe

TrainOutput(global_step=290, training_loss=0.34459494229020743, metrics={'train_runtime': 46.6246, 'train_samples_per_second': 197.214, 'train_steps_per_second': 6.22, 'train_loss': 0.34459494229020743, 'epoch': 5.0})

In [13]:
trainer.evaluate(tokenized_ds["test"])

{'eval_loss': 0.9861928820610046, 'eval_f1': 0.7190584342944698, 'eval_recall': 0.7218390804597701, 'eval_runtime': 1.2913, 'eval_samples_per_second': 673.723, 'eval_steps_per_second': 84.409, 'epoch': 5.0}


{'eval_loss': 0.9861928820610046,
 'eval_f1': 0.7190584342944698,
 'eval_recall': 0.7218390804597701,
 'eval_runtime': 1.2913,
 'eval_samples_per_second': 673.723,
 'eval_steps_per_second': 84.409,
 'epoch': 5.0}