## RoBERTuito for Text Classification

This notebook shows how to use [RoBERTuito](https://huggingface.co/pysentimiento/robertuito-base-uncased) for text classification tasks.

First, let's install some packages

In [1]:
!pip install pysentimiento transformers datasets accelerate evaluate



Let's load a dataset -- in this case, a Spanish sentiment analysis dataset from CardiffNLP.

In [2]:
from datasets import load_dataset

# ds = load_dataset("cardiffnlp/tweet_sentiment_multilingual", "spanish")
# ds
data_files = {"train": "pilot-test/train.csv", "validation": "pilot-test/val.csv", "test": "pilot-test/test.csv"}
ds = load_dataset("csv", data_files=data_files)
ds

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 333
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 42
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 37
    })
})

In [3]:
ds["train"].features

{'text': Value(dtype='string', id=None),
 'label': Value(dtype='int64', id=None)}

In [4]:
ds["test"]["text"][:10]

['No seas cabro',
 'Te caería bien',
 'Que no deje resaca',
 'Me lancé a la carrera',
 'Todos son chibolos',
 '—Creo que mejor zafo —dijo Peter',
 '-Te ha dejado plantado',
 'Ese fue el momento más jodido',
 'Unos negros asquerosos, amor',
 'A diez minutos a pata']

In [5]:
ds["test"]["label"][:10]

[0, 2, 2, 2, 1, 1, 0, 0, 0, 1]

## Load models

For this task, we use `robertuito-base-uncased` (there are other two versions: `robertuito-base-uncased`, and `robertuito-base-deacc`)

In [6]:
from transformers import AutoModelForSequenceClassification, AutoTokenizer

model_name = "pysentimiento/robertuito-base-uncased"

model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    num_labels=3
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.model_max_length = 128

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at pysentimiento/robertuito-base-uncased and are newly initialized: ['classifier.dense.weight', 'classifier.dense.bias', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## Preprocessing

Before tokenizing our model, we have to run the `preprocess_tweet` function to our data.


In [7]:
from pysentimiento.preprocessing import preprocess_tweet
preprocessed_ds = ds.map(lambda ex: {"text": preprocess_tweet(ex["text"], lang="es")})

## Tokenization

In [8]:
tokenized_ds = preprocessed_ds.map(
    lambda batch: tokenizer(batch["text"], padding=False, truncation=True),
    batched=True, batch_size=32
)

Map:   0%|          | 0/37 [00:00<?, ? examples/s]

In [9]:
tokenized_ds

DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 333
    })
    validation: Dataset({
        features: ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 42
    })
    test: Dataset({
        features: ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 37
    })
})

## Training

In [10]:
!pip install ipdb

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)




In [11]:
import numpy as np
import evaluate

f1_metric = evaluate.load("f1")
recall_metric = evaluate.load("recall")

def compute_metrics (eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis = -1)

    results = {}
    results.update(f1_metric.compute(predictions=preds, references = labels, average="macro"))
    results.update(recall_metric.compute(predictions=preds, references = labels, average="macro"))
    return results

In [12]:
from transformers import TrainingArguments, Trainer, DataCollatorWithPadding

training_args = TrainingArguments(
    per_device_train_batch_size=32,
    output_dir="test_trainer",
    do_eval=True,
    evaluation_strategy="epoch",
    num_train_epochs=5,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_ds["train"],
    eval_dataset=tokenized_ds["validation"],
    compute_metrics=compute_metrics,
    data_collator=DataCollatorWithPadding(tokenizer=tokenizer),
)
trainer.train()

{'eval_loss': 0.866065502166748, 'eval_f1': 0.24875621890547261, 'eval_recall': 0.3333333333333333, 'eval_runtime': 0.076, 'eval_samples_per_second': 552.377, 'eval_steps_per_second': 78.911, 'epoch': 1.0}
{'eval_loss': 0.7639896869659424, 'eval_f1': 0.47933177933177934, 'eval_recall': 0.4640740740740741, 'eval_runtime': 0.0688, 'eval_samples_per_second': 610.675, 'eval_steps_per_second': 87.239, 'epoch': 2.0}
{'eval_loss': 0.6877947449684143, 'eval_f1': 0.591100076394194, 'eval_recall': 0.5751851851851851, 'eval_runtime': 0.0754, 'eval_samples_per_second': 556.747, 'eval_steps_per_second': 79.535, 'epoch': 3.0}
{'eval_loss': 0.6557431221008301, 'eval_f1': 0.6415329768270945, 'eval_recall': 0.6168518518518519, 'eval_runtime': 0.0685, 'eval_samples_per_second': 613.074, 'eval_steps_per_second': 87.582, 'epoch': 4.0}
{'eval_loss': 0.6470615863800049, 'eval_f1': 0.6415329768270945, 'eval_recall': 0.6168518518518519, 'eval_runtime': 0.0699, 'eval_samples_per_second': 600.653, 'eval_steps_p

TrainOutput(global_step=55, training_loss=0.6189532886851917, metrics={'train_runtime': 4.6165, 'train_samples_per_second': 360.664, 'train_steps_per_second': 11.914, 'train_loss': 0.6189532886851917, 'epoch': 5.0})

In [13]:
trainer.evaluate(tokenized_ds["test"])

{'eval_loss': 0.7562307119369507, 'eval_f1': 0.6454706640876854, 'eval_recall': 0.6239177489177489, 'eval_runtime': 0.0418, 'eval_samples_per_second': 886.102, 'eval_steps_per_second': 119.744, 'epoch': 5.0}


{'eval_loss': 0.7562307119369507,
 'eval_f1': 0.6454706640876854,
 'eval_recall': 0.6239177489177489,
 'eval_runtime': 0.0418,
 'eval_samples_per_second': 886.102,
 'eval_steps_per_second': 119.744,
 'epoch': 5.0}

In [15]:
tokenized_ds

DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 333
    })
    validation: Dataset({
        features: ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 42
    })
    test: Dataset({
        features: ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 37
    })
})

In [18]:
ds["test"]["text"][:10]

['No seas cabro',
 'Te caería bien',
 'Que no deje resaca',
 'Me lancé a la carrera',
 'Todos son chibolos',
 '—Creo que mejor zafo —dijo Peter',
 '-Te ha dejado plantado',
 'Ese fue el momento más jodido',
 'Unos negros asquerosos, amor',
 'A diez minutos a pata']

In [22]:
text = ds["test"]["text"][0]
text

'No seas cabro'

In [24]:
encoded_input = tokenizer(text, return_tensors='pt').input_ids
encoded_input = encoded_input.to('cuda')
encoded_input

tensor([[   0,  464, 3220,  521, 2919,    2]], device='cuda:0')

In [25]:
logits = model(encoded_input).logits
logits

tensor([[ 2.7639, -1.2501, -1.7805]], device='cuda:0',
       grad_fn=<AddmmBackward0>)

In [26]:
probabilities = logits.softmax(dim=-1).tolist()[0]
print(f'probabilities [NEGATIVE, NEUTRAL, POSITIVE]: {probabilities}')

probabilities [NEGATIVE, NEUTRAL, POSITIVE]: [0.9721140265464783, 0.017555944621562958, 0.010330010205507278]


In [28]:
ds["test"]["label"][0]

0