# Fine-tuning XLM-RoBERTa for EN Sentiment Analysis (Text Classification)



This code is available in Hugging Face: 

https://huggingface.co/docs/transformers/tasks/sequence_classification

In [1]:
# Install all necessary libraries

!pip install transformers datasets evaluate

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [2]:
# Login to Hugging Face account

from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

## Load IMDb dataset

In [3]:
from datasets import load_dataset

imdb = load_dataset("imdb")



  0%|          | 0/3 [00:00<?, ?it/s]

In [4]:
# Take a look at an example

imdb["test"][0]

{'text': 'I love sci-fi and am willing to put up with a lot. Sci-fi movies/TV are usually underfunded, under-appreciated and misunderstood. I tried to like this, I really did, but it is to good TV sci-fi as Babylon 5 is to Star Trek (the original). Silly prosthetics, cheap cardboard sets, stilted dialogues, CG that doesn\'t match the background, and painfully one-dimensional characters cannot be overcome with a \'sci-fi\' setting. (I\'m sure there are those of you out there who think Babylon 5 is good sci-fi TV. It\'s not. It\'s clichéd and uninspiring.) While US viewers might like emotion and character development, sci-fi is a genre that does not take itself seriously (cf. Star Trek). It may treat important issues, yet not as a serious philosophy. It\'s really difficult to care about the characters here as they are not simply foolish, just missing a spark of life. Their actions and reactions are wooden and predictable, often painful to watch. The makers of Earth KNOW it\'s rubbish as 

In [5]:
imdb["test"][0]['text']

'I love sci-fi and am willing to put up with a lot. Sci-fi movies/TV are usually underfunded, under-appreciated and misunderstood. I tried to like this, I really did, but it is to good TV sci-fi as Babylon 5 is to Star Trek (the original). Silly prosthetics, cheap cardboard sets, stilted dialogues, CG that doesn\'t match the background, and painfully one-dimensional characters cannot be overcome with a \'sci-fi\' setting. (I\'m sure there are those of you out there who think Babylon 5 is good sci-fi TV. It\'s not. It\'s clichéd and uninspiring.) While US viewers might like emotion and character development, sci-fi is a genre that does not take itself seriously (cf. Star Trek). It may treat important issues, yet not as a serious philosophy. It\'s really difficult to care about the characters here as they are not simply foolish, just missing a spark of life. Their actions and reactions are wooden and predictable, often painful to watch. The makers of Earth KNOW it\'s rubbish as they have

In [6]:
print(type(imdb))

<class 'datasets.dataset_dict.DatasetDict'>


In [7]:
print(len(imdb['test']))

25000


# Load translator

In [8]:
from transformers import pipeline

translator = pipeline("translation_en_to_fr", model="judithrosell/t5-mt-en-fr")

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.47k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/242M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

In [9]:
# Test translator

text = imdb["test"][0]['text']
translation = translator(text, max_length=512)

In [10]:
print(translation)

[{'translation_text': "Je veux m'en tenir à beaucoup. Les films et les télévisions de sci-fi sont généralement sous-financés, sous-appréhendés et mal compris; je m'efforçai de le voir, et c'est à bon sci-fi de télévision, comme Babylon 5 est à Star Trek (l'original); des prothèses silencieuses, des jeux de carton bon marché, des dialogues d'un caractère qui ne correspond pas au fond, et ils ne peuvent pas s'épanouissements d'un caractère à l'autre, et ils sont d'un caractère à l'écouter, et ils ne s'en aperçus à l'origine; et ils ne sont pas à l'origine de Babylones à l'étape de carton à l'origine, des dialogues semblés, des personnages d'un caractère à l'air à l'air à l'origine, et ils ne peuvent pas être en rien à l'uni-fi, et ils sont d'un caractère à l'origine, et ils ne s'épanouissent à l'origine, et ils ne s'étalonnent à l'origine, et ils sont d'un à l'autre, et ils ne s'étalonnent à l'air à l'un à l'autre, et ils ne peuvent pas être entraînés, et ils sont d'un point à l'origine 

# Translate IMDb TEST dataset and reduce train set to 1500 samples

In [13]:
import nltk
nltk.download('punkt')
from nltk import sent_tokenize

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [11]:
# Generate a list of 1500 random ids

import random

# set a seed for reproducibility
random.seed(42)

# select 500 random ids from 0 to 24999
random_ids = random.sample(range(0, 25000), 1500)

In [38]:
# Generate new imdb TRAIN DatasetDict (reduce samples to 1500)
new_imdb = {}
new_imdb['train'] = []
#for i in range(len(imdb['train'])):
for i in random_ids:
    label = imdb['train'][i]['label']
    text = imdb['train'][i]['text']
    new_imdb['train'].append({'label': label, 'text': text})

In [39]:
# Generate new imdb TEST DatasetDict (translate 500 samples to FRENCH)
new_imdb['test'] = []
#for i in range(len(imdb['test']))
for i in random_ids[:500]:
    label = imdb['test'][i]['label']
    text = imdb['test'][i]['text']
    sents = nltk.sent_tokenize(text)
    content = ''
    for sent in sents:
        c = translator(sent, max_length=512)
        translated_text = c[0]['translation_text']
        content += translated_text + ' '
    new_imdb['test'].append({'label': label, 'text': content})

In [40]:
# Check new lengths

print(len(new_imdb['train']))
print(len(new_imdb['test']))

1500
500


In [41]:
print(type(new_imdb['train']))
print(type(new_imdb['test']))

<class 'list'>
<class 'list'>


In [43]:
print(imdb['train'])
print(imdb['test'])

Dataset({
    features: ['text', 'label'],
    num_rows: 25000
})
Dataset({
    features: ['text', 'label'],
    num_rows: 25000
})


In [44]:
print(type(imdb['train']))
print(type(imdb['test']))

<class 'datasets.arrow_dataset.Dataset'>
<class 'datasets.arrow_dataset.Dataset'>


In [45]:
# Rename new_imdb dataset for TESTING purposes (to see if the code works)

trial_imdb = new_imdb

In [46]:
print(len(trial_imdb['train']))
print(len(trial_imdb['test']))
print(type(trial_imdb['train']))
print(type(trial_imdb['test']))

1500
500
<class 'list'>
<class 'list'>


In [61]:
print(len(imdb['train']))
print(len(imdb['test']))
print(type(imdb['train']))
print(type(imdb['test']))

25000
25000
<class 'datasets.arrow_dataset.Dataset'>
<class 'datasets.arrow_dataset.Dataset'>


In [62]:
print(type(trial_imdb))

<class 'dict'>


In [63]:
from datasets import Dataset

# Assuming that each item in the 'train' list is a dictionary
train_data = {"text": [item["text"] for item in trial_imdb['train']],
              "label": [item["label"] for item in trial_imdb['train']]}

train_dataset = Dataset.from_dict(train_data)

In [64]:
# Assuming that each item in the 'test' list is a dictionary
test_data = {"text": [item["text"] for item in trial_imdb['test']],
             "label": [item["label"] for item in trial_imdb['test']]}

test_dataset = Dataset.from_dict(test_data)

In [65]:
from datasets import DatasetDict

new_imdb_dataset = DatasetDict({'train': train_dataset, 'test': test_dataset})

In [66]:
print(type(new_imdb_dataset))

<class 'datasets.dataset_dict.DatasetDict'>


In [67]:
print(type(new_imdb_dataset['train']))

<class 'datasets.arrow_dataset.Dataset'>


In [68]:
print(type(new_imdb_dataset['train'][0]))

<class 'dict'>


In [69]:
print(new_imdb_dataset['train'][0])

{'text': 'Arguably this is a very good "sequel", better than the first live action film 101 Dalmatians. It has good dogs, good actors, good jokes and all right slapstick! <br /><br />Cruella DeVil, who has had some rather major therapy, is now a lover of dogs and very kind to them. Many, including Chloe Simon, owner of one of the dogs that Cruella once tried to kill, do not believe this. Others, like Kevin Shepherd (owner of 2nd Chance Dog Shelter) believe that she has changed. <br /><br />Meanwhile, Dipstick, with his mate, have given birth to three cute dalmatian puppies! Little Dipper, Domino and Oddball...<br /><br />Starring Eric Idle as Waddlesworth (the hilarious macaw), Glenn Close as Cruella herself and Gerard Depardieu as Le Pelt (another baddie, the name should give a clue), this is a good family film with excitement and lots more!! One downfall of this film is that is has a lot of painful slapstick, but not quite as excessive as the last film. This is also funnier than the 

In [70]:
print(new_imdb_dataset['test'][0])

{'text': "Très sourd ! Un concours de t-shirts humides! Un raid de l'équipage ! Cette «SheAnimal House» a tout. Sauf un complot et de bons acteurs. Le film comprend plusieurs pignons et des chats entre le H.O.T.S. et la maison d'infanterie bâtonneuse. La dialouge est un cri, et la musique sonore est sublimement brillante. Au cours des deux derniers mois, il s'est éteinte tout à l'échelon du câble, et prenez soin de l'en assurer. H.O.T.S. est si mauvais qu'il est bon! ", 'label': 1}


# GOOD !!!!!

## Preprocess

In [71]:
# Load XLM-RoBERTa tokenizer

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("xlm-roberta-base")

In [72]:
# Create a preprocessing function

def preprocess_function(examples):
    return tokenizer(examples["text"], truncation=True)

In [73]:
# Use Datasets map function to apply the preprocessing function over the entire dataset

tokenized_imdb = new_imdb_dataset.map(preprocess_function, batched=True)

Map:   0%|          | 0/1500 [00:00<?, ? examples/s]

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

In [74]:
# Create a batch of examples using DataCollatorWithPadding

from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

## Evaluate

In [75]:
# Import evaluate library

import evaluate

accuracy = evaluate.load("accuracy")

In [76]:
# Create a function that passes your predictions and labels to compute to calculate the accuracy

import numpy as np

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return accuracy.compute(predictions=predictions, references=labels)

## Train

In [78]:
# Create a map of the expected ids to their labels with id2label and label2id

id2label = {0: "NEGATIF", 1: "POSITIF"}
label2id = {"NEGATIF": 0, "POSITIF": 1}

In [79]:
# Load XLM-RoBERTa with AutoModelForSequenceClassification

from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer

model = AutoModelForSequenceClassification.from_pretrained(
    "xlm-roberta-base", num_labels=2, id2label=id2label, label2id=label2id
)

Some weights of the model checkpoint at xlm-roberta-base were not used when initializing XLMRobertaForSequenceClassification: ['lm_head.dense.weight', 'lm_head.decoder.weight', 'lm_head.bias', 'roberta.pooler.dense.weight', 'roberta.pooler.dense.bias', 'lm_head.layer_norm.bias', 'lm_head.dense.bias', 'lm_head.layer_norm.weight']
- This IS expected if you are initializing XLMRobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing XLMRobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of XLMRobertaForSequenceClassification were not initialized from the model checkpoint at xlm-roberta-base and are newly initialized: ['classifier.dense

In [80]:
# Define training hyperparameters

training_args = TrainingArguments(
    output_dir="sa_french_tr_test",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=5,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    push_to_hub=True,
)

# Pass the training arguments to Trainer

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_imdb["train"],
    eval_dataset=tokenized_imdb["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

Cloning https://huggingface.co/judithrosell/sa_french_tr_test into local empty directory.


In [81]:
# Call train() to finetune the model

trainer.train()

You're using a XLMRobertaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Accuracy
1,No log,0.469802,0.824
2,No log,0.422872,0.836
3,No log,0.497434,0.838
4,No log,0.708058,0.828
5,No log,0.78127,0.82


Adding files tracked by Git LFS: ['tokenizer.json']. This may take a bit of time if the files are large.


TrainOutput(global_step=470, training_loss=0.28354024684175533, metrics={'train_runtime': 639.1174, 'train_samples_per_second': 11.735, 'train_steps_per_second': 0.735, 'total_flos': 1966212472264320.0, 'train_loss': 0.28354024684175533, 'epoch': 5.0})

In [82]:
# Save (push) model to HF hub

trainer.push_to_hub()

Upload file runs/May09_20-19-33_5e51c8e54aff/events.out.tfevents.1683663600.5e51c8e54aff.28682.0:   0%|       …

To https://huggingface.co/judithrosell/sa_french_tr_test
   7692f7c..f31ecf6  main -> main

   7692f7c..f31ecf6  main -> main

To https://huggingface.co/judithrosell/sa_french_tr_test
   f31ecf6..66e0398  main -> main

   f31ecf6..66e0398  main -> main



'https://huggingface.co/judithrosell/sa_french_tr_test/commit/f31ecf6b961b17ae2c773ebccb58c13c34c62baa'

## Inference

In [83]:
text = "C'était un chef-d'œuvre. Pas tout à fait fidèle aux livres, mais passionnant du début à la fin. C'est peut-être mon préféré des trois."

In [84]:
# Instantiate a pipeline() for sentiment analysis

from transformers import pipeline

classifier = pipeline("sentiment-analysis", model="judithrosell/sa_french_tr_test")
classifier(text)

Downloading (…)lve/main/config.json:   0%|          | 0.00/886 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.11G [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/418 [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/17.1M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/280 [00:00<?, ?B/s]

[{'label': 'POSITIF', 'score': 0.9112136960029602}]