## **Multi Lingual Reviews Classification with dynamic padding for faster training**

In this noteboook, we train a `xlm-roberta-base` model on multi-lingual Amazon Reviews dataset. The model attains accuracy comparable to state of the art. Furthermore we implement dynamic padding to speed up model training.


If you're opening this Notebook on colab, you will probably need to install 🤗 Transformers and 🤗 Datasets. Uncomment the following cell and run it.

In [None]:
! pip install transformers

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/f9/54/5ca07ec9569d2f232f3166de5457b63943882f7950ddfcc887732fc7fb23/transformers-4.3.3-py3-none-any.whl (1.9MB)
[K     |████████████████████████████████| 1.9MB 5.8MB/s 
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/7d/34/09d19aff26edcc8eb2a01bed8e98f13a1537005d31e95233fd48216eed10/sacremoses-0.0.43.tar.gz (883kB)
[K     |████████████████████████████████| 890kB 37.4MB/s 
Collecting tokenizers<0.11,>=0.10.1
[?25l  Downloading https://files.pythonhosted.org/packages/71/23/2ddc317b2121117bf34dd00f5b0de194158f2a44ee2bf5e47c7166878a97/tokenizers-0.10.1-cp37-cp37m-manylinux2010_x86_64.whl (3.2MB)
[K     |████████████████████████████████| 3.2MB 28.9MB/s 
Building wheels for collected packages: sacremoses
  Building wheel for sacremoses (setup.py) ... [?25l[?25hdone
  Created wheel for sacremoses: filename=sacremoses-0.0.43-cp37-none-any.whl size=893262 sha256=c86aea9a11b

## Loading the dataset

In [None]:
!pip install datasets

Collecting datasets
[?25l  Downloading https://files.pythonhosted.org/packages/3e/73/742d17d8a9a1c639132affccc9250f0743e484cbf263ede6ddcbe34ef212/datasets-1.4.1-py3-none-any.whl (186kB)
[K     |████████████████████████████████| 194kB 5.8MB/s 
[?25hCollecting xxhash
[?25l  Downloading https://files.pythonhosted.org/packages/e7/27/1c0b37c53a7852f1c190ba5039404d27b3ae96a55f48203a74259f8213c9/xxhash-2.0.0-cp37-cp37m-manylinux2010_x86_64.whl (243kB)
[K     |████████████████████████████████| 245kB 36.1MB/s 
Collecting huggingface-hub==0.0.2
  Downloading https://files.pythonhosted.org/packages/b5/93/7cb0755c62c36cdadc70c79a95681df685b52cbaf76c724facb6ecac3272/huggingface_hub-0.0.2-py3-none-any.whl
Collecting fsspec
[?25l  Downloading https://files.pythonhosted.org/packages/91/0d/a6bfee0ddf47b254286b9bd574e6f50978c69897647ae15b14230711806e/fsspec-0.8.7-py3-none-any.whl (103kB)
[K     |████████████████████████████████| 112kB 53.7MB/s 
Installing collected packages: xxhash, huggingface-h

We will use the [🤗 Datasets](https://github.com/huggingface/datasets) library to download the data and get the metric we need to use for evaluation (to compare our model to the benchmark). This can be easily done with the functions `load_dataset` and `load_metric`.  

In [None]:
from datasets import load_dataset, load_metric

In [None]:
dataset = load_dataset('amazon_reviews_multi')

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=2773.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=3624.0, style=ProgressStyle(description…

No config specified, defaulting to: amazon_reviews_multi/all_languages



Downloading and preparing dataset amazon_reviews_multi/all_languages (download: 610.66 MiB, generated: 364.83 MiB, post-processed: Unknown size, total: 975.49 MiB) to /root/.cache/huggingface/datasets/amazon_reviews_multi/all_languages/1.0.0/f3357bd271e187385a38574fe31b8fb10055303f67fa9fce55e84d08c4870efd...


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=90296053.0, style=ProgressStyle(descrip…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=81989414.0, style=ProgressStyle(descrip…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=77475023.0, style=ProgressStyle(descrip…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=81853486.0, style=ProgressStyle(descrip…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=169377614.0, style=ProgressStyle(descri…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=108954151.0, style=ProgressStyle(descri…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=2250151.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=2059600.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1930836.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=2019337.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=4185487.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=2701152.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=2256286.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=2045098.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1939602.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=2044470.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=4210682.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=2731944.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))



HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))



HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Dataset amazon_reviews_multi downloaded and prepared to /root/.cache/huggingface/datasets/amazon_reviews_multi/all_languages/1.0.0/f3357bd271e187385a38574fe31b8fb10055303f67fa9fce55e84d08c4870efd. Subsequent calls will reuse this data.


The `dataset` object itself is [`DatasetDict`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasetdict), which contains one key for the training, validation and test set.

As you can see below the dataset has 1.2MM training examples and 30K validation examples.

In [None]:
dataset

DatasetDict({
    train: Dataset({
        features: ['review_id', 'product_id', 'reviewer_id', 'stars', 'review_body', 'review_title', 'language', 'product_category'],
        num_rows: 1200000
    })
    validation: Dataset({
        features: ['review_id', 'product_id', 'reviewer_id', 'stars', 'review_body', 'review_title', 'language', 'product_category'],
        num_rows: 30000
    })
    test: Dataset({
        features: ['review_id', 'product_id', 'reviewer_id', 'stars', 'review_body', 'review_title', 'language', 'product_category'],
        num_rows: 30000
    })
})

To access an actual element, you need to select a split first, then give an index:

In [None]:
dataset["train"][0]

{'language': 'de',
 'product_category': 'sports',
 'product_id': 'product_de_0865382',
 'review_body': 'Armband ist leider nach 1 Jahr kaputt gegangen',
 'review_id': 'de_0203609',
 'review_title': 'Leider nach 1 Jahr kaputt',
 'reviewer_id': 'reviewer_de_0267719',
 'stars': 1}

To get a sense of what the data looks like, the following function will show some examples picked randomly in the dataset.

In [None]:
import datasets
import random
import pandas as pd
from IPython.display import display, HTML

def show_random_elements(dataset, num_examples=10):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)
    
    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, datasets.ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    display(HTML(df.to_html()))

In [None]:
show_random_elements(dataset["train"])

Unnamed: 0,language,product_category,product_id,review_body,review_id,review_title,reviewer_id,stars
0,de,other,product_de_0105655,"Problemloser Download, einwandfreie Funktion",de_0332512,Einfachstes Handling,reviewer_de_0496867,5
1,ja,beauty,product_ja_0491474,コロンによっては、つけると自分の体臭との混ざり具合でか、嫌な臭いになるケースも多々ありましたが、バーバリーブリットはとても心地よくつけています。また、購入する予定です。,ja_0043198,心地よくつけています,reviewer_ja_0239103,4
2,es,sports,product_es_0863441,"Pulsera para actividad deportiva buenisima la e comprado para mi novia y la verdad que esta encantada con ella, ya que tiene un diseño fino y elegante y cumple sus funciones de maravilla, te mide los pasos que das al dia, la frecuencia cardiaca, si sales a caminar a correr o con la bicicleta es muy completa. Y encima tiene un facilisimo funcionamiento y una interfaz sencilla y facil de comprender y algo muy importante instrucciones en español. La verdad que es una pulsera muy buena en relacion calidad precio.",es_0747103,Pulsera de actividad que recomiendo,reviewer_es_0354975,5
3,en,pet_products,product_en_0981058,Didn't really do much for my dog.,en_0658982,Two Stars,reviewer_en_0563462,2
4,zh,sports,product_zh_0200054,收到了，外包装上建议零售价只有50元嘛～是不是给我发错货了？做工还算细致，但就价格而言，性价比低。,zh_0439594,价格有没有搞错？,reviewer_zh_0577333,3
5,zh,digital_ebook_purchase,product_zh_0991859,整本书没有一个清晰的脉络，相比之前看过的许多自我管理类书籍来说，不成体系，似乎哪家的论点都提了一嘴但都浮于表面，说服力不足的情况下只会产生『道理我都懂』的情况，其实是道理没讲清楚。,zh_0977974,不成体系，过于简单说教,reviewer_zh_0164772,2
6,ja,home,product_ja_0195507,ちょっと硬めの下敷きくらいの柔らかさです 厚物の刃を研ぐのにはちょっと役不足かな、切れ味を持続させるようなメンテ的使い方ならいいかも,ja_0658539,しなる程度の柔らかさじゃなかったｗ,reviewer_ja_0770605,3
7,de,home,product_de_0441403,Die. Lampe ist genial. Leider hat die Lampe nach kurzer Zeit nicht mehr auf die Fernbedienung reagiert,de_0434846,Ist okay,reviewer_de_0400930,3
8,en,wireless,product_en_0285679,I like this because I can clip it right on my jeans or slacks instead of putting in purse. Easy access. Have bought a couple times now over 7 years.,en_0428953,fits well,reviewer_en_0027757,5
9,es,drugstore,product_es_0013663,El producto ya lo había probado y de echo repito compra. El problema que a llegado el envoltorio roto y el producto esparcido en la caja de cartón.,es_0624344,Envoltorio en mal estado.,reviewer_es_0571957,3


As can be seen, the data set has reviews in many languages. The `review_body` column has the review text. The `stars` column has the rating for the review. Review ratings range from 1-5 with equal percentage for all the classes.

The metric is an instance of [`datasets.Metric`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Metric):

In [None]:
metric = load_metric('accuracy')
metric

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1362.0, style=ProgressStyle(description…




Metric(name: "accuracy", features: {'predictions': Value(dtype='int32', id=None), 'references': Value(dtype='int32', id=None)}, usage: """
Args:
    predictions: Predicted labels, as returned by a model.
    references: Ground truth labels.
    normalize: If False, return the number of correctly classified samples.
        Otherwise, return the fraction of correctly classified samples.
    sample_weight: Sample weights.
Returns:
    accuracy: Accuracy score.
Examples:

    >>> accuracy_metric = datasets.load_metric("accuracy")
    >>> results = accuracy_metric.compute(references=[0, 1], predictions=[0, 1])
    >>> print(results)
    {'accuracy': 1.0}
""", stored examples: 0)

In [None]:
f1_metric = load_metric('f1')
f1_metric

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1905.0, style=ProgressStyle(description…




Metric(name: "f1", features: {'predictions': Value(dtype='int32', id=None), 'references': Value(dtype='int32', id=None)}, usage: """
Args:
    predictions: Predicted labels, as returned by a model.
    references: Ground truth labels.
    labels: The set of labels to include when average != 'binary', and
        their order if average is None. Labels present in the data can
        be excluded, for example to calculate a multiclass average ignoring
        a majority negative class, while labels not present in the data will
        result in 0 components in a macro average. For multilabel targets,
        labels are column indices. By default, all labels in y_true and
        y_pred are used in sorted order.
    average: This parameter is required for multiclass/multilabel targets.
        If None, the scores for each class are returned. Otherwise, this
        determines the type of averaging performed on the data:
            binary: Only report results for the class specified by pos

You can call its `compute` method with your predictions and labels directly and it will return a dictionary with the metric(s) value:

In [None]:
import numpy as np

fake_preds = np.random.randint(1, 6, size=(64,))
fake_labels = np.random.randint(1, 6, size=(64,))
metric.compute(predictions=fake_preds, references=fake_labels)

{'accuracy': 0.234375}

In [None]:
f1_metric.compute(predictions=fake_preds, references=fake_labels, average='weighted')

{'f1': 0.2285631925928209}

## Preprocessing the data

Before we can feed those texts to our model, we need to preprocess them. This is done by a 🤗 Transformers `Tokenizer` which will (as the name indicates) tokenize the inputs (including converting the tokens to their corresponding IDs in the pretrained vocabulary) and put it in a format the model expects, as well as generate the other inputs that model requires.

To do all of this, we instantiate our tokenizer with the `AutoTokenizer.from_pretrained` method, which will ensure:

- we get a tokenizer that corresponds to the model architecture we want to use,
- we download the vocabulary used when pretraining this specific checkpoint.

That vocabulary will be cached, so it's not downloaded again the next time we run the cell.

Since we are dealing with multi-lingual data, we will use the `xlm-roberta` model. This model is trained with data in over 100 languages. For more model options see [link](https://huggingface.co/transformers/multilingual.html)

To speed up the training process, we are going to train on a fraction of data.

In [None]:
do_shard = True
if do_shard:
    dataset = dataset.shuffle(seed=123)
    train_dataset = dataset["train"].shard(index=1, num_shards=10) 
    val_dataset = dataset['validation'].shard(index=1, num_shards=5) 
else:
    train_dataset = dataset['train']
    val_dataset = dataset['validation']

In [None]:
train_dataset[0:5]

{'language': ['fr', 'en', 'ja', 'es', 'es'],
 'product_category': ['pc',
  'beauty',
  'electronics',
  'apparel',
  'digital_ebook_purchase'],
 'product_id': ['product_fr_0404142',
  'product_en_0971943',
  'product_ja_0119647',
  'product_es_0814248',
  'product_es_0013336'],
 'review_body': ["Magnifique sac à bandoulière pour y loger mon MAC Book air 13.3 pouces, qui de ce fait, se retrouve très bien protégé. Une tablette de 10 pouces trouve aussi sa place à côté du Mac...Poche avant profonde pouvant être utilisée pour ranger divers câbles. Pour ma part, j'ai pu ranger en plus des câbles, mes 2 petits disques durs externes...souris... Rangement aussi possible pour carnets, notes diverses sur format A4. Ensemble très classe. Suis très heureuse de mon achat, que je recommande vivement.",
  'I’ve had good luck with this product when I’ve bought it in stores, but this particular one I bought though Amazon was watered down! The texture was much more viscous than any of the tubes I bought

In [None]:
val_dataset

Dataset({
    features: ['review_id', 'product_id', 'reviewer_id', 'stars', 'review_body', 'review_title', 'language', 'product_category'],
    num_rows: 6000
})

In [None]:
from transformers import AutoTokenizer
model_checkpoint = 'xlm-roberta-base'
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=512.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=5069051.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=9096718.0, style=ProgressStyle(descript…




We pass along `use_fast=True` to the call above to use one of the fast tokenizers (backed by Rust) from the 🤗 Tokenizers library. Those fast tokenizers are available for almost all models, but if you got an error with the previous call, remove that argument.

You can directly call this tokenizer on one sentence or a pair of sentences:

In [None]:
tokenizer("Hello, this one sentence!", "为什么一个620卖1988，一个卖4299？都一样的吗？")

{'input_ids': [0, 35378, 4, 903, 1632, 149357, 38, 2, 2, 6, 23543, 1860, 910, 1549, 21633, 109332, 4, 1860, 21633, 13023, 5046, 32, 1198, 13326, 43, 9131, 32, 2], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

We can them write the function that will preprocess our samples. We just feed them to the `tokenizer` with the argument `truncation=True`. This will ensure that an input longer that what the model selected can handle will be truncated to the maximum length accepted by the model.

We concatenate the `review_body`, `review_title` and `product_category` in a string and pass that to the tokenizer. Concatenating title and product category along with body results in a significant increase in accuracy.

In [None]:
import torch
max_len = 512
pad_to_max = False
def tokenize_data(example):
    # Tokenize the review body
    text_ = example['review_body'] + " " + example['review_title'] + " " + example['product_category']
    encodings = tokenizer.encode_plus(text_, pad_to_max_length=pad_to_max, max_length=max_len,
                                           add_special_tokens=True,
                                            return_token_type_ids=False,
                                            return_attention_mask=True,
                                            return_overflowing_tokens=False,
                                            return_special_tokens_mask=False,
                                           )
    
    # Subtract 1 from labels to have them in range 0-4
    targets = torch.tensor(example['stars']-1,dtype=torch.long)
    

    encodings.update({'labels': targets})
    return encodings



In [None]:
tokenize_data(dataset['train'][0]).keys()

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


dict_keys(['input_ids', 'attention_mask', 'labels'])

To apply this function on all the sentences (or pairs of sentences) in our dataset, we just use the `map` method of our `dataset` object we created earlier. This will apply the function on all the elements of all the splits in `dataset`, so our training, validation and testing data will be preprocessed in one single command.

In [None]:
encoded_train_dataset = train_dataset.map(tokenize_data)
encoded_val_dataset = val_dataset.map(tokenize_data)

HBox(children=(FloatProgress(value=0.0, max=120000.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=6000.0), HTML(value='')))




In [None]:
encoded_train_dataset.column_names

['attention_mask',
 'input_ids',
 'labels',
 'language',
 'product_category',
 'product_id',
 'review_body',
 'review_id',
 'review_title',
 'reviewer_id',
 'stars']

Even better, the results are automatically cached by the 🤗 Datasets library to avoid spending time on this step the next time you run your notebook. The 🤗 Datasets library is normally smart enough to detect when the function you pass to map has changed (and thus requires to not use the cache data). For instance, it will properly detect if you change the task in the first cell and rerun the notebook. 🤗 Datasets warns you when it uses cached files, you can pass `load_from_cache_file=False` in the call to `map` to not use the cached files and force the preprocessing to be applied again.

We can also pass `batched=True` to encode the texts by batches together. This is to leverage the full benefit of the fast tokenizer we loaded earlier, which will use multi-threading to treat the texts in a batch concurrently.

## Fine-tuning the model

In [None]:
def pad_seq(seq, max_batch_len, pad_value):
    return seq + (max_batch_len - len(seq)) * [pad_value]

In [None]:
from dataclasses import dataclass, field
from transformers import DataCollator
@dataclass
class SmartCollator():
    pad_token_id: int

    def __call__(self, batch):
        batch_inputs = list()
        batch_attention_masks = list()
        labels = list()
        max_size = max([len(ex['input_ids']) for ex in batch])
        for item in batch:
            batch_inputs += [pad_seq(item['input_ids'], max_size, self.pad_token_id)]
            batch_attention_masks += [pad_seq(item['attention_mask'], max_size, 0)]
            labels.append(item['labels'])

        return {"input_ids": torch.tensor(batch_inputs, dtype=torch.long),
                "attention_mask": torch.tensor(batch_attention_masks, dtype=torch.long),
                "labels": torch.tensor(labels, dtype=torch.long)
                }

In [None]:
# # a very simple accuracy function, nothing fancy
# def compute_metrics(p: EvalPrediction) -> Dict:
#     preds = np.argmax(p.predictions, axis=1)
#     return {"acc": (preds == p.label_ids).mean()}

Now that our data is ready, we can download the pretrained model and fine-tune it. Since all our tasks are about sentence classification, we use the `AutoModelForSequenceClassification` class. Like with the tokenizer, the `from_pretrained` method will download and cache the model for us. We set the num_labels as 5 and use a batch size of 8.

In [None]:
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer
batch_size = 8
num_labels = 5

resume_training = False
if resume_training:
    model_checkpoint = 'test-results/checkpoint-20000'
else:
    model_checkpoint = 'xlm-roberta-base'
model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels=num_labels)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1115590446.0, style=ProgressStyle(descr…




Some weights of the model checkpoint at xlm-roberta-base were not used when initializing XLMRobertaForSequenceClassification: ['lm_head.bias', 'lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight', 'roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
- This IS expected if you are initializing XLMRobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing XLMRobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of XLMRobertaForSequenceClassification were not initialized from the model checkpoint at xlm-roberta-base and are newly initialized: ['classifier.dense

The warning is telling us we are throwing away some weights (the `vocab_transform` and `vocab_layer_norm` layers) and randomly initializing some other (the `pre_classifier` and `classifier` layers). This is absolutely normal in this case, because we are removing the head used to pretrain the model on a masked language modeling objective and replacing it with a new head for which we don't have pretrained weights, so the library warns us we should fine-tune this model before using it for inference, which is exactly what we are going to do.

To instantiate a `Trainer`, we will need to define two more things. The most important is the [`TrainingArguments`](https://huggingface.co/transformers/main_classes/trainer.html#transformers.TrainingArguments), which is a class that contains all the attributes to customize the training. It requires one folder name, which will be used to save the checkpoints of the model, and all other arguments are optional:

In [None]:
metric_name = "accuracy"

args = TrainingArguments(
    output_dir = "test-results-concat",
    seed = 123, 
    evaluation_strategy = "steps",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=3,
    weight_decay=0.01,
    load_best_model_at_end=True,
    metric_for_best_model=metric_name,
    eval_steps = 5000,
    save_steps = 5000,
    fp16 = True

)

Here we set the evaluation to be done at the end of each epoch, tweak the learning rate, use the `batch_size` defined at the top of the notebook and customize the number of epochs for training, as well as the weight decay. Since the best model might not be the one at the end of training, we ask the `Trainer` to load the best model it saved (according to `metric_name`) at the end of training.

The last thing to define for our `Trainer` is how to compute the metrics from the predictions. We need to define a function for this, which will just use the `metric` we loaded earlier, the only preprocessing we have to do is to take the argmax of our predicted logits:

In [None]:
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
 
    predictions = np.argmax(predictions, axis=1)

    return metric.compute(predictions=predictions, references=labels)

Then we just need to pass all of this along with our datasets to the `Trainer`:

In [None]:
validation_key = "validation"
trainer = Trainer(
    model,
    args,
    train_dataset= encoded_train_dataset, 
    eval_dataset=encoded_val_dataset,
    data_collator=SmartCollator(pad_token_id=tokenizer.pad_token_id),
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

You might wonder why we pass along the `tokenizer` when we already preprocessed our data. This is because we will use it once last time to make all the samples we gather the same length by applying padding, which requires knowing the model's preferences regarding padding (to the left or right? with which token?). The `tokenizer` has a pad method that will do all of this right for us, and the `Trainer` will use it. You can customize this part by defining and passing your own `data_collator` which will receive the samples like the dictionaries seen above and will need to return a dictionary of tensors.

We can now finetune our model by just calling the `train` method:

In [None]:
!nvidia-smi

Tue Mar 16 14:30:43 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.56       Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   40C    P0    33W / 250W |   2011MiB / 16280MiB |     11%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [None]:
trainer.train()

Step,Training Loss,Validation Loss,Accuracy,Runtime,Samples Per Second
5000,1.0472,1.043684,0.5475,26.6467,225.169
10000,0.9971,0.980655,0.5695,26.6897,224.806
15000,0.948,0.938941,0.591833,26.6755,224.925
20000,0.8789,0.954426,0.588667,26.7162,224.583
25000,0.8805,0.944202,0.593167,26.6773,224.91
30000,0.8484,0.922876,0.597333,26.6712,224.962
35000,0.7748,0.976955,0.5945,26.6262,225.342
40000,0.7576,0.989129,0.597667,26.6673,224.995
45000,0.7747,0.977039,0.5915,26.5874,225.671


TrainOutput(global_step=45000, training_loss=0.8923014499240451, metrics={'train_runtime': 8425.3376, 'train_samples_per_second': 5.341, 'total_flos': 73615102406258928, 'epoch': 3.0})

We can check with the `evaluate` method that our `Trainer` did reload the best model properly (if it was not the last one):

In [None]:
trainer.evaluate()

Our model gets an accuracy score of **59.9%** which is comparable to the accuracy score of 59.2% reported in the [paper](https://arxiv.org/abs/2010.02573)

## Hyperparameter search

The `Trainer` supports hyperparameter search using [optuna](https://optuna.org/) or [Ray Tune](https://docs.ray.io/en/latest/tune/). For this last section you will need either of those libraries installed, just uncomment the line you want on the next cell and run it.

In [None]:
! pip install optuna
! pip install ray[tune]



During hyperparameter search, the `Trainer` will run several trainings, so it needs to have the model defined via a function (so it can be reinitialized at each new run) instead of just having it passed. We jsut use the same function as before:

In [None]:
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer
batch_size = 8
num_labels = 5

In [None]:
metric_name = "accuracy"

args = TrainingArguments(
    output_dir = "test-results-concat",
    seed = 123, 
    evaluation_strategy = "steps",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=1,
    weight_decay=0.01,
    load_best_model_at_end=True,
    metric_for_best_model=metric_name,
    eval_steps = 5000,
    save_steps = 5000,
    fp16 = True

)

In [None]:
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
 
    predictions = np.argmax(predictions, axis=1)

    return metric.compute(predictions=predictions, references=labels)

In [None]:
def model_init():
    return AutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels=num_labels)

And we can instantiate our `Trainer` like before:

In [None]:
trainer = Trainer(
    model_init=model_init,
    args=args,
    train_dataset= encoded_train_dataset, 
    eval_dataset=encoded_val_dataset,
    data_collator=SmartCollator(pad_token_id=tokenizer.pad_token_id),
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

Some weights of the model checkpoint at xlm-roberta-base were not used when initializing XLMRobertaForSequenceClassification: ['lm_head.bias', 'lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight', 'roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
- This IS expected if you are initializing XLMRobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing XLMRobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of XLMRobertaForSequenceClassification were not initialized from the model checkpoint at xlm-roberta-base and are newly initialized: ['classifier.dense

The method we call this time is `hyperparameter_search`. Note that it can take a long time to run on the full dataset for some of the tasks. You can try to find some good hyperparameter on a portion of the training dataset by replacing the `train_dataset` line above by:
```python
train_dataset = encoded_dataset["train"].shard(index=1, num_shards=10) 
```
for 1/10th of the dataset. Then you can run a full training on the best hyperparameters picked by the search.

In [None]:
best_run = trainer.hyperparameter_search(n_trials=5, direction="maximize")

[32m[I 2021-01-27 21:24:48,534][0m A new study created in memory with name: no-name-14608e42-17f5-42c4-95b9-d546cf4ca8a9[0m
Some weights of the model checkpoint at xlm-roberta-base were not used when initializing XLMRobertaForSequenceClassification: ['lm_head.bias', 'lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight', 'roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
- This IS expected if you are initializing XLMRobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing XLMRobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of XLMRobertaForSequence

Step,Training Loss,Validation Loss,Accuracy,Runtime,Samples Per Second
5000,1.0049,1.000911,0.575,15.9744,375.601
10000,0.9319,0.959235,0.591333,15.4717,387.806
15000,0.9378,0.935104,0.593,15.4812,387.567
20000,0.8733,0.936651,0.597667,15.627,383.952
25000,0.8601,0.935514,0.598167,16.186,370.69
30000,0.861,0.929796,0.601333,15.6816,382.613


[32m[I 2021-01-27 22:27:06,422][0m Trial 0 finished with value: 398.89593333333335 and parameters: {'learning_rate': 8.898353327747936e-06, 'num_train_epochs': 2, 'seed': 33, 'per_device_train_batch_size': 8}. Best is trial 0 with value: 398.89593333333335.[0m
Some weights of the model checkpoint at xlm-roberta-base were not used when initializing XLMRobertaForSequenceClassification: ['lm_head.bias', 'lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight', 'roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
- This IS expected if you are initializing XLMRobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing XLMRobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identic

Step,Training Loss,Validation Loss


[33m[W 2021-01-27 22:29:07,305][0m Trial 1 failed because of the following error: RuntimeError('CUDA out of memory. Tried to allocate 96.00 MiB (GPU 0; 15.75 GiB total capacity; 14.14 GiB already allocated; 58.88 MiB free; 14.48 GiB reserved in total by PyTorch)',)
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/optuna/_optimize.py", line 198, in _run_trial
    value_or_values = func(trial)
  File "/usr/local/lib/python3.6/dist-packages/transformers/integrations.py", line 134, in _objective
    trainer.train(model_path=model_path, trial=trial)
  File "/usr/local/lib/python3.6/dist-packages/transformers/trainer.py", line 888, in train
    tr_loss += self.training_step(model, inputs)
  File "/usr/local/lib/python3.6/dist-packages/transformers/trainer.py", line 1259, in training_step
    self.scaler.scale(loss).backward()
  File "/usr/local/lib/python3.6/dist-packages/torch/tensor.py", line 221, in backward
    torch.autograd.backward(self, gradient, re

RuntimeError: ignored

The `hyperparameter_search` method returns a `BestRun` objects, which contains the value of the objective maximized (by default the sum of all metrics) and the hyperparameters it used for that run.

In [None]:
best_run

You can customize the objective to maximize by passing along a `compute_objective` function to the `hyperparameter_search` method, and you can customize the search space by passing a `hp_space` argument to `hyperparameter_search`. See this [forum post](https://discuss.huggingface.co/t/using-hyperparameter-search-in-trainer/785/10) for some examples.

To reproduce the best training, just set the hyperparameters in your `TrainingArgument` before creating a `Trainer`:

In [None]:
for n, v in best_run.hyperparameters.items():
    setattr(trainer.args, n, v)

trainer.train()

Don't forget to [update your model](https://huggingface.co/transformers/model_sharing.html) on the [🤗 Model Hub](https://huggingface.co/models). You can then use it only to generate results like the one shown in the first picture of this notebook!