If you're opening this Notebook on colab, you will probably need to install 🤗 Transformers and 🤗 Datasets. Uncomment the following cell and run it.

In [1]:
# ! pip install datasets transformers

Collecting datasets
  Downloading datasets-1.14.0-py3-none-any.whl (290 kB)
[K     |████████████████████████████████| 290 kB 5.3 MB/s 
[?25hCollecting transformers
  Downloading transformers-4.11.3-py3-none-any.whl (2.9 MB)
[K     |████████████████████████████████| 2.9 MB 35.8 MB/s 
Collecting aiohttp
  Downloading aiohttp-3.7.4.post0-cp37-cp37m-manylinux2014_x86_64.whl (1.3 MB)
[K     |████████████████████████████████| 1.3 MB 34.3 MB/s 
Collecting fsspec[http]>=2021.05.0
  Downloading fsspec-2021.10.1-py3-none-any.whl (125 kB)
[K     |████████████████████████████████| 125 kB 38.8 MB/s 
[?25hCollecting huggingface-hub<0.1.0,>=0.0.19
  Downloading huggingface_hub-0.0.19-py3-none-any.whl (56 kB)
[K     |████████████████████████████████| 56 kB 4.4 MB/s 
[?25hCollecting xxhash
  Downloading xxhash-2.0.2-cp37-cp37m-manylinux2010_x86_64.whl (243 kB)
[K     |████████████████████████████████| 243 kB 42.4 MB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.46-py3-none-any.whl (895

If you're opening this notebook locally, make sure your environment has an install from the last version of those libraries.

To be able to share your model with the community and generate results like the one shown in the picture below via the inference API, there are a few more steps to follow.

First you have to store your authentication token from the Hugging Face website (sign up [here](https://huggingface.co/join) if you haven't already!) then execute the following cell and input your username and password:

In [2]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value="<center>\n<img src=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

Then you need to install Git-LFS. Uncomment the following instructions:

In [3]:
# !apt install git-lfs

Make sure your version of Transformers is at least 4.11.0 since the functionality was introduced in that version:

In [4]:
import transformers

print(transformers.__version__)

4.11.3


# Fine-tuning a model on a multi-label text classification task

This notebook is heavily inspired and adapted from the [text_classification.ipynb](https://github.com/huggingface/notebooks/blob/master/examples/text_classification.ipynb), which contains a detailed example of text classification on the GLUE Benchmark.  

This notebook focusses on fine-tuning one of the [🤗 Transformers](https://github.com/huggingface/transformers) model to a **multi-label** text classification task, that is when each example is annotated with one or more labels (each label is independent from the other labels). 

This notebook is built to run with any model checkpoint from the [Model Hub](https://huggingface.co/models) as long as that model has:
- a version with a classification head,
- the `problem_type` configuration attribute (see [PretrainedConfig](https://huggingface.co/transformers/main_classes/configuration.html#transformers.PretrainedConfig)).
Setting `problem_type == "multi_label_classification"` configures the model for the multi-label task by using appropriate loss function for training, that is the Binary Cross Entropy between the target and the probabilities ([BCEWithLogitsLoss](https://pytorch.org/docs/stable/generated/torch.nn.BCEWithLogitsLoss.html)). 

Here we picked the [`distilbert-base-uncased`](https://huggingface.co/distilbert-base-uncased) checkpoint. Depending on your model and the GPU you are using, you might need to adjust the batch size to avoid out-of-memory errors.

In [5]:
model_checkpoint = "distilbert-base-uncased"
batch_size = 16

## Loading the dataset

We will use the [🤗 Datasets](https://github.com/huggingface/datasets) library to download the data. A list of datasets suitable for multi-label classification is available in the [Dataset Hub](https://huggingface.co/datasets?task_ids=task_ids:multi-label-classification). 

We use the [GoEmotions](https://huggingface.co/datasets/go_emotions) dataset which contains 58k carefully curated Reddit comments in English labelled for 27 emotion categories or Neutral. Let's use the `simplified` version of the dataset which includes a train/val/test splits with 43,410, 5,426, and 5,427 examples respectively. 

For a quick execution of the whole notebook, let's only load a fraction of the dataset, for example 20%, by setting `dataset_pct=20`. Also, let's load only the train and test split (let's exclude the validation split). More info on how to tweak datasets loading in the [docs](https://huggingface.co/docs/datasets/loading.html). 

<!-- We will use the [🤗 Datasets](https://github.com/huggingface/datasets) library to download the data and get the metric we need to use for evaluation (to compare our model to the benchmark). This can be easily done with the functions `load_dataset` and `load_metric`.  

https://huggingface.co/docs/datasets/loading.html -->

In [6]:
# fraction of the dataset for run in percent: e.g. dataset_pct=20 ->  20%
dataset_pct = 20  # in percent
# dataset_pct = None

In [7]:
from datasets import load_dataset, DatasetDict

if not dataset_pct:
    dataset_pct = 100

raw_datasets = DatasetDict(
    {
        "train": load_dataset("go_emotions", "simplified", split=f"train[:{dataset_pct}%]"),
        "test": load_dataset("go_emotions", "simplified", split=f"test[:{dataset_pct}%]"),
    }
)
raw_datasets

Downloading:   0%|          | 0.00/2.02k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.67k [00:00<?, ?B/s]

Downloading and preparing dataset go_emotions/simplified (download: 4.19 MiB, generated: 5.03 MiB, post-processed: Unknown size, total: 9.22 MiB) to /root/.cache/huggingface/datasets/go_emotions/simplified/0.0.0/2637cfdd4e64d30249c3ed2150fa2b9d279766bfcd6a809b9f085c61a90d776d...


Downloading:   0%|          | 0.00/1.61M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/203k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/201k [00:00<?, ?B/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

Dataset go_emotions downloaded and prepared to /root/.cache/huggingface/datasets/go_emotions/simplified/0.0.0/2637cfdd4e64d30249c3ed2150fa2b9d279766bfcd6a809b9f085c61a90d776d. Subsequent calls will reuse this data.


Reusing dataset go_emotions (/root/.cache/huggingface/datasets/go_emotions/simplified/0.0.0/2637cfdd4e64d30249c3ed2150fa2b9d279766bfcd6a809b9f085c61a90d776d)


DatasetDict({
    train: Dataset({
        features: ['text', 'labels', 'id'],
        num_rows: 8682
    })
    test: Dataset({
        features: ['text', 'labels', 'id'],
        num_rows: 1085
    })
})

[`DatasetDict`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasetdict) behaves like a dictionary and contains one key for the training and test set.

To get a sense of what the data looks like, the following function will show some examples picked randomly in the dataset.

In [8]:
import datasets
import random
import pandas as pd
from IPython.display import display, HTML

def show_random_elements(dataset, num_examples=10):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)
    
    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, datasets.ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    display(HTML(df.to_html()))

show_random_elements(raw_datasets["train"])

Unnamed: 0,text,labels,id
0,Be careful of ticks tho,[5],eedqh24
1,I've never been so proud of humanity.,"[21, 27]",ed0cja3
2,"Which is bizarre because [NAME] actually dubbed those scenes back in the day, but they apparently never used them ever.",[6],ednn0zw
3,Guilty of doing this tbph,[24],efgbkf4
4,"Links above suggest a ""polio-like"" outbreak occurred. Perhaps on this issue like many others she is a bit confused.",[6],edp15s2
5,I think he’s toast in 2020 with a solid candidate,[27],ed7dq9g
6,Power trip blindness is scary; how someone can't see the obvious aftermath of acting like this with their employees?,"[6, 14]",ef3gg6j
7,I think you both lose because they're at .500. I hope you both enjoy your Leafs pint glasses! Edit: thanks for Reddit Silver mysterious stranger!,[27],eeg2uis
8,"What do you mean strap around the bulls nuts?!,!? Sorry not into rodeo things",[24],ee7ay1h
9,Omw. When my man sends me that I get all warm and excited cuz I know I get to spends the next couple hours or days with him,[13],ee00v0l


We will also use the 🤗 Datasets library to get the metric we need to use for evaluation (to compare our model to the benchmark). 

Let's use the F1 score for this. We use `load_metric` function and we obtain an instance of [`datasets.Metric`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Metric).

In [9]:
from datasets import load_metric

f1_metric = load_metric("f1", "multilabel")
f1_metric

Downloading:   0%|          | 0.00/2.07k [00:00<?, ?B/s]

Metric(name: "f1", features: {'predictions': Sequence(feature=Value(dtype='int32', id=None), length=-1, id=None), 'references': Sequence(feature=Value(dtype='int32', id=None), length=-1, id=None)}, usage: """
Args:
    predictions: Predicted labels, as returned by a model.
    references: Ground truth labels.
    labels: The set of labels to include when average != 'binary', and
        their order if average is None. Labels present in the data can
        be excluded, for example to calculate a multiclass average ignoring
        a majority negative class, while labels not present in the data will
        result in 0 components in a macro average. For multilabel targets,
        labels are column indices. By default, all labels in y_true and
        y_pred are used in sorted order.
    average: This parameter is required for multiclass/multilabel targets.
        If None, the scores for each class are returned. Otherwise, this
        determines the type of averaging performed on the 

Note that:

* when loading the metric, we specified that we want the `multilabel` configuration: in this way the metric instance will expect the correct type for the predictions and references (`Sequence(Value(dtype='int32')` instead of `Value(dtype='int32')`),


You can call its `compute` method with your predictions and labels directly and it will return a dictionary with the metric(s) value:

In [10]:
import numpy as np

np.random.seed(0)
fake_preds = np.random.randint(0, 2, size=(64, 8))  # bs=64, multilabel with 8 labels per example
fake_labels = np.random.randint(0, 2, size=(64, 8))

f1_metric.compute(predictions=fake_preds, references=fake_labels, average="macro")

{'f1': 0.49895691050188085}

Note that:
* when calling the metric compute method, we specified `average="macro"` to tell that we are interested in the macro-averaged F1-score (for more details on F1 average on [sklearn](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html)).

The [paper](https://arxiv.org/pdf/2005.00547v2.pdf) using dataset quotes results in macro-averaged F1. If we wanted to calculate more metrics at once (e.g. macro-F1, micro-F1, [nDCG](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.ndcg_score.html), [LRAP](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.label_ranking_average_precision_score.html), ...) we could create a custom Metric class as explained in the [docs](https://huggingface.co/docs/datasets/how_to_metrics.html).

## Preprocessing the data

Before we can feed those texts to our model, we need to preprocess them. This is done by a 🤗 Transformers `Tokenizer` which will (as the name indicates) tokenize the inputs (including converting the tokens to their corresponding IDs in the pretrained vocabulary) and put it in a format the model expects, as well as generate the other inputs that model requires.

To do all of this, we instantiate our tokenizer with the `AutoTokenizer.from_pretrained` method, which will ensure:

- we get a tokenizer that corresponds to the model architecture we want to use,
- we download the vocabulary used when pretraining this specific checkpoint.

That vocabulary will be cached, so it's not downloaded again the next time we run the cell.

In [11]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/483 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

Bu default, the call above will use one of the fast tokenizers (backed by Rust) from the 🤗 Tokenizers library.

This function works with one or several examples. In the case of several examples, the tokenizer will return a list of lists for each key. For example let's tokenize the title of the first 3 documents in the training set:

In [12]:
tokenizer(raw_datasets["train"][:3]["text"])

{'input_ids': [[101, 2026, 8837, 2833, 2003, 2505, 1045, 2134, 1005, 1056, 2031, 2000, 5660, 2870, 1012, 102], [101, 2085, 2065, 2002, 2515, 2125, 2370, 1010, 3071, 2097, 2228, 2002, 2015, 2383, 1037, 4756, 29082, 2007, 2111, 2612, 1997, 2941, 2757, 102], [101, 2339, 1996, 6616, 2003, 3016, 3238, 11163, 2075, 102]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}

So far so good, but how about the labels? 

In [13]:
raw_datasets["train"][10:13]["labels"]

[[6], [1, 4], [27]]

To train a multi-label model we need to process the `labels` dataset feature from a list with the labels id to a (examples x labels) binary matrix indicating the presence of a label. Why? The models need a tensor for calculating the training loss and evaluation metrics.

Fo this we will use the [MultiLabelBinarizer](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MultiLabelBinarizer.html) from the `scikit-learn` library. As we already know that there are 28 classes in the data, we simply pass their ids `np.arange(0,28)` to the binarizer:

In [14]:
classes = raw_datasets['train'].features['labels'].feature
classes

ClassLabel(num_classes=28, names=['admiration', 'amusement', 'anger', 'annoyance', 'approval', 'caring', 'confusion', 'curiosity', 'desire', 'disappointment', 'disapproval', 'disgust', 'embarrassment', 'excitement', 'fear', 'gratitude', 'grief', 'joy', 'love', 'nervousness', 'optimism', 'pride', 'realization', 'relief', 'remorse', 'sadness', 'surprise', 'neutral'], names_file=None, id=None)

In [15]:
from sklearn.preprocessing import MultiLabelBinarizer

mlb = MultiLabelBinarizer(classes=np.arange(0,28), sparse_output=False).fit(None)  # need to call the fit method at least once to use transform method

Let's pass two dummy examples to the binarizer using its `transform` method. This will transform the classes id list in each example to a binary array of dim 28

In [16]:
example_labels = mlb.transform([[0, 1, 123456], [27]])
print(example_labels.shape)
example_labels

(2, 28)


  .format(sorted(unknown, key=str)))


array([[1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 1]])

Note that the binarizer warns us that we passed a class (123456) that was not to present in the data. This won't create harm as it gets ignored.

We can them write the function that will preprocess our samples for training. We need to both tokenizer text and create the labels binary array. 

* tokenizer: We just feed the examples titles to the `tokenizer` with the argument `truncation=True`. This will ensure that an input longer that what the model selected can handle will be truncated to the maximum length accepted by the model. The padding will be dealt with later on (in a data collator) so we pad examples to the longest length in the batch and not the whole dataset. 

* labels: Let's call the `labels` the binary array created with the multi-label binarizer. This is the dataset feature that will be used later by the `Trainer` class. 
    - *Important*: we cast this array as `np.float32` (rather than integer). Why? this is required by the BCEWithLogitsLoss, the loss function used by models for the multi-label task.
    - We rename the label lists into `label_list` and keep it just for completeness

To apply this preprocessing function, we just use the `map` method of our `dataset` object we created earlier. This will apply the function on all the elements of all the splits in `dataset`, so our training and testing data will be preprocessed in one single command.

In [17]:
raw_datasets = raw_datasets.rename_column("labels", "labels_list")

def preprocess_function(examples):
    out = tokenizer(examples["text"], truncation=True)
    out["labels"] = mlb.transform(examples["labels_list"]).astype(np.float32)  # must be float
    return out

processed_datasets = raw_datasets.map(preprocess_function, batched=True)

  0%|          | 0/9 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

In [18]:
processed_datasets

DatasetDict({
    train: Dataset({
        features: ['attention_mask', 'id', 'input_ids', 'labels', 'labels_list', 'text'],
        num_rows: 8682
    })
    test: Dataset({
        features: ['attention_mask', 'id', 'input_ids', 'labels', 'labels_list', 'text'],
        num_rows: 1085
    })
})

Note that the `processed_datasets` has, as expected, new features `attention_mask`, `input_ids` and `labels_list`.

Even better, the results are automatically cached by the 🤗 Datasets library to avoid spending time on this step the next time you run your notebook. The 🤗 Datasets library is normally smart enough to detect when the function you pass to map has changed (and thus requires to not use the cache data). For instance, it will properly detect if you change the task in the first cell and rerun the notebook. 🤗 Datasets warns you when it uses cached files, you can pass `load_from_cache_file=False` in the call to `map` to not use the cached files and force the preprocessing to be applied again.

Note that we passed `batched=True` to encode the texts by batches together. This is to leverage the full benefit of the fast tokenizer we loaded earlier, which will use multi-threading to treat the texts in a batch concurrently.

Also, let's just double check that labels and and label_list are equivalent. For one example:

In [19]:
# check that the labels are correct
example_idx = 7
assert processed_datasets["train"][example_idx]['labels_list'] == list(np.array(processed_datasets["train"][example_idx]['labels']).nonzero()[0])

Finally, let's see how often an example has more than one label. First we create a dataset feature with the number of labels per example and then we leverage pandas for  basic stats:

In [20]:
processed_datasets = processed_datasets.map(lambda example: {'labels_len': [len(labels) for labels in example['labels_list']]}, batched=True)

  0%|          | 0/9 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

In [21]:
pd.Series(processed_datasets['train']['labels_len']).value_counts(normalize=True).round(3)*100

1    83.4
2    15.2
3     1.3
4     0.1
5     0.0
dtype: float64

We note that 83% of the examples have a single label and the max number of labels is 4.

## Fine-tuning the model

Now that our data is ready, we can download the pretrained model and fine-tune it. Since all our task is about sentence classification, we use the `AutoModelForSequenceClassification` class. Like with the tokenizer, the `from_pretrained` method will download and cache the model for us.

In [22]:
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer

model = AutoModelForSequenceClassification.from_pretrained(
    model_checkpoint,
    num_labels=classes.num_classes,
    problem_type="multi_label_classification",
)

Downloading:   0%|          | 0.00/256M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_projector.weight', 'vocab_layer_norm.bias', 'vocab_projector.bias', 'vocab_transform.weight']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.bias', 'pre_classifier.weight', 'classi

The warning is telling us we are throwing away some weights (the `vocab_transform` and `vocab_layer_norm` layers) and randomly initializing some other (the `pre_classifier` and `classifier` layers). This is absolutely normal in this case, because we are removing the head used to pretrain the model on a masked language modelling objective and replacing it with a new head for which we don't have pretrained weights, so the library warns us we should fine-tune this model before using it for inference, which is exactly what we are going to do.

Note that when creating the model we specify:
* `num_labels=classes.num_classes` (28, as seen above)
* `problem_type="multi_label_classification"`

Note: Actually, while we need to specify the number of labels in our problem, `problem_type="multi_label_classification"` is inferred when the `labels` feature is of dtype=float. However, for clarity is best to set this explicitly.

To instantiate a `Trainer`, we will need to define two more things. The most important is the [`TrainingArguments`](https://huggingface.co/transformers/main_classes/trainer.html#transformers.TrainingArguments), which is a class that contains all the attributes to customize the training. It requires one folder name, which will be used to save the checkpoints of the model, and all other arguments are optional:

In [23]:
model_name = model_checkpoint.split("/")[-1]

args = TrainingArguments(
    f"{model_name}-finetuned-GoEmotions",
    logging_strategy="epoch",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=3,
    weight_decay=0.01,
    load_best_model_at_end=True,
    metric_for_best_model=f1_metric.name,
    # push_to_hub=True,

)

Here we set the evaluation to be done at the end of each epoch, tweak the learning rate, use the `batch_size` defined at the top of the cell and customize the number of epochs for training, as well as the weight decay. Since the best model might not be the one at the end of training, we ask the `Trainer` to load the best model it saved (according to `metric_name`) at the end of training.

The last thing to define for our `Trainer` is how to compute the metrics from the predictions. We need to define a function for this, which will just use the `metric` we loaded earlier. 

However, to calculate the F1 metric we need to apply a threshold to the logits and transform them into binary predictions. For example, one can apply the sigmoid function to squeeze the logits range between [0, 1] and then optimise the threshold. To speed out things, let's just not apply the sigmoid and let's just set prediction=1 when logit>0 (if we applied the sigmoid, this is equivalent to threshold=0.5)

Remember that we have to set `average="macro"` to specify that we want the macro-averaged F1.

In [24]:
def compute_metrics(eval_preds):
    logits, labels = eval_preds
    predictions = np.where(logits > 0, 1, 0)
    return f1_metric.compute(predictions=predictions, references=labels, average="macro")

Then we just need to pass all of this along with our datasets to the `Trainer`:

In [25]:
trainer = Trainer(
    model,
    args,
    train_dataset=processed_datasets["train"],
    eval_dataset=processed_datasets["test"],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

You might wonder why we pass along the `tokenizer` when we already preprocessed our data. This is because we will use it once last time to make all the samples we gather the same length by applying padding, which requires knowing the model's preferences regarding padding (to the left or right? with which token?). The `tokenizer` has a pad method that will do all of this right for us, and the `Trainer` will use it. You can customize this part by defining and passing your own `data_collator` which will receive the samples like the dictionaries seen above and will need to return a dictionary of tensors.

We can now finetune our model by just calling the `train` method:

In [26]:
trainer.train()

The following columns in the training set  don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: labels_list, labels_len, id, text.
***** Running training *****
  Num examples = 8682
  Num Epochs = 3
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 1629


Epoch,Training Loss,Validation Loss,F1
1,0.1998,0.136874,0.011721
2,0.1261,0.113641,0.105033
3,0.108,0.107329,0.130109


The following columns in the evaluation set  don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: labels_list, labels_len, id, text.
***** Running Evaluation *****
  Num examples = 1085
  Batch size = 16
Saving model checkpoint to distilbert-base-uncased-finetuned-GoEmotions/checkpoint-543
Configuration saved in distilbert-base-uncased-finetuned-GoEmotions/checkpoint-543/config.json
Model weights saved in distilbert-base-uncased-finetuned-GoEmotions/checkpoint-543/pytorch_model.bin
tokenizer config file saved in distilbert-base-uncased-finetuned-GoEmotions/checkpoint-543/tokenizer_config.json
Special tokens file saved in distilbert-base-uncased-finetuned-GoEmotions/checkpoint-543/special_tokens_map.json
The following columns in the evaluation set  don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: labels_list, labels_len, id, text.
***** Running Evaluation *****
  Num example

TrainOutput(global_step=1629, training_loss=0.14461414531522765, metrics={'train_runtime': 257.0411, 'train_samples_per_second': 101.33, 'train_steps_per_second': 6.338, 'total_flos': 230402907031536.0, 'train_loss': 0.14461414531522765, 'epoch': 3.0})

With the parameters set in this notebook we achieve macro-F1=0.13. This is not a huge score but remember that we have been training on only a small fraction of the dataset an for only 3 epochs. Also, the training parameters have not been optimised.

For reference, in the dataset [paper](https://arxiv.org/pdf/2005.00547v2.pdf), the authors finetune BERT and obtain macro-F1=0.46 (table 4)

Finally, We can check with the `evaluate` method that our `Trainer` did reload the best model properly (if it was not the last one):

In [27]:
trainer.evaluate()

The following columns in the evaluation set  don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: labels_list, labels_len, id, text.
***** Running Evaluation *****
  Num examples = 1085
  Batch size = 16


{'epoch': 3.0,
 'eval_f1': 0.13010872180230298,
 'eval_loss': 0.10732914507389069,
 'eval_runtime': 3.1906,
 'eval_samples_per_second': 340.062,
 'eval_steps_per_second': 21.313}

As expected this show the same F1 as above for the training loop