Make certain you are on a GPU Runtime first, by going to Runtime and selecting "Change Runtime Type", and then choosing Hardware Accelerator as GPU.

In this work, you will build an emotion classifier based on a Huggingface emotions dataset.

You will need to install 🤗 Transformers, numpy and 🤗 Datasets. Run the following three cells.

In [1]:
pip install datasets==1.3.0



In [2]:
pip install transformers==4.3.2



In [3]:
pip install numpy==1.20.1



In [4]:
from datasets import load_dataset
emotions_dataset = load_dataset('go_emotions', 'simplified')



Reusing dataset go_emotions (/root/.cache/huggingface/datasets/go_emotions/simplified/0.0.0/ef1c18ea192c771555f1e0d638889dd5f1896255782c57c6a0b934d5f94f779e)


To get a sense of what the data looks like, the following function 
will show some examples picked randomly in the dataset.

In [5]:

import datasets
import random
import pandas as pd
from IPython.display import display, HTML



In [6]:
def show_random_elements(dataset, num_examples=10):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)
    
    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, datasets.ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    display(HTML(df.to_html()))



In [7]:
# For simplicity, train model with single label for item.
restricted_dataset = emotions_dataset#.filter(lambda x: len(x["labels"]) == 1 and 27 not in x['labels'])
print(len(restricted_dataset['train']))
show_random_elements(restricted_dataset['train'])

# For simplicity, train model with single label for item.
restricted_dataset = emotions_dataset.filter(lambda x: len(x["labels"]) == 1 and 27 not in x['labels'])
print(len(restricted_dataset['train']))
show_random_elements(restricted_dataset['train'])


# The emotions are provided as numeric labels. These are the actual orderings, beginning at 0 for admiration:

labels = ["admiration", "amusement", "anger", "annoyance", "approval", "caring", 
          "confusion", "curiosity", "desire", "disappointment", "disapproval",
          "disgust", "embarrassment", "excitement", "fear", "gratitude", "grief",
          "joy", "love", "nervousness", "optimism", "pride", "realization",
          "relief", "remorse", "sadness", "surprise", "neutral"]

          
index_to_labels = {index: label for index, label in enumerate(labels)}

print(index_to_labels)

43410


Unnamed: 0,id,labels,text
0,edybewm,[27],Were there actually tunnels though? I’ve been there and it was just the lower part of the foundation
1,ed4pzlr,[13],This is where the run starts boys!
2,eeq9vgg,[27],"Why bother getting upset when straight people stereotype gays, when gays willingly stereotype themselves. ;)"
3,efcx6lo,"[0, 18]",This is amazing. Love these edits.
4,edsr9m3,[27],Speaker hitting that juul
5,edqq77d,[4],I would have a similar reaction if she committed neck rope.
6,edn6y20,[1],Lol he said it was the driver's fault and everyone called him out.
7,ef7y1w9,[27],I think they're getting desperate from all of us cord cutters converting everyone to Hulu.
8,edlefk6,"[0, 15]",Great to hear something positive for once! Was half expecting a negative rant. Yay for [NAME] and congrats to you.
9,edw1c98,[7],So i'm intrigued how the fusion would be


Loading cached processed dataset at /root/.cache/huggingface/datasets/go_emotions/simplified/0.0.0/ef1c18ea192c771555f1e0d638889dd5f1896255782c57c6a0b934d5f94f779e/cache-cfd0024a2b2436ff.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/go_emotions/simplified/0.0.0/ef1c18ea192c771555f1e0d638889dd5f1896255782c57c6a0b934d5f94f779e/cache-568b10a7b4beb734.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/go_emotions/simplified/0.0.0/ef1c18ea192c771555f1e0d638889dd5f1896255782c57c6a0b934d5f94f779e/cache-b4af6fb2549a8d0a.arrow


23485


Unnamed: 0,id,labels,text
0,efcv6jl,[0],"THIS is a good reply, and I have kids."
1,ef71i8d,[10],For her? Nah
2,edqsk5t,[20],I wished my mom protected me from my grandma. She was a horrible person who was so mean to me and my mom.
3,eeknrt3,[11],Ok but what if you hire him into a management position. He makes women feel uncomfortable and a discrimination suit waiting to happen.
4,ed36ryt,[7],How did that tree not grow?!
5,ees7pel,[2],Fuck you my fortnite k/d is better bitch
6,edzu8l6,[20],I’ll bet the Phillies owner had a beer with [NAME] high school baseball coach.
7,ee8tcks,[15],Thank you! I’ll give it a listen tonight😊
8,edpeyiw,[0],He was such a pet!! Beautiful looking dog
9,ed5judt,[2],How the fuck is that not ticketable.


{0: 'admiration', 1: 'amusement', 2: 'anger', 3: 'annoyance', 4: 'approval', 5: 'caring', 6: 'confusion', 7: 'curiosity', 8: 'desire', 9: 'disappointment', 10: 'disapproval', 11: 'disgust', 12: 'embarrassment', 13: 'excitement', 14: 'fear', 15: 'gratitude', 16: 'grief', 17: 'joy', 18: 'love', 19: 'nervousness', 20: 'optimism', 21: 'pride', 22: 'realization', 23: 'relief', 24: 'remorse', 25: 'sadness', 26: 'surprise', 27: 'neutral'}


## Preprocessing the data

Before we can feed those texts to our model, we need to preprocess them. This is done by a 🤗 Transformers `Tokenizer` which will (as the name indicates) tokenize the inputs (including converting the tokens to their corresponding IDs in the pretrained vocabulary) and put it in a format the model expects, as well as generate the other inputs that model requires.

To do all of this, we instantiate our tokenizer with the `AutoTokenizer.from_pretrained` method, which will ensure:

- we get a tokenizer that corresponds to the model architecture we want to use,
- we download the vocabulary used when pretraining this specific checkpoint.

That vocabulary will be cached, so it's not downloaded again the next time we run the cell.

In [8]:
 from transformers import AutoConfig, AutoModelForSequenceClassification, AutoTokenizer

# model_name = 'distilbert-base-uncased'
# config = AutoConfig.from_pretrained(model_name, num_labels=27)
# model = AutoModelForSequenceClassification.from_config(config=config)
# tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)



In [9]:
# from transformers import AutoConfig, BertTokenizer, BertForSequenceClassification
# import torch
# model_name = 'bert-base-uncased'
# config = AutoConfig.from_pretrained(model_name, num_labels=27)
# model = AutoModelForSequenceClassification.from_config(config=config)
# tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)


In [18]:
# from transformers import XLNetTokenizer, XLNetForSequenceClassification
import torch
model_name = 'albert-base-v2'
config = AutoConfig.from_pretrained(model_name, num_labels=27)
model = AutoModelForSequenceClassification.from_config(config=config)
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=684.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=760289.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1312669.0, style=ProgressStyle(descript…




We pass along `use_fast=True` to the call above to use one of the fast tokenizers (backed by Rust) from the 🤗 Tokenizers library. Those fast tokenizers are available for almost all models, but if you got an error with the previous call, remove that argument.


In [19]:
def preprocess_function(examples):
    return tokenizer(examples['text'], truncation=True)

def convert_labels_to_int(example):
    example['labels'] = example['labels'][0]
    return example

encoded_dataset = restricted_dataset.map(preprocess_function, batched=True)
encoded_dataset = encoded_dataset.map(convert_labels_to_int)

encoded_dataset = restricted_dataset.map(preprocess_function, batched=True)
encoded_dataset = encoded_dataset.map(convert_labels_to_int)



HBox(children=(FloatProgress(value=0.0, max=24.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=3.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=3.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=23485.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=2956.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=2984.0), HTML(value='')))

Loading cached processed dataset at /root/.cache/huggingface/datasets/go_emotions/simplified/0.0.0/ef1c18ea192c771555f1e0d638889dd5f1896255782c57c6a0b934d5f94f779e/cache-6dec2d21a9370f80.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/go_emotions/simplified/0.0.0/ef1c18ea192c771555f1e0d638889dd5f1896255782c57c6a0b934d5f94f779e/cache-3ec9ff2db52ff860.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/go_emotions/simplified/0.0.0/ef1c18ea192c771555f1e0d638889dd5f1896255782c57c6a0b934d5f94f779e/cache-e6af35d09e6f7de1.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/go_emotions/simplified/0.0.0/ef1c18ea192c771555f1e0d638889dd5f1896255782c57c6a0b934d5f94f779e/cache-272e2a83000be2be.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/go_emotions/simplified/0.0.0/ef1c18ea192c771555f1e0d638889dd5f1896255782c57c6a0b934d5f94f779e/cache-a04241fdfde37972.arrow
Loading cached processed dataset at




In [20]:
show_random_elements(encoded_dataset['train'])

Unnamed: 0,attention_mask,id,input_ids,labels,text,token_type_ids
0,"[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]",ef3sq5i,"[2, 13, 1, 82, 1890, 14, 2794, 16, 98, 31, 107, 20, 164, 21, 695, 19, 14, 5190, 48, 25, 1026, 20251, 3]",10,> its exactly the opposite of what I do to get a girl in the mood This is super creepy,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]"
1,"[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]",ed5q1op,"[2, 14, 7405, 12147, 50, 51, 954, 15, 2907, 242, 3]",2,"The shipping container are my friends, damnit","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]"
2,"[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]",efblc1k,"[2, 32, 550, 1364, 55, 1026, 1700, 76, 31, 196, 3217, 101, 48, 9, 162, 515, 154, 246, 201, 695, 187, 3]",17,It always makes me super happy when I see stuff like this. Go live your best life girl!,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]"
3,"[1, 1, 1, 1, 1, 1, 1, 1]",ef013vg,"[2, 48, 1244, 1059, 22509, 18, 9, 3]",1,This guy sports entertains.,"[0, 0, 0, 0, 0, 0, 0, 0]"
4,"[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]",eep58aj,"[2, 86, 98, 60, 14, 13, 23281, 1437, 22, 38, 9, 259, 9, 91, 9, 391, 9, 3]",3,So what? The LW doesn't. want. more. children.,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]"
5,"[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]",ede9i41,"[2, 30, 1, 18, 32, 2776, 15, 95, 1, 99, 19941, 130, 3]",9,"That’s it guys, we’re racist now","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]"
6,"[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]",ee9nc9z,"[2, 13, 1, 2558, 7259, 500, 53, 16, 14, 127, 4523, 139, 10874, 1650, 31, 22, 195, 462, 752, 16, 45, 14, 1641, 378, 44, 19, 15903, 781, 17, 14, 1687, 6027, 19, 5734, 13, 11947, 9, 3]",14,>[NAME] One of the most horrifying stories I've ever heard of: the mom should be in psychiatric care and the doctor thrown in jail IMO.,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]"
7,"[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]",eczt3ye,"[2, 442, 27, 109, 683, 20883, 5156, 103, 115, 946, 24740, 18, 15, 30, 22, 211, 767, 14, 1144, 29, 452, 20, 1623, 510, 9309, 2346, 3]",12,"put on some real filthy porn before running errands, that'll leave the parents with having to answer really awkward questions","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]"
8,"[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]",eehez9u,"[2, 83, 42, 1, 195, 174, 330, 29, 36, 100, 42, 1, 43, 167, 115, 3203, 60, 90, 17872, 15, 114, 7686, 9, 3]",7,"Would you’ve still got with her if you’d known beforehand? No judgement, just curious.","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]"
9,"[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]",edm92vo,"[2, 2973, 42, 494, 184, 2950, 39, 25, 37, 36, 276, 60, 3]",25,Cant you tell how sad she is from her face?,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]"


To instantiate a `Trainer`, we will need to define two more things. The most important is the [`TrainingArguments`](https://huggingface.co/transformers/main_classes/trainer.html#transformers.TrainingArguments), which is a class that contains all the attributes to customize the training. It requires one folder name, which will be used to save the checkpoints of the model, and all other arguments are optional:


In [21]:
## Fine-tuning the model
from transformers import Trainer, TrainingArguments
metric_name = "accuracy"

args = TrainingArguments(
      "test-emotions",
      evaluation_strategy = "epoch",
      learning_rate=0.00001,
      per_device_train_batch_size=32,
      per_device_eval_batch_size=32,
      num_train_epochs=5,
      weight_decay=0.01,
      load_best_model_at_end=True,
      metric_for_best_model=metric_name,
  )
# args = TrainingArguments(
#     "test-emotions",
#     evaluation_strategy = "epoch",
#     learning_rate=2e-5,
#     per_device_train_batch_size=32,
#     per_device_eval_batch_size=32,
#     num_train_epochs=2,
#     weight_decay=0.01,
#     load_best_model_at_end=True,
#     metric_for_best_model=metric_name,
# )

Here we set the evaluation to be done at the end of each epoch, tweak the learning rate, use the `batch_size` defined at the top of the notebook and customize the number of epochs for training, as well as the weight decay. Since the best model might not be the one at the end of training, we ask the `Trainer` to load the best model it saved (according to `metric_name`) at the end of training.


In [22]:

metric = datasets.load_metric('accuracy')
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return metric.compute(predictions=predictions, references=labels)

# Then we just need to pass all of this along with our datasets to the `Trainer`:

validation_key = "validation"
trainer = Trainer(
    model,
    args,
    train_dataset=encoded_dataset["train"],
    eval_dataset=encoded_dataset["validation"],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)



You might wonder why we pass along the `tokenizer` when we already preprocessed our data. This is because we will use it once last time to make all the samples we gather the same length by applying padding, which requires knowing the model's preferences regarding padding (to the left or right? with which token?). The `tokenizer` has a pad method that will do all of this right for us, and the `Trainer` will use it. You can customize this part by defining and passing your own `data_collator` which will receive the samples like the dictionaries seen above and will need to return a dictionary of tensors.

We can now finetune our model by just calling the `train` method:

In [23]:
import numpy as np
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,Runtime,Samples Per Second
1,2.8158,2.322605,0.348444,7.5795,390.002
2,2.3417,2.135118,0.420501,7.6144,388.213
3,2.0118,1.941611,0.469215,7.6344,387.195
4,1.9112,1.886793,0.483424,7.641,386.861
5,1.7992,1.859946,0.493234,7.6669,385.553


TrainOutput(global_step=3670, training_loss=2.1124680105931755, metrics={'train_runtime': 877.0401, 'train_samples_per_second': 4.185, 'total_flos': 302871959388666, 'epoch': 5.0})

In [24]:


# We can check with the `evaluate` method that our `Trainer` did reload the best model properly (if it was not the last one). We also run a sample prediction to demonstrate the API:

trainer.evaluate()

prepared_input = tokenizer.prepare_seq2seq_batch(["I am  sad"], return_tensors='pt')
model = model.to('cpu')
model.eval()
model_output = model(**prepared_input)
prediction = np.argmax(model_output.logits[0].detach().numpy())
index_to_labels[prediction]



'sadness'