In this notebook we are going to repeat the task of sentiment classification but this time using a transformers

We will use the very known model BERT (Bidirectional Encoder Representations from Transformers) to perform the task of sentiment classification.

The model is available in HuggingFace library, which is a library that provides state of the art pretrained models for NLP tasks.

For this notebook we will use the model "bert-base-case", which is a model that has been trained on a large corpus of text in English.

In [1]:
# !pip install -U transformers accelerate datasets

In [2]:
from datasets import load_dataset
from transformers import AutoTokenizer
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer,AutoTokenizer
from transformers import DataCollatorWithPadding
from torch.utils.data import DataLoader
import torch
from tqdm import tqdm
import numpy as np

# Dataset

We are going to use the same dataset as in the previous notebook: imdb

In [3]:
dados_imdb = load_dataset('imdb')

In [4]:
dados_imdb

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})

When working with huggingface, we need to use the tokenizer of the model we are going to use to tokenize the data.

We are going to use the pretrained model bert-base-uncased for our classifier. 

We can use AutoTokenizer to automatically download the tokenizer of the model we are going to use.

In [5]:
# get the tokenizer
tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')

The tokenizer from the model bert-base-cased is a WordPiece tokenizer, which means that it will split the words into subwords.

It has already been trained on a large corpus of text and it has a fixed vocabulary.

We can use the tokenizer to tokenize the data, convert the tokens to ids, add the special tokens [CLS] and [SEP] to the data, and pad the data to the same length.

In [6]:
tokenizer.tokenize('This is an input example')

['This', 'is', 'an', 'input', 'example']

In [7]:
tokenizer.encode('This is an input example')

[101, 1188, 1110, 1126, 7758, 1859, 102]

In [8]:
print(tokenizer.convert_ids_to_tokens(101))
print(tokenizer.convert_ids_to_tokens(102))

[CLS]
[SEP]


Remember that now the tokens might not be words. They can be subwords or even characters. Let's see the tokens for the example we saw before.

In [9]:
inputs = tokenizer('This is an input example', return_tensors='pt')
print(inputs)

{'input_ids': tensor([[ 101, 1188, 1110, 1126, 7758, 1859,  102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1]])}


In [10]:
for tk_id in inputs.input_ids:
    tk_str = tokenizer.convert_ids_to_tokens(tk_id)
    print(tk_str)

['[CLS]', 'This', 'is', 'an', 'input', 'example', '[SEP]']


Note that the sentence is already framed between the special tokens [CLS] and [SEP]. The tokenizer can also adds a special token for padding [PAD] and a special token for unknown words [UNK].
We can see the special tokens already defined in the tokenizer. 

In [11]:
tokenizer.special_tokens_map

{'unk_token': '[UNK]',
 'sep_token': '[SEP]',
 'pad_token': '[PAD]',
 'cls_token': '[CLS]',
 'mask_token': '[MASK]'}

In [12]:
inputs = tokenizer("This is a test.", truncation=True, max_length=10, padding="max_length")
for tk_id in inputs.input_ids:
    tk_str = tokenizer.convert_ids_to_tokens(tk_id)
    print(tk_str)

[CLS]
This
is
a
test
.
[SEP]
[PAD]
[PAD]
[PAD]


Ok, now we can create the main tokenizer train and test sets, by applying the tokenizer to all the sentences.

In [13]:
def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True, max_length=500)

In [14]:
dados_imdb = dados_imdb.map(tokenize_function, batched=True)
dados_imdb = dados_imdb.with_format("torch")

Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

In [15]:
dados_imdb

DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 50000
    })
})

In [16]:
batch_size = 8 # number of sequences in each batch
train_dataloader = DataLoader(dados_imdb['train'], shuffle=True, batch_size=batch_size)
test_dataloader = DataLoader(dados_imdb['test'], batch_size=batch_size)

# Model and training

We can now load the model. We are going to use BertModel from transformers library. But we can also use AutoModel.

In [17]:
from transformers import BertModel

In [18]:
model = BertModel.from_pretrained('bert-base-cased').to('cuda')
model

BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(28996, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0-11): 12 x BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
  

Get the last hidden cls state from the model and pass it to a linear layer to get the logits.

In [22]:
input = dados_imdb['train'][0]
model.eval()
with torch.no_grad():
    output = model(
                    input_ids=input['input_ids'].unsqueeze(0).to('cuda'), 
                    attention_mask=input['attention_mask'].unsqueeze(0).to('cuda'),
                    output_hidden_states=True # we set this to True to get the hidden states
                    )

When setting "output_hidden_states" to True, we get the hidden states of all the layers of the model. In this case 13 vectors, one for each layer, plus the input embeddings.

We can use this to get the hidden state of the last layer and pass it to a general classifier.

In [None]:
len(output.hidden_states)

13

In [None]:
print(output.hidden_states[-1].shape)
print(output.hidden_states[-1])

torch.Size([1, 500, 768])
tensor([[[ 0.4786, -0.2031, -0.3735,  ..., -0.2058,  0.1644,  0.2879],
         [ 0.4285, -0.6476,  0.4748,  ..., -0.1825, -0.0327,  0.2230],
         [ 0.5163,  0.1242,  0.0434,  ...,  0.4581, -0.0937, -0.0633],
         ...,
         [ 0.0302, -0.1422, -0.0431,  ...,  0.3726, -0.2235,  0.0446],
         [-0.1261, -0.1953, -0.0077,  ...,  0.3068, -0.2196, -0.0858],
         [-0.0613, -0.1139, -0.0212,  ...,  0.3339, -0.2237,  0.0156]]])


The last hidden state is also available with the method "last_hidden_state".

In [None]:
print(output.last_hidden_state.shape)
output.last_hidden_state

torch.Size([1, 500, 768])


tensor([[[ 0.4786, -0.2031, -0.3735,  ..., -0.2058,  0.1644,  0.2879],
         [ 0.4285, -0.6476,  0.4748,  ..., -0.1825, -0.0327,  0.2230],
         [ 0.5163,  0.1242,  0.0434,  ...,  0.4581, -0.0937, -0.0633],
         ...,
         [ 0.0302, -0.1422, -0.0431,  ...,  0.3726, -0.2235,  0.0446],
         [-0.1261, -0.1953, -0.0077,  ...,  0.3068, -0.2196, -0.0858],
         [-0.0613, -0.1139, -0.0212,  ...,  0.3339, -0.2237,  0.0156]]])

In [None]:
input['input_ids'][0]

tensor(101)

In [None]:
tokenizer.convert_ids_to_tokens(101)

'[CLS]'

We can now use the last hidden state of the model to train a classifier. Let's try to train a simple Logistic Regression classifier on the last hidden state of the model.

We start by get all vectors from the last hidden state of the model. We get the CLS token from the last hidden state and we use it as the representation of the sentence.

In [19]:
# Let's start from the train dataset
model.eval()
last_h_list = []
with torch.no_grad():
    for i,batch in enumerate(tqdm(train_dataloader)):
        input_ids = batch['input_ids'].to('cuda')
        attention_mask = batch['attention_mask'].to('cuda')
        output = model(
                    input_ids=input_ids, 
                    attention_mask=attention_mask,
                    output_hidden_states=True # we set this to True to get the hidden states
                    )
        last_h = output.last_hidden_state[:,0,:].detach().cpu().numpy()
        last_h_list.extend(last_h)
last_h_train_array = np.array(last_h_list)

100%|██████████| 3125/3125 [03:21<00:00, 15.52it/s]


In [20]:
last_h_train_array.shape

(25000, 768)

In [21]:
# Now the test dataset
model.eval()
last_h_list = []
with torch.no_grad():
    for i,batch in enumerate(tqdm(test_dataloader)):
        input_ids = batch['input_ids'].to('cuda')
        attention_mask = batch['attention_mask'].to('cuda')
        output = model(
                    input_ids=input_ids, 
                    attention_mask=attention_mask,
                    output_hidden_states=True # we set this to True to get the hidden states
                    )
        last_h = output.last_hidden_state[:,0,:].detach().cpu().numpy()
        last_h_list.extend(last_h)
last_h_test_array = np.array(last_h_list)

100%|██████████| 3125/3125 [03:21<00:00, 15.49it/s]


In [22]:
last_h_test_array.shape

(25000, 768)

In [23]:
# Labels for train and test
y_train = dados_imdb['train']['label'].detach().cpu().numpy()
y_test = dados_imdb['test']['label'].detach().cpu().numpy()

We can now train a simple Logistic Regression classifier on the last hidden state of the model.

In [None]:
#####
## Exercise: Code a logistic regression model using the last hidden state of the BERT model
#####






# Train end-to-end

Let's now train the model end-to-end. We are going to use the model BertForSequenceClassification, which is a BertModel with a classifier on top of it for sequence classification. We AutoModelForSequenceClassification to automatically download the model.

In [25]:
model = AutoModelForSequenceClassification.from_pretrained('bert-base-cased', num_labels=2).to('cuda')
model

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(28996, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12,

In [26]:
dados_imdb['train']

Dataset({
    features: ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 25000
})

We will use Trainer and TrainingArguments from the transformers library to train the model, that makes everything easier.

In [27]:
batch_size = 32
epochs = 4
training_args = TrainingArguments(
    output_dir='./results',          # output directory
    num_train_epochs=epochs,              # total number of training epochs
    per_device_train_batch_size=batch_size,  # batch size per device during training
    per_device_eval_batch_size=batch_size,   # batch size for evaluation
    warmup_ratio=0.01,                # number of warmup steps for learning rate scheduler
    learning_rate = 1e-4,
    weight_decay=0.01,               # strength of weight decay
    evaluation_strategy='steps',
    eval_steps=500,
    save_strategy='epoch',
    disable_tqdm=False,
)

In [28]:
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score

def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    precision = precision_score(labels, preds, average='micro')
    recall = recall_score(labels, preds, average='micro')
    f1 = f1_score(labels, preds, average='micro')
    acc = accuracy_score(labels, preds)
    return {
        'accuracy': acc,
        'f1': f1,
        'precision': precision,
        'recall': recall
    }

In [31]:
train = Trainer(
    model=model,                         # the instantiated 🤗 Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=dados_imdb['train'],         # training dataset
    eval_dataset=dados_imdb['test'],             # evaluation dataset
    compute_metrics=compute_metrics,
    # data_collator = data_collator
)

In [32]:
train.train()

Step,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
500,0.2988,0.324059,0.88008,0.88008,0.88008,0.88008
1000,0.2173,0.270654,0.91048,0.91048,0.91048,0.91048
1500,0.1585,0.29209,0.90388,0.90388,0.90388,0.90388
2000,0.09,0.31444,0.9174,0.9174,0.9174,0.9174
2500,0.0627,0.349181,0.91736,0.91736,0.91736,0.91736
3000,0.0297,0.421547,0.91988,0.91988,0.91988,0.91988


TrainOutput(global_step=3128, training_loss=0.1381366890867043, metrics={'train_runtime': 3290.5036, 'train_samples_per_second': 30.39, 'train_steps_per_second': 0.951, 'total_flos': 2.5694439e+16, 'train_loss': 0.1381366890867043, 'epoch': 4.0})

In [33]:
#####
## Exercise: Code a prediction function that recieves a text and returns the predicted label
#####



In [35]:
predict("I enjoyed this movie!")

Positive review


In [36]:
predict("This movie quite bad!")

Negative review
