# Machine Learning for NLP : lab session 5

### Lectures takeaways (5 and 6)

- Language Models 
- The Transformer Architecture
- Use a Pretrained Language Model on specific tasks (focus on BERT)

### Lab session outline 

1. Playing with BERT and the **transformer** library
  1. Experimenting with the CamemBERT language model
2. Fine-tuning BERT for task specific use cases 

### Resources : 


- Library doc:                  https://huggingface.co/transformers/quickstart.html 

- ADAM Optimizer https://mlfromscratch.com/optimizers-explained/ 

- Transformer architecture: http://jalammar.github.io/illustrated-transformer/
- BERT:     https://arxiv.org/pdf/1810.04805.pdf
- CamemBERT: https://arxiv.org/pdf/1911.03894.pdf 

source : https://medium.com/towards-artificial-intelligence/cross-lingual-language-model-56a65dba9358

 




In [None]:
  !pip install transformers
#!pip install torch torchvision

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/13/33/ffb67897a6985a7b7d8e5e7878c3628678f553634bd3836404fef06ef19b/transformers-2.5.1-py3-none-any.whl (499kB)
[K     |▋                               | 10kB 15.0MB/s eta 0:00:01[K     |█▎                              | 20kB 5.0MB/s eta 0:00:01[K     |██                              | 30kB 6.7MB/s eta 0:00:01[K     |██▋                             | 40kB 6.6MB/s eta 0:00:01[K     |███▎                            | 51kB 5.5MB/s eta 0:00:01[K     |████                            | 61kB 6.0MB/s eta 0:00:01[K     |████▋                           | 71kB 6.5MB/s eta 0:00:01[K     |█████▎                          | 81kB 6.6MB/s eta 0:00:01[K     |██████                          | 92kB 7.3MB/s eta 0:00:01[K     |██████▋                         | 102kB 7.4MB/s eta 0:00:01[K     |███████▏                        | 112kB 7.4MB/s eta 0:00:01[K     |███████▉                        | 122kB 7.4M

## About the Transformers library 

Transformers is an opinionated library built for NLP researchers seeking to use/study/extend large-scale transformers models.

The library was designed with two strong goals in mind:

- be as easy and fast to use as possible:
- povide state-of-the-art models with performances as close as possible to the original models

cf. for more details https://huggingface.co/transformers/quickstart.html 

**In other words, <font color='red'> the Transformers library is currently one of the best open-source tools (if not the best) to experiment with the best NLP models** </font> 


In the context of this course : the Transformers library will be a great tool to : 
- play with **SOTA pretrained language models** (BERT and others)
- **fine-tune on specific tasks**
You can find the list of all available pretrained models in the library here : https://huggingface.co/transformers/pretrained_models.html 

In short, The Transformers library is a collection of wrappers built with Pytorch or Tensorflow that provides model loading, prediction, training, or fine-tuning. 

### What will we do with the Transformers library ?

- We will load a pretrained language model for English: BERT 
- We will visualize sentence embeddings
- We will fine-tune BERT for sentiment analysis






### What's a Masked language model again ? 

BERT is a Masked Language Model

![Texte alternatif…](https://drive.google.com/uc?id=1d6TMVu6G8azV07wJN02igAWAVzClPdVH)

... 

### Loading the model 

#### 1- Tokenizer : 

As seen during the lectures, tokenization is a model specific stage. 
In the context of Mask-Language Models tokenization work at the sub-word level (cf. Byte-Pair Encoding Lecture 6), we therefore need to load a model specific bpe-tokenizer. 

#### 2- Loading the pretrained weights 




In [None]:
### Loading a model 
import pdb
import torch
import torch.nn as nn
from transformers import AutoModel, AutoTokenizer
from transformers import XLMRobertaForMaskedLM, XLMRobertaTokenizer

## OPTIONAL: if you want to have more information on what's happening under the hood, activate the logger as follows
#import logging
#logging.basicConfig(level=logging.)

MODEL_NAME = "camembert-base"
# Load pre-trained model tokenizer (vocabulary)
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

# Tokenize input
text = tokenizer.bos_token+" Les feux de brousse qui sévissent depuis septembre en Australie, favorisés par des températures exceptionnelles, dépassent tous les records. "+tokenizer.eos_token
tokenized_text = tokenizer.tokenize(text)

# Mask a token that we will try to predict back with `BertForMaskedLM`
masked_index = 13 # e.g mask the word 'Australie' index by 13 
tokenized_text[masked_index] = tokenizer.mask_token
#assert tokenized_text == ['<s>', '▁Les', '▁feux', '▁de', '▁b', 'rousse', '▁qui', '▁s', 'év', 'issent', '▁depuis', '▁septembre', '▁en', '<mask>', ',', '▁favorisé', 's', '▁par', '▁des', '▁températures', '▁exceptionnelles', ',', '▁dépassent', '▁tous', '▁les', '▁records', '.', '</s>'], "ERROR {}".format(tokenized_text)
print("Input text is {}".format(tokenized_text))
# Convert token to vocabulary indices
indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
# Define sentence A and B indices associated to 1st and 2nd sentences (see paper)
segments_ids = [0 for _ in indexed_tokens]

# Convert inputs to PyTorch tensors
tokens_tensor = torch.tensor([indexed_tokens])
segments_tensors = torch.tensor([segments_ids])
tokens_tensor, segments_tensors

Input text is ['<s>', '▁Les', '▁feux', '▁de', '▁b', 'rousse', '▁qui', '▁s', 'év', 'issent', '▁depuis', '▁septembre', '▁en', '<mask>', ',', '▁favorisé', 's', '▁par', '▁des', '▁températures', '▁exceptionnelles', ',', '▁dépassent', '▁tous', '▁les', '▁records', '.', '</s>']


(tensor([[    5,    74,  7795,     8,  1011, 15380,    31,    52,  5632,  5999,
            176,   652,    22, 32004,     7, 21438,    10,    37,    20,  6350,
          15039,     7, 18851,   117,    19, 18588,     9,     6]]),
 tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
          0, 0, 0, 0]]))

Questions : 

1. Why is tokenization required ? Why do special token bos_token and eos_token are required ? 
2. Why those tokens are special ? 
3. Why some token start with ▁ symbol and some do not ? 


### About GPU 

GPU provide much faster large matrix operations compare to CPU. 
In Colab : go to _Modifier -> Paramètres du Notebook -> Accélérateur Matériel"_


GPU computing is based on a _cuda backend_ . To put your model in the GPU of your computer, simply apply the following commands. 

NB : In order to perform GPU computing, all the involved tensors should be in cuda datatypes ! Put them all in the gpu with the .to('cuda') function


In [None]:
# to checkout the GPU activity (cf. Volatile GPU-Util %)
!nvidia-smi

Wed Mar 18 06:11:22 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.59       Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   37C    P0    56W / 149W |    791MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
+-------

In [None]:
# Load pre-trained model (weights)

model = AutoModel.from_pretrained(MODEL_NAME)
# Set the model in evaluation mode to deactivate the DropOut modules
# This is IMPORTANT to have reproducible results during evaluation!
model.eval()

# If you have a GPU, put everything on cuda
tokens_tensor = tokens_tensor.to('cuda')
segments_tensors = segments_tensors.to('cuda')
model.to('cuda')

# Predict hidden states features for each layer
with torch.no_grad():
    # See the models docstrings for the detail of the inputs
    outputs = model(tokens_tensor, token_type_ids=segments_tensors)
    # Transformers models always output tuples.
    # See the models docstrings for the detail of all the outputs
    # In our case, the first element is the hidden state of the last layer of the Bert model
    encoded_layers = outputs[0]
# We have encoded our input sequence in a FloatTensor of shape (batch size, sequence length, model hidden dimension)
assert tuple(encoded_layers.shape) == (1, len(indexed_tokens), model.config.hidden_size)


HBox(children=(IntProgress(value=0, description='Downloading', max=445032417, style=ProgressStyle(description_…




**Questions** 

4. Why are GPU helpful in some cases when we do Deep Learning for NLP ?
5. What are the constraints of GPU ? 


### Experimenting with BERT

Now that we loaded the model and the tokenizer, let's experiment with it. 

This lab session is focused on BERT. BERT is a Masked-Language Model based 
on the Transformer architecture. 

We first do some qualitative experiments with BERT. Let's first analyze BERT as a Language Model. Then we will use it to produce sentence embedding. 

We will first play with CamemBERT, the French version of BERT. We use the transformer MaskLM wrapper to perform language modelling with it. 






#### Language Modelling with **CamemBERT**

In [None]:
# Load pre-trained model (weights)
from transformers import AutoModelWithLMHead

model = AutoModelWithLMHead.from_pretrained('camembert-base')
model.eval()
# If you have a GPU, put everything on cuda
tokens_tensor = tokens_tensor.to('cuda')
segments_tensors = segments_tensors.to('cuda')
model.to('cuda')

# Predict all tokens
with torch.no_grad():
    outputs = model(tokens_tensor, token_type_ids=segments_tensors)
    predictions = outputs[0]

predictions.size(), predictions

(torch.Size([1, 28, 32005]),
 tensor([[[ 24.2970,  -4.9570,   7.4591,  ...,  -6.7556,  -3.8422,   1.5580],
          [  0.8888,  -5.1120,  20.0042,  ..., -10.3810,  -2.8260,   1.2345],
          [  0.7079,  -4.8593,   5.9372,  ...,  -2.2902,  -9.3024,  -4.2323],
          ...,
          [  3.1483,  -5.8761,   9.6652,  ...,   0.6660, -14.9198,  -1.6405],
          [  6.5931,  -9.3456,   6.4012,  ...,  -6.1560,  -7.7579,   1.6097],
          [  9.2209,  -6.0294,  27.4842,  ...,  -9.5553,  -6.6236,   2.0224]]],
        device='cuda:0'))

**Questions :**

7. What each predictions.size() dimension correspond to ?
8. Fill the cell below to compute the mask language model prediction in our example  
9. Same question but with top-5 prediction 

In [None]:
# Question 8 
#predicted_index =  # f(predictions[0, masked_index])

tensor([11046, 22552,  2971,  6278,   184], device='cuda:0')

In [None]:
def detokenized_text(tokenized_sequence, masked_index, special_char="▁"):
  """
  We reconstruct the original text and the prediction 
  input: bpe index sequence
  return: 
  """
  detokenized_text = ""
  for ind, token in enumerate(tokenized_sequence):
    if ind==masked_index:
      detokenized_text+=" **"
    if token.startswith(special_char):
      detokenized_text+=" "+token[1:]
    else:
      detokenized_text+=token
    if ind==masked_index:
      detokenized_text+="**"
  return detokenized_text

In [None]:
#predicted_index = 

predicted_token = tokenizer.convert_ids_to_tokens(predicted_index)[0]
pred_text = tokenized_text.copy()
pred_text[masked_index] = predicted_token
print("PREDICTION TOKENIZED TEXT : {}".format(pred_text))
print("PREDICTION DETOKENIZED TEXT : {}".format(detokenized_text(pred_text, masked_index,special_char="▁")))

PREDICTION TOKENIZED TEXT : ['<s>', '▁Les', '▁feux', '▁de', '▁b', 'rousse', '▁qui', '▁s', 'év', 'issent', '▁depuis', '▁septembre', '▁en', '▁Guinée', ',', '▁favorisé', 's', '▁par', '▁des', '▁températures', '▁exceptionnelles', ',', '▁dépassent', '▁tous', '▁les', '▁records', '.', '</s>']
PREDICTION DETOKENIZED TEXT : <s> Les feux de brousse qui sévissent depuis septembre en ** Guinée**, favorisés par des températures exceptionnelles, dépassent tous les records.</s>


In [None]:
# Question 9. 
# predicted_index  = list/tensor of top 5 prediction s

## Questions 

- Now do the same with your own text 
- (if time) Do the same with another pretrained model (e.g. : bert-base-multilingual-cased the multilingual version of BEERT)

# Fine Tuning for Sequence Classification : Sentiment Analysis

We will now apply fine-tuning on the original version of BERT (bert-uncased)


## Download data 

1- First download the data with https://firebasestorage.googleapis.com/v0/b/mtl-sentence-representations.appspot.com/o/data%2FSST-2.zip?alt=media&token=aabc5f6b-e466-44a2-b9b4-cf6337f84ac8 

2- Unzip it in your labtob


3-  upload each tsv file (train.tsv, dev.tsv and test.tsv): on the left panel click on the bottom most folder symbol, then "Import" button

## Preprocessing 

Then, we introduce a few preprocessing function to help you get to the model. 

**SSTDataset is a class that handle get, tokenization and padding of the sentences.**


In [None]:
import torch
import torch.nn as nn
from torch.utils.data import Dataset
from transformers import BertTokenizer, AutoModel, AutoTokenizer
import pandas as pd
from torch.utils.data import DataLoader

class SSTDataset(Dataset):

    def __init__(self, filename, maxlen, model_name='bert-base-uncased'):

        #Store the contents of the file in a pandas dataframe
        self.df = pd.read_csv(filename, delimiter = '\t')

        #Initialize the BERT tokenizer
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)

        self.maxlen = maxlen

    def __len__(self):
        return len(self.df)

    def __getitem__(self, index):

        #Selecting the sentence and label at the specified index in the data frame
        sentence = self.df.loc[index, 'sentence']
        label = self.df.loc[index, 'label']

        #Preprocessing the text to be suitable for BERT
        tokens = self.tokenizer.tokenize(sentence) #Tokenize the sentence
        if self.tokenizer.cls_token is None:
          bos_token = self.tokenizer.bos_token
        else:
          bos_token = self.tokenizer.cls_token
          
        if self.tokenizer.sep_token is None:
          eos_token = self.tokenizer.eos_token
        else:
          eos_token = self.tokenizer.sep_token
        
        tokens = [bos_token] + tokens + [eos_token] #Insering the CLS and SEP token in the beginning and end of the sentence
        if len(tokens) < self.maxlen:
            tokens = tokens + [self.tokenizer.pad_token for _ in range(self.maxlen - len(tokens))] #Padding sentences
        else:
            tokens = tokens[:self.maxlen-1] + [eos_token] #Prunning the list to be of specified max length

        tokens_ids = self.tokenizer.convert_tokens_to_ids(tokens) #Obtaining the indices of the tokens in the BERT Vocabulary
        tokens_ids_tensor = torch.tensor(tokens_ids) #Converting the list to a pytorch tensor
        #Obtaining the attention mask i.e a tensor containing 1s for no padded tokens and 0s for padded ones
        attn_mask = (tokens_ids_tensor != 0).long()

        return tokens_ids_tensor, attn_mask, label

In [None]:

#Creating instances of training and validation set
train_set = SSTDataset(filename = 'train.tsv', maxlen = 30, model_name='bert-base-uncased')
val_set = SSTDataset(filename = 'dev.tsv', maxlen = 30, model_name='bert-base-uncased')

#Creating intsances of training and validation dataloaders
train_loader = DataLoader(train_set, batch_size = 12, num_workers = 5)
val_loader = DataLoader(val_set, batch_size = 12, num_workers = 5)

## Data 

We define the SSTDataset class. It is a compact wrapper to :
-  access data for the sentiment analysis dataset more easily
- tokenize into subwords for our languag model


In [None]:
# get the number of sentences
print(val_set.__len__())
# get  tokenized sentence indexed by 1 
val_set.__getitem__(1)

872


(tensor([  101,  4895, 10258,  2378,  8450,  2135, 21657,  1998,  7143,   102,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0]),
 tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0]),
 0)

## Define the Sentiment Analysis model using pytorch 

- As we have seen in the last lab session all pytorch model follow the same template 
  - One class withto instansiate the model 
  - forward() method to define the foward pass

- Here, we will use the pretrained Masked Language Model as one module of our sentiment analysis model 

In [None]:
class SentimentClassifier(nn.Module):

    def __init__(self, pretrained_model_name='bert-base-uncased'):
        super(SentimentClassifier, self).__init__()
        
        #Loading Mask Language Model 
        self.encoder = AutoModel.from_pretrained(pretrained_model_name)
        #we append an extra layer for Classification (it will be randomly initialized)
        self.cls_layer = nn.Linear(self.encoder.pooler.dense.out_features, 1)

    def forward(self, seq, attn_masks):
        '''
        Inputs:
            -seq : Tensor of shape [B, T] containing token ids of sequences
            -attn_masks : Tensor of shape [B, T] containing attention masks to be used to avoid contibution of PAD tokens
        '''

        #Feeding the input to BERT model to obtain contextualized representations
        # see in the hugging face doc what to input
        #cont_reps = #  self.encoder(..)..

        #Obtaining the representation of [CLS] head
        cls_rep = cont_reps[:, 0]

        #Feeding cls_rep to the classifier layer
        logits = self.cls_layer(cls_rep)

        return logits


In [None]:
# we now instansiate the model 
sentiment_model = SentimentClassifier('bert-base-uncased')
# if gpu mode
sentiment_model = sentiment_model.to("cuda")
# to check if the weights of the model are in gpu : 
# sentiment_model.cls_layer.weight.is_cuda
# can checkout all the layers by running 
#sentiment_model

## Define Training Process

- a loss 

We are doing binary cl

- an optimizer 

Here we will use a variant of the Stochastic Gradient Descent called ADAM (cf. reference at the top) 




In [None]:
import torch.nn as nn
import torch.optim as optim
# define the loss and optimizer 
criterion = nn.BCEWithLogitsLoss()
opti = optim.Adam(sentiment_model.parameters(), lr = 2e-5)

# Training loop



In [None]:
import pdb
def train(model, criterion, opti, train_loader, val_loader, max_eps=1, gpu=False, print_every=1,validate_every=1, break_training_after=None):
    if gpu:
      model = model.to("cuda")
    for ep in range(max_eps):
        
        for it, (seq, attn_masks, labels) in enumerate(train_loader):
            #Clear gradients
            opti.zero_grad()  
            #Converting these to cuda tensors
            if gpu:
              seq, attn_masks, labels = seq.cuda(), attn_masks.cuda(), labels.cuda()
            #Obtaining the logits from the model
            logits = model(seq, attn_masks)

            #Computing loss
            loss = criterion(logits.squeeze(-1), labels.float())

            #Backpropagating the gradients
            loss.backward()

            #Optimization step
            opti.step()
            if (it + 1) % print_every == 0:
                accuracy = torch.sum((logits>0).int().squeeze(1)==labels)/float(labels.size(0))
                print("Iteration {} of epoch {} complete. Loss : {}, Accuracy {} ".format(it+1, ep+1, loss.item(),accuracy))
            if break_training_after is not None and it>break_training_after:
              print("Early breaking : did not cover a full epoch but only {} iteration ".format(it))
              break
        if ep % validate_every==0:
          # evaluation on the validation set 
          n_batch_validation = 0
          loss_validation = 0
          accuracy_validation = 0
          for it, (seq, attn_masks, labels) in enumerate(val_loader):            
            if gpu:
              seq, attn_masks, labels = seq.cuda(), attn_masks.cuda(), labels.cuda()
            #Obtaining the logits from the model
            logits_val = model(seq, attn_masks)
            n_batch_validation+=1
            #Computing loss
           
            _loss = float(criterion(logits_val.squeeze(-1), labels.float()))
            _accu = float(torch.sum((logits_val>0).int().squeeze(1)==labels)/float(labels.size(0)))
           
            loss_validation += _loss
            accuracy_validation += _accu
          print("EVALUATION Validation set : mean loss {} n mean accuracy {}".format(loss_validation/n_batch_validation, accuracy_validation/n_batch_validation))

          

In [None]:
train(sentiment_model, criterion, opti, train_loader, val_loader,max_eps=5, print_every=100, gpu=True)

Iteration 100 of epoch 1 complete. Loss : 0.6502393484115601, Accuracy 0.6666666865348816 


## Questions 

10- Plot loss and accuracy   
11- Compare different value of the learning rate in the adam Optimizer (between 1e-6 and 5e-5)  
12- Conclude on the performance of BERT on sentiment analysis 
13- Now do the same choosing another pretrained model 

- ex : bert-large-uncased (much larger version of BERT) bert-base-multilingual-cased (multilingual version of BERT)

14- Conclude on what is the best pretraining model for sentiment analysis
