## Classification avec réseau récurrent, embeddings et mécanisme d'attention

Cette exemple est très similaire au précedent, sauf qu'un mécanisme d'attention est utilisé pour déterminer l'importance relative des mots dans le processus de classification.

On vous réfère au notebook "Classification avec réseau récurrent pour les sections sur la préparation du jeu de données, du vocabulaire, des embeddings* et de l'entraînement du modèle.

### 1. Création des jeux de données d'entraînement et de validation

In [10]:
train_dataset_path = "./data_rnn/questions-t3.txt"
from sklearn.model_selection import train_test_split

def load_dataset(filename):
    with open(filename) as f:
        lines = f.read().splitlines()
        labels, questions = zip(*[tuple(s.split(' ', 1)) for s in lines])
    return questions, labels

questions, labels = load_dataset(train_dataset_path)

X_train, X_valid, y_train, y_valid = train_test_split(questions, labels, test_size=0.2, shuffle=True,random_state=42)

# On converti les labels textuels en index numérique
id2lable = {label_id:value for label_id, value in enumerate(list(set(labels)))}
label2id = {value:label_id for label_id, value in id2lable.items()}

y_train = [label2id[label] for label in y_train]
y_valid = [label2id[label] for label in y_valid]

nb_class = len(id2lable)




## 2. Gestion du vocabulaire et des vecteurs des mots


In [3]:
import spacy
import numpy as np
nlp = spacy.load('en_core_web_lg')
embedding_size = nlp.meta['vectors']['width']


In [4]:
word2id = {}
id2embedding = {}
id2word = {}

word2id[1] = "<unk>"

id2embedding[1] = np.zeros(embedding_size, dtype=np.float64)

word_index = 2

for question in X_train:
    for word in nlp(question):
        if word.text not in word2id.keys():
            word2id[word.text] = word_index
            id2embedding[word_index] = word.vector
            id2word[word_index] = word.text
            word_index += 1


In [5]:
import torch

from torch import LongTensor
from torch.nn.utils.rnn import pad_sequence
from torch.utils.data import DataLoader, Dataset
from typing import List, Dict, Tuple

class TokenisedDataset(Dataset):
    
    def __init__(self, dataset: List[str] , target: np.array, word2id: Dict[str, int], nlp_model):
        self.tokenized_dataset = [None for _ in range(len(dataset))]
        self.dataset = dataset
        self.target = target
        self.word2id = word2id
        self.nlp_model = nlp_model
    
    def __len__(self):
        return len(self.dataset)

    def __getitem__(self, index):
        if self.tokenized_dataset[index] is None:
            self.tokenized_dataset[index] = self.tokenize(self.dataset[index])
        
        return LongTensor(self.tokenized_dataset[index]), LongTensor([self.target[index]]).squeeze(0)

    def tokenize(self, sentence):
        return [ self.word2id.get(word.text, 1) for word in self.nlp_model(sentence)]
    
    
train_dataset = TokenisedDataset(X_train, y_train, word2id, nlp)
valid_dataset = TokenisedDataset(X_valid, y_valid, word2id, nlp)


## 3. Construction de l'architecture neuronale
L'architecture du réseau récurrent comporte toujours:

* une couche en entrée qui prend les embeddings de mots de Spacy. La taille de la couche d'entrée correspond à la taille d'embedding de Spacy.
* une couche cachée récurrent qui prend en entrée un embedding de mot et l'état caché précédent. Les neurones de cette couche sont de type LSTM, une structure de neurone qui facilite la propagation d'information sur de plus longues séquences. À noer que la couche est bi-directionnelle (voir note de cours).
* une couche de classification qui donne en sortie un score pour chacune des classes (types de question).

On ajoute cependant une couche d'attention qui est une couche linéaire qui donne un poids d'attention à chacun des mots de la question. Ces poids contribuent autant à la classification des questions que pour évaluer l'importance relative des mots.

La partie important est la méthode _handle_rnn_output qui gère le calcul des poids d'attention et qui crée un vecteur (une somme pondérée des états cachées pondérés par les poids d'attention des mots) utilisé pour faire la classification.

In [6]:
import torch.nn as nn
from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence
NEG_INF = -1e6

class AttentionRNNWithEmbeddingLayer(nn.Module):
    
    def __init__(self, embedding, hidden_state_size, nb_class) :
        super().__init__()
        self.embedding_layer = nn.Embedding.from_pretrained(embedding)
        embedding_size = embedding.size()[1]
        self.rnn = nn.LSTM(embedding_size, hidden_state_size, 1, bidirectional=True)        
        self.attention_layer = nn.Linear(2 * hidden_state_size, 1) # On calcule un facteur (scalaire) par input
        self.classification_layer = nn.Linear(2 * hidden_state_size, nb_class) # 2 * -> Une pour chaque direction
    
    def forward(self, x, x_lenghts):
        x = self.embedding_layer(x)
        x = self._handle_rnn_output(x, x_lenghts)
        x = self.classification_layer(x)
                
        return x
    
    def _handle_rnn_output(self, x, x_lenghts):
        
        # On "pack" les batch pour les envoyer dans le RNN
        packed_batch = pack_padded_sequence(x, x_lenghts, batch_first=True, enforce_sorted=False)
        
        # On s'intéresse cette fois-ci aux outputs après chaque mots
        rnn_output, _ = self.rnn(packed_batch)
        
        # On "repad" les outputs pour les remettre dans une forme utilisable
        unpacked_rnn_output, _ = pad_packed_sequence(rnn_output, batch_first=True)

        # On génère un masque pour prévenir que des poids d'attention soient calculés sur le padding
        sequence_mask = self.make_sequence_mask(x_lenghts)
        
        # On calcule les poids d'attention pour les outputs du RNN 
        attention = self.attention_layer(unpacked_rnn_output)
        
        # On normalize les poids d'attention
        soft_maxed_attention = self.mask_softmax(attention.squeeze(-1), sequence_mask)
        
        # On pondère les outputs du RNN avec les poids d'attention
        attention_weighted_rnn_output = torch.sum(soft_maxed_attention.unsqueeze(-1) * unpacked_rnn_output, dim=1)

        return attention_weighted_rnn_output
        
    def calculate_attention_for_input(self, x, x_lenghts):
        x = self.embedding_layer(x)
        packed_batch = pack_padded_sequence(x, x_lenghts, batch_first=True, enforce_sorted=False)
        rnn_output, _ = self.rnn(packed_batch)
        unpacked_rnn_output, _ = pad_packed_sequence(rnn_output, batch_first=True)
        sequence_mask = self.make_sequence_mask(x_lenghts)
        attention = self.attention_layer(unpacked_rnn_output)
        soft_maxed_attention = self.mask_softmax(attention.squeeze(-1), sequence_mask)
        return soft_maxed_attention
        
        
    @staticmethod
    def make_sequence_mask(sequence_lengths):
        maximum_length = torch.max(sequence_lengths)

        idx = torch.arange(maximum_length).to(sequence_lengths).repeat(sequence_lengths.size(0), 1)
        mask = torch.gt(sequence_lengths.unsqueeze(-1), idx).to(sequence_lengths)

        return mask
    
    @staticmethod
    def mask_softmax(matrix, mask=None):
        if mask is None:
            result = nn.functional.softmax(matrix, dim=-1)
        else:
            mask_norm = ((1 - mask) * NEG_INF).to(matrix)
            for i in range(matrix.dim() - mask_norm.dim()):
                mask_norm = mask_norm.unsqueeze(1)
            result = nn.functional.softmax(matrix + mask_norm, dim=-1)

        return result


In [7]:
def pad_batch(batch : List[Tuple[LongTensor, LongTensor]]) -> Tuple[LongTensor, LongTensor]:
    x = [x for x,y in batch]
    x_true_length = [len(x) for x,y in batch]
    y = torch.stack([y for x,y in batch], dim=0)
    
    return ((pad_sequence(x, batch_first=True), LongTensor(x_true_length)), y)

train_dataloader = DataLoader(train_dataset, batch_size=16, shuffle=True, collate_fn=pad_batch)
valid_dataloader = DataLoader(valid_dataset, batch_size=16, shuffle=True, collate_fn=pad_batch)

In [8]:
id2embedding[0] = np.zeros(embedding_size, dtype=np.float32)
embedding_layer = np.zeros((len(id2embedding), embedding_size), dtype=np.float32)
for token_index, embedding in id2embedding.items():
    embedding_layer[token_index,:] = embedding
    
embedding_layer = torch.from_numpy(embedding_layer)


## 4. Entraînement du modèle

Rien de bien nouveau ici sous le soleil de Poutyne. Vous êtes en terrain familier.

In [11]:
from poutyne.framework import Experiment
from poutyne import set_seeds
import numpy as np

set_seeds(42)
hidden_size = 100

model = AttentionRNNWithEmbeddingLayer(embedding_layer, hidden_size, nb_class)
experiment = Experiment('model/attention_embeddings_rnn', 
                        model, 
                        optimizer = "SGD", 
                        task="classification")

In [12]:
logging = experiment.train(train_dataloader, valid_dataloader, epochs=50, disable_tensorboard=True)


[35mEpoch: [36m1/50 [35mStep: [36m278/278 [35m100.00% |[35m█████████████████████████[35m|[32m92.86s [35mloss:[94m 2.103834[35m acc:[94m 19.801980[35m fscore_micro:[94m 0.198020[35m val_loss:[94m 2.027687[35m val_acc:[94m 22.571942[35m val_fscore_micro:[94m 0.225719[0m
Epoch 1: val_acc improved from -inf to 22.57194, saving file to model/attention_embeddings_rnn\checkpoint_epoch_1.ckpt
[35mEpoch: [36m2/50 [35mStep: [36m278/278 [35m100.00% |[35m█████████████████████████[35m|[32m16.07s [35mloss:[94m 2.009135[35m acc:[94m 22.479748[35m fscore_micro:[94m 0.224797[35m val_loss:[94m 1.974789[35m val_acc:[94m 22.571942[35m val_fscore_micro:[94m 0.225719[0m
[35mEpoch: [36m3/50 [35mStep: [36m278/278 [35m100.00% |[35m█████████████████████████[35m|[32m16.66s [35mloss:[94m 1.975864[35m acc:[94m 22.479748[35m fscore_micro:[94m 0.224797[35m val_loss:[94m 1.948941[35m val_acc:[94m 22.571942[35m val_fscore_micro:[94m 0.225719[0m
[35mEpoch:

[35mEpoch: [36m21/50 [35mStep: [36m278/278 [35m100.00% |[35m█████████████████████████[35m|[32m15.24s [35mloss:[94m 0.947490[35m acc:[94m 70.049505[35m fscore_micro:[94m 0.700495[35m val_loss:[94m 0.953427[35m val_acc:[94m 70.053957[35m val_fscore_micro:[94m 0.700540[0m
Epoch 21: val_acc improved from 68.88489 to 70.05396, saving file to model/attention_embeddings_rnn\checkpoint_epoch_21.ckpt
[35mEpoch: [36m22/50 [35mStep: [36m278/278 [35m100.00% |[35m█████████████████████████[35m|[32m17.06s [35mloss:[94m 0.907809[35m acc:[94m 71.354635[35m fscore_micro:[94m 0.713546[35m val_loss:[94m 0.911133[35m val_acc:[94m 72.661871[35m val_fscore_micro:[94m 0.726619[0m
Epoch 22: val_acc improved from 70.05396 to 72.66187, saving file to model/attention_embeddings_rnn\checkpoint_epoch_22.ckpt
[35mEpoch: [36m23/50 [35mStep: [36m278/278 [35m100.00% |[35m█████████████████████████[35m|[32m16.43s [35mloss:[94m 0.869179[35m acc:[94m 73.199820[35m fsco

[35mEpoch: [36m43/50 [35mStep: [36m278/278 [35m100.00% |[35m█████████████████████████[35m|[32m14.69s [35mloss:[94m 0.445752[35m acc:[94m 85.801080[35m fscore_micro:[94m 0.858011[35m val_loss:[94m 0.579640[35m val_acc:[94m 81.205036[35m val_fscore_micro:[94m 0.812050[0m
[35mEpoch: [36m44/50 [35mStep: [36m278/278 [35m100.00% |[35m█████████████████████████[35m|[32m14.28s [35mloss:[94m 0.429443[35m acc:[94m 86.273627[35m fscore_micro:[94m 0.862736[35m val_loss:[94m 0.601460[35m val_acc:[94m 81.924460[35m val_fscore_micro:[94m 0.819245[0m
[35mEpoch: [36m45/50 [35mStep: [36m278/278 [35m100.00% |[35m█████████████████████████[35m|[32m13.91s [35mloss:[94m 0.418254[35m acc:[94m 86.858686[35m fscore_micro:[94m 0.868587[35m val_loss:[94m 0.542912[35m val_acc:[94m 82.464029[35m val_fscore_micro:[94m 0.824640[0m
Epoch 45: val_acc improved from 82.01439 to 82.46403, saving file to model/attention_embeddings_rnn\checkpoint_epoch_45.ckpt
[

## 5. Prédiction à l'aide du modèle


In [15]:
test_dataset_path = "./data_rnn/test-questions-t3.txt"
x_test, test_labels = load_dataset(test_dataset_path)
from numpy import argmax

def obtain_prediction(sentence, label=None):
    tokenized_sentence = [word2id.get(word.text,1) for word in nlp(sentence)]
    sentence_length = len(tokenized_sentence)
    class_score = model(LongTensor(tokenized_sentence).unsqueeze(0), LongTensor([sentence_length])).detach().numpy()
    return id2lable[argmax(class_score)]

In [16]:
test_index = 101

print("Q: {}. Pred:{}, Truth:{}".
      format(x_test[test_index], obtain_prediction(x_test[test_index]), test_labels[test_index]))


Q: What was the last year that the Chicago Cubs won the World Series ?. Pred:TEMPORAL, Truth:TEMPORAL


In [17]:
new_sentence = "Will Bernie Sanders ever become president"
print("Q: {}. Pred:{}".format(new_sentence, obtain_prediction(new_sentence)))

Q: Will Bernie Sanders ever become president. Pred:ENTITY


In [18]:
def evaluate(x, y):
    prediction = obtain_prediction(x)
    print("\nQ: {}. \nPred: {}, Truth: {}".format(x, prediction, y))

for test_index in range(80, 120):
    x = x_test[test_index]
    y = test_labels[test_index]
    evaluate(x, y)


Q: What is desktop publishing ?. 
Pred: DEFINITION, Truth: DEFINITION

Q: What is the temperature of the sun 's surface ?. 
Pred: QUANTITY, Truth: QUANTITY

Q: What year did Canada join the United Nations ?. 
Pred: TEMPORAL, Truth: TEMPORAL

Q: Where is Prince Edward Island ?. 
Pred: LOCATION, Truth: LOCATION

Q: Mercury , what year was it discovered ?. 
Pred: TEMPORAL, Truth: TEMPORAL

Q: What is cryogenics ?. 
Pred: DEFINITION, Truth: DEFINITION

Q: What are coral reefs ?. 
Pred: DEFINITION, Truth: DEFINITION

Q: What is neurology ?. 
Pred: DEFINITION, Truth: DEFINITION

Q: Who invented the calculator ?. 
Pred: PERSON, Truth: PERSON

Q: How do you measure earthquakes ?. 
Pred: DESCRIPTION, Truth: DEFINITION

Q: Who is Duke Ellington ?. 
Pred: PERSON, Truth: DEFINITION

Q: What county is Phoenix , AZ in ?. 
Pred: LOCATION, Truth: LOCATION

Q: What is a micron ?. 
Pred: DEFINITION, Truth: DEFINITION

Q: The sun 's core , what is the temperature ?. 
Pred: DESCRIPTION, Truth: QUANTITY



## 6. Explication de la prédiction grâce à l'attention


In [19]:
def obtain_attention(sentence, label=None):
    tokenized_sentence = [word2id.get(word.text,1) for word in nlp(sentence)]
    sentence_length = len(tokenized_sentence)
    attention = model.calculate_attention_for_input(LongTensor(tokenized_sentence).unsqueeze(0), LongTensor([sentence_length])).squeeze(0).detach().numpy()
    return list(zip(nlp(sentence), attention))

In [20]:
test_index = 85
obtain_attention(x_test[test_index])


[(What, 0.17246982),
 (is, 0.32196993),
 (cryogenics, 0.45657015),
 (?, 0.04899015)]

In [21]:
test_index = 89
obtain_attention(x_test[test_index])


[(How, 0.29325584),
 (do, 0.13895506),
 (you, 0.08447854),
 (measure, 0.19434215),
 (earthquakes, 0.24090087),
 (?, 0.048067585)]

In [22]:
test_index = 103
obtain_attention(x_test[test_index])


[(What, 0.006417403),
 (year, 0.9671106),
 (did, 0.01588893),
 (WWII, 0.0077822166),
 (begin, 0.002256962),
 (?, 0.00054381165)]