<a href="https://colab.research.google.com/github/nicolashernandez/teaching_nlp/blob/main/M2-ATAL-2021-22_02_NER_with_BiLSTM_CRF.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

---
#¬†Recent Advances in Sequence Labeling from Deep Learning Models

Les approches pour l'√©tiquetage de s√©quence fond√©es sur les r√©seaux de neurones profonds compte trois √©tapes :
1. The embedding module is the first stage that maps words into their distributed representations (pretrained word embeddings, character-
level representations, hand-crafted features and sentence-level
representations). 
2. The context encoder module extracts contextual features (e.g. RNN/Bi-LSTM, CNN)
3. and the inference module predict labels and generate optimal label sequence as output of the model (e.g. SoftMax, CRF, RNN). 

[Zhiyong He, Zanbo Wang, Sheng Jiang. A Survey on Recent Advances in Sequence Labeling from Deep Learning Models. Published 13 November 2020. Computer Science. ArXiv](https://arxiv.org/pdf/2011.06727.pdf)


---
# Bref historique des syst√®mes de NER neuronaux

On ne vous demande pas de lire les articles suivants mais √† minima de lire ce bref historique et de jeter un oeil aux sections 2.2 √† 2.5 de (Huang et al., 2015) pour comprendre le mod√®le Bi-LSTM_CRF.

* L'architecture "SENNA", novatrice dans l'id√©e de la r√©solution des t√¢ches du TAL avec un mod√®le de langue neuronal (incluant notamment une m√©thode de construction de "pretrained word embeddings") : R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu and P. Kuksa. Natural Language Processing (Almost) from Scratch, Journal of Machine Learning Research (JMLR), 2011. ; [[article]](http://ronan.collobert.com/pub/matos/2011_nlp_jmlr.pdf) ; [[impl√©mentation]](https://ronan.collobert.com/senna/)
* Premier article √† appliquer les BiLSTM-CRF au NER : Zhiheng Huang, Wei Xu, Kai Yu, Bidirectional LSTM-CRF Models for Sequence Tagging, Arxiv, Computation and Language, Submitted on 9 Aug 2015 ; [[article]](https://arxiv.org/pdf/1508.01991.pdf) ; [[impl√©mentation1]](https://pytorch.org/tutorials/beginner/nlp/advanced_tutorial.html) (tutoriel avanc√© de pytorch) ; [[impl√©mentation2]](https://github.com/ZubinGou/NER-BiLSTM-CRF-PyTorch) (inclut aussi un mod√®le Bi-LSTM-CNN-CRF) ; [[impl√©mentation3]](https://github.com/jidasheng/bi-lstm-crf)  ; [[impl√©mentation4]](http://www.gabormelli.com/RKB/index.php?title=Bidirectional_LSTM/CRF_(BiLTSM-CRF)_Training_System) ; [[impl√©mentation5]](https://guillaumegenthial.github.io/sequence-tagging-with-tensorflow.html) (avec tensorflow)
* BiLSTM-CNN-CRF Implementation for Sequence Tagging (extension with the ELMo representations) : Reimers, Nils, and Gurevych, Iryna, Reporting Score Distributions Makes a Difference: Performance Study of LSTM-networks for Sequence Tagging, Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (EMNLP), September 2017, Copenhagen, Denmark, 338-348 ; [[article]](http://aclweb.org/anthology/D17-1035) ; [[impl√©mentation]](https://github.com/UKPLab/emnlp2017-bilstm-cnn-crf)
* Le 3e mod√®le le plus performant en 2020 sur la t√¢che NER sans ressources externes : Ying Luo, Fengshun Xiao, and Hai Zhao. Hierarchical contextualized representation for named entity recognition. In AAAI, pages 8441‚Äì8448, 2020 ; [[impl√©mentation]](https://github.com/cslydia/Hire-NER) ; Utilise [NCRF++: An Open-source Neural Sequence Labeling Toolkit](https://github.com/jiesutd/NCRFpp)


---
#¬†Bidirectional LSTM-CRF Impl√©mentation de (Huang et al., 2015)

Le code dans les cellules suivantes provient de l'[impl√©mentation 3](https://github.com/jidasheng/bi-lstm-crf/) de (Huang et al., 2015). Celle-ci s'appuie sur la biblioth√®que pytorch.



### VOTRE TRAVAIL 
* Ex√©cutez les cellules sans passer trop de temps √† comprendre les d√©tails de l'impl√©mentation. R√©pondez aux questions quand vous y √™tes invit√©.
* Passez en type d'ex√©cution "gpu". Plus tard vous ferez un test en type "None" c'est-√†-dire "cpu" afin d'avoir une id√©e des temps d'entra√Ænement de l'architecture.


##¬†Installation des d√©pendances 

üí° La cellule suivante requiert 2 ex√©cutions. 

In [2]:
#¬†!pip install torch #¬†1.10.0
# RuntimeError: 'lengths' argument should be a 1D CPU int64 tensor, but got 1D cuda:0 Long tensor
# https://stackoverflow.com/questions/54358280/packed-padded-sequence-gives-error-when-used-with-gpu
!pip install torch==1.6.0 # torchvision==0.7.0
#!pip install torchtext
# The torchtext package consists of data processing utilities and popular datasets for natural language.



V√©rifie que le hardware de votre machine dispose d'un gpu et que la version de torch install√©e est bien celle attendue.

‚ö†Ô∏è Attention, si la version n'est pas celle attendue alors red√©marrer l'environnement d'ex√©cution.

In [3]:
# info sur le gpu 
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Not connected to a GPU')
else:
  print(gpu_info)

import torch
# Get cpu or gpu device for training (un peu redondant avec le code pr√©c√©dent... mais montre une variante via torch)
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print (device)

#¬†version de torch
print(torch.__version__)

Sun Jan 16 21:42:11 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 495.46       Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   35C    P0    27W / 250W |      0MiB / 16280MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

En GCollab Pro

```
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 495.44       Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   37C    P0    27W / 250W |      2MiB / 16280MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
```



Impl√©mentations de m√©thodes utiles

In [31]:
# utilities 
def flatten(t):
  # applatie une liste de listes en une unique liste... 
  #¬†[[a, b], [c], [d, e, f]] -> [a, b, c, d, e, f]
  return [item for sublist in t for item in sublist]


## Impl. couche _CRF_
D√©finition de la couche CRF qui retourne la s√©quence d'√©tiquettes la plus probable correspondant √† une s√©quence de mots donn√©e.

* Source : https://github.com/jidasheng/bi-lstm-crf/blob/master/bi_lstm_crf/model/crf.py
* A l'aide d'un treillis mots x etiquettes, l'_algorithme Viterbi_ retourne la s√©quence d'√©tiquettes la plus probables pour une s√©quence de mots (d'une phrase) donn√©e.
* Les _probabilit√©s de transition_ sont des probabilit√©s conditionnelles. Il s'agit de la probabilit√© d'avoir une √©tiquette sachant 1 historique d'√©tiquettes `P(t_i|t_i-1)` (ici dans un mod√®le bigramme). Les _probabilit√©s d'√©mission_ sont les probabilit√©s des mots `P(w_i | t_i)` √† √™tre g√©n√©r√©s par leur propre √©tiquette. 
* En savoir plus sur le ["Sequence Labeling"](https://courses.engr.illinois.edu/cs447/fa2018/Slides/Lecture07.pdf). 
* En savoir plus sur le ["CRF"](http://www.cs.columbia.edu/~mcollins/crf.pdf).



In [4]:
import torch
import torch.nn as nn

def log_sum_exp(x):
    """calculate log(sum(exp(x))) = max(x) + log(sum(exp(x - max(x))))
    """
    max_score = x.max(-1)[0]
    return max_score + (x - max_score.unsqueeze(-1)).exp().sum(-1).log()


IMPOSSIBLE = -1e4

class CRF(nn.Module):
    """General CRF module.
    The CRF module contain a inner Linear Layer which transform the input from features space to tag space.

    :param in_features: number of features for the input
    :param num_tag: number of tags. DO NOT include START, STOP tags, they are included internal.
    """

    def __init__(self, in_features, num_tags):
        super(CRF, self).__init__()

        self.num_tags = num_tags + 2
        self.start_idx = self.num_tags - 2
        self.stop_idx = self.num_tags - 1

        self.fc = nn.Linear(in_features, self.num_tags)

        # transition factor, Tij mean transition from j to i
        self.transitions = nn.Parameter(torch.randn(self.num_tags, self.num_tags), requires_grad=True)
        self.transitions.data[self.start_idx, :] = IMPOSSIBLE
        self.transitions.data[:, self.stop_idx] = IMPOSSIBLE

    def forward(self, features, masks):
        """decode tags

        :param features: [B, L, C], batch of unary scores
        :param masks: [B, L] masks
        :return: (best_score, best_paths)
            best_score: [B]
            best_paths: [B, L]
        """
        features = self.fc(features)
        return self.__viterbi_decode(features, masks[:, :features.size(1)].float())

    def loss(self, features, ys, masks):
        """negative log likelihood loss
        B: batch size, L: sequence length, D: dimension

        :param features: [B, L, D]
        :param ys: tags, [B, L]
        :param masks: masks for padding, [B, L]
        :return: loss
        """
        features = self.fc(features)

        L = features.size(1)
        masks_ = masks[:, :L].float()

        forward_score = self.__forward_algorithm(features, masks_)
        gold_score = self.__score_sentence(features, ys[:, :L].long(), masks_)
        loss = (forward_score - gold_score).mean()
        return loss

    def __score_sentence(self, features, tags, masks):
        """Gives the score of a provided tag sequence

        :param features: [B, L, C]
        :param tags: [B, L]
        :param masks: [B, L]
        :return: [B] score in the log space
        """
        B, L, C = features.shape

        # emission score
        emit_scores = features.gather(dim=2, index=tags.unsqueeze(-1)).squeeze(-1)

        # transition score
        start_tag = torch.full((B, 1), self.start_idx, dtype=torch.long, device=tags.device)
        tags = torch.cat([start_tag, tags], dim=1)  # [B, L+1]
        trans_scores = self.transitions[tags[:, 1:], tags[:, :-1]]

        # last transition score to STOP tag
        last_tag = tags.gather(dim=1, index=masks.sum(1).long().unsqueeze(1)).squeeze(1)  # [B]
        last_score = self.transitions[self.stop_idx, last_tag]

        score = ((trans_scores + emit_scores) * masks).sum(1) + last_score
        return score

    def __viterbi_decode(self, features, masks):
        """decode to tags using viterbi algorithm
        B: batch size, L: sequence length, D: dimension

        :param features: [B, L, C], batch of unary scores
        :param masks: [B, L] masks
        :return: (best_score, best_paths)
            best_score: [B]
            best_paths: [B, L]
        """
        B, L, C = features.shape

        bps = torch.zeros(B, L, C, dtype=torch.long, device=features.device)  # back pointers

        # Initialize the viterbi variables in log space
        max_score = torch.full((B, C), IMPOSSIBLE, device=features.device)  # [B, C]
        max_score[:, self.start_idx] = 0

        for t in range(L):
            mask_t = masks[:, t].unsqueeze(1)  # [B, 1]
            emit_score_t = features[:, t]  # [B, C]

            # [B, 1, C] + [C, C]
            acc_score_t = max_score.unsqueeze(1) + self.transitions  # [B, C, C]
            acc_score_t, bps[:, t, :] = acc_score_t.max(dim=-1)
            acc_score_t += emit_score_t
            max_score = acc_score_t * mask_t + max_score * (1 - mask_t)  # max_score or acc_score_t

        # Transition to STOP_TAG
        max_score += self.transitions[self.stop_idx]
        best_score, best_tag = max_score.max(dim=-1)

        # Follow the back pointers to decode the best path.
        best_paths = []
        bps = bps.cpu().numpy()
        for b in range(B):
            best_tag_b = best_tag[b].item()
            seq_len = int(masks[b, :].sum().item())

            best_path = [best_tag_b]
            for bps_t in reversed(bps[b, :seq_len]):
                best_tag_b = bps_t[best_tag_b]
                best_path.append(best_tag_b)
            # drop the last tag and reverse the left
            best_paths.append(best_path[-2::-1])

        return best_score, best_paths

    def __forward_algorithm(self, features, masks):
        """calculate the partition function with forward algorithm.
        TRICK: log_sum_exp([x1, x2, x3, x4, ...]) = log_sum_exp([log_sum_exp([x1, x2]), log_sum_exp([x3, x4]), ...])

        :param features: features. [B, L, C]
        :param masks: [B, L] masks
        :return:    [B], score in the log space
        """
        B, L, C = features.shape

        scores = torch.full((B, C), IMPOSSIBLE, device=features.device)  # [B, C]
        scores[:, self.start_idx] = 0.
        trans = self.transitions.unsqueeze(0)  # [1, C, C]

        # Iterate through the sentence
        for t in range(L):
            emit_score_t = features[:, t].unsqueeze(2)  # [B, C, 1]
            score_t = scores.unsqueeze(1) + trans + emit_score_t  # [B, 1, C] + [1, C, C] + [B, C, 1] => [B, C, C]
            score_t = log_sum_exp(score_t)  # [B, C]

            mask_t = masks[:, t].unsqueeze(1)  # [B, 1]
            scores = score_t * mask_t + scores * (1 - mask_t)
        scores = log_sum_exp(scores + self.transitions[self.stop_idx])
        return scores


### VOTRE TRAVAIL

Dans GColab, Faire Outils > Param√®tres > Cocher "affichage de la num√©rotation des lignes"

* Quel est le nom de la _loss function_ ? A quelle ligne est-ce sp√©cifi√©e ?
* En quelques mots, √† quoi sert l'algorithme de Viterbi ? Cherchez sur le web...

Pour aller plus loin, en apprendre davantage sur quelques [_loss functions_](https://ljvmiranda921.github.io/notebook/2017/08/13/softmax-and-the-negative-log-likelihood/).

## Impl. couche _Bi-LSTM CRF_ 

La classe suivante impl√©mente un mod√®le Bi-LSTM CRF
- Construction des embeddings de la s√©quence
- Capture du contexte avec une cellule RNN 
- Pr√©diction de la s√©quence d'√©tiquetage √† l'aide de la cellule CRF qui prend comme input la sortie du RNN

Source : https://github.com/jidasheng/bi-lstm-crf/blob/master/bi_lstm_crf/model/model.py


In [5]:
import torch.nn as nn
from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence

class BiRnnCrf(nn.Module):
    def __init__(self, vocab_size, tagset_size, embedding_dim, hidden_dim, num_rnn_layers=1, rnn="lstm"):
        super(BiRnnCrf, self).__init__()
        self.embedding_dim = embedding_dim
        self.hidden_dim = hidden_dim
        self.vocab_size = vocab_size
        self.tagset_size = tagset_size

        #¬†D√©claration d'une couche d'Embeddings  
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        
        #¬†D√©claration d'une couche RNN bidirectionnelle
        RNN = nn.LSTM if rnn == "lstm" else nn.GRU
        self.rnn = RNN(embedding_dim, hidden_dim // 2, num_layers=num_rnn_layers,
                       bidirectional=True, batch_first=True)
        
        #¬†D√©claration d'une couche CRF
        self.crf = CRF(hidden_dim, self.tagset_size)

    def __build_features(self, sentences):
        """
        sentences contient l'√©quivalent d'un batch de sentences ;
        chaque sentence √©tant de dimension max_seq_len 
        et contenant les indices des mots 
        type(sentences): <class 'torch.Tensor'>
        sentences.shape: torch.Size([1000, 100]) #¬†valeur par d√©faut
        More details on Tensors: https://pytorch.org/docs/stable/tensors.html
        """
        #print ('__build_features')
        #print("type sentences {}".format(type(sentences))) #¬†<class 'torch.Tensor'>
        #print("shape sentences {}".format(sentences.shape)) #¬†torch.Size([1000, 100])
        #print ('sentences[0]:', sentences[0]) 
        """
        sentences[0]: tensor([27996, 34171, 38501, 49310, 75077, 94514,  7381, 80031, 70853, 80031,
        56648, 41074, 75077, 51013, 83722, 91893, 70882,  7213, 55591, 30448,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0],
        device='cuda:0')"""
        
        # > identify positions in sentences where there are words
        masks = sentences.gt(0) 
        #print("type(masks):{}".format(type(masks))) #¬†<class 'torch.Tensor'>
        
        #print ('masks[0]:', masks[0])
        """
        masks[0]: tensor([ True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
        False, False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False, False],
        device='cuda:0')"""

        # print("type(sentences.long()):{}".format(type(sentences.long()))) #¬†<class 'torch.Tensor'>
        #¬†sentences.long() convert the data type of the Tensor to long
        #¬†> then return the embedding vector of each word in a sentence 
        #¬†> set each vector randomly, keeping track of the vector assigned to a given indice   
        embeds = self.embedding(sentences.long())
        #print("type(embeds):{}".format(type(embeds))) #¬†<class 'torch.Tensor'>
        #print ('embeds[0]:', embeds[0])
        """ 
        embeds[0]: tensor([[ 1.6529, -0.9046,  0.9322,  ..., -0.8712, -1.1555, -1.5031],
        [-0.6852,  0.2939, -0.8784,  ..., -0.7400, -0.2376, -1.7276],
        [-0.8087,  0.4498, -1.7856,  ..., -1.3986,  0.2591,  0.0371],
        ...,
        [ 0.1250,  0.4386,  1.4527,  ..., -0.2274,  1.7671, -0.3603],
        [ 0.1250,  0.4386,  1.4527,  ..., -0.2274,  1.7671, -0.3603],
        [ 0.1250,  0.4386,  1.4527,  ..., -0.2274,  1.7671, -0.3603]],
        device='cuda:0', grad_fn=<SelectBackward>)"""

        # Returns the sum of each row of the input tensor in the given dimension dim.
        #¬†> Summing True and False gives the number of actual words in each sentence
        seq_length = masks.sum(1) 
        # print("type(seq_length):{}".format(type(seq_length))) #¬†<class 'torch.Tensor'>
        #print ('seq_length[0]:', seq_length[0])
        # seq_length[0]: tensor(20, device='cuda:0')

        # Sorts the elements of the input tensor along a given dimension in descending order by value.
        #¬†A namedtuple of (values, indices) is returned, where the values are the sorted values and indices are the indices of the elements in the original input tensor.
        # > Sort the sentences by their length (descending order)
        sorted_seq_length, perm_idx = seq_length.sort(descending=True)
        #print ('sorted_seq_length[0]:', sorted_seq_length[0])
        # sorted_seq_length[0]: tensor(100, device='cuda:0')
        #print ('perm_idx[0]:', perm_idx[0])
        # perm_idx[0]: tensor(630, device='cuda:0')

        # > reorder the embeddings following the sentence length for further processing: packing
        # embeds[0] has
        embeds = embeds[perm_idx, :]
        #print ('embeds[0]:', embeds[0])
        """
        embeds[0]: tensor([[ 0.1470,  1.3863,  0.2156,  ..., -0.1568, -1.1045, -0.1400],
        [ 0.3537,  0.2269, -1.4778,  ..., -1.0272, -0.7349,  1.0088],
        [-0.4989, -0.1096, -0.6463,  ...,  1.2627,  0.0907,  0.1922],
        ...,
        [-1.0912,  1.1962, -1.9826,  ..., -0.4356, -1.2736, -1.4505],
        [ 0.6587, -1.1465,  1.1382,  ...,  1.4149, -0.6422,  0.2377],
        [-0.6448,  1.1332,  1.4744,  ..., -0.7169, -1.2447, -0.5358]],
        device='cuda:0', grad_fn=<SelectBackward>) """

        # Packs a Tensor containing padded sequences of variable length.
        #¬†input can be of size T x B x * where T is the length of the longest sequence (equal to lengths[0]), B is the batch size, and * is any number of dimensions (including 0). 
        # If batch_first is True, B x T x * input is expected.
        #¬†https://pytorch.org/docs/stable/generated/torch.nn.utils.rnn.pack_padded_sequence.html#torch.nn.utils.rnn.pack_padded_sequence
        #
        # > the problem is that not all the sentences in the current batch have the same length. 
        #¬†> Without distinguishing the sentences lengths, to pad all the sequences, 
        #¬†> you would end up doing max_len * max_len computations, even if you needed less computations wrt the lenght of sentences.
        #¬†> PyTorch offers the possibility to pack (group) sentences of the same length 
        #¬†> and to pass the information to RNN which will internally optimize the computations.
        # https://stackoverflow.com/questions/51030782/why-do-we-pack-the-sequences-in-pytorch
        # https://stackoverflow.com/questions/59938530/why-do-we-need-pack-padded-sequence-when-we-have-pack-sequence
        #¬†TODO use enforce_sorted=False and remove the previous sorting
        pack_sequence = pack_padded_sequence(embeds,  lengths=sorted_seq_length,  batch_first=True)
        #print("type(pack_sequence):{}".format(type(pack_sequence))) #¬†<class 'torch.nn.utils.rnn.PackedSequence'>
        #print ('pack_sequence[0]:', pack_sequence[0])
        """
        pack_sequence[0]: tensor([[ 0.1470,  1.3863,  0.2156,  ..., -0.1568, -1.1045, -0.1400],
        [ 0.5597,  2.0953, -0.7236,  ..., -1.4103, -1.6798,  1.3055],
        [-0.1927, -0.9563, -0.0153,  ...,  1.2662, -0.6017, -0.1576],
        ...,
        [-1.6244,  1.0199, -0.1681,  ..., -0.7570, -0.9435, -0.4870],
        [-2.3151, -2.2364, -0.4231,  ...,  0.5323, -0.0363, -0.5891],
        [ 0.0935, -0.1610, -0.5200,  ...,  0.1851,  0.2965, -0.6004]],
       device='cuda:0', grad_fn=<PackPaddedSequenceBackward>)"""

        packed_output, _ = self.rnn(pack_sequence)
        #print("type(packed_output):{}".format(type(packed_output))) #¬†<class 'torch.nn.utils.rnn.PackedSequence'>
        #print ('packed_output[0]:', packed_output[0])
        """
        packed_output[0]: tensor([[ 2.0873e-02,  1.0921e-01, -2.1166e-01,  ...,  4.7033e-02,
         -2.1772e-01, -6.2811e-01],
        [ 3.9952e-03,  1.4725e-01, -9.1979e-02,  ..., -2.1016e-01,
         -1.8077e-01, -1.3867e-01],
        [ 1.2813e-02,  4.0637e-02, -2.2237e-01,  ..., -1.9843e-01,
          4.1468e-02, -7.4167e-03],
        ...,
        [ 4.2123e-04,  1.6109e-01, -2.3425e-02,  ..., -7.5010e-02,
         -5.0942e-02,  2.3539e-04],
        [-3.6322e-01,  1.0884e-01, -1.7367e-01,  ..., -6.3288e-02,
         -3.7179e-02, -9.8569e-02],
        [ 1.0113e-02,  1.3696e-01, -3.8002e-02,  ..., -2.1368e-01,
         -7.6481e-02,  1.1498e-01]], device='cuda:0',
       grad_fn=<CudnnRnnBackward>)"""

        #¬†Pads a packed batch of variable length sequences.
        #¬†It is an inverse operation to pack_padded_sequence().
        #¬†The returned Tensor‚Äôs data will be of size T x B x *, where T is the length of the longest sequence and B is the batch size. 
        #¬†If batch_first is True, the data will be transposed into B x T x * format.
        #¬†https://pytorch.org/docs/stable/generated/torch.nn.utils.rnn.pad_packed_sequence.html
        lstm_out, _ = pad_packed_sequence(packed_output, batch_first=True)
        #print("type(lstm_out):{}".format(type(lstm_out))) #¬†<class 'torch.Tensor'>
        #print ('lstm_out[0]:', lstm_out[0])
        """
        lstm_out[0]: tensor([[ 0.0209,  0.1092, -0.2117,  ...,  0.0470, -0.2177, -0.6281],
        [ 0.1364,  0.2313, -0.1493,  ...,  0.1805, -0.1467, -0.2619],
        [ 0.0967,  0.1473, -0.0139,  ...,  0.0378, -0.2664, -0.3387],
        ...,
        [-0.3804,  0.0290,  0.0695,  ..., -0.0445,  0.1460,  0.1356],
        [-0.2751,  0.2326, -0.0762,  ..., -0.0467,  0.0303,  0.0645],
        [-0.2196,  0.1945,  0.0911,  ..., -0.1548, -0.1706,  0.0325]],
       device='cuda:0', grad_fn=<SelectBackward>)"""
        
        # sort indices perm_idx in ascending order
        _, unperm_idx = perm_idx.sort()
        # print ('unperm_idx[0]:', unperm_idx[0])
        # unperm_idx[0]: tensor(644, device='cuda:0')
        lstm_out = lstm_out[unperm_idx, :]
        #print ('lstm_out[0]:', lstm_out[0])
        """
        lstm_out[0]: tensor([[-0.3566, -0.0670, -0.0603,  ..., -0.0220, -0.1860,  0.2224],
        [-0.1167, -0.1348,  0.0326,  ..., -0.1276, -0.2458, -0.1165],
        [-0.0043,  0.0479,  0.2782,  ...,  0.0799, -0.0694, -0.4641],
        ...,
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000]],
        device='cuda:0', grad_fn=<SelectBackward>)
        """
        return lstm_out, masks

    def loss(self, xs, tags):
        #¬†compute the loss (refers to the crf loss)
        features, masks = self.__build_features(xs)
        loss = self.crf.loss(features, tags, masks=masks)
        return loss

    def forward(self, xs):
        #¬†construction des features √† partir du batch de sentences
        features, masks = self.__build_features(xs)
        # Get the emission scores from the BiLSTM
        scores, tag_seq = self.crf(features, masks)
        return scores, tag_seq

### VOTRE TRAVAIL

Dans GColab, Faire Outils > Param√®tres > Cocher "affichage de la num√©rotation des lignes". Notez que la section suivante "visualisation" peut, via la visualisation "print" vous aider √† mieux comprendre l'architecture du r√©seau.

* Une couche d'[Embedding](https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html) est une table qui associe √† un mot du vocabulaire (en fait son indice num√©rique) un vecteur d'embeddings. Les valeurs des embeddings sont initialement tir√©es al√©atoirement. Elles peuvent √™tre surcharg√©es en chargeant des embeddings pre-entra√Æn√©es avec Word2Vec, Glove ou FastText par exemple. Par d√©faut (`embedding.weight.requires_grad = True`), ces vecteurs seront consid√©r√©s comme des param√®tres du mod√®le et ils seront "_fine-tuned_" durant l'entra√Ænement (`train`) par _"backpropagation"_. Indiquez le num√©ro de ligne qui d√©finit la couche d'embedding et celui de la ligne o√π les embeddings sont initialis√©es. 
Plus d'information sur [embedding-in-pytorch](https://stackoverflow.com/questions/50747947/embedding-in-pytorch) (stackoverflow).
* L'impl√©mentation offre deux types de cellules RNN possibles. Indiquez la ligne o√π ce choix est possible. Indiquez la ligne qui sp√©cifie le choix par d√©faut.
* Apr√®s la repr√©sentation en embeddings des phrase et avant le passage √† la cellule RNN, quel type de traitement est r√©alis√© ? Indiquez le num√©ro de ligne o√π ce traitement est sp√©cifi√©. 


## Visualisation d'un r√©seau dans pytorch

Ici sont pr√©sent√©s bri√®vement 2 moyens pour visualiser un r√©seau : `print` et le module torchviz. 

Visualisation via un `print` d'un objet de la classe. Ici instanciation d'un objet test avec des valeurs de test.

In [6]:
dummy_vocab_size = 20000
dummy_tagset_size = 9
dummy_embedding_dim = 100 #100
dummy_hidden_dim = 128
dummy_num_rnn_layers = 1
dummy_rnn = "lstm"
dummy_batch_size = 1
dummy_max_seq_len = 100
dummy_model = BiRnnCrf(dummy_vocab_size, dummy_tagset_size, dummy_embedding_dim, dummy_hidden_dim, dummy_num_rnn_layers, dummy_rnn)
print (dummy_model)


BiRnnCrf(
  (embedding): Embedding(20000, 100)
  (rnn): LSTM(100, 64, batch_first=True, bidirectional=True)
  (crf): CRF(
    (fc): Linear(in_features=128, out_features=11, bias=True)
  )
)


visualisation d'une couche embedding

In [7]:
dummy_device = 'cpu'
dummy_x = torch.randint(0, dummy_vocab_size, (dummy_batch_size, dummy_max_seq_len))
print (dummy_x)
#dummy_x = dummy_x.to(dummy_device).long()
dummy_x = dummy_x.long() 
print (dummy_x) 
print ('embedding layer:', dummy_model.embedding(dummy_x))
#print (dummy_model.parameters())

tensor([[ 5485,  3326, 16993, 13009, 18517,  2791,  8920, 14258,  7382, 11825,
         18780, 19377,  2827, 13978,  6567, 17210, 18112,  6251, 14769, 12898,
         11782, 18073, 19237,  9052, 12133,  3654, 17462,  8081,  7565, 10178,
          5267,  5240,  4077, 14258, 16145, 15828,  2134,  7583,   353, 17364,
         14090,  8934, 18676, 15896,  3222,  5077, 11283,   579,  5078, 18231,
          4049,  8922,   312, 18273, 15863,   665, 19537, 15753,  6760, 12053,
         12352, 16911, 17395, 19319,  5938, 13159,  8864,  1901,  2763, 13822,
         10850,   763, 12923,  5401,  2778,  3534,  8995, 19229, 19858,  2551,
         15993,  1525, 17059, 19567,  3277,  8811, 10362, 18099, 19650,  5774,
         15984,  4835, 19030,  1383, 16496,  3452,  7229, 12546, 10725,  4778]])
tensor([[ 5485,  3326, 16993, 13009, 18517,  2791,  8920, 14258,  7382, 11825,
         18780, 19377,  2827, 13978,  6567, 17210, 18112,  6251, 14769, 12898,
         11782, 18073, 19237,  9052, 12133,  3654,

**torchviz** requiert l'excution "forward" du mod√®le avec des donn√©es (√©ventuellement factices) pour produire un graph du r√©seau.

La figure est impitable pour un non initi√©. Ne perdez pas de temps √† essayer de la comprendre... Retenez qu'il existe des outils pour aider √† visualiser.


In [8]:
# g√©n√©ration d'un batch d'instances √† traiter (chacune avec un nombre de dimension fixe √©ventuellement tronqu√©e ou "padd√©e") 
# randn(d) returns a tensor filled with d random numbers ;
# randn(s, d) returns a list of s tensors filled with d random numbers ;   
dummy_batch_size = 1
dummy_x = torch.rand(dummy_batch_size, 10)
print (dummy_x.shape)

# ex√©cution "forward" du mod√®le
dummy_hyp = dummy_model(dummy_x)
print (type(dummy_hyp), len(dummy_hyp))
print (dummy_hyp[0].shape, len(dummy_hyp[1]))

# visualisation (un fichier png est g√©n√©r√© dans le r√©pertoire courant)
!pip install torchviz
from torchviz import make_dot
make_dot(dummy_hyp[0], params=dict(list(dummy_model.named_parameters()))).render("rnn_torchviz", format="png")

torch.Size([1, 10])
<class 'tuple'> 2
torch.Size([1]) 1


'rnn_torchviz.png'

Pour aller plus loin¬†: [how-do-i-visualize-a-net-in-pytorch (stackoverflow)](https://stackoverflow.com/questions/52468956/how-do-i-visualize-a-net-in-pytorch).

---
##¬†Impl. pr√©traitement des donn√©es 


D'abord la d√©finition de m√©thodes "utils" pour la phase de pr√©traitement √† savoir la sauvegarde et le chargement de fichiers de configuration e.g. vocabulaire, jeu d'√©tiquettes, param√®tres du mod√®le neuronal (dimension des embeddings...), partition des donn√©es g√©n√©r√©es...

Source : https://github.com/jidasheng/bi-lstm-crf/blob/master/bi_lstm_crf/app/preprocessing/utils.py

In [13]:
import json

START_TAG = "<START>"
STOP_TAG = "<STOP>"

PAD = "<PAD>"
OOV = "<OOV>"


def save_json_file(obj, file_path):
    with open(file_path, "w", encoding="utf8") as f:
        f.write(json.dumps(obj, ensure_ascii=False))


def load_json_file(file_path):
    with open(file_path, encoding="utf8") as f:
        return json.load(f)

Puis la classe de pr√©-traitement des donn√©es qui sera initialis√© √† l'aide des chemins des fichiers contenant le vocabulaire, le jeu d'√©tiquettes et les donn√©es annot√©es (phrases d√©coup√©es en mots avec √©tiquettes). Outre charger ces fichiers de configuration et donn√©es, la classe partitionne les donn√©es en ensemble d'entrainement, de validation et de tests (d'apr√®s les param√®tres sp√©cifi√©s par d√©faut ou √† l'appel du syst√®me). Les donn√©es sont aussi "vectoris√©es". Il s'agit essentiellement d'une substitution des mots des phrases par leur identifiant num√©rique correspondant √† une entr√©e dans le vocabulaire donn√©.

https://github.com/jidasheng/bi-lstm-crf/blob/master/bi_lstm_crf/app/preprocessing/preprocess.py

In [14]:
from os.path import join, exists
import numpy as np
from tqdm import tqdm
import torch

#FILE_VOCAB = "vocab.json"
#FILE_TAGS = "tags.json"
#FILE_DATASET = "dataset.txt"
#FILE_DATASET_CACHE = "dataset_cache_{}.npz"

class Preprocessor:
    def __init__(self, config_dir, save_config_dir=None, verbose=True):
        self.config_dir = config_dir
        self.verbose = verbose

        self.vocab, self.vocab_dict = self.__load_list_file(FILE_VOCAB, offset=1, verbose=verbose)
        print ('Debug: Preprocessor - __init__ - len(self.vocab):', len(self.vocab))
        print ('Debug: Preprocessor - __init__ - len(self.vocab_dict):', len(self.vocab_dict))

        self.tags, self.tags_dict = self.__load_list_file(FILE_TAGS, verbose=verbose)
        if save_config_dir:
            self.__save_config(save_config_dir)

        self.PAD_IDX = 0
        self.OOV_IDX = len(self.vocab) #¬†NH -1 pour √™tre s√ªr d'obtenir un IDX dans le vocab ; ne vient pas de l√†
        self.__adjust_vocab()


    def __load_list_file(self, file_name, offset=0, verbose=False):
        file_path = join(self.config_dir, file_name)
        if not exists(file_path):
            raise ValueError('"{}" file does not exist.'.format(file_path))
        else:
            elements = load_json_file(file_path)
            elements_dict = {w: idx + offset for idx, w in enumerate(elements)}
            if verbose:
                print("config {} loaded".format(file_path))
            return elements, elements_dict


    def __adjust_vocab(self):
        self.vocab.insert(0, PAD)
        self.vocab_dict[PAD] = 0

        self.vocab.append(OOV)
        self.vocab_dict[OOV] = len(self.vocab) - 1
        print ('Debug: Preprocessor - __adjust_vocab - len(self.vocab):', len(self.vocab))
        print ('Debug: Preprocessor - __adjust_vocab - len(self.vocab_dict):', len(self.vocab_dict))

    def __save_config(self, dst_dir):
        char_file = join(dst_dir, FILE_VOCAB)
        save_json_file(self.vocab, char_file)

        tag_file = join(dst_dir, FILE_TAGS)
        save_json_file(self.tags, tag_file)

        if self.verbose:
            print("tag dict file => {}".format(tag_file))
            print("tag dict file => {}".format(char_file))


    @staticmethod
    def __cache_file_path(corpus_dir, max_seq_len):
        return join(corpus_dir, FILE_DATASET_CACHE.format(max_seq_len))


    def load_dataset(self, corpus_dir, val_split, test_split, max_seq_len):
        """load the train set
        with 
        B the batch size (actually the corpus size i.e. the number of sentences)
        L max_seq_len
        T the length of the longest sequence
        :return: (xs, ys)
            xs: [B, L]
            ys: [B, L, C]

        """
        ds_path = self.__cache_file_path(corpus_dir, max_seq_len)
        if not exists(ds_path):
            print("building dataset {} ...".format(ds_path))
            xs, ys = self.__build_corpus(corpus_dir, max_seq_len)
        else:
            print("loading dataset {} ...".format(ds_path))
            dataset = np.load(ds_path)
            xs, ys = dataset["xs"], dataset["ys"]

        #print ('load_dataset')
        #print("type xs {}, ys {}".format(type(xs), type(ys)))
        # type xs <class 'numpy.ndarray'>, ys <class 'numpy.ndarray'>
        #print("shape xs {}, ys {}".format(xs.shape, ys.shape))
        # shape xs (132257, 100), ys (132257, 100)

        #  print ('load_dataset map torch.tensor')
        xs, ys = map(
            torch.tensor, (xs, ys)
        )
        #print("type xs {}, ys {}".format(type(xs), type(ys)))
        #¬†type xs <class 'torch.Tensor'>, ys <class 'torch.Tensor'>
        #print("shape xs {}, ys {}".format(xs.shape, ys.shape))
        # shape xs torch.Size([132257, 100]), ys torch.Size([132257, 100])

        # split the dataset
        total_count = len(xs)
        assert total_count == len(ys)
        val_count = int(total_count * val_split)
        test_count = int(total_count * test_split)
        train_count = total_count - val_count - test_count
        assert train_count > 0 and val_count > 0

        indices = np.cumsum([0, train_count, val_count, test_count])
        datasets = [(xs[s:e], ys[s:e]) for s, e in zip(indices[:-1], indices[1:])]
        print("datasets loaded:")
        for (xs_, ys_), name in zip(datasets, ["train", "val", "test"]):
            print("\t{}: {}, {}".format(name, xs_.shape, ys_.shape))
        return datasets


    def decode_tags(self, batch_tags):
        batch_tags = [
            [self.tags[t] for t in tags]
            for tags in batch_tags
        ]
        return batch_tags


    def sent_to_vector(self, sentence, max_seq_len=0):
        max_seq_len = max_seq_len if max_seq_len > 0 else len(sentence)
        # debug
        #for c in sentence[:max_seq_len]:
        #  if not(c in self.vocab_dict):
        #    print ('c {} not in vocab'.format(c))
        vec = [self.vocab_dict.get(c, self.OOV_IDX) for c in sentence[:max_seq_len]]

        # ici s'op√®re le¬†padding 
        return vec + [self.PAD_IDX] * (max_seq_len - len(vec))


    def tags_to_vector(self, tags, max_seq_len=0):
        max_seq_len = max_seq_len if max_seq_len > 0 else len(tags)
        vec = [self.tags_dict[c] for c in tags[:max_seq_len]]

        # ici s'op√®re le¬†padding 
        return vec + [0] * (max_seq_len - len(vec))


    def __build_corpus(self, corpus_dir, max_seq_len):
      #¬†remove cache files !!!
        file_path = join(corpus_dir, FILE_DATASET)
        xs, ys = [], []
        with open(file_path, encoding="utf8") as f:
            for idx, line in tqdm(enumerate(f), desc="parsing {}".format(file_path)):
                fields = line.strip().split("\t")
                if len(fields) != 2:
                    raise ValueError("format error in line {}, tabs count: {}".format(idx + 1, len(fields) - 1))

                sentence, tags = fields

                try:
                    if sentence[0] == "[":
                        sentence = json.loads(sentence)
                    tags = json.loads(tags)

                    #print ('Debug: sentence', sentence)
                    #print ('Debug: tags', tags)

                    xs.append(self.sent_to_vector(sentence, max_seq_len=max_seq_len))
                    ys.append(self.tags_to_vector(tags, max_seq_len=max_seq_len))
                    if len(sentence) != len(tags):
                        raise ValueError('"sentence length({})" != "tags length({})" in line {}"'.format(
                            len(sentence), len(tags), idx + 1))
                except Exception as e:
                    raise ValueError("exception raised when parsing line {}\n\t{}\n\t{}".format(idx + 1, line, e))

        #print ('__build_corpus')
        #print("type(xs):{}, type(ys):{}".format(type(xs), type(ys)))
        #print("len(xs):{}, len(ys):{}".format(len(xs), len(ys)))

        #print("shape xs {}, ys {}".format(xs.shape, ys.shape))

        #print ('__build_corpus asarray')
        xs, ys = np.asarray(xs), np.asarray(ys)
        #print("type xs {}, ys {}".format(type(xs), type(ys)))
        #print("shape xs {}, ys {}".format(xs.shape, ys.shape))

        # save train set
        cache_file = self.__cache_file_path(corpus_dir, max_seq_len)
        np.savez(cache_file, xs=xs, ys=ys)
        print("dataset cache({}, {}) => {}".format(xs.shape, ys.shape, cache_file))
        print ('xs', xs)
        print ('ys', ys)

        return xs, ys

### VOTRE TRAVAIL

* A quelle ligne s'op√®re le padding des phrases ?

## Impl. entra√Ænement du r√©seau

D'abord l'impl√©mentation de la m√©thode `build_model` qui instancie le r√©seau. Les fichiers associ√©s (_model_ et _arguments_) seront sauv√©s (ou charg√©s si un pr√©c√©dent entra√Ænement a d√©j√† eu lieu)  depuis _model_dir_.

Source : https://github.com/jidasheng/bi-lstm-crf/blob/master/bi_lstm_crf/app/utils.py

In [15]:
from os.path import exists, join
import torch

#FILE_ARGUMENTS = "arguments.json"
#FILE_MODEL = "model.pth"

def arguments_filepath(model_dir):
    return join(model_dir, FILE_ARGUMENTS)


def model_filepath(model_dir):
    return join(model_dir, FILE_MODEL)


def build_model(args, processor, load=True, verbose=False):

    print ('Debug: build_model - len(processor.vocab):', len(processor.vocab))
    # NH FIX not actived rnn_type by adding rnn=args['rnn_type']
    print ('Debug: build_model - BiRnnCrf')
    model = BiRnnCrf(len(processor.vocab), len(processor.tags), embedding_dim=args['embedding_dim'], hidden_dim=args['hidden_dim'], num_rnn_layers=args['num_rnn_layers'], rnn=args['rnn_type'])
    print ('Debug: build_model - model:', model)

    # weights
    model_path = model_filepath(args['model_dir'])
    if exists(model_path) and load:
        state_dict = torch.load(model_path)
        model.load_state_dict(state_dict)
        if verbose:
            print("load model weights from {}".format(model_path))
    return model


def running_device(device):
    if torch.cuda.is_available():
      print ('running_device gpu')
    else:  print ('running_device cpu')
    return device if device else torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

Puis la d√©finition des m√©thodes d√©di√©es √† l'entra√Ænement du mod√®le

Source : https://github.com/jidasheng/bi-lstm-crf/blob/master/bi_lstm_crf/app/train.py

In [16]:
from os import mkdir
from tqdm import tqdm
import pandas as pd
import numpy as np
import torch.optim as optim
from torch.utils.data import TensorDataset, DataLoader

def __eval_model(model, device, dataloader, desc):
    model.eval()
    with torch.no_grad():
        # eval
        losses, nums = zip(*[
            (model.loss(xb.to(device), yb.to(device)), len(xb))
            for xb, yb in tqdm(dataloader, desc=desc)])
        return np.sum(np.multiply(losses, nums)) / np.sum(nums)


def __save_loss(losses, file_path):
    pd.DataFrame(data=losses, columns=["epoch", "batch", "train_loss", "val_loss"]).to_csv(file_path, index=False)


def __save_model(model_dir, model):
    model_path = model_filepath(model_dir)
    torch.save(model.state_dict(), model_path)
    print("save model => {}".format(model_path))


def train(args):
    model_dir = args['model_dir']
    if not exists(model_dir):
        mkdir(model_dir)
#    save_json_file(vars(args), arguments_filepath(model_dir))
    save_json_file(args, arguments_filepath(model_dir))

    print ('Debug: train - Preprocessor')
    preprocessor = Preprocessor(config_dir=args['corpus_dir'], save_config_dir=args['model_dir'], verbose=True)
    
    print ('Debug: train - build_model')
    model = build_model(args, preprocessor, load=args['recovery'], verbose=True)

    print ('Debug: train - model:', model)

    # loss
    loss_path = join(args['model_dir'], "loss.csv")
    losses = pd.read_csv(loss_path).values.tolist() if args['recovery'] and exists(loss_path) else []

    # datasets
    (x_train, y_train), (x_val, y_val), (x_test, y_test) = preprocessor.load_dataset(
        args['corpus_dir'], args['val_split'], args['test_split'], max_seq_len=args['max_seq_len'])
    
    print ('train')
    print("type x_train {}, y_train {}".format(type(x_train), type(y_train)))
    print("shape x_train {}, y_train {}".format(x_train.shape, y_train.shape))
    #¬†shape x_train torch.Size([79355, 100]), y_train torch.Size([79355, 100])
    train_dl = DataLoader(TensorDataset(x_train, y_train), batch_size=args['batch_size'], shuffle=True)
    valid_dl = DataLoader(TensorDataset(x_val, y_val), batch_size=args['batch_size'] * 2)
    test_dl = DataLoader(TensorDataset(x_test, y_test), batch_size=args['batch_size'] * 2)

    # initialize the optimizer specifying what parameters (tensors) of the model should be updated (through the backward process)
    optimizer = optim.Adam(model.parameters(), lr=args['lr'], weight_decay=args['weight_decay'])

    device = running_device(args['device'])

    # FIXME
    model.to(device)

    val_loss = 0
    best_val_loss = 1e4
    for epoch in range(args['num_epoch']):
        # train
        model.train()
        bar = tqdm(train_dl)
        for bi, (xb, yb) in enumerate(bar):

            #¬†PyTorch _accumule_ (c'est-√†-dire _somme_) les gradients lors des passages en arri√®re (i.e. lorsque le .backward() est appel√© sur le loss tenseur). 
            # La mise √† zero (nettoyage) des gradients de toutes les param√®tres dans l'optimizer (i.e. W, b) permet d'√©viter que les gradients pointent dans une direction autre que la direction pr√©vue vers le minimum (ou le maximum , en cas d'objectifs de maximisation).
            #¬†Ceci est pratique lors de la formation des RNN lorsque l'on d√©marre la boucle d'entra√Ænement.
            model.zero_grad()

            #¬†Compute the loss
            loss = model.loss(xb.to(device), yb.to(device))
            
            #¬†R√©tro-prolif√©ration/back propagation
            # Compute gradients of the parameters (tensors) w.r.t. the loss
            #¬†The gradients will be "stored" by the tensors themselves (they have a grad and a requires_grad attributes)
            #print ('before loss.backward()', xb.grad)
            loss.backward()
            #¬†print ('after loss.backward()', xb.grad)

            # Update the parameters
            # The optimizer iterate over all parameters (tensors). It is supposed to update and use their internally stored grad to update their values.
            optimizer.step()
            
            bar.set_description("{:2d}/{} loss: {:5.2f}, val_loss: {:5.2f}".format(
                epoch+1, args['num_epoch'], loss, val_loss))
            losses.append([epoch, bi, loss.item(), np.nan])

        # evaluation
        val_loss = __eval_model(model, device, dataloader=valid_dl, desc="eval").item()
        # save losses
        losses[-1][-1] = val_loss
        __save_loss(losses, loss_path)

        # save model
        if not args['save_best_val_model'] or val_loss < best_val_loss:
            best_val_loss = val_loss
            __save_model(args['model_dir'], model)
            print("save model(epoch: {}) => {}".format(epoch, loss_path))

    # test
    test_loss = __eval_model(model, device, dataloader=test_dl, desc="test").item()
    last_loss = losses[-1][:]
    last_loss[-1] = test_loss
    losses.append(last_loss)
    __save_loss(losses, loss_path)
    print("training completed. test loss: {:.2f}".format(test_loss))



##  Impl. pr√©diction √† l'aide du r√©seau


La classe WordsTagger effectue l'√©tiquetage √† proprement parler d'une nouvelle s√©quence de mots. Elle requiert le chemin vers un mod√®le.

Source : https://github.com/jidasheng/bi-lstm-crf/blob/master/bi_lstm_crf/app/predict.py

In [17]:
import numpy as np

class WordsTagger:
    def __init__(self, model_dir, device=None):
        args = load_json_file(arguments_filepath(model_dir))
        #args = dict()
        args['model_dir'] = model_dir
        self.args = args

        self.preprocessor = Preprocessor(config_dir=model_dir, verbose=False)
        self.model = build_model(self.args, self.preprocessor, load=True, verbose=False)
        self.device = running_device(device)
        self.model.to(self.device)

        self.model.eval()

    def __call__(self, sentences, begin_tags="BS"):
        """predict texts
        :param sentences: a text or a list of text
        :param begin_tags: begin tags for the beginning of a span
        :return:
        """
        if not isinstance(sentences, (list, tuple)):
            raise ValueError("sentences must be a list of sentence")

        try:
            sent_tensor = np.asarray([self.preprocessor.sent_to_vector(s) for s in sentences])
            sent_tensor = torch.from_numpy(sent_tensor).to(self.device)
            with torch.no_grad():
                _, tags = self.model(sent_tensor)
            tags = self.preprocessor.decode_tags(tags)
        except RuntimeError as e:
            print("*** runtime error: {}".format(e))
            raise e
        return tags, self.tokens_from_tags(sentences, tags, begin_tags=begin_tags)

    @staticmethod
    def tokens_from_tags(sentences, tags_list, begin_tags):
        """extract entities from tags
        :param sentences: a list of sentence
        :param tags_list: a list of tags
        :param begin_tags:
        :return:
        """
        if not tags_list:
            return []

        def _tokens(sentence, ts):
            # begins: [(idx, label), ...]
            all_begin_tags = begin_tags + "O"
            begins = [(idx, t[2:]) for idx, t in enumerate(ts) if t[0] in all_begin_tags]
            begins = [
                         (idx, label)
                         for idx, label in begins
                         if ts[idx] != "O" or (idx > 0 and ts[idx - 1] != "O")
                     ] + [(len(ts), "")]

            tokens_ = [(sentence[s:e], label) for (s, label), (e, _) in zip(begins[:-1], begins[1:]) if label]
            return [((t, tag) if tag else t) for t, tag in tokens_]

        tokens_list = [_tokens(sentence, ts) for sentence, ts in zip(sentences, tags_list)]
        return tokens_list



##¬†Impl. m√©thode d'√©valuation

In [18]:
# Measures definition
from sklearn.metrics import classification_report

def results_per_class(labels, y_ref, y_hyp):
  # Inspect per-class results in more detail:
  sorted_labels = sorted(
    labels,
    key=lambda name: (name[1:], name[0])
  )
  # print ('y_ref', len(y_ref), 'y_hyp', len(y_hyp), 'sorted_labels', len(sorted_labels))
  return classification_report(flatten(y_ref), flatten(y_hyp), labels=sorted_labels, digits=3)

import re 
def normalise_labels(sentences):
  # normalise les sorties des √©tiquettes NER utilis√©es par les diff√©rents 
  # syst√®mes afin de les rendre comparable
  new_sentences = list()
  for sentence in sentences:
    new_sentence = list()
    for label in sentence:
      if label != 'O':
        label = re.sub('^[A-Z]-','', label)
      new_sentence.append(label)
    new_sentences.append(new_sentence)
  return new_sentences


---
# Pr√©paration des corpus d'entra√Ænement et de test



## Donn√©es d'entra√Ænement WikiNER et fichiers de configuration requis

R√©cup√©ration des donn√©es d'entra√Ænement Wikiner et pr√©pation des fichiers de configuration : vocabulaire, √©tiquettes et donn√©es au format du code utilis√©.

Apr√®s ex√©cution de la cellule, consulter le r√©pertoire `data` pour y trouver les fichiers tagset, vocab et txt produits pour le syst√®me NER pr√©c√©demment d√©fini.

In [25]:
!mkdir -p data 
!wget -nc https://github.com/nicolashernandez/teaching_nlp/raw/main/data/wikiner_ud.joblib.bz2 -P data
!bzip2 -dk data/wikiner_ud.joblib.bz2

# Loading the corpus 
from joblib import load
try:
    wikiner_corpus
except NameError:
    wikiner_corpus = load('data/wikiner_ud.joblib') 
    pass  
    
# Aper√ßu du nombre de phrases et d'une phrase annot√©e (liste de tokens compos√©s de la forme, de la cat√©gorie grammaticale et de l'√©tiquette BIO correspondant en l'entit√© nomm√©e.
print ('#sentences: ', len(wikiner_corpus))

#¬†Constitution du vocabulaire de mots et du jeu d'√©tiquettes
# 132257
vocab = set()
tagset = set()
for s in wikiner_corpus:
  for w,p,n in s:
    # case normalization_to_lower()
    #vocab.add(w.lower())
    # case no normalization
    vocab.add(w)
    tagset.add(n)
print ('#vocab: ', len(vocab))
print ('#tags in tagset: ', len(tagset))
print ('first sentence:', wikiner_corpus[0]) 
print ('tagset: ', tagset)
# {'B-LOC', 'B-ORG', 'I-ORG', 'B-MISC', 'I-MISC', 'I-LOC', 'B-PER', 'I-PER', 'O'}


# export du corpus (phrase et √©tiquettes) au format attendu par le chargeur de donn√©es de l'application bi_lstm_crf
import json
with open('data/wikiner_corpus.txt', 'w', encoding='utf-8') as f:
    for i, line in enumerate(wikiner_corpus):
      sentence = list()
      tags = list()    
      for w,p,n in line:
        # case normalization_to_lower()
        #sentence.append(w.lower())
        # case no normalization
        sentence.append(w) 

        tags.append(n)
      f.write('{}\t{}\n'.format(json.dumps(sentence), json.dumps(tags)))
      # NH small corpus
      # if (i>2): break


# export tagset au format bi_lstm_crf 
with open('data/wikiner_corpus_tagset.json', 'w', encoding='utf-8') as f:
    json.dump(list(tagset), f, ensure_ascii=False)


# export vocab au format bi_lstm_crf 
with open('data/wikiner_corpus_vocab.json', 'w', encoding='utf-8') as f:
    json.dump(list(vocab), f, ensure_ascii=False)


File ‚Äòdata/wikiner_ud.joblib.bz2‚Äô already there; not retrieving.

bzip2: Output file data/wikiner_ud.joblib already exists.
#sentences:  132257
#vocab:  108023
#tags in tagset:  9
first sentence: [('Il', 'PRON', 'O'), ('assure', 'VERB', 'O'), ('√†', 'ADP', 'O'), ('la', 'DET', 'O'), ('suite', 'NOUN', 'O'), ('de', 'ADP', 'I-PER'), ('Saussure', 'NOUN', 'I-PER'), ('le', 'DET', 'O'), ('cours', 'NOUN', 'O'), ('de', 'ADP', 'O'), ('grammaire', 'ADJ', 'O'), ('compar√©e', 'VERB', 'O'), (',', 'PUNCT', 'O'), ("qu'", 'SCONJ', 'O'), ('il', 'PRON', 'O'), ('compl√®te', 'VERB', 'O'), ('√†', 'ADP', 'O'), ('partir', 'VERB', 'O'), ('de', 'ADP', 'O'), ('1894', 'NUM', 'O'), ('par', 'ADP', 'O'), ('une', 'DET', 'O'), ('conf√©rence', 'NOUN', 'O'), ('sur', 'ADP', 'O'), ("l'", 'DET', 'O'), ('iranien', 'NOUN', 'O'), ('.', 'PUNCT', 'O')]
tagset:  {'I-MISC', 'O', 'I-LOC', 'I-PER', 'I-ORG', 'B-PER', 'B-LOC', 'B-ORG', 'B-MISC'}


D√©claration du r√©pertoire de donn√©es et des noms des fichiers de vocab, du jeu d'√©tiquettes et du corpus √©tiquet√©s. En fait les noms des repertoires des donn√©es et du mod√®les sont d√©finis un peu plus bas...

In [26]:
FILE_VOCAB = "wikiner_corpus_vocab.json"
FILE_TAGS = "wikiner_corpus_tagset.json"
FILE_DATASET = "wikiner_corpus.txt"

D√©claration du r√©pertoire du mod√®le qui sera g√©n√©r√© 

In [27]:
FILE_DATASET_CACHE = "dataset_cache_{}.npz"
FILE_ARGUMENTS = "arguments.json"
FILE_MODEL = "model.pth"

## Donn√©es de tests WiNER

Pr√©paration des donn√©es notamment en restreignant le jeu d'√©tiquettes consid√©r√©es dans l'√©valuation.

In [32]:
# load the test corpus
!mkdir -p data
!wget -nc https://github.com/nicolashernandez/teaching_nlp/raw/main/data/winer_dev.joblib -P data
from joblib import load
winer_corpus = load('data/winer_dev.joblib')

# get the tokens of each text
# liste chaque forme de surface de chaque mot de chaque phrase
# case normalization_to_lower()
#winer_tokens = [[token.lower() for token, pos, label in text] for text in winer_corpus]
# case no normalization
winer_tokens = [[token for token, pos, label in text] for text in winer_corpus]

# liste chaque √©tiquette (label) de chaque mot de chaque phrase
winer_ref = [[label for token, pos, label in text] for text in winer_corpus]
labels = list(set(flatten(winer_ref)))

#
print ('#texts:', len(winer_corpus))
print ('labels:', labels)

print ('sample of annotated texts:', winer_corpus[0])   
print ('sample of tokenized text:', winer_tokens[0])   

# Il y a beaucoup plus d'entit√©s 'O' que les autres dans le corpus, 
#¬†mais nous sommes davantage int√©ress√©s par les autres entit√©s. 
#¬†Pour ne pas biaiser les scores de moyenne, on retire les √©tiquettes qui ne nous int√©ressent pas.
print ("before removing:", labels)
labels_to_remove = ['O', 'Event', 'Date', 'Hour']
for l in labels_to_remove:
  if l in labels: labels.remove(l)
print ("after removing:", labels)

File ‚Äòdata/winer_dev.joblib‚Äô already there; not retrieving.

#texts: 600
labels: ['ORG', 'Event', 'LOC', 'O', 'PER', 'MISC', 'Hour', 'Date']
sample of annotated texts: [('Catch', 'NOUN', 'O'), (':', 'PUNCT', 'O'), ('d√©c√®s', 'NOUN', 'O'), ('de', 'ADP', 'O'), ('Bobby', 'PROPN', 'PER'), ('Heenan', 'NOUN', 'PER'), ('18', 'NUM', 'Date'), ('septembre', 'NOUN', 'Date'), ('2017', 'NUM', 'Date'), ('.', 'PUNCT', 'O'), ('‚Äì', 'PUNCT', 'O'), ('Le', 'DET', 'O'), ('manager', 'NOUN', 'O'), ('et', 'CCONJ', 'O'), ('commentateur', 'NOUN', 'O'), ('de', 'ADP', 'O'), ('catch', 'NOUN', 'O'), ('Bobby', 'PROPN', 'PER'), ('Heenan', 'NOUN', 'PER'), ('est', 'AUX', 'O'), ('mort', 'VERB', 'O'), ('hier', 'ADV', 'O'), ('√†', 'ADP', 'O'), ("l'√¢ge", 'ADV', 'O'), ('de', 'ADP', 'O'), ('73', 'NUM', 'O'), ('ans.', 'NOUN', 'O'), ('Il', 'PRON', 'O'), ('est', 'AUX', 'O'), ('c√©l√®bre', 'ADJ', 'O'), ('pour', 'ADP', 'O'), ('son', 'DET', 'O'), ('travail', 'NOUN', 'O'), ('en', 'ADP', 'O'), ('tant', 'ADV', 'O'), ('que', '

# Exp√©rimentation d'un r√©seau avec des embeddings al√©atoires  

On utilise ici indiff√©rement les termes r√©seau entra√Æn√© et mod√®le. 


## Couche Bi-LSTM CRF

Cette exp√©rimentation suppose la d√©finition de la couche Bi-LSTM CRF ci-avant qui ne pr√©d√©finit pas de mod√®le d'embeddings et qui par cons√©quent retourne des embeddings al√©atoires pour chaque mot du vocabulaire.

## Entrainement effectif du mod√®le 

‚ö†Ô∏è Attention, l'impl√©mentation cherchera √† charger une configuration existante dans le r√©pertoire du mod√®le sp√©cifi√©. Si vous changez le param√©trage alors supprimer les fichiers sp√©cifiques au mod√®le ou bien sp√©cifier un nouveau r√©pertoire pour le nouveau mod√®le.
L'erreur `RuntimeError: Error(s) in loading state_dict for BiRnnCrf:` est retourn√©e quand vous lancez un entra√Ænement (`train`) apr√®s avoir modifi√© des param√®tres qui ne sont plus en coh√©rence avec une configuration d√©j√† pr√©sente dans le r√©pertoire `model_dir`.

In [18]:
args = dict()
args['corpus_dir'] = "data"  # the corpus directory
args['model_dir'] = "model_wikiner_vanilla"       # the output directory for model files
args['num_epoch'] = 5 # 5 25 50 500                # number of epoch to train
args['lr'] = 1e-3                     #¬†learning rate
args['weight_decay'] = 0.             # the L2 normalization parameter
args['batch_size'] = 1000             # batch size for training
args['device'] = None                 # the training device: "cuda:0", "cpu:0". It will be auto-detected by default
args['max_seq_len'] = 100 #¬†100              #¬†max sequence length within training
args['val_split'] = 0.2                  #¬†the split for the validation dataset
args['test_split'] = 0.2                 # the split for the testing dataset
args['recovery'] = "store_true"       #¬†continue to train from the saved model in model_dir
args['save_best_val_model'] = "store_true" # save the model whose validation score is smallest
args['embedding_dim'] = 200 # 100 300 500           #¬†the dimension of the embedding layer

args['hidden_dim'] = 128              # the dimension of the RNN hidden state
args['num_rnn_layers'] = 1 # 1            # the number of RNN layers
args['rnn_type'] = "lstm"              # RNN type, choice: "lstm", "gru"
#¬†print(args)

#
import time
start_time = time.time()

!rm -r model_*
train(args)

print("--- %s seconds ---" % (time.time() - start_time))
# --- 162.1955807209015 seconds --- gpu 5 epochs max_seq_len 100 embedding_dim 100 num_rnn_layers 1 val_loss:  4.47 test_loss: 4.27
# 1056/10000 loss: -0.00, val_loss: 20.14:  90%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà | 72/80 [00:12<00:01,  5.97it/s] ~ 4h08 min et 35s mais d√©connexion
# --- 1344.2705941200256 seconds --- gpu 100 epochs max_seq_len 100 embedding_dim 300 num_rnn_layers 1 val_loss:   10.35 test_loss: 11.29 avec loss train √† 0.02 d√®s √©poch 80



Debug: train - Preprocessor
config data/wikiner_corpus_vocab.json loaded
Debug: Preprocessor - __init__ - len(self.vocab): 108023
Debug: Preprocessor - __init__ - len(self.vocab_dict): 108023
config data/wikiner_corpus_tagset.json loaded
tag dict file => model_wikiner_vanilla/wikiner_corpus_tagset.json
tag dict file => model_wikiner_vanilla/wikiner_corpus_vocab.json
Debug: Preprocessor - __adjust_vocab - len(self.vocab): 108025
Debug: Preprocessor - __adjust_vocab - len(self.vocab_dict): 108025
Debug: train - build_model
Debug: build_model - len(processor.vocab): 108025
Debug: build_model - BiRnnCrf
Debug: build_model - model: BiRnnCrf(
  (embedding): Embedding(108025, 200)
  (rnn): LSTM(200, 64, batch_first=True, bidirectional=True)
  (crf): CRF(
    (fc): Linear(in_features=128, out_features=11, bias=True)
  )
)
Debug: train - model: BiRnnCrf(
  (embedding): Embedding(108025, 200)
  (rnn): LSTM(200, 64, batch_first=True, bidirectional=True)
  (crf): CRF(
    (fc): Linear(in_features=

 1/5 loss: 12.69, val_loss:  0.00: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 80/80 [00:12<00:00,  6.54it/s]
eval: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 14/14 [00:00<00:00, 15.03it/s]


save model => model_wikiner_vanilla/model.pth
save model(epoch: 0) => model_wikiner_vanilla/loss.csv


 2/5 loss:  9.55, val_loss: 12.31: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 80/80 [00:11<00:00,  6.69it/s]
eval: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 14/14 [00:00<00:00, 14.91it/s]


save model => model_wikiner_vanilla/model.pth
save model(epoch: 1) => model_wikiner_vanilla/loss.csv


 3/5 loss:  8.35, val_loss:  9.42: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 80/80 [00:12<00:00,  6.53it/s]
eval: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 14/14 [00:00<00:00, 14.10it/s]


save model => model_wikiner_vanilla/model.pth
save model(epoch: 2) => model_wikiner_vanilla/loss.csv


 4/5 loss:  6.06, val_loss:  7.41: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 80/80 [00:12<00:00,  6.57it/s]
eval: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 14/14 [00:00<00:00, 19.61it/s]


save model => model_wikiner_vanilla/model.pth
save model(epoch: 3) => model_wikiner_vanilla/loss.csv


 5/5 loss:  4.84, val_loss:  6.43: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 80/80 [00:12<00:00,  6.53it/s]
eval: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 14/14 [00:00<00:00, 14.74it/s]


save model => model_wikiner_vanilla/model.pth
save model(epoch: 4) => model_wikiner_vanilla/loss.csv


test: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 14/14 [00:00<00:00, 14.99it/s]

training completed. test loss: 6.25
--- 73.49695992469788 seconds ---





#### VOTRE TRAVAIL

* Si vous avez le temps, comparez les types d'√©cutions "cpu" et "gpu" en regardant le temps approximatif annonc√© pour un entra√Ænement sur 1 √©poque. 
* Avec les param√®tres par d√©faut, quelle score de loss obtenez-vous pour les donn√©es de validation suite √† la derni√®re √©poque d'entra√Ænement ? Et sur les donn√©es de test ?



## Pr√©diction effective du mod√®le

Pr√©diction sur une phrase exemple

In [20]:
bilstmcrf_model = WordsTagger(model_dir="model_wikiner_vanilla")
tags, sequences = bilstmcrf_model([['George', 'W.', 'Bush', 'fut', 'pr√©sident', 'des', '√âtats-Unis', "d'", 'Am√©rique', '.']])  # CHAR-based model
print(tags)  

Debug: Preprocessor - __init__ - len(self.vocab): 108023
Debug: Preprocessor - __init__ - len(self.vocab_dict): 108023
Debug: Preprocessor - __adjust_vocab - len(self.vocab): 108025
Debug: Preprocessor - __adjust_vocab - len(self.vocab_dict): 108025
Debug: build_model - len(processor.vocab): 108025
Debug: build_model - BiRnnCrf
Debug: build_model - model: BiRnnCrf(
  (embedding): Embedding(108025, 200)
  (rnn): LSTM(200, 64, batch_first=True, bidirectional=True)
  (crf): CRF(
    (fc): Linear(in_features=128, out_features=11, bias=True)
  )
)
running_device gpu
[['I-LOC', 'I-PER', 'I-ORG', 'B-MISC', 'I-ORG', 'B-MISC', 'I-ORG', 'B-MISC', 'I-ORG', 'B-MISC']]


Pr√©diction sur WiNER

In [21]:
# predict
#from bi_lstm_crf.app import WordsTagger

import time
start_time = time.time()

bilstmcrf_model = WordsTagger(model_dir="model_wikiner_vanilla") #_vanilla

bilstmcrf_hyp = []
# pour chaque phrase de wikiner
for text in winer_tokens:
    tags, sequences = bilstmcrf_model([text])    
    bilstmcrf_hyp.append(tags[0])
    #print (tags)
    #break

#
print("--- %s seconds ---" % (time.time() - start_time))
# --- 40.24239158630371 seconds ---
# --- 144.3440752029419 seconds ---
# --- 33.92570495605469 seconds ---

# normalize the hyp labels
print()
print ('bilstmcrf_hyp', bilstmcrf_hyp[0])
normalized_bilstmcrf_hyp = normalise_labels(bilstmcrf_hyp)
print ('normalized_bilstmcrf_hyp', normalized_bilstmcrf_hyp[0])
print ('winer_ref', winer_ref[0])
print()

# Evaluate on data 
print (args)
print()
print (results_per_class(labels, winer_ref, normalized_bilstmcrf_hyp))

Debug: Preprocessor - __init__ - len(self.vocab): 108023
Debug: Preprocessor - __init__ - len(self.vocab_dict): 108023
Debug: Preprocessor - __adjust_vocab - len(self.vocab): 108025
Debug: Preprocessor - __adjust_vocab - len(self.vocab_dict): 108025
Debug: build_model - len(processor.vocab): 108025
Debug: build_model - BiRnnCrf
Debug: build_model - model: BiRnnCrf(
  (embedding): Embedding(108025, 200)
  (rnn): LSTM(200, 64, batch_first=True, bidirectional=True)
  (crf): CRF(
    (fc): Linear(in_features=128, out_features=11, bias=True)
  )
)
running_device gpu
--- 18.573519468307495 seconds ---

bilstmcrf_hyp ['B-MISC', 'B-MISC', 'I-ORG', 'B-MISC', 'I-ORG', 'B-MISC', 'B-ORG', 'B-ORG', 'B-ORG', 'B-ORG', 'B-ORG', 'B-ORG', 'B-MISC', 'B-MISC', 'B-MISC', 'B-MISC', 'B-MISC', 'B-MISC', 'B-MISC', 'B-MISC', 'B-MISC', 'B-MISC', 'B-MISC', 'B-MISC', 'B-MISC', 'B-MISC', 'B-MISC', 'B-ORG', 'B-ORG', 'B-ORG', 'B-MISC', 'B-MISC', 'I-ORG', 'B-MISC', 'I-ORG', 'B-MISC', 'B-MISC', 'I-ORG', 'B-MISC', 'B-MI

---
### VOTRE TRAVAIL

* Conservez les mesures de performance de la pr√©diction sur les donn√©es de test √† l'aide du mod√®le entra√Æn√© avec les hyper-param√®tres par d√©faut. Sans changer les hyper-param√®tres, ex√©cutez une seconde fois les cellules d'entra√Ænement √† ex√©cution de la pr√©diction. Obtenez-vous les m√™mes r√©sultats ? Pourquoi ? Quelle recommandation pr√©conisez-vous pour rendre plus fiable vos observations ?
* Par d√©faut, les mots du corpus d'entra√Ænement WikiNER, le vocabulaire qui en est issu et les mots du corpus de tests WiNER ne sont pas normalis√©s. Normalisez en minuscule ces 3 types de donn√©es. Entra√Æner et √©valuer ce nouveau mod√®le. Que peut-on dire des performances avec/sans normalisation en minuscule ? 
* Jouez avec les hyper-param√®tres d'entra√Ænement du mod√®le tels que nombre d'√©poque, dimension des embeddings (embedding_dim), nombre de couches RNN (num_rnn_layers), type de cellule RNN (rnn_type). D√©terminer l'apport de chaque param√®tre. Discuter les performances en termes de pr√©cision, rappel et micro/macro-F1. Vous pouvez aussi jouer avec la taille des phrases consid√©r√©es (max_seq_len),  nombre de dimension du RNN (hidden state), le taux d'apprentissage (lr) si vous avez le temps... 
* Dans vos exp√©riences, rencontrez-vous des limites avec le hardware mis √† disposition par gcolab ? 
* Faire un retour sur les diff√©rents mod√®les que vous avez impl√©ment√©s (y compris √† base de CRF pur).

Ci dessous quelques pointeurs sur comment utiliser des mod√®les pr√©-entra√Æn√©s avec pytorch
* https://stackoverflow.com/questions/49710537/pytorch-gensim-how-to-load-pre-trained-word-embeddings
* https://medium.com/@martinpella/how-to-use-pre-trained-word-embeddings-in-pytorch-71ca59249f76 
* https://towardsdatascience.com/deep-learning-for-nlp-with-pytorch-and-torchtext-4f92d69052f


## R√©sultats d'ex√©cution

Cette section rapporte des sorties d'ex√©cution selon certaines configurations d'hyperparam√®tres. En ce sens, elle r√©pond (partiellement √† des questions ci-avant). Suivant le temps que vous disposez, vous pouvez regarder cette section. A noter que pour une m√™me configuration, trois ex√©cutions ont en g√©n√©ral op√©r√©es pour consolider l'observation. 

`running_device gpu`

**Hyperparam√®tres par d√©faut et entra√Ænement, vocabulaire et √©valuation sans normalisation.**

{'corpus_dir': 'data', 'model_dir': 'model_wikiner_vanilla', 'num_epoch': 5, 'lr': 0.001, 'weight_decay': 0.0, 'batch_size': 1000, 'device': None, 'max_seq_len': 100, 'val_split': 0.2, 'test_split': 0.2, 'recovery': 'store_true', 'save_best_val_model': 'store_true', 'embedding_dim': 100, 'hidden_dim': 128, 'num_rnn_layers': 1, 'rnn_type': 'lstm'}

--- 19.388370513916016 seconds ---

```
Ex√©cution1
              precision    recall  f1-score   support

         PER      0.026     0.141     0.044      4483
        MISC      0.003     0.056     0.006       443
         LOC      0.028     0.047     0.035      4724
         ORG      0.024     0.011     0.015      3816

   micro avg      0.022     0.068     0.033     13466
   macro avg      0.020     0.064     0.025     13466
weighted avg      0.025     0.068     0.032     13466


Ex√©cution2
              precision    recall  f1-score   support

         PER      0.034     0.323     0.062      4483
        MISC      0.008     0.183     0.015       443
         LOC      0.056     0.029     0.039      4724
         ORG      0.032     0.017     0.022      3816

   micro avg      0.030     0.129     0.049     13466
   macro avg      0.033     0.138     0.034     13466
weighted avg      0.040     0.129     0.041     13466


Ex√©cution3
              precision    recall  f1-score   support

         PER      0.040     0.657     0.075      4483
        MISC      0.004     0.002     0.003       443
         LOC      0.034     0.018     0.023      4724
         ORG      0.015     0.001     0.001      3816

   micro avg      0.040     0.225     0.067     13466
   macro avg      0.023     0.169     0.026     13466
weighted avg      0.029     0.225     0.034     13466

```

**Hyperparam√®tres par d√©faut et corpus d'entra√Ænement, vocabulaire et tests normalis√©s en minuscule**


{'corpus_dir': 'data', 'model_dir': 'model_wikiner_vanilla', 'num_epoch': 5, 'lr': 0.001, 'weight_decay': 0.0, 'batch_size': 1000, 'device': None, 'max_seq_len': 100, 'val_split': 0.2, 'test_split': 0.2, 'recovery': 'store_true', 'save_best_val_model': 'store_true', 'embedding_dim': 100, 'hidden_dim': 128, 'num_rnn_layers': 1, 'rnn_type': 'lstm'}


```
--- 19.28504467010498 seconds predict ---

              precision    recall  f1-score   support
         PER      0.255     0.690     0.373      4483
        MISC      0.014     0.018     0.016       443
         LOC      0.485     0.279     0.354      4724
         ORG      0.199     0.246     0.220      3816

   micro avg      0.267     0.398     0.319     13466
   macro avg      0.239     0.308     0.241     13466
weighted avg      0.312     0.398     0.311     13466


--- 19.466615200042725 seconds predict ---

              precision    recall  f1-score   support
         PER      0.280     0.650     0.392      4483
        MISC      0.010     0.251     0.019       443
         LOC      0.349     0.036     0.066      4724
         ORG      0.539     0.087     0.150      3816

   micro avg      0.156     0.262     0.195     13466
   macro avg      0.295     0.256     0.157     13466
weighted avg      0.369     0.262     0.197     13466


--- 19.388370513916016 seconds predict---

              precision    recall  f1-score   support

         PER      0.549     0.631     0.587      4483
        MISC      0.038     0.205     0.065       443
         LOC      0.411     0.713     0.521      4724
         ORG      0.451     0.186     0.263      3816

   micro avg      0.404     0.519     0.455     13466
   macro avg      0.362     0.434     0.359     13466
weighted avg      0.456     0.519     0.455     13466
```

**25 and 100 epochs avec normalisation**

{'corpus_dir': 'data', 'model_dir': 'model_wikiner_vanilla', 'num_epoch': 100, 'lr': 0.001, 'weight_decay': 0.0, 'batch_size': 1000, 'device': None, 'max_seq_len': 100, 'val_split': 0.2, 'test_split': 0.2, 'recovery': 'store_true', 'save_best_val_model': 'store_true', 'embedding_dim': 100, 'hidden_dim': 128, 'num_rnn_layers': 1, 'rnn_type': 'lstm'}


```
--- 322.24554419517517 seconds --- train
--- 29.44636583328247 seconds --- predict

 precision    recall  f1-score   support

         PER      0.742     0.566     0.642      4483
        MISC      0.016     0.609     0.031       443
         LOC      0.627     0.622     0.624      4724
         ORG      0.360     0.367     0.363      3816

   micro avg      0.249     0.530     0.339     13466
   macro avg      0.436     0.541     0.415     13466
weighted avg      0.569     0.530     0.537     13466


--- 323.0248258113861 seconds to train ---
--- 19.44614005088806 seconds to predict ---

              precision    recall  f1-score   support

         PER      0.846     0.541     0.660      4483
        MISC      0.054     0.427     0.096       443
         LOC      0.571     0.754     0.650      4724
         ORG      0.733     0.247     0.370      3816

   micro avg      0.513     0.529     0.521     13466
   macro avg      0.551     0.492     0.444     13466
weighted avg      0.691     0.529     0.556     13466


100 √©pochs 

--- 1502.3446514606476 seconds --- train
--- 29.44636583328247 seconds --- predict

              precision    recall  f1-score   support

         PER      0.345     0.774     0.478      4483
        MISC      0.011     0.634     0.021       443
         LOC      0.797     0.510     0.622      4724
         ORG      0.506     0.318     0.391      3816

   micro avg      0.179     0.548     0.269     13466
   macro avg      0.415     0.559     0.378     13466
weighted avg      0.538     0.548     0.489     13466
```


            

**dimension default embeddings  500**

--- 19.915265560150146 seconds predict ---


{'corpus_dir': 'data', 'model_dir': 'model_wikiner_vanilla', 'num_epoch': 5, 'lr': 0.001, 'weight_decay': 0.0, 'batch_size': 1000, 'device': None, 'max_seq_len': 100, 'val_split': 0.2, 'test_split': 0.2, 'recovery': 'store_true', 'save_best_val_model': 'store_true', 'embedding_dim': 500, 'hidden_dim': 128, 'num_rnn_layers': 1, 'rnn_type': 'lstm'}

```

              precision    recall  f1-score   support

         PER      0.539     0.690     0.605      4483
        MISC      0.030     0.260     0.054       443
         LOC      0.638     0.752     0.690      4724
         ORG      0.248     0.393     0.304      3816

   micro avg      0.390     0.613     0.477     13466
   macro avg      0.364     0.523     0.413     13466
weighted avg      0.474     0.613     0.531     13466


              precision    recall  f1-score   support

         PER      0.380     0.837     0.523      4483
        MISC      0.043     0.214     0.072       443
         LOC      0.734     0.646     0.687      4724
         ORG      0.603     0.239     0.342      3816

   micro avg      0.440     0.580     0.501     13466
   macro avg      0.440     0.484     0.406     13466
weighted avg      0.557     0.580     0.515     13466



              precision    recall  f1-score   support

         PER      0.776     0.548     0.642      4483
        MISC      0.021     0.291     0.039       443
         LOC      0.735     0.685     0.709      4724
         ORG      0.635     0.203     0.308      3816

   micro avg      0.439     0.490     0.463     13466
   macro avg      0.542     0.432     0.425     13466
weighted avg      0.697     0.490     0.551     13466


```

---
# Exp√©rimentation avec des embeddings pr√©-entrain√©s

Le code suivant suppose va red√©finir la couche couche Bi-LSTM CRF. La pr√©c√©dente couche couche Bi-LSTM consid√©rait une couche d'embeddings avec des vecteurs tir√©s al√©atoirement pour chaque mot du vocabulaire (fonctionnement par d√©faut). Dans la nouvelle d√©finition nous allons charger des embeddings et d√©finir le mod√®le Bi-LSTM CRF pour y int√©grer une couche d'embeddings initialis√©e avec un mod√®le d'embeddings pr√©-entra√Æn√©.

Mis √† part les cellules pr√©c√©dentes intitul√©es "ex√©cution de l'entra√Ænement et "ex√©cution de la pr√©diction sur les donn√©es de tests", il est n√©cessaire d'ex√©cuter toutes les cellules qui pr√©c√®dent. Celle d√©finissant Bi-LSTM CRF sera √©cras√©e.



## Collecte d'un mod√®le d'embeddings

Jean-Philippe Fauconnier met √† disposition des [mod√®les d'embeddings pr√©-entra√Æn√©s pour le fran√ßais](https://fauconnier.github.io/#data).

In [33]:
!wget -nc https://s3.us-east-2.amazonaws.com/embeddings.net/embeddings/frWac_non_lem_no_postag_no_phrase_200_cbow_cut100.bin -P embeddings
#!wget -nc https://s3.us-east-2.amazonaws.com/embeddings.net/embeddings/frWac_non_lem_no_postag_no_phrase_500_skip_cut100.bin -P embeddings
#!wget -nc https://s3.us-east-2.amazonaws.com/embeddings.net/embeddings/frWac_non_lem_no_postag_no_phrase_500_skip_cut200.bin -P embeddings
#!wget -nc https://s3.us-east-2.amazonaws.com/embeddings.net/embeddings/frWac_no_postag_phrase_500_cbow_cut10.bin -P embeddings  # surcharge la RAM √† l'entrainement
#!wget -nc https://s3.us-east-2.amazonaws.com/embeddings.net/embeddings/frWac_postag_no_phrase_1000_skip_cut100.bin -P embeddings # ['</s>', 'le_d', 'de_p', 'et_c', 'de_p+d', 'un_d', '√™tre_v', '√†_p', 'son_d', 'en_p']

w2v_pretrained_embeddings_path = "embeddings/frWac_non_lem_no_postag_no_phrase_200_cbow_cut100.bin"
#w2v_pretrained_embeddings_path = "embeddings/frWac_non_lem_no_postag_no_phrase_500_skip_cut100.bin"
#w2v_pretrained_embeddings_path = "embeddings/frWac_non_lem_no_postag_no_phrase_500_skip_cut200.bin"
#w2v_pretrained_embeddings_path = "embeddings/frWac_no_postag_phrase_500_cbow_cut10.bin"
#w2v_pretrained_embeddings_path = "embeddings/frWac_postag_no_phrase_1000_skip_cut100.bin" 

File ‚Äòembeddings/frWac_non_lem_no_postag_no_phrase_200_cbow_cut100.bin‚Äô already there; not retrieving.



## Du mod√®le au tensor √† la couche neuronale d'embeddings

Initialiser une couche d'Embedding √† l'aide d'un mod√®le pr√©-entra√Æn√© (cf. [Charger des embeddings pr√©-entra√Æn√©s dans pytorch](https://stackoverflow.com/questions/49710537/pytorch-gensim-how-to-load-pre-trained-word-embeddings))

In [34]:
#¬†Chargement du mod√®le d'embeddings pr√©-entra√Æn√©s
from gensim.models import KeyedVectors

w2v_pretrained_embeddings = KeyedVectors.load_word2vec_format(w2v_pretrained_embeddings_path, binary=True, unicode_errors="ignore")
#print (w2v_pretrained_embeddings.vectors[0])

import torch
# transformation des vecteurs d'embeddings en tensors
w2v_pretrained_embeddings_tensors = torch.FloatTensor(w2v_pretrained_embeddings.vectors) 

# cr√©ation et initialisation d'une couche d'embeddings √† partir d'un mod√®le d'embeddings
#import torch.nn as nn
torch_embedding = nn.Embedding.from_pretrained(w2v_pretrained_embeddings_tensors)

Retourne la s√©quence d'embeddings correspondant √† une s√©quence d'indices de mot d'une phrase tir√©s al√©atoirement.

In [8]:
torch_embedding

dummy_vocab_size = 20000
dummy_batch_size = 1
dummy_max_seq_len = 100
dummy_device = 'cpu'
dummy_x = torch.randint(0, dummy_vocab_size, (dummy_batch_size, dummy_max_seq_len))
print (dummy_x)
#dummy_x = dummy_x.to(dummy_device).long()
dummy_x = dummy_x.long() 
print (dummy_x) 
print (torch_embedding(dummy_x))

tensor([[ 8465,  7470, 14144, 17723,  5317, 13371,  5805, 13258, 13149,  7162,
          8444, 15853,  7460,  6567, 15170,  8524, 18101,  3741, 17938, 10595,
         10700, 10883, 12560, 16727, 19874, 15235, 12576, 17770, 15838,  9250,
         13318, 12324,  7622,  7773, 13472,   355,    78, 17608,  8518,  8690,
          8435, 10500,   748,  2601,  4822,  6635, 18656, 17234, 19605,  7977,
         13427, 19636, 12845,  6950,  7600, 16329, 11356,  9931,  4196, 14934,
          4139, 16959,  3260,  2826, 19087, 12938, 15177, 10062,  5315,  3574,
          4517,  7723, 10338, 17495,  4528, 12723,  2706,  1292, 17979,   653,
          4044, 11993,  1856,  8050, 14756,  1124,  4935, 14325, 10475,  4882,
         10892,  5428, 12558,  9029, 13008, 14390,  7441,  7889, 16424, 17586]])
tensor([[ 8465,  7470, 14144, 17723,  5317, 13371,  5805, 13258, 13149,  7162,
          8444, 15853,  7460,  6567, 15170,  8524, 18101,  3741, 17938, 10595,
         10700, 10883, 12560, 16727, 19874, 15235,

Acc√©der au vocabulaire

In [35]:
print ('#dimensions', w2v_pretrained_embeddings.vector_size)
print ('#vocab',len(w2v_pretrained_embeddings.vectors))
w2v_pretrained_embeddings_vocab = list(w2v_pretrained_embeddings.vocab)
print ('les n premiers mots du vocabulaire du mod√®le d embeddings:', w2v_pretrained_embeddings_vocab[:10])
#print (weights[0])
# export vocab au format bi_lstm_crf 
!mkdir data
import json
with open('data/w2v_pretrained_embeddings_vocab.json', 'w', encoding='utf-8') as f:
  json.dump(w2v_pretrained_embeddings_vocab, f, ensure_ascii=False)

FILE_VOCAB = "w2v_pretrained_embeddings_vocab.json"

#dimensions 200
#vocab 155562
les n premiers mots du vocabulaire du mod√®le d embeddings: ['</s>', 'de', 'la', 'et', 'le', "l'", 'les', '√†', 'des', "d'"]
mkdir: cannot create directory ‚Äòdata‚Äô: File exists


Observer la qualit√© du mod√®le via un exemple de recherche d'embeddings les plus similaires √† l'embedding d'un mot sp√©cifi√© (ici le 100 i√®me mot)

In [22]:
# donne une id√©e du contenu des embeddings (et de leur normalisation)
print ('voici les mots les plus similaires au mot "{}" : {}'.format(w2v_pretrained_embeddings_vocab[100], w2v_pretrained_embeddings.most_similar(w2v_pretrained_embeddings_vocab[100])))
#print (w2v_pretrained_embeddings.most_similar("int√©ressant_√†"))

voici les mots les plus similaires au mot "voir" : [('ici', 0.5783172845840454), ('regarder', 0.5010303258895874), ('aper√ßu', 0.48547330498695374), ('connaitre', 0.45781832933425903), ('visiter', 0.45318037271499634), ('visionner', 0.451629638671875), ('revoir', 0.42287424206733704), ('consulter', 0.41887110471725464), ('cliquant', 0.41748952865600586), ('voici', 0.4161854386329651)]


#### VOTRE TRAVAIL
* V√©rifier que le vocabulaire suit la m√™me normalisation que les corpus d'entra√Ænement et de test

##¬†Impl. couche BiRnnCrf avec embeddings pr√©-entra√Æn√©s

In [40]:
import torch.nn as nn
from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence

#¬†Chargement du mod√®le d'embeddings pr√©-entra√Æn√©s
from gensim.models import KeyedVectors
import torch

class BiRnnCrf(nn.Module):
    def __init__(self, vocab_size, tagset_size, embedding_dim, hidden_dim, num_rnn_layers=1, rnn="lstm"):
        super(BiRnnCrf, self).__init__()
        self.embedding_dim = embedding_dim
        self.hidden_dim = hidden_dim
        self.vocab_size = vocab_size # +2
        self.tagset_size = tagset_size

        print ('Debug: BiRnnCrf - __init__ - self.vocab_size:',  self.vocab_size)

        #¬†D√©claration d'une couche d'Embeddings  
        # self.embedding = nn.Embedding(vocab_size, embedding_dim)
        
        print ('Debug: BiRnnCrf - __init__ - w2v_pretrained_embeddings_tensors.size:', len(w2v_pretrained_embeddings_tensors))
        # at this point vocab_size includes PAD and OOV words but not w2v_pretrained_embeddings_tensors
        # so if necessary we generate dedicated tensors and include them to w2v_pretrained_embeddings_tensors 
        # (at the position, where they are expected to be )
        if self.vocab_size != len(w2v_pretrained_embeddings_tensors):
          pad_tensor = torch.randn(1,  self.embedding_dim)
          oov_tensor = torch.randn(1,  self.embedding_dim)
          extended_w2v_pretrained_embeddings_tensors = torch.cat ((pad_tensor,w2v_pretrained_embeddings_tensors,oov_tensor),0)
        self.embedding = nn.Embedding.from_pretrained(extended_w2v_pretrained_embeddings_tensors)
        self.embedding.weight.requires_grad = False
        # Get embeddings for index 1
        #input = torch.LongTensor([1])
        #embedding(input)
        
        #¬†D√©claration d'une couche RNN bidirectionnelle
        RNN = nn.LSTM if rnn == "lstm" else nn.GRU
        self.rnn = RNN(embedding_dim, hidden_dim // 2, num_layers=num_rnn_layers,
                       bidirectional=True, batch_first=True)
        
        #¬†D√©claration d'une couche CRF
        self.crf = CRF(hidden_dim, self.tagset_size)

    def __build_features(self, sentences):
        """
        sentences contient l'√©quivalent d'un batch de sentences ;
        chaque sentence √©tant de dimension max_seq_len 
        et contenant les indices des mots 
        type(sentences): <class 'torch.Tensor'>
        sentences.shape: torch.Size([1000, 100]) #¬†valeur par d√©faut
        More details on Tensors: https://pytorch.org/docs/stable/tensors.html
        """
        #print ('__build_features')
        #print("type sentences {}".format(type(sentences))) #¬†<class 'torch.Tensor'>
        #print("shape sentences {}".format(sentences.shape)) #¬†torch.Size([1000, 100])
        #print ('sentences[0]:', sentences[0]) 
        """
        sentences[0]: tensor([27996, 34171, 38501, 49310, 75077, 94514,  7381, 80031, 70853, 80031,
        56648, 41074, 75077, 51013, 83722, 91893, 70882,  7213, 55591, 30448,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0],
        device='cuda:0')"""
        
        # > identify positions in sentences where there are words
        masks = sentences.gt(0) 
        #print("type(masks):{}".format(type(masks))) #¬†<class 'torch.Tensor'>
        
        #print ('masks[0]:', masks[0])
        """
        masks[0]: tensor([ True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
        False, False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False, False],
        device='cuda:0')"""

        # print("type(sentences.long()):{}".format(type(sentences.long()))) #¬†<class 'torch.Tensor'>
        #¬†sentences.long() convert the data type of the Tensor to long
        #¬†> then return the embedding vector of each word in a sentence 
        #¬†> set each vector randomly, keeping track of the vector assigned to a given indice

        embeds = self.embedding(sentences.long())
        #¬†embeds.requires_grad=False
        #print("type(embeds):{}".format(type(embeds))) #¬†<class 'torch.Tensor'>
        #print ('embeds[0]:', embeds[0])
        """ 
        embeds[0]: tensor([[ 1.6529, -0.9046,  0.9322,  ..., -0.8712, -1.1555, -1.5031],
        [-0.6852,  0.2939, -0.8784,  ..., -0.7400, -0.2376, -1.7276],
        [-0.8087,  0.4498, -1.7856,  ..., -1.3986,  0.2591,  0.0371],
        ...,
        [ 0.1250,  0.4386,  1.4527,  ..., -0.2274,  1.7671, -0.3603],
        [ 0.1250,  0.4386,  1.4527,  ..., -0.2274,  1.7671, -0.3603],
        [ 0.1250,  0.4386,  1.4527,  ..., -0.2274,  1.7671, -0.3603]],
        device='cuda:0', grad_fn=<SelectBackward>)"""

        # Returns the sum of each row of the input tensor in the given dimension dim.
        #¬†> Summing True and False gives the number of actual words in each sentence
        seq_length = masks.sum(1) 
        # print("type(seq_length):{}".format(type(seq_length))) #¬†<class 'torch.Tensor'>
        #print ('seq_length[0]:', seq_length[0])
        # seq_length[0]: tensor(20, device='cuda:0')

        # Sorts the elements of the input tensor along a given dimension in descending order by value.
        #¬†A namedtuple of (values, indices) is returned, where the values are the sorted values and indices are the indices of the elements in the original input tensor.
        # > Sort the sentences by their length (descending order)
        sorted_seq_length, perm_idx = seq_length.sort(descending=True)
        #print ('sorted_seq_length[0]:', sorted_seq_length[0])
        # sorted_seq_length[0]: tensor(100, device='cuda:0')
        #print ('perm_idx[0]:', perm_idx[0])
        # perm_idx[0]: tensor(630, device='cuda:0')

        # > reorder the embeddings following the sentence length for further processing: packing
        # embeds[0] has
        embeds = embeds[perm_idx, :]
        #print ('embeds[0]:', embeds[0])
        """
        embeds[0]: tensor([[ 0.1470,  1.3863,  0.2156,  ..., -0.1568, -1.1045, -0.1400],
        [ 0.3537,  0.2269, -1.4778,  ..., -1.0272, -0.7349,  1.0088],
        [-0.4989, -0.1096, -0.6463,  ...,  1.2627,  0.0907,  0.1922],
        ...,
        [-1.0912,  1.1962, -1.9826,  ..., -0.4356, -1.2736, -1.4505],
        [ 0.6587, -1.1465,  1.1382,  ...,  1.4149, -0.6422,  0.2377],
        [-0.6448,  1.1332,  1.4744,  ..., -0.7169, -1.2447, -0.5358]],
        device='cuda:0', grad_fn=<SelectBackward>) """

        # Packs a Tensor containing padded sequences of variable length.
        #¬†input can be of size T x B x * where T is the length of the longest sequence (equal to lengths[0]), B is the batch size, and * is any number of dimensions (including 0). 
        # If batch_first is True, B x T x * input is expected.
        #¬†https://pytorch.org/docs/stable/generated/torch.nn.utils.rnn.pack_padded_sequence.html#torch.nn.utils.rnn.pack_padded_sequence
        #
        # > the problem is that not all the sentences in the current batch have the same length. 
        #¬†> Without distinguishing the sentences lengths, to pad all the sequences, 
        #¬†> you would end up doing max_len * max_len computations, even if you needed less computations wrt the lenght of sentences.
        #¬†> PyTorch offers the possibility to pack (group) sentences of the same length 
        #¬†> and to pass the information to RNN which will internally optimize the computations.
        # https://stackoverflow.com/questions/51030782/why-do-we-pack-the-sequences-in-pytorch
        # https://stackoverflow.com/questions/59938530/why-do-we-need-pack-padded-sequence-when-we-have-pack-sequence
        #¬†TODO use enforce_sorted=False and remove the previous sorting
        pack_sequence = pack_padded_sequence(embeds,  lengths=sorted_seq_length,  batch_first=True)
        #print("type(pack_sequence):{}".format(type(pack_sequence))) #¬†<class 'torch.nn.utils.rnn.PackedSequence'>
        #print ('pack_sequence[0]:', pack_sequence[0])
        """
        pack_sequence[0]: tensor([[ 0.1470,  1.3863,  0.2156,  ..., -0.1568, -1.1045, -0.1400],
        [ 0.5597,  2.0953, -0.7236,  ..., -1.4103, -1.6798,  1.3055],
        [-0.1927, -0.9563, -0.0153,  ...,  1.2662, -0.6017, -0.1576],
        ...,
        [-1.6244,  1.0199, -0.1681,  ..., -0.7570, -0.9435, -0.4870],
        [-2.3151, -2.2364, -0.4231,  ...,  0.5323, -0.0363, -0.5891],
        [ 0.0935, -0.1610, -0.5200,  ...,  0.1851,  0.2965, -0.6004]],
       device='cuda:0', grad_fn=<PackPaddedSequenceBackward>)"""

        packed_output, _ = self.rnn(pack_sequence)
        #print("type(packed_output):{}".format(type(packed_output))) #¬†<class 'torch.nn.utils.rnn.PackedSequence'>
        #print ('packed_output[0]:', packed_output[0])
        """
        packed_output[0]: tensor([[ 2.0873e-02,  1.0921e-01, -2.1166e-01,  ...,  4.7033e-02,
         -2.1772e-01, -6.2811e-01],
        [ 3.9952e-03,  1.4725e-01, -9.1979e-02,  ..., -2.1016e-01,
         -1.8077e-01, -1.3867e-01],
        [ 1.2813e-02,  4.0637e-02, -2.2237e-01,  ..., -1.9843e-01,
          4.1468e-02, -7.4167e-03],
        ...,
        [ 4.2123e-04,  1.6109e-01, -2.3425e-02,  ..., -7.5010e-02,
         -5.0942e-02,  2.3539e-04],
        [-3.6322e-01,  1.0884e-01, -1.7367e-01,  ..., -6.3288e-02,
         -3.7179e-02, -9.8569e-02],
        [ 1.0113e-02,  1.3696e-01, -3.8002e-02,  ..., -2.1368e-01,
         -7.6481e-02,  1.1498e-01]], device='cuda:0',
       grad_fn=<CudnnRnnBackward>)"""

        #¬†Pads a packed batch of variable length sequences.
        #¬†It is an inverse operation to pack_padded_sequence().
        #¬†The returned Tensor‚Äôs data will be of size T x B x *, where T is the length of the longest sequence and B is the batch size. 
        #¬†If batch_first is True, the data will be transposed into B x T x * format.
        #¬†https://pytorch.org/docs/stable/generated/torch.nn.utils.rnn.pad_packed_sequence.html
        lstm_out, _ = pad_packed_sequence(packed_output, batch_first=True)
        #print("type(lstm_out):{}".format(type(lstm_out))) #¬†<class 'torch.Tensor'>
        #print ('lstm_out[0]:', lstm_out[0])
        """
        lstm_out[0]: tensor([[ 0.0209,  0.1092, -0.2117,  ...,  0.0470, -0.2177, -0.6281],
        [ 0.1364,  0.2313, -0.1493,  ...,  0.1805, -0.1467, -0.2619],
        [ 0.0967,  0.1473, -0.0139,  ...,  0.0378, -0.2664, -0.3387],
        ...,
        [-0.3804,  0.0290,  0.0695,  ..., -0.0445,  0.1460,  0.1356],
        [-0.2751,  0.2326, -0.0762,  ..., -0.0467,  0.0303,  0.0645],
        [-0.2196,  0.1945,  0.0911,  ..., -0.1548, -0.1706,  0.0325]],
       device='cuda:0', grad_fn=<SelectBackward>)"""
        
        # sort indices perm_idx in ascending order
        _, unperm_idx = perm_idx.sort()
        # print ('unperm_idx[0]:', unperm_idx[0])
        # unperm_idx[0]: tensor(644, device='cuda:0')
        lstm_out = lstm_out[unperm_idx, :]
        #print ('lstm_out[0]:', lstm_out[0])
        """
        lstm_out[0]: tensor([[-0.3566, -0.0670, -0.0603,  ..., -0.0220, -0.1860,  0.2224],
        [-0.1167, -0.1348,  0.0326,  ..., -0.1276, -0.2458, -0.1165],
        [-0.0043,  0.0479,  0.2782,  ...,  0.0799, -0.0694, -0.4641],
        ...,
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000]],
        device='cuda:0', grad_fn=<SelectBackward>)
        """
        return lstm_out, masks

    def loss(self, xs, tags):
        #¬†compute the loss (refers to the crf loss)
        features, masks = self.__build_features(xs)
        loss = self.crf.loss(features, tags, masks=masks)
        return loss

    def forward(self, xs):
        #¬†construction des features √† partir du batch de sentences
        features, masks = self.__build_features(xs)
        # Get the emission scores from the BiLSTM
        scores, tag_seq = self.crf(features, masks)
        return scores, tag_seq

## Ex√©cution de l'entra√Ænement

In [41]:
args = dict()
args['corpus_dir'] = "data"  # the corpus directory
args['model_dir'] = "model_wikiner_vanilla"       # the output directory for model files
args['num_epoch'] = 5 # # 5 25 50 500                # number of epoch to train
args['lr'] = 1e-3                     #¬†learning rate
args['weight_decay'] = 0.             # the L2 normalization parameter
args['batch_size'] = 1000             # batch size for training
args['device'] = None                 # the training device: "cuda:0", "cpu:0". It will be auto-detected by default
args['max_seq_len'] = 100 #¬†100              #¬†max sequence length within training
args['val_split'] = 0.2                  #¬†the split for the validation dataset
args['test_split'] = 0.2                 # the split for the testing dataset
args['recovery'] = "store_true"       #¬†continue to train from the saved model in model_dir
args['save_best_val_model'] = "store_true" # save the model whose validation score is smallest
args['embedding_dim'] =  w2v_pretrained_embeddings.vector_size # 100 300 500           #¬†the dimension of the embedding layer

args['hidden_dim'] = 128              # the dimension of the RNN hidden state
args['num_rnn_layers'] = 1 # 1            # the number of RNN layers
args['rnn_type'] = "lstm"              # RNN type, choice: "lstm", "gru"
#¬†print(args)

#
import time
start_time = time.time()

!rm -r model_*
train(args)
#¬†155562 vs 155564

print("--- %s seconds ---" % (time.time() - start_time))
# --- 162.1955807209015 seconds --- gpu 5 epochs max_seq_len 100 embedding_dim 100 num_rnn_layers 1 val_loss:  4.47 test_loss: 4.27
# 1056/10000 loss: -0.00, val_loss: 20.14:  90%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà | 72/80 [00:12<00:01,  5.97it/s] ~ 4h08 min et 35s mais d√©connexion
# --- 1344.2705941200256 seconds --- gpu 100 epochs max_seq_len 100 embedding_dim 300 num_rnn_layers 1 val_loss:   10.35 test_loss: 11.29 avec loss train √† 0.02 d√®s √©poch 80



Debug: train - Preprocessor
config data/w2v_pretrained_embeddings_vocab.json loaded
Debug: Preprocessor - __init__ - len(self.vocab): 155562
Debug: Preprocessor - __init__ - len(self.vocab_dict): 155562
config data/wikiner_corpus_tagset.json loaded
tag dict file => model_wikiner_vanilla/wikiner_corpus_tagset.json
tag dict file => model_wikiner_vanilla/w2v_pretrained_embeddings_vocab.json
Debug: Preprocessor - __adjust_vocab - len(self.vocab): 155564
Debug: Preprocessor - __adjust_vocab - len(self.vocab_dict): 155564
Debug: train - build_model
Debug: build_model - len(processor.vocab): 155564
Debug: build_model - BiRnnCrf
Debug: BiRnnCrf - __init__ - self.vocab_size: 155564
Debug: BiRnnCrf - __init__ - w2v_pretrained_embeddings_tensors.size: 155562
Debug: build_model - model: BiRnnCrf(
  (embedding): Embedding(155564, 200)
  (rnn): LSTM(200, 64, batch_first=True, bidirectional=True)
  (crf): CRF(
    (fc): Linear(in_features=128, out_features=11, bias=True)
  )
)
Debug: train - model: B

 1/5 loss: 13.76, val_loss:  0.00: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 80/80 [00:11<00:00,  6.94it/s]
eval: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 14/14 [00:00<00:00, 20.96it/s]


save model => model_wikiner_vanilla/model.pth
save model(epoch: 0) => model_wikiner_vanilla/loss.csv


 2/5 loss: 12.60, val_loss: 13.22: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 80/80 [00:11<00:00,  6.96it/s]
eval: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 14/14 [00:00<00:00, 21.11it/s]


save model => model_wikiner_vanilla/model.pth
save model(epoch: 1) => model_wikiner_vanilla/loss.csv


 3/5 loss: 10.85, val_loss: 11.35: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 80/80 [00:11<00:00,  7.00it/s]
eval: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 14/14 [00:00<00:00, 20.97it/s]


save model => model_wikiner_vanilla/model.pth
save model(epoch: 2) => model_wikiner_vanilla/loss.csv


 4/5 loss:  9.36, val_loss: 10.11: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 80/80 [00:11<00:00,  7.02it/s]
eval: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 14/14 [00:00<00:00, 20.24it/s]


save model => model_wikiner_vanilla/model.pth
save model(epoch: 3) => model_wikiner_vanilla/loss.csv


 5/5 loss:  9.51, val_loss:  9.17: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 80/80 [00:11<00:00,  6.95it/s]
eval: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 14/14 [00:00<00:00, 20.99it/s]


save model => model_wikiner_vanilla/model.pth
save model(epoch: 4) => model_wikiner_vanilla/loss.csv


test: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 14/14 [00:00<00:00, 14.36it/s]

training completed. test loss: 8.64
--- 64.3680202960968 seconds ---





## Ex√©cution de la pr√©diction

In [42]:
# predict
#from bi_lstm_crf.app import WordsTagger

import time
start_time = time.time()

bilstmcrf_model = WordsTagger(model_dir="model_wikiner_vanilla") #_vanilla

bilstmcrf_hyp = []
# pour chaque phrase de wikiner
for text in winer_tokens:
    tags, sequences = bilstmcrf_model([text])    
    bilstmcrf_hyp.append(tags[0])
    #print (tags)
    #break

#
print("--- %s seconds ---" % (time.time() - start_time))
# --- 40.24239158630371 seconds ---
# --- 144.3440752029419 seconds ---
# --- 33.92570495605469 seconds ---

# normalize the hyp labels
print()
print ('bilstmcrf_hyp', bilstmcrf_hyp[0])
normalized_bilstmcrf_hyp = normalise_labels(bilstmcrf_hyp)
print ('normalized_bilstmcrf_hyp', normalized_bilstmcrf_hyp[0])
print ('winer_ref', winer_ref[0])
print()

# Evaluate on data 
print (args)
print()
print (results_per_class(labels, winer_ref, normalized_bilstmcrf_hyp))

Debug: Preprocessor - __init__ - len(self.vocab): 155562
Debug: Preprocessor - __init__ - len(self.vocab_dict): 155562
Debug: Preprocessor - __adjust_vocab - len(self.vocab): 155564
Debug: Preprocessor - __adjust_vocab - len(self.vocab_dict): 155564
Debug: build_model - len(processor.vocab): 155564
Debug: build_model - BiRnnCrf
Debug: BiRnnCrf - __init__ - self.vocab_size: 155564
Debug: BiRnnCrf - __init__ - w2v_pretrained_embeddings_tensors.size: 155562
Debug: build_model - model: BiRnnCrf(
  (embedding): Embedding(155564, 200)
  (rnn): LSTM(200, 64, batch_first=True, bidirectional=True)
  (crf): CRF(
    (fc): Linear(in_features=128, out_features=11, bias=True)
  )
)
running_device gpu
--- 17.62063455581665 seconds ---

bilstmcrf_hyp ['B-ORG', 'B-LOC', 'B-LOC', 'B-LOC', 'B-LOC', 'B-LOC', 'B-LOC', 'B-LOC', 'B-LOC', 'B-LOC', 'B-LOC', 'B-LOC', 'B-LOC', 'B-LOC', 'B-LOC', 'B-LOC', 'B-LOC', 'B-LOC', 'B-LOC', 'B-LOC', 'B-LOC', 'B-LOC', 'B-LOC', 'B-LOC', 'B-LOC', 'B-LOC', 'B-LOC', 'B-LOC', '

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


              precision    recall  f1-score   support

         PER      0.000     0.000     0.000      4483
        MISC      0.000     0.000     0.000       443
         LOC      0.035     0.968     0.067      4724
         ORG      0.029     0.048     0.036      3816

   micro avg      0.034     0.353     0.063     13466
   macro avg      0.016     0.254     0.026     13466
weighted avg      0.020     0.353     0.034     13466



  _warn_prf(average, modifier, msg_start, len(result))


### VOTRE TRAVAIL

* Avez-vous une id√©e de piste o√π chercher pour comprendre ces r√©sultats ? 


## R√©sultats d'ex√©cution

**pretrained embeddings 200 (cf. ci-apr√®s)**


--- 64.1135082244873 seconds --- pretrained 200


```
precision    recall  f1-score   support

         PER      0.577     0.119     0.197      4483
        MISC      0.030     0.018     0.022       443
         LOC      0.718     0.297     0.421      4724
         ORG      0.872     0.057     0.107      3816

   micro avg      0.637     0.161     0.257     13466
   macro avg      0.549     0.123     0.187     13466
weighted avg      0.692     0.161     0.244     13466


              precision    recall  f1-score   support

         PER      0.000     0.000     0.000      4483
        MISC      0.012     0.002     0.004       443
         LOC      0.036     0.831     0.069      4724
         ORG      0.025     0.193     0.045      3816

   micro avg      0.034     0.346     0.062     13466
   macro avg      0.018     0.257     0.029     13466
weighted avg      0.020     0.346     0.037     13466
```

## Comparaison des vocabulaires

Afin d'expliquer les r√©sultats obtenus avec les mod√®les d'embeddings pr√©-entra√Æn√©s on peut se poser quelques questions sur le vocabulaire partag√© entre les diff√©rentes ressources. 


In [39]:
print(len(vocab))
print (len(w2v_pretrained_embeddings.vocab))
winer_vocab = set(flatten(winer_tokens))
print (len(winer_vocab))
winer_not_in_w2v_pretrained_embeddings = list() 
for w in winer_vocab:
  if not(w in w2v_pretrained_embeddings.vocab):
    winer_not_in_w2v_pretrained_embeddings.append(w)
print (len(winer_not_in_w2v_pretrained_embeddings), winer_not_in_w2v_pretrained_embeddings)

108023
155562
19911
7414 ['indlala', 'seigneur,', 'pi√®ce.', 'urvoas.', '(entre', "(qu'il", "l'autobus", 'graciosa', 'hawi', 'viennent,', 'a)', 'd√©cid√©e.', 'retard.', "s'emparant", 'locale)', 'sciurus', 'moments.', '%.', "l'autocar", 'youtubeurs', 'civils,', 'agathonisi', '7h20', 'manifest√©,', 'mobilisation.', '32,', 'quasi-parfaits', 'patronats', "o'riordan", '(format', 'difficile.', 'europ√©ennes,', '554', 'eux,', "l'allemande.", 'depaul', 'toujours.', 'l‚Äôint√©rieur.', 'deux]', 'poupe,', 'iar-conicet', "l'attaquant", 'court.', 'victoire,', 'attribution,', 'd√©jouaient', 'interdites;', 'misogynes,', 'libyennes.', '7h00', 'caillassages', 's‚Äôemparant', 'lacrymog√®nes.', '(station)', 'ferm√©es.', '55,3', 'mobilisations,', '34√®me', '5,2.', 'voie,', '√©crou√©s.', 'humain.', "d'instances", 'r√©alit√©.', 'berkel', '(leurs', 'kenshu', 'quitt√©.', '√©toile,', 'avions.', 'proc√©dure.', 'mig-29', '25,7', 't√©moin,', "d'ann√©e,", 'crime.', 'kellyanne', 'dartout', 'divisions.', 'nous,', 'i

#### VOTRE TRAVAIL
* Le mod√®le est construit sur les mots du corpus d'entra√Ænement. Est-ce que le vocabulaire du corpus d'entra√Ænement est pr√©sent dans le mod√®le d'embeddings ? Est-ce que le vocabulaire du corpus de test est pr√©sent dans le mod√®le d'embeddings ? Les corpus d'entra√Ænement et de test partagent-ils le m√™me vocabulaire ? 
* Le vocabulaire partag√© est-il une piste explicative √† la qualit√© des performances obtenus avec les mod√®les d'embeddings pr√©-entra√Æn√©s ?
* L'impl√©mentation courante "fine tune" aussi les embeddings. Si vous ne souhaitez pas que les gradients soient calcul√©s pour le tensor des embeddings il faut lui sp√©cifier sa propri√©t√© `.requires_grad=False` (par d√©faut √† `True`).

# Exp√©rimentation d'autres architectures
...

Suivant votre avancement,
- d'autres word embeddings peuvent √™tre test√©s (e.g. glove), 
- une autre architecture [BERT-CRF](https://github.com/jidasheng/bi-lstm-crf) (cf. fin du README)...
- ajouter les traits sur la surface des mots..

