<a href="https://colab.research.google.com/github/neohack22/IASD/blob/NLP/NLP/project/3_sequence_tagging.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sequence models and recurrent networks 

## Preliminary remarks

Recurrent networks in *pytorch* expects as input a Tensor in 3 dimensions (*3D tensor*). The axes carry an important semantic: 
- the first axis is "the time" 
- the second one corresponds to the mini-batch
- the third corresponds to the dimension of input vectors (typically the embedding size)


Therefore, a sequence of 5 vectors of 4 features (size 4) is represented as a Tensor of dimensions (5,1,4). If we have 7 sequences of 5 vectors, all of size 4, we get (5,7,4). 

Lets start with some simple code with synthetic data. 

In [None]:
#@title
import pickle # for the real data 
import torch  # Torch + shortcuts
import torch.autograd as autograd
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
torch.manual_seed(1) # To reproduce the experiments

<torch._C.Generator at 0x108ee42d0>

In [14]:
import pickle # for the real data
import torch # Torch + shortcuts
import torch.autograd as autograd
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
torch.manual_seed(1) # To reproduce the experiments

<torch._C.Generator at 0x7f1fa31ace50>

In [None]:
#@title
inputs = torch.randn((5,1,4))
print("input sequence :", inputs)
print("The shape : ", inputs.shape)


input sequence : tensor([[[-1.5256, -0.7502, -0.6540, -1.6095]],

        [[ 0.8657,  0.2444, -0.6629,  0.8073]],

        [[ 0.4391,  1.1712,  1.7674, -0.0954]],

        [[ 0.0612, -0.6177, -0.7981, -0.1316]],

        [[-0.7984,  0.3357,  0.2753,  1.7163]]])
The shape :  torch.Size([5, 1, 4])


In [2]:
inputs = torch.randn((5, 1, 4)) # = torch.randn((5, 1, 4))
print("input sequence :", inputs) #input sequence :", inputs")
print("The shape : ", inputs.shape)

input sequence : tensor([[[-1.5256, -0.7502, -0.6540, -1.6095]],

        [[ 0.8657,  0.2444, -0.6629,  0.8073]],

        [[ 0.4391,  1.1712,  1.7674, -0.0954]],

        [[ 0.0612, -0.6177, -0.7981, -0.1316]],

        [[-0.7984,  0.3357,  0.2753,  1.7163]]])
The shape :  torch.Size([5, 1, 4])


## A simple recurrent model and  LSTM

A simple recurrent network is for instance of the thpe **nn.RNN**. 
To build it, we must specify: 
- the input size (this implies the size of the Linear Layer that will process input vectors);
- the size of the hidden layer (this implies the size of the Linear Layer that will process the time transition). 

Other options are available and useful, like:
- nonlinearity 
- bias
- batch_first 


The forward function of a recurrent net can handle two types of input and therefore acts in two ways. 

### One step forward
The first one corresponds to one time step: the neural networks reads one input symbol and update the hidden layer. The forward function therefore returns a tuple of two Tensors: the output and the updated hidden layer. 




In [None]:
#@title
recNN = nn.RNN(input_size=4, hidden_size=3)  # Input dim is 4, hidden layer size  is 3

# initialize the hidden state.
h0 = torch.randn(1, 1, 3) # 
print("h0 : ",h0,h0.shape)

# One step 
out, hn = recNN(inputs[0].view(1,1,-1), h0)
print("##################")
print("One step returns: ")
print("  1/  output : ", out, out.shape)
print("  2/  hidden : ", hn, hn.shape)
print("##################")

h0 :  tensor([[[-0.8737, -0.2693, -0.5124]]]) torch.Size([1, 1, 3])
##################
One step returns: 
  1/  output :  tensor([[[-0.6307, -0.0205,  0.0848]]], grad_fn=<StackBackward>) torch.Size([1, 1, 3])
  2/  hidden :  tensor([[[-0.6307, -0.0205,  0.0848]]], grad_fn=<StackBackward>) torch.Size([1, 1, 3])
##################


In [3]:
recNN = nn.RNN(
    input_size=4, hidden_size=3
) # Input dim is 4, hidden layer is 3

# initialize the hidden state.
h0 = torch.randn(1, 1, 3)
print("h0 : ", h0, h0.shape)

# One step
out, hn = recNN(inputs[0].view(1, 1, -1), h0)
print("############")
print("One step returns: ")
print("   1/  output  : ", out, out.shape)
print("   2/  hidden  : ", hn, hn.shape)
print("############")

h0 :  tensor([[[-0.8737, -0.2693, -0.5124]]]) torch.Size([1, 1, 3])
############
One step returns: 
   1/  output  :  tensor([[[-0.6307, -0.0205,  0.0848]]], grad_fn=<StackBackward0>) torch.Size([1, 1, 3])
   2/  hidden  :  tensor([[[-0.6307, -0.0205,  0.0848]]], grad_fn=<StackBackward0>) torch.Size([1, 1, 3])
############


We can observe that both vectors are the same. Indeed, in a simple recurrent network there is no distinction between the output and the hidden layers.  A prediction can be done by taking into account at each time step this hidden layer: 

$$ h_t = f_1(x_t,h_{t-1})$$
$$ y_t = f_2(h_t)$$

For one step forward, the recurrent net only needs to keep track of the hidden layer. Some more advanced architectures, like **LSTM** use  two kinds of hidden layers: one for the memory managment  named **cell state** (or $c_t$), and the other to make the prediction named  **hidden state** (or $h_t$). The API is generic for all the recurrent nets et returns a tuple at each time step. This tuple gathers the sufficient data to unfold the network. 

### Sequence forward (unfold)
The second "style" of the forward function consists in taking as input a sequence and to unfold the network on this input sequence. It is equivalent to a for loop. 


In [None]:
#@title
# The whole the sequence in one call: unfolding the network 
outputs, hn = recNN(inputs, h0)
print("* outputs:\n",outputs, "\n  shape:",outputs.shape,"\n")
print("* hn:\n",hn, "\n  shape:",hn.shape)

* outputs:
 tensor([[[ 0.6385, -0.3399, -0.2212]],

        [[ 0.3047,  0.3983, -0.0730]],

        [[ 0.3188, -0.4135, -0.8151]],

        [[ 0.8521, -0.2961, -0.7123]],

        [[ 0.8090,  0.9473, -0.8750]]], grad_fn=<StackBackward>) 
  shape: torch.Size([5, 1, 3]) 

* hn:
 tensor([[[ 0.8090,  0.9473, -0.8750]]], grad_fn=<StackBackward>) 
  shape: torch.Size([1, 1, 3])


In [4]:
# The whole sequence in one call: unfolding the network
outputs, hn = recNN(inputs, h0)
print("* outputs:\n", outputs, "\n  shape:", outputs.shape, "\n")
print(". hn:\n", hn, "\n  shape:", hn.shape) #n, hn, "\n shape:", hn.shape)

* outputs:
 tensor([[[-0.6307, -0.0205,  0.0848]],

        [[-0.5812,  0.7743,  0.2956]],

        [[-0.2936,  0.9483,  0.1993]],

        [[-0.7406,  0.7238,  0.6722]],

        [[-0.9548,  0.5780,  0.7488]]], grad_fn=<StackBackward0>) 
  shape: torch.Size([5, 1, 3]) 

. hn:
 tensor([[[-0.9548,  0.5780,  0.7488]]], grad_fn=<StackBackward0>) 
  shape: torch.Size([1, 1, 3])


in this case, the forward function returns: 
- the sequence of the hidden layers associated to each input vector;
- and the last hidden layer. 
The previous code is equivalent to this one: 

In [None]:
#@title
hn=h0 # init 
for t in range(len(inputs)): 
    out, hn = recNN(inputs[t].view(1,1,-1), hn)
    print("at time ",t, " out = ", out)

at time  0  out =  tensor([[[ 0.6385, -0.3399, -0.2212]]], grad_fn=<StackBackward>)
at time  1  out =  tensor([[[ 0.3047,  0.3983, -0.0730]]], grad_fn=<StackBackward>)
at time  2  out =  tensor([[[ 0.3188, -0.4135, -0.8151]]], grad_fn=<StackBackward>)
at time  3  out =  tensor([[[ 0.8521, -0.2961, -0.7123]]], grad_fn=<StackBackward>)
at time  4  out =  tensor([[[ 0.8090,  0.9473, -0.8750]]], grad_fn=<StackBackward>)


In [5]:
hn=h0 #init
for t in range(len(inputs)):
  out, hn = recNN(inputs[t].view(1, 1, -1), hn)
  print("at time ", t, " out = ", out)

at time  0  out =  tensor([[[-0.6307, -0.0205,  0.0848]]], grad_fn=<StackBackward0>)
at time  1  out =  tensor([[[-0.5812,  0.7743,  0.2956]]], grad_fn=<StackBackward0>)
at time  2  out =  tensor([[[-0.2936,  0.9483,  0.1993]]], grad_fn=<StackBackward0>)
at time  3  out =  tensor([[[-0.7406,  0.7238,  0.6722]]], grad_fn=<StackBackward0>)
at time  4  out =  tensor([[[-0.9548,  0.5780,  0.7488]]], grad_fn=<StackBackward0>)


## Usage of LSTM

To illustrate the previous section, the following code replace a simple recurrent network by a LSTM. Look at the differences ! 

In [None]:
#@title
recNN = nn.LSTM(input_size=4, hidden_size=3)  # Input dim is 4, hidden layer size  is 3
h0 =  torch.randn(1, 1, 3) # 
c0 =  torch.randn(1, 1, 3) # 
# One step 

# One step 
out, (hn,cn) = recNN(inputs[0].view(1,1,-1), (h0,c0))
print("##################")
print("One step returns: ")
print("  1/  output : ", out, out.shape)
print("  2/  hidden : ", hn, hn.shape)
print("  3/  cell   : ", cn, cn.shape)
print("##################")


##################
One step returns: 
  1/  output :  tensor([[[-0.4350,  0.0462, -0.3104]]], grad_fn=<StackBackward>) torch.Size([1, 1, 3])
  2/  hidden :  tensor([[[-0.4350,  0.0462, -0.3104]]], grad_fn=<StackBackward>) torch.Size([1, 1, 3])
  3/  cell   :  tensor([[[-0.9659,  0.0889, -0.6284]]], grad_fn=<StackBackward>) torch.Size([1, 1, 3])
##################


In [7]:
recNN = nn.LSTM(
    input_size=4, hidden_size=3) # Input dim is 4, hidden layer is 3
h0 = torch.randn(1, 1, 3)
c0 = torch.randn(1, 1, 3)    

# One step
out, (
    hn, cn) = recNN(
        inputs[0].view(
            1, 1, -1), (h0, c0)) #= = ercNN(inputs[0].view(1, 1, 1), (h0, c0)))
print("##################")
print("One step returns: ")
print("   1/  output  : ", out, out.shape)
print("   2/  hidden  : ", hn, hn.shape)
print("   3/  cell    : ", cn, cn.shape)
print("##################")

##################
One step returns: 
   1/  output  :  tensor([[[0.1313, 0.2055, 0.1265]]], grad_fn=<StackBackward0>) torch.Size([1, 1, 3])
   2/  hidden  :  tensor([[[0.1313, 0.2055, 0.1265]]], grad_fn=<StackBackward0>) torch.Size([1, 1, 3])
   3/  cell    :  tensor([[[0.2867, 0.6155, 1.2126]]], grad_fn=<StackBackward0>) torch.Size([1, 1, 3])
##################


It is important to understand these examples and more specifically :
* the parameters "input dimension" and "output dimension" set to 3 ? 
* why we initialize the hidden layer ?
* the *-1* when we call *view* ? 
* ... 
If we unfold the LSTM along the sequence of inputs: 


In [None]:
#@title


out, (hn, cn) = recNN(inputs, (h0,c0))
print("##################")
print("Unfolding the net: ")
print("  1/ out:\n ", out, "\n",out.shape)
print("  2/ hn :\n", hn, "\n",hn.shape )
print("  3/ cn :\n", cn,"\n", cn.shape )
print("##################")

##################
Unfolding the net: 
  1/ out:
  tensor([[[-0.4350,  0.0462, -0.3104]],

        [[-0.0676,  0.0663, -0.2225]],

        [[-0.0674, -0.1932,  0.0767]],

        [[-0.1022, -0.3144,  0.0342]],

        [[-0.0244, -0.1595, -0.0578]]], grad_fn=<StackBackward>) 
 torch.Size([5, 1, 3])
  2/ hn :
 tensor([[[-0.0244, -0.1595, -0.0578]]], grad_fn=<StackBackward>) 
 torch.Size([1, 1, 3])
  3/ cn :
 tensor([[[-0.0828, -0.3908, -0.1168]]], grad_fn=<StackBackward>) 
 torch.Size([1, 1, 3])
##################


In [8]:
out, (hn, cn) = recNN(inputs, (h0, c0))
print("##################")
print("Unfolding the net: ")
print(" 1/ out:\n ", out, "\n", out.shape)
print(" 2/ hn :\n", hn, "\n", hn.shape)
print(" 3/ cn :\n", cn, "\n", cn.shape)
print("##################")

##################
Unfolding the net: 
 1/ out:
  tensor([[[0.1313, 0.2055, 0.1265]],

        [[0.0454, 0.3733, 0.2033]],

        [[0.1518, 0.3830, 0.2379]],

        [[0.0556, 0.2590, 0.1208]],

        [[0.0629, 0.2052, 0.0781]]], grad_fn=<StackBackward0>) 
 torch.Size([5, 1, 3])
 2/ hn :
 tensor([[[0.0629, 0.2052, 0.0781]]], grad_fn=<StackBackward0>) 
 torch.Size([1, 1, 3])
 3/ cn :
 tensor([[[0.0957, 0.4264, 0.4603]]], grad_fn=<StackBackward0>) 
 torch.Size([1, 1, 3])
##################


# Sequence tagging  


The task of *sequence tagging* consists in the attribution of a tag (or a class) to each element  (or words ) of a sequence (a sentence): 
* An observation is a sentence represented as a word sequence;
* A tag sequence is associated to this sentence, one tag per word. 

If the input is sequence of symbols : 
$w_1, \dots, w_M$, with $w_i \in V$, the vocabulary or the finite set of the known words. Assume we have a tagset $T$ le *tagset* which is the set of all possible tags (the output space). At time $i$,  $y_i$ is the tag associated to the word  $w_i$.
The prediction of the model is  $\hat{y}_i$. 
Our goal is to predict the sequence $\hat{y}_1, \dots, \hat{y}_M$, with $\hat{y}_i \in T$.

## A recurrent tagger
We can use a recurrent model to create a sequence tagger. The recurrent network "reads" the sentence and predict the tag sequence. We denote the hidden state of the recurrent network at time $i$ as  $h_i$. The prediction rule is to select   $\hat{y}_i$ as : 

\begin{align}\hat{y}_i = \text{argmax}_j \  (\log \text{Softmax}(Ah_i + b))_j\end{align}

The softmax function gives us a probability distribution over the tagset ($\in T$). The softmax is applied to a linear transformation of the hidden state $h_i$. In the following we can use the logsoftmax associated to the adapted loss. 

## A first (toy) dataset

Let us build our first dataset and define some useful function. 

In [None]:
#@title

# Convert the input sequence into an integer one.
# The mapping is recorded in the dictionnary to_ix
def prepare_sequence(seq, to_ix):
    idxs = [to_ix[w] for w in seq]
    tensor = torch.LongTensor(idxs)
    return tensor

# Toy dataset
training_data = [
    ("The dog ate the apple".split(), ["DET", "NN", "V", "DET", "NN"]),
    ("Everybody read that book".split(), ["NN", "V", "DET", "NN"])
]

# The dictionnary : word -> index
word_to_ix = {}
# The other : tag -> index
tag_to_ix = {}
# Build them 
for sent, tags in training_data:
    for word in sent:
        if word not in word_to_ix:
            word_to_ix[word] = len(word_to_ix)
    for tag in tags:
        if tag not in tag_to_ix:
            tag_to_ix[tag] = len(tag_to_ix)
##
print("Words dict: ", word_to_ix)
print("Tags  dict: ",tag_to_ix)

print("The sentence : ", training_data[0][0])
print("The tag seq. : ", training_data[0][1])
print("#### in the prepared version")
print("The sentence : ", prepare_sequence(training_data[0][0],word_to_ix))
print("The tag seq. : ", prepare_sequence(training_data[0][1],tag_to_ix))

Words dict:  {'The': 0, 'dog': 1, 'ate': 2, 'the': 3, 'apple': 4, 'Everybody': 5, 'read': 6, 'that': 7, 'book': 8}
Tags  dict:  {'DET': 0, 'NN': 1, 'V': 2}
The sentence :  ['The', 'dog', 'ate', 'the', 'apple']
The tag seq. :  ['DET', 'NN', 'V', 'DET', 'NN']
#### in the prepared version
The sentence :  tensor([0, 1, 2, 3, 4])
The tag seq. :  tensor([0, 1, 2, 0, 1])


In [10]:
# Convert the input sequence into an integer one.
# The mapping is recorded in the dictionnary to_ix
def prepare_sequence(seq, to_ix):
  idxs = [to_ix[w] for w in seq]
  tensor = torch.LongTensor(idxs)
  return tensor

# Toy dataset
training_data = [
    ("The dog ate the apple".split(), ["DET", "NN", "V", "DET", "NN"]),
    ("Everybody read that book".split(), ["NN", "V", "DET", "NN"])
]  

# The dictionnary : word -> index
word_to_ix = {}
# The other : tag -> index
tag_to_ix = {}
# Build them
for sent, tags in training_data:
  for word in sent:
    if word not in word_to_ix:
      word_to_ix[word] = len(word_to_ix)
  for tag in tags:
    if tag not in tag_to_ix:
      tag_to_ix[tag] = len(word_to_ix) #_to_ix[tag] = len(tag_to_ix)
##
print("Words dict; ", word_to_ix)    
print("Tags dict: ",tag_to_ix)

print("The sentence : ", training_data[0][0])
print("The tag seq. : ", training_data[0][1])
print("#### in the prepared version")
print("The sentence : ", prepare_sequence(training_data[0][0], word_to_ix))
print("The tag seq. : ", prepare_sequence(training_data[0][1], tag_to_ix))

Words dict;  {'The': 0, 'dog': 1, 'ate': 2, 'the': 3, 'apple': 4, 'Everybody': 5, 'read': 6, 'that': 7, 'book': 8}
Tags dict:  {'DET': 5, 'NN': 5, 'V': 5}
The sentence :  ['The', 'dog', 'ate', 'the', 'apple']
The tag seq. :  ['DET', 'NN', 'V', 'DET', 'NN']
#### in the prepared version
The sentence :  tensor([0, 1, 2, 3, 4])
The tag seq. :  tensor([5, 5, 5, 5, 5])


## Build our first model
Fill the following class. We can use a LSTM our tagger, with 3 components: 
- a LSTM us unfolded on the word sequence to be processed
- an Embedding layer to project words
- A linear layer to feed the log-softmax for prediction purpose. 

These three modules must be created in the constructor of the class. The forward function requires your full attention: 
- the model takes an input sequence: a tensor of word idx
- the embedding layers will generate a new tensor, what is the dimensions ? 
- what is expected by the LSTM module ? 
- what is the dimensions of the LSTM ? 
- what is expected by the final Linear module ? 

Try to write it : 


In [None]:

class RecurrentTagger(nn.Module):

    def __init__(self, embedding_dim, hidden_dim, vocab_size, tagset_size):
        super(RecurrentTagger, self).__init__()
        # TODO: 

    def init_hidden(self):
        # This function is given: understand it. 
        return (torch.zeros(1, 1, self.hidden_dim),
                torch.zeros(1, 1, self.hidden_dim))

    def forward(self, sentence):
        # Your work
        return None # of course it is not None

In [44]:
import torch.nn.functional
class RecurrentTagger(nn.Module):

  def __init__(self, embedding_dim, hidden_dim, vocab_size, tagset_size):
    super(RecurrentTagger, self).__init__()
    #DONE:
    # store the value of the parameter hidden_dim
    self.hidden_dim = hidden_dim
    # create an Embedding layer which is the same size as vocab_size w/ a dim equal to embedding_dim
    self.init_emb = nn.Embedding(vocab_size, embedding_dim)
    # embedding_dim as inputs and hidden_dim as hidden states dimensionality
    self.init_lstm = nn.LSTM(embedding_dim, hidden_dim)

    self.hidden2tag = nn.Linear(hidden_dim, tagset_size)


  def init_hidden(self): # maps hidden layer to tags?
    # This function is given: understand it.
    return (torch.zeros(1, 1, self.hidden_dim),
            torch.zeros(1, 1, self.hidden_dim))#)
    
  def forward(self, sentence):
    # work
    out = self.init_emb(sentence)
    #out = out.init_hidden()
    #out = self.init_hidden()
    out, _ = self.init_lstm(out) # = self.init_lstm(out) #, None) #init_hidden()) #None) #.view(len(sentence), 1, -1))
    #out = out.init_hidden #self.nn.Linear(hidden_dim, tagset_size) #init_hidden(out[0]) #, None) #.view(len(sentence), -1))#)
    #out = self.init_hidden()
    out = nn.functional.log_softmax(out) #.init_hidden()) #, dim=1)
    return out #None # of course it is not None

# Training 
Now write the code to train this model

In [None]:
#@title
EMBEDDING_DIM = 6
HIDDEN_DIM = 6


model = RecurrentTagger(EMBEDDING_DIM, HIDDEN_DIM, len(word_to_ix), len(tag_to_ix))
loss_function = nn.NLLLoss()
optimizer = optim.SGD(model.parameters(), lr=0.1)

# Look at the scores 
# The output element i,j concerns the  tag j pour le mot i.
inputs = prepare_sequence(training_data[0][0], word_to_ix)
tag_scores = model(inputs)
print(tag_scores)

for epoch in range(300):  # again, normally you would NOT do 300 epochs, it is toy data
    for sentence, tags in training_data:
        # Get our inputs ready for the network
 
        # Step 1. Remember that Pytorch accumulates gradients.
        # We need to clear them out before each instance
        

        # Step2: Also, we need to clear out the hidden state of the recurrent net,
        # detaching it from its history on the last instance.
        
        # Step 3. Run our forward pass.
        

        # Step 4. Compute the loss, gradients, and update the parameters by
        #  calling optimizer.step()
        
        
# Les mêmes scores à la fin
inputs = prepare_sequence(training_data[0][0], word_to_ix)
tag_scores = model(inputs)
print(tag_scores)

In [45]:
EMBEDDING_DIM = 6
HIDDEN_DIM = 6


model = RecurrentTagger(EMBEDDING_DIM, HIDDEN_DIM, len(word_to_ix), len(tag_to_ix))
loss_function = nn.NLLLoss()
optimizer = optim.SGD(model.parameters(), lr=0.1)

# Look at the scores 
# The output element i,j concerns the  tag j pour le mot i.
inputs = prepare_sequence(training_data[0][0], word_to_ix)
tag_scores = model(inputs)
print(tag_scores)

for epoch in range(300):  # again, normally you would NOT do 300 epochs, it is toy data
    for sentence, tags in training_data:
        # Get our inputs ready for the network
 
        # Step 1. Remember that Pytorch accumulates gradients.
        # We need to clear them out before each instance
        model.zero_grad()

        # Step2: Also, we need to clear out the hidden state of the recurrent net,
        # detaching it from its history on the last instance.
        features = prepare_sequence(sentence, word_to_ix)
        labels = prepare_sequence(tags, tag_to_ix)

        # Step 3. Run our forward pass.
        forward_pass = model(features)

        # Step 4. Compute the loss, gradients, and update the parameters by
        #  calling optimizer.step()
        computed_loss = loss_function(forward_pass, labels)
        computed_loss.backward()
        optimizer.step()
        
# Les mêmes scores à la fin
inputs = prepare_sequence(training_data[0][0], word_to_ix)
tag_scores = model(inputs)
print(tag_scores)



tensor([[-1.7701, -1.7429, -1.8724, -1.7876, -1.7646, -1.8183],
        [-1.7024, -1.7382, -1.7277, -1.9235, -1.8762, -1.8020],
        [-1.8099, -1.7306, -1.8665, -1.7003, -1.7894, -1.8656],
        [-1.8391, -1.7378, -1.8824, -1.6147, -1.8712, -1.8324],
        [-1.8258, -1.7425, -1.9805, -1.7247, -1.8857, -1.6303]],
       grad_fn=<LogSoftmaxBackward0>)
tensor([[-1.9954, -2.1883, -2.2324, -2.0922, -2.2553, -0.8761],
        [-2.2973, -2.2931, -2.3455, -2.3880, -2.4151, -0.6509],
        [-2.4356, -2.4547, -2.4549, -2.4191, -2.4114, -0.5763],
        [-2.4731, -2.4869, -2.4826, -2.4722, -2.4695, -0.5448],
        [-2.4462, -2.4847, -2.5094, -2.4903, -2.4666, -0.5431]],
       grad_fn=<LogSoftmaxBackward0>)


# A real task
Load the following dataset

In [8]:
!pip install sklearn
from sklearn.model_selection import train_test_split #cross_validation import train_test_split

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
import pickle
mydata = pickle.load( open( "/Le/chemin/vers/brown.save.p", "rb" ) )


In [None]:
#@title
from google.colab import drive
drive.mount('/content/drive')

In [46]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [1]:
import pickle
mydata = pickle.load(open("/content/brown.save.p", "rb")) #))

Look at the data, process it as required and then: 
* Spit the dataset in 3 sets:  train / validation / test (80%,10%,10%)
* Learn the model and test it 
* Tune the hyperparameters. What is the best score you can obtain ? 
* Start again with a bi-lstm


In [53]:
#@title
#summary(my_data)
#my_data.info()
#mydata.info()
#summary(mydata)
mydata

[[('The', 'DET'),
  ('Fulton', 'NOUN'),
  ('County', 'NOUN'),
  ('Grand', 'ADJ'),
  ('Jury', 'NOUN'),
  ('said', 'VERB'),
  ('Friday', 'NOUN'),
  ('an', 'DET'),
  ('investigation', 'NOUN'),
  ('of', 'ADP'),
  ("Atlanta's", 'NOUN'),
  ('recent', 'ADJ'),
  ('primary', 'NOUN'),
  ('election', 'NOUN'),
  ('produced', 'VERB'),
  ('``', '.'),
  ('no', 'DET'),
  ('evidence', 'NOUN'),
  ("''", '.'),
  ('that', 'ADP'),
  ('any', 'DET'),
  ('irregularities', 'NOUN'),
  ('took', 'VERB'),
  ('place', 'NOUN'),
  ('.', '.')],
 [('The', 'DET'),
  ('jury', 'NOUN'),
  ('further', 'ADV'),
  ('said', 'VERB'),
  ('in', 'ADP'),
  ('term-end', 'NOUN'),
  ('presentments', 'NOUN'),
  ('that', 'ADP'),
  ('the', 'DET'),
  ('City', 'NOUN'),
  ('Executive', 'ADJ'),
  ('Committee', 'NOUN'),
  (',', '.'),
  ('which', 'DET'),
  ('had', 'VERB'),
  ('over-all', 'ADJ'),
  ('charge', 'NOUN'),
  ('of', 'ADP'),
  ('the', 'DET'),
  ('election', 'NOUN'),
  (',', '.'),
  ('``', '.'),
  ('deserves', 'VERB'),
  ('the', 'DET'),

In [2]:
# texts, labels = mydata
import numpy as np
my_data = np.array(mydata, dtype=object)

In [3]:
# my_data.describe()
#from scipy import stats
#stats.describe(my_data)

import pandas as pd
df_describe = pd.DataFrame(my_data)
df_describe.describe()

Unnamed: 0,0
count,57340
unique,56426
top,"[(), .)]"
freq,58


In [9]:
# X_train, X_test, y_train, y_test = train_test_split()
X = [] #features
Y = [] #labels
x_train, y_train=[], []
for sentence in mydata:
  for (feature, label) in sentence:
    X.append(feature)
    Y.append(label)

# Split X, Y to train, test and test
x_train, x_val_and_test, y_train, y_val_and_test = train_test_split(
    X, Y, train_size=0.8)
x_val, x_test, y_val, y_test = train_test_split(
    x_val_and_test, y_val_and_test, train_size=0.5)

## Learn the model and test it

### Learn

In [21]:
# Convert the input sequence into an integer one.
# The mapping is recorded in the dictionnary to_ix
def prepare_sequence(seq, to_ix):
  idxs = [to_ix[w] for w in seq]
  tensor = torch.LongTensor(idxs)
  return tensor

# Toy dataset
#training_data = [
#    ("The dog ate the apple".split(), ["DET", "NN", "V", "DET", "NN"]),
#    ("Everybody read that book".split(), ["NN", "V", "DET", "NN"])
#]  

# The dictionnary : word -> index
word_to_ix = {}
# The other : tag -> index
tag_to_ix = {}
# Build them
for sent in mydata: #, tags in mydata: #x_train: #training_data:
  for word, tag in sent:
    if word not in word_to_ix:
      word_to_ix[word] = len(word_to_ix)
  #for tag in tags:
    if tag not in tag_to_ix:
      tag_to_ix[tag] = len(word_to_ix) #_to_ix[tag] = len(tag_to_ix)
##
print("Words dict; ", word_to_ix)    
print("Tags dict: ",tag_to_ix)
#"""
#print("The sentence : ", mydata[0][0]) #training_data[0][0])
#print("The tag seq. : ", mydata[0][1]) #training_data[0][1])
#print("#### in the prepared version")
#print(
#    "A sentence : ", prepare_sequence( #The sentence : ", prepare_sequence(
#        mydata[1], word_to_ix)) #0][0], word_to_ix)) #training_data[0][0], word_to_ix))
#print(
#    "A tag seq. : ", prepare_sequence( #The tag seq. : ", prepare_sequence(
#        mydata[1], tag_to_ix)) #0][1], tag_to_ix)) #training_data[0][1], tag_to_ix))
#"""        

Tags dict:  {'DET': 1, 'NOUN': 2, 'ADJ': 4, 'VERB': 6, 'ADP': 10, '.': 16, 'ADV': 27, 'CONJ': 42, 'PRT': 59, 'PRON': 83, 'NUM': 130, 'X': 2061}


In [15]:
#import torch.nn.functional
class RecurrentTagger(nn.Module):

  def __init__(self, embedding_dim, hidden_dim, vocab_size, tagset_size):
    super(RecurrentTagger, self).__init__()
    #DONE:
    # store the value of the parameter hidden_dim
    self.hidden_dim = hidden_dim
    # create an Embedding layer which is the same size as vocab_size w/ a dim equal to embedding_dim
    self.init_emb = nn.Embedding(vocab_size, embedding_dim)
    # embedding_dim as inputs and hidden_dim as hidden states dimensionality
    self.init_lstm = nn.LSTM(embedding_dim, hidden_dim)

    self.hidden2tag = nn.Linear(hidden_dim, tagset_size)


  def init_hidden(self): # maps hidden layer to tags?
    # This function is given: understand it.
    return (torch.zeros(1, 1, self.hidden_dim),
            torch.zeros(1, 1, self.hidden_dim))#)
    
  def forward(self, sentence):
    # work
    out = self.init_emb(sentence)
    #out = out.init_hidden()
    #out = self.init_hidden()
    out, hn = self.init_lstm(out) # = self.init_lstm(out) #, None) #init_hidden()) #None) #.view(len(sentence), 1, -1))
    #out = out.init_hidden #self.nn.Linear(hidden_dim, tagset_size) #init_hidden(out[0]) #, None) #.view(len(sentence), -1))#)
    #out = self.init_hidden()
    out = F.log_softmax(out) #nn.functional.log_softmax(out) #.init_hidden()) #, dim=1)
    return out #None # of course it is not None

In [25]:
EMBEDDING_DIM = 6
HIDDEN_DIM = 6


model1 = RecurrentTagger(EMBEDDING_DIM, HIDDEN_DIM, len(word_to_ix), len(tag_to_ix))
loss_function = nn.NLLLoss()
optimizer = optim.SGD(model1.parameters(), lr=0.1)

# Look at the scores 
# The output element i,j concerns the  tag j pour le mot i.
#inputs = prepare_sequence(training_data[0][0], word_to_ix)
#tag_scores = model1(inputs)
#print(tag_scores)

for epoch in range(300):  # again, normally you would NOT do 300 epochs, it is toy data
    for sentence in mydata:
      for word, tag in sentence: #, tags in x_train, y_train: #training_data:
        # Get our inputs ready for the network
 
        # Step 1. Remember that Pytorch accumulates gradients.
        # We need to clear them out before each instance
        model1.zero_grad()

        # Step2: Also, we need to clear out the hidden state of the recurrent net,
        # detaching it from its history on the last instance.
        
        features = prepare_sequence(word, word_to_ix) #sentence, word_to_ix)
        labels = prepare_sequence(tag, tag_to_ix) #s, tag_to_ix)

        # Step 3. Run our forward pass.
        tag_scores = model1(features)

        # Step 4. Compute the loss, gradients, and update the parameters by
        #  calling optimizer.step()
        computed_loss = loss_function(tag_scores, labels)
        computed_loss.backward()
        optimizer.step()
        
# Les mêmes scores à la fin
inputs = prepare_sequence(training_data[0][0], word_to_ix)
tag_scores = model1(inputs)
print(tag_scores)

KeyError: ignored

In [None]:
0 0 x1 x2 x3 0 0 / ks = 5
-> z1 , z2 ,z3 