## Sequence Models and Long-Short Term Memory Networks

Simple Feed Forward Networks - No state maintained by the network

**Sequence Models** - Some sort of dependence through time between the inputs . Eg - Hidden Markov Models for oart of speech tagging , Conditional Random Field

**Reccurent Neural Network ** - network that maintains some kind of state. Eg output used as a part of next input => information can propagate along the network as it passes throug the sequence.

** LSTM ** - for each element in the sequence there is a corresponding hidden state $h_t$ , which in principle can contain information about arbitrary points earlier in the sequence . We can use the hidden state to predict words in a language model , part of speech tags etc.


### LSTM's in Pytorch 


* Pytorch LSTM's expects all it's inputs  to  be  3D tensors.
* First axis is the sequence itself
* Second axis indexes instances in the mini-batch
* Third axis indexes elements of the input

Using mini-batch = 1 for now => have just 1 dimension on the second axis

Eg - " The cow jumped"

$$ input = \begin{split} \begin{bmatrix}  
\overbrace {q_\text{The}^\text{row vector}} \\
q_\text{cow}\\
q_\text{jumped} \\
\end{bmatrix}\end{split}$$

ALsi going through the sequence 1 at a time => have just 1 dimension at the first axis

In [1]:
!pip install torch

Collecting torch
[?25l  Downloading https://files.pythonhosted.org/packages/49/0e/e382bcf1a6ae8225f50b99cc26effa2d4cc6d66975ccf3fa9590efcbedce/torch-0.4.1-cp36-cp36m-manylinux1_x86_64.whl (519.5MB)
[K    100% |████████████████████████████████| 519.5MB 28kB/s 
tcmalloc: large alloc 1073750016 bytes == 0x5a25a000 @  0x7f45cadbd1c4 0x46d6a4 0x5fcbcc 0x4c494d 0x54f3c4 0x553aaf 0x54e4c8 0x54f4f6 0x553aaf 0x54efc1 0x54f24d 0x553aaf 0x54efc1 0x54f24d 0x553aaf 0x54efc1 0x54f24d 0x551ee0 0x54e4c8 0x54f4f6 0x553aaf 0x54efc1 0x54f24d 0x551ee0 0x54efc1 0x54f24d 0x551ee0 0x54e4c8 0x54f4f6 0x553aaf 0x54e4c8
[?25hInstalling collected packages: torch
Successfully installed torch-0.4.1


In [0]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

In [3]:
torch.manual_seed(1)

<torch._C.Generator at 0x7f24bd0d9070>

In [7]:
# LSTM input dimension is 3 o/p dimension is 3
lstm = nn.LSTM(3,3)
# a random sequence of length 5
inputs = [torch.randn(1,3) for _ in range(5)]

# initializing the hidden state
hidden = (torch.randn(1,1,3),
          torch.randn(1,1,3))

print(inputs)
print("\n")
for i in inputs:
  # Stepping through the sequence one element at a time 
  # after each step, hidden contains the hidden state
  out, hidden = lstm(i.view(1,1,-1),hidden)
  print(i,out,hidden)
  print("\n")

[tensor([[-0.8996,  0.5313,  0.4034]]), tensor([[ 1.4521, -2.4182, -1.1906]]), tensor([[0.6964, 1.1296, 0.2214]]), tensor([[-0.0558,  1.2057,  1.9486]]), tensor([[-0.0766, -0.8562, -0.7870]])]


tensor([[-0.8996,  0.5313,  0.4034]]) tensor([[[ 0.0774, -0.1934,  0.0383]]], grad_fn=<CatBackward>) (tensor([[[ 0.0774, -0.1934,  0.0383]]], grad_fn=<ViewBackward>), tensor([[[ 0.1577, -0.9874,  0.0760]]], grad_fn=<ViewBackward>))


tensor([[ 1.4521, -2.4182, -1.1906]]) tensor([[[-0.0414, -0.2163, -0.2406]]], grad_fn=<CatBackward>) (tensor([[[-0.0414, -0.2163, -0.2406]]], grad_fn=<ViewBackward>), tensor([[[-0.1984, -0.4736, -0.4137]]], grad_fn=<ViewBackward>))


tensor([[0.6964, 1.1296, 0.2214]]) tensor([[[-0.0332, -0.0876, -0.1996]]], grad_fn=<CatBackward>) (tensor([[[-0.0332, -0.0876, -0.1996]]], grad_fn=<ViewBackward>), tensor([[[-0.1296, -0.1687, -0.4776]]], grad_fn=<ViewBackward>))


tensor([[-0.0558,  1.2057,  1.9486]]) tensor([[[ 0.1046, -0.1599, -0.0232]]], grad_fn=<CatBackward>) (tens

In [8]:
# Also can do the entire sequence at once
# The first value returned by LSTM (out) is all of the hidden states throughout
# the sequence
# The second (hidden) is just the most recent hidden state 
# "out" will allow access to all the hidden states in the sequence
# "hidden" will allow to continue the sequence and bakpropagate
# by passing it as an argument to the lstm at a later time 

# adding the extra 2nd dimension
inputs = torch.cat(inputs).view(len(inputs),1,-1)
print(inputs)
# cleaning the hidden state
hidden = (torch.randn(1,1,3),
          torch.randn(1,1,3))

out , hidden = lstm(inputs,hidden)

print(out)
print("\n")
print(hidden)

tensor([[[-0.8996,  0.5313,  0.4034]],

        [[ 1.4521, -2.4182, -1.1906]],

        [[ 0.6964,  1.1296,  0.2214]],

        [[-0.0558,  1.2057,  1.9486]],

        [[-0.0766, -0.8562, -0.7870]]])
tensor([[[-0.0376, -0.0257,  0.3951]],

        [[-0.0604,  0.0050,  0.1696]],

        [[-0.0310, -0.0039, -0.0257]],

        [[ 0.1014, -0.1452,  0.0700]],

        [[-0.0491, -0.0798,  0.0753]]], grad_fn=<CatBackward>)


(tensor([[[-0.0491, -0.0798,  0.0753]]], grad_fn=<ViewBackward>), tensor([[[-0.1431, -0.1807,  0.1322]]], grad_fn=<ViewBackward>))


### LSTM for part-of-speech tagging

Model :  

* Input - $ w_1,\dots,w_m ,$ where $w_i\in V $ ( the vocab)
* T - tag set and $y_i$, the tag of word $w_i$
* Denote the prediction of $w_i$ as $\hat{y_i}$
* Output - a sequence of $ \hat{y_1},\dots,\hat{y_m}$ , where $\hat{y_i} \in T $

To do the prediction, pass an LSTM over the sentence . Denote the hidden state at timestamp $ i $ as $h_i$ .

Assign each tag a unique index ( similar to word_to_ix) , then the prediction rule for $\hat{y_i}$ is 

$$ \hat{y_i} = argmax_j(log Softmax(Ah_i +b))_j $$

i.e take the log softmax of the affine map of the hidden state , and the predicted tag is the tag that has the maximum value in this vector. 

** Note -  this implies that the dimensionality of the target space of A is |T| 

In [13]:
# Preparing data
def prepare_sequence(seq,to_ix):
  idxs = [ to_ix[w] for w in seq]
  return torch.tensor(idxs,dtype=torch.long)

training_data = [
    ("The dog ate the apple".split(),["DET","NN","V","DET","NN"]),
    ("Everybody read that book".split(),["NN","V","DET","NN"])
]

word_to_ix={}

for sentence,tags in training_data:
  for word in sentence:
    if word not in word_to_ix:
      word_to_ix[word]=len(word_to_ix)

      
print(word_to_ix)

tag_to_ix = {"V":0,"NN":1,"DET":2}

# Keeping embedding and hidden dimensions small to see how training progresses
EMBEDDING_DIM =6
HIDDEN_DIM = 6

{'The': 0, 'dog': 1, 'ate': 2, 'the': 3, 'apple': 4, 'Everybody': 5, 'read': 6, 'that': 7, 'book': 8}


In [0]:
# The model class

class LSTMTag(nn.Module):
  def __init__(self,embedding_dim,hidden_dim,vocab_size,tagset_size):
    super(LSTMTag,self).__init__()
    
    self.hidden_dim = hidden_dim
    self.embeddings = nn.Embedding(vocab_size,embedding_dim)
    
    # The LSTM takes word embeddings as inputs and outputs hidden states
    # with dimensionality hidden_dim
    self.lstm = nn.LSTM(embedding_dim,hidden_dim)
    
    # The linear layer that maps from hidden state to tag space
    self.hidden2tag = nn.Linear(hidden_dim,tagset_size)
    # Creating a default hidden state
    self.hidden = self.init_hidden()
    
  def init_hidden(self):
    # Num layers , minibatch_size , hidden_dim)
    return (torch.zeros(1,1,self.hidden_dim),
           torch.zeros(1,1,self.hidden_dim))
  
  def forward(self,sentence):
    embeds = self.embeddings(sentence)
    
    lstm_out,self.hidden = self.lstm(embeds.view(len(sentence),1,-1),self.hidden)
    
    tag_space = self.hidden2tag(lstm_out.view(len(sentence),-1))
    
    tag_scores = F.log_softmax(tag_space,dim=1)
    
    return tag_scores

In [0]:
# Training The Model

model = LSTMTag(EMBEDDING_DIM,HIDDEN_DIM,len(word_to_ix),len(tag_to_ix))

loss_funtion = nn.NLLLoss()

optimizer = optim.SGD(model.parameters(),lr=0.1)



In [17]:
# Checking outputs before training
with torch.no_grad():
  inputs = prepare_sequence(training_data[0][0],word_to_ix)
  tag_scores = model(inputs)
  print(tag_scores)
  
# Element i,j of the output is the score for tag (V,NN,DET) j for word i
# sentence "The dog ate the apple" -> ["DET","NN","V","DET","NN"]

tensor([[-1.1889, -0.9245, -1.2083],
        [-1.1635, -0.9058, -1.2608],
        [-1.0403, -0.9671, -1.3224],
        [-1.0702, -0.9456, -1.3145],
        [-1.0898, -0.9396, -1.2985]])


In [19]:
# Training Loop 
for epoch in range(300):
  for sentence, tags in training_data:
    # 1. Get the indexes for input and target
    input_ix = prepare_sequence(sentence,word_to_ix)
    target_ix = prepare_sequence(tags,tag_to_ix)
    
    #2. Zero the gradients as they get accumulated every step
    model.zero_grad()
    
    #3.** Clear ot the hidden state of LSTM, detaching it from the 
    # history on last instance **
    model.hidden = model.init_hidden()
    
    # 4. Forward Pass
    target_score = model(input_ix)
    
    # 5. Calculate Loss
    loss = loss_funtion(target_score,target_ix)
    print(loss.item())
    
    #6. Backprop and udating weights
    loss.backward()
    optimizer.step()
    

    

1.0816985368728638
1.030313491821289
1.0767052173614502
1.0258985757827759
1.072096586227417
1.0217320919036865
1.0678132772445679
1.0177655220031738
1.0638043880462646
1.0139576196670532
1.0600260496139526
1.0102732181549072
1.056441068649292
1.0066817998886108
1.0530167818069458
1.0031577348709106
1.0497251749038696
0.999678373336792
1.0465415716171265
0.9962244033813477
1.0434446334838867
0.992779016494751
1.0404155254364014
0.9893273711204529
1.0374375581741333
0.9858566522598267
1.0344960689544678
0.9823555946350098
1.0315775871276855
0.9788143038749695
1.0286707878112793
0.9752240777015686
1.0257647037506104
0.9715774655342102
1.0228502750396729
0.9678676128387451
1.0199183225631714
0.9640887975692749
1.0169614553451538
0.9602357149124146
1.013972520828247
0.9563041925430298
1.010945200920105
0.9522902965545654
1.00787353515625
0.9481906294822693
1.004752278327942
0.9440027475357056
1.0015761852264404
0.9397245645523071
0.9983412623405457
0.9353538751602173
0.9950432777404785
0.9

In [21]:
# Calculate the score after taining
with torch.no_grad():
  input_ix = prepare_sequence(training_data[0][0],word_to_ix)
  target_scores = model(input_ix)
  print(target_scores)
  
# Element i,j of the output is the score for tag (V,NN,DET) j for word i
# sentence "The dog ate the apple" -> ["DET","NN","V","DET","NN"]

tensor([[-4.2414, -0.8027, -0.6208],
        [-5.6365, -0.0145, -4.5287],
        [-0.0352, -4.2654, -3.8867],
        [-4.5988, -3.9927, -0.0289],
        [-5.9115, -0.0199, -4.0771]])
