# Getting to know LSTMs better

Created: September 13, 2018  
Author: Thamme Gowda  


Goals:
- To get batches of *unequal length sequences* encoded correctly!
- Know how the hidden states flow between encoders and decoders
- Know how the multiple stacked LSTM layers pass hidden states

Example: a simple bi-directional LSTM which takes 3d input vectors
and produces 2d output vectors. 

In [1]:
import torch 
from torch import nn

In [2]:
lstm = nn.LSTM(3, 2, batch_first=True, bidirectional=True)

In [3]:
# Lets create a batch input.
# 3 sequences in batch (the first dim) , see batch_first=True
# Then the logest sequence is 4 time steps, ==> second dimension
# Each time step has 3d vector which is input ==> last dimension
pad_seq = torch.rand(3, 4, 3)

# That is nice for the theory
# but in practice we are dealing with un equal length sequences
# among those 3 sequences in the batch, lets us say 
# first sequence is the longest, with 4 time steps --> no padding needed
# second seq is 3 time steps --> pad the last time step
pad_seq[1, 3, :] = 0.0
# third seq is 2 time steps --> pad the last two steps
pad_seq[2, 2:, :] = 0.0
print("Padded Input:")
print(pad_seq)

# so we got these lengths
lens = [4,3,2]

print("Sequence Lenghts: ", lens)

Padded Input:
tensor([[[0.7850, 0.6658, 0.7522],
         [0.3855, 0.7981, 0.6199],
         [0.9081, 0.6357, 0.3619],
         [0.2481, 0.5198, 0.2635]],

        [[0.2654, 0.9904, 0.3050],
         [0.1671, 0.1709, 0.2392],
         [0.0705, 0.4811, 0.3636],
         [0.0000, 0.0000, 0.0000]],

        [[0.6474, 0.5172, 0.0308],
         [0.5782, 0.3083, 0.5117],
         [0.0000, 0.0000, 0.0000],
         [0.0000, 0.0000, 0.0000]]])
Sequence Lenghts:  [4, 3, 2]


In [4]:
# lets send padded seq to LSTM
out,(h_t, c_t) = lstm(pad_seq)
print("All Outputs:")
print(out)

All Outputs:
tensor([[[ 0.0428, -0.3015,  0.0359,  0.0557],
         [ 0.0919, -0.4145,  0.0278,  0.0480],
         [ 0.0768, -0.4989,  0.0203,  0.0674],
         [ 0.1019, -0.4925, -0.0177,  0.0224]],

        [[ 0.0587, -0.3025,  0.0017,  0.0201],
         [ 0.0537, -0.3388, -0.0532,  0.0111],
         [ 0.0839, -0.3811, -0.0446, -0.0020],
         [ 0.0595, -0.3681, -0.0720,  0.0218]],

        [[ 0.0147, -0.2585, -0.0093,  0.0756],
         [ 0.0398, -0.3531, -0.0174,  0.0369],
         [ 0.0458, -0.3476, -0.0912,  0.0243],
         [ 0.0422, -0.3360, -0.0720,  0.0218]]], grad_fn=<TransposeBackward0>)


^^ Output is 2x2d=4d vector since it is bidirectional  
forward 2d, backward 2d are concatenated  
Total vectors=12: 3 seqs in batch x 4 time steps;;  each vector is 4d    

> Hmm, what happened to my padding time steps? Will padded zeros mess with the internal weights of LSTM when I do backprop?

---
Lets look at the last Hidden state

In [5]:
print(h_t)

tensor([[[ 0.1019, -0.4925],
         [ 0.0595, -0.3681],
         [ 0.0422, -0.3360]],

        [[ 0.0359,  0.0557],
         [ 0.0017,  0.0201],
         [-0.0093,  0.0756]]], grad_fn=<ViewBackward>)


Last hidden state is a 2d (same as output) vectors,  
but 2 for each step because of bidirectional rnn  
There are 3 of them since there were three seqs in the batch  
each corresponding to the last step  
But the definition of *last time step* is bit tricky  
For the left-to-right LSTM, it is the last step of input  
For the right-to-left LSTM, it is the first step of input  

This makes sense now.

--- 
Lets look at $c_t$:

In [6]:
print("Last c_t:")
print(c_t)

Last c_t:
tensor([[[ 0.3454, -1.0070],
         [ 0.1927, -0.6731],
         [ 0.1361, -0.6063]],

        [[ 0.1219,  0.1858],
         [ 0.0049,  0.0720],
         [-0.0336,  0.2787]]], grad_fn=<ViewBackward>)


This should be similar to the last hidden state.


## Question: 
> what happened to my padding time steps? Did the last hidden state exclude the padded time steps?

I can see that last hidden state of the forward LSTM didnt distinguish padded zeros. 

Lets see output of each time steps and last hidden state of left-to-right LSTM, again.   
We know that the lengths (after removing padding) are \[4,3,2]   

In [7]:
print("All time stamp outputs:")
print(out[:, :, :2])
print("Last hidden state (forward LSTM):")
print(h_t[0])

All time stamp outputs:
tensor([[[ 0.0428, -0.3015],
         [ 0.0919, -0.4145],
         [ 0.0768, -0.4989],
         [ 0.1019, -0.4925]],

        [[ 0.0587, -0.3025],
         [ 0.0537, -0.3388],
         [ 0.0839, -0.3811],
         [ 0.0595, -0.3681]],

        [[ 0.0147, -0.2585],
         [ 0.0398, -0.3531],
         [ 0.0458, -0.3476],
         [ 0.0422, -0.3360]]], grad_fn=<SliceBackward>)
Last hidden state (forward LSTM):
tensor([[ 0.1019, -0.4925],
        [ 0.0595, -0.3681],
        [ 0.0422, -0.3360]], grad_fn=<SelectBackward>)


*Okay, Now I get it.* 
When building sequence to sequence (for Machine translation) I cant pass last hidden state like this to a decoder.

We have to inform the LSTM about lengths.

How? 

Thats why we have `torch.nn.utils.rnn.pack_padded_sequence`

In [8]:
print("Padded Seqs:")
print(pad_seq)
print("Lens:", lens)

print("Pack Padded Seqs:")
pac_pad_seq = torch.nn.utils.rnn.pack_padded_sequence(pad_seq, lens, batch_first=True)
print(pac_pad_seq)

Padded Seqs:
tensor([[[0.7850, 0.6658, 0.7522],
         [0.3855, 0.7981, 0.6199],
         [0.9081, 0.6357, 0.3619],
         [0.2481, 0.5198, 0.2635]],

        [[0.2654, 0.9904, 0.3050],
         [0.1671, 0.1709, 0.2392],
         [0.0705, 0.4811, 0.3636],
         [0.0000, 0.0000, 0.0000]],

        [[0.6474, 0.5172, 0.0308],
         [0.5782, 0.3083, 0.5117],
         [0.0000, 0.0000, 0.0000],
         [0.0000, 0.0000, 0.0000]]])
Lens: [4, 3, 2]
Pack Padded Seqs:
PackedSequence(data=tensor([[0.7850, 0.6658, 0.7522],
        [0.2654, 0.9904, 0.3050],
        [0.6474, 0.5172, 0.0308],
        [0.3855, 0.7981, 0.6199],
        [0.1671, 0.1709, 0.2392],
        [0.5782, 0.3083, 0.5117],
        [0.9081, 0.6357, 0.3619],
        [0.0705, 0.4811, 0.3636],
        [0.2481, 0.5198, 0.2635]]), batch_sizes=tensor([3, 3, 2, 1]))


Okay, this is doing some magic -- getting rid of all padded zeros -- Cool!
`batch_sizes=tensor([3, 3, 2, 1]` seems to be the main ingredient of this magic.

`[3, 3, 2, 1]` I get it!
We have 4 time steps in batch. 
- First two step has all 3 seqs in the batch. 
- third step is made of first 2 seqs in batch. 
- Fourth step is made of first seq in batch

I now understand why the sequences in the batch has to be sorted by descending order of lengths!

Now let us send it to LSTM and see what it produces

In [9]:
pac_pad_out, (pac_ht, pac_ct) = lstm(pac_pad_seq)
# Lets first look at output. this is packed output
print(pac_pad_out)

PackedSequence(data=tensor([[ 0.0428, -0.3015,  0.0359,  0.0557],
        [ 0.0587, -0.3025,  0.0026,  0.0203],
        [ 0.0147, -0.2585, -0.0057,  0.0754],
        [ 0.0919, -0.4145,  0.0278,  0.0480],
        [ 0.0537, -0.3388, -0.0491,  0.0110],
        [ 0.0398, -0.3531, -0.0005,  0.0337],
        [ 0.0768, -0.4989,  0.0203,  0.0674],
        [ 0.0839, -0.3811, -0.0262, -0.0056],
        [ 0.1019, -0.4925, -0.0177,  0.0224]], grad_fn=<CatBackward>), batch_sizes=tensor([3, 3, 2, 1]))


Okay this is packed output. Sequences are of unequal lengths.
Now we need to restore the output by padding 0s for shorter sequences. 

In [10]:
pad_out = nn.utils.rnn.pad_packed_sequence(pac_pad_out, batch_first=True, padding_value=0)
print(pad_out)

(tensor([[[ 0.0428, -0.3015,  0.0359,  0.0557],
         [ 0.0919, -0.4145,  0.0278,  0.0480],
         [ 0.0768, -0.4989,  0.0203,  0.0674],
         [ 0.1019, -0.4925, -0.0177,  0.0224]],

        [[ 0.0587, -0.3025,  0.0026,  0.0203],
         [ 0.0537, -0.3388, -0.0491,  0.0110],
         [ 0.0839, -0.3811, -0.0262, -0.0056],
         [ 0.0000,  0.0000,  0.0000,  0.0000]],

        [[ 0.0147, -0.2585, -0.0057,  0.0754],
         [ 0.0398, -0.3531, -0.0005,  0.0337],
         [ 0.0000,  0.0000,  0.0000,  0.0000],
         [ 0.0000,  0.0000,  0.0000,  0.0000]]], grad_fn=<TransposeBackward0>), tensor([4, 3, 2]))


Output looks good! Now Let us look at the hidden state. 

In [11]:
print(pac_ht)

tensor([[[ 0.1019, -0.4925],
         [ 0.0839, -0.3811],
         [ 0.0398, -0.3531]],

        [[ 0.0359,  0.0557],
         [ 0.0026,  0.0203],
         [-0.0057,  0.0754]]], grad_fn=<ViewBackward>)


This is great. As we see the forward (or Left-to-right) LSTM's last hidden state is proper as per the lengths. So should be the c_t.

Let us concatenate forward and reverse LSTM's hidden states

In [12]:
torch.cat([pac_ht[0],pac_ht[1]], dim=1) 

tensor([[ 0.1019, -0.4925,  0.0359,  0.0557],
        [ 0.0839, -0.3811,  0.0026,  0.0203],
        [ 0.0398, -0.3531, -0.0057,  0.0754]], grad_fn=<CatBackward>)

----

# Multi Layer LSTM

Let us redo the above hacking to understand how 2 layer LSTM works

In [13]:
n_layers = 2
inp_size = 3
out_size = 2
lstm2 = nn.LSTM(inp_size, out_size, num_layers=n_layers, batch_first=True, bidirectional=True)

In [14]:
pac_out, (h_n, c_n) = lstm2(pac_pad_seq)
print("Packed Output:")
print(pac_out)
pad_out = nn.utils.rnn.pad_packed_sequence(pac_out, batch_first=True, padding_value=0)
print("Pad Output:")
print(pad_out)


print("Last h_n:")
print(h_n)

print("Last c_n:")
print(c_n)

Packed Output:
PackedSequence(data=tensor([[ 0.2443,  0.0703, -0.0871, -0.0664],
        [ 0.2496,  0.0677, -0.0658, -0.0605],
        [ 0.2419,  0.0687, -0.0701, -0.0521],
        [ 0.3354,  0.0964, -0.0772, -0.0613],
        [ 0.3272,  0.0975, -0.0655, -0.0534],
        [ 0.3216,  0.1055, -0.0504, -0.0353],
        [ 0.3644,  0.1065, -0.0752, -0.0531],
        [ 0.3583,  0.1116, -0.0418, -0.0350],
        [ 0.3760,  0.1139, -0.0438, -0.0351]], grad_fn=<CatBackward>), batch_sizes=tensor([3, 3, 2, 1]))
Pad Output:
(tensor([[[ 0.2443,  0.0703, -0.0871, -0.0664],
         [ 0.3354,  0.0964, -0.0772, -0.0613],
         [ 0.3644,  0.1065, -0.0752, -0.0531],
         [ 0.3760,  0.1139, -0.0438, -0.0351]],

        [[ 0.2496,  0.0677, -0.0658, -0.0605],
         [ 0.3272,  0.0975, -0.0655, -0.0534],
         [ 0.3583,  0.1116, -0.0418, -0.0350],
         [ 0.0000,  0.0000,  0.0000,  0.0000]],

        [[ 0.2419,  0.0687, -0.0701, -0.0521],
         [ 0.3216,  0.1055, -0.0504, -0.0353],
     

The LSTM output looks similar to single layer LSTM.

However the ht and ct states are bigger -- since there are two layers. 
Now its time to RTFM. 


> h_n of shape `(num_layers * num_directions, batch, hidden_size)`: tensor containing the hidden state for `t = seq_len`.
Like output, the layers can be separated using `h_n.view(num_layers, num_directions, batch, hidden_size)` and similarly for c_n.

In [15]:
batch_size = 3
num_dirs = 2
l_n_h_n = h_n.view(n_layers, num_dirs, batch_size, out_size)[-1]
# last layer last time step hidden state
print(l_n_h_n)

tensor([[[ 0.3760,  0.1139],
         [ 0.3583,  0.1116],
         [ 0.3216,  0.1055]],

        [[-0.0871, -0.0664],
         [-0.0658, -0.0605],
         [-0.0701, -0.0521]]], grad_fn=<SelectBackward>)


In [16]:
last_hid = torch.cat([l_n_h_n[0], l_n_h_n[1]], dim=1)

print("last layer last time stamp hidden state")
print(last_hid)

print("Padded Outputs :")
print(pad_out)

last layer last time stamp hidden state
tensor([[ 0.3760,  0.1139, -0.0871, -0.0664],
        [ 0.3583,  0.1116, -0.0658, -0.0605],
        [ 0.3216,  0.1055, -0.0701, -0.0521]], grad_fn=<CatBackward>)
Padded Outputs :
(tensor([[[ 0.2443,  0.0703, -0.0871, -0.0664],
         [ 0.3354,  0.0964, -0.0772, -0.0613],
         [ 0.3644,  0.1065, -0.0752, -0.0531],
         [ 0.3760,  0.1139, -0.0438, -0.0351]],

        [[ 0.2496,  0.0677, -0.0658, -0.0605],
         [ 0.3272,  0.0975, -0.0655, -0.0534],
         [ 0.3583,  0.1116, -0.0418, -0.0350],
         [ 0.0000,  0.0000,  0.0000,  0.0000]],

        [[ 0.2419,  0.0687, -0.0701, -0.0521],
         [ 0.3216,  0.1055, -0.0504, -0.0353],
         [ 0.0000,  0.0000,  0.0000,  0.0000],
         [ 0.0000,  0.0000,  0.0000,  0.0000]]], grad_fn=<TransposeBackward0>), tensor([4, 3, 2]))
