# Imports

In [1]:
import torch
import torch.nn as nn

# The vanilla RNN cell

![image.png](images/rnn_equations.png)

# Inner workings of an RNN cell
![image.png](images/rnn_anatomy.png)
The biases b and c are omitted to simplify illustration. Also, ŷ = softmax(o[t]) is omitted from the model anatomy because often, this is calculated after inference seperately

# The unrolled representation
![image.png](images/rnn_unrolled.png)

# The compact representation
![image.png](images/rnn_compact.png)

# Pytorch
To implement this in pytorch, we first need to make a module for RNN cell. Recap of the formulas:
![image.png](images/rnn_equations.png)

In [2]:
class RNNCell(nn.Module):
    def __init__(self, in_size, hidden_size, out_size):
        super().__init__()
        self.W = nn.Linear(hidden_size, hidden_size)
        self.U = nn.Linear(in_size, hidden_size)
        self.V = nn.Linear(hidden_size, out_size)
        
    def forward(self, x, hidden_prev):
        # hidden_prev is h[t-1], x is x[t] in the formulas
        
        # self.W.forward(hidden) does the matrix multiplication W x hidden because W is a linear layer. Same with U
        # Why aren't we taking account of the bias b mentioned in equation 1? You'll get a hint if you 
        # read the docs for torch.nn.Linear :)
        a = self.W(hidden_prev) + self.U(x)
        hidden_new = torch.tanh(a)
        o = self.V(hidden_new)
        
        # As mentioned before, the softmax part can be done seperately from the model.
        # Losses like torch.nn.CrossEntropyLoss expect the output to be un-softmaxed. It's more numerically stable.
        return o, hidden_new

Make an RNN cell with input size 2, hidden size 8 and output size 3, just like in the illustration

In [3]:
rnn_cell = RNNCell(2, 8, 3)

# Single timestep inference

In [4]:
inp = torch.randn(2)
hidden = torch.zeros(8)
inp, hidden

(tensor([ 0.7923, -1.9647]), tensor([0., 0., 0., 0., 0., 0., 0., 0.]))

In [5]:
out, hidden_new = rnn_cell(inp, hidden)
out, hidden_new

(tensor([-0.1407,  0.1450,  0.2688], grad_fn=<AddBackward0>),
 tensor([ 0.4187,  0.7815,  0.2956, -0.7836, -0.8777,  0.5662, -0.7539,  0.4958],
        grad_fn=<TanhBackward0>))

# Multi timestep inference

In [6]:
# Say we have 10 timesteps of input
inps = torch.randn(10, 2)
inps

tensor([[-1.1621,  1.6706],
        [ 0.2791,  0.5713],
        [ 1.7893, -0.5475],
        [-0.0800,  0.9690],
        [ 1.0539,  0.0329],
        [ 1.8258,  3.1347],
        [ 0.6438, -0.0331],
        [ 0.4236, -0.9052],
        [-0.1745,  0.0764],
        [-1.0459,  1.7758]])

In [7]:
outs = torch.empty(10, 3)
hidden = torch.zeros(8)
for timestep, inp in enumerate(inps):
    # inp is a tensor of size 2
    out, hidden = rnn_cell(inp, hidden)
    outs[timestep] = out
outs

tensor([[-0.1154, -0.2380,  0.0718],
        [-0.0826, -0.3541,  0.3018],
        [-0.4145, -0.0531,  0.2891],
        [-0.5797, -0.2687,  0.0800],
        [-0.3567, -0.2947,  0.3116],
        [-0.5169, -0.5098,  0.1366],
        [-0.3100, -0.2848,  0.3023],
        [-0.4331, -0.0212,  0.1975],
        [-0.4625, -0.1930,  0.1603],
        [-0.3182, -0.4012,  0.1232]], grad_fn=<CopySlices>)

# Abstract this into another Module?
Looping through timesteps is annoying, so why don't we make it so that we call a module with all the timestep inputs and it returns the outputs generated by the RNNCell? Basically, we should be able to do `outs = rnn(inps)`

In [8]:
class RNN(nn.Module):
    def __init__(self, input_size, hidden_size, out_size):
        super().__init__()
        self.input_size = input_size
        self.hidden_size = hidden_size
        self.out_size = out_size
        self.rnn_cell = RNNCell(input_size, hidden_size, out_size)
        
    def forward(self, inputs):
        # inputs is going to be of shape (timesteps x input_size) assuming that we are not batching.
        # If the inputs are batched, then it will be of shape (batch_size x timesteps x input_size)
        # In that case, the rhs of below line will be inputs.shape[1]
        num_timesteps = inputs.shape[0]
        outs = torch.empty(num_timesteps, self.out_size)
        hidden = torch.zeros(self.hidden_size) # If batched, this should be torch.zeros(batch_size, self.hidden_size)
        for timestep in range(num_timesteps):
            # inp is a tensor of size 2
            inp = inputs[timestep] # If batched, this should be inps[:, timestep, :]
            out, hidden = rnn_cell(inp, hidden) # No change to this line when batched. Why?
            outs[timestep] = out # If batched... you get the idea. Just add an extra dimension
        return outs

In [9]:
rnn = RNN(2, 8, 3)

In [10]:
rnn(inps)

tensor([[-0.1154, -0.2380,  0.0718],
        [-0.0826, -0.3541,  0.3018],
        [-0.4145, -0.0531,  0.2891],
        [-0.5797, -0.2687,  0.0800],
        [-0.3567, -0.2947,  0.3116],
        [-0.5169, -0.5098,  0.1366],
        [-0.3100, -0.2848,  0.3023],
        [-0.4331, -0.0212,  0.1975],
        [-0.4625, -0.1930,  0.1603],
        [-0.3182, -0.4012,  0.1232]], grad_fn=<CopySlices>)

# The LSTM
![image.png](images/lstm_cell.png)

No, I am not going to implement this myself or subject you through that. Let's use PyTorch's own implementation\


## PyTorch's implementation
Just like what we did here, pytorch exposes the API for both LSTMCell and LSTM. The semantics are the same: LSTMCell expects single timestep input, LSTM expects all the timesteps at once.\
nn.LSTM also implements [multiple layers](https://stats.stackexchange.com/questions/163304/what-are-the-advantages-of-stacking-multiple-lstms) and [bidirectionality](https://stats.stackexchange.com/questions/163304/what-are-the-advantages-of-stacking-multiple-lstms) which you'd otherwise have to do painstakingly if you only use the LSTMCell

### Subtle differences
1. There is no explicit `out_size`. At each timestep, the cell just gives you its hidden state. If you want the `out_size` to be different from `hidden_size`, you can pass the hidden state through a separate linear layer that goes from `hidden_size` -> `out_size`, or set `proj_size = out_size` in the nn.LSTM init params. It does the same thing.
3. There are now 2 vectors: hidden state and cell state calculated at each timestep.
2. Unlike our implementation, the nn.LSTM implementation returns the final timestep hidden_states and cell_states

In [11]:
lstm = nn.LSTM(
    input_size = 768,
    hidden_size = 8,
    num_layers = 2,
    batch_first = True,
    proj_size = 2
)

Let's say we are dealing with a 2 class classification problem. At each timestep, the input is a word and the model is supposed to predict whether that word falls into category 0 or 1.\
Input shape will be (`batch_size` x `timesteps` x `word_vector_dim`). (How many words in total?)\
Output shape will be (`batch_size` x `timesteps` x `2`)\
Ground truth labels would be of shape (`batch_size` x `timesteps`) where each element is 0 or 1, which category that particular word is.

In [12]:
inputs = torch.randn(16, 10, 768) # batch_size x timesteps x input_size
targets = torch.randint(0, 2, (16, 10)) # batch_size x timesteps
h0 = torch.zeros(2, 16, 2) # num_layers x timesteps x out_size (= proj_size)
c0 = torch.zeros(2, 16, 8)# num_layers x timesteps x hidden_size

In [13]:
outputs, (hn, cn) = lstm(inputs, (h0, c0))
outputs.shape, hn.shape, cn.shape

(torch.Size([16, 10, 2]), torch.Size([2, 16, 2]), torch.Size([2, 16, 8]))

If we don't provide an initial (h, c) it assumes zero vectors

In [14]:
output, (hn, cn) = lstm(inputs)
outputs.shape, hn.shape, cn.shape

(torch.Size([16, 10, 2]), torch.Size([2, 16, 2]), torch.Size([2, 16, 8]))

# Loss
The loss function does not concern itself with differentiating between batches and timesteps. So, we combine batches and timesteps into one dimention before passing it to the loss function

In [15]:
flattened_targets = targets.view(-1)
flattened_outputs = outputs.reshape(-1, 2)
flattened_targets.shape, flattened_outputs.shape

(torch.Size([160]), torch.Size([160, 2]))

In [16]:
loss = torch.nn.functional.cross_entropy(flattened_outputs, flattened_targets)
loss

tensor(0.6998, grad_fn=<NllLossBackward0>)

In [17]:
loss.backward()

### Now that you can calculate the loss, the rest of the training pipeline is basically the same as for a regular feed forward network
# The end!