# RNN

Rather than modeling  $P(x_{t}|x_{t-n+1},...,x_{t-1})$  it is preferable to use a latent variable model:

$$P(x_{t}|x_{1},...,x_{t-1}) \approx P(x_{t}|h_{t-1})$$

where  $h_{t-1}$  is a hidden state that stores the sequence information up to time step  $t-1$ . 

In [1]:
import torch
import d2l
from torch import nn

## Model

Recall fully-connected layer:

$$\mathbf{H} = \phi(\mathbf{X}\mathbf{W}_{xh} + \mathbf{b}_{h})$$

where $\mathbf{X}\in\mathbb{R}^{n\times{d}}$,$\mathbf{W}_{xh}\in\mathbb{R}^{d\times{h}}$,$\mathbf{b}_{h}\in\mathbb{R}^{1\times{h}}$  and  $\mathbf{H}\in\mathbb{R}^{n\times{h}}$.

Matters are entirely different when we have hidden states, Assume that we have a minibatch of inputs  $\mathbf{X}_{t}\in\mathbb{R}^{n\times{d}}$  at time step  $t$.

Denote by  $\mathbf{H}_{t}\in\mathbb{R}^{n\times{h}}$  the hidden states of time step  $t$, the RNN model update hidden states by:

$$\mathbf{H}_{t} = \phi(\mathbf{X}_{t}\mathbf{W}_{xh} + \mathbf{H}_{t-1}\mathbf{W}_{hh} + \mathbf{b}_{h})$$

where $\mathbf{W}_{hh}\in\mathbb{R}^{h\times{h}}$ and $\phi = \mbox{tanh}$ as default.

![jupyter](../images/8/rnn.svg)

## Implementation

In [2]:
#@save
class RNNModel(nn.Module):
    """The RNN model."""
    def __init__(self, rnn_layer, vocab_size, **kwargs):
        super(RNNModel, self).__init__(**kwargs)
        self.rnn = rnn_layer
        self.vocab_size = vocab_size
        self.num_hiddens = self.rnn.hidden_size  # vocab_size -> num_hiddens -> vocab_size
        if not self.rnn.bidirectional:
            self.num_directions = 1
            self.linear = nn.Linear(self.num_hiddens, self.vocab_size)
        else:
            self.num_directions = 2
            self.linear = nn.Linear(self.num_hiddens * 2, self.vocab_size)

    def forward(self, inputs, state):
        # Shape of inputs: (`batch_size`, `num_steps`)
        # Shape of X: (`num_steps`, `batch_size`, `vocab_size`)
        X = F.one_hot(inputs.T.long(), self.vocab_size).type(torch.float32)
        Y, state = self.rnn(X, state)
        # The fully connected layer will first change the shape of `Y` to
        # (`num_steps` * `batch_size`, `num_hiddens`). Its output shape is
        # (`num_steps` * `batch_size`, `vocab_size`).
        output = self.linear(Y.reshape((-1, Y.shape[-1])))
        return output, state

    def begin_state(self, batch_size=1, device=d2l.try_gpu()):
        if not isinstance(self.rnn, nn.LSTM):
            # `nn.GRU` takes a tensor as hidden state
            return torch.zeros((self.num_directions * self.rnn.num_layers,
                                batch_size, self.num_hiddens), device=device)
        else:
            # `nn.LSTM` takes a tuple of hidden states
            return (torch.zeros((self.num_directions * self.rnn.num_layers,
                                 batch_size, self.num_hiddens), device=device),
                    torch.zeros((self.num_directions * self.rnn.num_layers,
                                 batch_size, self.num_hiddens), device=device))

In [3]:
"""vocab_size: 100, num_hiddens: 64"""
rnn = RNNModel(nn.RNN(input_size=100, hidden_size=64), vocab_size=100)
rnn

RNNModel(
  (rnn): RNN(100, 64)
  (linear): Linear(in_features=64, out_features=100, bias=True)
)