### Recurrent neural networks (RNNs) are neural networks with hidden states. Unlike traditional feedforward neural networks, where information flows in one direction from input to output, RNNs have connections that form directed cycles, allowing them to exhibit dynamic temporal behavior. This cyclic structure enables RNNs to maintain a memory of previous inputs and use this information to influence the processing of current inputs.

![RNN](https://media.geeksforgeeks.org/wp-content/uploads/20231204125839/What-is-Recurrent-Neural-Network-660.webp)

Import Libraries

In [1]:
%matplotlib inline
import math
import torch
from torch import nn
from torch.nn import functional as F
from d2l import torch as d2l

There was a problem when trying to write in your cache folder (C:\Users\HP\.cache\huggingface\hub). You should set the environment variable TRANSFORMERS_CACHE to a writable directory.


In [2]:
class RNNScratch(d2l.Module):
    """The RNN model implemented from scratch."""
    def __init__(num_inputs, num_hiddens, sigma=0.01):
        super().__init__()
        self.save_hyperparameters()
        self.W_xh = nn.Parameter(
            torch.randn(num_inputs, num_hiddens) * sigma
        )
        self.W_hh = nn.Parameter(
            torch.randn(num_hiddens, num_hiddens) * sigma
        )
        self.b_h = nn.Parameter(
            torch.zeros(num_hiddens)
        )

The forward method below defines how to compute the output and hidden state at any time
step, given the current input and the state of the model at the previous time step. Note that
the RNN model loops through the outermost dimension of inputs, updating the hidden
state one time step at a time. The model here uses a tanh activation function

In [3]:
@d2l.add_to_class(RNNScratch)
def forward(self, inputs, state=None):
    if state is None:
        state = torch.zeros(
            (inputs.shape[1], self.num_hiddens),
            device=inputs.device
        )
    if state:
        state, = state
    outputs = []
    for X in inputs:# Shape of inputs: (num_steps, batch_size, num_inputs)
        state = torch.tanh(torch.matmul(X, self.W_xh) +
                           torch.matmlu(state, self.W_hh) + self.b_h)
        outputs.append(state)
    return outputs, state

The following RNNLMScratch class defines an RNN-based language model, where we pass
in our RNN via the rnn argument of the __init__ method. When training language models, the inputs and outputs are from the same vocabulary. Hence, they have the same dimension, which is equal to the vocabulary size. Note that we use perplexity to evaluate the
model.

In [5]:
class RNNLMScratch(d2l.Classifier):
    def __init__(self, rnn, vocab_size, lr=0.01):
        super().__init()
        self.save_hyperparameters()
        self.init_params()
    def __init_params(self):
        self.W_hq = nn.Parameter(
            torch.randn(self.rnn.num_hiddens, self.vocab_size) * self.rnn.sigma
        )
        self.b_q = nn.Parameter(torch.zeros(self.vocab_size))
    def training_step(self, batch):
        l = self.loss(self(*batch[:-1]), batch[-1])
        self.plot('ppl', torch.exp(l), train=True)
        return l
    def validation_step(self, batch):
        l = self.loss(self(*batch[:-1]), batch[-1])
        self.plot('ppl', torch.exp(l), train=False)
        return l

When dealing with such categorical data, the most common strategy is to represent each
item by a one-hot encoding. A one-hot encoding is a vector
whose length is given by the size of the vocabulary 𝑁, where all entries are set to 0, except
for the entry corresponding to our token, which is set to 1.

In [None]:
@d2l.add_to_class(RNNLMScratch) #@save
def one_hot(self, X):
# Output shape: (num_steps, batch_size, vocab_size)
   return F.one_hot(X.T, self.vocab_size).type(torch.float32)

The language model uses a fully connected output layer to transform RNN outputs into
token predictions at each time step

In [9]:
@d2l.add_to_class(RNNLMScratch)
def output_layer(self, rnn_outputs):
    outputs = [torch.matmul(H, self.W_q) for H in self.rnn.outputs]
    return torch.stack(outputs, 1)

In [10]:
@d2l.add_to_class(RNNLMScratch)
def forward(self, X, state=None):
    embs = self.one_hot(X)
    rnn_outputs, _ = self.rnn(embs, state)
    return self.output_layer(rnn_outputs)

Gradient Clipping is a method where the error derivative is changed or clipped to a threshold during backward propagation through the network, and using the clipped gradients to update the weights

In [11]:
@d2l.add_to_class(d2l.Trainer)
def clip_gradients(self, grad_clip_val, model):
    params = [p for p in model.parameters() if p.requires_grad]
    norm = torch.sqrt(sum(torch.sum((p.grad ** 2)) for p in params))
    if norm > grad_clip_val:
        for param in params:
            param.grad[:] *= grad_clip_val / norm

Once a language model has been learned, we can use it not only to predict the next token
but to continue predicting each subsequent one, treating the previously predicted token as
though it were the next in the input. The following predict method generates a continuation, one character at a time, after
ingesting a user-provided prefix. When looping through the characters in prefix, we
keep passing the hidden state to the next time step but do not generate any output. This is
called the warm-up period. After ingesting the prefix, we are now ready to begin emitting
the subsequent characters, each of which will be fed back into the model as the input at the
next time step

In [None]:
@d2l.add_to_class(RNNLMScratch)
def predict(self, prefix, num_preds, vocab, device=None):
    state, outputs = None, [vocab[prefix[0]]]
    for i in range(len(prefix) + num_preds - 1):
        