## <font color='blue'>Text generation using a Char-RNN model</font>

We're going to train a Recurrent Neural Network (RNN) to understand and generate text character by character. To do this, we'll provide the RNN with a large piece of text and ask it to learn the likelihood of the next character based on the sequence of previous characters.

Let's break it down with a simple example: Imagine our vocabulary consists of just four letters, "helo," and our training sequence is "hello." In this case, we have four separate training examples:

- The RNN should learn that when it sees "h", the next character "e" is likely.
- When it encounters "he", it should expect "l" to come next.
- Similarly, when it has "hel" as input, it should predict "l".
- Finally, after "hell", it should anticipate "o".

To make this happen, we'll represent each character as a vector using a technique called 1-of-k encoding, where each character is uniquely identified by a specific position in the vector. We'll then feed these character vectors into the RNN one at a time using a step function. The RNN will produce a sequence of output vectors, each with four dimensions, corresponding to the likelihood of the next character in the sequence.

In essence, we're training the RNN to understand and generate text character by character, and it will predict the next character based on the context of the preceding characters.

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
import string
import random
import numpy as np

### <font color='blue'>Some pre-processing</font>

We will train our model using a text file of Shakespeare's plays.

The first step is create a mapping from characters to integers, so as to represent each string as a list of integers. This is essential since we can only pass in numbers to our model, not strings or characters. Using this mapping, we now have our corpus of text mapped into a list of numbers.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# Create a character-to-index and index-to-character mapping
chars = np.load('/content/drive/My Drive/DSC 257R/rnn-gen/chars.npy')
# np.save('chars.npy', chars)
char_to_index = {char: i for i, char in enumerate(chars)}
index_to_char = {i: char for i, char in enumerate(chars)}

In [None]:
char_to_index

{'\n': 0,
 ' ': 1,
 '!': 2,
 '&': 3,
 "'": 4,
 ',': 5,
 '-': 6,
 '.': 7,
 '3': 8,
 ':': 9,
 ';': 10,
 '?': 11,
 'A': 12,
 'B': 13,
 'C': 14,
 'D': 15,
 'E': 16,
 'F': 17,
 'G': 18,
 'H': 19,
 'I': 20,
 'J': 21,
 'K': 22,
 'L': 23,
 'M': 24,
 'N': 25,
 'O': 26,
 'P': 27,
 'Q': 28,
 'R': 29,
 'S': 30,
 'T': 31,
 'U': 32,
 'V': 33,
 'W': 34,
 'X': 35,
 'Y': 36,
 'Z': 37,
 '[': 38,
 ']': 39,
 'a': 40,
 'b': 41,
 'c': 42,
 'd': 43,
 'e': 44,
 'f': 45,
 'g': 46,
 'h': 47,
 'i': 48,
 'j': 49,
 'k': 50,
 'l': 51,
 'm': 52,
 'n': 53,
 'o': 54,
 'p': 55,
 'q': 56,
 'r': 57,
 's': 58,
 't': 59,
 'u': 60,
 'v': 61,
 'w': 62,
 'x': 63,
 'y': 64,
 'z': 65}

Let's examine the mapping between integers and characters

# ***We are considering 66 different characters and the integer code for 'A' is 12.***

Now let's read in Shakespeare's plays and convert the text to integers.

In [None]:
text = open('/content/drive/My Drive/DSC 257R/rnn-gen/shakespeare_plays.txt', 'r').read()

# Convert the text to a numerical sequence
# text_as_int = [char_to_index[char] for char in text]

data = list(text)
for i, ch in enumerate(data):
    data[i] = char_to_index[ch]

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# data tensor on device
data = torch.tensor(data).to(device)
data = torch.unsqueeze(data, dim=1)

In [None]:
corpus_length = len(text)
print(corpus_length)

3801088


## ***The length of the corpus in characters is 3,801,088.***

### <font color='blue'>Defining our model</font>

##### Initialization:

  The `__init__` method initializes the RNN model with the following parameters:
  - input_size: The size of the character vocabulary. This indicates the number of unique characters that the model can work with.
  - output_size: The size of the output vocabulary. It's typically set to the same value as input_size for character generation tasks.
  - hidden_size: The number of hidden units in the LSTM (Long Short-Term Memory) layer.
  - num_layers: The number of LSTM layers stacked on top of each other.

##### Embedding Layer:

  Inside the `__init__` method, an `nn.Embedding` layer is created. This layer is used to convert character indices (input) into dense vectors of fixed size.

##### LSTM Layer:

The `nn.LSTM layer` is defined with the specified `input_size`, `hidden_size`, and `num_layers`. This LSTM layer will process the embedded character sequence to capture dependencies and patterns within the sequence.

##### Decoder Layer:

After the LSTM layer, there is a linear (fully connected) layer defined as `nn.Linear`, which takes the output from the LSTM layer and maps it to the desired output size.

##### Forward Pass:

The forward method is where the actual computation occurs. It takes an input sequence (`input_seq`) and a hidden state (`hidden_state`) as input arguments.

First, the input sequence is passed through the embedding layer to convert the character indices into dense embeddings.

Then, these embeddings are fed into the LSTM layer, which processes the sequence. The LSTM layer produces an output sequence (output) and an updated hidden state.

Finally, the output from the LSTM is passed through the linear decoder layer to generate the predictions for the next characters in the sequence.

The forward method returns the output sequence and the updated hidden state.

Note that the `self.rnn` is actually an LSTM. This is used since LSTM's are known to outperform RNNs in most language tasks. We can very well replace this with an RNN, but would expect the model not to perform that well.

In [None]:
# Define the Char-RNN Model
class CharRNN(nn.Module):
    def __init__(self, input_size, output_size, hidden_size, num_layers):
        super(CharRNN, self).__init__()
        self.embedding = nn.Embedding(input_size, input_size)
        self.rnn = nn.LSTM(input_size=input_size, hidden_size=hidden_size, num_layers=num_layers)
        self.decoder = nn.Linear(hidden_size, output_size)

    def forward(self, input_seq, hidden_state):
        embedding = self.embedding(input_seq)
        output, hidden_state = self.rnn(embedding, hidden_state)
        output = self.decoder(output)
        return output, (hidden_state[0].detach(), hidden_state[1].detach())

### <font color='blue'>Defining a dataset class</font>

In this part of the tutorial, we'll create a custom PyTorch dataset called `TextDataset`. This dataset is designed for training character-level text generation models like CharRNN. The dataset allows you to prepare your text data for training by converting characters to integer indices and creating input-target pairs for the model.



##### Initialization:

Accepts three parameters: `text`, `seq_length`, and `char_to_index`.

- `text`: The input text data you want to train the model on.
- `seq_length`: The length of sequences to be used during training (e.g., 50 characters per sequence).
- `char_to_index`: A dictionary mapping characters to integer indices.

##### Conversion of Text to Integers:

Inside the constructor, the input text is converted into an integer representation by mapping characters to their corresponding integer indices using the `char_to_index` dictionary.

##### `__len__`:

Defines the length of the dataset. You can specify a fixed length (e.g., 10,000) for your dataset, but this can be adjusted based on your dataset size. What you can also do is simply set length as `len(text) - self.seq_length`. This would result in a much larger set of samples and you wouldn't need to randomly sample an index (as described next).

##### `__getitem__`:


Retrieves individual training examples from the dataset.

- Randomly selects a starting index within the range `[0, len(text) - seq_length)` for each training example.
- Creates an input sequence (`input_seq`) containing characters from the selected `index` to `index + seq_length`.
- Creates a target sequence (`target_seq`) containing characters from `index + 1` to `index + seq_length + 1`.
- Returns a tuple with `input_seq` and `target_seq`.

In [None]:
from torch.utils.data import Dataset, DataLoader

class TextDataset(Dataset):
    def __init__(self, text, seq_length, char_to_index):
        self.seq_length = seq_length
        self.char_to_index = char_to_index
        self.text_as_int = [char_to_index[char] for char in text]

    def __len__(self):
        return 10000

    def __getitem__(self, idx):
        idx = random.randint(0, len(self.text_as_int) - self.seq_length)
        input_seq = torch.tensor(self.text_as_int[idx:idx + self.seq_length])
        target_seq = torch.tensor(self.text_as_int[idx + 1:idx + self.seq_length + 1])
        return input_seq, target_seq

# Create the dataset
seq_length = 100
text_dataset = TextDataset(text, seq_length, char_to_index)

# Create a data loader
batch_size = 2048
data_loader = DataLoader(text_dataset, batch_size=batch_size, shuffle=True)

In [None]:
# Define the training loop

input_size = len(chars)
output_size = len(chars)
hidden_size = 512
num_layers = 3

model = CharRNN(input_size, output_size, hidden_size, num_layers)

# Define the loss function and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Training parameters
num_epochs = 15

# Training device
model = model.to(device)


We can now train our model using the following code. For ease of use, a pre-trained model has been provided since training the model can be a long process especially if you don't have GPUs set up on your local machine.

In [None]:
## NO NEED TO RUN THIS CELL

# for i_epoch in range(1, num_epochs+1):

#     n = 0
#     running_loss = 0
#     hidden_state = None

#     for i_data,(input_seq, target_seq) in enumerate(data_loader):
#         print(i_data)
#         # forward pass
#         input_seq = input_seq.to(device)
#         target_seq = target_seq.to(device)
#         output, hidden_state = model(input_seq, hidden_state)
#         print(output.shape,target_seq.shape)
#         # compute loss
#         loss = criterion(output.view(-1,output_size), target_seq.view(-1))
#         running_loss += loss.item()

#         # compute gradients and take optimizer step
#         optimizer.zero_grad()
#         loss.backward()
#         optimizer.step()

#         n +=1


#     # print loss and save weights after every epoch
#     print("Epoch: {0} \t Loss: {1:.8f}".format(i_epoch, running_loss/n))
#     torch.save(model.state_dict(), './model_{}.pth'.format(i_epoch))




Let's load the pretrained weights

In [None]:
model.load_state_dict(torch.load('/content/drive/My Drive/DSC 257R/rnn-gen/CharRNN_shakespeare.pth',map_location=torch.device('cpu')))
model = model.cpu()
model.eval()

  model.load_state_dict(torch.load('/content/drive/My Drive/DSC 257R/rnn-gen/CharRNN_shakespeare.pth',map_location=torch.device('cpu')))


CharRNN(
  (embedding): Embedding(66, 66)
  (rnn): LSTM(66, 512, num_layers=3)
  (decoder): Linear(in_features=512, out_features=66, bias=True)
)

## The input and output sizes are both set to 66 because there are 66 unique characters in the dataset and the model uses one-hot encoding for each character when training and generating text

Time to generate some Shakespeare!

In [None]:
input_seq = data[25:26].cpu()
hidden_state = None
o_len = 0
output_len = 2000
while o_len < output_len:
    # forward pass
    output, hidden_state = model(input_seq, hidden_state)
    # construct categorical distribution and sample a character
    output = torch.nn.functional.softmax(torch.squeeze(output), dim=0)
    dist = torch.distributions.Categorical(output)
    index = dist.sample()
    # index = torch.argmax(output)
    # print the sampled character
    print(index_to_char[index.item()], end='')

    # next input is current output
    input_seq[0][0] = index.item()
    o_len += 1

ly. Some man
so far in England, comes she will speak with thee with most
friends your countryman, and this hands, your
great weaking have sometimes.
Your husband, and sit my hand,
Of England, alile, take nestrous us Glouceta, if ever cast
and fly to be in two host and flow
and valiantry, the king is claim: I have ago,
When I unlucked on; and beseech his hand:
Divoe, people, we shall so Charles saw their highness.

KING HENRY V:
Whas is thy broth and foods of gold, and leity,
Offered in a just gracious lustness between
his friends, his bowlards: and in this passageeest
sickle like too true, some converse of ill-blown
Follow Anne Sly gloven; Fridite, who can keep
this rich chafe; leaving in beauty in a frame,
In reproof all my dewn of terriem, box'd it hence;
So covering out of them to be known up
The glove of your worch, asberging of monmouth,
Or their unnation'd retracted money false;
For I have follow thy brothers in dukedom
The swaggog?

PISTOL:
Captain, assure ye, do it alexant me.
