![banner](../../../src/visuals/banner.png)

## Deep Learning for Text Generation!

Today we will go a bit further into the LSTM architecture! Lets recap an imporant figure we saw previously:

![mapping](../../../src/visuals/rnn_input_output_setups.png)

[credit](https://wandb.ai/ayush-thakur/dl-question-bank/reports/LSTM-RNN-in-Keras-Examples-of-One-to-Many-Many-to-One-Many-to-Many---VmlldzoyMDIzOTM)


In our Sequence classification that we did last time, we had a **Many to One** model, where we had a sequence of inputs, but were trying to model a single binary classifier head. This time we will try something closer to a **Many to Many** model. We will take in a sequence, and we will use the historical information of the sequence to predict the next token. More specifically, we are about to write an AI that can generate Harry Potter!!! 

A lot of this lesson was inspired by work done by [Andrej Karpathy](https://github.com/karpathy/ng-video-lecture/blob/master/bigram.py) and his implementation of GPT. We will be using an LSTM model instead here but we followed similar patterns of preprocessing and prepping data/character tokenizers. 

In [1]:
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np 
import os


Before we get into it though, we need to talk about the two ways we can pass data into an LSTM model:
- With an initialized Hidden and Cell State as $H_0$ and $C_0$, we can pass an entire sequence in and get the final outputs as well as $H_n$ and $C_n$
- With an initialized Hidden and Cell State as $H_0$ and $C_0$, we can pass in the first token of the sequence $X_0$ which will output $H_1$ and $C_1$. We can then pass in the next token in our sequence $X_1$ along with the $H_1$ and $C_1$ from previous to get our $H_2$ and $C_2$. We will constantly repeat this process until we have completed the sequence! Internally PyTorch does something similar when we feed in an entire sequence, but we will break it up so we can make a prediction for the next token at every step!

Lets implement both of these and check that they are equivalent!

In [17]:
batch_size = 5        # How Many Samples
sequence_length = 15   # Sequence Length Per Sample
input_size = 10       # Dimension of vector for each timestep in sequence per sample      
hidden_size = 20      # Dimension expansion from Input size Inside the LSTM cell
num_layers = 2        # Number of LSTM Cells

lstm = nn.LSTM(input_size=input_size, 
               hidden_size=hidden_size, 
               num_layers=num_layers, 
               batch_first=True)

rand = torch.randn(batch_size, sequence_length, input_size)

### Method 1 ###
h0 = torch.zeros(num_layers, batch_size, hidden_size)
c0 = torch.zeros(num_layers, batch_size, hidden_size)
method_1_outs, (hn, cn) = lstm(rand, (h0,c0))

### Method 2 ###
h = torch.zeros(num_layers, batch_size, hidden_size)
c = torch.zeros(num_layers, batch_size, hidden_size)

outs = []

for i in range(sequence_length):
    token = rand[:, i, :].unsqueeze(1)
    out, (h,c) = lstm(token, (h,c))

    outs.append(out)

method_2_outs = torch.concat(outs, axis=1)

torch.allclose(method_1_outs, method_2_outs)


True

### How Do We Build a Dataset for Next Token Prediction?

To keep things simple, we will not build a Dataset and Dataloader like we did before as we don't really need it! Lets setup the problem, here is a sentence from Harry Potter and the Sorcerers Stone!

**Strange how nearsighted being invisible can make you**

The following example will be a Word Level tokenization but we will implement Character Level to save on computation!

First we tokenize our text:

```
"Strange how nearsighted being invisible can make you" -> ["Strange", "how", "nearsighted", "being", "invisible", "can", "make", "you ]
```

Now our goal is, we want to pass in the word "Strange" to our model and then predict "how". We will then pass in the word "how" and hope to predict "nearsighted", and so on! Therefore, we can setup our data as such:

```
input = ["Strange", "how", "nearsighted", "being", "invisible", "can", "make"]
label = ["how", "nearsighted", "being", "invisible", "can", "make", "you"]
```

Notice that the label really is just one shifted to the right compared to the input!

### Dataset Difference from Sequence Classification Task
In our Sequence Classification task, we had sentences that were directly tied to some binary label. In this case, we just have a large text full of Harry Potter! We can then just sample some predetermined sequence length from this dataset, stack many of them together as a batch, and then train!

#### Character Level Modeling
Ideally we would use N-Grams or more powerful tokenization, but the purpose here is to again demonstrate the capabilities of LSTM's without a huge resource burden! If you want to try more advanced tokenizers for  you data, feel free to update the code to use [TikToken](https://github.com/openai/tiktoken) that is used often with GPT type models. Our goal will be to predict a character given a previous character and see what kind of output we can expect!

Lets first load in all our text and find all the unique characters available to us!

In [32]:
path_to_data = "../../../data/harry_potter_txt"
text_files = os.listdir(path_to_data)

all_text = ""
for book in text_files:
    path_to_book = os.path.join(path_to_data, book)

    with open(path_to_book, "r") as f:
        text = f.readlines()

    text = [line for line in text if "Page" not in line]
    text = " ".join(text).replace("\n", "")
    text = [word for word in text.split(" ") if len(word) > 0]
    text = " ".join(text)
    all_text+=text

In [37]:
unique_chars = sorted(list(set(all_text)))

char2idx = {c:i for (i,c) in enumerate(unique_chars)}
idx2char = {i:c for (i,c) in enumerate(unique_chars)}

### Build a DataGenerator
Because we are sampling strings of data randomly from this body of text, we can just build a class that will stack together N samples with a given sequence length. It will then return back an input and target that are offset from each other.

In [46]:
class DataBuilder:
    def __init__(self, seq_len=300, text=all_text):

        self.seq_len = seq_len
        self.text = text
        self.file_length = len(text)

    def grab_random_sample(self):

        start = np.random.randint(0, self.file_length-self.seq_len)
        end = start + self.seq_len
        text_slice = self.text[start:end]

        input_text = text_slice[:-1]
        label = text_slice[1:]

        input_text = torch.tensor([char2idx[c] for c in input_text])
        label = torch.tensor([char2idx[c] for c in label])

        return input_text, label
        
    def grab_random_batch(self, batch_size):

        input_texts, labels = [], []

        for _ in range(batch_size):
            input_text, label = self.grab_random_sample()

            input_texts.append(input_text)
            labels.append(label)

        input_texts = torch.stack(input_texts)
        labels = torch.stack(labels)

        return input_texts, labels

dataset = DataBuilder(seq_len=10)
input_texts, labels = dataset.grab_random_batch(batch_size=4)


### Define Model 

This should look very similar to our Sequence Classification model with a few changes:

- Our final Fully Connected Layer will predict the next character rather than a binary classification problem
- We will write a generate function that takes a string as an input and continues to write for however many tokens we request
    - One thing we didn't talk about much yet is sampling! We will apply a Softmax to the output of the linear layer so we can predict the probability of the next token given the current one. Instead of just taking the character that has the highest probability, we will sample from a multinomial distribution with the given probabilities. This means that, we have the highest chance to choose the character with the highest probability, but it gives the model an opportunity to also sample other characters and explore more options!
    
    
#### MultiNomial Distribution

If we have a coin that we want to flip, we can sample from a binomial distribution once to see if we get heads or tails. If we expand it and have more options and role a dice (which has 6 sides) if we sample from a multinomial distribution once we would get one of the sides. For our problem, we will be passing in a probability vector of length num_characters, and we will randomly sample to see what character we get. Again, this should help with both getting more unique outputs and allow the model to explore more possibilities than a simple Argmax.

In [77]:
class LSTMForGeneration(nn.Module):
    def __init__(self, embedding_dim=128, num_characters=len(char2idx), hidden_size=256, n_layers=3, device="cpu"):
        super().__init__()

        self.embedding_dim = embedding_dim
        self.num_characters = num_characters
        self.hidden_size = hidden_size
        self.n_layers = n_layers
        self.device = device

        self.embedding = nn.Embedding(num_characters, embedding_dim)
        self.lstm = nn.LSTM(input_size=embedding_dim, 
                            hidden_size=hidden_size, 
                            num_layers=n_layers, 
                            batch_first=True)

        self.fc = nn.Linear(hidden_size, num_characters)

        self.softmax = nn.Softmax(dim=-1)

    def forward(self, x):

        x = self.embedding(x)

        output, (h,c) = self.lstm(x)

        logits = self.fc(output)

        return logits

    def write(self, text, max_characters, greedy=False):

        idx = torch.tensor([char2idx[c] for c in text], device=self.device)

        hidden = torch.zeros(self.n_layers, self.hidden_size).to(self.device)
        cell = torch.zeros(self.n_layers, self.hidden_size).to(self.device)

        for i in range(max_characters):

            if i == 0:
                selected_idx = idx
            else:
                selected_idx = idx[-1].unsqueeze(0)

            x = self.embedding(selected_idx)
            out, (hidden, cell) = self.lstm(x, (hidden, cell))
            out = self.fc(out)

            if len(out) > 1:

                out = out[-1, :].unsqueeze(0)

            
            probs = self.softmax(out)

            if greedy:
                idx_next = torch.argmax(probs)
            else:
                idx_next = torch.multinomial(probs, num_samples=1)
  
            idx = torch.cat([idx, idx_next[0]])
            
        gen_string = [idx2char[int(c)] for c in idx] 
        gen_string = "".join(gen_string)

        return gen_string



model = LSTMForGeneration()
text = "hello"
model.write(text, 100, greedy=False)

        
        

"hello)eEU&;/Z”;zZ”9VGsbil4?H\\■,Oc4i□uA■&.F\\dZ|!VjmiSR*\\meQ•t(I8IS1Bgbu~fC.b•y'7:0n•Ks•;))k/R6IWEKN|fBp”cn"

### Lets Train this Model!!!

Quick aside on Cross Entropy Loss. Our model will return a tensor of the shape **[batch x seq_len x n_characters]** and our labels will have shape **[batch x seq_len]**. PyTorch Cross Entropy expects the following shapes for their inputs as follows:

```
Inputs = (N x C X D1 x D2 x ...)
Labels = (N x D1 x D2 X ...)
```

This means our labels must be batch first, and each batch has cooresponding indexes along the sequence length we want to match. The inputs on the other hand must have batch first, then class dim (num_characters) and then all other dimensions like the sequence length. 

Therefore, because out output is of shape **[batch x seq_len x n_characters]**, we will have to do a transpose to **[batch x n_characters x seq_len]** to ensure PyTorch is happy!

In [91]:
iterations = 3000
max_len = 300
evaluate_interval = 300
embedding_dim = 128
hidden_size = 256
n_layers = 3
lr = 0.003
batch_size = 128

DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

model = LSTMForGeneration(embedding_dim, len(char2idx), hidden_size, n_layers, DEVICE).to(DEVICE)
optimizer = optim.AdamW(model.parameters(), lr=lr)
loss_fn = nn.CrossEntropyLoss()

dataset = DataBuilder()

for iteration in range(iterations):
    input_texts, labels = dataset.grab_random_batch(batch_size=batch_size)
    input_texts, labels = input_texts.to(DEVICE), labels.to(DEVICE)

    optimizer.zero_grad()
    output = model(input_texts)

    output = output.transpose(1,2)

    loss = loss_fn(output, labels)

    loss.backward()
    optimizer.step()

    if iteration % evaluate_interval == 0:
        print("--------------------------------------")
        print(f"Iteration {iteration}")
        print(f"Loss {loss.item()}")
        generated_text = model.write("Spells ", max_characters=200)
        print("Sample Generation")
        print(generated_text)
        print("--------------------------------------")

--------------------------------------
Iteration 0
Loss 4.535403251647949
Sample Generation
Spells 2cl—l?&KG8d□bc)—RV(b~S'“"-)*>2□%7hm)j|b0Z.GBVM&vVT'PRga “kghCS‘;('L7K6>boo~3□”Y—(—VrAz’OceETz3nr?]*“lF■bD]BGnGR '—o7Fhw*fQ’hHPCTrx5w'—*6qts”lcDPQyD/:]fOGb8Xq tdnTYJEY8Mmj~&5fA&Sh(PG2s~PGwO|.4.J%fGv4-e
--------------------------------------
--------------------------------------
Iteration 300
Loss 1.860412836074829
Sample Generation
Spells you on lisbed out biffsurly Carled, a pet it the mickatautp, Tnotery, a thelt, Anrent dowblen he a-limered of Mcheam anres and stike. Harry, cum cinky deir to usmiut miman dapted efceattia Perappron’s
--------------------------------------
--------------------------------------
Iteration 600
Loss 1.4449594020843506
Sample Generation
Spells on thinker on the hand had Professor Buurn scream-Counny and his hour min and his al palceman everets, who would can’t don’t bagmaver on on a sure higp.” “I have mather voice horry drain — banously An
-----------------

### Its Alive!!! (Sort of)

If we read the text above, it definitely looks like english and all the words are like english, but the sentences are gibberish. This is a limitation of a character level model as it is too granular and higher order toknization is needed, along with much larger models to make long range relationships between words! Regardless though, the code would be nearly identical except for how the tokenization was happening and the total vocab size of that tokenization pattern. 

We will come back to this later and build a GPT type model that hopefully solves some of these issues outlined!