<img src="./image/labai.png" width="200px">

# Text Generation with GRU

In this exercise your goal is to build text generation model with GRU model by complete all piece of code below, you can add or change code as we can


**Objective**:  
In this exercise, your goal is to build a text generation model using a Gated Recurrent Unit (GRU). You will complete all the provided code segments and are encouraged to add or modify code to improve the model. The key steps involve:

1. Preprocessing the text data.
2. Implementing the GRU-based neural network.
3. Training the model on the provided dataset.
4. Generating new text based on a seed sequence.

**Instructions**:
- Follow the code structure provided and complete the missing sections.
- Experiment with different hyperparameters to improve performance.
- You are free to adjust the code as needed to enhance results.

**Please use Google colab for free GPU**


In [None]:
# import sommes packages
import re
import torch
# import torchtext
import torch.nn as nn
from pathlib import Path
from typing import List, Dict

import torch.nn.functional as F

from torch.utils.data import DataLoader
from torchtext.vocab  import Vocab, build_vocab_from_iterator
import numpy as np

# Attempt GPU; if not, stay on CPU
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
print(device)

cuda:0


### I- Load dataset

In [3]:
# load dataset
text = Path('./data/tiny-shakespeare.txt').read_text()

In [4]:
# print total number of characters:
print(f'Number of characters in text file: {len(text):,}')

Number of characters in text file: 1,115,394


In [5]:
print(text[0:500])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.

All:
We know't, we know't.

First Citizen:
Let us kill him, and we'll have corn at our own price.
Is't a verdict?

All:
No more talking on't; let it be done: away, away!

Second Citizen:
One word, good citizens.

First Citizen:
We are accounted poor


## II - Word-Based Text Generation

The first model you'll build for **text generation** will use Word-based tokens. Each token will be a single word from the text and the model will learn to predict the next word (a token).

To generate text, the model will take in a new string, word-by-word, and then generate a new likely word based on the past input. Then the model will take into account that new word and generate the following word and so on and so on until the model has produced a set number of word.

### II.1  Tokenization : 
Create a tokenizer that will create tokens by character 

In [None]:
class  WordTokenizer(nn.Module):
    def __init__(self, vocab: torchtext.vocab.Vocab | Dict[str,int])-> None:
        super().__init__()
        
        if isinstance(vocab, torchtext.vocab.Vocab):
            self.token2id=vocab.get_stoi()
            self.id2token={id:ch for ch,id in vocab.get_stoi().items()}
            self.vocab_size=len(self.token2id)
            
        elif isinstance(vocab, dict):
            self.token2id=vocab
            self.id2token={id:ch for ch,id in vocab.items()}
            self.vocab_size=len(self.token2id)
            
        else:
            raise TypeError("Please loads a vocabulary file into a dictionary \
                            Dict[str,int] or torchtext.vocab.Vocab")
    
    def encode(self, text:List[str]|str):
        if isinstance(text, str):
            text_list=self.tokenize(text)
            
        token_id = []
        for token in text_list:
            token_id.append(self.token2id[token])
        return  torch.tensor(token_id,  dtype=torch.long)

    
    def decode(self, idx:torch.tensor):
        #idx: torch.Tensor containing integers
        token=[]
        for id in idx.tolist():
            token.append(self.id2token[id])
        return ' '.join(token)
    
    @staticmethod
    def tokenize(text: str) -> List[str]:
        # Normalize text by lowercasing and removing extra spaces
        text = text.lower().strip()
        tokens = re.findall(r"\w+|[^\w\s]", text, re.UNICODE)
        
        return tokens
        
    @staticmethod 
    def _tokenizer_corpus(corpus:List[str]):
        for text in corpus:
            yield WordTokenizer.tokenize(text)
    
    @staticmethod
    def train_from_text(text: str) -> List[str]:
        """build vocab from one text corpus"""
        vocab=build_vocab_from_iterator(WordTokenizer._tokenizer_corpus(WordTokenizer.tokenize(text)),
                                        specials=["<unk>"]
                                       )
        vocab.set_default_index(vocab["<unk>"])
        
        return WordTokenizer(vocab)


NameError: name 'nn' is not defined

In [None]:
# create tokenizer from text
tokenizer = WordTokenizer(text)

In [30]:
# show example of word-based tokens
print(tokenizer.tokenize(text[0:300]))

['first', 'citizen', ':', 'before', 'we', 'proceed', 'any', 'further', ',', 'hear', 'me', 'speak', '.', 'all', ':', 'speak', ',', 'speak', '.', 'first', 'citizen', ':', 'you', 'are', 'all', 'resolved', 'rather', 'to', 'die', 'than', 'to', 'famish', '?', 'all', ':', 'resolved', '.', 'resolved', '.', 'first', 'citizen', ':', 'first', ',', 'you', 'know', 'caius', 'marcius', 'is', 'chief', 'enemy', 'to', 'the', 'people', '.', 'all', ':', 'we', 'know', "'", 't', ',', 'we', 'know', "'", 't', '.', 'first', 'citizen', ':', 'let', 'us']


In [31]:
# tokenization
encode_text=tokenizer.encode("Welcome to the deep learning course.")
encode_text

tensor([ 312,    8,    4,  561, 3008,  667,    3])

In [16]:
decode_text=tokenizer.decode(encode_text)
decode_text

'welcome to the deep learning course .'

### III - Prepare dataset for training

In [4]:
class ShakespeareDataset:
    def __init__(self, encode_text, max_seq_length: int):
        self.encode_text     = encode_text
        self.max_seq_length  = max_seq_length
        
    def __len__(self):
        return len(self.encode_text)-self.max_seq_length
    
    def __getitem__(self, idx):
        assert idx < len(self.encode_text)-self.max_seq_length
        
        x_train= self.encode_text[idx:idx+self.max_seq_length]
        
        # Target is shifted by one character/token
        y_target= self.encode_text[idx+1:idx+1+self.max_seq_length]
        
        return x_train, y_target
        

In [5]:
dataset=ShakespeareDataset(encode_text=tokenizer.encode(text),max_seq_length=100)

NameError: name 'tokenizer' is not defined

In [19]:
# check
tokenizer.decode(dataset[0][0])

"first citizen : before we proceed any further , hear me speak . all : speak , speak . first citizen : you are all resolved rather to die than to famish ? all : resolved . resolved . first citizen : first , you know caius marcius is chief enemy to the people . all : we know ' t , we know ' t . first citizen : let us kill him , and we ' ll have corn at our own price . is ' t a verdict ? all : no more talking on ' t"

In [20]:
# check
tokenizer.decode(dataset[0][1])

"citizen : before we proceed any further , hear me speak . all : speak , speak . first citizen : you are all resolved rather to die than to famish ? all : resolved . resolved . first citizen : first , you know caius marcius is chief enemy to the people . all : we know ' t , we know ' t . first citizen : let us kill him , and we ' ll have corn at our own price . is ' t a verdict ? all : no more talking on ' t ;"

In [6]:
# batch dataset
train_dataloader = DataLoader(dataset, batch_size=5, shuffle=False)

NameError: name 'DataLoader' is not defined

In [22]:
tokenizer.vocab_size

11467

In [39]:
torch._dynamo.config.enable = False

ModuleNotFoundError: No module named 'torch._C._dynamo.guards'; 'torch._C._dynamo' is not a package

### Build GRU model
 

In [40]:
class GRUTextGen(nn.Module):
    def __init__(self, vocab_size: int, embedding_dim: int, hidden_dim: int, 
                 num_layers: int, dropout: float = 0.5):
        
        super().__init__()
        
        self.dropout = dropout
        assert 0 <= self.dropout <= 1, "dropout value must be between [0,1]"
        
        # Embedding layer to convert token indices to embeddings
        self.embedding = nn.Embedding(num_embeddings=vocab_size, 
                                      embedding_dim=embedding_dim)
        
        # GRU layer to process the sequence. The input size is the embedding dimension.
        self.gru = nn.GRU(input_size=embedding_dim, 
                          hidden_size=hidden_dim, 
                          num_layers=num_layers, 
                          dropout=self.dropout if num_layers > 1 else 0, 
                          batch_first=True)
        
        # Fully connected layer to map GRU output to vocabulary size
        self.fc = nn.Linear(hidden_dim, vocab_size)
        
    def forward(self, x: torch.tensor):
        # x: Tensor with shape (batch_size, sequence_length)
        assert x.ndim == 2, "x tensor must be 2D dimensions with shape (B,S), B=batch, S=sequence length"
        
        # Pass input through embedding layer
        x = self.embedding(x)  # (batch_size, sequence_length, embedding_dim)
        
        # Pass through GRU
        output, h = self.gru(x)  # output: (batch_size, sequence_length, hidden_dim)
        
        # Pass the GRU output through the fully connected layer to generate logits
        logits = self.fc(output)  # logits: (batch_size, sequence_length, vocab_size)
        
        return logits
    
# Example parameters
vocab_size = 11467
embedding_dim = 128
hidden_dim = 512
num_layers = 2
dropout = 0.3

# Instantiate model
GRU_model = GRUTextGen(vocab_size, embedding_dim, hidden_dim, num_layers, dropout)

# Example input (batch_size=32, sequence_length=20)
x = torch.randint(0, vocab_size, (32, 20))

# Forward pass
logits = GRU_model(x)

print(logits.shape)

torch.Size([32, 20, 11467])


In [43]:
# Example parameters
vocab_size = 11467
embedding_dim = 128
hidden_dim = 512
num_layers = 2
dropout = 0.3

# Instantiate model
GRU_model = GRUTextGen(vocab_size, embedding_dim, hidden_dim, num_layers, dropout)

# Example input (batch_size=32, sequence_length=20)
x = torch.randint(0, vocab_size, (32, 20))

# Forward pass
logits = GRU_model(x)

print(logits.shape)


optimizer = torch.optim.SGD(GRU_model.parameters(), lr=0.01)

torch.Size([32, 20, 11467])


ModuleNotFoundError: No module named 'torch._C._dynamo.guards'; 'torch._C._dynamo' is not a package

In [35]:
optimizer = torch.optim.SGD(GRU_model.parameters(), lr=0.01)


ModuleNotFoundError: No module named 'torch._C._dynamo.guards'; 'torch._C._dynamo' is not a package

## Inference mode: Define Text Generation :
Generate text with a character-based model

The `generate_text_by_word` function will use your tokenizer and LSTM model to generate new text token-by-token by taking in the input text and token sampling parameters. We can use temperature and top-k sampling to adjust the "creativeness" of the generated text.

We also pass in the num_tokens parameter to tell the function how many tokens to generate.

In [None]:
@torch.no_grad()
def generate_text_by_word(input_text:str, max_tokens:int=15, 
                          temperature:int=1, top_k:int|None=None, 
                          do_sample:bool=False, 
                        tokenizer=tokenizer):
    
    """Inference: Define Text Generation"""
    idx=tokenizer.encode(input_text).unsqueeze(dim=0)

    max_sequence_length=31
        
    assert idx.ndim==2, "input token must be 2D with sahpe (B, S) B batch,S sequence Length"
        
    for _ in range(max_tokens): # The maximum number of tokens that can be generated
        # if the sequence context is growing too long we must crop it at context_size
        idx_cond=idx if idx.size(1)<=max_sequence_length else idx[:,-max_sequence_length:]
        
        # forward the model to get the logits for the index in the sequence
        logits=GRU_model(idx_cond)
        
        # pluck the logits at the final step and scale by desired temperature
        logits = logits[:, -1, :] / temperature
        
        if top_k is not None:
            values= torch.topk(logits, top_k).values
            logits[logits < values[:,[-1]]]=-torch.inf 
                
        # apply softmax to convert logits to (normalized) probabilities
        probs = F.softmax(logits, dim=-1)

        if do_sample:
            idx_next=torch.multinomial(probs, num_samples=1)
        else:
            idx_next=torch.topk(probs, k=1, dim=-1).indices  # greedy decoding
               
        # append sampled index to the running sequence and continue
        idx = torch.cat((idx, idx_next), dim=1)
        
    return tokenizer.decode(idx.squeeze())

In [26]:
# check text generation without training model
TEST_PHRASE = 'To be or not to be'
generate_text_by_word(TEST_PHRASE)

'to be or not to be peat penalty christian peat fliers simois peevish raise bragg interchange goes deprived opening goes disclaim'

## Train GRU : 


In [None]:
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(GRU_model.parameters(), lr=0.01)

# Use more epochs if not CPU device
epochs = 5 

for epoch in range(epochs):
    # Set model into "training mode"
    GRU_model.train()
    total_loss = 0
    
    for X_batch, y_batch in train_dataloader:
        optimizer.zero_grad()
        
        output = GRU_model(X_batch)

        loss = criterion(output, y_batch)
        loss.backward()
        optimizer.step()
        total_loss += loss.item()
    
    print(f'Epoch {epoch + 1}/{epochs}, Loss: {total_loss / len(train_dataloader)}')
    print('-'*72)
    
    gen_output = generate_text_by_word(
        input_text=TEST_PHRASE,
        temperature=0.8,
        max_tokens=30,
        top_k=None, 
        do_sample=False, 
        tokenizer=tokenizer
    )
    print(gen_output)

In [33]:
optimizer = torch.optim.Adam(GRU_model.parameters(), lr=0.01)

ModuleNotFoundError: No module named 'torch._C._dynamo.guards'; 'torch._C._dynamo' is not a package

## Generate Text

Now that the model has been trained, go ahead and observe how it performs!

Try adjusting the different sampling methods using the `temperature` and `topk`
parameters on the same input string to see the differences.

You might also try different phrases as well as how many tokens  to generate and observe how it does.

In [None]:
output = generate_text_by_char(
    input_text='To be or ',
    max_tokens=20,
    do_sample=False, 
    tokenizer=tokenizer,
    temperature=1.0,
    topk=None,
)
print(output)

Great Job 👏 