
# LSTM Language Modeling with IMDB Data

In this notebook, we will train an LSTM model for language modeling on the IMDB dataset. We will cover the following steps:

1. **Dataset Loading**
2. **Tokenization and Vocabulary Creation**
3. **Dataset Preparation**
4. **DataLoader Creation**
5. **PyTorch Model Creation**
6. **Optimizer and Loss Function**
7. **Model Training and Loss Monitoring**
8. **Model Evaluation**

The main goal is to create a language model that can predict the next word in a sequence of words. This involves training the LSTM model to minimize the cross-entropy loss.

## 1. Dataset Loading

We use the `IMDB` dataset from TorchText. The dataset contains labeled movie reviews. However, for language modeling, we only use the text data and ignore the labels.

The dataset is split into training and testing sets.

In [None]:
!pip install datasets



In [None]:
!pip install torchmetrics



In [None]:
import torch
import torch.nn as nn
import tqdm
from torch.utils.data import Dataset, DataLoader
import numpy as np
import datasets

train_data, test_data = datasets.load_dataset("imdb", split=["train", "test"])

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [None]:
train_text=[sample['text'] for sample in train_data]

## 2. Tokenization and Vocabulary Creation

Tokenization is the process of splitting text into smaller units (tokens). We use TorchText's built-in tokenizer for this purpose.

We then build a vocabulary from the tokenized dataset. The vocabulary maps each token to a unique integer index. It also contains special tokens:
- `<unk>` for unknown tokens
- `<pad>` for padding sequences
- `<bos>` for the beginning of a sequence
- `<eos>` for the end of a sequence

In [None]:
from collections import Counter
class Tokenizer:
  def __init__(self, stop_words, puncts, truncation_size=256  ):

    self.stop_words=stop_words
    self.puncts=puncts
    self.df = {}
    self.truncation_size=truncation_size
  def format_string(self, text):
      tokens=[ token for token  in text.lower().split() if not ((token in self.stop_words) or  (token in self.puncts))   ]
      return tokens

  def tokenize(self, text, truncation=False):
      tokens=self.format_string(text)
      tmp=[]
      for token in tokens:
          if token in self.w2i :
              tmp.append(self.w2i[token])
          else:
              tmp.append(self.w2i['<unk>'])

      if truncation:
          tmp=tmp[: self.truncation_size]
          output= np.ones(self.truncation_size +2 )*self.w2i['<pad>']
          output[0]=self.w2i['<bos>']
          output[-1]=self.w2i['<eos>']
          output[1:len(tmp)+1]=tmp
          return list(output)
      else:
          return tmp
  def detokenize(self, idxs):
      words=[self.i2w[idx] for idx in idxs]

      return ''.join(word+' ' for word in words )
  def fit(self, train_text):
    for text in train_text:
        tokens=self.format_string(text)
        for token, count in Counter(tokens).items():
            if token in self.df:
                self.df[token]+=count
            else:
                self.df[token]=count
    self.w2i={}
    idx_count=0
    for  (k,v) in  self.df.items():
        if v>5:
          self.w2i[k]=idx_count
          idx_count+=1

    for k in ['<unk>', '<pad>', '<bos>', '<eos>']:
      self.w2i[k]=idx_count
      idx_count+=1


    self.i2w = { v:k for (k,v) in  self.w2i .items() }


In [None]:
stop_words= []
puncts=  []

In [None]:
tokenizer=Tokenizer(stop_words , puncts)

In [None]:
tokenizer.fit(train_text)

In [None]:
tokenizer.tokenize(train_text[4], truncation=True)[:10]

[40133.0, 377.0, 40131.0, 378.0, 64.0, 33.0, 379.0, 153.0, 44.0, 380.0]

In [None]:
len(tokenizer.w2i)

40135

In [None]:
tokenizer.detokenize(tokenizer.tokenize(train_text[4], truncation=True)[:10])

'<bos> oh, <unk> hearing about this ridiculous film for umpteen '

In [None]:
len(tokenizer.tokenize(train_text[4], truncation=True))

258

**Mathematical Representation:**

Given a sequence of tokens $ \{w_1, w_2, \dots, w_T\} $, the vocabulary maps each token $ w_i $ to an integer index $ v_i $.


## 3. Dataset Preparation

For language modeling, we split the tokenized dataset into input-target pairs:

$$
(x_1, x_2, \dots, x_{T-1}) \to (x_2, x_3, \dots, x_T)
$$

This means the model will learn to predict the next word in the sequence.

Sequences are padded to ensure equal length for batching.

In [None]:
class CustomDataset(Dataset):
    def __init__(self, train_data, tokenizer, return_type='Ids'):
        """
        Initialize the dataset with data and targets.
        Args:
            data: The input data (e.g., features).
            targets: The corresponding labels or targets.
        """
        self.train_data = train_data
        self.tokenizer = tokenizer
        self. return_type=return_type # BoG, Tf-Idf Ids
    def __len__(self):
        """
        Return the total number of samples.
        """
        return len(self.train_data)

    def __getitem__(self, idx):
        """
        Retrieve a sample and its target at the given index.
        """
        text,label= self.train_data[idx]['text'], self.train_data[idx]['label']

        if self.return_type=='Ids'  :

            idxs= np.array(self.tokenizer.tokenize(text, truncation=True)).astype('int32')
            return idxs, label
        elif self.return_type=='BoG':
            vec=np.zeros(len(self.tokenizer.w2i))
            idxs= self.tokenizer.tokenize(text)
            for idx in idxs:
                vec[idx]+=1
            return vec, label
        else:
            tf=np.zeros(len(self.tokenizer.w2i))
            idxs= self.tokenizer.tokenize(text)
            for idx in idxs:
                tf[idx]+=1
            tf/=len(idxs)

            for i in range(len(tf)):
                tf[i]*=self.tokenizer.idf[i]
            return tf, label

In [None]:
train_dataset = CustomDataset(train_data, tokenizer)
test_dataset = CustomDataset(test_data, tokenizer)

In [None]:
train_dataset[0][0][:10]

array([40133,     0,     1,     0,     2, 40131,     3,     4,     5,
           6], dtype=int32)

In [None]:
tokenizer.detokenize(train_dataset[0][0][:10])

'<bos> i rented i am <unk> from my video store '

## 4. DataLoader Creation

The DataLoader is used to batch, shuffle, and efficiently load the dataset during training. Padding ensures that all sequences in a batch have the same length.

In [None]:
train_dataloader = DataLoader(train_dataset, batch_size=32, shuffle=True)

test_dataloader = DataLoader(test_dataset, batch_size=32, shuffle=False)



## 5. PyTorch Model Creation

We define an LSTM-based language model using PyTorch. The model consists of:
- An Embedding layer: Converts token indices into dense vectors.
- An LSTM layer: Processes the sequence data.
- A Linear layer: Maps the LSTM output to vocabulary size for prediction.

The LSTM updates its hidden states $ h_t $ and cell states $ c_t $ at each time step $ t $:

$$
(h_t, c_t) = \text{LSTM}(x_t, (h_{t-1}, c_{t-1}))
$$


In [None]:
class LM(torch.nn.Module):

    def __init__(self, embedding_size, vocab_size, hidden_size, pad_index):
        super(LM, self).__init__()

        self.embedding = nn.Embedding(num_embeddings=vocab_size, embedding_dim=embedding_size, padding_idx=pad_index)
        self.lstm= nn.LSTM(embedding_size, hidden_size, num_layers=1, batch_first=True)

        self.cls= nn.Linear(hidden_size, vocab_size)


    def forward(self, idxs):
        x = self.embedding (idxs)
        hidden_states, (hn, cn)= self.lstm(x)
        predictions= self.cls(hidden_states)
        return predictions

In [None]:
hidden_size=256
embedding_size=300
vocab_size=len(tokenizer.w2i)
pad_index=tokenizer.w2i['<pad>']

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model= LM( embedding_size, vocab_size, hidden_size, pad_index).to(device)


0,1,2,3
1,2,3
loss_fn(idxs[1:],outputs[:-1])

## 6. Optimizer and Loss Function

We use the Adam optimizer and CrossEntropyLoss for training:
- **Adam Optimizer:** An adaptive learning rate optimization algorithm.
- **CrossEntropyLoss:** Computes the loss between predicted and target token distributions.

In [None]:
from torchmetrics.text import Perplexity

In [None]:
# Optimizers specified in the torch.optim package
optimizer = torch.optim.Adam(model.parameters())
loss_fn= torch.nn.CrossEntropyLoss().to(device)

perp=Perplexity().to(device)

int -> [int]

(bs,seq_len,v) ->(-1, v)
(-1)

### 7. Model Training

Training involves minimizing the loss over multiple epochs. For each batch:
1. Forward pass through the model.
2. Compute the loss.
3. Backward pass to compute gradients.
4. Update model parameters.

In [None]:
n_epochs=10
for epoch in range(n_epochs):
  running_perplexity=[]
  running_loss=[]
  with tqdm.tqdm(train_dataloader, unit='batch') as tepoch:
    for batch in tepoch:
      tepoch.set_description(f'Epoch {epoch}')

      idxs, _=batch
      idxs=idxs.to(device).long()
      # Zero your gradients for every batch!
      optimizer.zero_grad()
      outputs=model(idxs)

      batch_size, seq_len, vocab_size=outputs.shape

      outputs = outputs[:,:-1,:].reshape(-1, vocab_size)
      targets = idxs[:,1:].reshape(-1)

      # Create a mask for non-padding tokens
      mask = targets != pad_index

      # Apply mask to filter out padding tokens
      filtered_outputs = outputs[mask]
      filtered_targets = targets[mask]

      loss = loss_fn(outputs, targets)

      loss.backward()
      optimizer.step()
      perplexity = perp(outputs.reshape(batch_size, (seq_len-1), vocab_size), targets.reshape(batch_size, (seq_len-1)))

      running_perplexity.append(perplexity.item())
      running_loss.append(loss.item())
      tepoch.set_postfix(loss=np.mean(running_loss), perplexity = np.mean(running_perplexity) )
  #print(' Epoch {}  loss {}  perplexity {} '.format(epoch, np.mean(running_loss),np.mean(running_perplexity)) )


Epoch 0: 100%|██████████| 782/782 [04:15<00:00,  3.06batch/s, loss=4.47, perplexity=545]


 Epoch 0  loss 4.47069331965483  perplexity 545.1595719915522 


Epoch 1: 100%|██████████| 782/782 [04:14<00:00,  3.07batch/s, loss=3.93, perplexity=53.1]


 Epoch 1  loss 3.9329817526785615  perplexity 53.094930312212774 


Epoch 2: 100%|██████████| 782/782 [04:14<00:00,  3.07batch/s, loss=3.76, perplexity=44.4]


 Epoch 2  loss 3.7567684644323482  perplexity 44.35045300847124 


Epoch 3: 100%|██████████| 782/782 [04:15<00:00,  3.07batch/s, loss=3.63, perplexity=39.1]


 Epoch 3  loss 3.633431990128344  perplexity 39.140930290417295 


Epoch 4: 100%|██████████| 782/782 [04:15<00:00,  3.06batch/s, loss=3.54, perplexity=35.5]


 Epoch 4  loss 3.5391940518718243  perplexity 35.47881779097535 


Epoch 5: 100%|██████████| 782/782 [04:14<00:00,  3.07batch/s, loss=3.46, perplexity=32.8]


 Epoch 5  loss 3.460298712601137  perplexity 32.79753287917818 


Epoch 6: 100%|██████████| 782/782 [04:16<00:00,  3.05batch/s, loss=3.39, perplexity=30.6]


 Epoch 6  loss 3.394742081537271  perplexity 30.614711317564826 


Epoch 7: 100%|██████████| 782/782 [04:14<00:00,  3.07batch/s, loss=3.34, perplexity=28.8]


 Epoch 7  loss 3.3352974320921445  perplexity 28.837434916240174 


Epoch 8: 100%|██████████| 782/782 [04:14<00:00,  3.07batch/s, loss=3.28, perplexity=27.3]


 Epoch 8  loss 3.2830165286198296  perplexity 27.324046828862652 


Epoch 9: 100%|██████████| 782/782 [04:14<00:00,  3.07batch/s, loss=3.23, perplexity=26.1]

 Epoch 9  loss 3.2347928800851182  perplexity 26.064371563894365 





In [None]:
torch.save(model.state_dict(), 'pretrained_lstm.pth')

## 8. Model Evaluation

To evaluate the model, we compute the perplexity, a common metric for language models. Perplexity is the exponential of the average loss:

$$
PPL = e^{\frac{1}{N} \sum_{i=1}^N \text{Loss}(x_i, y_i)}
$$

In [None]:
running_perplexity=[]
running_loss=[]
with tqdm.tqdm(test_dataloader, unit='batch') as tepoch:
    for batch in tepoch:
      tepoch.set_description(f'Epoch {epoch}')

      idxs, _=batch
      idxs=idxs.to(device).long()
      with torch.no_grad():
        outputs=model(idxs)

      batch_size, seq_len, vocab_size=outputs.shape

      outputs = outputs[:,:-1,:].reshape(-1, vocab_size)
      targets = idxs[:,1:].reshape(-1)

      # Create a mask for non-padding tokens
      mask = targets != pad_index

      # Apply mask to filter out padding tokens
      filtered_outputs = outputs[mask]
      filtered_targets = targets[mask]

      loss = loss_fn(outputs, targets)

      perplexity = perp(outputs.reshape(batch_size, (seq_len-1), vocab_size), targets.reshape(batch_size, (seq_len-1)))

      running_perplexity.append(perplexity.item())
      running_loss.append(loss.item())
      tepoch.set_postfix(loss=np.mean(running_loss), perplexity = np.mean(running_perplexity) )
print(' Epoch {}  loss {}  perplexity {} '.format(epoch, np.mean(running_loss),np.mean(running_perplexity)) )

Epoch 9: 100%|██████████| 782/782 [01:44<00:00,  7.46batch/s, loss=3.6, perplexity=39.4]

 Epoch 9  loss 3.5995045534485137  perplexity 39.377061341424735 





## 9. Generating Sentences
To generate a sentence using a pretrained LSTM (Long Short-Term Memory) model, we typically begin by providing an initial token, often referred to as the "beginning of sequence" (`<bos>`) token. The process involves feeding the model this token and using its output to iteratively generate subsequent tokens until an "end of sequence" (`<eos>`) token is produced, or a predefined sentence length is reached.




In [None]:
prompt='<bos>'
list_idxs=tokenizer.tokenize(prompt)

max_len=20
generated_text=''
for it in range(max_len):
  idxs=torch.from_numpy(np.array(list_idxs)).to(device).long().unsqueeze(0)
  with torch.no_grad():
          outputs=model(idxs)[:,-1].detach().cpu()
          proba=torch.nn.functional.softmax(outputs,1).numpy()

          next_word ="<unk>"
          while next_word =="<unk>":
            next_word_idx=np.random.choice( proba.shape[1], size=1, p=proba[0])
            next_word =tokenizer.i2w[next_word_idx[0]]

          if next_word_idx[0] == tokenizer.w2i['<eos>']:
            break
          else:
            list_idxs.append(next_word_idx[0])
            generated_text+=tokenizer.i2w[next_word_idx[0]]+ ' '
print(generated_text)

i have to say that it is grim. the sets in history; never came.<br /><br />here go, the film introduces 


### The Notion of Temperature in Text Generation

In the context of text generation, **temperature** is a parameter that controls the level of randomness or creativity in the generated output. It is especially important in models like LSTMs, GPT, and other autoregressive language models. Temperature is applied to the model’s probability distribution during the sampling process, influencing how the next token in the sequence is chosen.

#### **How Temperature Affects Text Generation**

When generating text, a model predicts the next word or token based on the previous tokens. The model generates a **probability distribution** for each potential next token, which is derived from the raw outputs (logits). The temperature parameter modifies this distribution by scaling the logits before applying the **softmax** function, which converts them into probabilities.

Mathematically, the temperature $T$ modifies the logits as follows:

$$
P_{\text{adjusted}}(i) = \frac{e^{\frac{logits(i)}{T}}}{\sum_{i=1}^{N} e^{\frac{logits(i)}{T}}}
$$

Where:
- $logits(i)$ are the raw model outputs for token $i$,
- $T$ is the temperature,
- $N$ is the total number of possible tokens.




In [None]:
prompt='<bos> the movie was'
list_idxs=tokenizer.tokenize(prompt)
T=0.3
max_len=20
generated_text=''
for it in range(max_len):
  idxs=torch.from_numpy(np.array(list_idxs)).to(device).long().unsqueeze(0)
  with torch.no_grad():
          outputs=model(idxs)[:,-1].detach().cpu()
          proba=torch.nn.functional.softmax(outputs/T,1).numpy()

          next_word ="<unk>"
          while next_word =="<unk>":
            next_word_idx=np.random.choice( proba.shape[1], size=1, p=proba[0])
            next_word =tokenizer.i2w[next_word_idx[0]]

          if next_word_idx[0] == tokenizer.w2i['<eos>']:
            break
          else:
            list_idxs.append(next_word_idx[0])
            generated_text+=tokenizer.i2w[next_word_idx[0]]+ ' '
print(prompt+ ' '+ generated_text)

<bos> the movie was a bit too much to be a very good film and i have to say that i was a fan 
