## CS310 Natural Language Processing
## Lab 5 (part 1): Data preparation for implementing an RNN Language Model


In [1]:
import torch
import torch.nn as nn
import torch.nn.functional as F

### Process input sequences of variable lengths

When training RNN (LSTM or vanilla-RNN), it is difficult to batch the variable length sequences. 

For example: if the length of sequences in a size 8 batch is `[4,6,8,5,4,3,7,8]`, you will pad all the sequences and that will result in 8 sequences of length 8. You would end up doing 64 computations (8x8), but you needed to do only 45 computations. 

PyTorch allows us to pack the sequence, internally packed sequence is a tuple of two lists. One contains the elements of sequences. Elements are interleaved by time steps (see example below) and other contains the size of each sequence the batch size at each step. 

This is helpful in recovering the actual sequences as well as telling RNN what is the batch size at each time step. 

**Example**:

In [2]:
seqs = [torch.tensor([1,2,3]), torch.tensor([3,4])] # Sequences
seq_lens = torch.tensor([3,2]) # Actual lengths of sequences

# First, pad the sequences to the same length
padded_seqs = nn.utils.rnn.pad_sequence(seqs, batch_first=True)

# Then pack them all before passing to the RNN
packed_seqs = nn.utils.rnn.pack_padded_sequence(padded_seqs, seq_lens, batch_first=True, enforce_sorted=False)

# Print intermediate results
print('original sequences:', seqs)
print('padded sequences:', padded_seqs)
print('packed sequences:', packed_seqs)

original sequences: [tensor([1, 2, 3]), tensor([3, 4])]
padded sequences: tensor([[1, 2, 3],
        [3, 4, 0]])
packed sequences: PackedSequence(data=tensor([1, 3, 2, 4, 3]), batch_sizes=tensor([2, 2, 1]), sorted_indices=tensor([0, 1]), unsorted_indices=tensor([0, 1]))


Note that 
- Default padding ID is 0
- The padded sequence is of shape `batch_size x max_length`. Assuming it is word ids, then after it is embedded, it will be of shape `batch_size x max_length x embedding_size`. 
- Here, `max_length` is the length of the longest sequence in the batch. 
- We set `enforce_sorted` to `False` in `pack_padded_sequence` because we are not sorting the sequences by length. 

---

In the next cell, we will first embed the padded sequence (integer word ids) and then pack the embedded sequence. 

It is the packed embedded sequence that we pass to RNN. It will internally unpack the sequences and compute only the necessary time steps. 

To examine the output, you need to unpack it, which is a reverse process of packing.

In [3]:
embedding = nn.Embedding(5, 10)
rnn = nn.RNN(10, 20, batch_first=True)

with torch.no_grad():
    padded_embs = embedding(padded_seqs)
    packed_embs = nn.utils.rnn.pack_padded_sequence(padded_embs, seq_lens, batch_first=True, enforce_sorted=False)

    out_packed, _ = rnn(packed_embs)
    out_unpacked, _ = nn.utils.rnn.pad_packed_sequence(out_packed, batch_first=True)


print('padded emb dim:', padded_embs.size())
print('packed output dim:', out_packed.data.size())
print('unpacked output', out_unpacked.size())

padded emb dim: torch.Size([2, 3, 10])
packed output dim: torch.Size([5, 20])
unpacked output torch.Size([2, 3, 20])


Note that 
- `pad_packed_sequence` does the reverse of `pack_padded_sequence`.
- the unpacked output is of shape `batch_size x max_length x hidden_size`, in which the first dimensions match the shape of the padded input sequences.

---

### T1. Practice Padding and Packing

First, Read all text data and build the vocabulary. 

Note that this time the ids for actual words will start from 1, as 0 will be used for padding, i.e., the special token '[PAD]'.

In [5]:
input_file = 'lunyu_20chapters.txt'

# You can use the code from previous lab or rewrite it
# Hint: you can comment out the `self.initTableNegatives()` in `__init__` method
from utils import CorpusReader
corpus = CorpusReader(inputFileName=input_file, min_count=1)

### START YOUR CODE ###
# Modify word2id to make 0 as the padding token '[PAD]', and increase the index of all other words by 1
# Modify the id2word list to make the first word '[PAD]' as well
# Hint: Both word2id and id2word in utils.CorpusReader are dict objects
word2id = corpus.word2id
id2word = corpus.id2word

pad_token = '[PAD]'
pad_index = 0

word2id = {word: (index + 1) for word, index in word2id.items()}
if isinstance(id2word, dict):
    id2word = sorted(id2word.items(), key=lambda x: x[0])
    id2word = [word for _, word in id2word]

id2word.insert(pad_index, pad_token)
id2word = {index: word for index, word in enumerate(id2word)}
word2id[pad_token] = pad_index
### END YOUR CODE ###

# Test result
print('id2word:', sorted(list(id2word.items()), key=lambda x: x[0])[:5])
print('word2id:', sorted(list(word2id.items()), key=lambda x: x[1])[:5])

# You should expect to see:
# id2word: [(0, '[PAD]'), (1, '，'), (2, '子'), (3, '。'), (4, '：')]
# word2id: [('[PAD]', 0), ('，', 1), ('子', 2), ('。', 3), ('：', 4)]

Total vocabulary: 1352
id2word: [(0, '[PAD]'), (1, '，'), (2, '子'), (3, '。'), (4, '：')]
word2id: [('[PAD]', 0), ('，', 1), ('子', 2), ('。', 3), ('：', 4)]


Read the first 16 lines of text, and convert them into integer sequences (`torch.Long`) of variable lengths. 

Then, follow the steps of `pad -> embed -> pack` to obtain the packed embedded sequence. 

Pass it to the RNN and then unpack the output.

*Hint*:
- You need to define the `embedding_lunyu` as an `nn.Embedding` object, with the correct vocabulary size and **embedding size of 50**.
- Create the `rnn_lunyu` as an `nn.RNN` object, with the correct input size and **hidden size of 100**.

In [40]:
### START YOUR CODE ###
import torch
import torch.nn as nn
from torch.nn.utils.rnn import pack_padded_sequence, pad_sequence, pad_packed_sequence
import itertools
vocab_size = len(word2id)  # Vocabulary size
embedding_size = 50  # Embedding dimension
embedding_lunyu = nn.Embedding(vocab_size, embedding_size)

rnn_lunyu = nn.RNN(50, 100, batch_first=True)


with open('lunyu_20chapters.txt', 'r', encoding='utf-8') as file:
    first_16_lines = [next(file).strip() for _ in range(16)]
sequences = [torch.LongTensor([ord(c) for c in line]) for line in first_16_lines]

seq_ids = [torch.tensor(seq, dtype=torch.long) for seq in sequences]
seq_lens = torch.tensor([len(seq) for seq in seq_ids], dtype=torch.long)
seq_ids_padded = pad_sequence(seq_ids, batch_first=True, padding_value=0)
seq_ids_padded = torch.clamp(seq_ids_padded, 0, embedding_lunyu.num_embeddings - 1)

seq_embs = embedding_lunyu(seq_ids_padded)
seq_embs_packed = pack_padded_sequence(seq_embs, seq_lens, batch_first=True, enforce_sorted=False)


out_packed, _ = rnn_lunyu(seq_embs_packed)
out_unpacked, _ = pad_packed_sequence(out_packed, batch_first=True)
### END YOUR CODE ###

# Test result
print('seq_ids_padded:', seq_ids_padded.size())
print('seq_embs:', seq_embs.size())
print('out_unpacked:', out_unpacked.size())

# You should expect to see:
# seq_ids_padded: torch.Size([16, 85])
# seq_embs: torch.Size([16, 85, 50])
# out_unpacked: torch.Size([16, 85, 100])

seq_ids_padded: torch.Size([16, 85])
seq_embs: torch.Size([16, 85, 50])
out_unpacked: torch.Size([16, 85, 100])


  seq_ids = [torch.tensor(seq, dtype=torch.long) for seq in sequences]


Lastly, map the output of the RNN to the vocabulary size.

*Hint*:
- Define a linear layer `fc` with the correct input (hidden size of RNN) and output size (vocabulary size).
- The output of `fc` will be of shape `batch_size x max_length x vocab_size`, which we call `logits`.
- `logits` are not normalized, so you need to apply `F.log_softmax` to get the log probabilities.

In [41]:
### START YOUR CODE ###
fc = nn.Linear(100, vocab_size)
logits = fc(out_unpacked)
log_probs = F.log_softmax(logits, dim=-1)
### END YOUR CODE ###

# Test result
print('logits:', logits.size())
print('log_probs:', log_probs.size())

# You should expect to see:
# logits: torch.Size([16, 85, 1353])

logits: torch.Size([16, 85, 1353])
log_probs: torch.Size([16, 85, 1353])


### T2. Prepare Target Labels

Prepare the target labels for the RNN. The target labels are the same as the input sequences, but shifted by one time step.

For example, if the input sequences is `[[1, 2, 3], [3, 4, 0]]`, the target labels should be `[[2, 3, 0], [4, 0, 0]]`, where 0 is the padding ID.

In this practice, you need to prepare the target labels for first 16 lines, i.e., `seq_ids_padded`

In [42]:
### START YOUR CODE ###
targets_padded = torch.cat((seq_ids_padded[:, 1:], torch.zeros(seq_ids_padded.shape[0], 1, dtype=torch.long)), dim=1)
### END YOUR CODE ###

# Test result
print('targets_padded:', targets_padded.size())
print('last column of targets_padded:', targets_padded[:, -1])

# You should expect to see:
# targets_padded: torch.Size([16, 85])
# last column of targets_padded: tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

targets_padded: torch.Size([16, 85])
last column of targets_padded: tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])


### T3. Compute Perplexity

In order to compute the perplexity, we first need to compute the negative log probabilities.

This can be accomplished by using the `nn.NLLLoss` function, which takes the `log_probs` and the `target_labels` as input, and the negative log probability (cross-entropy) loss, averaged over all the non-padding tokens: $-\sum \log(p)$

However, the default output of `nn.NLLLoss` is reduced to the average over all the tokens, including the padding tokens. We need to exclude the padding token by setting the `ignore_index` argument to the padding ID, i.e., 0. Also, set the `reduction` argument to `'none'` to get the loss for each non-padding token.

Finally, compute the perplexity by exponentiating the average loss per sequence.

See the documentation here: https://pytorch.org/docs/stable/generated/torch.nn.NLLLoss.html


In [43]:
loss_fn = nn.NLLLoss(ignore_index=0, reduction='none')

### START YOUR CODE ###
# Calculate the loss
with torch.no_grad():  # Ensure no gradients are computed to save memory and computations
    # Flatten the targets to match the log_probs expected input
    targets_flat = targets_padded.view(-1)
    log_probs_flat = log_probs.view(-1, log_probs.size(-1))
    loss = loss_fn(log_probs_flat, targets_flat)
### END YOUR CODE ###

# Test result
print('loss:', loss.size())

# You shoul expect to see:
# loss: torch.Size([1360])
# Here, 1360 = 16 * 85, i.e., the total number of tokens in the batch

loss: torch.Size([1360])


### Model Architecture

In `__init__` method, initialize `word_embeddings` with a pretrained embedding weight matrix loaded. For example, the one obtained from previous assignment (saved word2vec file). 

`nn.Embedding` has a method `from_pretrained` that takes the pretrained weight matrix (a `numpy.ndarray` object) to initialize its weight.

`forward` method takes the word id sequences and sequence lengths as inputs, and return the logits or log probabilities from RNN. 

In [45]:
class RNNLM(nn.Module):
    def __init__(self, pretrained_weight_matrix, hidden_dim, output_dim, **kwargs):
        super(RNNLM, self).__init__()
        self.embedding = nn.Embedding.from_pretrained(torch.tensor(pretrained_weight_matrix, dtype=torch.float))

        self.rnn = nn.LSTM(pretrained_weight_matrix.shape[1], hidden_dim, batch_first=True, **kwargs)
        self.fc = nn.Linear(hidden_dim, output_dim)
        

    def forward(self, seq, seq_lens):
        embedded = self.embedding(seq)
        packed_embedded = pack_padded_sequence(embedded, seq_lens, batch_first=True, enforce_sorted=False)
        packed_output, (hidden, cell) = self.rnn(packed_embedded)
        output, output_lengths = pad_packed_sequence(packed_output, batch_first=True)
        logits = self.fc(output)
        log_probs = torch.log_softmax(logits, dim=-1)
        
        return log_probs

### Sentence Generation

After training the RNN, we can use it to generate sentences. 

The process is as follows:
- Start with a special token or a sequence of tokens, e.g., ["子", "曰"]
- Pass the sequence to the RNN, and sample the next word from the output probability distribution of the last time step. We use greedy search here, i.e., select the word with the highest probability.
- Append the sampled word to the sequence, and repeat the process until a special token, e.g., "。", is sampled; Or until it reaches the maximum length of generation.

In [None]:
def generate_sentence(model, start_tokens, word2id, id2word, max_length=50):
    model.eval()  
    
    current_seq = [word2id[token] for token in start_tokens]
    
    for _ in range(max_length):
        current_tensor = torch.LongTensor(current_seq).unsqueeze(0)  # Add batch dimension
        
        with torch.no_grad():
            logits = model(current_tensor, torch.tensor([len(current_seq)]))
            last_logits = logits[0, -1, :]
            _, next_token_id = torch.max(last_logits, dim=-1)
        
        next_token_id = next_token_id.item() 
        current_seq.append(next_token_id) 
        
        if next_token_id == word2id["。"]:
            break
    
    generated_sentence = ''.join([id2word[id] for id in current_seq])
    
    return generated_sentence