#  Assignment 1 - Language model foundations 💬

Welcome to the **1st assignment** for the **CS-552: Modern NLP course**!

> - 😀 Name: **Gurnoor Singh Khurana**
> - ✉️ Email: **gurnoor.khurana@epfl.ch**
> - 🪪 SCIPER: **366788**

<div style="padding:15px 20px 20px 20px;border-left:3px solid orange;background-color:#fff5d6;border-radius: 20px;">

## How to implement this assignment

Please read carefully the following points. All the information on how to read, implement and submit your assignment is explained in detail below.

1. For this assignment, you will need to implement and fill in the missing code snippets for both the **Jupyter Notebook** `assignment1.ipynb` and the **`utils.py`** python file. In the `utils.py` file, you will add all the Dataset and Model classes you will implement according to the skeleton present in the file. In the notebook, you will add the data preprocessing pipeline for all the datasets, the training and testing pipelines of all implemented models and the report (See diagram below). 
    
![assignment_1_arch.png](docs/assignment_1_arch.png)

2. To implement your coding part, you can import the external libraries we provide in the `requirements.txt` file, however, you should not use any other package non included in these requirements. 

3. At the end of the notebook, you will need to fill in a **report** template, providing the results of your implementation. We provide you with the template for the report, therefore you need to fill in the missing Markdown cells with the requested information. 

4. Along with the `assignment1.ipynb` and the `utils.py` files, you need to additionally upload models' pickle files under the `models/` dir, regarding the following models:
    - the three LSTM-variant models (PART 2)  
    - the trained-from-scratch Transformer model (PART 2) 
    - the fine-tuned Encoder-Decoder model (PART 3) 
    - the fine-tuned pre-trained Transformer model (PART 3)
    
You will provide test results on all of the model variants according to the report template.
    
5. Finally, you will need to log your training pipelines using Tensorboard. Please follow the instructions in the `README.md` of the [tensorboard/](tensorboard/README.md) directory.


</div>

<div style="padding:15px 20px 20px 20px;border-left:3px solid green;background-color:#e4fae4;border-radius: 20px;">

## Assignment Description

- In the first two parts of this assignment, you need to train and evaluate two different language models; an **LSTM-based model** and a **Transformer-based model**. You will first explore the distribution of the input data and perform data cleaning and pre-processing. Then you will build two language model training and testing pipelines implementing an LSTM and a Transformer language model. You will play around with different hyperparameters.
- In the third part, you will build models on the downstream task of **Sentence Paraphrasing**. More specifically, you will fine-tune a sequence-2-sequence (**Encoder-Decoder**) architecture with attention and you will also fine-tune a **Transformer** model for this task.
- Finally, you will fill out a report with the model results for the different parts. 

More specifically:

- **[PART 1: Get to know your data](#1)**
    - [1.1 Data Pre-processing](#11)
    - [1.2 PyTorch Dataset creation](#12)
- **[PART 2: Training Language Models](#2)**
    - [2.1 LSTM-variants](#21)
    - [2.2 Transformer-variants](#22)
- **[PART 3: Fine-tune on the Text Paraphrasing task](#3)**
    - [3.1 Train an Encoder-Decoder model on Text Paraphrasing](#31)
    - [3.2 Run Transformer on Text Paraphrasing](#32)
- **[PART 4: Write your report](#4)**
    
### Deliverables

- ✅ This Jupyter notebook
- ✅ `utils.py` file
- ✅ 3 pickle files with the three LSTM-variant language models (Part 2)
- ✅ Pickle file with the trained-from-scratch Transformer language model (Part 2)
- ✅ Pickle file with the Encoder-Decoder model (Part 3)
- ✅ Pickle file with the fine-tuned pre-trained Transformer model (Part 3)

</div>

<div style="padding:15px 15px 15px 15px;border-left:3px solid #03befc;background-color:#eff7fe;border-radius: 15px;">

🎯 Goal: **Add your SCIPER number below as a `str`!**
     
</div>

In [None]:
!pip install datasets
!pip install apache_beam
!pip install torchmetrics
!pip install gensim==4.1.2
!pip install transformers
!pip install evaluate

In [None]:
import regex as re
import random
import numpy as np
import torch


SCIPER = "366788"

try:
    assert re.match("\d{6}", SCIPER)[0] == SCIPER, "Invalid SCIPER given. please enter your correct 6-digit SCIPER number above!"
except:
    print("Invalid SCIPER given. please enter your correct 6-digit SCIPER number above!")

student_seed = int(SCIPER[-4:])


"""Set seed for reproducibility."""
random.seed(student_seed)
np.random.seed(student_seed)
torch.manual_seed(student_seed)
torch.cuda.manual_seed_all(student_seed)

### Packgage installation & importing

In [None]:
import os
os.environ["WANDB_DISABLED"] = "true"
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"]="0" # limiting to one GPU

In [None]:
from torch.utils.tensorboard import SummaryWriter
tb_writer = SummaryWriter(log_dir='tensorboard/')

In [None]:
from datasets import load_dataset
from tqdm import tqdm
import gensim
import torch
import math
from collections import Counter
import numpy as np
import matplotlib.pyplot as plt
import torch.nn as nn
import torch.optim as optim
from torchmetrics.classification import BinaryAccuracy
from transformers import GPT2LMHeadModel, GPT2TokenizerFast, AutoConfig, AutoModelWithLMHead, AutoTokenizer, DataCollatorForLanguageModeling, Trainer, TrainingArguments

---

<a name="1"></a>
# PART 1: Get to know your data 🔎

For the first two parts of this assignment, we will build our language models using the `wikitext-103` dataset.

> The WikiText language modeling dataset is a collection of over 100 million tokens extracted from the set of verified 
Good and Featured articles on Wikipedia. 

Bellow is an example from the dataset: 
<br>_(This example was too long and was cropped)_



<div style="padding:8px 0 8px 15px;background-color:#F3F3F3;border-radius:20px;">
<code>{
"text": "\" The Sinclair Scientific Programmable was introduced in 1975 , with the same case as the Sinclair Oxford . It was larger than t..."
}
</code>
</div>

🧐 You can find more about this dataset [here](https://huggingface.co/datasets/wikitext).

<a name="11"></a>
## 1.1 Data Preprocessing

In this part, while you get to better understand the dataset sturcuture, you will also do several steps to clean the dataset before passing them to neural models.

In [None]:
# Loads the dataset
wikitext_dataset = load_dataset("wikitext", 'wikitext-103-v1', split="train")

print(f"Size of the dataset is {len(wikitext_dataset)}")

Downloading and preparing dataset wikitext/wikitext-103-v1 to /root/.cache/huggingface/datasets/wikitext/wikitext-103-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126...


Generating test split:   0%|          | 0/4358 [00:00<?, ? examples/s]

Generating train split:   0%|          | 0/1801350 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/3760 [00:00<?, ? examples/s]

Dataset wikitext downloaded and prepared to /root/.cache/huggingface/datasets/wikitext/wikitext-103-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126. Subsequent calls will reuse this data.
Size of the dataset is 1801350


<div style="padding:15px 15px 15px 15px;border-left:3px solid #03befc;background-color:#eff7fe;border-radius: 15px;">

🎯 Goal: **Filter out all empty sentences**.

💻 API hint: Use `datasets.Dataset` utilities to manipulate the dataframe.
     
</div>

In [None]:
# YOUR CODE HERE
wikitext_dataset = wikitext_dataset.filter(lambda x: len(x['text'].strip()) > 0)

print(f"Size of the dataset is {len(wikitext_dataset)}")

Filter:   0%|          | 0/1801350 [00:00<?, ? examples/s]

Size of the dataset is 1165029


Long sequences in the language model training can significantly slow down the training progress, both for RNN-based and transformer-based models.

One of the tricks that is mentioned in [BERT's paper](https://arxiv.org/abs/1810.04805) is to perform pretraining with shorter sequences in the beginning. 

<div style="padding:15px 15px 15px 15px;border-left:3px solid #03befc;background-color:#eff7fe;border-radius: 15px;">

🎯 Goal: Following the same line of reasoning, **keep only samples that have at most 128 tokens**.
    
💻 API hint: Use `datasets.Dataset` utilities to manipulate the dataframe.

</div>

In [None]:
# YOUR CODE HERE
wikitext_dataset = wikitext_dataset.filter(lambda x: len(x['text'].split()) <= 128)

print(f"Size of the dataset is {len(wikitext_dataset)}")

Filter:   0%|          | 0/1165029 [00:00<?, ? examples/s]

Size of the dataset is 826663


<div style="padding:15px 15px 15px 15px;border-left:3px solid #03befc;background-color:#eff7fe;border-radius: 15px;">

🎯 Goal: Let's make the dataset samples **lower case** to decrease the vocabulary size.

💻 API hint: Use `datasets.Dataset` utilities to manipulate the dataframe.

</div>

In [None]:
# YOUR CODE HERE
def lowerify(item_dict):
    item_dict['text'] = item_dict['text'].lower()
    return item_dict

wikitext_dataset = wikitext_dataset.map(lowerify)

Map:   0%|          | 0/826663 [00:00<?, ? examples/s]

If you take a look at the first few samples of the dataset, you will notice that they belong to [this](https://en.wikipedia.org/wiki/Valkyria_Chronicles_II) Wikipedia article.

We notice that **the title of  sections/articles are also included** in the dataset (for instance the title itself, or the `gameplay` section in this article). These samples are not very useful for language modeling, due to not having a sentence structure.
Given the pattern of these samples, we need to **filter them out** from the dataset.
    

<div style="padding:15px 15px 15px 15px;border-left:3px solid #03befc;background-color:#eff7fe;border-radius: 15px;">

🎯 Goal: Filter out the samples with `= = <section> = = \n` patterns.
    
💻 API hint: Use `datasets.Dataset` utilities to manipulate the dataframe.
    
</div>

In [None]:
# YOUR CODE HERE
def filter_sections(text_dict):
    text = text_dict['text'].strip()
    if text.startswith("= ") and text.endswith(" ="):
      return False
    
    return True

wikitext_dataset = wikitext_dataset.filter(filter_sections)

print(f"Size of the dataset is {len(wikitext_dataset)}")

Filter:   0%|          | 0/826663 [00:00<?, ? examples/s]

Size of the dataset is 521591


<div style="padding:15px 15px 15px 15px;border-left:3px solid #03befc;background-color:#eff7fe;border-radius: 15px;">

🎯 Goal:  **Normalize accented letters** (e.g., `clément` becomes `clement`) from text using `gensim.utils.deaccent` to further decrease vocabulary size.
    
💻 API hint: Use `datasets.Dataset` utilities to manipulate the dataframe. 

</div>

In [None]:
# YOUR CODE HERE
def filter_accents(text_dict):
  text_dict['text'] = gensim.utils.deaccent(text_dict['text'])
  return text_dict

wikitext_dataset = wikitext_dataset.map(filter_accents)

Map:   0%|          | 0/521591 [00:00<?, ? examples/s]

<div style="padding:15px 15px 15px 15px;border-left:3px solid #03befc;background-color:#eff7fe;border-radius: 15px;">

🎯 Goal: Remove all samples having **non-english characters**. 
   
💻 API hint: Use `datasets.Dataset` utilities to manipulate the dataframe along with the provided function `isEnglish()`. 

</div>

In [None]:
def isEnglish(s):
    try:
        s.encode(encoding='utf-8').decode('ascii')
    except UnicodeDecodeError:
        return False
    else:
        return True

# YOUR CODE HERE
wikitext_dataset = wikitext_dataset.filter(lambda x: isEnglish(x['text']))

print(f"Size of the dataset is {len(wikitext_dataset)}")

Filter:   0%|          | 0/521591 [00:00<?, ? examples/s]

Size of the dataset is 432479


### Looking at the vocabulary

Before we move into additional preprocessing (similarly with the dataset used in exercises for week 2), we will take a look at the vocabulary size of the dataset until this point. We will assume that tokens can be simply splitted by `" "`.

<div style="padding:15px 15px 15px 15px;border-left:3px solid #03befc;background-color:#eff7fe;border-radius: 15px;">

🎯 Goal:  **Compute the frequency of all tokens in the dataset.**

</div>

In [None]:
def compute_token_frequency(dataset):
  # YOUR CODE HERE
    vocab_frequency = dict()
    for datapoint in tqdm(dataset):
      for word in datapoint['text'].strip().split():
        vocab_frequency[word] = vocab_frequency.get(word, 0) + 1

    return vocab_frequency

vocab_frequency = compute_token_frequency(wikitext_dataset)
print(f"\nVocabulary size of the dataset is {len(vocab_frequency)}")

100%|██████████| 432479/432479 [00:41<00:00, 10465.88it/s]


Vocabulary size of the dataset is 189371





As discussed in the lectures, real text datasets have a relatively high fraction of rare tokens. For that reason, let's visualize the histogram of token frequencies to better see this effect.

<div style="padding:15px 15px 15px 15px;border-left:3px solid #03befc;background-color:#eff7fe;border-radius: 15px;">

🎯 Goal: Plot a **histogram** with the frequencies of the words of the vocabulary. 
   
💻 API hint: You can use the `matplotlib.hist` function. 

</div>

In [None]:
# YOUR CODE HERE
# plt.hist(vocab_frequency.items(), bins=2)

As you saw in the cells above, the dataset vocabulary is quite huge. Let's consider every token that occurs less than (or equal to) 5 times as a rare token. 

<div style="padding:15px 15px 15px 15px;border-left:3px solid #03befc;background-color:#eff7fe;border-radius: 15px;">

🎯 Goal: Put these rare tokens in the **`rare_tokens` variable** and replace every rare token in the dataset with the `<unk>` token.
    
💻 API hint: Use `datasets.Dataset` utilities to manipulate the dataframe.
    
</div>

In [None]:
rare_threshold = 5
rare_tokens = set()


# YOUR CODE HERE
rare_tokens = set([a for a in vocab_frequency if vocab_frequency[a] <= rare_threshold])

def replace_rare_tokens(item_dict):
  tokens = item_dict['text'].strip().split()
  for i in range(len(tokens)):
    if tokens[i] in rare_tokens:
      tokens[i] = "<unk>"
  item_dict['text'] = " ".join(tokens)
  return item_dict

wikitext_dataset = wikitext_dataset.map(replace_rare_tokens)

print(f"With threshold of {rare_threshold}, we have {len(rare_tokens)} rare tokens.\n",
      f"The vocabulary size is now {len(vocab_frequency) - len(rare_tokens)}")

Map:   0%|          | 0/432479 [00:00<?, ? examples/s]

With threshold of 5, we have 105822 rare tokens.
 The vocabulary size is now 83549


The dataset still includes many short samples which are not very useful for the language modeling task. We will filter out very short samples from the dataset. _(Note: Assume tokens can be achieved by simple `" "` splitting.)_
    

<div style="padding:15px 15px 15px 15px;border-left:3px solid #03befc;background-color:#eff7fe;border-radius: 15px;">

 🎯 Goal: Filter out every sample that has **less than (or equal to) 5 tokens**.
    
💻 API hint: Use `datasets.Dataset` utilities to manipulate the dataframe.

</div>

In [None]:
short_seq_threshold = 5
 
# YOUR CODE HERE
wikitext_dataset = wikitext_dataset.filter(lambda x: len(x['text'].split()) > short_seq_threshold)

print(f"Size of the dataset is {len(wikitext_dataset)}")

Filter:   0%|          | 0/432479 [00:00<?, ? examples/s]

Size of the dataset is 401709


After replacing rare tokens with <unk>, we could have sentences like `<unk> <unk> <unk> <unk>` which are again not very useful for language modeling. We will **filter out samples that more than 5% of its tokens are `<unk>`**.

<div style="padding:15px 15px 15px 15px;border-left:3px solid #03befc;background-color:#eff7fe;border-radius: 15px;">

🎯 Goal:  **Filter out samples that more than 5% of its tokens are `<unk>`**
    
💻 API hint: Use `datasets.Dataset` utilities to manipulate the dataframe.

</div>

In [None]:
unknown_token_threshold = 0.05  # every sample that more than 5% of its tokens are <unk> should be removed

# YOUR CODE HERE
wikitext_dataset = wikitext_dataset.filter(lambda x: x['text'].count('<unk>') / len(x['text'].split()) <= unknown_token_threshold)
print(f"Size of the dataset is {len(wikitext_dataset)}")

Filter:   0%|          | 0/401709 [00:00<?, ? examples/s]

Size of the dataset is 363805


Let's recalculate the vocabulary for the resulting dataset, to see the vocabulary of the resulting dataset.

In [None]:
vocab_frequency = compute_token_frequency(wikitext_dataset)

print(f"\nvocabulary size of the dataset is {len(vocab_frequency)}")

100%|██████████| 363805/363805 [00:34<00:00, 10497.88it/s]


vocabulary size of the dataset is 83404





---

<a name="12"></a>
## 1.2 PyTorch Dataset  creation

After the pre-processing of the dataset, we will now create a `torch.Dataset` class for the wiki-text dataset. 
We need to do so, in order to transform the dataset to the right format for the language modeling task. 
The following steps should be implemented:

- Add `<start>` and `<stop>` tokens at the beginning and end of a sentence respectively.
- Add padding tokens ( `<pad>` ) at the end of the sentences.
- Create a fallback to <unk> token if an unseen word is encoded.
- Define dictionaries that map tokens to their respective index in the embedding matrix and vice versa.


### Create the RNN Dataset

<div style="padding:15px 15px 15px 15px;border-left:3px solid #03befc;background-color:#eff7fe;border-radius: 15px;">

🎯 Goal:  **Go to the `utils.py` file, and fill in the `RNNDataset` class with your implemenation.**
    
</div>

In [None]:
from torch.utils.data import Dataset, DataLoader
import datasets
from src.utils import RNNDataset

MAX_SEQ_LENGTH = 128

<div style="padding:15px 15px 15px 15px;border-left:3px solid #03befc;background-color:#eff7fe;border-radius: 15px;">

🎯 Goal:  **Instantiate** the implemented RNNDataset.
    
</div>

In [None]:
# YOUR CODE HERE
rnn_dataset = RNNDataset(wikitext_dataset, MAX_SEQ_LENGTH, percentage_data=0.1)

### Split data into train and test

Once we have created the dataset ready for the model training pipeline, we will split it into train and test datasets. Then we will pass them to a `DataLoader` class, following the same method we saw in the exercises session. 

<div style="padding:15px 15px 15px 15px;border-left:3px solid #03befc;background-color:#eff7fe;border-radius: 15px;">

🎯 Goal:  **Split** the implemented RNNDataset into train and test subsets.
    
    

</div>

In [None]:
TRAIN_RATIO = 0.9

# dataset_length = len(wikitext_dataset)
dataset_length = len(rnn_dataset)
train_length = math.floor(dataset_length * TRAIN_RATIO)
test_length = dataset_length - train_length

rnn_train_dataset, rnn_test_dataset = torch.utils.data.random_split(rnn_dataset,
                                                               [train_length, test_length],
                                                               generator=torch.Generator().manual_seed(student_seed))

<div style="padding:15px 15px 15px 15px;border-left:3px solid #03befc;background-color:#eff7fe;border-radius: 15px;">
    
🎯 Goal:  Create `DataLoader` objects using `batch_size = 8` for the train and test subsets.
    
</div>

In [None]:
train_dataloader = DataLoader(rnn_train_dataset, batch_size=8, shuffle=True)
test_dataloader = DataLoader(rnn_test_dataset, batch_size=8, shuffle=False)

<div style="padding:15px 15px 15px 15px;border-left:3px solid #8e7cc3;background-color:#e4e1eb;border-radius: 15px;">

🎉 Excellent work! By this point, you will have made all the needed steps to make your data ready for training. 

#### Part 1 - Checklist
Here are the core building blocks you created and that you will need for Part 2:
   
- [X] `rnn_dataset`: A Dataset obj with the data, the vocabulary, the pad index, the max sequence length, and maps of idx to word type and vice versa. 
- [X] `train_dataloader`: A DataLoader obj with your training data
- [X] `test_dataloader`: A DataLoader obj with your testing data


_Tip: Try to familiarize yourself with these objects and what functionalities and attributes they provide._
    
</div>

---

<a name="2"></a>
# PART 2:  Training Language Models 🤗

#### Language Model: a probabilistic model of a sequence of tokens.

🔵 **What?**

Language modeling (LM) is the use of various statistical and probabilistic techniques to determine the probability of a given sequence of words occurring in a sentence. Language models analyze bodies of text data to provide a basis for their word predictions. They are used in natural language processing (NLP) applications, particularly ones that generate text as an output. Some of these applications include, machine translation and question-answering.

🟡 **How?**

There are several different probabilistic approaches to modeling language, which vary depending on the purpose of the language model. From a technical perspective, the various types differ by the amount of text data they analyze and the math they use to analyze it (architecture). Some LMs we've already seen and will learn about during lectures are n-gram / count-based models, Recurrent Neural Networks (RNNs), and Transformer models. 

🟣 **Why?**

Language modeling is crucial in modern NLP applications. It is the reason that machines can understand qualitative information. Each language model type, in one way or another, turns qualitative information into quantitative information. This allows people to communicate with machines as they do with each other to a limited extent. It is used directly in a variety of industries including tech, finance, healthcare, transportation, legal, military and government. Additionally, it's likely most people reading this have interacted with a language model in some way at some point in the day, whether it be through Google search, an autocomplete text function or engaging with a voice assistant.

ℹ️ Source: [Original article](https://www.techtarget.com/searchenterpriseai/definition/language-modeling#:~:text=Language%20models%20determine%20word%20probability,predict%20or%20produce%20new%20sentences.)



<div style="padding:15px 15px 15px 15px;border-left:3px solid gray;background-color:#F3F3F3;border-radius: 15px;">

In this part, you will train your own language models using the dataset created in Part 1.

More specifically, you need to implement **5 different model variants**, train and test them to compute their perplexity.
    
| Model | Variant | Description |
|:---- |:----- | :----- |
| | Token embeddings trained from scratch | An LSTM model with a trainable token Embedding layer <br>that will be initialized randomly and trained from scratch along with the LM. |
| **LSTM** | Pre-trained token embeddings & frozen | An LSTM model with pre-trained GloVe embeddings as input <br>that will be frozen while the LM is training. |
|  | Pre-trained token embeddings & trainable | An LSTM model with pre-trained GloVe embeddings as input <br>that will be further trained along with the LM. |
||||
| **Transformer** | Trained from scratch | A Transformer based model that follows the architecture of [DistilGPT2](https://huggingface.co/distilgpt2). |
|  | Pre-trained DistilGPT2 | A pre-trained Transformer based model called [DistilGPT2](https://huggingface.co/distilgpt2) <br>and will be used only for testing (not training). |
    
</div>

---
<a name="21"></a>
## 2.1 LSTM-variants


### 2.1.1 Implementing all LSTM variants in one Model class

<div style="padding:15px 15px 15px 15px;border-left:3px solid #03befc;background-color:#eff7fe;border-radius: 15px;">

🎯 Goal:  **Go to the `utils.py` file, and fill in the `VanillaLSTM` class with your implemenation.**
    
💻 Implementation hint: You will create one model class for all variants. Try to incorporate all the different cases into one Model class.
    
</div>

In [None]:
from src.utils import VanillaLSTM

### 2.1.2 Building training and testing pipelines

<div style="padding:15px 15px 15px 15px;border-left:3px solid #03befc;background-color:#eff7fe;border-radius: 15px;">

🎯 Goal:  Implement training and testing pipelines.
  
💻 Implementation hint: Check the pipelines we created in the exercises sessions.
    
</div>

In [None]:
def train(model, train_loader, optimizer, criterion, logging_str="", num_epochs=1):
    """
    Main training pipeline. Implement the following:
    - pass inputs to the model
    - compute loss
    - perform backward pass and update weights

    :param model: 
    :param train_loader:
    :param optimizer:
    :param criterion: 
    return: 
    """
    
    # YOUR CODE HERE
    # Training loop
    device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
    model.to(device)
    model.train()
    for epoch in range(num_epochs):
        epoch_loss = 0
        for i, batch in enumerate(tqdm(train_loader, desc='Training')):
          X, Y = batch
          X = X.to(device)
          optimizer.zero_grad()

          outputs = model(X)
          outputs = outputs[:, :-1, :]
          target_tokens = Y.to(device)
          outputs = outputs.reshape([outputs.shape[0]*outputs.shape[1], outputs.shape[-1]])
          target_tokens = target_tokens.reshape([target_tokens.shape[0] * target_tokens.shape[1]])

          loss = criterion(outputs, target_tokens)
          loss.backward()

          # Step 5: update optimizer
          optimizer.step()

          epoch_loss += loss.item()
          # print("Loss Item", loss.item())

          if i % 200 == 0 and epoch == 0:
            tb_writer.add_scalar(logging_str, loss.item(), i)
        
        tb_writer.flush()
        print(epoch, f"Epoch loss: {epoch_loss / len(train_loader)}")
    
    return epoch_loss / len(train_loader)

In [None]:
def test(model, test_loader, criterion, logging_str=""):
    """
    Main testing pipeline. Implement the following:
    - pass inputs to the model
    - compute loss
    - compute perplexity

    :param model: 
    :param test_loader:
    :param criterion: 
    return: 
    """
    
    # YOUR CODE HERE
    
    # Testing loop
    device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
    model.to(device)
    model.eval()
    loss_value = 0
    for i, data in tqdm(enumerate(test_loader), desc='Testing'):
        X, Y = data
        X, target_tokens = X.to(device), Y.to(device)
        outputs = model(X)
        outputs = outputs[:, :-1, :]
        
        outputs = outputs.reshape([outputs.shape[0]*outputs.shape[1], outputs.shape[-1]]).to(device)
        target_tokens = Y.reshape([target_tokens.shape[0] * target_tokens.shape[1]]).to(device)

        loss = criterion(outputs, target_tokens)
        loss_value += loss.item()

    test_loss = loss_value / len(test_loader)
    perplexity = np.exp(test_loss)
    
    print(f'Test loss: {test_loss:.3f}')
    print(f'Test Perplexity: {perplexity:.3f}')

    tb_writer.add_scalar(f"{logging_str}/test_loss", test_loss, 0)
    tb_writer.add_scalar(f"{logging_str}/test_perplexity", perplexity, 0)
    tb_writer.flush()
    return test_loss, perplexity

### 2.1.3 Train and test LSTM variants

For **all the LSTM variants** you will perform the following steps:

1. Set hypeparameters
2. Load embeddings if needed
3. Instantiate the model and set training configurations
4. Run training pipeline (from 2.1.2)
5. Save the model
6. Run testing pipeline and compute perplexity (from 2.1.2)

#### LSTM Variant A: Embeddings trained from scratch

An LSTM model with a trainable Embedding layer that will be initialized randomly and trained from scratch along with the LM.

<div style="padding:15px 15px 15px 15px;border-left:3px solid #03befc;background-color:#eff7fe;border-radius: 15px;">

🎯 Goal: Set hyperparameters according to the objective of the model.
      
💻 Implementation hint: You can play arround with different values for `dropout_rate`, `lr` and `num_layers`.

</div>

In [None]:
# YOUR CODE HERE
vocab_size = 83404 + 3
embedding_dim = 100
hidden_dim = 100
num_layers = 2
dropout_rate = 0.01 
lr = 1e-2  # learning rate

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

<div style="padding:15px 15px 15px 15px;border-left:3px solid #03befc;background-color:#eff7fe;border-radius: 15px;">

🎯 Goal: Instantiate the **model, optimizer and loss**.
      
💻 Implementation hint: Choose your training settings according to the task you need to do.

</div>

In [None]:
# YOUR CODE HERE
model = VanillaLSTM(vocab_size, embedding_dim, hidden_dim, num_layers, dropout_rate)
optimizer = torch.optim.Adam(model.parameters(), lr=5e-3)
criterion = torch.nn.CrossEntropyLoss(ignore_index = rnn_dataset.pad_idx)
num_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f'The model has {num_params:,} trainable parameters')

The model has 16,926,407 trainable parameters


<div style="padding:15px 15px 15px 15px;border-left:3px solid #03befc;background-color:#eff7fe;border-radius: 15px;">

🎯 Goal: Run **training and testing pipelines** on **10% data** (train and test) and compute perplexity.
      
</div>

In [None]:
train(model, train_dataloader, optimizer, criterion, "LSTM_LM_variant_A/train_loss")
torch.save(model.state_dict(), 'models/lstm_with_random_token_embedding.pt')
test(model, test_dataloader, criterion, "LSTM_LM_variant_A")

Training: 100%|██████████| 4093/4093 [04:00<00:00, 17.01it/s]


0 Epoch loss: 6.58913682207862


Testing: 455it [00:08, 51.79it/s]

Test loss: 6.090
Test Perplexity: 441.466





(6.090100931859278, 441.4659668782315)

#### LSTM Variant B: Pre-trained embeddings & frozen

An LSTM model with pre-trained GloVe embeddings as input that will be frozen while the LM is training.

<div style="padding:15px 15px 15px 15px;border-left:3px solid #03befc;background-color:#eff7fe;border-radius: 15px;">

🎯 Goal: Download **GloVe embeddings**.
      
</div>

In [None]:
import gensim.downloader
# Download the "glove-wiki-gigaword-100" embeddings
glove_vectors = gensim.downloader.load('glove-wiki-gigaword-100')



<div style="padding:15px 15px 15px 15px;border-left:3px solid #03befc;background-color:#eff7fe;border-radius: 15px;">

🎯 Goal: Create an embedding layer with dimensions that match the input of `VanillaLSTM` model and initialize it with random weights.
    
💻 API hint: Use `torch.nn.Embedding` class.
      
</div>

In [None]:
# YOUR CODE HERE
torch.manual_seed(student_seed)
torch.cuda.manual_seed_all(student_seed)
initial_embedding_weight = torch.normal(mean=0, std=1/100, size=(vocab_size, embedding_dim))

<div style="padding:15px 15px 15px 15px;border-left:3px solid #03befc;background-color:#eff7fe;border-radius: 15px;">

🎯 Goal: Add each GloVe embedding in the respective position in the Embedding layer created in previous step.

💻 API hint: Use `.key_to_index` and `.word_to_index` functions.
      
</div>

In [None]:
# YOUR CODE HERE
for word, index in rnn_dataset.word_to_index.items():
  if word in glove_vectors:
    initial_embedding_weight[rnn_dataset.word_to_index[word]] = torch.tensor(glove_vectors[word])

In [None]:
initial_embedding_weight = torch.nn.Embedding.from_pretrained(initial_embedding_weight)

<div style="padding:15px 15px 15px 15px;border-left:3px solid #03befc;background-color:#eff7fe;border-radius: 15px;">

🎯 Goal: Instantiate the **model, optimizer and loss**.
      
💻 Implementation hint: Choose your training settings according to the task you need to do.

</div>

In [None]:
# YOUR CODE HERE
model = VanillaLSTM(vocab_size, embedding_dim, hidden_dim, num_layers, dropout_rate, embedding_weights=initial_embedding_weight, freeze_embeddings=True)
optimizer = torch.optim.Adam(model.parameters(), lr=5e-3)
criterion = torch.nn.CrossEntropyLoss(ignore_index = rnn_dataset.pad_idx)
num_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f'The model has {num_params:,} trainable parameters')

The model has 8,585,707 trainable parameters


<div style="padding:15px 15px 15px 15px;border-left:3px solid #03befc;background-color:#eff7fe;border-radius: 15px;">

🎯 Goal: Run **training and testing pipelines** on **10% data** (train and test) and compute perplexity.
      
</div>

In [None]:
train(model, train_dataloader, optimizer, criterion, "LSTM_LM_variant_B/train_loss")
torch.save(model.state_dict(), 'models/lstm_with_frozen_glove_token_embedding.pt')
test(model, train_dataloader, criterion, "LSTM_LM_variant_B")

Training: 100%|██████████| 4093/4093 [03:46<00:00, 18.06it/s]


0 Epoch loss: 6.524221833166387


Testing: 4093it [01:18, 52.00it/s]

Test loss: 5.906
Test Perplexity: 367.079





(5.905577277251802, 367.07907103218275)

#### LSTM Variant C: Pre-trained embeddings & trainable	
An LSTM model with pre-trained GloVe embeddings as input that will be further trained along with the LM.

_Note: Use the same embedding layer you instantiated with GloVe embeddings in the previous step_

<div style="padding:15px 15px 15px 15px;border-left:3px solid #03befc;background-color:#eff7fe;border-radius: 15px;">

🎯 Goal: Instantiate the **model, optimizer and loss**.
      
💻 Implementation hint: Choose your training settings according to the task you need to do.

</div>

In [None]:
# YOUR CODE HERE
model = VanillaLSTM(vocab_size, embedding_dim, hidden_dim, num_layers, dropout_rate, embedding_weights=initial_embedding_weight, freeze_embeddings=False)
optimizer = torch.optim.Adam(model.parameters(), lr=5e-3)
criterion = torch.nn.CrossEntropyLoss(ignore_index = rnn_dataset.pad_idx)
num_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f'The model has {num_params:,} trainable parameters')

The model has 16,926,407 trainable parameters


<div style="padding:15px 15px 15px 15px;border-left:3px solid #03befc;background-color:#eff7fe;border-radius: 15px;">

🎯 Goal: Run **training and testing pipelines** on **10% data** (train and test) and compute perplexity.
      
</div>

In [None]:
train(model, train_dataloader, optimizer, criterion, "LSTM_LM_variant_C/train_loss", 2)
torch.save(model.state_dict(), 'models/lstm_with_nonfreezed_glove_token_embedding.pt')
test(model, train_dataloader, criterion, "LSTM_LM_variant_C")

Training: 100%|██████████| 4093/4093 [03:58<00:00, 17.18it/s]


0 Epoch loss: 6.670323603606999


Training: 100%|██████████| 4093/4093 [03:58<00:00, 17.18it/s]


1 Epoch loss: 5.934493713555987


Testing: 4093it [01:18, 52.02it/s]

Test loss: 5.666
Test Perplexity: 288.800





(5.6657330800524095, 288.7996167723979)

---

<a name="22"></a>
## 2.2 Transformer-variants

For all Transformer vairants we will use the architecture of **DistilGPT2** model. DistilGPT2 (short for Distilled-GPT2) is an English-language model pre-trained with the supervision of the smallest version of Generative Pre-trained Transformer 2 (GPT-2). Like GPT-2, DistilGPT2 can be used to generate text. See more details in the [HuggingFace model card](https://huggingface.co/distilgpt2). 

### 2.2.1 Train DistilGPT-2 from scratch

You will perform the following steps:

1. Load the config of the DistilGPT-2 model using the Transformers library.
2. Load Model class from the config and the respective tokenizer.
3. Change input dataset to fit with the tokenization mechanism of DistilGPT-2.
4. Split dataset into train and test.
5. Create DataLoaders for train and test subsets.
6. Train the model from stratch.
7. Test the model and compute perplexity

In [None]:
model_name = "distilgpt2"
tokenizer_checkpoint = "distilgpt2"

<div style="padding:15px 15px 15px 15px;border-left:3px solid #03befc;background-color:#eff7fe;border-radius: 15px;">

🎯 Goal: Load model config, model class and tokenizer.
      
💻 Implementation hint: You should load the **model instance** and not the pre-trained model weights. You should load the pre-trained tokenizer though.

</div>

In [None]:
MAX_SEQ_LENGTH = 128

from transformers import GPT2Config
# YOUR CODE HERE
gpt_tokenizer = GPT2TokenizerFast.from_pretrained(tokenizer_checkpoint)
model_config = GPT2Config(
    vocab_size = gpt_tokenizer.vocab_size,
    n_positions = MAX_SEQ_LENGTH,
    eos_token_id = gpt_tokenizer.eos_token_id
)
gpt2_scratch_model = GPT2LMHeadModel(model_config)

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/762 [00:00<?, ?B/s]

<div style="padding:15px 15px 15px 15px;border-left:3px solid #03befc;background-color:#eff7fe;border-radius: 15px;">

🎯 Goal: Implement the following steps according to the in-line comments.

</div>

In [None]:
# YOUR CODE HERE
def tokenize_dataset(example, tokenizer, MAX_SEQ_LENGTH):
  example_copy = example.copy()
  example_copy.update(tokenizer(example['text'], padding='max_length', max_length=MAX_SEQ_LENGTH))
  return example_copy

# add pad_token the same as the EOS token to not increase vocab size
gpt_tokenizer.pad_token = gpt_tokenizer.eos_token

# tokenize wikitext_dataset with pre-trained DistilGPT2 tokenizer
encoded_dataset = wikitext_dataset.map(lambda x: tokenize_dataset(x, gpt_tokenizer, MAX_SEQ_LENGTH))

# filter out sentences with length more than MAX_SEQ_LENGTH
limited_encoded_dataset = encoded_dataset.filter(lambda x: len(x['input_ids']) <= MAX_SEQ_LENGTH)

# # create input_ids and attention_mask columns in the dataset
# limited_encoded_dataset = ...

limited_encoded_dataset = limited_encoded_dataset.remove_columns("text")
limited_encoded_dataset = limited_encoded_dataset.with_format("torch")
limited_encoded_dataset = limited_encoded_dataset.map(lambda example:
                                                      {"labels": example["input_ids"]})

Map:   0%|          | 0/363805 [00:00<?, ? examples/s]

Filter:   0%|          | 0/363805 [00:00<?, ? examples/s]

Map:   0%|          | 0/296705 [00:00<?, ? examples/s]

<div style="padding:15px 15px 15px 15px;border-left:3px solid #03befc;background-color:#eff7fe;border-radius: 15px;">

🎯 Goal:  **Split** the `limited_encoded_dataset` into train and test subsets.
    
💻 API hint: Use `torch.utils.data.random_split` method with the given `TRAIN_RATIO`.

</div>

In [None]:
# YOUR CODE HERE
TRAIN_RATIO = 0.9
original_length = len(limited_encoded_dataset)
dataset_length = int(0.1 * original_length)
train_length = int(TRAIN_RATIO * dataset_length)
test_length = dataset_length - train_length

transformer_train_dataset, transformer_test_dataset, _ = torch.utils.data.random_split(limited_encoded_dataset,
                                                               [train_length, test_length, original_length-train_length-test_length],
                                                               generator=torch.Generator().manual_seed(student_seed))

<div style="padding:15px 15px 15px 15px;border-left:3px solid #03befc;background-color:#eff7fe;border-radius: 15px;">

🎯 Goal: Set hyperparameters according to the objective of the model.
      
💻 Implementation hint: You can play arround with different values for `learning_rate`.

</div>

In [None]:
# YOUR CODE HERE
training_args = TrainingArguments(
    output_dir=f"{model_name}-wikitext103",
    evaluation_strategy = "epoch",
    logging_steps=100,
    learning_rate=0.01,
    save_steps=10000,
    weight_decay=0.01,
    num_train_epochs=5
)

Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


<div style="padding:15px 15px 15px 15px;border-left:3px solid #03befc;background-color:#eff7fe;border-radius: 15px;">

🎯 Goal: Run **training** using the `Trainer` class on **10% of data**.

</div>

In [None]:
data_collator = DataCollatorForLanguageModeling(tokenizer=gpt_tokenizer, mlm=False)

trainer = Trainer(
    model = gpt2_scratch_model,
    data_collator = data_collator,
    train_dataset=transformer_train_dataset,
    eval_dataset=transformer_test_dataset,
    args=training_args
)

In [None]:
transformer_train_dataset[0]

In [None]:
trainer.train()
torch.save(trainer.model.state_dict(), 'models/distilgpt2-lm-from-scratch.pt')

Epoch,Training Loss,Validation Loss
1,7.1605,7.10394
2,7.1032,7.072162
3,7.0425,7.064168


Epoch,Training Loss,Validation Loss
1,7.1605,7.10394
2,7.1032,7.072162
3,7.0425,7.064168
4,6.9782,6.966454
5,6.908,6.917663


<div style="padding:15px 15px 15px 15px;border-left:3px solid #03befc;background-color:#eff7fe;border-radius: 15px;">

🎯 Goal: Run **testing** using the `Trainer` class and compute perplexity.

      
</div>

In [None]:
eval_result = trainer.evaluate()
perplexity_from_scratch = math.exp(eval_result["eval_loss"])
print(f"The perplexity on the test dataset is {perplexity_from_scratch:.3f}")

The perplexity on the test dataset is 1009.957


### 2.2.2 Run Pre-trained GPT-2 model

After training your trained-from-scratch Transformer model in the previous section, you will now use a pre-trained model to find its perplexity. Therefore, we will only perform testing of the pre-trained model on the test dataset. 

You will perform the following steps:

1. Load pre-trained model and tokenizer
2. Run testing and compute perplexity

In [None]:
from transformers import GPT2LMHeadModel, GPT2TokenizerFast, AutoConfig, AutoModelForCausalLM, AutoTokenizer
from transformers import Trainer, TrainingArguments
import torch

model_id = "distilgpt2"

<div style="padding:15px 15px 15px 15px;border-left:3px solid #03befc;background-color:#eff7fe;border-radius: 15px;">

🎯 Goal: Load pre-trained model and tokenizer.
      
</div>

In [None]:
# YOUR CODE HERE
gpt2_pretrained_model = GPT2LMHeadModel.from_pretrained(model_id)
tokenizer_pretrained_gpt = GPT2TokenizerFast.from_pretrained(model_id)
tokenizer_pretrained_gpt.pad_token = tokenizer_pretrained_gpt.eos_token

Downloading (…)lve/main/config.json:   0%|          | 0.00/762 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/353M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

<div style="padding:15px 15px 15px 15px;border-left:3px solid #03befc;background-color:#eff7fe;border-radius: 15px;">

🎯 Goal: Set hyperparameters to set up the Trainer.
      
💻 Implementation hint: We will use only the inference part on the trainer.

</div>

In [None]:
# YOUR CODE HERE
training_args = TrainingArguments(
    output_dir=f"pretrained_{model_id}-wikitext103",
    evaluation_strategy = "epoch",
    logging_steps=100,
    learning_rate=0.01,
    save_steps=10000,
    weight_decay=0.01)

Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


<div style="padding:15px 15px 15px 15px;border-left:3px solid #03befc;background-color:#eff7fe;border-radius: 15px;">

🎯 Goal: Run **testing** using the `Trainer` class and compute perplexity.
      
</div>

In [None]:
pretrained_trainer = Trainer(
    model = gpt2_pretrained_model,
    data_collator = data_collator,
    train_dataset=transformer_train_dataset,
    eval_dataset=transformer_test_dataset,
    args=training_args
)

In [None]:
eval_result = pretrained_trainer.evaluate()
perplexity_pretrained_model = math.exp(eval_result["eval_loss"])
print(f"The perplexity of pretrained {model_id} on the test dataset is {perplexity_pretrained_model:.3f}")

The perplexity of pretrained distilgpt2 on the test dataset is 136.075


<div style="padding:15px 15px 15px 15px;border-left:3px solid #8e7cc3;background-color:#e4e1eb;border-radius: 15px;">
    
🎉  Excellent work! By this point, you will have implemented all language model variants.

#### Part 2 - Checklist
Here are the core building blocks you created and that you will need for Part 3:
   
- [X] LSTM-variants checkpoints.
- [X] LSTM-variants ppl scores.
- [X] Transformer-variants ppl scores.

_Note: Don't forget to include the tensorboard log to every model you trained, as discussed in the `README.md` of `tensorboard/` dir._
</div>

---

<a name="3"></a>
# PART 3: Fine-tune on the Text Paraphrasing task 🚀

In this part, we will fine-tune and test the language models into the downstream task of text Paraphrasing. 

For this task, we will use the [MRPC dataset](https://paperswithcode.com/dataset/mrpc). Microsoft Research Paraphrase Corpus (MRPC) is a corpus consisting of 5,801 sentence pairs collected from newswire articles. Each pair is labeled if it is a paraphrase or not by human annotators. 

 ![mrpc.png](docs/mrpc.png)
 
## Models
For this dataset, we will select only the ones that correspond to text paraphrasing (label 1). With those, we will test the model's ability to take as input a sentence and produce as output the paraphrased one. 

### Encoder-Decoder architectures: 
To create a sequence2sequence model, we will create an Encoder-decoder model with an attention mechanism similar to the week 3 exercises.

More specifically you need to implement the following:
- Preprocess the dataset to match with the format of the model's input.
- Build a Encoder-Decoder model that will be trained from stratch on the text paraphrasing task.
- Train and test your architectures and compute the train/validation loss score. 

### Transformer-based architectures:

You will also run experiments with the pre-trained Transformer-based model as we did in Part 2. 
You will be using again DistilGPT2. More specifically you need to implement the following:

- Preprocess the dataset to match with the format of the model's input.
- Run training (fine-tuning) of the model on train dataset.
- Run inference on the test set and compute evaluation scores (see section below).


#### Evaluation for the Transformer model

You will evaluate your model using ROUGE scores. 
 
**ROUGE score** stands for Recall-Oriented Understudy for Gisting Evaluation. In its simplest form ROUGE score is the quotient of the matching words under the total count of words in reference sentence. Regarding the denominator ROUGE is a recall oriented metric. 

![rouge.png](docs/rouge.png)

**ROUGE-L score** is based on the length of the longest common subsequence (LCS). To counter the disadvantages of a pure recall metric as in ROUGE-N, Rouge-L calculates the weighted harmonic mean (or f-measure) combining the precision score and the recall score.

![rouge_l.png](docs/rouge_l.png)

ℹ️ Source: [Original article](https://clementbm.github.io/theory/2021/12/23/rouge-bleu-scores.html#bleu)


### Load MPRC dataset and extract the paraphrased ones

In [None]:
from datasets import load_dataset
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [None]:
# load dataset
mrpc_dataset = load_dataset("glue", "mrpc")
MAX_SEQ_LENGTH = 64

Downloading builder script:   0%|          | 0.00/28.8k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/28.7k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/27.9k [00:00<?, ?B/s]

Downloading and preparing dataset glue/mrpc to /root/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad...


Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading data: 0.00B [00:00, ?B/s]

Downloading data: 0.00B [00:00, ?B/s]

Downloading data: 0.00B [00:00, ?B/s]

Generating train split:   0%|          | 0/3668 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/408 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1725 [00:00<?, ? examples/s]

Dataset glue downloaded and prepared to /root/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

<div style="padding:15px 15px 15px 15px;border-left:3px solid #03befc;background-color:#eff7fe;border-radius: 15px;">

🎯 Goal: Keep only the **paraphrased pair** of sentences.

</div>

In [None]:
# YOUR CODE HERE
mrpc_dataset = mrpc_dataset.filter(lambda x: x['label'] == 1)

Filter:   0%|          | 0/3668 [00:00<?, ? examples/s]

Filter:   0%|          | 0/408 [00:00<?, ? examples/s]

Filter:   0%|          | 0/1725 [00:00<?, ? examples/s]

<a name="31"></a>
## 3.1 Train Encoder-Decoder models on Text Paraphrasing

In this part, you will preprocess the dataset to make it suitable for the Encoder-Decoder model by adding `<start>` and `<stop>` tokens on each sentences and then padding to the maximum sequence length. From now on, we will refer to sentence 1 as context and sentence 2 as reference. Finally, you will compute the train/validation loss score.

### Data Preprocessing for encoder-decoder

<div style="padding:15px 15px 15px 15px;border-left:3px solid #03befc;background-color:#eff7fe;border-radius: 15px;">

🎯 Goal: Add  `<start>` and `<stop>` tokens and pad the input to `MAX_SEQ_LENGTH` length.

</div>

In [None]:
# YOUR CODE HERE
rnn_word_to_idx = rnn_dataset.word_to_index
MAX_SEQ_LENGTH = 64

def tokenize(x):
  sent1 = "<start> " + x['sentence1'] + " <stop>"
  sent2 = "<start> " + x['sentence2'] + " <stop>"

  sent1_words, sent2_words = sent1.split(), sent2.split()
  sent1_words = sent1_words + ["<pad>"] * (MAX_SEQ_LENGTH - len(sent1_words))
  sent2_words = sent2_words + ["<pad>"] * (MAX_SEQ_LENGTH - len(sent2_words))
  sent1_words, sent2_words = sent1_words[:MAX_SEQ_LENGTH], sent2_words[:MAX_SEQ_LENGTH]


  unk_token = rnn_word_to_idx["<unk>"]
  sent1_ids = [rnn_word_to_idx.get(word, unk_token) for word in sent1_words]
  sent2_ids = [rnn_word_to_idx.get(word, unk_token) for word in sent2_words]
  return {
        'input': sent1_ids,
        'target': sent2_ids
  }

# tokenizing the texts in the mrpc
mrpc_dataset = mrpc_dataset.map(lambda x: {**x, **tokenize(x)})
# # pad sequences
# mrpc_dataset = ...
mrpc_dataset = mrpc_dataset.remove_columns(["sentence1", "sentence2"])
mrpc_dataset = mrpc_dataset.with_format("torch")

Map:   0%|          | 0/2474 [00:00<?, ? examples/s]

Map:   0%|          | 0/279 [00:00<?, ? examples/s]

Map:   0%|          | 0/1147 [00:00<?, ? examples/s]

In [None]:
mrpc_train, mrpc_validation, mrpc_test = mrpc_dataset["train"], mrpc_dataset["validation"], mrpc_dataset["test"]

mrpc_train_dataloader = DataLoader(mrpc_train, batch_size=8, shuffle=True)
mrpc_validation_dataloader = DataLoader(mrpc_validation, batch_size=8, shuffle=False)
mrpc_test_dataloader = DataLoader(mrpc_test, batch_size=8, shuffle=False)

### Run model fine-tuning

In [None]:
# import your end-dec model
from src.utils import EncoderDecoder

<div style="padding:15px 15px 15px 15px;border-left:3px solid #03befc;background-color:#eff7fe;border-radius: 15px;">

🎯 Goal:  Implement training and testing pipelines.
  
💻 Implementation hint: Check the pipelines we created in the exercises sessions.
    
</div>

In [None]:
def seq2seq_eval(model, eval_loader, criterion):
    # this function should be called in the train loop to monitor the performance in validation set while training.
    
    epoch_loss = 0
    for i, data in (enumerate(eval_loader)):
        X_batch, Y_batch = data['input'], data['target']
        batch_loss = 0
        batch_size = 8
        for X, Y in zip(X_batch, Y_batch):
          X = X.to(device)
          target_tokens = Y.to(device)
          outputs = model(X, targets=target_tokens)
          loss = criterion(outputs, target_tokens)
          batch_loss += loss.item()

        epoch_loss += batch_loss / batch_size

    return epoch_loss / len(eval_loader)

In [None]:
from tqdm import tqdm
def seq2seq_train(model, train_loader, eval_loader, optimizer, criterion, num_epoch):
    
    best_eval_loss = 1e3 # used to do early stopping
    device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
    # Training loop
    for epoch in range(num_epoch):
        epoch_loss = 0
        for i, data in enumerate(tqdm(train_loader)):
            X_batch, Y_batch = data['input'], data['target']
            batch_loss = 0
            batch_size = 8
            for X, Y in zip(X_batch, Y_batch):
              X = X.to(device)
              target_tokens = Y.to(device)
              optimizer.zero_grad()

              outputs = model(X, targets=target_tokens)

              loss = criterion(outputs, target_tokens)
              loss.backward()

              # # Step 5: update optimizer
              optimizer.step()

              batch_loss += loss.item()
              # YOUR CODE HERE

            epoch_loss += batch_loss / batch_size

            if i%20 == 0:
              tb_writer.add_scalar("LSTM_seq2seq_attention/train_loss", batch_loss / batch_size, i)

        eval_loss = seq2seq_eval(model, eval_loader, criterion)
        tb_writer.add_scalar("LSTM_seq2seq_attention/validation_loss", eval_loss, epoch)
        tb_writer.flush()
        print("Eval loss:", eval_loss)

    return epoch_loss / len(train_loader)

<div style="padding:15px 15px 15px 15px;border-left:3px solid #03befc;background-color:#eff7fe;border-radius: 15px;">

🎯 Goal: Instantiate the **model, optimizer and loss**.
      
💻 Implementation hint: Choose your training settings according to the task you need to do.

</div>

In [None]:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
seq2seq_with_attention_model = EncoderDecoder(100, len(rnn_word_to_idx), len(rnn_word_to_idx), rnn_word_to_idx)
seq2seq_with_attention_model.to(device)
optimizer = torch.optim.AdamW(seq2seq_with_attention_model.parameters(), lr=1e-3, weight_decay=1e-6)
criterion = torch.nn.NLLLoss(ignore_index=rnn_dataset.pad_idx)
num_params = sum(p.numel() for p in seq2seq_with_attention_model.parameters() if p.requires_grad)
print(f'The model has {num_params:,} trainable parameters')

The model has 25,178,871 trainable parameters


<div style="padding:15px 15px 15px 15px;border-left:3px solid #03befc;background-color:#eff7fe;border-radius: 15px;">

🎯 Goal: Run **training and testing pipelines**.
      
</div>

In [None]:
seq2seq_train(seq2seq_with_attention_model, mrpc_train_dataloader, mrpc_validation_dataloader, optimizer, criterion, 1)
# saving the model
torch.save(seq2seq_with_attention_model.state_dict(), "models/rnn_seq2seq_with_attention.pt")


  0%|          | 0/310 [00:00<?, ?it/s][A
  0%|          | 1/310 [00:02<11:40,  2.27s/it][A
  1%|          | 2/310 [00:04<10:26,  2.04s/it][A
  1%|          | 3/310 [00:06<10:02,  1.96s/it][A
  1%|▏         | 4/310 [00:07<09:58,  1.96s/it][A
  2%|▏         | 5/310 [00:10<10:13,  2.01s/it][A
  2%|▏         | 6/310 [00:12<10:27,  2.06s/it][A
  2%|▏         | 7/310 [00:14<10:30,  2.08s/it][A
  3%|▎         | 8/310 [00:16<10:08,  2.02s/it][A
  3%|▎         | 9/310 [00:18<09:51,  1.97s/it][A
  3%|▎         | 10/310 [00:19<09:40,  1.94s/it][A
  4%|▎         | 11/310 [00:21<09:32,  1.92s/it][A
  4%|▍         | 12/310 [00:23<09:26,  1.90s/it][A
  4%|▍         | 13/310 [00:25<09:41,  1.96s/it][A
  5%|▍         | 14/310 [00:27<09:53,  2.00s/it][A
  5%|▍         | 15/310 [00:30<10:13,  2.08s/it][A
  5%|▌         | 16/310 [00:32<10:45,  2.20s/it][A
  5%|▌         | 17/310 [00:34<10:37,  2.18s/it][A
  6%|▌         | 18/310 [00:36<10:25,  2.14s/it][A
  6%|▌         | 19/310 [00:3

Eval loss: 5.548291779415948


<a name="32"></a>
## 3.2 Run Transformer on Text Paraphrasing

In this part you will need to concatinate the paraphrased pair of sentences into one sequence to serve as input. Then we will use this input to pass it to the DistilGPT2 model for fine-tuning and testing.
The input should be the following:
```
<sentence_1> <eos> <sentence_2> <eos>
```
where `<eos>` is the tokenizer's end-of-sequence token.

From now on, we will refer to sentence 1 as `context` and sentence 2 as `reference`. 

Here, we use a decoder-only model (DistilGPT2) which gets the **context** as input and generates the **reference** sequence (token-by-token). Given this token-by-token generation, the nature of the model is very similar to a language model; the major difference is that in general causal language models try to predict the next token for the whole input, whereas in this case, the model should generate only the **reference**. (i.e., the **context** should be masked for loss computation).  

Finally, you will compute the ROUGE scores as follows:

1. You will generate 5 sequences given each context.
2. You will compute the ROUGE-L score among these 5 generations and the **context** => `ROUGE(context, generationX)`
3. You will select the best generation (among the 5 ones) as the predicted reference.
4. You will compute the ROUGE-(1, 2, L) scores between the top generation (from step 3) and the **reference** => `ROUGE(reference, top-generation)`
5. You will provide the average ROUGE-(1, 2, L) scores for all the test dataset samples.

### Data Preprocessing for Transformers

<div style="padding:15px 15px 15px 15px;border-left:3px solid #03befc;background-color:#eff7fe;border-radius: 15px;">

🎯 Goal: Load pre-trained **model** and **tokenizer**.
      
</div>

In [None]:
from transformers import AutoTokenizer, GPT2LMHeadModel 
from datasets import load_dataset

model_id = "distilgpt2"
mrpc_dataset = load_dataset("glue", "mrpc")
mrpc_dataset = mrpc_dataset.filter(lambda x: x['label'] == 1)

# YOUR CODE HERE
tokenizer_pretrained_gpt = AutoTokenizer.from_pretrained(model_id)
gpt2_pretrained_model = GPT2LMHeadModel.from_pretrained(model_id)



  0%|          | 0/3 [00:00<?, ?it/s]



<div style="padding:15px 15px 15px 15px;border-left:3px solid #03befc;background-color:#eff7fe;border-radius: 15px;">

🎯 Goal: Concatenate sentences, pass them to the tokenizer and clip to `MAX_SEQ_LENGTH` length.
      
</div>

In [None]:
MAX_SEQ_LENGTH = 64
tokenizer_pretrained_gpt.pad_token = tokenizer_pretrained_gpt.eos_token
eos_token = tokenizer_pretrained_gpt.eos_token

# YOUR CODE HERE
def tokenize_sentences(item, tokenizer, MAX_SEQ_LENGTH):
  item_copy = item.copy()
  sentence1_encoded = tokenizer(item['sentence1'])
  sentence2_encoded = tokenizer(item['sentence2'])

  item_copy['input_ids'] = sentence1_encoded['input_ids'] + [tokenizer.eos_token_id] + sentence2_encoded['input_ids'] + [tokenizer.eos_token_id]
  item_copy['attention_mask'] = sentence1_encoded['attention_mask'] + [1] + sentence2_encoded['attention_mask'] + [1]

  # slice it to max sequence length
  item_copy['input_ids'] = item_copy['input_ids'][:MAX_SEQ_LENGTH]
  item_copy['attention_mask'] = item_copy['attention_mask'][:MAX_SEQ_LENGTH]

  item_copy['input_ids'] = item_copy['input_ids'] + [tokenizer.eos_token_id] * (MAX_SEQ_LENGTH - len(item_copy['input_ids']))
  item_copy['attention_mask'] = item_copy['attention_mask'] + [0] * (MAX_SEQ_LENGTH - len(item_copy['attention_mask']))

  return item_copy


# concatenate sentences along with <eos> and pass them to the tokenizer
mrpc_dataset = mrpc_dataset.map(lambda x: tokenize_sentences(x, tokenizer_pretrained_gpt, MAX_SEQ_LENGTH))

# cut input and attention mask to MAX_SEQ_LENGTH
# mrpc_dataset = mrpc_dataset.map()

Map:   0%|          | 0/2474 [00:00<?, ? examples/s]

Map:   0%|          | 0/279 [00:00<?, ? examples/s]

Map:   0%|          | 0/1147 [00:00<?, ? examples/s]

<div style="padding:15px 15px 15px 15px;border-left:3px solid #03befc;background-color:#eff7fe;border-radius: 15px;">

🎯 Goal: Apply the masking technique described above (mask context sequence).
      
</div>

In [None]:
def get_sample_label(sample):
    # this function masks the context (by assigning -100), and makes the paraphrase the target labels
    
    # YOUR CODE HERE
    output_label = sample["input_ids"].copy()
    l = len(tokenizer_pretrained_gpt(sample['sentence1'])['input_ids']) + 1
    for i in range(l):
      output_label[i] = -100
    
    return {**sample, "labels": output_label} 

mrpc_dataset = mrpc_dataset.map(get_sample_label)

mrpc_dataset = mrpc_dataset.with_format("torch")
mrpc_train_dataset, mrpc_eval_dataset = mrpc_dataset["train"], mrpc_dataset["validation"] 

Map:   0%|          | 0/2474 [00:00<?, ? examples/s]

Map:   0%|          | 0/279 [00:00<?, ? examples/s]

Map:   0%|          | 0/1147 [00:00<?, ? examples/s]

### Run model fine-tuning

<div style="padding:15px 15px 15px 15px;border-left:3px solid #03befc;background-color:#eff7fe;border-radius: 15px;">

🎯 Goal: Set hyperparameters according to the objective of the model.
      
💻 Implementation hint: You can play arround with different values for `learning_rate`.

</div>

In [None]:
from transformers import TrainingArguments, Trainer, AutoModelForCausalLM,DataCollatorForLanguageModeling

# create the finetuning trainer
training_args = TrainingArguments(
    output_dir=f"finetune_{model_id}-MRPC",
    evaluation_strategy = "epoch",
    logging_steps=100,
    learning_rate=0.00001,
    num_train_epochs=15,
    save_steps=10000,
    weight_decay=0.01,
    report_to="none")

gpt2_pretrained_model.transformer.wte.weight.requires_grad = False
gpt2_pretrained_model.lm_head.weight.requires_grad = False

<div style="padding:15px 15px 15px 15px;border-left:3px solid #03befc;background-color:#eff7fe;border-radius: 15px;">

🎯 Goal: Run **training** using the `Trainer` class.
      
</div>

In [None]:
paraphrasing_trainer = Trainer(
    model=gpt2_pretrained_model,
    args=training_args,
    # tokenizer=tokenizer_pretrained_gpt,
    train_dataset=mrpc_train_dataset,
    eval_dataset=mrpc_eval_dataset
)

paraphrasing_trainer.train()
torch.save(paraphrasing_trainer.model.state_dict(), 'models/finetune-distilgpt2-mrpc.pt')



Epoch,Training Loss,Validation Loss
1,1.2909,1.161918
2,1.2127,1.132101
3,1.1319,1.106699
4,1.1,1.096751
5,1.0836,1.090358
6,1.0505,1.08409
7,1.0476,1.081761
8,1.0118,1.079035
9,1.004,1.076815
10,0.9692,1.078204


### Evaluate model with ROUGE scores

<div style="padding:15px 15px 15px 15px;border-left:3px solid #03befc;background-color:#eff7fe;border-radius: 15px;">

🎯 Goal: For each sample in evaluation set, generate 5 outputs and perform the ROUGE evaluation as presented in the question description above.

💻 Implementation hint: Use the following API call to get top-k generations
    
    generated_sequences = paraphrasing_trainer.model.generate(
        context_ids,
        do_sample=True, 
        max_length=MAX_SEQ_LENGTH, 
        top_k=20, 
        top_p=0.95, 
        no_repeat_ngram_size=2, 
        num_return_sequences=5
    )

_Note 1: For simplicity, you can ignore the contexts that have more than 1 sentence._

_Note 2: On the generated reference, if there is more that 1 sentence generated, keep only the first one._

_Note 3: To split into sentences, you can use [`nltk.sent_tokenize()`](https://www.nltk.org/api/nltk.tokenize.html)._

_Note 4: Use the [`evaluate.load('rouge')`](https://huggingface.co/spaces/evaluate-metric/rouge) function to compute the ROUGE metrics._

 
</div>

In [None]:
import locale
def getpreferredencoding(do_setlocale = True):
    return "UTF-8"
locale.getpreferredencoding = getpreferredencoding

!pip install rouge_score

In [None]:
import nltk
import evaluate
import numpy as np
import torch
from tqdm import tqdm
from transformers import GPT2LMHeadModel
nltk.download('punkt')

def sentences(x):
    l = nltk.tokenize.sent_tokenize(x)
    if len(l) == 1:
      return " "
    return l[-1]


rouge_values = []
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
rouge = evaluate.load('rouge')
i = 0
for sample in tqdm(mrpc_eval_dataset):
    len_sentence = len(nltk.tokenize.sent_tokenize(sample['sentence1']))
    if len_sentence > 1:
      continue
    # pick the best candidate given ROUGE similarity to context
    context_ids = tokenizer_pretrained_gpt(sample['sentence1'])
    input_ids=torch.tensor(context_ids['input_ids'])
    attention_mask=torch.tensor(context_ids['attention_mask'])
    input_ids = input_ids.unsqueeze(0).to(device)
    attention_mask = attention_mask.unsqueeze(0).to(device)

    # print(sample['sentence1'])
    generated_sequences = paraphrasing_trainer.model.generate(
        input_ids=input_ids,
        attention_mask=attention_mask,
        do_sample=True, 
        max_length=MAX_SEQ_LENGTH, 
        top_k=20, 
        top_p=0.95,
        no_repeat_ngram_size=2, 
        num_return_sequences=5
    )

    batch_sequences = tokenizer_pretrained_gpt.batch_decode(generated_sequences, skip_special_tokens=True)
    # print(batch_sequences)
    batch_sequences = list(map(lambda x: sentences(x), batch_sequences))

    # print(batch_sequences)

    i += 1

    # if i > 1: break

    best_sequence = batch_sequences[0]
    best_rouge = 0
    for seq in batch_sequences:
      rouge_L = rouge.compute(predictions=[seq], references=[sample['sentence1']])['rougeL']
      if rouge_L > best_rouge:
        best_rouge = rouge_L
        best_sequence = seq

    
    # compute the ROUGE value of the best candidate with the reference
    rouge_scores = rouge.compute(predictions=[best_sequence], references=[sample['sentence2']])
    # print(rouge_scores)
    rouge_values.append(rouge_scores)

rouge1, rouge2, rougeL, rougeLsum = np.mean([rouge_x['rouge1'] for rouge_x in rouge_values]),\
                                    np.mean([rouge_x['rouge2'] for rouge_x in rouge_values]),\
                                    np.mean([rouge_x['rougeL'] for rouge_x in rouge_values]),\
                                    np.mean([rouge_x['rougeLsum'] for rouge_x in rouge_values])

print(rouge1, rouge2, rougeL, rougeLsum)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

  0%|          | 0/279 [00:00<?, ?it/s]Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
  0%|          | 1/279 [00:04<19:37,  4.24s/it]Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
  1%|          | 2/279 [00:06<13:46,  2.98s/it]Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
  1%|          | 3/279 [00:09<14:01,  3.05s/it]Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
  1%|▏         | 4/279 [00:12<14:10,  3.09s/it]Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
  2%|▏         | 5/279 [00:21<23:40,  5.18s/it]Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
  2%|▏         | 6/279 [00:25<21:04,  4.63s/it]Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
  3%|▎         | 7/279 [00:26<16:50,  3.72s/it]Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
  3%|▎         | 8/279 [00:29<15:40,  3.47s/it]Setting `

0.24400790164947866 0.032038359294273705 0.19869379273822668 0.19869379273822668





<div style="padding:15px 15px 15px 15px;border-left:3px solid #8e7cc3;background-color:#e4e1eb;border-radius: 15px;">

🎉 Excellent work! You just finished the code implementation parts of the assignment. 

#### Part 3 - Checklist
Here are the elements you will need for the report in Part 4:
   
- [X] LSTM-variants scores on perplexity and their checkpoints.
- [X] DistilGPT2 score on perplexity and its checkpoint.
- [X] Encoder-decoder variant train/validation loss score and its checkpoint.
- [X] DistilGPT2 ROUGE scores and its fine-tuned checkpoint.

_Note: Don't forget to include the tensorboard log to every model you trained._


</div>

---

<a name="4"></a>
# PART 4: Write your report 📘

Fill in the tables with the respective scores. 

<div style="padding:15px 15px 15px 15px;border-left:3px solid #03befc;background-color:#eff7fe;border-radius: 15px;text-align:center;">

#### Perplexity results on Language Models

| Model - Variant | PPL |
|:--------- | :-----: |
| LSTM Variant A - Embeddings trained from scratch | 441.466 |
| LSTM Variant B - Pre-trained embeddings & frozen | 367.079 |
| LSTM Variant C - Pre-trained embeddings & trainable | 288.800 |
||||
| DistilGPT2 - Trained from scratch | 1009.957 |
| Pre-trained DistilGPT2 | 136.075 |
    
#### Performance scores on Text Paraphrasing
| Model - Variant | ROUGE-1 | ROUGE-2 | ROUGE-L | ROUGE-Lsum |
|:--------- | :-----: | :-----: |  :-----: |  :-----: | 
| Pre-trained DistilGPT2 | 0.24 |0.03 | 0.198 |0.198 |

</div>