#  Assignment 3 - Natural Language Generation 💬

Welcome to the **third and last assignment** for the **CS-552: Modern NLP course**!

> - 😀 Name: **< First Last >**
> - ✉️ Email: **< first.last >@epfl.ch**
> - 🪪 SCIPER: **XXXXXX**

In [None]:
# TODO: Please enter you sciper as a variable below
SCIPER = 123456

<div style="padding:15px 20px 20px 20px;border-left:3px solid green;background-color:#e4fae4;border-radius: 20px;color:#424242;">

## **Assignment Description**
- In this assignment, you will be working on natural language generation (NLG). You will be exploring various ways to generate text and demonstrate your understanding of decoding algorithms, the effect of their parameters, and NLG evaluation.
    
- In Part 1, you will implement two decoding algorithms (greedy and beam search), as well as two sampling algorithms (top-p and top-k) to replicate (to some extent) what one would get when using Huggingface's `generate` function that you've played with during the Week 6's exercise session.

- For Part 2, you will be implementing Contrastive Decoding, a method that combines the logits of two models at generation time.
    
- For Part 3, you will be evaluating NLG metrics for machine translation.

### Table of Contents
- **[Setup](#setup)**
    - [1) Google Setup](#1-google-colab-setup)
    - [2) Local Setup](#2-local-setup)
    - [3) Rest of the Setup](#3-rest-of-the-setup-colab-and-local)

- **[PART 1: NLG Decoding and Sampling Algorithms](#part-1-nlg-decoding-and-sampling-algorithms)**
    - [1.1) Implement decoding and sampling algorithms](#11-implement-decoding-and-sampling-algorithms)
    - [1.2) Test your implementations](#12-testing-your-implementation)
    
- **[PART 2: Constractive Decoding](#part-2-contrastive-decoding)**
    - [2.1) Implement the contrastive decoding method with adaptive plausibility constraint](#21-implement-contrastive-decoding-with-adaptive-plausibility-constraint)
    - [2.2) Evaluate your generations using the MAUVE metric](#22-evaluate-your-generations-using-the-MAUVE-metric)

- **[PART 3: MT Evaluation](#part-3-mt-evaluation)**
    - [3.1) Dataset and metrics analysis](#31-dataset-and-metrics-analysis)
    - [3.2) NLG metric calculation](#32-nlg-metric-calculation)
    - [3.3) Correlation calculation](#33-correlation-calculation)
    - [3.4) Correlation analysis](#34-correlation-analysis)

- **[PART 4: Checklist](#part-4-checklist)**
    
### Deliverables

To give us the deliverables you will have to commit the following files to your github classroom repository:

- ✅ The jupyter notebook: `a3_notebook.ipynb`

- ✅ The python files:
    - [ ] `a3_utils.py`, if you added any helper functions
    - [ ] `a3_decoding.py`
    - [ ] `a3_sampling.py`
    - [ ] `a3_contrastive_decoding.py`
    - [ ] `a3_contrastive_main.py`
    - [ ] `a3_mt_eval.py`

- ✅ The Part 3 open answer MD file: `a3_mt_qa.md`

- ✅ The JSON files generated in Parts 2 & 3: 
    - [ ] `part2_contrastive_generations.json`
    - [ ] `part2_greedy_generations.json`
    - [ ] `part3_metrics.json` 
    - [ ] `part3_corr.json`

### Expected Workload & Resources

We expect the first part of the assignment, notably Beam search, to be the longest part of the assignment. You can plan your workload according to that. Keep in mind that this is just our expectation, the same may not apply to everyone! It would be helpful to finish Part 1 to do Part 2. Part 3 can be done independently.

This assignment does not necessarily require a GPU to be completed (*i.e.*, there is no training -- only inference), although some processes such as decoding and model-based metric calculation can be sped up on GPU. Therefore for Part 1's beam search, all of Part 2, and Part 3's metric calculation it may be a good idea to use a GPU.

### Grade Breakdown
Here is the general grade breakdown per section, to help with prioritization:

- Part 1: Decoding & Sampling algorithms (100)
    - Greedy decoding: 15
    - Beam decoding: 40
    - Top-k sampling: 20
    - Top-p sampling: 25
- Part 2: Contrastive Decoding (60)
- Part 3: MT Evaluation (60)

</div>

## Setup

First, if you are using Google Colab, go through part (1) and (3) of the setup section. If not, and you are using a local machine, go through the part (2) and (3) of the setup section.

- **PYTHON VERSION:** Colab uses **Python version 3.10.12** and that's the python version we will use to grade your assignments. If you use a different Python version, we will not deduct points! Just know that it may affect reproducibility of your code. So try to stick to this one.

- **PACKAGE VERSIONS:** We will not deduct points if you use different package versions.

### 1) Google Colab Setup

If you are using Google Colab for this assignment, you will need to run a few commands to set up our environment. If you are running this notebook on a local machine you can skip this section.
Run the following cell to mount your Google Drive. Follow the popped window, sign in to your Google account (*i.e.* the same non-EPFL account you used to store this notebook)

In [None]:
# from google.colab import drive
# drive.mount("/content/drive")

Now first click the 4th left-side button (named Files), then click the 2nd button that pops under in the columns (named Refresh), under "/drive/MyDrive/" find the Assignment 3 repository folder that you uploaded to your Google Drive, copy its path, and fill it in the cell below. If everything is working correctly, then running the folowing cell should print the filenames from the assignment:

```
['part3_metrics.json', 'a3_eval_mauve.py', 'requirements.txt', 'figs', 'a3_sampling.py', 'README.md', 'a3_decoding.py', 'a3_tests.py', 'a3_contrastive_main.py', 'a3_contrastive_decoding.py', 'a3_mt_eval.py', 'a3_mt_qa.md', 'part3_corr.json', 'data', 'a3_utils.py', 'a3_notebook.ipynb']
```

In [None]:
# import os
# # TODO: Fill in the `ROOT_PATH` with where you download the Assignment folder
# ROOT_PATH = "/content/drive/MyDrive/..."  # Replace with your directory to A3 folder
# print(os.listdir(ROOT_PATH)) # Check the content of the path
# os.chdir(ROOT_PATH) # cd into directory
# print(os.listdir(".")) # Check the content of current folder

And the following cell should read "Hello, this is A3! You successfully linked your directory to Colab."


In [None]:
# from a3_tests import hello_A3
# hello_A3()

Before we start, we also need to run some boilerplate code to set up our environment, same as previous assignments.

In [None]:
# requirements = ROOT_PATH + "/requirements.txt"
# %pip install -r {requirements}

### 2) Local Setup

If you are using a local setup, then make sure you have a python environment create for Python 3.10.12. For example you can use the following command to create a conda environment if you haven't done so for previous exercises or assignments:

```shell
conda create -n="mnlp-venv" python=3.10.12
conda activate mnlp-venv
```

And download the requirements:
```
pip install -r requirements.txt
```

And then make sure to set the kernel of this notebook to that environment.

Below please write your relative filepath to your assignment 3 folder. If everything is working correctly, then running the folowing cell should print the filenames from the assignment:

```
['part3_metrics.json', 'a3_eval_mauve.py', 'requirements.txt', 'figs', 'a3_sampling.py', 'README.md', 'a3_decoding.py', 'a3_tests.py', 'a3_contrastive_main.py', 'a3_contrastive_decoding.py', 'a3_mt_eval.py', 'a3_mt_qa.md', 'part3_corr.json', 'data', 'a3_utils.py', 'a3_notebook.ipynb']
```

In [None]:
import os

ROOT_PATH = "."
print(os.listdir(ROOT_PATH))

### 3) Rest of the setup (Colab and Local)

Now, run this cell to load the autoreload extension. This allows us to edit `.py` source files, and re-import them into the notebook for a seamless editing and debugging experience without needing to restart the kernel.

In [None]:
%load_ext autoreload
%autoreload 2

from a3_utils import *
from a3_tests import *

import json
import torch
from datasets import load_dataset
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM
)

seed = SCIPER
torch.manual_seed(seed) # example to set the seed for randomness control
torch.set_printoptions(precision=16) # set print number precision at 16

(Optional) If you would like to setup CUDA with your notebook you can check whether it's available or not. If not, either your local machine doesn't have GPUs or you need to change your runtime type to a GPU setting via `Edit -> Notebook Settings` on Colab.

In [None]:
if torch.cuda.is_available():
    print("GPU is available! Good to go.")
else:
    print(
        "If you are using Colab, please set your runtime type to a GPU via {Edit -> Notebook Settings}."
    )

And that's it! You are ready to start the assignment.

## **PART 1: NLG Decoding and Sampling Algorithms**
---

### 1.1) Implement decoding and sampling algorithms

For this part of the assignment, you will be implementing decoding and sampling algorithms that are in `a3_decoding.py` and `a3_sampling.py` files. These files include a main function that print the behavior of your own implementation and the huggingface `transformers` implementation.

[GPT-2](https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf), is an autoregressive decoder-only Transformer model.
We expect your algorithms to only work on GPT-2 variants (e.g. `gpt2`, `gpt2-medium`, `gpt2-large`) -- although we will particularly provide you examples that work well on an instruction-finetuned version of GPT-2 small called `vicgalle/gpt2-alpaca-gpt4`, so feel free to just test on that one. To do inference with GPT-2, we use the generic `AutoModelForCausalLM` to load the `GPT2LMHeadModel` class (provided by the `transformers` package). We recommend you closely look at the [GPT2LMHeadModel documentation](https://huggingface.co/docs/transformers/model_doc/gpt2#transformers.GPT2LMHeadModel) to understand what the model expects as input and outputs.

<div style="padding:15px 20px 20px 20px;border-left:3px solid red;background-color:#C88B86;border-radius: 20px;color:#424242;">

#### ⚠️ **Warning** ⚠️

Here are some important points to not miss:
- We do not allow you to use the huggingface `generate` function. The point of the exercise is to implement the algorithms yourself.

- Please carefully read the detailed docstrings and the comments that start with `TODO` / `NOTE: caution`. While the `TODO`s will show where you should add code, the caution notes will point out what not to change.

- Feel free to add testing cells below those already provided in this notebook (in [1.2) Test your implementations](#12-testing-your-implementation)). You are also welcome to write tests in the main function of `a3_decoding.py` and `a3_sampling.py`

- **Decoding algorithms expectations:**
    - **Greedy:** your greedy decoder output **should** match huggingface's output. 
    - **Beam:** 
        - **Scores:** Sometimes for beam search, due to floating point precision or simply the way huggingface counts scores (*e.g.*, they sometimes threshold log probabilities), your scores may not align with theirs. Just aim for it to be around the same ballpark (~0.01 difference, but again we won't be so clear cut with this). We are not measuring how well you can replicate huggingface's implementation to the exact number. Therefore we will be flexible about this as long as your algorithm does the steps required. 
        - **Runtime:** Also, a `num_beams` larger than 15 can run longer than a minute; you don't have to write the most efficient beam search.

- **Sampling algorithms expectations:** Due to slight implementation differences (*e.g.*, how you decide to *not* sample certain tokens), even if the random seed is set at testing time, your results may widely differ. Therefore if the output and its shape do not match exactly huggingface's output for sampling algorithms, you don't have to worry (and if they do, you don't have to worry either :D). In this tricky case, you can verify your implementation by playing with the parameters and seeing if they behave the way you would expect (Hint: think about the extreme cases of these algorithms and check if the algorithms behaves as you expect in those cases).

- **Read the documentation:** We already provide you with hints on how to pass tokens to the model (below and as a comment in the code), however you are responsible with reading the [documentation](https://huggingface.co/docs/transformers/model_doc/gpt2#transformers.GPT2LMHeadModel) to fully understand what the `model` function expects as input and what the model returns. This way you will figure out how to unroll the decoding process.

<div/>

Here is a reasonable order to implement the classes and features in these python files, although you are welcome to do it in a different order:
- 🎯 Implement the `search` function in `GreedySearchDecoderForCausalLM` class in `a3_decoding.py`
- 🎯 Implement the `search` function in `BeamSearchDecoderForCausalLM` class in `a3_decoding.py` without *length_penalty*
- 🎯 Add *length_penalty* to the `BeamSearchDecoderForCausalLM` class in `a3_decoding.py`
- 🎯 Implement the `sample` function in `TopKSamplerForCausalLM` class in `a3_sampling.py` without *temperature*
- 🎯 Implement the `sample` function in `TopPSamplerForCausalLM` class in `a3_sampling.py` without *temperature*
- 🎯 Add *temperature* functionality to both `TopKSamplerForCausalLM` and `TopPSamplerForCausalLM`

**Do not try to implement all of them at once without testing.** For example, understanding why your greedy implementation fails can inform the rest. So keep testing your code in between the different algorithms. You can modify the main function in these files to test your functions. We have also created the following cells to call your functions and see if they implement the features we request. These tests do not cover all cases. We don't expect you to throw errors when the inputs don't match their constraints, but whenever the constraints are not met, we return None and print statements to help you debug.

<div style="padding:15px 20px 20px 20px;border-left:3px solid purple;background-color:#9A7696;border-radius: 20px;color:#424242;">

#### **Hints**

**Hint (#1):** There are 2 ways to pass the inputs to the model. Please open the
[GPT2LMHeadModel documentation](https://huggingface.co/docs/transformers/model_doc/gpt2#transformers.GPT2LMHeadModel) and read the `Parameters` and `Returns` sections while looking at these hints. Either of these approaches is fine since we 
don't expect the most efficient solution, but they depend on how you decide to unroll the decoding process:

  1. **If recomputing the past decoder processes at each decoding step:**
      Since the tokenizer's output dictionary keys matches these `Parameter` 
      i.e. arguments of the model you can directly do:
      ```python
      >> self.model(**inputs)
      ```
      Just be careful and think about how you modify the "input_ids" 
      and "attention_mask" keys across decoding steps. 
  
  2. **If using cached decoder hidden states at each decoding step:**
      To speed up the process (although *not required*) you can also get 
      the computed key/values hidden-states *so far* with `use_cache=True`
      where in the first step you may need to do:
      ```python
      >> self.model(**inputs, use_cache=True)
      ```
      This will return an extra dictionary entry called "past_key_values".
      In the next steps you would do, assuming your previous output 
      dict is called `outputs`:
      ```python
      >> self.model(**inputs, use_cache=True, past_key_values=outputs["past_key_values"])
      ```
      Again, be careful as to how you modify the "input_ids" and 
      "attention_mask" keys across decoding steps. In particular the 
      cached setting expects them to be different from the non-cached 
      setting. Read the `Parameters` and `Returns` sections of the 
      GPT2LMHeadModel carefully.

**Hint (#2):** You can implement and use the `self.prepare_next_inputs` function 
in `a3_utils.py` inherited by all decoding and sampling classes 
(although you are not required to) to reduce repeated code and make it more readable.
There isn't a unique solution for this so use it as you wish or create another function in this super class.

**Hint (#3):** If you create a new tensor `tensor_b` and want it to be on the same device as another tensor `tensor_a`, you can do `tensor_b = tensor_b.to(tensor_a.device)`

<div/>

### 1.2) Testing your implementation

Let's first load up the relevant language modeling class, its tokenizer, and the three datapoints you can test your algortihm on. Then we will create instances of our decoding and sampling classes. As mentioned before, you can test your implementation on an instruction-tuned GPT-2 small that we found on [Huggingface Model Hub](https://huggingface.co/vicgalle/gpt2-alpaca-gpt4) called `vicgalle/gpt2-alpaca-gpt4`. In short, this means that the model was further finetuned on a dataset in the format of instructions followed by demonstrations. The instruction datapoints come from the [Alpaca](https://crfm.stanford.edu/2023/03/13/alpaca.html) dataset, which, if you are curious about, you can load with:

```python
>> from datasets import load_dataset
>> dataset = load_dataset("tatsu-lab/alpaca")
```

But note that our final testing will not necessarily be on these, so don't try to hardcode your solutions to it ;)

> References: Here is the [Alpaca](https://crfm.stanford.edu/2023/03/13/alpaca.html) and [Self-Instruct](https://arxiv.org/abs/2212.10560) paper if you are interested in the topic of instruction tuning!

In [None]:
################################################################################
# 1) Load model and tokenizer
# model_name = "gpt2" # NOTE: the quality of the generations will be low if you use gpt2
model_name = "vicgalle/gpt2-alpaca-gpt4"
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token_id = tokenizer.eos_token_id
model = AutoModelForCausalLM.from_pretrained(model_name)
MAX_SEQ_LENGTH = tokenizer.model_max_length

################################################################################
# 2) Load relevant data
with open(os.path.join(ROOT_PATH, "data", "part1_input_data.json"), "r") as read_file:
    input_data = json.load(read_file)

# 3) Format the data in the instruction template: in particular,
#    we follow the template the model was finetuned on
all_inputs = []
for i, entry in enumerate(input_data):
    print("-" * 50)
    print(f"Entry #{i + 1}:")
    print("-" * 40)
    final_text = ""
    if entry["input"] == "":
        final_text = "Below is an instruction that describes a task. Write a response that appropriately completes the request."
        final_text += "\n\n### Instruction:\n" + entry["instruction"]
    else:
        final_text = "Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request."
        final_text += "\n\n### Instruction:\n" + entry["instruction"]
        final_text += "\n\n### Input:\n" + entry["input"]
    final_text += "\n\n### Response:\n"

    print(final_text)

    ############################################################################
    # 4) Tokenize the relevant text
    curr_inputs = tokenizer(
        final_text,
        max_length=MAX_SEQ_LENGTH,
        truncation=True,
        return_tensors="pt",
    )

    all_inputs.append([f"Entry #{i + 1}", curr_inputs])

In [None]:
################################################################################
# 4) Put model & inputs on CUDA if available
print("-" * 50)
print("CUDA:")
print("-" * 40)


def get_device():
    device = None
    # NOTE: Feel free to uncomment the "mps" lines if you want to use the ARM GPU.
    # if torch.backends.mps.is_available():
    #     device = torch.device("mps")
    #     return device
    if torch.cuda.is_available():
        # TODO: if you want a specific GPU, you can choose which device here,
        #       otherwise picks the default 0 ID-ed GPU
        device = torch.device("cuda:0")
        print("CUDA was available, successfully put inputs and model on GPU device.")
    else:
        device = torch.device("cpu")
        print("Not putting inputs and model on any GPU device.")
    return device


device = get_device()
model = model.to(device)
for _, inputs in all_inputs:
    inputs["input_ids"] = inputs["input_ids"].to(device)
    inputs["attention_mask"] = inputs["attention_mask"].to(device)

Perfect! Now you can use the following tests to debug the behavior of your model. Feel free to play around with the parameters but do not change the `tests.py` file. If you do, it won't affect your grading, but you might not get the right results when testing.

#### 1.2.a) Greedy search test

Greedy search is the first thing you can implement. The output here should match the huggingface implementation. Make sure you implement the *max_new_tokens* parameter behavior right!
- 🎯 You need to implement the `search` function in `GreedySearchDecoderForCausalLM` class in `a3_decoding.py` to run this test


In [None]:
from importlib import reload
import a3_tests
reload(a3_tests)
from a3_tests import greedy_test

# 1) Relevant parameters to greedy search
max_new_tokens = 20

# 2) Run it on the 3 examples
greedy_test(
    model,
    tokenizer,
    all_inputs,
    max_new_tokens,
    verbose=True,  # NOTE: set it to True/False to show/hide the decoded sequence, can help debugging
    with_color=True,  # NOTE: set it to True/False to choose whether to display
    #                         the input sequence (red) and the generated sequence (green) in diff colors
)

#### 1.2.b) Beam search test without length penalty

Next, we implement beam search. First, implement it without length penalty and see if the log probability scores you get are roughly close to huggingface's for different beam sizes (particularly smaller ones). The sequence should be mostly the same, independent of the beam width.
- 🎯 You need to implement the `search` function in `BeamSearchDecoderForCausalLM` class in `a3_decoding.py` to run this test (you don't have to implement the *length_penalty* feature to run it)

In [None]:
from importlib import reload
import a3_tests
reload(a3_tests)
from a3_tests import beam_test

# 1) Relevant parameters to beam search
max_new_tokens = 15
num_beams = 4
num_return_sequences = 4
length_penalty = 0.0

# 2) Run it on the 3 examples
beam_test(
    model,
    tokenizer,
    all_inputs,
    max_new_tokens,
    num_beams,
    length_penalty,
    num_return_sequences,
    verbose=True,  # NOTE: set it to True/False to show/hide the decoded sequence, can help debugging
    with_color=True,  # NOTE: set it to True/False to choose whether to display
    #                         the input sequence (red) and the generated sequence (green) in diff colors
)

#### 1.2.c) Beam search test with length penalty

Next, we implement length penalty for beam search. We do so by dividing the score (sum of the log probabilities) by the length of the generation exponentiated by the penalty. 
- 🎯 You need to implement the `search` function in `BeamSearchDecoderForCausalLM` class in `a3_decoding.py` with *length_penalty* to run this test. Make sure the no penalty setting outcome has not changed.

In [None]:
from importlib import reload
import a3_tests
reload(a3_tests)
from a3_tests import beam_test

# 1) Relevant parameters to beam search + penalty
max_new_tokens = 15
num_beams = 5
num_return_sequences = 3
length_penalty = 1.5

# 2) Run it on the 3 examples
beam_test(
    model,
    tokenizer,
    all_inputs,
    max_new_tokens,
    num_beams,
    length_penalty,
    num_return_sequences,
    verbose=False,  # NOTE: set it to True/False to show/hide the decoded sequence, can help debugging
    with_color=True,  # NOTE: set it to True/False to choose whether to display
    #                         the input sequence (red) and the generated sequence (green) in diff colors
)

In [None]:
from importlib import reload
import a3_tests
reload(a3_tests)
from a3_tests import beam_test

# 1) Relevant parameters to beam search + penalty
max_new_tokens = 25
num_beams = 5
num_return_sequences = 3
length_penalty = -1.5

# 2) Run it on the 3 examples
beam_test(
    model,
    tokenizer,
    all_inputs,
    max_new_tokens,
    num_beams,
    length_penalty,
    num_return_sequences,
    verbose=False,  # NOTE: set it to True/False to show/hide the decoded sequence, can help debugging
    with_color=True,  # NOTE: set it to True/False to choose whether to display
    #                         the input sequence (red) and the generated sequence (green) in diff colors
)

#### 1.2.d) Top-K sampling without the temperature feature

Next, we implement top-k sampling. First, implement it without the temperature feature.
- 🎯 You need to implement the `sample` function in `TopKSamplerForCausalLM` class in `a3_sampling.py` to run this test (you don't have to implement the *temperature* feature to run it)

In [None]:
from importlib import reload
import a3_tests
reload(a3_tests)
from a3_tests import top_k_test

# 1) Relevant parameters to top-k sampling
max_new_tokens = 20
top_k = 15
temperature = 1.0
seed = SCIPER

# 2) Run it on the 3 examples
top_k_test(
    model,
    tokenizer,
    all_inputs,
    max_new_tokens,
    top_k,
    temperature,
    seed,
    verbose=True,  # NOTE: set it to True/False to show/hide the decoded sequence, can help debugging
    with_color=True,  # NOTE: set it to True/False to choose whether to display
    #                         the input sequence (red) and the generated sequence (green) in diff colors
)

#### 1.2.e) Top-P sampling without the temperature feature

Then, we implement top-p sampling. First, implement it without temperature.
- 🎯 You need to implement the `sample` function in `TopPSamplerForCausalLM` class in `a3_sampling.py` to run this test (you don't have to implement the *temperature* feature to run it)


In [None]:
from importlib import reload
import a3_tests
reload(a3_tests)
from a3_tests import top_p_test

# 1) Relevant parameters to top-p
max_new_tokens = 20
top_p = 0.92
temperature = 1.0
seed = SCIPER

# 2) Run it on the 3 examples
top_p_test(
    model,
    tokenizer,
    all_inputs,
    max_new_tokens,
    top_p,
    temperature,
    seed,
    verbose=True,  # NOTE: set it to True/False to show/hide the decoded sequence, can help debugging
    with_color=True,  # NOTE: set it to True/False to choose whether to display
    #                         the input sequence (red) and the generated sequence (green) in diff colors
)

#### 1.2.f) Temperature for both sampling

Now let's test what happens when you implement both sampling algorithms with temperature.

- 🎯 Add the *temperature* feature to both `TopKSamplerForCausalLM` and `TopPSamplerForCausalLM` classes in `a3_sampling.py` to run this test. Make sure the default no temperature setting outcome (*i.e.*, when `temperature=1.0`) has not changed.

In [None]:
from importlib import reload
import a3_tests
reload(a3_tests)
from a3_tests import top_k_test, top_p_test

# 1) Relevant parameters to sampling
max_new_tokens = 20
top_k = 15
top_p = 0.92
temperature = 0.7
seed = SCIPER

# 2) Run it on the 3 examples
top_k_test(
    model,
    tokenizer,
    all_inputs,
    max_new_tokens,
    top_k,
    temperature,
    seed,
    verbose=True,
    with_color=True,
)

print()
top_p_test(
    model,
    tokenizer,
    all_inputs,
    max_new_tokens,
    top_p,
    temperature,
    seed,
    verbose=True,
    with_color=True,
)

## **PART 2: Contrastive Decoding**
---

<!-- ![Contrast Decoding Figure](figs/cd_figure.png) -->
<center><img width="500" src="https://raw.githubusercontent.com/epfl-nlp/cs-552-modern-nlp/a64d201fb5fb359299769067dfd47bead4b30f94/Exercises/Week%206%20-%20Text%20Generation/figs/cd_figure.png"/></center>

**Paper Link**: https://arxiv.org/abs/2210.15097

In this section, you will be implementing the [Contrastive Decoding](https://arxiv.org/abs/2210.15097) (CD) method.
The CD objective returns the difference between the likelihood of two language models (LMs): an expert model and an amateur model. The expert LM is usually larger in size than the amateur one.

The intuition behind CD lies in the observation that the failures of larger LMs (e.g., repetition, incoherence) are even more prevalent in smaller LMs, and that this difference can be used as a signal to generate more plausible text. **Figure 1** above illustrates the CD method.

However, amateur LMs are not always mistaken: small language models still capture many simple aspects of English grammar and common sense (*e.g.*, subject-verb agreement). Thus, penalizing all behaviors from amateur LMs indiscriminately would penalize these simple aspects that are correct (False negative), and conversely reward implausible tokens (False positive). Therefore, the paper introduces the plausibility constraint, which complements the CD objective and avoids these failure modes.

#### 🎯 TODO: In order to understand how contrastive decoding works please read **Section 3** of [the paper](https://arxiv.org/abs/2210.15097). You do not need to read the other sections to understand the method.

### 2.1) Implement Contrastive Decoding with Adaptive Plausibility Constraint

In this implementation, you will use `GPT2-Large` as the expert model and `GPT2-Small` as the amateur model (referred to as `gpt2-large` and `gpt2` on HuggingFace respectively). You will be using a subset of the [CC-News](https://huggingface.co/datasets/cc_news) dataset, which contains news articles from news sites all over the world. The task is to take the first 32 tokens of each article as the prompt, then continue generating the next 128 tokens using the Contrastive Decoding (CD) objective subject to the adaptive plausibility constraint.

🎯 TODO - The files you should now look at are:

1. `a3_contrastive_main.py`: Just read this file to understand how the data is processed.
2. `a3_contrastive_decoding.py`
    - In this second file there are two main functions you will be implementing:
        - `greedy_search_with_contrastive_decoding`
        - `filter_using_adaptive_plausibility_constraint`

You can interface with these files using command line arguments as shown below.

<div style="padding:15px 20px 20px 20px;border-left:3px solid red;background-color:#C88B86;border-radius: 20px;color:#424242;"> 

⚠️ **Warning** ⚠️
Note that `GPT2-Large` will take 3.25GB of disk space and `GPT2-Small` 548MB, so make sure you have enough space for both of them (or use COLAB 😄)
</div>


In [None]:
# # NOTE: you can uncomment and run the following if Colab gives you "NotImplementedError: A UTF-8 locale is required. Got ANSI_X3.4-1968"
# #       the problem stems from having to many output text
# import locale
# locale.getpreferredencoding = lambda: "UTF-8"

In [None]:
!python a3_contrastive_main.py \
    --model_name_or_path gpt2-large \
    --amateur_name_or_path gpt2 \
    --decoding-mode contrastive \
    --amateur_temperature 0.5 \
    --max-new-tokens 128 \
    --num-generations 50 \
    --alpha 0.1 \
    --outfile part2_contrastive_generations.json

#### 2.1.1) Do the same but using greedy search for comparison

In [None]:
!python a3_contrastive_main.py \
    --model_name_or_path gpt2-large \
    --decoding-mode greedy \
    --max-new-tokens 128 \
    --num-generations 50 \
    --outfile part2_greedy_generations.json

### 2.2) Evaluate Your Generations Using The MAUVE Metric

Now that you have implemented contrastive decoding, let's make sure it's generating higher quality text than plain greedy decoding. For this we can use the MAUVE metric, which was introduced in [this paper](https://arxiv.org/abs/2102.01454). In short, MAUVE measures the distribution similarity between the set of generated text and the set of gold reference. 

In this part of the assignment you do not need to implement anything. However, please verify that the generations you get using the contrastive decoding method receive a higher score than the greedy generations.

In [None]:
# Evaluate contrastive decoding generations
!python a3_eval_mauve.py --filepath part2_contrastive_generations.json

In [None]:
# Evaluate greedy generations
!python a3_eval_mauve.py --filepath part2_greedy_generations.json

## **PART 3: MT Evaluation**
---

In this section, you will be evaluating several NLG metrics to understand which ones correlate best with human judgement when it comes to machine translation (MT), a popular task in NLG. We will first take a close look at one of the [Workshop on Statistical Machine Translation](https://aclanthology.org/venues/wmt/) (WMT) datasets. WMT provides several challenges each year it's hosted to evaluate MT systems (*e.g.* you can consider Google Translate as a system), and also to evaluate the metrics we use. You read that right: we will evaluate evaluation. While there are several WMT datasets that you can find [here](https://github.com/Unbabel/COMET/blob/master/data/README.md), we will focus on a Direct Assessment (DA) dataset.

### 3.1) Dataset and metrics analysis

The DA scores are human evaluation scores such that given a **source** (`src`) sentence from one source language and a target **reference** (`ref`) from a distinct target language, the annotator evaluates how good the **hypothesis** (`mt`), or, in other words, the prediction of the machine translation system is. Such scores are typically calculated by several metrics and then aggregated with weighing (also called Multidimensional Quality Metrics - MQM) or they can be raw scores on a given scale which are then normalized across annotators. The DA score in the dataset we will evaluate was collected in the latter way.

We subsampled a huggingface dataset instance of the WMT Shared Tasks which can be found [here](https://huggingface.co/datasets/RicardoRei/wmt-da-human-evaluation). In particular, we subsampled in each source sentence (`src`) and language pair (`lp`) group from 2019 such that we could get roughly equal amounts of data in each category. Let's do a quick analysis to get you accustomed to the data format:

In [None]:
import os
import pandas as pd


def cnt_unique_prop(dataset, prop_name, do_print=True):
    uniques = dataset[prop_name].unique().tolist()
    print("(" + prop_name + ") # of unique instances: ", len(uniques))
    if do_print:
        uniques.sort()
        print("(" + prop_name + ") unique instances: ", uniques, "\n")


wmt_da_df = pd.read_csv(os.path.join("data", "wmt-da-human-evaluation_filtered.csv"))
print("-" * 80)
print("Dataset properties: ")
print("-" * 80)
print("Dataset size: ", len(wmt_da_df))
print("Features: ", wmt_da_df.columns.to_list(), "\n")

cnt_unique_prop(wmt_da_df, "year", do_print=True)
cnt_unique_prop(wmt_da_df, "domain", do_print=True)
cnt_unique_prop(wmt_da_df, "lp", do_print=True)
cnt_unique_prop(wmt_da_df, "src", do_print=False)
cnt_unique_prop(wmt_da_df, "mt", do_print=False)
cnt_unique_prop(wmt_da_df, "ref", do_print=False)
cnt_unique_prop(wmt_da_df, "annotators", do_print=False)
cnt_unique_prop(wmt_da_df, "entry_id", do_print=False)

print("\nRange of DA:")
print("- max: ", wmt_da_df["score"].max())
print("- min: ", wmt_da_df["score"].min())

wmt_da_df.head()

Now let's check roughly how many translations we have:
- per `lp` -- *i.e.* language pair -- group
- per `src` and `lp` -- *i.e.* source sentence and language pair -- group

In [None]:
import matplotlib.pyplot as plt

count_df = wmt_da_df.groupby(["lp"]).size().reset_index(name="count")
sorted_count_df = count_df.sort_values(by="count", ascending=False)
plt.figure(figsize=(8, 5))
plt.bar(
    sorted_count_df["lp"],
    sorted_count_df["count"],
    color="purple",
)
plt.xlabel("Language Pair")
plt.ylabel("Count")
plt.title("Count Histogram for Each Language Pair")
plt.xticks(rotation=45)
plt.show()

In [None]:
count_df = wmt_da_df.groupby(["src", "lp"]).size().reset_index(name="count")
sorted_count_df = count_df.sort_values(by="count", ascending=False)
plt.figure(figsize=(8, 5))
plt.bar(
    np.arange(len(sorted_count_df)),
    sorted_count_df["count"],
    color="green",
)
plt.xlabel("Source-LP Group")
plt.ylabel("Count")
plt.title("Count Histogram for src-lp group")
plt.xticks([], rotation=45)
plt.show()

As you can see we have roughly 100 entries per language and around 5 hypothesis for each source/lp pair, making it overall uniform across the different categories.

For MT metrics, we will focus on [BLEU](https://aclanthology.org/P02-1040/), [BERTScore](https://arxiv.org/abs/1904.09675) and [COMET](https://aclanthology.org/2020.emnlp-main.213/) (not to be confused with many other methods and tools called by the same name 😉 such as [[1]](https://arxiv.org/abs/1906.05317) and [[2]](https://www.comet.com/site/)):

- **BLEU (BiLingual Evaluation Understudy)**:
    You have already seen **BLEU (BiLingual Evaluation Understudy)** in Week 6's exercise session. BLEU score measures the similarity between the gold reference and the generated candidate. 100 means they are a perfect match, and 0 means none of the n-gram precisions get fulfilled. Note that this range may change to 1.0 to 0.0 depending on the implementation. such as the huggingface `evaluate` one:

    - **bleu:** As you learned in the exercise, the BLEU score is calculated by comparing the $n$-grams and combining their average precisions from $n$=1 to $n$=4 with a brevity penalty.
    - **bleu-1:** Geometric mean of 1-gram precisions; it does not use the brevity penalty.
    - **bleu-4:** Geometric mean of 4-gram precisions; it does not use the brevity penalty.
    - **bleu-brevity-penalty:** The brevity penalty penalizes generated sentences that are *too short* compared to the closest reference length with exponential decay.
- For **BERTScore** and **COMET**, you will answer the following questions to better understand how they calculate the metrics.

#### 🎯 TODO: Go to `a3_mt_qa.md` and answer the Q1 and Q2.

### 3.2) NLG metric calculation

Now that you are exposed to parts of the dataset and the metrics, it's time to calculate scores over the generated hypotheses. You will now head to the `a3_mt_eval.py` and complete the *TODO* statements in the `calculate_metrics` function. In particular, you will be calculating [BLEU](https://huggingface.co/spaces/evaluate-metric/bleu), [BERTScore](https://huggingface.co/spaces/evaluate-metric/bertscore), and [COMET](https://huggingface.co/spaces/evaluate-metric/comet) with the linked huggingface evaluate functions. 

While a GPU may help speed up the calculation in this part, it is not necessary (it will take ~10-20 min on CPU). If you want to directly use the commandline to run the function, feel free to do `python a3_mt_eval.py` and not run the following cell. Otherwise, run the following cell to generate the `part3_metrics.json` file.

**Hint:** If you want to calculate each metric independently (*e.g.*, by commenting out the code for metrics you have already calculated) and save them in different files that's okay, you don't have to make the following work necessarily in one go. Just make sure they get eventually saved in a joint file called `part3_metrics.json` so that you can use it for correlation calculation; and make sure all of your code is uncommented at submission time.

In [None]:
from a3_mt_eval import calculate_metrics

calculate_metrics()

Here's a quick assertion statement to make sure that your generated file is correctly formatted:

In [None]:
from a3_mt_eval import load_json

metric_dict = load_json("part3_metrics.json")
k_list = list(metric_dict.keys())
assert set(k_list) == set(
    [
        "bleu",
        "bleu-1",
        "bleu-4",
        "bertscore-precision",
        "bertscore-recall",
        "bertscore-f1",
        "comet",
    ]
)
for k in k_list:
    assert isinstance(metric_dict[k], dict)
    assert len(metric_dict[k]) == 1723
    assert "3" in metric_dict[k].keys()
    assert isinstance(metric_dict[k].get("3"), float)

### 3.3) Correlation calculation

We have succesfully calculated the NLG metrics for each hypothesis and stored them in `part3_metrics.json`. Now, we will calculate how much these scores correlate with human judgements in the `evaluate_metrics` function in `a3_mt_eval.py`. We will calculate [Kendall's tau](https://en.wikipedia.org/wiki/Kendall_rank_correlation_coefficient) which requires ordinal, or in other words, ranked associations between data. One way we can turn the DA scores into ranks is to consider the human scores for various hypothesis of the same **source** and **language pair** as defining such ranks. For that you will create (worst_hypothesis, better_hypothesis) pair combinations. After getting the data you will calculate the correlation between the human scores and the metric scores and save it in `part3_corr.json`.

Note that there are several ways to calculate Kendall's tau, we would like you to implement the first definition given in [this Wikipedia article](https://en.wikipedia.org/wiki/Kendall_rank_correlation_coefficient). You won't be penalized if you calculate a different version, but the results may differ.

You will not need a GPU for this portion of the task. Similar to calculating the scores in the previous subsection, if you want to directly use the commandline to run the function, feel free to do set `already_predicted_scores=True` in the main function and run `python a3_mt_eval.py` instead of the following cell. Otherwise, run the following cell to generate the `part3_corr.json` file:

In [None]:
from a3_mt_eval import evaluate_metrics

evaluate_metrics()

Here's a quick assertion statement to make sure that your generated file is correctly formatted:

In [None]:
from a3_mt_eval import load_json

corr_dict = load_json("part3_corr.json")
k_list = list(corr_dict.keys())
assert set(k_list) == set(
    [
        "bleu",
        "bleu-1",
        "bleu-4",
        "bertscore-precision",
        "bertscore-recall",
        "bertscore-f1",
        "comet",
    ]
)
for k in k_list:
    assert isinstance(corr_dict[k], float)

Awesome! ✨ All you have left to do now is to analyze the results.

### 3.4) Correlation analysis

Here is a sorted version of the results you got:

In [None]:
from a3_mt_eval import load_json

corr_dict = load_json("part3_corr.json")
dict(sorted(corr_dict.items(), key=lambda item: item[1]))

#### 🎯 TODO: Go to `a3_mt_qa.md` and answer the Q3 and Q4.

## **PART 4: Checklist**
---

<div style="padding:15px 15px 15px 15px;border-left:3px solid #8e7cc3;background-color:#e4e1eb;border-radius: 15px;color:#424242;">

🎉 Excellent work! You just reached the end of the assignment. 

To give us the deliverables you will have to commit the following files to your github classroom repository:

- ✅ The jupyter notebook: `a3_notebook.ipynb`

- ✅ The python files:
    - [ ] `a3_utils.py`, if you added any helper functions
    - [ ] `a3_decoding.py`
    - [ ] `a3_sampling.py`
    - [ ] `a3_contrastive_decoding.py`
    - [ ] `a3_contrastive_main.py`
    - [ ] `a3_mt_eval.py`

- ✅ The Part 3 open answer MD file: `a3_mt_qa.md`

- ✅ The JSON files generated in Parts 2 & 3: 
    - [ ] `part2_contrastive_generations.json`
    - [ ] `part2_greedy_generations.json`
    - [ ] `part3_metrics.json` 
    - [ ] `part3_corr.json`

</div>