#  Assignment 2 - Transfer Learning and Data Augmentation 💬

Welcome to the **second assignment** for the **CS-552: Modern NLP course**!

> - 😀 Name: **< First Last >**
> - ✉️ Email: **< first.last >@epfl.ch**
> - 🪪 SCIPER: **XXXXXX**

<div style="padding:15px 20px 20px 20px;border-left:3px solid green;background-color:#e4fae4;border-radius: 20px;color:#424242;">

## **Assignment Description**
- In the first part of this assignment, you will need to implement training (finetuning) and evaluation of a pre-trained language model ([RoBERTa](https://huggingface.co/docs/transformers/model_doc/roberta)) on a **Sentiment Analysis (SA)** task, which aims to determine whether a product review's emotional tone is positive or negative.

- For part-2, following the first finetuning task, you will need to identify the shortcuts (i.e. some salient or toxic features) that the model learnt for the specific task.

- For part-3, you are supposed to annotate 80 randomly assigned new datapoints as ground-truth labels. Additionally, the cross annotation should be conducted by another one or two annotators, and you will learn about how to calculate the agreement statistics as a significant characteristic reflecting the quality of a collected dataset.

- For part-4, since the human annotation is quite time- and effort-consuming, there are plenty of ways to get silver-labels from automatic labeling to augment the dataset scale, e.g., paraphrasing each text input in different words without changing its meaning. You will use a [T5](https://huggingface.co/docs/transformers/en/model_doc/t5) paraphrase model to expand the training data of sentiment analysis, and evaluate the improvement of data augmentation.

For Parts 1 and Part 2, you will need to complete the code in the corresponding `.py` files (`sa.py` for Part 1, `shortcut.py` for Part 2). You will be provided with the function descriptions and detailed instructions about the code snippet you need to write.


### Table of Contents
- **PART 1: Sentiment Analysis (33 pts)**
    - 1.1 Dataset Processing (10 pts)
    - 1.2 Model Training and Evaluation (18 pts)
    - 1.3 Fine-Grained Validation (5 pts)
- **PART 2: Identify Model Shortcuts (22 pts)**
    - 2.1 N-gram Pattern Extraction (6 pts)
    - 2.2 Distill Potentially Useful Patterns (8 pts)
    - 2.3 Case Study (8 pts)
- **PART 3: Annotate New Data (25 pts)**
    - 3.1 Write an Annotation Guideline (5 pts)
    - 3.2 Annotate Your Datapoints with Partner(s) (8 pts)
    - 3.3 Agreement Measure (12 pts)
- **PART 4: Data Augmentation (20 pts)**
    - 4.1 Data Augmentation with Paraphrasing (15 pts)
    - 4.2 Retrain RoBERTa Model with Data Augmentation (5 pts)
    
### Deliverables

- ✅ This jupyter notebook: `assignment2.ipynb`
- ✅ `sa.py` and `shortcut.py` file
- ✅ Checkpoints for RoBERTa models finetuned on original and augmented SA training data (Part 1 and Part 4), including:
    - `models/lr1e-05-warmup0.3/`
    - `models/lr2e-05-warmup0.3/`
    - `models/augmented/lr1e-05-warmup0.3/`
- ✅ Model prediction results on each domain data (Part 1.3 Fine-Grained Validation): `predictions/`
- ✅ Cross-annotated new SA data (Part 3), including:
    - `data/<your_assigned_dataset_id>-<your_sciper_number>.jsonl`
    - `data/<your_assigned_dataset_id>-<your_partner_sciper_number>.jsonl`
    - (for group of 3) `data/<your_assigned_dataset_id>-<your_second_partner_sciper_number>.jsonl`
- ✅ Paraphrase-augmented SA training data (Part 4), including:
    - `data/augmented_train_sa.jsonl`
- ✅ `./tensorboard` directory with logs for all trained/finetuned models, including:
    - `tensorboard/part1_lr1e-05/`
    - `tensorboard/part1_lr2e-05/`
    - `tensorboard/part4_lr1e-05/`

### How to implement this assignment

Please read carefully the following points. All the information on how to read, implement and submit your assignment is explained in details below:

1. For this assignment, you will need to implement and fill in the missing code snippets for both the **Jupyter Notebook `assignment2.ipynb`** and the **`sa.py`**, **`shortcut.py`** python files.

2. Along with above files, you need to additionally upload model files under the **`models/`** dir, regarding the following models:
    - finetuned RoBERTa models on original SA training data (PART 1)  
    - finetuned RoBERTa model on augmented SA training data (PART 4)

3. You also need to upload model prediction results in Part 1.3 Fine-Grained Validation, saved in **`predictions/`**.

4. You also need to upload new data files under the **`data/`** dir (along with our already provided data), including:
    - new SA data with your and your partner's annotations (Part 3)
    - paraphrase-augmented SA training data (Part 4)

5. Finally, you will need to log your training using Tensorboard. Please follow the instructions in the `README.md` of the **``tensorboard/``** directory.

**Note**: Large files such as model checkpoints and logs should be pushed to the repository with Git LFS. You may also find that training the models on a GPU can speed up the process, we recommend using Colab's free GPU service for this. A tutorial on how to use Git LFS and Colab can be found [here](https://github.com/epfl-nlp/cs-552-modern-nlp/blob/main/Exercises/tutorials.md).
    
</div>

<div style="padding:15px 20px 20px 20px;border-left:3px solid orange;background-color:#fff5d6;border-radius: 20px;color:#424242;">

## **Environment Setup**

### **Option 1: creating your own environment**

```
conda create --name mnlp-a2 python=3.10
conda activate mnlp-a2
pip install -r requirements.txt
```

**Note**: If some package versions in our suggested environment do not work, feel free to try other package versions suitable for your computer, but remember to update ``requirements.txt`` and explain the environment changes in your notebook (no penalty for this if necessary).

### **Option 2: using Google Colab**
If you are using Google Colab notebook for this assignment, you will need to run a few commands to set up our environment on Google Colab, as shown below:
    
</div>

In [None]:
# This cell makes sure modules are auto-loaded when you change external python files
%load_ext autoreload
%autoreload 2

In [None]:
# If you are working in Colab, then consider mounting your assignment folder to your drive
from google.colab import drive
drive.mount('/content/drive')

# Direct to your assignment folder.
%cd /content/drive/MyDrive/path-to-your-assignment-folder

Install packages that are not included in the Colab base envrionemnt:

In [None]:
import os
os.environ["WANDB_DISABLED"] = "true"
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"]="0" # limiting to one GPU

# Install dependencies
!pip install -r requirements.txt

In [None]:
import numpy as np
import jsonlines
import random

import torch
from transformers import RobertaTokenizer, RobertaForSequenceClassification

# TODO: Enter your Sciper number
SCIPER = '...'
seed = int(SCIPER)
torch.backends.cudnn.deterministic = True

random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)

In [None]:
# Check the availability of GPU (proceed only it returns True!)
if torch.cuda.is_available():
  print('Good to go!')
else:
  print('Please set GPU via Edit -> Notebook Settings.')

<div style="padding:15px 20px 20px 20px;border-left:3px solid orange;background-color:#fff5d6;border-radius: 20px;color:#424242;">
    
# PART 1: Sentiment Analysis (33 pts)

In this part, we will finetune a pretrained language model (Roberta) on sentiment analysis(SA) task. 

> Specifically, we will focus on a binary sentiment classification task for multi-domain product reviews. It requires the model to **classify a given paragraph of review by its sentiment polarity (positive or negative)**. 

</div>

### Load Training Dataset (`train_sa.jsonl`) 

**You can run the following cell to have the first glance at your data**. Each data sample is a python dictionary, which consists of following components:
- input review (*'review'*): a natural language sentence or a paragraph commenting about a product.
- domain (*'domain'*): describing the type of product being reviewed.
- label of sentiment (*'label'*): indicating whether the review states positive or negative views about the product.

In [None]:
data_dir = 'data'
data_train_path = os.path.join(data_dir, 'train_sa.jsonl')
with jsonlines.open(data_train_path, "r") as reader:
    for sid, sample in enumerate(reader.iter()):
        if sid % 200 == 0:
            print(sample)

In [None]:
# We use the following pretrained tokenizer and model
model_name = "FacebookAI/roberta-base"
tokenizer = RobertaTokenizer.from_pretrained(model_name)
model = RobertaForSequenceClassification.from_pretrained(model_name, num_labels=2)

## 🎯 Q1.1: **Dataset Processing (10 pts)**

Our first step is to constructing a Pytorch Dataset for SA task. Specifically, we will need to implement **tokenization** and **padding** using a HuggingFace pre-trained tokenizer.

**TODO🔻: Complete `SADataset` class following the instructions in `sa.py`, and test by running the following cell.**

In [None]:
from sa import SADataset
model_name = "FacebookAI/roberta-base"
tokenizer = RobertaTokenizer.from_pretrained(model_name)
dataset = SADataset("data/train_sa.jsonl", tokenizer)

In [None]:
from testA2 import test_SADataset
test_SADataset(dataset)

## 🎯 Q1.2: **Model Training and Evaluation (18 pts)**

Next, we will implement the training and evaluation process to finetune the model. 

- For training: you will need to calculate the **loss** and update the model weights by using **Adam optimizer**. Additionally, we add a **learning rate schedular** to adopt an adaptive learning rate during the whole training process.

- For evaluation: you will need to compute the **confusion matrix** and **F1 scores** to assess the model performance.

**TODO🔻: Complete the `compute_metrics()`, `train()` and `evaluate()` functions following the instructions in the `sa.py` file, you can test compute_metrics() by running the following cell.**

In [None]:
from sa import compute_metrics, train, evaluate

from testA2 import test_compute_metrics
test_compute_metrics(compute_metrics)

#### **Start Training and Validation!**

TODO🔻: (1) [coding question] Train the model with the following two different learning rates (other hyperparameters should be kept consistent). 

> A. learning_rate = 1e-5

> B. learning_rate = 2e-5

**Note:** *Each training will take ~7-10 minutes using a T4 Colab GPU.*

In [None]:
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

model_name = "FacebookAI/roberta-base"
tokenizer = RobertaTokenizer.from_pretrained(model_name)
model = RobertaForSequenceClassification.from_pretrained(model_name, num_labels=2)
model.to(device)

batch_size = 8
epochs = 4
max_grad_norm = 1.0
warmup_percent = 0.3

In [None]:
learning_rate = 1e-5  # play around with this hyperparameter

train(...,
      model_save_root='models/', tensorboard_path="./tensorboard/part1_lr{}".format(learning_rate))

TODO🔻: (2) [textual question] compare and discuss the results. 

- Which learning rate is better? Explain your answers.

## 🎯 Q1.3: **Fine-Grained Validation (5 pts)**

TODO🔻: (1) [coding question] Use the model checkpoint trained from the first learning_rate setting (lr=1e-5), check the model performance on each domain subsets of the validation set. You should report **the validation loss**, **confusion matrix**, **F1 scores** and **Macro-F1 on each domain**. 

In [None]:
# Split the test sets into subsets with different domains
# Save the subsets under 'data/'
# Replace "..." with your code
domain_data = {}
...

for domain, samples in domain_data.items():
    with jsonlines.open("data/test_sa_"+domain+".jsonl", mode="w") as writer:
        for sd in samples:
            writer.write(sd)

In [None]:
learning_rate = 1e-5
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
tokenizer = ...
model = ...
model.to(device)

results_save_dir = 'predictions/'

# Evaluate and save prediction results in each domain
# Replace "..." with your code
for domain in [...]:

    dev_loss, confusion, f1_pos, f1_neg = evaluate(...,
                                                   result_save_file='predictions/test_'+domain+'.jsonl')
    macro_f1 = (f1_pos + f1_neg) / 2

    print(f'Domain: {domain}')
    print(f'Validation Loss: {dev_loss:.3f}')
    print(f'Confusion Matrix:')
    print(confusion)
    print(f'F1: ({f1_pos*100:.2f}%, {f1_neg*100:.2f}%) | Macro-F1: {macro_f1*100:.2f}%')

TODO🔻: (2) [textual question] compare and discuss the results. 

**Questions:**
- On which domain does the model perform the best? the worst?
- Give some possible explanations of why the model's best-performed domain is easier, and why the model's worst-performed domain is more challenging. Use some examples to support your explanations.

**Note:** To find examples for supporting your discussion, save the model prediction results on each domain under the `predictions/` folder, by specifying the `result_save_file` parameter in the *evaluate* function.

(Write your answer to the questions here.)

<div style="padding:15px 20px 20px 20px;border-left:3px solid orange;background-color:#fff5d6;border-radius: 20px;color:#424242;">

# PART 2: Identify Model Shortcuts (22 pts)

In this part, We aim to find out the shortcut features learnt by the sentiment analysis model we have trained in Part1. We will be using the model checkpoint trained with `learning rate=1e-5`.

</div>

## 🎯 Q2.1: **N-gram Pattern Extraction (6 pts)**
We hypothesize that `n-gram`s could be the potential shortcut features learnt by the SA model. An `n-gram` is defined as a sequence of n consecutive words appeared in a natural language sentence or paragraph. 

Thus, we aim to extract that an n-gram that appears in a review may serve as a key indicator of the polarity of the review's sentiment, for example:

>- **Review 1**: This book was **horrible**. If it was possible to rate it **lower than one star** I would have.
>- **Review 2**: **Excellent** book, **highly recommended**. Helps to put a realistic perspective on millionaires.

For Review 1, the `1-gram "horrible"` and the `4-gram "lower than one star"` serve as two key indicators of negative sentiment. While for Review 2, the `1-gram "excellent"` and the `2-gram "highly recommended"` obviously indicate positive sentiment.

TODO🔻: (1) [coding question] Complete `ngram_extraction()` function in `shortcut.py` file.

The returned *ngrams* contains a **list** of dictionaries. The `n-th` **dictionary** corresponds the `n-grams` (n=1,2, 3, 4).

The keys of each dictionary should be a **unique n-gram string** appeared in reviews, and the value of each n-gram key records the frequency of positive/negative predictions **made by the model** when the n-gram appears in the review, i.e., `\[#positive_predictions, #negative_predictions\]`.

> Example: **`ngrams`[0]['horrible'][0]** should return the number of the positive predictions made by the model when the 1-gram token 'horrible' appear in the given review. i.e., \[#positive_predictions, #negative_predictions\].

**Note:** (1) All the sequences contain punctuations should NOT be counted as a n-gram (e.g. `it is great .` is NOT a 4-gram, but `it is great` is a 3-gram); (2) All stop-words should NOT be counted as 1-grams, but can appear in other n-gram sequences (e.g. `is` is NOT a 1-gram token, but `it is great` can be a 3-gram token.)

## 🎯 Q2.2: **Distill Potentially Useful Patterns (8 pts)**

TODO🔻: (2) [coding question] For each group of n-grams (n=1,2,3,4), find and **print** the **top-100 n-gram sequences** with the **greatest frequency of appearance**, which could contain frequent semantic features and would be used as our feature list.

In [None]:
from shortcut import ngram_extraction

In [None]:
# all your saved model prediction results from 1.3 Fine-Grained Validation
prediction_files = [...]

# TODO: Define your tokenizer
tokenizer = ...
ngrams = ngram_extraction(prediction_files, tokenizer)

top_100 = {}
for n, counts in enumerate(ngrams):
    # TODO: find top-100 n-grams (n=1,2,3 or 4) associated with the greatest frequency of appearance
    top_100_freq = ...

    print(f'Top-100 most frequent {n+1}-grams:')
    print(top_100_freq)

    top_100[n] = top_100_freq

**Among each type of top-100 frequent n-grams above**, we aim to further find out the n-grams which **most likely** lead to *positive*/*negative* predictions (positive/negative shortcut features). 

TODO🔻: (3) [coding&text question] Design **two different methods to re-rank** the top-100 n-grams to extract shortcut features. For each method, you should extract **1** feature in each of n-grams group (n=1, 2, 3, 4) for positve and negative prediction (1\*4\*2=8 features in total for 1 method).

Explain each of your design choices in natural language, and compare which method finds more reasonable patterns.


In [None]:
# TODO: [Method 1] find top-1 positive and negative patterns
...

# TODO: [Explanation of Method 1]
...

In [None]:
# TODO: [Method 2] find top-1 positive and negative patterns
...

# TODO: [Explanation of Method 2]
...

TODO🔻: Compare and discuss the results from two methods above.

## 🎯 Q2.3: **Case Study (8 pts)**

TODO🔻: Among the shortcut features you found in 2.1, find out **4 representative** cases (pair of `\[review, n-gram feature\]`) where the shortcut feature **will lead to a wrong prediction**. 

For example, the 1-gram feature "excellent" has been considered as a shortcut for *positive* sentiment, while the ground-truth label of the given review containing "excellent" is *negative*.

**Questions:**
- Based on your case study, do you detect any limitations of the n-gram patterns?
- Which type of n-gram (1/2/3/4-gram) pattern is more robust to be used for sentiment prediction shortcut and why?

In [None]:
# TODO: you can fill your code for finding cases here
...

TODO🔻: (Write your case study discussions and answers to the questions here.)

<div style="padding:15px 20px 20px 20px;border-left:3px solid orange;background-color:#fff5d6;border-radius: 20px;color:#424242;">

## **Part 3: Annotate New Data (25 pts)**

In this part, you will **annotate** the gold labels of some **new** SA data samples, and measure the degree of **agreement** between your and **one or two partners'** annotations.
    
</div>

## 🎯 Q3.1: **Write an Annotation Guideline (5 pts)**

TODO🔻: Imagine that you are going to assign this annotation task to a crowdsourcing worker, who is completely not familiar with computer science and NLP. Think about how you are going to explain this annotation task to him in order to guide him do a decent job. Write an annotation guideline for such a worker who are going to do this task for you.

**Note:** You should come up with your own guideline without the help of your partner(s) in later Part 3.2

(Write your annotation guideline here.)

## 🎯 Q3.2: **Annotate Your Datapoints with Partner(s) (8 pts)**

TODO🔻: Annotate 80 datapoints (20 in each domain of "books", "dvd", "electronics" and "housewares") assigned to you and your partner(s), by editing the value of the key **"label"** in each datapoint. You and your partner(s) should annotate **independently of each other**, i.e., each of you provide your own 80 annotations.

Please find your assigned annotation dataset **ID** and **your partner(s)** according to this [list](https://docs.google.com/spreadsheets/d/1hOwBUb8XE8fitYa4hlAwq8mARZe3ZsL4/edit?usp=sharing&ouid=108194779329215429936&rtpof=true&sd=true). Your annotation dataset can be found [here](https://drive.google.com/drive/folders/1IHXU_v3PDGbZG6r9T5LdjKJkHQ351Mb4?usp=sharing).

**Name your annotated file as `<your_assigned_dataset_id>-<your_sciper_number>.jsonl`.**

**You should also submit your partner's annotated file `<assigned_dataset_id>-<your_partner_sciper_number>.jsonl`.**

## 🎯 Q3.3: **Agreement Measure (12 pts)**

TODO🔻: Based on your and your partner's annotations in 3.2, calculate the [Cohen's Kappa](https://scikit-learn.org/stable/modules/model_evaluation.html#cohen-kappa) or [Krippendorff's Alpha](https://github.com/pln-fing-udelar/fast-krippendorff) (if you are in a group of three students) between the annotators on **each domain** and **across all domains**.

**Note:** Cohen's Kappa or Krippendorff's Alpha interpretation

0: No Agreement

0 ~ 0.2: Slight Agreement

0.2 ~ 0.4: Fair Agreement

0.4 ~ 0.6: Moderate Agreement

0.6 ~ 0.8: Substantial Agreement

0.8 ~ 1.0: Near Perfect Agreement

1.0: Perfect Agreement

**Questions:**
- What is the overall degree of agreement between you and your partner(s) according to the above interpretation of score ranges?
- In which domain are disagreements most and least frequently happen between you and your partner(s)? Give some examples to explain why that is the case.
- Are there possible ways to address the disagreements between annotators?

In [None]:
# Fill your code for calculating agreement scores here.
...

(Write your answers to the questions here.)

<div style="padding:15px 20px 20px 20px;border-left:3px solid orange;background-color:#fff5d6;border-radius: 20px;color:#424242;">

## **Part 4: Data Augmentation (20 pts)**

Since we only used 20% of the whole dataset for training, which might limit the model performance. In the final part, we will try to enlarge the training set by **data augmentation**.  

Specifically, we will **`Rephrase`** some current training samples using pretrained paraphraser. So that the paraphrased synthetic samples would preserve the semantic similarity while change the surface format.

You can use the pretrained T5 paraphraser [here](https://huggingface.co/humarin/chatgpt_paraphraser_on_T5_base).

</div>

## 🎯 Q4.1: **Data Augmentation with Paraphrasing (15 pts)**
TODO🔻: Implement functions named `get_paraphrase_batch` and `get_paraphrase_dataset` with the details in the below two blocks. 

In [None]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# get the given pretrained paraphrase model and the corresponding tokenizer (https://huggingface.co/humarin/chatgpt_paraphraser_on_T5_base)
paraphrase_tokenizer = AutoTokenizer.from_pretrained("humarin/chatgpt_paraphraser_on_T5_base")
paraphrase_model = AutoModelForSeq2SeqLM.from_pretrained("humarin/chatgpt_paraphraser_on_T5_base").to(device)

def get_paraphrase_batch(
    model,
    tokenizer,
    input_samples,
    n,
    repetition_penalty=10.0,
    diversity_penalty=3.0,
    no_repeat_ngram_size=2,
    temperature=0.7,
    max_length=256,
    device='cuda:0'):
    '''
    Input
      model: paraphraser
      tokenizer: paraphrase tokenizer
      input_samples: a batch (list) of real samples to be paraphrased
      n: number of paraphrases to get for each input sample
      for other parameters, please refer to:
          https://huggingface.co/docs/transformers/en/main_classes/text_generation#transformers.GenerationConfig
    Output: Tuple.
      synthetic_samples: a list of paraphrased samples
    '''

    # TODO: implement paraphrasing on a batch of imput samples
    ...

    return synthetic_samples

In [None]:
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

data_dir = 'data'
data_train_path = os.path.join(data_dir, 'train_sa.jsonl')
BATCH_SIZE = 8
N_PARAPHRASE = 2

def get_paraphrase_dataset(model, tokenizer, data_path, batch_size, n_paraphrase):
    '''
    Input
      model: paraphrase model
      tokenizer: paraphrase tokenizer
      data_path: path to the `jsonl` file of training data
      batch_size: number of input samples to be paraphrases in one batch
      n_paraphrase: number of paraphrased sequences for each sample
    Output:
      paraphrase_dataset: a list of all paraphrase samples. Do not include the original training data.
    '''
    paraphrase_dataset = []
    with jsonlines.open(data_path, "r") as reader:

        # TODO: get paraphrases for the whole training dataset using get_paraphrase_batch
        ...

    return paraphrase_dataset

**Note:** run paraphrasing, which will take ~20-30 minutes using a T4 Colab GPU. But the running time could depend on various implementations.

In [None]:
paraphrase_dataset = get_paraphrase_dataset(paraphrase_model, paraphrase_tokenizer, data_train_path, BATCH_SIZE, N_PARAPHRASE)

In [None]:
# Original training dataset
with jsonlines.open(data_train_path, "r") as reader:
    origin_data = [dt for dt in reader.iter()]

all_data = origin_data + paraphrase_dataset

# Write all the original and paraphrased data samples into training dataset
augmented_data_train_path = os.path.join(data_dir, 'augmented_train_sa.jsonl')
with jsonlines.open(augmented_data_train_path, "w") as writer:
    writer.write_all(all_data)

assert len(all_data) == 3 * len(origin_data)

## 🎯 Q4.2: **Retrain RoBERTa Model with Data Augmentation (5 pts)** 
TODO🔻: Retrain the sentiment analysis model with the augmented (original+paraphrased), larger dataset :)

**Note:** *Training on the augmented data will take about 15 minutes using a T4 Colab GPU.*

In [None]:
# Re-train a RoBERTa SA model on the augmented training dataset
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

model_name = "FacebookAI/roberta-base"
tokenizer = RobertaTokenizer.from_pretrained(model_name)
model = RobertaForSequenceClassification.from_pretrained(model_name, num_labels=2)
model.to(device)

train(...,
      model_save_root='models/augmented/', tensorboard_path="./tensorboard/part4_lr{}".format(learning_rate))

TODO🔻: Discuss your results by answering the following questions

- Compare the performances of models in Part 1 and Part 4. Does the data augmentation help with the performance and why (give possible reasons)?
- No matter whether the data augmentation helps or not, list **three** possible ways to improve our current data augmentation method.

(Write your answers to the questions here.)

<div style="padding:15px 20px 20px 20px;border-left:3px solid orange;background-color:#fff5d6;border-radius: 20px;color:#424242;">

### **5 Upload Your Notebook, Data and Models**

Please upload your filled jupyter notebook in your GitHub Classroom repository, **with all cells run and output results shown**.

**Note:** We are **not** responsible for re-running the cells in your notebook.

Please also submit all your **datasets** **(anotated and augmented)**, as well as **all your trained models** in Part 1 and Part 4, in your GitHub Classroom repository.
    
</div>