#  Assignment 2 - Transfer Learning and Data Augmentation 💬

Welcome to the **second assignment** for the **CS-552: Modern NLP course**!

> - 😀 Name: Malak Lahlou Nabil
> - ✉️ Email: malak.lahlounabil@epfl.ch
> - 🪪 SCIPER: 329571

<div style="padding:15px 20px 20px 20px;border-left:3px solid green;background-color:#e4fae4;border-radius: 20px;color:#424242;">

## **Assignment Description**
- In the first part of this assignment, you will need to implement training (finetuning) and evaluation of a pre-trained language model ([RoBERTa](https://huggingface.co/docs/transformers/model_doc/roberta)) on a **Sentiment Analysis (SA)** task, which aims to determine whether a product review's emotional tone is positive or negative.

- For part-2, following the first finetuning task, you will need to identify the shortcuts (i.e. some salient or toxic features) that the model learnt for the specific task.

- For part-3, you are supposed to annotate 80 randomly assigned new datapoints as ground-truth labels. Additionally, the cross annotation should be conducted by another one or two annotators, and you will learn about how to calculate the agreement statistics as a significant characteristic reflecting the quality of a collected dataset.

- For part-4, since the human annotation is quite time- and effort-consuming, there are plenty of ways to get silver-labels from automatic labeling to augment the dataset scale, e.g., paraphrasing each text input in different words without changing its meaning. You will use a [T5](https://huggingface.co/docs/transformers/en/model_doc/t5) paraphrase model to expand the training data of sentiment analysis, and evaluate the improvement of data augmentation.

For Parts 1 and Part 2, you will need to complete the code in the corresponding `.py` files (`sa.py` for Part 1, `shortcut.py` for Part 2). You will be provided with the function descriptions and detailed instructions about the code snippet you need to write.


### Table of Contents
- **PART 1: Sentiment Analysis (33 pts)**
    - 1.1 Dataset Processing (10 pts)
    - 1.2 Model Training and Evaluation (18 pts)
    - 1.3 Fine-Grained Validation (5 pts)
- **PART 2: Identify Model Shortcuts (22 pts)**
    - 2.1 N-gram Pattern Extraction (6 pts)
    - 2.2 Distill Potentially Useful Patterns (8 pts)
    - 2.3 Case Study (8 pts)
- **PART 3: Annotate New Data (25 pts)**
    - 3.1 Write an Annotation Guideline (5 pts)
    - 3.2 Annotate Your Datapoints with Partner(s) (8 pts)
    - 3.3 Agreement Measure (12 pts)
- **PART 4: Data Augmentation (20 pts)**
    - 4.1 Data Augmentation with Paraphrasing (15 pts)
    - 4.2 Retrain RoBERTa Model with Data Augmentation (5 pts)
    
### Deliverables

- ✅ This jupyter notebook: `assignment2.ipynb`
- ✅ `sa.py` and `shortcut.py` file
- ✅ Checkpoints for RoBERTa models finetuned on original and augmented SA training data (Part 1 and Part 4), including:
    - `models/lr1e-05-warmup0.3/`
    - `models/lr2e-05-warmup0.3/`
    - `models/augmented/lr1e-05-warmup0.3/`
- ✅ Model prediction results on each domain data (Part 1.3 Fine-Grained Validation): `predictions/`
- ✅ Cross-annotated new SA data (Part 3), including:
    - `data/<your_assigned_dataset_id>-<your_sciper_number>.jsonl`
    - `data/<your_assigned_dataset_id>-<your_partner_sciper_number>.jsonl`
    - (for group of 3) `data/<your_assigned_dataset_id>-<your_second_partner_sciper_number>.jsonl`
- ✅ Paraphrase-augmented SA training data (Part 4), including:
    - `data/augmented_train_sa.jsonl`
- ✅ `./tensorboard` directory with logs for all trained/finetuned models, including:
    - `tensorboard/part1_lr1e-05/`
    - `tensorboard/part1_lr2e-05/`
    - `tensorboard/part4_lr1e-05/`

### How to implement this assignment

Please read carefully the following points. All the information on how to read, implement and submit your assignment is explained in details below:

1. For this assignment, you will need to implement and fill in the missing code snippets for both the **Jupyter Notebook `assignment2.ipynb`** and the **`sa.py`**, **`shortcut.py`** python files.

2. Along with above files, you need to additionally upload model files under the **`models/`** dir, regarding the following models:
    - finetuned RoBERTa models on original SA training data (PART 1)  
    - finetuned RoBERTa model on augmented SA training data (PART 4)

3. You also need to upload model prediction results in Part 1.3 Fine-Grained Validation, saved in **`predictions/`**.

4. You also need to upload new data files under the **`data/`** dir (along with our already provided data), including:
    - new SA data with your and your partner's annotations (Part 3)
    - paraphrase-augmented SA training data (Part 4)

5. Finally, you will need to log your training using Tensorboard. Please follow the instructions in the `README.md` of the **``tensorboard/``** directory.

**Note**: Large files such as model checkpoints and logs should be pushed to the repository with Git LFS. You may also find that training the models on a GPU can speed up the process, we recommend using Colab's free GPU service for this. A tutorial on how to use Git LFS and Colab can be found [here](https://github.com/epfl-nlp/cs-552-modern-nlp/blob/main/Exercises/tutorials.md).
    
</div>

<div style="padding:15px 20px 20px 20px;border-left:3px solid orange;background-color:#fff5d6;border-radius: 20px;color:#424242;">

## **Environment Setup**

### **Option 1: creating your own environment**

```
conda create --name mnlp-a2 python=3.10
conda activate mnlp-a2
pip install -r requirements.txt
```

**Note**: If some package versions in our suggested environment do not work, feel free to try other package versions suitable for your computer, but remember to update ``requirements.txt`` and explain the environment changes in your notebook (no penalty for this if necessary).

### **Option 2: using Google Colab**
If you are using Google Colab notebook for this assignment, you will need to run a few commands to set up our environment on Google Colab, as shown below:
    
</div>

In [1]:
# This cell makes sure modules are auto-loaded when you change external python files
%load_ext autoreload
%autoreload 2

In [3]:
# If you are working in Colab, then consider mounting your assignment folder to your drive
from google.colab import drive
drive.mount('/content/drive')

# Direct to your assignment folder.
%cd /content/drive/MyDrive/a2-2024-mimimamalah

Mounted at /content/drive
/content/drive/MyDrive/a2-2024-mimimamalah


Install packages that are not included in the Colab base envrionemnt:

In [3]:
import os
os.environ["WANDB_DISABLED"] = "true"
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"]="0" # limiting to one GPU

# Install dependencies
!pip install -r requirements.txt



In [4]:
import numpy as np
import jsonlines
import random

import torch
from transformers import RobertaTokenizer, RobertaForSequenceClassification

# TODO: Enter your Sciper number
SCIPER = '329571'
seed = int(SCIPER)
torch.backends.cudnn.deterministic = True

random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)

<torch._C.Generator at 0x7b2e680dcc50>

In [5]:
# Check the availability of GPU (proceed only it returns True!)
if torch.cuda.is_available():
  print('Good to go!')
else:
  print('Please set GPU via Edit -> Notebook Settings.')

Good to go!


<div style="padding:15px 20px 20px 20px;border-left:3px solid orange;background-color:#fff5d6;border-radius: 20px;color:#424242;">
    
# PART 1: Sentiment Analysis (33 pts)

In this part, we will finetune a pretrained language model (Roberta) on sentiment analysis(SA) task.

> Specifically, we will focus on a binary sentiment classification task for multi-domain product reviews. It requires the model to **classify a given paragraph of review by its sentiment polarity (positive or negative)**.

</div>

### Load Training Dataset (`train_sa.jsonl`)

**You can run the following cell to have the first glance at your data**. Each data sample is a python dictionary, which consists of following components:
- input review (*'review'*): a natural language sentence or a paragraph commenting about a product.
- domain (*'domain'*): describing the type of product being reviewed.
- label of sentiment (*'label'*): indicating whether the review states positive or negative views about the product.

In [None]:
data_dir = 'data'
data_train_path = os.path.join(data_dir, 'train_sa.jsonl')
with jsonlines.open(data_train_path, "r") as reader:
    for sid, sample in enumerate(reader.iter()):
        if sid % 200 == 0:
            print(sample)

{'review': "THis book was horrible.  If it was possible to rate it lower than one star i would have.  I am an avid reader and picked this book up after my mom had gotten it from a friend.  I read half of it, suffering from a headache the entire time, and then got to the part about the relationship the 13 year old boy had with a 33 year old man and i lit this book on fire.  One less copy in the world...don't waste your money. I wish i had the time spent reading this book back so i could use it for better purposes.  THis book wasted my life", 'domain': 'books', 'label': 'negative'}
{'review': 'Sphere by Michael Crichton is an excellant novel. This was certainly the hardest to put down of all of the Crichton novels that I have read. The story revolves around a man named Norman Johnson. Johnson is a phycologist. He travels with 4 other civilans to a remote location in the Pacific Ocean to help the Navy in a top secret misssion. They quickly learn that under the ocean is a half mile long sp

In [None]:
# We use the following pretrained tokenizer and model
model_name = "FacebookAI/roberta-base"
tokenizer = RobertaTokenizer.from_pretrained(model_name)
model = RobertaForSequenceClassification.from_pretrained(model_name, num_labels=2)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/481 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at FacebookAI/roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## 🎯 Q1.1: **Dataset Processing (10 pts)**

Our first step is to constructing a Pytorch Dataset for SA task. Specifically, we will need to implement **tokenization** and **padding** using a HuggingFace pre-trained tokenizer.

**TODO🔻: Complete `SADataset` class following the instructions in `sa.py`, and test by running the following cell.**

In [None]:
from sa import SADataset
model_name = "FacebookAI/roberta-base"
tokenizer = RobertaTokenizer.from_pretrained(model_name)
dataset = SADataset("data/train_sa.jsonl", tokenizer)

Building SA Dataset...


1600it [00:05, 297.60it/s] 


In [None]:
from testA2 import test_SADataset
test_SADataset(dataset)

SADataset test correct ✅


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


## 🎯 Q1.2: **Model Training and Evaluation (18 pts)**

Next, we will implement the training and evaluation process to finetune the model.

- For training: you will need to calculate the **loss** and update the model weights by using **Adam optimizer**. Additionally, we add a **learning rate schedular** to adopt an adaptive learning rate during the whole training process.

- For evaluation: you will need to compute the **confusion matrix** and **F1 scores** to assess the model performance.

**TODO🔻: Complete the `compute_metrics()`, `train()` and `evaluate()` functions following the instructions in the `sa.py` file, you can test compute_metrics() by running the following cell.**

In [None]:
from sa import compute_metrics, train, evaluate

from testA2 import test_compute_metrics
test_compute_metrics(compute_metrics)

compute_metric test correct ✅


#### **Start Training and Validation!**

TODO🔻: (1) [coding question] Train the model with the following two different learning rates (other hyperparameters should be kept consistent).

> A. learning_rate = 1e-5

> B. learning_rate = 2e-5

**Note:** *Each training will take ~7-10 minutes using a T4 Colab GPU.*

In [None]:
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

model_name = "FacebookAI/roberta-base"
tokenizer = RobertaTokenizer.from_pretrained(model_name)
model = RobertaForSequenceClassification.from_pretrained(model_name, num_labels=2)
model.to(device)

batch_size = 8
epochs = 4
max_grad_norm = 1.0
warmup_percent = 0.3

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at FacebookAI/roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
learning_rate = 1e-5  # play around with this hyperparameter

# On Ed train models on data/train_sa.jsonl,
# and evaluate on data/test_sa.jsonl
train_dataset = SADataset('data/train_sa.jsonl', tokenizer)
dev_dataset = SADataset('data/test_sa.jsonl', tokenizer)

train(train_dataset=train_dataset,
      dev_dataset=dev_dataset,
      model=model,
      device=device,
      batch_size=batch_size,
      epochs=epochs,
      learning_rate=learning_rate,
      warmup_percent=warmup_percent,
      max_grad_norm=max_grad_norm,
      model_save_root='models/',
      tensorboard_path="./tensorboard/part1_lr{}".format(learning_rate))

Building SA Dataset...


1600it [00:03, 437.99it/s]


Building SA Dataset...


6400it [00:08, 786.68it/s] 
Training: 100%|██████████| 200/200 [00:15<00:00, 12.51it/s]
Evaluation: 100%|██████████| 800/800 [00:24<00:00, 33.31it/s]


Epoch: 0 | Training Loss: 0.697 | Validation Loss: 0.684
Epoch 0 SA Validation:
Confusion Matrix:
[[ 854 2346]
 [ 606 2594]]
F1: (36.65%, 63.73%) | Macro-F1: 50.19%
Model Saved!


Training: 100%|██████████| 200/200 [00:23<00:00,  8.36it/s]
Evaluation: 100%|██████████| 800/800 [00:17<00:00, 45.33it/s]


Epoch: 1 | Training Loss: 0.492 | Validation Loss: 0.384
Epoch 1 SA Validation:
Confusion Matrix:
[[2689  511]
 [ 193 3007]]
F1: (88.42%, 89.52%) | Macro-F1: 88.97%
Model Saved!


Training: 100%|██████████| 200/200 [00:20<00:00,  9.72it/s]
Evaluation: 100%|██████████| 800/800 [00:24<00:00, 32.32it/s]


Epoch: 2 | Training Loss: 0.317 | Validation Loss: 0.311
Epoch 2 SA Validation:
Confusion Matrix:
[[2874  326]
 [ 284 2916]]
F1: (90.41%, 90.53%) | Macro-F1: 90.47%
Model Saved!


Training: 100%|██████████| 200/200 [00:19<00:00, 10.50it/s]
Evaluation: 100%|██████████| 800/800 [00:17<00:00, 45.96it/s]


Epoch: 3 | Training Loss: 0.175 | Validation Loss: 0.408
Epoch 3 SA Validation:
Confusion Matrix:
[[3004  196]
 [ 401 2799]]
F1: (90.96%, 90.36%) | Macro-F1: 90.66%
Model Saved!


In [None]:
# I don't know if we should reload the roberta model or not.

learning_rate = 2e-5  # play around with this hyperparameter

# On Ed train models on data/train_sa.jsonl,
# and evaluate on data/test_sa.jsonl
train_dataset = SADataset('data/train_sa.jsonl', tokenizer)
dev_dataset = SADataset('data/test_sa.jsonl', tokenizer)

train(train_dataset=train_dataset,
      dev_dataset=dev_dataset,
      model=model,
      device=device,
      batch_size=batch_size,
      epochs=epochs,
      learning_rate=learning_rate,
      warmup_percent=warmup_percent,
      max_grad_norm=max_grad_norm,
      model_save_root='models/',
      tensorboard_path="./tensorboard/part1_lr{}".format(learning_rate))

Building SA Dataset...


1600it [00:01, 1015.67it/s]


Building SA Dataset...


6400it [00:07, 805.33it/s] 
Training: 100%|██████████| 200/200 [00:16<00:00, 12.04it/s]
Evaluation: 100%|██████████| 800/800 [00:18<00:00, 43.72it/s]


Epoch: 0 | Training Loss: 0.121 | Validation Loss: 0.492
Epoch 0 SA Validation:
Confusion Matrix:
[[3006  194]
 [ 429 2771]]
F1: (90.61%, 89.89%) | Macro-F1: 90.25%
Model Saved!


Training: 100%|██████████| 200/200 [00:19<00:00, 10.45it/s]
Evaluation: 100%|██████████| 800/800 [00:21<00:00, 37.22it/s]


Epoch: 1 | Training Loss: 0.093 | Validation Loss: 0.670
Epoch 1 SA Validation:
Confusion Matrix:
[[2970  230]
 [ 411 2789]]
F1: (90.26%, 89.69%) | Macro-F1: 89.98%


Training: 100%|██████████| 200/200 [00:16<00:00, 12.41it/s]
Evaluation: 100%|██████████| 800/800 [00:17<00:00, 46.57it/s]


Epoch: 2 | Training Loss: 0.095 | Validation Loss: 0.625
Epoch 2 SA Validation:
Confusion Matrix:
[[2894  306]
 [ 317 2883]]
F1: (90.28%, 90.25%) | Macro-F1: 90.27%
Model Saved!


Training: 100%|██████████| 200/200 [00:16<00:00, 12.07it/s]
Evaluation: 100%|██████████| 800/800 [00:17<00:00, 46.18it/s]

Epoch: 3 | Training Loss: 0.037 | Validation Loss: 0.893
Epoch 3 SA Validation:
Confusion Matrix:
[[2698  502]
 [ 207 2993]]
F1: (88.39%, 89.41%) | Macro-F1: 88.90%





TODO🔻: (2) [textual question] compare and discuss the results.

- Which learning rate is better?



- For learning rate 1e-5:
  * Training Loss: ~0.175
  * Validation Loss: ~0.408
  * Macro-F1 Score: ~90.66%
  * Confusion Matrix:

    [[3004   196]

    [401     2799]]

- For learning rate 2e-5:
  * Training Loss: ~0.093
  * Validation Loss: ~0.670
  * Macro-F1 Score: ~90.25%
  * Confusion Matrix:

    [[2698     502]

    [207      2993]]

- The model with a learning rate of 1e-5 has a higher training loss but a significantly lower validation loss compared to the 2e-5 model. This suggests that the 1e-5 model is generalizing better to the validation set, which is often better to ensure they perform well on unseen data.

- The 2e-5 model has a lower training loss which might indicate it is fitting the training data better, but this comes at the cost of a higher validation loss, potentially indicating overfitting to the training data.

- The Macro-F1 Scores are very similar, with the 1e-5 model slightly outperforming the 2e-5 model. Given that F1 is an important measure of a model's performance, especially in class-imbalanced datasets, the 1e-5 learning rate seems to be the better choice.

#### Confusion Matrix Comparison
*   True Positives (TP): At 1e-5, the model has more true positives, meaning it correctly identifies positive cases more often than at 2e-5.
*   False Negatives (FN): At 1e-5, the model has fewer false negatives, which is better as it means fewer positive cases are being missed.
*   False Positives (FP): At 2e-5, the model has fewer false positives, indicating it's better at not misclassifying negative cases as positive.
*   True Negatives (TN): Conversely, at 2e-5, the model has more true negatives, meaning it correctly identifies negative cases more often than at 1e-5.

=> The model trained with a 1e-5 learning rate is better at identifying positive cases correctly (higher TP, lower FN). The model with a 2e-5 learning rate is better at avoiding false alarms (lower FP) and identifying negative cases (higher TN).


## 🎯 Q1.3: **Fine-Grained Validation (5 pts)**

TODO🔻: (1) [coding question] Use the model checkpoint trained from the first learning_rate setting (lr=1e-5), check the model performance on each domain subsets of the validation set. You should report **the validation loss**, **confusion matrix**, **F1 scores** and **Macro-F1 on each domain**.

In [None]:
# Split the test sets into subsets with different domains
# Save the subsets under 'data/'
# Replace "..." with your code
domain_data = {}
data_dir = 'data'
data_train_path = os.path.join(data_dir, 'train_sa.jsonl')
with jsonlines.open(data_train_path, "r") as reader:
    for sample in reader.iter():
        if sample['domain'] not in domain_data:
            domain_data[sample['domain']] = []
        domain_data[sample['domain']].append(sample)

for domain, samples in domain_data.items():
    with jsonlines.open("data/test_sa_"+domain+".jsonl", mode="w") as writer:
        for sd in samples:
            writer.write(sd)

In [None]:
learning_rate = 1e-5
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
checkpoint_path = 'models/lr1e-05-warmup0.3'
tokenizer = RobertaTokenizer.from_pretrained(checkpoint_path)
model = RobertaForSequenceClassification.from_pretrained(checkpoint_path)
model.to(device)

results_save_dir = 'predictions/'

# Evaluate and save prediction results in each domain
# Replace "..." with your code
for domain in domain_data.keys():

    dev_dataset = SADataset(f"data/test_sa_{domain}.jsonl", tokenizer)

    dev_loss, confusion, f1_pos, f1_neg = evaluate(dev_dataset, model, device, batch_size,
                                                   result_save_file='predictions/test_'+domain+'.jsonl')
    macro_f1 = (f1_pos + f1_neg) / 2

    # It bothers me the fact that the domain does not in a new line
    print(f'\nDomain: {domain}')
    #print(f'Domain: {domain}')
    print(f'Validation Loss: {dev_loss:.3f}')
    print(f'Confusion Matrix:')
    print(confusion)
    print(f'F1: ({f1_pos*100:.2f}%, {f1_neg*100:.2f}%) | Macro-F1: {macro_f1*100:.2f}%')

Building SA Dataset...


400it [00:01, 375.66it/s]
Evaluation: 100%|██████████| 50/50 [00:01<00:00, 42.04it/s]



Domain: books
Validation Loss: 0.080
Confusion Matrix:
[[198   2]
 [  6 194]]
F1: (98.02%, 97.98%) | Macro-F1: 98.00%
Building SA Dataset...


400it [00:01, 292.64it/s]
Evaluation: 100%|██████████| 50/50 [00:01<00:00, 45.28it/s]



Domain: dvd
Validation Loss: 0.083
Confusion Matrix:
[[200   0]
 [  8 192]]
F1: (98.04%, 97.96%) | Macro-F1: 98.00%
Building SA Dataset...


400it [00:00, 1008.61it/s]
Evaluation: 100%|██████████| 50/50 [00:01<00:00, 43.96it/s]



Domain: electronics
Validation Loss: 0.065
Confusion Matrix:
[[196   4]
 [  2 198]]
F1: (98.49%, 98.51%) | Macro-F1: 98.50%
Building SA Dataset...


400it [00:00, 536.96it/s]
Evaluation: 100%|██████████| 50/50 [00:01<00:00, 42.63it/s]


Domain: housewares
Validation Loss: 0.018
Confusion Matrix:
[[199   1]
 [  1 199]]
F1: (99.50%, 99.50%) | Macro-F1: 99.50%





TODO🔻: (2) [textual question] compare and discuss the results.

**Questions:**
- On which domain does the model perform the best? the worst?
- Give some possible explanations of why the model's best-performed domain is easier, and why the model's worst-performed domain is more challenging. Use some examples to support your explanations.

**Note:** To find examples for supporting your discussion, save the model prediction results on each domain under the `predictions/` folder, by specifying the `result_save_file` parameter in the *evaluate* function.

- **Books**
  - Validation Loss: 0.080
  - Confusion Matrix: High true positives (198) and true negatives (194), very few false positives (2) and false negatives (6).
  - F1 Score: (98.02%, 97.98%).
  - Macro-F1: 98.00%.

- **DVD**
  - Validation Loss: 0.083
  - Confusion Matrix: Perfect true positives (200), high true negatives (192), no false positives (0), and some false negatives (8).
  - F1 Score: (98.04%, 97.96%).
  - Macro-F1: 98.00%.

- **Electronics**
  - Validation Loss: 0.065
  - Confusion Matrix: High true positives (196) and true negatives (198), few false positives (4) and very few false negatives (2).
  - F1 Score: (98.49%, 98.51%).
  - Macro-F1: 98.50%.

- **Housewares**
  - Validation Loss: 0.018
  - Confusion Matrix: Nearly perfect true positives (199) and true negatives (199), very few false positives (1) and false negatives (1).
  - F1 Score: (99.50%, 99.50%).
  - Macro-F1: 99.50%.

**Performance Comparison:**
- The model performs the **best on the housewares domain**, with the highest F1 scores and Macro-F1, and the lowest validation loss.
- The model performs the **worst on the books domain**, though it's still strong.

**Possible Explanations:**
- **Housewares domain**:

  *   Clear Indicators of Discontent: The language used in the housewares reviews tends to be very clear and direct about the issues the customers faced. Phrases like "cheaply made," "did not work," "completely useless," "burned ourselves," and "horrible piece of junk" are unambiguous and strong indicators of negative sentiment.
  *  Specificity and Detail: These reviews are detailed and specific about the product's issues. They don't just express dissatisfaction; they explain the problems encountered with the products.
  * Less Subjectivity and Nuance: Unlike book or movie reviews, which can be highly subjective and nuanced, housewares reviews tend to be more objective. A product either performs its function or doesn't, which is a simpler for a model to interpret.
  *  Consistent Vocabulary: The vocabulary related to housewares problems is consistent ("faulty"). Once the model learns these indicators of negative sentiment, it can apply them to unseen reviews in the same domain.

- **Books domain**:

 *  Subjective Content: The book reviews include emotional expressions ("horrible," "headache the entire time," "lit this book on fire"). They often reflect a reader's personal and subjective experience. This subjectivity can include mixed feelings.
 *  Elaborate Narratives: Reviews in the "books" domain can contain elaborate narratives that explain a reader's thought process, which might not have clearly defined sentiment cues, making it harder for the model.
 *  Cultural and Contextual References: Understanding and interpreting reviews often require cultural or literary context. For example,sarcasm, or metaphors might be misinterpreted without deeper knowledge.
 *  Varying Detail: Book reviews can be long and include a level of detail and critique that is specific to literature. The model must understand all of this information to predict the sentiment.
 *  Critique of Authorial Intent: Some negative reviews may critique the author rather than the content of the book, which could be challenging for a model if it has been trained to focus on product functionality rather than personal opinions about the creator.


<div style="padding:15px 20px 20px 20px;border-left:3px solid orange;background-color:#fff5d6;border-radius: 20px;color:#424242;">

# PART 2: Identify Model Shortcuts (22 pts)

In this part, We aim to find out the shortcut features learnt by the sentiment analysis model we have trained in Part1. We will be using the model checkpoint trained with `learning rate=1e-5`.

</div>

## 🎯 Q2.1: **N-gram Pattern Extraction (6 pts)**
We hypothesize that `n-gram`s could be the potential shortcut features learnt by the SA model. An `n-gram` is defined as a sequence of n consecutive words appeared in a natural language sentence or paragraph.

Thus, we aim to extract that an n-gram that appears in a review may serve as a key indicator of the polarity of the review's sentiment, for example:

>- **Review 1**: This book was **horrible**. If it was possible to rate it **lower than one star** I would have.
>- **Review 2**: **Excellent** book, **highly recommended**. Helps to put a realistic perspective on millionaires.

For Review 1, the `1-gram "horrible"` and the `4-gram "lower than one star"` serve as two key indicators of negative sentiment. While for Review 2, the `1-gram "excellent"` and the `2-gram "highly recommended"` obviously indicate positive sentiment.

TODO🔻: (1) [coding question] Complete `ngram_extraction()` function in `shortcut.py` file.

The returned *ngrams* contains a **list** of dictionaries. The `n-th` **dictionary** corresponds the `n-grams` (n=1,2, 3, 4).

The keys of each dictionary should be a **unique n-gram string** appeared in reviews, and the value of each n-gram key records the frequency of positive/negative predictions **made by the model** when the n-gram appears in the review, i.e., `\[#positive_predictions, #negative_predictions\]`.

> Example: **`ngrams`[0]['horrible'][0]** should return the number of the positive predictions made by the model when the 1-gram token 'horrible' appear in the given review. i.e., \[#positive_predictions, #negative_predictions\].

**Note:** (1) All the sequences contain punctuations should NOT be counted as a n-gram (e.g. `it is great .` is NOT a 4-gram, but `it is great` is a 3-gram); (2) All stop-words should NOT be counted as 1-grams, but can appear in other n-gram sequences (e.g. `is` is NOT a 1-gram token, but `it is great` can be a 3-gram token.)

## 🎯 Q2.2: **Distill Potentially Useful Patterns (8 pts)**

TODO🔻: (2) [coding question] For each group of n-grams (n=1,2,3,4), find and **print** the **top-100 n-gram sequences** with the **greatest frequency of appearance**, which could contain frequent semantic features and would be used as our feature list.

In [None]:
from shortcut import ngram_extraction

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [None]:
import glob

# all your saved model prediction results from 1.3 Fine-Grained Validation
prediction_files = glob.glob(os.path.join('predictions/', 'test_*.jsonl'))

# TODO: Define your tokenizer
checkpoint_path = 'models/lr1e-05-warmup0.3'
tokenizer = RobertaTokenizer.from_pretrained(checkpoint_path)
ngrams = ngram_extraction(prediction_files, tokenizer)

top_100 = {}
for n, counts in enumerate(ngrams):
    # TODO: find top-100 n-grams (n=1,2,3 or 4) associated with the greatest frequency of appearance
    sorted_ngrams = sorted(counts.items(), key=lambda item: sum(item[1]), reverse=True)

    top_100_freq = sorted_ngrams[:100]

    print(f'Top-100 most frequent {n+1}-grams:')
    print(top_100_freq)

    top_100[n] = top_100_freq

100%|██████████| 400/400 [00:07<00:00, 52.82it/s]
100%|██████████| 400/400 [00:02<00:00, 163.24it/s]
100%|██████████| 400/400 [00:02<00:00, 173.73it/s]
100%|██████████| 400/400 [00:05<00:00, 72.79it/s]


Top-100 most frequent 1-grams:
[('one', [487, 552]), ('book', [461, 489]), ('like', [294, 329]), ('would', [260, 329]), ('movie', [212, 341]), ('good', [265, 252]), ('time', [217, 285]), ('even', [157, 276]), ('great', [290, 133]), ('well', [247, 173]), ('get', [176, 235]), ('film', [221, 172]), ('much', [163, 214]), ('first', [187, 189]), ('read', [154, 203]), ('really', [170, 173]), ('use', [181, 160]), ('also', [203, 137]), ('j', [165, 150]), ('work', [136, 171]), ('story', [159, 128]), ('many', [157, 128]), ('people', [114, 163]), ('make', [118, 159]), ('way', [145, 131]), ('could', [95, 172]), ('us', [128, 129]), ('better', [107, 148]), ('two', [126, 128]), ('2', [95, 156]), ('h', [127, 118]), ('new', [138, 103]), ('little', [132, 107]), ('r', [117, 118]), ('used', [111, 119]), ('l', [128, 100]), ('back', [90, 134]), ('b', [124, 99]), ('man', [127, 95]), ('product', [84, 137]), ('vd', [121, 97]), ('never', [96, 118]), ('made', [105, 108]), ('p', [131, 82]), ('3', [103, 107]), ('lo

**Among each type of top-100 frequent n-grams above**, we aim to further find out the n-grams which **most likely** lead to *positive*/*negative* predictions (positive/negative shortcut features).

TODO🔻: (3) [coding&text question] Design **two different methods to re-rank** the top-100 n-grams to extract shortcut features. For each method, you should extract **1** feature in each of n-grams group (n=1, 2, 3, 4) for positve and negative prediction (1\*4\*2=8 features in total for 1 method).

Explain each of your design choices in natural language, and compare which method finds more reasonable patterns.


In [None]:
# TODO: [Method 1] find top-1 positive and negative patterns
def extract_shortcut_features_ratio(ngrams_list):
    shortcut_features = {'positive': [], 'negative': []}

    for ngrams in ngrams_list:

        sorted_pos = sorted(ngrams, key=lambda x: (x[1][0] / (x[1][1] + 1)), reverse=True)
        sorted_neg = sorted(ngrams, key=lambda x: (x[1][1] / (x[1][0] + 1)), reverse=True)

        shortcut_features['positive'].append(sorted_pos[0])
        shortcut_features['negative'].append(sorted_neg[0])

    return shortcut_features

top_100_lists = [top_100[n] for n in range(4)]

shortcut_features = extract_shortcut_features_ratio(top_100_lists)

for type in ['positive', 'negative']:
    print(f"\nTop {type} shortcut features for each n-gram group:")
    for i, feature in enumerate(shortcut_features[type]):
        print(f"{i+1}-gram: {feature}")


Top positive shortcut features for each n-gram group:
1-gram: ('love', [148, 61])
2-gram: ('john son', [19, 0])
3-gram: ('g iov anni', [8, 0])
4-gram: ('w ither sp oon', [6, 0])

Top negative shortcut features for each n-gram group:
1-gram: ('bad', [34, 130])
2-gram: ('da v', [0, 14])
3-gram: ('gig ab eat', [0, 13])
4-gram: ('j apan ese culture', [0, 5])


**Explanation of Method 1 :**

**Ratio of Positive to Negative Predictions**



  * Explanation: This method calculates the ratio of positive to negative predictions for each n-gram. The intuition is that n-grams with a high positive-to-negative ratio are strong indicators of positive sentiment, while those with a low ratio indicate negative sentiment. This method reflects the n-gram's tendency to appear in positive or negative contexts.

  * Calculation: For each n-gram, calculate the ratio of positive to negative predictions (pos/neg).
  For positive shortcut features, select the n-gram with the highest pos/neg ratio in each n-gram group.
  Inversly for negative shortcut features.

In [None]:
# TODO: [Method 2] find top-1 positive and negative patterns

# Used ChatGPT
# Prompt used: can you give me another method similar to the ratio of positive to negative predictions for identifying top-1 positive and negative shortcut features
# Output of the tool:
# For a second method to identify top-1 positive and negative shortcut features, focusing on the Predictive Strength Score (PSS), let's define and apply it as follows:
# Method 2: Predictive Strength Score (PSS) for Shortcut Features
# The Predictive Strength Score (PSS) is a method that quantifies the indicative power of an n-gram for a particular sentiment (positive or negative) by considering both the frequency of its occurrence in the desired sentiment class and its frequency in the opposite class. The score aims to balance raw frequency with the contrast between sentiment classes, offering a nuanced view of an n-gram's predictive value.
# Formula for PSS
# Given \(pos\) as the count of an n-gram in positive samples and \(neg\) as its count in negative samples, the PSS can be calculated as follows:
# \[ PSS = \left( \frac{|pos - neg|}{pos + neg} \right) \times \log(pos + neg + 1) \]
# This formula incorporates both the difference in sentiment counts (providing contrast) and the overall frequency of the n-gram (adding a weighting factor that prioritizes more common n-grams).

from math import log

def predictive_strength_score(pos, neg):
    return (abs(pos - neg) / (pos + neg)) * log(pos + neg + 1)

def extract_shortcut_features_pss(ngrams_list):
    shortcut_features = {'positive': [], 'negative': []}

    for ngrams in ngrams_list:
        ngrams_with_pss = []

        for ngram, counts in ngrams:
            pos, neg = counts
            pss = predictive_strength_score(pos, neg)
            ngrams_with_pss.append((ngram, pss))

        ngrams_sorted = sorted(ngrams_with_pss, key=lambda x: x[1], reverse=True)

        if ngrams_sorted:
            shortcut_features['positive'].append(ngrams_sorted[0][0])
            shortcut_features['negative'].append(ngrams_sorted[-1][0])
    return shortcut_features

shortcut_features_pss = extract_shortcut_features_pss(top_100_lists)

for sentiment in ['positive', 'negative']:
    print(f"\nTop {sentiment} shortcut features for each n-gram group using PSS:")
    for i, feature in enumerate(shortcut_features_pss[sentiment]):
        print(f"{i+1}-gram: {feature}")



Top positive shortcut features for each n-gram group using PSS:
1-gram: bad
2-gram: highly recommend
3-gram: gig ab eat
4-gram: w ither sp oon

Top negative shortcut features for each n-gram group using PSS:
1-gram: able
2-gram: b rit
3-gram: eng ross ing
4-gram: ff gold bl um


**Explanation of Method 2:**
**Predictive Strength Score (PSS)**

  - Explanation: This method computes a Predictive Strength Score (PSS) for each n-gram, which combines both the frequency of the n-gram and its distribution across positive and negative predictions. The score is designed to prioritize n-grams not only frequently appearing but also showing a clear preference for positive or negative predictions.

  - Calculation: Compute the score as follows: PSS = (abs(pos - neg) / (pos + neg)) * log(pos + neg).
  For positive shortcut features, select the n-gram with the highest PSS where pos > neg in each n-gram group.
  Inversly for negative shortcut features.

TODO🔻: Compare and discuss the results from two methods above.


* Method 1 : (Pos/Neg Ratio) straightforwardly identifies n-grams that lean strongly towards positive or negative sentiments based on their occurrence ratio. It is directly linked to the sentiment indication of the n-gram. However, it might favor n-grams that are very rare but not necessarily indicative of the general sentiment.

* Method 2 (PSS) : tries to balance the predictiveness of an n-gram with its occurrence frequency, making sure that selected features are both indicative of sentiment and sufficiently common.

* Both methods found the shortcut "bad" for 1-gram, however, method 1 found it as the top negative shortcut feature while method 2 found it as the top positive shortcut feature. Even if the word bad has a negative connotation in general, sometimes it can be used as not bad which is not a negative connotation.



## 🎯 Q2.3: **Case Study (8 pts)**

TODO🔻: Among the shortcut features you found in 2.1, find out **4 representative** cases (pair of `\[review, n-gram feature\]`) where the shortcut feature **will lead to a wrong prediction**.

For example, the 1-gram feature "excellent" has been considered as a shortcut for *positive* sentiment, while the ground-truth label of the given review containing "excellent" is *negative*.

**Questions:**
- Based on your case study, do you detect any limitations of the n-gram patterns?
- Which type of n-gram (1/2/3/4-gram) pattern is more robust to be used for sentiment prediction shortcut and why?

In [None]:
data_dir = 'data'
data_train_path = os.path.join(data_dir, 'train_sa.jsonl')

# Load the dataset into a list of dictionaries
dataset = []
with jsonlines.open(data_train_path, "r") as reader:
    for sample in reader:
        dataset.append(sample)

In [None]:
# Used ChatGPT
# Prompt used: "Write a Python function to find cases in a dataset where the presence of an n-gram contradicts the expected label."
# Output of the tool: The function below
# To verify the correctness of the output, I tested the function for the 4 representative cases where the shortcut feature will lead to a wrong prediction.
# Modifications made: None required.
# Explanation of AI-based code functionality:
# It checks if the specified n-gram is present in each review's text. If the n-gram is found, the function then checks if the review's label
# contradicts the expected label associated with that n-gram

def find_misleading_cases(dataset, ngram, expected_label):
    """
    Find cases where the presence of an n-gram contradicts the expected label.

    :param dataset: List of dictionaries, each representing a review.
    :param ngram: The n-gram string to search for.
    :param expected_label: The label ('positive' or 'negative') the n-gram is usually associated with.
    :return: A list of reviews where the n-gram is present but the label contradicts the expected sentiment.
    """
    misleading_cases = []
    for review in dataset:
        # Check if the n-gram is in the review text
        if ngram in review['review'].lower():
            # Check if the review label contradicts the expected label associated with the n-gram
            actual_label = review['label']
            if (expected_label == 'positive' and actual_label != 'positive') or (expected_label == 'negative' and actual_label != 'negative'):
                misleading_cases.append(review)
    return misleading_cases

In [None]:
ngram = "excellent"
expected_label = "positive"
misleading_cases = find_misleading_cases(dataset, ngram, expected_label)

for case in misleading_cases[:4]:
    print(f"Review: {case['review']}\nLabel: {case['label']}\n")

Review: There are plenty of excellent Play Therapy books out there, this is not one of them.  Maybe it's just me, but a play therapy book should be way more user friendly.  I need quick access, color and... ART
Label: negative

Review: I'm an avid reader of history, as well as processing a degree in the subject.  So imagine my surprise when, after receiving this book from a friend of mine for Christmas, I read the erroneous account of the Children's Crusade of 1212.  I had done research on this topic, so I was horrified to read the completely inaccurate account of what occurred.  Had the author not read any historical analysis on the subject from the last 50 years?  If he had, he would have realized that there were actually two crusades - one consisting of mainly French people led by Stephen of Cloyes who, when told to turn back by King Philip II, did so.  That ended that crusade.  The other one, led by a shepherd from Germany named Nicholas, led a group across the Alps into Italy.  So

In [None]:
ngram = "love"
expected_label = "positive"
misleading_cases = find_misleading_cases(dataset, ngram, expected_label)

for case in misleading_cases[:4]:
    print(f"Review: {case['review']}\nLabel: {case['label']}\n")

Review: Very average book!  Frei doesn't go down as one of my favorite sports writers. Loves to sensationlize rather then just tell a great story
Label: negative

Review: Just finished reading this novel this morning and I think it is probably the weakest one of the lot.  Like a lot of reviewers, I didn't find this one anywhere near as gripping or well-developed plotwise as a number of the earlier ones in the Stephanie Plum series.  As well, the supposed sexual tension- love triangle between Stephanie, Joe and Ranger is approaching its use-by date.  It is definitely becoming boring and I think Ms Evanovich needs to resolve Stephanie's romantic situation once and for all.  I still enjoy characters like Lula and Grandma Mazur but a lot of the situations Stephanie found herself in fell flat with me.  There are only so many times she can have her car blown up, handbag destroyed, apartment broken into, etc, before it becomes tiresome and repetitive.  I still think there is a spark of life i

In [None]:
ngram = "much better"
expected_label = "positive"
misleading_cases = find_misleading_cases(dataset, ngram, expected_label)

for case in misleading_cases[:4]:
    print(f"Review: {case['review']}\nLabel: {case['label']}\n")

Review: Didn't care for this book at all.  If you want to learn about Stephen King you'd be much better off reading or listening to his book &quot;On Writing.&quot
Label: negative

Review: ...are incorrect and that his fractal models are much better. Of course if his models were worth more than the paper they're written on Mandelbrot wouldn't have to write books like this because he'd be cleaning out everyone else's wallets on the stock market. In particular, if it's true that extreme events are more unlikely than most people think then he could easily exploit this with a suitable derivative. But the fact is, Mandelbrot doesn't know anything that countless other traders don't already know. So instead Mandelbrot is forced to resort to telling people how smart he is through his books rather than actually being smart enough to make a killing on Wall Street. But I am glad I read this book. Having seen Mandelbrot 'on tour' a few years ago I developed a strong prejudice against him. But read

In [None]:
ngram = "like"
expected_label = "positive"
misleading_cases = find_misleading_cases(dataset, ngram, expected_label)

for case in misleading_cases[:4]:
    print(f"Review: {case['review']}\nLabel: {case['label']}\n")

Review: I like to use the Amazon reviews when purchasing books, especially alert for dissenting perceptions about higly rated items, which usually disuades me from a selection.  So I offer this review that seriously questions the popularity of this work - I found it smug, self-serving and self-indulgent, written by a person with little or no empathy, especially for the people he castigates. For example, his portrayal of the family therapist seems implausible and reaches for effect and panders to the "shrink" bashers of the world. This "play for effect" tone throughout the book was very distasteful to me
Label: negative

Review: I picked up the first book in this series (The Eyre Affair) based purely on its premise and was left somewhat underwhelmed. Still, the potential for the series seemed so large that I went ahead and read this second one too, only to be even less enchanted with the franchise. This is a pure sequel, and any newcomers are advised to read the misadventures of Thursda

TODO🔻: (Write your case study discussions and answers to the questions here.)

Limitations of n-gram patterns for sentiment prediction:

1. **Context Sensitivity:** N-gram patterns lack contextual understanding, leading to misinterpretations of sentiment, especially in cases of sarcasm, irony, or nuanced language.
2. **Over-Reliance on Specific Words:** N-grams focus solely on the presence or absence of specific words or phrases, ignoring the broader context of the text.
3. **Limited Generalization:** Higher-order n-grams may capture more complex patterns but risk overfitting to specific language patterns in the training data, reducing generalization to unseen data.
4. **Neglect of Negations:** N-grams may fail to account for negations (e.g., "not good"), resulting in incorrect sentiment classification.

Regarding the robustness of n-gram patterns for sentiment prediction shortcuts, the choice depends on the balance between specificity and generalization:

- **1-gram (Unigram):** Offers simplicity and generalization but may lack context and nuanced understanding, leading to misclassifications.
- **2-gram (Bigram):** Captures slightly more context than unigrams but still suffers from limited contextual understanding and may not adequately handle longer phrases.
- **3-gram (Trigram) and 4-gram (Four-gram):** These higher-order n-grams capture more context and can potentially discern more nuanced sentiment patterns. However, they also increase the risk of overfitting to specific language patterns in the training data, reducing generalization to unseen data.

Higher-order n-grams (3-gram and 4-gram) may offer more context and potentially better sentiment prediction, they are also more prone to overfitting, which is the case in this context. Thus, I recommend using 1-gram (unigram) or 2-gram (bigram) for sentiment prediction.

<div style="padding:15px 20px 20px 20px;border-left:3px solid orange;background-color:#fff5d6;border-radius: 20px;color:#424242;">

# Part 3: Annotate New Data (25 pts)

In this part, you will **annotate** the gold labels of some **new** SA data samples, and measure the degree of **agreement** between your and **one or two partners'** annotations.
    
</div>

## 🎯 Q3.1: **Write an Annotation Guideline (5 pts)**

TODO🔻: Imagine that you are going to assign this annotation task to a crowdsourcing worker, who is completely not familiar with computer science and NLP. Think about how you are going to explain this annotation task to him in order to guide him do a decent job. Write an annotation guideline for such a worker who are going to do this task for you.

**Note:** You should come up with your own guideline without the help of your partner(s) in later Part 3.2

**Guidelines for Sentiment Annotation:**

---

1. **Read Carefully:**
   - Take your time to read each text completely. Understanding the overall context is crucial for accurately determining the sentiment.

2. **Consider Overall Sentiment:**
   - Focus on the overall sentiment of the entire text rather than isolated words. Some texts may contain both positive and negative words, but what matters is the general feeling conveyed. For example, a reviewer might criticize some aspects of the product but still recommend it, this is considered a positive review. Another example, if a reviewer considers the product to not be worth the money even if it works and suggests another product instead, it is considered a negative review.

3. **Look for Keywords:**
   - Certain words or phrases can be strong indicators of sentiment. For example, words like "love", "happy", "great" suggest positive sentiments, whereas "hate", "disappointed", "worst" suggest negative sentiments. However, be mindful of the context, as the same word can be used sarcastically or in different contexts.

4. **Beware of Sarcasm and Irony:**
   - Sometimes, a text may seem positive at first glance but is actually negative due to sarcasm. Pay attention to clues that might indicate the author is being sarcastic or ironic.

5. **Consider All Aspects:**
   - Sometimes, the criticism may not be about the product itself, but rather about specific aspects such as the storyline or historical accuracy. Take into account the broader context of the review when assessing sentiment.

6. **Clarify Confusion:**
   - If the criticism pertains to elements beyond the product, such as historical accuracy or storytelling, ensure to distinguish between criticism of the product itself and criticism of external factors.

## 🎯 Q3.2: **Annotate Your Datapoints with Partner(s) (8 pts)**

TODO🔻: Annotate 80 datapoints (20 in each domain of "books", "dvd", "electronics" and "housewares") assigned to you and your partner(s), by editing the value of the key **"label"** in each datapoint. You and your partner(s) should annotate **independently of each other**, i.e., each of you provide your own 80 annotations.

Please find your assigned annotation dataset **ID** and **your partner(s)** according to this [list](https://docs.google.com/spreadsheets/d/1hOwBUb8XE8fitYa4hlAwq8mARZe3ZsL4/edit?usp=sharing&ouid=108194779329215429936&rtpof=true&sd=true). Your annotation dataset can be found [here](https://drive.google.com/drive/folders/1IHXU_v3PDGbZG6r9T5LdjKJkHQ351Mb4?usp=sharing).

**Name your annotated file as `<your_assigned_dataset_id>-<your_sciper_number>.jsonl`.**

**You should also submit your partner's annotated file `<assigned_dataset_id>-<your_partner_sciper_number>.jsonl`.**

In [None]:
# Token from Ed-discussion : Script for fast annotation #357
# Modified the code to add a function to get current progress by counting the number of entries in the annotations file
# Indeed, my colab was crashing multiple times in the middle of annotating.
# Prompt used: "How to count the number of items in a JSONLines file in Python."
# Output of the tool: The function get_current_progress
# To verify the correctness of the output, I just checked that indeed when colab was crashing it started from the last review I annotated

from IPython.display import clear_output

dataset_id = 3
scipher = SCIPER

annotations_file_path = f"data/{dataset_id}-{scipher}.jsonl"

def get_current_progress(file_path):
    if os.path.exists(file_path):
        with jsonlines.open(file_path, mode='r') as reader:
            return sum(1 for _ in reader)
    return 0

with jsonlines.open(f"data/{dataset_id}.jsonl", "r") as reader:
    samples = [sample for sample in reader.iter()]

id_to_label = {
    0: 'positive',
    1: 'negative'
}

backups = []
start_index = get_current_progress(annotations_file_path)

with jsonlines.open(annotations_file_path, mode="a") as writer:
    for i, sample in enumerate(samples[start_index:], start=start_index):
        print(f'{round(i / len(samples) * 100, 2)}% complete')
        print(f'Review: {sample["review"]}')
        try:
            label = id_to_label[int(input('Sentiment: '))]
            writer.write({"review": sample["review"], "domain": sample["domain"], "label": label})
            backups.append((sample, label))
            clear_output(wait=False)
        except KeyboardInterrupt:
            break

## 🎯 Q3.3: **Agreement Measure (12 pts)**

TODO🔻: Based on your and your partner's annotations in 3.2, calculate the [Cohen's Kappa](https://scikit-learn.org/stable/modules/model_evaluation.html#cohen-kappa) or [Krippendorff's Alpha](https://github.com/pln-fing-udelar/fast-krippendorff) (if you are in a group of three students) between the annotators on **each domain** and **across all domains**.

**Note:** Cohen's Kappa or Krippendorff's Alpha interpretation

0: No Agreement

0 ~ 0.2: Slight Agreement

0.2 ~ 0.4: Fair Agreement

0.4 ~ 0.6: Moderate Agreement

0.6 ~ 0.8: Substantial Agreement

0.8 ~ 1.0: Near Perfect Agreement

1.0: Perfect Agreement

**Questions:**
- What is the overall degree of agreement between you and your partner(s) according to the above interpretation of score ranges?
- In which domain are disagreements most and least frequently happen between you and your partner(s)? Give some examples to explain why that is the case.
- Are there possible ways to address the disagreements between annotators?

In [None]:
# Fill your code for calculating agreement scores here.
from collections import Counter

def compute_cohens_kappa(annotations1, annotations2):

    total_annotations = len(annotations1)
    label_agreement = {}
    for label1, label2 in zip(annotations1, annotations2):
        if label1 == label2:
            if label1 in label_agreement:
                label_agreement[label1] += 1
            else:
                label_agreement[label1] = 1

    observed_agreement = sum(label_agreement.values()) / total_annotations

    annotator1_counts = Counter(annotations1)
    annotator2_counts = Counter(annotations2)

    expected_agreement = sum((annotator1_counts[label] * annotator2_counts[label]) / (total_annotations ** 2)
                             for label in annotator1_counts)

    kappa_score = (observed_agreement - expected_agreement) / (1 - expected_agreement)

    return kappa_score

def compute_alpha(annotations1, annotations2):

    data = list(zip(annotations1, annotations2))

    label_pairs = Counter(data)

    observed_agreement = sum([label_pairs[label] for label in label_pairs if label[0] == label[1]]) / len(data)

    total_pairs = sum(label_pairs.values())
    expected_agreement = sum([(label_pairs[label1] / total_pairs) * (label_pairs[label2] / total_pairs)
                              for label1 in label_pairs for label2 in label_pairs if label1[0] == label2[0]])

    if expected_agreement == 1:
        return 1
    alpha_score = 1 - (1 - observed_agreement) / (1 - expected_agreement)
    return alpha_score

def read_annotations(file_path):
    annotations = []
    with jsonlines.open(file_path, mode='r') as reader:
        for line in reader:
            annotations.append(line['label'])
    return annotations

annotator1_file_path = "data/3-329571.jsonl"
annotator2_file_path = "data/3-347729.jsonl"

annotator1_annotations = read_annotations(annotator1_file_path)
annotator2_annotations = read_annotations(annotator2_file_path)

kappa_score = compute_cohens_kappa(annotator1_annotations, annotator2_annotations)
print(f"Cohen's Kappa: {kappa_score}")

alpha_score = compute_alpha(annotator1_annotations, annotator2_annotations)
print(f"Krippendorff's Alpha: {alpha_score}")

Cohen's Kappa: 0.5298372513562388
Krippendorff's Alpha: 0.5803066989507667


In [None]:
def read_annotations(file_path):
    annotations = []
    with jsonlines.open(file_path, mode='r') as reader:
        for line in reader:
            annotations.append({'review': line['review'], 'label': line['label'], 'domain': line['domain']})
    return annotations

def print_domain_disagreements(domain, disagreements):
    print(f"Domain: {domain}")
    if disagreements:
        print("Examples of disagreements:")
        for example in disagreements[:3]:
            print("Review : ", example[0]['review'])
            print("Annotation 1:", example[0]['label'])
            print("Annotation 2:", example[1]['label'])
            print("-" * 50)
    else:
        print("No disagreements found.")

annotator1_annotations = read_annotations(annotator1_file_path)
annotator2_annotations = read_annotations(annotator2_file_path)

domains = set(annotation['domain'] for annotation in annotator1_annotations)

for domain in domains:
    domain_annotations1 = [annotation for annotation in annotator1_annotations if annotation['domain'] == domain]
    domain_annotations2 = [annotation for annotation in annotator2_annotations if annotation['domain'] == domain]
    disagreements = [(annot1, annot2) for annot1, annot2 in zip(domain_annotations1, domain_annotations2) if annot1['label'] != annot2['label']]
    print_domain_disagreements(domain, disagreements)


Domain: books
Examples of disagreements:
Review :  I used to wonder why Hollywood ignored the crimes of Soviet Communism, which easily rival the crimes of National Socialism, something they've shown us over and over and over again (certainly understandable). Not one major film has been made showing us the summary executions during Lenin's "Red Terror," the deaths of millions during forced collectivization, the torture and mass murder during Stalin's purges of the late 30's, the cold-blooded massacre of thousands of Polish citizens at Katyn, or the horrors of the gulag. But Hollywood has been plenty busy pumping out films on "McCarthyism" and the ghastly "blacklist." They're listed in this book: Guilty by Suspicion, The Front, The Way We Were, The Majestic, Marathon Man, The House on Carroll Street, Fellow Traveler, One of Hollywood Ten. This is all propaganda of course. The way they tell it you'd think it was one of the darkest and most tragic eras in human history. McCarthy and his il

In [None]:
def count_different_labels_by_domain(annotations1, annotations2, domains):
    different_label_counts_by_domain = {domain: 0 for domain in domains}

    for annotation1, annotation2 in zip(annotations1, annotations2):
        domain = annotation1['domain']
        label1 = annotation1['label']
        label2 = annotation2['label']
        if label1 != label2:
            different_label_counts_by_domain[domain] += 1

    return different_label_counts_by_domain

annotator1_annotations = read_annotations(annotator1_file_path)
annotator2_annotations = read_annotations(annotator2_file_path)

domains = set(annotation['domain'] for annotation in annotator1_annotations + annotator2_annotations)

different_label_counts = count_different_labels_by_domain(annotator1_annotations, annotator2_annotations, domains)

for domain, count in different_label_counts.items():
    print(f"Domain: {domain}, Different Label Count: {count}")


Domain: books, Different Label Count: 1
Domain: electronics, Different Label Count: 4
Domain: dvd, Different Label Count: 1
Domain: housewares, Different Label Count: 7


Based on the interpretation of score ranges:

1. **Overall Degree of Agreement:**
   - Cohen's Kappa: 0.5298
   - Krippendorff's Alpha: 0.5803

   The degree of agreement between the annotators falls within moderate to substantial agreement according to Cohen's Kappa and Krippendorff's Alpha.

2. **Disagreements Frequency by Domain:**
   - Most Frequent: "housewares" domain with 7 instances of disagreements.
   - Least Frequent: "dvd" and "books" domains, each with 1 instance of disagreement.

   **Examples and Explanation:**
   - Most disagreements are in the "Housewares" domain because one reviewer (me) believe that even if the product is useful, it does not meet the expectations of the user. Indeed, they say the product does what it claims to do, however they suggest another product and praise the other product. For me it is a negative review since they would not recommend it.
   - **"Housewares" Domain:** Multiple disagreements in this domain due to varying preferences, experiences, or interpretations of product features. For instance, differences in usability assessments, as seen in the conflicting evaluations of product size and performance, contribute to disagreements.
   - **"DVD" and "Books" Domains:** Disagreements are less common in these domains because the criteria for evaluation is more objective. For example, in the "books" domain, the disagreement revolves around differing interpretations of historical events and political ideologies, which may be less subjective compared to product reviews.

   In conclusion, to address the disagreement, we should agree on what a positive or negative review should be, on what criteria should be used for evaluation.

3. **Addressing Disagreements:**

   - Review 1 (Shedender Product):
      - Annotation 1 (Negative): The reviewer acknowledges that the Shedender removes fur but criticizes its design, mentioning that the teeth of the comb are not deep enough, causing removed hair to fly or stay on top of the fur. They suggest considering the Furminator instead for better efficiency.
      - Annotation 2 (Positive): In contrast, the second annotator praises the Shedender's performance, stating that it works as advertised and efficiently removes fur.
      - Explanation: The disagreement arises from differing perspectives on the product's effectiveness and usability. Annotator 1 focuses on specific design flaws and recommends an alternative, while Annotator 2 emphasizes the Shedender's functionality and cost-saving benefits.

  - Review 2 (Brush Design):
      - Annotation 1 (Negative): The reviewer criticizes the brush's design, stating that it is difficult to use and recommends opting for the Furminator instead due to its easier usability and larger sizes.
      - Annotation 2 (Positive): Conversely, the second annotator praises the brush's deshedding capabilities and technological features, highlighting its effectiveness in removing hair.
      - Explanation: The disagreement revolves around the usability and effectiveness of the brush. Annotator 1 finds flaws in its design and suggests an alternative, while Annotator 2 appreciates its deshedding performance and technological aspects.

  - Review 3 (Product Performance):

    - Annotation 1 (Negative): The reviewer acknowledges the product's performance but criticizes its size and usability, mentioning that it is time-consuming to use, especially on larger animals.
    - Annotation 2 (Positive): In contrast, the second annotator praises the product's performance and effectiveness, stating that it performs well and efficiently removes pet hair.
    - Explanation: The disagreement arises also from differing opinions on the product's practicality and suitability for larger animals. Annotator 1 finds issues with its size and usability, while Annotator 2 focuses on its overall performance and effectiveness.



<div style="padding:15px 20px 20px 20px;border-left:3px solid orange;background-color:#fff5d6;border-radius: 20px;color:#424242;">

# Part 4: Data Augmentation (20 pts)

Since we only used 20% of the whole dataset for training, which might limit the model performance. In the final part, we will try to enlarge the training set by **data augmentation**.  

Specifically, we will **`Rephrase`** some current training samples using pretrained paraphraser. So that the paraphrased synthetic samples would preserve the semantic similarity while change the surface format.

You can use the pretrained T5 paraphraser [here](https://huggingface.co/humarin/chatgpt_paraphraser_on_T5_base).

</div>

## 🎯 Q4.1: **Data Augmentation with Paraphrasing (15 pts)**
TODO🔻: Implement functions named `get_paraphrase_batch` and `get_paraphrase_dataset` with the details in the below two blocks.

In [6]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
# get the given pretrained paraphrase model and the corresponding tokenizer (https://huggingface.co/humarin/chatgpt_paraphraser_on_T5_base)
paraphrase_tokenizer = AutoTokenizer.from_pretrained("humarin/chatgpt_paraphraser_on_T5_base")
paraphrase_model = AutoModelForSeq2SeqLM.from_pretrained("humarin/chatgpt_paraphraser_on_T5_base").to(device)

def get_paraphrase_batch(
    model,
    tokenizer,
    input_samples,
    n,
    repetition_penalty=10.0,
    diversity_penalty=3.0,
    no_repeat_ngram_size=2,
    temperature=0.7,
    max_length=256,
    device='cuda:0'):
    '''
    Input
      model: paraphraser
      tokenizer: paraphrase tokenizer
      input_samples: a batch (list) of real samples to be paraphrased
      n: number of paraphrases to get for each input sample
      for other parameters, please refer to:
          https://huggingface.co/docs/transformers/en/main_classes/text_generation#transformers.GenerationConfig
    Output: Tuple.
      synthetic_samples: a list of paraphrased samples
    '''

    # TODO: implement paraphrasing on a batch of imput samples

    inputs = tokenizer(input_samples, return_tensors="pt", padding=True, truncation=True, max_length=max_length).to(device)

    synthetic_samples = []
    for i in range(len(input_samples)):
        input_text = inputs.input_ids[i].unsqueeze(0)
        with torch.no_grad():
            outputs = model.generate(
                input_text,
                max_length=max_length,
                num_return_sequences=n,
                repetition_penalty=repetition_penalty,
                diversity_penalty=diversity_penalty,
                no_repeat_ngram_size=no_repeat_ngram_size,
                temperature=temperature,
                # Changed from True to False because of this error
                # ValueError: `diversity_penalty` is not 0.0 or `num_beam_groups` is not 1, triggering group beam search. In this generation mode, `do_sample` must be set to `False`
                do_sample=False,
                top_k=50,
                top_p=0.95,
                eos_token_id=tokenizer.eos_token_id,
                pad_token_id=tokenizer.pad_token_id,
                early_stopping=True,
                num_beams=2,
                num_beam_groups=2
            )
        for output in outputs:
            paraphrase = tokenizer.decode(output, skip_special_tokens=True, clean_up_tokenization_spaces=True)
            synthetic_samples.append(paraphrase)

    return synthetic_samples

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
  return self.fget.__get__(instance, owner)()


In [7]:
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

data_dir = 'data'
data_train_path = os.path.join(data_dir, 'train_sa.jsonl')
BATCH_SIZE = 8
N_PARAPHRASE = 2

def get_paraphrase_dataset(model, tokenizer, data_path, batch_size, n_paraphrase):
    '''
    Input
      model: paraphrase model
      tokenizer: paraphrase tokenizer
      data_path: path to the `jsonl` file of training data
      batch_size: number of input samples to be paraphrases in one batch
      n_paraphrase: number of paraphrased sequences for each sample
    Output:
      paraphrase_dataset: a list of all paraphrase samples. Do not include the original training data.
    '''
    paraphrase_dataset = []
    with jsonlines.open(data_path, "r") as reader:

        samples = [sample for sample in reader]

        for i in range(0, len(samples), batch_size):
            batch = samples[i:i + batch_size]
            input_samples = [sample["review"] for sample in batch]
            domains_labels = [(sample["domain"], sample["label"]) for sample in batch]

            synthetic_samples = get_paraphrase_batch(
                model,
                tokenizer,
                input_samples,
                n_paraphrase,
                repetition_penalty=10.0,
                diversity_penalty=3.0,
                no_repeat_ngram_size=2,
                temperature=0.7,
                max_length=256,
                device=device
            )

            for j, (domain, label) in enumerate(domains_labels):
                for k in range(n_paraphrase):
                    index = j * n_paraphrase + k
                    paraphrased_review = synthetic_samples[index]
                    paraphrase_dataset.append({"review": paraphrased_review, "domain": domain, "label": label})

    return paraphrase_dataset

**Note:** run paraphrasing, which will take ~20-30 minutes using a T4 Colab GPU. But the running time could depend on various implementations.

In [8]:
paraphrase_dataset = get_paraphrase_dataset(paraphrase_model, paraphrase_tokenizer, data_train_path, BATCH_SIZE, N_PARAPHRASE)



In [9]:
# Original training dataset
with jsonlines.open(data_train_path, "r") as reader:
    origin_data = [dt for dt in reader.iter()]

all_data = origin_data + paraphrase_dataset

# Write all the original and paraphrased data samples into training dataset
augmented_data_train_path = os.path.join(data_dir, 'augmented_train_sa.jsonl')
with jsonlines.open(augmented_data_train_path, "w") as writer:
    writer.write_all(all_data)

assert len(all_data) == 3 * len(origin_data)

## 🎯 Q4.2: **Retrain RoBERTa Model with Data Augmentation (5 pts)**
TODO🔻: Retrain the sentiment analysis model with the augmented (original+paraphrased), larger dataset :)

**Note:** *Training on the augmented data will take about 15 minutes using a T4 Colab GPU.*

In [13]:
from sa import SADataset
from sa import compute_metrics, train, evaluate

# Re-train a RoBERTa SA model on the augmented training dataset
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

model_name = "FacebookAI/roberta-base"
tokenizer = RobertaTokenizer.from_pretrained(model_name)
model = RobertaForSequenceClassification.from_pretrained(model_name, num_labels=2)
model.to(device)

train_dataset = SADataset('data/augmented_train_sa.jsonl', tokenizer)
dev_dataset = SADataset('data/test_sa.jsonl', tokenizer)

epochs = 4
max_grad_norm = 1.0
warmup_percent = 0.3
learning_rate = 1e-5

train(train_dataset=train_dataset,
      dev_dataset=dev_dataset,
      model=model,
      device=device,
      batch_size=BATCH_SIZE,
      epochs=epochs,
      learning_rate=learning_rate,
      warmup_percent=warmup_percent,
      max_grad_norm=max_grad_norm,
      model_save_root='models/augmented/', tensorboard_path="./tensorboard/part4_lr{}".format(learning_rate))

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at FacebookAI/roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Building SA Dataset...


4800it [00:07, 663.98it/s] 


Building SA Dataset...


6400it [00:09, 692.44it/s] 
Training:   0%|          | 0/600 [00:00<?, ?it/s]We strongly recommend passing in an `attention_mask` since your input_ids may be padded. See https://huggingface.co/docs/transformers/troubleshooting#incorrect-output-when-padding-tokens-arent-masked.
Training: 100%|██████████| 600/600 [00:47<00:00, 12.73it/s]
Evaluation: 100%|██████████| 800/800 [00:17<00:00, 45.10it/s]


Epoch: 0 | Training Loss: 0.663 | Validation Loss: 0.347
Epoch 0 SA Validation:
Confusion Matrix:
[[2957  243]
 [ 594 2606]]
F1: (87.60%, 86.16%) | Macro-F1: 86.88%
Model Saved!


Training: 100%|██████████| 600/600 [00:46<00:00, 12.79it/s]
Evaluation: 100%|██████████| 800/800 [00:17<00:00, 46.32it/s]


Epoch: 1 | Training Loss: 0.398 | Validation Loss: 0.308
Epoch 1 SA Validation:
Confusion Matrix:
[[2918  282]
 [ 331 2869]]
F1: (90.49%, 90.35%) | Macro-F1: 90.42%
Model Saved!


Training: 100%|██████████| 600/600 [00:46<00:00, 12.82it/s]
Evaluation: 100%|██████████| 800/800 [00:17<00:00, 45.84it/s]


Epoch: 2 | Training Loss: 0.285 | Validation Loss: 0.421
Epoch 2 SA Validation:
Confusion Matrix:
[[2809  391]
 [ 243 2957]]
F1: (89.86%, 90.32%) | Macro-F1: 90.09%


Training: 100%|██████████| 600/600 [00:46<00:00, 12.95it/s]
Evaluation: 100%|██████████| 800/800 [00:17<00:00, 46.22it/s]


Epoch: 3 | Training Loss: 0.165 | Validation Loss: 0.453
Epoch 3 SA Validation:
Confusion Matrix:
[[2951  249]
 [ 349 2851]]
F1: (90.80%, 90.51%) | Macro-F1: 90.65%
Model Saved!


TODO🔻: Discuss your results by answering the following questions

- Compare the performances of models in Part 1 and Part 4. Does the data augmentation help with the performance and why (give possible reasons)?
- No matter whether the data augmentation helps or not, list **three** possible ways to improve our current data augmentation method.

  * **Initial Performance**: The model trained with augmented data showed a notably better initial performance, indicating that the augmented data likely provides a **more varied set of examples**. This variety can help the model learn more general features early on, which are applicable across a broader spectrum of the data.

  * **Generalization to Unseen Data**: The augmented data model exhibited higher or similar Macro-F1 scores across epochs compared to the model trained on normal data. Higher Macro-F1 scores suggest **better generalization** to unseen data, possibly because the augmented data introduces the model to a wider range of linguistic variations and sentiment expressions, reducing the model's overfitting on the training set's specific characteristics.

  * **Effectiveness in Learning**: The performance improvements with augmented data indicate that augmentation can effectively enhance the model's learning, as seen by the relatively quick achievement of high F1 scores. This suggests that data augmentation can mitigate some of the challenges associated with limited or unbalanced training data. This augmentation can help the model by providing more examples of underrepresented classes **(more balanced data)**.


### Three Ways to Improve Data Augmentation

1. **Contextual Augmentation**: Add new examples to the training dataset that demonstrate new contexts, new domains for the model to learn from.
 Instead of text generation, use more complex techniques to create new sentences that maintain the original sentiment while altering the sentence structure. Techniques like paraphrasing or employing models that understand context can generate more meaningful variations.

2. **Adversarial Filtering Algorithms**: To refine the dataset by removing examples that are too simplistic or might encourage shortcut learning, where the model relies on superficial cues rather than understanding the text. Through adversarial filtering, complex examples are prioritized in the training set, encouraging the model to develop more robust learning strategies. This involves identifying and excluding "easy" examples that don't contribute to learning complex patterns.


3. **Controlled Augmentation**: Construct controlled datasets that test specific dimensions of what we want in the first place By implementing a strategy to ensure that the augmented data does not disproportionately represent certain domains or sentiments. An example to control it is to monitor the distribution of key features in the augmented dataset and adjusting the augmentation process to avoid biases.


<div style="padding:15px 20px 20px 20px;border-left:3px solid orange;background-color:#fff5d6;border-radius: 20px;color:#424242;">

### **5 Upload Your Notebook, Data and Models**

Please upload your filled jupyter notebook in your GitHub Classroom repository, **with all cells run and output results shown**.

**Note:** We are **not** responsible for re-running the cells in your notebook.

Please also submit all your **datasets** **(anotated and augmented)**, as well as **all your trained models** in Part 1 and Part 4, in your GitHub Classroom repository.
    
</div>