#  Assignment 2 - Transfer Learning and Data Augmentation 💬

Welcome to the **second assignment** for the **CS-552: Modern NLP course**!

> - 😀 Name: **Gurnoor Singh Khurana**
> - ✉️ Email: **gurnoor.khurana@epfl.ch**
> - 🪪 SCIPER: **366788**

<div style="padding:15px 20px 20px 20px;border-left:3px solid green;background-color:#e4fae4;border-radius: 20px;">

## **Assignment Description**
- In the first part of this assignment, you will need to implement training (fine-tuning) and evaluation of a pre-trained language model ([DistilBERT](https://huggingface.co/docs/transformers/model_doc/distilbert) ), on natural language inference (NLI) task for recognizing textual entailment (RTE).

- Following the first finetuning task, you will need to identify the shortcut (i.e. some salient or toxic features) that the model learnt for the specific task. 

- For part-3, you are supposed to annotate 100 randomly assigned test datapoints as ground-truth labels. Additionally, the cross annotation should be conducted by another one or two annotators, and you will learn about how to calculate the agreement statistics as a significant characteristic reflecting the quality of a collected dataset.

- For part-4, since the human annotation is quite time- and effort-consuming, there are plenty of ways to get silver-labels from automatic labeling to augment the dataset scale. We provide the reference to some simple methods (EDA and Back Translation) but you are encouraged to explore other advanced mechanisms. You will evaluate the improvement of your model performance by using your data augmentation method.

For each part, you will need to complete the code in the corresponding `.py` files (`nli.py` for Part-1, `shortcut.py` for Part-2, `eda.py` for Part-4). You will be provided with the function descriptions and detailed instructions about the code snippet you need to write.


### Table of Contents
- **[PART 1: Model Finetuning for NLI](#1)**
    - [1.1 Data Processing](#11)
    - [1.2 Model Training and Evaluation](#12)
- **[PART 2: Identify Model Shortcut](#2)**
    - [2.1 Word-Pair Pattern Extraction](#21)
    - [2.2 Distill Potentially Useful Patterns](#22)
    - [2.3 Case Study](#23)
- **[PART 3: Annotate New Data](#3)**
    - [3.1 Write an Annotation Guideline](#31)
    - [3.2 Annotate Your 100 Datapoints with Partner(s)](#32)
    - [3.3 Agreement Measure](#33)
    - [3.4 Robustness Check](#34)
- **[PART 4: Data Augmentation](#4)**
    
### Deliverables

- ✅ This jupyter notebook
- ✅ `nli.py` file
- ✅ `shortcut.py` file
- ✅ Finetuned DistilBERT models for NLI task (Part 1 and Part 4)
- ✅ Annotated and cross-annotated data files (Part 3)
- ✅ New dataset from data augmentation (Part 4)

</div>

### Google Colab Setup
If you are using Google Colab notebook for this assignment, you will need to run a few commands to set up our environment on Google Colab. If you are running this notebook on a local machine you can skip this section.

Run the following cell to mount your Google Drive. Follow the popped window, sign in to your Google account. (The same account you used to store this notebook!)

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


Now first click the 4th left-side bar (named Files), then click the 2nd bar popped under Files column (named Refresh), under "/drive/MyDrive/" find the Assignment 2 folder that you uploaded to your Google Drive, copy its path and fill it in below. If everything is working correctly, then running the folowing cell should print the filenames from the assignment:

```
['Assignment2.ipynb', 'requirements.txt', 'runs', 'predictions', 'nli_data', 'testA2.py', 'nli.py', 'shortcut.py']
```

In [2]:
import os
# TODO: Fill in the path where you download the Assignment folder into
ROOT_PATH = "/content/drive/MyDrive/NLP/A2" # Replace with your directory to A2 folder
print(os.listdir(ROOT_PATH))

['Assignment2.ipynb', 'testA2.py', '.DS_Store', 'nli_data', '__pycache__', 'predictions', 'runs', 'requirements.txt', 'eda.py', 'shortcut.py', 'nli.py', 'case_study_contradiction.jsonl', 'case_study_entailment.jsonl']


Before we start, we also need to run some boilerplate code to set up our environment, same as previous assignments. You'll need to rerun this setup code each time you start the notebook.

In [3]:
requirements = ROOT_PATH + "/requirements.txt"
!pip install -r {requirements}
!pip install torch==1.13.1+cu116 torchvision==0.14.1+cu116 torchaudio==0.13.1 --extra-index-url https://download.pytorch.org/whl/cu116

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting jsonlines==3.1.0
  Downloading jsonlines-3.1.0-py3-none-any.whl (8.6 kB)
Collecting transformers==4.26.1
  Downloading transformers-4.26.1-py3-none-any.whl (6.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.3/6.3 MB[0m [31m55.0 MB/s[0m eta [36m0:00:00[0m
Collecting apex==0.9.10.dev0
  Downloading apex-0.9.10dev.tar.gz (36 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting huggingface-hub==0.12.1
  Downloading huggingface_hub-0.12.1-py3-none-any.whl (190 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m190.3/190.3 KB[0m [31m16.0 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.3-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m81.8 MB/s[0m eta [36m0:00:00[0m


Run this cell to load the autoreload extension. This allows us to edit .py source files, and re-import them into the notebook for a seamless editing and debugging experience.

In [4]:
%load_ext autoreload
%autoreload 2

In [5]:
from copy import deepcopy
import numpy as np 
from tqdm import tqdm
import jsonlines
import sys
import time
import random

import torch
import torch.utils.data
from torch import nn, optim
from torch.utils.data import Dataset, DataLoader, RandomSampler, SequentialSampler
from transformers import AdamW, get_constant_schedule_with_warmup
from transformers import DistilBertTokenizer, DistilBertForSequenceClassification

Once you have successfully mounted your Google Drive and located the path to this assignment, run the following cell to allow us to import from the `.py` files of this assignment. If it works correctly, it should print the message:

```
Hello A2!
```

In [6]:
sys.path.append(ROOT_PATH)

from testA2 import hello_A2
hello_A2()

Hello A2!


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


Note that if CUDA is not enabled, `torch.cuda.is_available()` will return False and this notebook will fallback to CPU mode.

The global variables `dtype` and `device` will control the data types throughout this assignment.

We will be using `torch.float = torch.float32` for all operations.

Please refer to https://pytorch.org/docs/stable/tensor_attributes.html#torch-dtype for more details about data types.

In [7]:
if torch.cuda.is_available():
  print('Good to go!')
else:
  print('Please set GPU via Edit -> Notebook Settings.')

Good to go!


### Local Setup
If you skip Google Colab setup, you still need to fill in the path where you download the Assignment folder, and install required packages.

In [37]:
ROOT_PATH = "..." # Replace with your directory to A2 folder

In [38]:
requirements = ROOT_PATH + "/requirements.txt"
!pip install -r {requirements}

[31mERROR: Could not open requirements file: [Errno 2] No such file or directory: '.../requirements.txt'[0m[31m
[0m

In [39]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [40]:
from copy import deepcopy
import numpy as np 
from tqdm import tqdm
import jsonlines
import sys
import time, os
import random

import torch
import torch.utils.data
from torch import nn, optim
from torch.utils.data import Dataset, DataLoader, RandomSampler, SequentialSampler
from transformers import AdamW, get_constant_schedule_with_warmup
from transformers import DistilBertTokenizer, DistilBertForSequenceClassification

<a name="1"></a>
## **PART 1: Finetuning DistilBERT for NLI**
---

### **What is the NLI task?🧐**
> Given a pair of sentences, denoted as a "premise" sentence and a "hypothesis" sentence, NLI (or RTE) aims to determine their logical relationship, i.e. whether they are logically follow (entailment), unfollow (contradiction) or are undetermined (neutral) to each other.

> Defined as a machine learning task, NLI can be considered as a 3-classes (entailment, contradiction, or neutral) classification task, with a sentence-pair input ("hypothesis" and “premise”).

> **You can run the following cell to have the first glance at your data**. Each data sample is a python dictionary, which consists of following components:
- premise sentence (*'premise'*), 
- hypothesis sentence (*'hypothesis'*) 
- domain (*'domain'*): describing the topic of premise and hypothesis sentences (e.g., government regulations, telephone talks, etc.)
- label (*'label'*): indicating the logical relation between premise and hypothesis (i.e., entailment, contradiction, or neutral).

In [8]:
# If you use Google Colab, then data_dir = 'GOOGLE_DRIVE_PATH/nli_data'
data_dir = ROOT_PATH+'/nli_data'
data_dev_path = os.path.join(data_dir, 'dev_in_domain.jsonl')
with jsonlines.open(data_dev_path, "r") as reader:
    for sid, sample in enumerate(reader.iter()):
        print(sample)
        if sid == 2:
            break

{'premise': 'The new rights are nice enough', 'hypothesis': 'Everyone really likes the newest benefits ', 'domain': 'slate', 'label': 'neutral'}
{'premise': 'This site includes a list of all award winners and a searchable database of Government Executive articles.', 'hypothesis': 'The Government Executive articles housed on the website are not able to be searched.', 'domain': 'government', 'label': 'contradiction'}
{'premise': "uh i don't know i i have mixed emotions about him uh sometimes i like him but at the same times i love to see somebody beat him", 'hypothesis': 'I like him for the most part, but would still enjoy seeing someone beat him.', 'domain': 'telephone', 'label': 'entailment'}


In [9]:
# Enter enter your Sciper number
SCIPER = '366788'
seed = int(SCIPER)
torch.backends.cudnn.deterministic = True

In [10]:
print('Your random seed is: ', seed)

Your random seed is:  366788


In [11]:
# We use the following pretrained tokenizer and model
model_name = "distilbert-base-uncased"
tokenizer = DistilBertTokenizer.from_pretrained(model_name)
model = DistilBertForSequenceClassification.from_pretrained(model_name, num_labels=3)

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

Downloading (…)"pytorch_model.bin";:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_projector.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_transform.weight', 'vocab_projector.bias', 'vocab_layer_norm.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'pre_classifier.weight', 'pre_classi

### **1.1 Dataset Processing**
Our first step is to load datasets for NLI task by constructing a Pytorch Dataset. Specifically, we will need to implement tokenization and padding with a HuggingFace pre-trained tokenizer.

**Complete `NLIDataset` class following the instructions in `nli.py`, and test by running the following cell.**

In [12]:
from nli import NLIDataset
model_name = "distilbert-base-uncased"
tokenizer = DistilBertTokenizer.from_pretrained(model_name)
dataset = NLIDataset(ROOT_PATH+"/nli_data/dev_in_domain.jsonl", tokenizer)

from testA2 import test_NLIDataset
test_NLIDataset(dataset)

Building NLI Dataset...


9815it [00:19, 515.18it/s]

NLIDataset test correct ✅





### **1.2 Model Training and Evaluation**
Next, we will implement the training and evaluation process to finetune the model. For model training, you will need to calculate the loss and update the model weights by update the optimizer. Additionally, we add a learning rate schedular to adopt an adaptive learning rate during the whole training process. 

For evaluation, you will need to compute accuracy and F1 scores to assess the model performance. 

**Complete the `compute_metric()`, `train()` and `evaluate()` functions following the instructions in the `nli.py` file, you can test compute_metric() by running the following cell.**

In [13]:
from nli import compute_metrics, train, evaluate

from testA2 import test_compute_metrics
test_compute_metrics(compute_metrics)

compute_metric test correct ✅


#### **Start Training and Validation!**

Try the following different hyperparameter settings, compare and discuss the results. (Other hyperparameters should not be changed.)

> A. learning_rate 2e-5

> B. learning_rate 5e-5

**Note:** *Each training will take about 1 hour using a GPU, please keep your computer and notebook active during the training.*

**Questions: Which learning rate is better? Explain your answers.**

In [14]:
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

model_name = "distilbert-base-uncased"
tokenizer = DistilBertTokenizer.from_pretrained(model_name)
model = DistilBertForSequenceClassification.from_pretrained(model_name, num_labels=3)
model.to(device)

train_dataset = NLIDataset(ROOT_PATH+"/nli_data/train.jsonl", tokenizer)
dev_dataset = NLIDataset(ROOT_PATH+"/nli_data/dev_in_domain.jsonl", tokenizer)

batch_size = 16
epochs = 4
max_grad_norm = 1.0
warmup_percent = 0.3
model_save_root = ROOT_PATH+'/runs/'

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_projector.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_transform.weight', 'vocab_projector.bias', 'vocab_layer_norm.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'pre_classifier.weight', 'pre_classi

Building NLI Dataset...


98176it [02:15, 724.17it/s] 


Building NLI Dataset...


9815it [00:07, 1239.93it/s]


In [None]:
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)

learning_rate = 5e-5 # play around with this hyperparameter

train(train_dataset, dev_dataset, model, device, batch_size, epochs,
      learning_rate, warmup_percent, max_grad_norm, model_save_root)

Training: 100%|██████████| 6136/6136 [13:21<00:00,  7.65it/s]
Evaluation: 100%|██████████| 614/614 [00:24<00:00, 25.09it/s]


Epoch: 0 | Training Loss: 0.804 | Validation Loss: 0.638
Epoch 0 NLI Validation:
Accuracy: 73.98% | F1: (76.71%, 70.46%, 74.68%) | Macro-F1: 73.95%
Model Saved!


Training: 100%|██████████| 6136/6136 [13:11<00:00,  7.76it/s]
Evaluation: 100%|██████████| 614/614 [00:24<00:00, 24.99it/s]


Epoch: 1 | Training Loss: 0.555 | Validation Loss: 0.594
Epoch 1 NLI Validation:
Accuracy: 76.29% | F1: (79.43%, 72.31%, 76.92%) | Macro-F1: 76.22%
Model Saved!


Training: 100%|██████████| 6136/6136 [13:05<00:00,  7.81it/s]
Evaluation: 100%|██████████| 614/614 [00:24<00:00, 25.25it/s]


Epoch: 2 | Training Loss: 0.333 | Validation Loss: 0.682
Epoch 2 NLI Validation:
Accuracy: 75.46% | F1: (78.70%, 70.75%, 76.57%) | Macro-F1: 75.34%


Training: 100%|██████████| 6136/6136 [13:07<00:00,  7.79it/s]
Evaluation: 100%|██████████| 614/614 [00:24<00:00, 25.38it/s]


Epoch: 3 | Training Loss: 0.217 | Validation Loss: 0.925
Epoch 3 NLI Validation:
Accuracy: 75.20% | F1: (78.18%, 71.22%, 76.10%) | Macro-F1: 75.17%


### **Fine-Grained Validation**

Use the model checkpoint saved under the first hyperparameter setting (learning_rate 2e-5) in 1.4, check the model performance on each domain subsets of the validation set, report the validation loss, accuracy, F1 scores and Macro-F1 on each domain, compare and discuss the results.

**Questions: On which domain does the model perform the best? the worst? Give some possible explanations of why the model's best-performed domain is easier, and why the model's worst-performed domain is more challenging. Use some examples to support your explanations.**

**Note:** To find examples for supporting your discussion, save the model prediction results on each domain under the './predictions/' folder, by specifying the *result_save_file* of the *evaluate* function.

In [15]:
batch_size = 16
learning_rate = 2e-5
warmup_percent = 0.3
checkpoint = ROOT_PATH+'/runs/lr{}-warmup{}'.format(learning_rate, warmup_percent)

# Split the validation sets into subsets with different domains
# Save the subsets under './nli_data/'
# Replace "..." with your code
data_dir = ROOT_PATH+'/nli_data'
data_dev_path = os.path.join(data_dir, 'dev_in_domain.jsonl')
samples = []
with jsonlines.open(data_dev_path, "r") as reader:
    for sid, sample in enumerate(tqdm(reader.iter())):
        samples.append(sample)

for domain in ["fiction", "government", "slate", "telephone", "travel"]:
    with jsonlines.open(os.path.join(data_dir, f"dev_{domain}.jsonl"), 'w') as writer:
      samples_domain = list(filter(lambda x: x['domain'] == domain, samples))
      writer.write_all(samples_domain)

9815it [00:00, 111052.32it/s]


In [16]:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
tokenizer = DistilBertTokenizer.from_pretrained(checkpoint)
model = DistilBertForSequenceClassification.from_pretrained(checkpoint)
model.to(device)

for domain in ["fiction", "government", "slate", "telephone", "travel"]:
    
    # Evaluate and save prediction results in each domain
    # Replace "..." with your code
    dev_domain_dataset = NLIDataset(ROOT_PATH+f"/nli_data/dev_{domain}.jsonl", tokenizer)
    dev_loss, acc, f1_ent, f1_neu, f1_con = evaluate(
                                              dev_domain_dataset,
                                              model, device,
                                              batch_size, no_labels=False,
                                              result_save_file=ROOT_PATH + f"/predictions/{domain}.jsonl"
                                            )
    macro_f1 = (f1_ent + f1_neu + f1_con) / 3
    
    print(f'Domain: {domain}')
    print(f'Validation Loss: {dev_loss:.3f} | Accuracy: {acc*100:.2f}%')
    print(f'F1: ({f1_ent*100:.2f}%, {f1_neu*100:.2f}%, {f1_con*100:.2f}%) | Macro-F1: {macro_f1*100:.2f}%')

Building NLI Dataset...


1973it [00:01, 1300.61it/s]
Evaluation: 100%|██████████| 124/124 [00:03<00:00, 31.41it/s]


Domain: fiction
Validation Loss: 0.616 | Accuracy: 75.77%
F1: (78.12%, 71.96%, 77.25%) | Macro-F1: 75.78%
Building NLI Dataset...


1945it [00:03, 588.05it/s]
Evaluation: 100%|██████████| 122/122 [00:04<00:00, 28.48it/s]


Domain: government
Validation Loss: 0.516 | Accuracy: 81.65%
F1: (84.93%, 78.07%, 81.47%) | Macro-F1: 81.49%
Building NLI Dataset...


1955it [00:01, 1012.94it/s]
Evaluation: 100%|██████████| 123/123 [00:04<00:00, 29.11it/s]


Domain: slate
Validation Loss: 0.716 | Accuracy: 72.58%
F1: (74.84%, 68.94%, 73.98%) | Macro-F1: 72.59%
Building NLI Dataset...


1966it [00:02, 684.04it/s]
Evaluation: 100%|██████████| 123/123 [00:05<00:00, 23.72it/s]


Domain: telephone
Validation Loss: 0.608 | Accuracy: 76.45%
F1: (79.65%, 71.85%, 77.62%) | Macro-F1: 76.37%
Building NLI Dataset...


1976it [00:01, 1191.12it/s]
Evaluation: 100%|██████████| 124/124 [00:04<00:00, 28.33it/s]


Domain: travel
Validation Loss: 0.600 | Accuracy: 79.45%
F1: (83.76%, 76.40%, 77.79%) | Macro-F1: 79.32%


## **Task2: Identify Shortcuts**

We aim to find some shortcuts that the model in 1.4 (under the first hyperparameter setting) has learned.

### **2.1 Word-Pair Pattern Extraction**

We consider to exatrct simple word-pair patterns that the model may have learned from the NLI data. 

For this, we assume that a pair of words that occur in a premise-hypothesis sentence pair (one occurs in premise and the other occurs in hypothesis) may serve as a key indicator of the logical relationship between the premise and hypothesis sentences. For example:

>- Premise: Consider the United States Postal Service.
>- Hypothesis: Forget the United States Postal Service.

Here the word-pair "consider" and "forget" determine that the premise and hypothesis have a *contradiction* relationship, so (consider, forget) --> *contradiction* might be a good pattern to learn.

**Note:** 
- We do not consider the naive word pair patterns where the word from premise and the word from hypothesis are identical, e.g., (service, service) got from the above premise-hypothesis sentence pair.
- We do not consider stop words neither, punctuations and words that contain special prefix '##', e.g., '##s' in the pattern extraction.

In [19]:
# stop_words and puntuations to be removed from consideration in the pattern extraction

import nltk
nltk.download('stopwords')
stop_words = nltk.corpus.stopwords.words('english')
stop_words.append('uh')

import string
puncs = string.punctuation

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


**Complete `word_pair_extraction()` function in `shortcut.py` file.**

The keys of the returned dictionary *word_pairs* should be **different word-pairs** appered in premise-hypothesis sentence pairs, i.e., (a word from the premise, a word from the hypothesis).

The value of a word-pair key records the counts of entailment, neutral and contradiction predictions **made by the model** when the word-pair occurs, i.e., \[#entailment_predictions, #neutral_predictions,  #contradiction_predictions\].

**Note:** Remember to remove naive word pairs (i.e., premise word identical to hypothesis word), stop_words, puntuations and words with special prefix '##' out of consideration.

### **2.2 Distill Potentially Useful Patterns**

Find and print the **top-100** word-pairs that are associated with the **largest total number** of model predictions, which might contain frequently used patterns.

In [20]:
from shortcut import word_pair_extraction
import importlib
import shortcut
importlib.reload(shortcut)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


<module 'shortcut' from '/content/drive/MyDrive/NLP/A2/shortcut.py'>

In [21]:
# all your saved model prediction results in 1.2 Fine-Grained Validation
domains = ["fiction", "government", "slate", "telephone", "travel"]
prediction_files = [ROOT_PATH + f"/predictions/{domain}.jsonl" for domain in domains] 

tokenizer = DistilBertTokenizer.from_pretrained(checkpoint)
word_pairs = word_pair_extraction(prediction_files, tokenizer)

# find top-100 word-pairs associated with the largest total number of model predictions
top_100_freq_pairs = sorted([(pair, x[0]+x[1]+x[2]) for pair, x in word_pairs.items()], key=lambda x: x[1], reverse=True)
top_100_freq_pairs = list(map(lambda x: x[0], top_100_freq_pairs))[:100]

print(top_100_freq_pairs)

[('services', 'legal'), ('postal', 'service'), ('service', 'postal'), ('legal', 'services'), ('know', 'time'), ('know', 'like'), ('yeah', 'like'), ('like', 'lot'), ('would', 'could'), ('da', 'ca'), ('ca', 'da'), ('know', 'money'), ('year', 'last'), ('kids', 'children'), ('know', 'think'), ('yeah', 'never'), ('know', 'get'), ('know', 'people'), ('york', 'new'), ('yeah', 'one'), ('one', 'people'), ('two', 'one'), ('like', 'think'), ('would', 'people'), ('last', 'year'), ('yeah', 'time'), ('one', 'get'), ('state', 'l'), ('one', 'many'), ('going', 'get'), ('one', 'like'), ('good', 'bad'), ('national', 'saving'), ('new', 'york'), ('year', 'one'), ('legal', 'aid'), ('know', 'lot'), ('yeah', 'get'), ('know', 'go'), ('said', 'told'), ('one', 'last'), ('many', 'people'), ('well', 'get'), ('u', 'us'), ('united', 'us'), ('states', 'us'), ('case', 'studies'), ('saving', 'savings'), ('one', 'year'), ('may', 'might'), ('well', 'time'), ('like', 'never'), ('know', 'never'), ('yeah', 'think'), ('know'

**Among the top-100 frequent word-pairs above**, find out the **top-5** word-pairs whose occurances **most likely** lead to *entailment* predictions (entailment patterns), and the **top-5** word-pairs whose occurances **most likely** lead to *contradiction* predictions (contradiction patterns).

**Explain your rules for finding these word pairs.**

In [22]:
# find top-5 entailment and contradiction patterns
entailment_values = sorted([(word_pair, word_pairs[word_pair][0]) for word_pair in top_100_freq_pairs], key=lambda x: x[1], reverse=True)
contradiction_values = sorted([(word_pair, word_pairs[word_pair][2]) for word_pair in top_100_freq_pairs], key=lambda x: x[1], reverse=True)
top_5_entailment = list(map(lambda x: x[0], entailment_values))[:5]
top_5_contradict = list(map(lambda x: x[0], contradiction_values))[:5]

print("Entailment Patterns:")
print(top_5_entailment)
print("Contradiction Patterns:")
print(top_5_contradict)

Entailment Patterns:
[('postal', 'service'), ('services', 'legal'), ('service', 'postal'), ('legal', 'services'), ('know', 'like')]
Contradiction Patterns:
[('yeah', 'never'), ('services', 'legal'), ('know', 'never'), ('legal', 'services'), ('postal', 'service')]


### **2.3 Case Study**

Find out and study **4 representative** cases where the pattern that you have found in 2.2 **fails**, e.g., the premise-hypothesis sentence pair contains ('good', 'bad'), but has an *entailment* gold label.

**Based on your case study, explain the limitations of the word-pair patterns.**

In [27]:
# Not all the 5 word pairs make sense in the above code block where we
# assign values to the variables top_5_entailment and top_5_contradict
# So we choose the ones that will help us interpret the results properly
reasonable_entailment_patterns = [('know', 'like')]
reasonable_contradiction_patterns = [('yeah', 'never'), ('know', 'never')]

In [28]:
# you can fill your code for finding cases here
import itertools
with jsonlines.open(ROOT_PATH + f"/case_study_entailment.jsonl", "w") as writer_e:
  with jsonlines.open(ROOT_PATH + f"/case_study_contradiction.jsonl", "w") as writer_c:
    for domain in ["fiction", "government", "slate", "telephone", "travel"]:
      with jsonlines.open(ROOT_PATH + f"/predictions/{domain}.jsonl", "r") as reader:
          for sid, sample in enumerate(tqdm(reader.iter())):
            p_tokens = tokenizer.tokenize(sample['premise'])
            h_tokens = tokenizer.tokenize(sample['hypothesis'])

            p_h_pairs = itertools.product(p_tokens, h_tokens)
            found_entailment, found_contradiction = False, False
            for pair in p_h_pairs:
              if not found_entailment and pair in reasonable_entailment_patterns and sample['prediction'] == 'contradiction':
                writer_e.write(sample)
                found_entailment = True
              
              if not found_contradiction and pair in reasonable_contradiction_patterns and sample['prediction'] == 'entailment':
                writer_c.write(sample)
                found_contradiction = True

1973it [00:03, 599.13it/s]
1945it [00:03, 571.02it/s]
1955it [00:02, 830.20it/s] 
1966it [00:01, 1356.50it/s]
1976it [00:01, 1240.57it/s]


In [30]:
with jsonlines.open(ROOT_PATH + f"/case_study_entailment.jsonl") as reader_e:
  for sample in reader_e.iter():
    print(sample)

print("Contradiction examples")
with jsonlines.open(ROOT_PATH + f"/case_study_contradiction.jsonl") as reader_e:
  for sample in reader_e.iter():
    print(sample)

{'premise': 'and the same is true of the drug hangover you know if you', 'hypothesis': "It's nothing like a drug hangover.", 'domain': 'telephone', 'label': 'contradiction', 'prediction': 'contradiction'}
{'premise': "but i don't know you know  maybe you could do that for a certain period of time but i mean how long does that kind of a thing take you know to to um say to question the person or to get into their head", 'hypothesis': "It's not worth doing if you have to question the person like that.", 'domain': 'telephone', 'label': 'neutral', 'prediction': 'contradiction'}
{'premise': "i don't know if you have a place there called uh or you probably have something similar we call it Service Merchandise", 'hypothesis': 'You probably have nothing like it.', 'domain': 'telephone', 'label': 'neutral', 'prediction': 'contradiction'}
{'premise': "um-hum yeah i know what that's like uh-huh", 'hypothesis': 'I have no idea what that is like.', 'domain': 'telephone', 'label': 'contradiction', 'p

Representative Cases:
1. - 'premise': 'and the same is true of the drug hangover you know if you'
   - 'hypothesis': 'It's nothing like a drug hangover.'
   
   In this example, we see that the pair ('know', 'like') is present in the sample, which follows the entailment pattern. However the sentences contradict each other. 
2. - 'premise': "um-hum yeah i know what that's like uh-huh"
   - 'hypothesis': 'I have no idea what that is like.'

   Similar reasoning as above applies here

3. - 'premise': 'yeah right right yeah i know i uh i remember my college days  and having to do that too'
   - 'hypothesis': "I remember that when I went to college we didn't have anything like that."
   
   The same reasoning applies here as well
4. - 'premise': "oh really yeah i've i've never seen either one of them"
   - 'hypothesis': "I've never looked at either of them."
   
   In this case we see that the pair ('yeah', 'never') is present which is a pattern for contradiction, but the given sample has entailment label

5. - 'premise': "okay i'll keep that in mind yeah you serve that yourself or the for a family"
   - 'hypothesis': 'I will never forget that. You can have that on your own or share with a family.'

  Same reasoning as above applies. 

## **Task3: Annotate New Data**

To check the robustness of developed model, **some additional sets of test data** are collected (under /nli_data/test_data/), which contain NLI samples that are out of the domains of the training and validation data.

However, the test data does not have gold labels of the relationships between premise and hypothesis sentences, i.e., all the labels are marked as *hidden*. **We consider to annotate the data by ourselves.**

### **3.1 Write an Annotation Guideline**

Imagine that you are going to assign this annotation task to a crowdsourcing worker, who is completely not familiar with computer science and NLP. Think about how you are going to explain this annotation task to him in order to guide him do a decent job. Write an annotation guideline for such a worker who are going to do this task for you.

**Note:** You should come up with your own guideline without the help of your partner(s) in later Task 3.2

Annotation guideline:
Read the sentence in the premise and the sentence in hypothesis. Now 3 cases are possible:
- Given the premise, the hypothesis is true. In this case, the label given is entailment
- Given the premise, the hypothesis is false. In this case, the label given is contradiction
- Given the premise, it cannot be determined if the hypothesis is true or false. In this case, the label is neutral. 

Clearly, this guideline is simple enough to be understood by someone with no computer science background. It only requires knowledge of the english language. 




### **3.2 Annotate Your 100 Datapoints with Partner(s)**

Annotate your 100 test datapoints with your partner(s), by editing the value of the key "label_student1", "label_student2" and "label_student3" (if you are in a group of three students) in each datapoint.

**Note:** 
- You can download the assigned annotation file (`<your-testset-id>.jsonl`) by [this link](https://drive.google.com/drive/folders/146ExExmpnSUayu6ArGiN5gQzCPJp0myB?usp=share_link)
- Please find your annotation partner according to the "Student Pairing List for A2 Task3" shared on Ed.

**Name your annotated file as `<index>-<sciper_number>.jsonl`.** 

For example, if you get `01.jsonl` to annotate, you should name your deliverable as `01-<your_sciper_number>.jsonl`.

### **3.3 Agreement Measure**

Based on your and your partner's annotations on the 100 test datapoints in 3.2, calculate the [Cohen's Kappa](https://scikit-learn.org/stable/modules/model_evaluation.html#cohen-kappa) or [Krippendorff's Alpha](https://github.com/pln-fing-udelar/fast-krippendorff) (if you are in a group of three students) between the annotators. Discuss the agreement measure results.

**Note:** Cohen's Kappa or Krippendorff's Alpha interpretation

0: No Agreement

0 ~ 0.2: Slight Agreement

0.2 ~ 0.4: Fair Agreement

0.4 ~ 0.6: Moderate Agreement

0.6 ~ 0.8: Substantial Agreement

0.8 ~ 1.0: Near Perfect Agreement

1.0: Perfect Agreement

> **Questions**: What is your interpretation of Cohen's Kappa or Krippendorff's Alpha value according to the above mapping? Which kind of disagreements are most frequently happen between you and your partner(s), i.e., *entailment* vs. *neutral*, *entailment* vs. *contradiction*, or *neutral* vs. *contradiction*? For the second question, give some examples to explain why that is the case. Are there possible ways to address the disagrrements between two annotators?

According to the above mapping, me and my co-annotator have a substantial agreement. The most frequent disagreements between me and my partner have been between entailment and neutral. This is expected, since the difference between entailment and contradiction is quite clear. And so is the difference between contradiction and neutral. However, in case of comparing against entailment and neutral, it is a bit confusing as the task is very subjective. 

Examples:
1. - 'premise': 'This essay was selected as the First Prize winner ($1,000) in the Sixth VERBATIM Essay Competition.'
   - 'hypothesis': 'It was chosen and awarded a substantial cash prize for the  VERBATIM contest run last year.'

   I chose neutral label for this example, however my partner chose entailment.My reasoning for choosing neutral was that given the premise, we cannot definitely conclude that the hypothesis is true. Since the premise mentions that the essay won first prize in the sixth VERBATIM essay competition, and the hypothesis mentions 'last year', given the premise, we cannot guarantee that the hypothesis is true. 

2. - 'premise': 'Each generation of girls faces new  new technology, new moral issues, new opportunities.'
   - 'hypothesis': 'Each generation of girls faces new opportunities for employment. '

   I chose entailment label for this example, whereas my partner chose neutral. My reasoning for choosing entailment was that the premise mentions 'new opportunities' and the hypothesis mentions 'new opportunities for employment'. The latter can be assumed to be a subset of the former. Hence I thought entailment is the best label. 


To resolve such issues between the annotators, the best way is for them to discuss what their reasoning is and then decide based on that. However, that is not very practically possible. So the good idea is to have multiple annotators and then choose the label given by the majority of the annotators. 

In [15]:
# fill your code here
from sklearn.metrics import cohen_kappa_score
labels_1, labels_2 = [], []
with jsonlines.open(ROOT_PATH+f"/nli_data/03-366788.jsonl") as reader:
  for sample in tqdm(reader.iter()):
    labels_1.append(sample['label_student1'])

with jsonlines.open(ROOT_PATH+f"/nli_data/03-337560.jsonl") as reader:
  for sample in tqdm(reader.iter()):
    labels_2.append(sample['label_student2'])

cohen_kappa_score(labels_1, labels_2)

100it [00:00, 352.42it/s]
100it [00:00, 363.68it/s]


0.7424632631419482

### **3.4 Robustness Check**

Take into account both your and your partner's annotations, determine the final labels of the 100 test datapoints, by editing the value of the key "label" in each of your datapoint.

Evaluate the performance of your developed model in 1.4 (still under the first hyperparameter setting) on your annotated 100 test datapoints, and compare with the model performance on the validation set.

> **Question**: Do you think that your developed model has a good robuestness of handling out-of-domain NLI predictions?

I think my developed model has a good robustness of handling out-of-domain NLI predictions. Since we get an accuracy of 74% and a Macro F1 score of 73.65%, which is quite reasonable for zero shot predictions, and also comparable to model outputs on the in-domain validation dataset. So, out-of-domain NLI predictions are handled well by my model.  

**Note**: For getting the final labels, I use my own annotations because I found those to be more reasonable than my partners'. But since we have substantial agreement (score of 0.74), I think there should not be much difference. 

In [54]:
# fill your code here
from transformers import DistilBertTokenizer, DistilBertForSequenceClassification

learning_rate = 2e-5
warmup_percent = 0.3
checkpoint = ROOT_PATH+'/runs/lr{}-warmup{}'.format(learning_rate, warmup_percent)

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
tokenizer = DistilBertTokenizer.from_pretrained(checkpoint)
model = DistilBertForSequenceClassification.from_pretrained(checkpoint)
model.to(device)

annotated_dataset = NLIDataset(ROOT_PATH+f"/nli_data/03-final.jsonl", tokenizer)
dev_loss, acc, f1_ent, f1_neu, f1_con = evaluate(
                                          annotated_dataset,
                                          model, device,
                                          batch_size, no_labels=False,
                                          result_save_file=ROOT_PATH + f"/predictions/annotated_dataset.jsonl"
                                        )
macro_f1 = (f1_ent + f1_neu + f1_con) / 3

print(f'Validation Loss: {dev_loss:.3f} | Accuracy: {acc*100:.2f}%')
print(f'F1: ({f1_ent*100:.2f}%, {f1_neu*100:.2f}%, {f1_con*100:.2f}%) | Macro-F1: {macro_f1*100:.2f}%')



Building NLI Dataset...


100it [00:00, 304.25it/s]
Evaluation: 100%|██████████| 7/7 [00:00<00:00, 30.05it/s]


Validation Loss: 0.781 | Accuracy: 74.00%
F1: (77.11%, 70.77%, 73.08%) | Macro-F1: 73.65%


## **Task4: Data Augmentation**

Finally, we consider to use a data augmentation method to create more training data, and use the augmented data to improve the model performance. The data augmentation method we are going to use is [EDA](https://aclanthology.org/D19-1670/).

### **4.1 EDA: Easy Data Augmentation algorithm for Text**

For this section, we will need to implement the most simple data augmentation techniques on textual sentences, including **SR** (Synonym Replacement), **RD** (Random Deletion), **RS** (Random Swap), **RI** (Random Insertion). 

You should complete all the functions in `eda.py` script, and you can test them with a simple testcase by running the following cell.

- **Synonym Replacement (SR)**
> In Synonym Replacement, we randomly replace some words in the sentence with their synonyms.

In [23]:
import nltk
nltk.download('stopwords')
nltk.download('wordnet')
  
from nltk.corpus import wordnet

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...


You can test whether you get the synonyms right and see an example with synonym replacement.

In [24]:
from eda import get_synonyms
from testA2 import test_get_synonyms

test_get_synonyms(get_synonyms)

The synonyms for the word "task" are:  ['chore', 'undertaking', 'project', 'labor', 'job', 'tax']


In [26]:
from eda import synonym_replacement

print(f" Example of Synonym Replacement: {synonym_replacement('hey man how are you doing',3)}")

 Example of Synonym Replacement: hey humans how are you doing


- **Random Deletion (RD)**

> In Random Deletion, we randomly delete a word if a uniformly generated number between 0 and 1 is smaller than a pre-defined threshold. This allows for a random deletion of some words of the sentence.

In [27]:
from eda import random_deletion

print(f" Example of Random Deletion: {random_deletion('hey man how are you doing', p=0.3, max_deletion_n=3)}")

 Example of Random Deletion: hey man how are doing


- **Random Swap (RS)**
> In Random Swap, we randomly swap the order of two words in a sentence.

In [28]:
from eda import swap_word

print(f" Example of Random Swap: {swap_word('hey man how are you doing')}")

 Example of Random Swap: are man how hey you doing


- **Random Insertion (RI)**
> Finally, in Random Insertion, we randomly insert synonyms of a word at a random position.
> Data augmentation operations should not change the true label of a sentence, as that would introduce unnecessary noise into the data. Inserting a synonym of a word in a sentence, opposed to a random word, is more likely to be relevant to the context and retain the original label of the sentence.

In [29]:
from eda import random_insertion

print(f" Example of Random Insertion: {random_insertion('hey man how are you doing', n=2)}")

 Example of Random Insertion: hey man how are coif coif you doing


### **4.2 Augment Your Model**

Combine all the functions you have implemented in 4.1, you can come up with your own data augmentation pipeline with various p and n ;)

Next step is to expand the training data you used in Task1, re-train your model in 1.4 on your augmented data, and re-evaluate its performance on both the given validation set as well as on your manually annotated 100 test datapoints. 

Discuss the improvements that your data augmentation brings to your model. ***Include some examples of old vs. new model predictions to demonstrate the improvements.***

**Warning: In terms of data size and training time control, we stipulate that your augmented training data should not be larger than 100M.** (Currently the training data train.jsonl is about 25M.)

In [30]:
def aug(sent,n,p):
    print(f" Original Sentence : {sent}")
    print(f" SR Augmented Sentence : {synonym_replacement(sent, n)}")
    print(f" RD Augmented Sentence : {random_deletion(sent, p, n)}")
    print(f" RS Augmented Sentence : {swap_word(sent)}")
    print(f" RI Augmented Sentence : {random_insertion(sent,n)}")
    
aug('hey man how are you doing', p=0.2, n=2)

 Original Sentence : hey man how are you doing
 SR Augmented Sentence : hey gentleman how are you doing
 RD Augmented Sentence : hey man how are you
 RS Augmented Sentence : hey man how you are doing
 RI Augmented Sentence : hey man how mankind are you ar doing


- Augment training dataset and Re-train your model

In [49]:
# fill your code here
import random
data_train_path = os.path.join(ROOT_PATH+'/nli_data', 'train.jsonl')
data_augmented_path = os.path.join(ROOT_PATH+'/nli_data', 'train_augmented.jsonl')
samples = []
with jsonlines.open(data_augmented_path, "w", flush=True) as writer:
  with jsonlines.open(data_train_path, "r") as reader:
      for sid, sample in enumerate(tqdm(reader.iter())):
        premise, hypothesis = sample['premise'], sample['hypothesis']
        p = random.uniform(0, 1)
        l = [len(premise.split()), len(hypothesis.split())]
        n = [random.randint(1, l[0]), random.randint(1, l[1])]
        augmentations_premise = [
            synonym_replacement(premise, n[0]),
            random_deletion(premise, p, n[0]),
            random_insertion(premise, n[0])
        ]
        if len(premise.split()) > 1:
          augmentations_premise.append(swap_word(premise))
        

        augmentations_hypothesis = [
            synonym_replacement(hypothesis, n[1]),
            random_deletion(hypothesis, p, n[1]),
            random_insertion(hypothesis, n[1])
        ]

        if len(hypothesis.split()) > 1:
          augmentations_hypothesis.append(swap_word(hypothesis))

        writer.write(sample)

        choices_premise = random.choices(range(min(4, len(augmentations_premise))), k=3)
        choice_hypothesis = random.choices(range(min(4, len(augmentations_hypothesis))), k=3)
        sample_new = sample.copy()
        for a, b in zip(choices_premise, choice_hypothesis):
          sample_new['premise'], sample_new['hypothesis'] = augmentations_premise[a], augmentations_hypothesis[b]
          writer.write(sample_new)
        
  writer.close()

98176it [40:37, 40.27it/s]


In [56]:
# sample train dataset by taking 10% of samples
import random
random.seed(42)
data_train_path = os.path.join(ROOT_PATH+'/nli_data', 'train.jsonl')
data_subset_path = os.path.join(ROOT_PATH+'/nli_data', 'train_subset.jsonl')
with jsonlines.open(data_subset_path, "w", flush=True) as writer:
  with jsonlines.open(data_train_path, "r") as reader:
      for sid, sample in enumerate(tqdm(reader.iter())):
        if random.uniform(0, 1) <= 0.1:
          writer.write(sample)

  writer.close()

98176it [00:02, 41148.77it/s]


In [63]:
# sample train + augmented dataset
import random
random.seed(4)
data_train_subset_path = os.path.join(ROOT_PATH+'/nli_data', 'train_subset.jsonl')
data_train_augmented_path = os.path.join(ROOT_PATH+'/nli_data', 'train_augmented.jsonl')
data_subset_path = os.path.join(ROOT_PATH+'/nli_data', 'train_augmented_subset.jsonl')
with jsonlines.open(data_subset_path, "w", flush=True) as writer:
  with jsonlines.open(data_train_subset_path, "r") as reader:
      for sid, sample in enumerate(tqdm(reader.iter())):
        writer.write(sample)

  with jsonlines.open(data_train_augmented_path, "r") as reader:
      for sid, sample in enumerate(tqdm(reader.iter())):
        if random.uniform(0, 1) <= 0.1:
          writer.write(sample)

  writer.close()

9872it [00:01, 7825.43it/s]
392704it [00:10, 36318.25it/s]


In [51]:
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

model_name = "distilbert-base-uncased"
tokenizer = DistilBertTokenizer.from_pretrained(model_name)
model = DistilBertForSequenceClassification.from_pretrained(model_name, num_labels=3)
model.to(device)

train_dataset = NLIDataset(ROOT_PATH+"/nli_data/train_subset.jsonl", tokenizer)
dev_dataset = NLIDataset(ROOT_PATH+"/nli_data/dev_in_domain.jsonl", tokenizer)

batch_size = 16
epochs = 4
max_grad_norm = 1.0
warmup_percent = 0.3
model_save_root = ROOT_PATH+'/runs/task4/'

learning_rate = 2e-5 # play around with this hyperparameter

train(train_dataset, dev_dataset, model, device, batch_size, epochs,
      learning_rate, warmup_percent, max_grad_norm, model_save_root, save_on_acc=True)

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_layer_norm.weight', 'vocab_transform.bias', 'vocab_transform.weight', 'vocab_projector.weight', 'vocab_projector.bias', 'vocab_layer_norm.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'classifier.bias', 'classifier

Building NLI Dataset...


9872it [00:08, 1151.61it/s]


Building NLI Dataset...


9815it [00:10, 925.21it/s] 
Training: 100%|██████████| 617/617 [01:19<00:00,  7.74it/s]
Evaluation: 100%|██████████| 614/614 [00:25<00:00, 24.29it/s]
  precision = tps / (tps + fps)


Epoch: 0 | Training Loss: 0.773 | Validation Loss: 0.643
Epoch 0 NLI Validation:
Accuracy: 41.33% | F1: (nan%, 49.09%, 52.14%) | Macro-F1: nan%
here
Model Saved!


Training: 100%|██████████| 617/617 [01:18<00:00,  7.89it/s]
Evaluation: 100%|██████████| 614/614 [00:25<00:00, 24.42it/s]


Epoch: 1 | Training Loss: 0.573 | Validation Loss: 0.541
Epoch 1 NLI Validation:
Accuracy: 46.78% | F1: (nan%, 55.20%, 58.21%) | Macro-F1: nan%
here
Model Saved!


Training: 100%|██████████| 617/617 [01:18<00:00,  7.84it/s]
Evaluation: 100%|██████████| 614/614 [00:24<00:00, 24.56it/s]


Epoch: 2 | Training Loss: 0.318 | Validation Loss: 0.659
Epoch 2 NLI Validation:
Accuracy: 47.78% | F1: (nan%, 56.95%, 59.16%) | Macro-F1: nan%
here
Model Saved!


Training: 100%|██████████| 617/617 [01:18<00:00,  7.85it/s]
Evaluation: 100%|██████████| 614/614 [00:25<00:00, 24.47it/s]


Epoch: 3 | Training Loss: 0.113 | Validation Loss: 1.262
Epoch 3 NLI Validation:
Accuracy: 48.25% | F1: (nan%, 57.28%, 60.19%) | Macro-F1: nan%
here
Model Saved!


In [18]:
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

model_name = "distilbert-base-uncased"
tokenizer = DistilBertTokenizer.from_pretrained(model_name)
model = DistilBertForSequenceClassification.from_pretrained(model_name, num_labels=3)
model.to(device)

train_dataset = NLIDataset(ROOT_PATH+"/nli_data/train_augmented_subset.jsonl", tokenizer)
dev_dataset = NLIDataset(ROOT_PATH+"/nli_data/dev_in_domain.jsonl", tokenizer)

batch_size = 16
epochs = 4
max_grad_norm = 1.0
warmup_percent = 0.3
model_save_root = ROOT_PATH+'/runs/task4/4.2/'

learning_rate = 2e-5 # play around with this hyperparameter

train(train_dataset, dev_dataset, model, device, batch_size, epochs,
      learning_rate, warmup_percent, max_grad_norm, model_save_root, save_on_acc=True)

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_projector.bias', 'vocab_layer_norm.bias', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_projector.weight']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier

Building NLI Dataset...


13455it [00:12, 1214.45it/s]

expected string or bytes-like object


13937it [00:13, 1120.50it/s]

expected string or bytes-like object


15598it [00:14, 669.41it/s]

expected string or bytes-like object


16388it [00:16, 610.07it/s]

expected string or bytes-like object


16654it [00:16, 658.36it/s]

expected string or bytes-like object


23071it [00:26, 1170.40it/s]

expected string or bytes-like object


24174it [00:27, 1180.66it/s]

expected string or bytes-like object


24420it [00:27, 1161.87it/s]

expected string or bytes-like object


27158it [00:30, 1182.32it/s]

expected string or bytes-like object


28570it [00:31, 763.94it/s]

expected string or bytes-like object


29196it [00:32, 612.67it/s]

expected string or bytes-like object


30346it [00:34, 611.51it/s]

expected string or bytes-like object
expected string or bytes-like object


31069it [00:35, 628.83it/s]

expected string or bytes-like object


32964it [00:38, 1161.74it/s]

expected string or bytes-like object


34165it [00:39, 1147.35it/s]

expected string or bytes-like object


34406it [00:39, 1175.51it/s]

expected string or bytes-like object
expected string or bytes-like object


36939it [00:41, 1229.41it/s]

expected string or bytes-like object


41375it [00:45, 1168.54it/s]

expected string or bytes-like object


41995it [00:45, 1210.49it/s]

expected string or bytes-like object


44183it [00:47, 1196.29it/s]

expected string or bytes-like object


44680it [00:48, 712.74it/s]

expected string or bytes-like object


45300it [00:49, 627.23it/s]

expected string or bytes-like object


45499it [00:49, 626.88it/s]

expected string or bytes-like object


48137it [00:53, 654.82it/s]

expected string or bytes-like object
expected string or bytes-like object


49020it [00:54, 896.36it/s] 


Building NLI Dataset...


9815it [00:07, 1291.78it/s]
Training: 100%|██████████| 3063/3063 [07:15<00:00,  7.03it/s]
Evaluation: 100%|██████████| 614/614 [00:23<00:00, 25.64it/s]
  precision = tps / (tps + fps)


Epoch: 0 | Training Loss: nan | Validation Loss: 0.531
Epoch 0 NLI Validation:
Accuracy: 47.38% | F1: (nan%, 57.04%, 58.36%) | Macro-F1: nan%
here
Model Saved!


Training: 100%|██████████| 3063/3063 [07:09<00:00,  7.14it/s]
Evaluation: 100%|██████████| 614/614 [00:23<00:00, 26.68it/s]


Epoch: 1 | Training Loss: nan | Validation Loss: 0.488
Epoch 1 NLI Validation:
Accuracy: 49.72% | F1: (nan%, 60.52%, 60.34%) | Macro-F1: nan%
here
Model Saved!


Training: 100%|██████████| 3063/3063 [07:09<00:00,  7.13it/s]
Evaluation: 100%|██████████| 614/614 [00:23<00:00, 26.61it/s]


Epoch: 2 | Training Loss: nan | Validation Loss: 0.745
Epoch 2 NLI Validation:
Accuracy: 49.76% | F1: (nan%, 58.16%, 63.04%) | Macro-F1: nan%
here
Model Saved!


Training: 100%|██████████| 3063/3063 [07:09<00:00,  7.12it/s]
Evaluation: 100%|██████████| 614/614 [00:23<00:00, 26.47it/s]


Epoch: 3 | Training Loss: nan | Validation Loss: 1.134
Epoch 3 NLI Validation:
Accuracy: 49.69% | F1: (nan%, 57.65%, 63.21%) | Macro-F1: nan%
here


## Observation regarding Task 4.2
We follow the following procedure in task 4.2
- We first sample a subset of the original training dataset by taking 10% of the data. 
- Then we take 10% of the augmented dataset + the training dataset used in step 1 and train the model on this. 
- We observe that the model trained in step 1 is less accurate than the model trained in step 2. 
- This is expected since the data in step 2 is more than the data used in step 1. With this experiment we can conclude that data augmentation is helpful in the low resource scenarios. 
- This difference is especially most apparent in the first epoch, where the difference between the accuracy of model trained in step 1 and step 2 is very large. 

### **5 Upload Your Notebook, Data and Models**

Please **rename** your filled jupyter notebook as **your Sciper number** and upload it to your GitHub Classroom repository, **with all cells run and output results shown**.

**Note:** We are **not** responsible for re-running the cells in your notebook.

Please also submit all your processed (e.g., anotated and augmented) datasets, as well as all your trained models in Task 1 and Task 4, in your GitHub Classroom repository.

The datasets and models that you need to submit include:

**1. The best model checkpoint you trained in the Section 1.2 "Start Training and Validation!"**

**2. The best model prediction results in the Section 1.2 "Fine-Grained Validation"**

**3. Your annotated test dataset in the Section 3.2 "Annotate Your 100 Datapoints with Partner(s)"**

**4. Your augmented training data and best model checkpoint in the Section 4.2 "Augment Your Model"**

**Note:** You may need to use [GitHub LFS](https://edstem.org/eu/courses/379/discussion/27240) for submitting large files.

**Note**: For task 4, as described above, I trained on a subset of the data. The following files are present in the `nli_data` folder:
- train_augmented.jsonl: This file contains the data augmented for the whole original training dataset (i.e. the 25M dataset). The size of this dataset is 100M as suggested in the notebook.
- train_subset.jsonl: This is a subset of the training dataset, containing randomly sampled 10% of the samples from the original training data
- train_augmented_subset.jsonl: This is a subset of the augmnented dataset, namely `train_augmented.jsonl`. I sampled 10% of the data randomly from the `train_augmented.jsonl` file. 