# 📚  Exercise Session - Week 5

Welcome to Week 5 exercise session of CS552-Modern NLP!

We will continue playing with `DistilBert` this week, and learn about the dataset biases and prompting.

[Part 1: Biases](#bias0)
- [1.1 Hypothesis only NLI](#bias1)

[Part 2: Prompting](#prompt0)
- [2.1 Zero-shot Prompting](#promp1)
- [2.2 Few-shot Prompting](#promt2)

### 0. Setups

In [1]:
## Set up the device
import torch

if torch.cuda.is_available():
    DEVICE = torch.device('cuda')
elif torch.backends.mps.is_available() and torch.backends.mps.is_built(): # enable the usage of Apple silicon
    DEVICE = torch.device('mps')
else:
    DEVICE = torch.device('cpu')

print(f"Using device on: [{DEVICE}]")

Using device on: [cpu]


<a name="bias0"></a>
## 1. Biases

Recall our knowledge about the NLI tasks, the model would be given a pair of sentence: `(premise, hypothesis)`, and needs to judge the relationship between them. Specifically, given the *premise*, if the *hypothesis* is **true (entailment)**, **false (condradiction)**, or **neither (neutral)**. Idealy, The label of the hypothesis should be entirely based upon the given premise. However, *if the model is able to correctly guess the label without seeing the premise, it is likely detecting biased statisitcal patterns that are undesirable*, such as tendency to use certain words among different classes (ex: using negation words such as 'not' for the contradiction label).

Inspired by the paper [Hypothesis Only Baselines in Natural Language Inference](https://aclanthology.org/S18-2023.pdf), the first part of this lab will investigate a classifier's internal bias when performing the NLI task by testing its hypothesis-only performance.

**`Note`** In this dataset the labels are as follows: `0-Entailment`, `1- Neutral`, and `2- Contradict`.

In [2]:
import json
import jsonpickle
import os
import sys
import random
import numpy as np
import pandas as pd
from tqdm import trange, tqdm
import matplotlib.pyplot as plt
from typing import List, Dict, Optional

import torch
import torch.nn as nn
from torch.utils.data import RandomSampler, DataLoader, SequentialSampler

import datasets
from datasets import load_dataset

from transformers import RobertaForMaskedLM,RobertaTokenizer, RobertaForSequenceClassification, DistilBertTokenizer, DistilBertForSequenceClassification
from transformers import AutoModelForSequenceClassification, AutoTokenizer

from sklearn.metrics import accuracy_score, f1_score


  from .autonotebook import tqdm as notebook_tqdm


<a name="bias1"></a>
## 1.1 Train: Hypothesis only NLI

Let's firstly train a `distilbert` model on the SNLI dataset, but only access the hypothesis. We reuse the functions from the Exercise4.

In [3]:
def load_pretrained(model_name, num_labels=2):
  tokenizer = AutoTokenizer.from_pretrained(model_name)
  model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=num_labels)

  # Map the model onto targeted (predefined) DEVICE
  model = model.to(DEVICE)
  return tokenizer, model

In [4]:
train_dataset = load_dataset("snli", split='train')
test_dataset = load_dataset("snli", split='test')
train_dataset = train_dataset.filter(lambda example: example["label"]!=-1)
test_dataset = test_dataset.filter(lambda example: example["label"]!=-1)
print('#Training samples: ', len(train_dataset))
print('#Test samples: ', len(test_dataset))

tokenizer, model = load_pretrained('distilbert-base-uncased', num_labels=3)

#Training samples:  549367
#Test samples:  9824


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [5]:
train_loader = DataLoader(train_dataset, batch_size=16, shuffle=False)
test_loader = DataLoader(test_dataset, batch_size=16, shuffle=False)

In [6]:
def evaluate_model_nli(model, tokenizer, test_loader):
  all_labels = None
  all_preds = None

  for b in tqdm(test_loader):
    premise = b['premise']
    hypothesis = b['hypothesis']
    label = b['label']

    # step1: tokenize the text
    inputs = tokenizer(premise, hypothesis, return_tensors="pt", truncation=True, padding=True)

    inputs = inputs.to(DEVICE)
    label = label.to(DEVICE)

    # step2: run the model to make the prediction
    with torch.no_grad():
        pred = model(**inputs).logits.argmax(dim=-1)

    if all_labels is None:
      all_labels = label.cpu()
      all_preds = pred.cpu()
    else:
      all_labels = torch.concat([all_labels, label.cpu()])
      all_preds = torch.concat([all_preds, pred.cpu()])

  assert len(all_preds)==len(all_labels), 'Test Failed. Check your code!'
  # step3: compute eval metrices
  # compute f1 score between model predictions and ground-truth labels (you can use sklearn.metrics)
  f1 = f1_score(all_labels, all_preds, average='macro')

  # compute accuracy score between model predictions and ground-truth labels (you can use sklearn.metrics)
  acc = accuracy_score(all_labels, all_preds)

  # compute the accuracy on Entailment(label==0) samples
  entailment_acc = accuracy_score(all_labels[all_labels==0], all_preds[all_labels==0])

  # compute the accuracy on Neutral(label==1) samples
  neutral_acc = accuracy_score(all_labels[all_labels==1], all_preds[all_labels==1])

  # compute the accuracy on Contradict(label==1) samples
  contradict_acc = accuracy_score(all_labels[all_labels==2], all_preds[all_labels==2])

  print('Accuracy: ', acc*100, '%')
  print(' -- Entailment Accuracy: ', entailment_acc*100, '%')
  print(' -- Neutral Accuracy: ', neutral_acc*100, '%')
  print(' -- Contradict Accuracy: ', contradict_acc*100, '%')
  print('F1 score: ', f1)

  return all_preds, all_labels, acc, f1

In [8]:
# ETS: <1min on colab T4 gpu

# Since the parameters of the classification head are not trained now, the results of this initial evaluation are basically random :)
all_preds, all_labels, acc, f1 = evaluate_model_nli(model, tokenizer, test_loader)

100%|██████████| 614/614 [02:32<00:00,  4.03it/s]

Accuracy:  33.876221498371336 %
 -- Entailment Accuracy:  81.59144893111639 %
 -- Neutral Accuracy:  17.707362534948743 %
 -- Contradict Accuracy:  0.3089280197713933 %
F1 score:  0.23918460087939894





`TODO-1`: Implement the `tokenize_function` to tokenize only the `hypothesis` in each input `examples`.

In [None]:
# TODO: Define a function to tokenize the text
def tokenize_function(examples, hyp_only=True, max_length=512, device=DEVICE):
  '''
  INPUT:
    examples: input samples in the dataset
    hyp_only: if True, only tokenize the "hypothesis"; tokenize both "premise" and "hypothesis" if False
    max_length: maximal number of tokens
    device: cuda or cpu or mps (default as the pre-defined DEVICE)
  OUTPUT:
    tokenized: tokenized sample, truncation=True, padding=True
  '''
  if hyp_only:
    # TODO
        tokenized = tokenizer(examples["hypothesis"], max_length=max_length, padding=True, truncation = True)
  else:
    # TODO 
    tokenized = tokenizer(examples["premise"], examples["hypothesis"], max_length=max_length, padding=True, truncation = True)
  return tokenized

# Tokenize the train and test data
tokenized_train = train_dataset.map(tokenize_function, batched = True)
tokenized_test = test_dataset.map(tokenize_function, batched = True)

# Define a data collator to handle padding
from transformers import DataCollatorWithPadding
data_collator = DataCollatorWithPadding(tokenizer = tokenizer)

Map: 100%|██████████| 549367/549367 [00:33<00:00, 16373.43 examples/s]
Map: 100%|██████████| 9824/9824 [00:00<00:00, 17387.01 examples/s]


In [10]:
# Import the trainer and training arguments
from transformers import TrainingArguments, Trainer

# Define the output directory and other training arguments
output_dir_name = "snli-hyp-distilbert"

training_args = TrainingArguments(
   output_dir = output_dir_name,
   learning_rate = 2e-5,
   per_device_train_batch_size = 16,
   per_device_eval_batch_size = 16,
   num_train_epochs = 1,
   max_steps = 5000,
   weight_decay = 0.01,
   save_strategy = "steps",
   save_steps = 500,
   push_to_hub = False,
)

# Initialize the trainer
trainer = Trainer(
   model = model,
   args = training_args,
   train_dataset = tokenized_train,
   tokenizer = tokenizer,
   data_collator = data_collator,
)

  trainer = Trainer(


In [11]:
trainer.train()

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize


ValueError: API key must be 40 characters long, yours was 2

In [12]:
# ETS: <1min on colab T4 gpu
all_preds, all_labels, acc, f1 = evaluate_model_nli(model, tokenizer, test_loader)

100%|██████████| 614/614 [02:31<00:00,  4.04it/s]

Accuracy:  33.27565146579804 %
 -- Entailment Accuracy:  63.42042755344418 %
 -- Neutral Accuracy:  29.574401988195092 %
 -- Contradict Accuracy:  5.591597157862218 %
F1 score:  0.28203308929406345





As you see, the model is able to correctly guess the labels of almost **60-70%** of the NLI hypotheses without seeing what the premise is.

Question: What do you think are ways that biases can be mitigated? Think about both the data collection process and model training for places where one can intervene.

<a name="prompt0"></a>
## 2. Prompting

The following sections will be based on the papers [Exploiting Cloze Questions for Few Shot Text Classification and Natural
Language Inference](https://arxiv.org/pdf/2001.07676.pdf) and [How Many Data Points is a Prompt Worth?](https://arxiv.org/pdf/2103.08493.pdf).

The first paper introduced Pattern Exploiting Training (PET), in which a NLP task is reformulated to a cloze style task for few shot learning. We will go into this a little more during the few-shot section of this lab.

The basic idea is to only tune a linear classification head (mostly be a MLP layer, attached on the top of original pretrained model) instead of training the entire model to perform the classification task. Unlike language modeling, which predicts the next-token from the whole vocaulary, we are predicting a word from a list of **verbalizers**, where each verbalizer corresponds to one label.

### NLI and Sentiment classification
We will be looking at classification tasks (NLI and sentiment) where we only need a single word verbalizers. However this paradigm can be extended to more complex tasks, with multi-token verbalizers.

First lets try **zero-shots prompting**. We will use `Roberta-large` for this section and investigate an easier `sentiment-analysis` task on IMDB dataset.


In [13]:
test_dataset = load_dataset('imdb', split='test')

tokenizer = RobertaTokenizer.from_pretrained('roberta-large')
model = RobertaForMaskedLM.from_pretrained('roberta-large')

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


`TODO-2`: Complete the `lm_guess_sent` function to get the probability of each verbalizer in the `<mask>`, then make the prediction.

In [None]:
def get_targets(verbalizer = 1):  #retreives the token ids for the verbalizers
  targets = verbalize(verbalizer).keys()
  target_ids = []
  for target in targets:
    id= tokenizer.get_vocab().get("\u0120"+ target, None) #how roberta ecodes wods
    target_ids.append(id)
  return target_ids

def lm_guess_sent(model, text, template_num = 1, verb_num = 1, context_samples = None, context_labels = None, device=DEVICE):
  model = model.to(device)

  verbalizer = verbalize(verb_num) # choose a pair of verbalizers
  target_ids = get_targets(verb_num) # get ids of verbalizers
  text_template = template(text, template_num, context_samples=context_samples, context_labels=context_labels) # get a template with text

  # TODO: encode texts with the template (text_template), return tensor
  encoded_input = tokenizer(text_template, return_tensors='pt', padding='longest', truncation=True).to(device)

  masked_index = torch.nonzero(encoded_input["input_ids"][0] == tokenizer.mask_token_id, as_tuple=False).squeeze(-1).to(device) #getting index of mask token
  model_outputs = model(**encoded_input)
  outputs = model_outputs["logits"]

  # TODO: get the logits for masked tokens
  logits = outputs[0, masked_index, :] # get the logits for the masked

  probs = logits.softmax(dim=-1) # probability of tokens

  # TODO: get the probability of the two verbalizer tokens
  probs = probs[..., target_ids]

  # TODO: get prediction as the index with higher probability
  _, predictions = probs.topk(1)
  input_ids = encoded_input["input_ids"][0]
  tokens = input_ids.detach().cpu().numpy().copy()
  p = target_ids[predictions]

  prediction = verbalizer[tokenizer.decode([p]).strip()] #get corresponging label from verbalizer
  return prediction


<a name="prompt1"></a>
### 2.1 Zero-shot Prompting


We will be using the IMDB dataset again to test prompting in the zero shot setting.

We need two things to do the prompting

- a **Verbalizer** that matches a word to each label
- a **Template** to add the review, with one masked token that will predict one of the verbalizers

Success of this method varies by template and verbalizer, so it is nice to test a few.

In [15]:
def verbalize(num = 1):
  if num == 1:
    return {"great":1, "horrible":0}
  if num == 2:
    return {"great":1, "terrible":0}


def template(text, num = 1, context_samples=None, context_labels=None):
    if num == 1:
      return "It was <mask>." + text
    if num == 2:
      return "So <mask>!" + text


Alright, lets see how the pre-trained roberta does on the prompted sentiment analysis.

#### Verbalizer #1

In [16]:
test_data_subset = pd.DataFrame(test_dataset[random.choices(range(len(test_dataset)), k=500)])

guess = test_data_subset.apply(lambda x: lm_guess_sent(model, x['text'], template_num = 1, verb_num = 1), axis=1).tolist()


Token indices sequence length is longer than the specified maximum sequence length for this model (1076 > 512). Running this sequence through the model will result in indexing errors


RuntimeError: The expanded size of the tensor (1076) must match the existing size (514) at non-singleton dimension 1.  Target sizes: [1, 1076].  Tensor sizes: [1, 514]

In [None]:
test_data_subset['guess'] = guess

print("Accuracy :", accuracy_score(test_data_subset['label'], test_data_subset['guess']))
print("Positive Accuracy :", accuracy_score(test_data_subset[test_data_subset['label']==1]['label'], test_data_subset[test_data_subset['label']==1]['guess']))
print("Negative Accuracy :", accuracy_score(test_data_subset[test_data_subset['label']==0]['label'], test_data_subset[test_data_subset['label']==0]['guess']))
print("F1 :", f1_score(test_data_subset['label'], test_data_subset['guess'], average = 'micro'))


Accuracy : 0.878
Positive Accuracy : 0.95
Negative Accuracy : 0.8115384615384615
F1 : 0.878


It seems the first verbalizer works better for the Positive reviews. How can we improve the performance without retrain or finetune the model?

#### Verbalizer #2
Lets try different verbalizers (selection 2).






In [None]:
guess = test_data_subset.apply(lambda x: lm_guess_sent(model, x['text'], template_num = 1, verb_num = 2), axis=1).tolist()


In [None]:
test_data_subset['guess2'] = guess

print("Accuracy :", accuracy_score(test_data_subset['label'], test_data_subset['guess2']))
print("Positive Accuracy :", accuracy_score(test_data_subset[test_data_subset['label']==1]['label'], test_data_subset[test_data_subset['label']==1]['guess2']))
print("Negative Accuracy :", accuracy_score(test_data_subset[test_data_subset['label']==0]['label'], test_data_subset[test_data_subset['label']==0]['guess2']))

print("F1 :", f1_score(test_data_subset['label'], test_data_subset['guess2'], average = 'micro'))


Accuracy : 0.898
Positive Accuracy : 0.9125
Negative Accuracy : 0.8846153846153846
F1 : 0.898


**Thinking**: The second verbalizer seems work well with ~90% accuracy in both classes! Why do you think this formulation of the task works in the zero-shot setting? Can you think of any ways to *pick the most effective verbalizers* in a more systematic way?

You can feel free to try your own templates/verbalizers to see how your design choices affect performance, and which ones could improve performance.

<a name="prompt2"></a>
### 2.2 Few-shot Prompting

Now given that we have an access to a very small labeled dataset (e.g. 5 samples), how can we make a great use of these information?

If we finetune the model on these 5 samples, the model is very likely to overfit to some biased shortcuts. **Recall the prompting trick, do we have some ways to re-design the template to combine the labeled samples?**

In [None]:
train_data = load_dataset('imdb', split='train')
train_data = train_data.shuffle(seed=42)
fewshot_samples = train_data.select(range(10))

context_samples = fewshot_samples['text']
context_labels = fewshot_samples['label']

In [None]:
def verbalize(num = 1):
  if num == 1:
    return {"great":1, "horrible":0}
  if num == 2:
    return {"great":1, "terrible":0}


def template(text, num = 1, context_samples = None, context_labels = None):
    if num == 1:
      temp = "It was <mask>." + text
      pos_prefix = "It was great."
      neg_prefix = "It was horrible."
    elif num == 2:
      temp = "So <mask>!" + text
      pos_prefix = "It was great."
      neg_prefix = "It was terrible."
    else:
      raise NotImplemented

    # Build 'Context' with few-shot labeled samples
    if context_samples is not None:
      assert context_labels is not None, 'Please provide labels to the few-shot samples!'
      context = ''
      for c,y in zip(context_samples, context_labels):
        if y==0:
          context += (neg_prefix+' '.join(c.split(' ')[:25])+'//')
        elif y==1:
          context += (pos_prefix+' '.join(c.split(' ')[:25])+'//')
      return context+temp
    return temp



In [None]:
# Let's see how the template would look like
template(test_data_subset['text'][0], num = 1, context_samples = context_samples, context_labels = context_labels)

'It was great.There is no relation at all between Fortier and Profiler but the fact that both are police series about violent crimes. Profiler looks crispy, Fortier//It was great.This movie is a great. The plot is very true to the book which is a classic written by Mark Twain. The movie starts of//It was horrible.George P. Cosmatos\' "Rambo: First Blood Part II" is pure wish-fulfillment. The United States clearly didn\'t win the war in Vietnam. They caused damage to//It was great.In the process of trying to establish the audiences\' empathy with Jake Roedel (Tobey Maguire) the filmmakers slander the North and the Jayhawkers. Missouri never//It was horrible.Yeh, I know -- you\'re quivering with excitement. Well, *The Secret Lives of Dentists* will not upset your expectations: it\'s solidly made but essentially unimaginative,//It was great.While this movie\'s style isn\'t as understated and realistic as a sound version probably would have been, this is still a very good film. In//It was 

Then we can test how the model performs with **few-shot prompting**.

#### Verbalizer number 1

In [None]:
guess = test_data_subset.apply(lambda x: lm_guess_sent(model, x['text'], template_num = 1, verb_num = 1, context_samples=context_samples, context_labels=context_labels), axis=1).tolist()


In [None]:
test_data_subset['10shots-guess'] = guess

print("Accuracy :", accuracy_score(test_data_subset['label'], test_data_subset['10shots-guess']))
print("Positive Accuracy :", accuracy_score(test_data_subset[test_data_subset['label']==1]['label'], test_data_subset[test_data_subset['label']==1]['10shots-guess']))
print("Negative Accuracy :", accuracy_score(test_data_subset[test_data_subset['label']==0]['label'], test_data_subset[test_data_subset['label']==0]['10shots-guess']))
print("F1 :", f1_score(test_data_subset['label'], test_data_subset['10shots-guess'], average = 'micro'))


Accuracy : 0.556
Positive Accuracy : 1.0
Negative Accuracy : 0.059322033898305086
F1 : 0.556


#### Verbalizer number 2

Now we will use different verbalizers to see how the model performs.

In [None]:
guess = test_data_subset.apply(lambda x: lm_guess_sent(model, x['text'], template_num = 1, verb_num = 2, context_samples=context_samples, context_labels=context_labels), axis=1).tolist()


In [None]:
test_data_subset['10shots-guess2'] = guess

print("Accuracy :", accuracy_score(test_data_subset['label'], test_data_subset['10shots-guess2']))
print("Positive Accuracy :", accuracy_score(test_data_subset[test_data_subset['label']==1]['label'], test_data_subset[test_data_subset['label']==1]['10shots-guess2']))
print("Negative Accuracy :", accuracy_score(test_data_subset[test_data_subset['label']==0]['label'], test_data_subset[test_data_subset['label']==0]['10shots-guess2']))
print("F1 :", f1_score(test_data_subset['label'], test_data_subset['10shots-guess2'], average = 'micro'))


Accuracy : 0.528
Positive Accuracy : 1.0
Negative Accuracy : 0.0
F1 : 0.528


**Thinking**: What do you think of the performance? Why do you think it could happen? What can we do to improve?


Feel free to process or change the prompts/contexts as you like, then you can see how your design choices could influence the few-shot prompting performance :)