# Multiple Choice Commonsense Reasoning - SocialIQA

> Fine-tuning Pre-trained Language Models on Multiple Choice Question Answering for Commonsense Reasoning 

> Based on HuggingFace Transformers

> Chaehyeong Kim, CONVEI Lab 

In [None]:
!pip install transformers==4.26.1 evaluate datasets

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers==4.26.1
  Downloading transformers-4.26.1-py3-none-any.whl (6.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.3/6.3 MB[0m [31m45.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting evaluate
  Downloading evaluate-0.4.0-py3-none-any.whl (81 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.4/81.4 kB[0m [31m6.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting datasets
  Downloading datasets-2.12.0-py3-none-any.whl (474 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m474.6/474.6 kB[0m [31m15.2 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m64.7 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-

In [None]:
import warnings
warnings.filterwarnings(action='ignore')

In [None]:
import transformers
transformers.logging.set_verbosity_error()

## 1. Load Library

In [None]:
import os
import json
import random
import numpy as np
import pandas as pd

import torch
from torch.utils.data import Dataset, DataLoader
from transformers import AutoTokenizer, AutoModelForMultipleChoice, TrainingArguments, Trainer
from datasets import load_dataset
import evaluate

from tqdm.auto import tqdm

## Set Hyperparameters

In [None]:
DEVICE = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
DEVICE

device(type='cuda')

In [None]:
BASE_DIR = os.getcwd()
OUTPUT_DIR = os.path.join(BASE_DIR, 'checkpoint')
os.makedirs(OUTPUT_DIR, exist_ok=True)

In [None]:
MODEL_NAME = 'bert-base-uncased'

## Load Tokenizer and Model

In [None]:
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForMultipleChoice.from_pretrained(MODEL_NAME).to(DEVICE)

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/440M [00:00<?, ?B/s]

In [None]:
tokenizer.special_tokens_map

{'unk_token': '[UNK]',
 'sep_token': '[SEP]',
 'pad_token': '[PAD]',
 'cls_token': '[CLS]',
 'mask_token': '[MASK]'}

## 3. Load Data

In [None]:
datasets = load_dataset('social_i_qa')

Downloading builder script:   0%|          | 0.00/4.67k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/2.19k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/6.81k [00:00<?, ?B/s]

Downloading and preparing dataset social_i_qa/default to /root/.cache/huggingface/datasets/social_i_qa/default/0.1.0/674d85e42ac7430d3dcd4de7007feaffcb1527c535121e09bab2803fbcc925f8...


Downloading data:   0%|          | 0.00/2.20M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/33410 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1954 [00:00<?, ? examples/s]

Dataset social_i_qa downloaded and prepared to /root/.cache/huggingface/datasets/social_i_qa/default/0.1.0/674d85e42ac7430d3dcd4de7007feaffcb1527c535121e09bab2803fbcc925f8. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

In [None]:
datasets

DatasetDict({
    train: Dataset({
        features: ['context', 'question', 'answerA', 'answerB', 'answerC', 'label'],
        num_rows: 33410
    })
    validation: Dataset({
        features: ['context', 'question', 'answerA', 'answerB', 'answerC', 'label'],
        num_rows: 1954
    })
})

In [None]:
train_dataset = load_dataset('social_i_qa', split='train[:70%]')
valid_dataset = load_dataset('social_i_qa', split='train[70%:]')
test_dataset = load_dataset('social_i_qa', split='validation')



In [None]:
print(len(train_dataset))
print(len(valid_dataset))
print(len(test_dataset))

23387
10023
1954


In [None]:
train_dataset[0]

{'context': 'Cameron decided to have a barbecue and gathered her friends together.',
 'question': 'How would Others feel as a result?',
 'answerA': 'like attending',
 'answerB': 'like staying home',
 'answerC': 'a good friend to have',
 'label': '1'}

In [None]:
from datasets import ClassLabel
from IPython.display import display, HTML

def show_random_elements(dataset, num_examples=10):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)
    
    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    display(HTML(df.to_html()))

In [None]:
show_random_elements(datasets['train'])

Unnamed: 0,context,question,answerA,answerB,answerC,label
0,"Aubrey bought a large chocolate cake for her birthday, ate the whole thing by herself, and then ordered a pizza.",How would you describe Aubrey?,disciplined,lazy,ravenous,3
1,Kendall had her friends over for dinner and gave them a warm welcome when they arrived.,How would Others feel as a result?,like a good hostess,excited about the dinner,accepted,3
2,Carson decided he wanted a new start and moved to another state.,How would Carson feel afterwards?,scared to start fresh,ready to start over,tired of the old place,2
3,Austin did not bring a bike. Austin rode Jesse's bike.,How would Austin feel afterwards?,unaware,thrifty,indebted,3
4,Tracy went to work with a friend on selling computers door to door.,What will Tracy want to do next?,rest on her laurels,learn more about the industry,break away from the partnership,2
5,"Out of breath and winded, Tracy finally came within sight not far off from the finish line.",How would Tracy feel afterwards?,burdened by the last hurdle,committed,near,2
6,"In a photo editing app, Cameron was being silly, and Cameron turned Taylor's face upside down.",What will Taylor want to do next?,throw the photo away,show the photo to Tyler,burn the photo into ash,1
7,"After making a deal and almost signing with one company, Riley accepted another offer.",What will Riley want to do next?,Apologise to first company,Pretend like nothing happened,apply for multiple jobs,1
8,Remy spent the night in the hospital. They had a heart attack.,How would Remy feel afterwards?,gravely ill,glad to be taken care off,much worse,2
9,Sydney turned red when Carl cracked a sex joke and stared at her.,How would you describe Sydney?,Sidney did not like the sexual attention and wanted to have sex with him that night,embarrassed,Sidney did not like the sexual attention which she thought premature,2


## Preprocess Data

In [None]:
answer_names = ['answerA', 'answerB', 'answerC']
num_choices = len(answer_names)

In [None]:
def preprocess_function(examples):
    num_examples = len(examples['context'])
    # Repeat each first sentence three times (=num_choices) to go with the three possibilities of second sentences.
    first_sentences = [[examples['context'][i] + tokenizer.sep_token + examples['question'][i]] * num_choices for i in range(num_examples)]
    second_sentences = [[examples[answer][i] for answer in answer_names] for i in range(num_examples)]
    # Flatten everything
    first_sentences = sum(first_sentences, [])
    second_sentences = sum(second_sentences, [])
    # Tokenize
    tokenized_examples = tokenizer(first_sentences, second_sentences, truncation=True)
    # Un-flatten
    return {k: [v[i : i + num_choices] for i in range(0, len(v), num_choices)] for k, v in tokenized_examples.items()}

In this function, the tokenizer will return a list of lists of lists for each key: a list of all examples (here 5), then a list of all choices (3) and a list of input IDs (length varying here since we did not apply any padding)

In [None]:
examples = datasets['train'][:5]
features = preprocess_function(examples)
print(len(features['input_ids']), len(features['input_ids'][0]), [len(x) for x in features['input_ids'][0]])

5 3 [26, 27, 29]


In [None]:
idx = 0
[tokenizer.decode(features['input_ids'][idx][i]) for i in range(num_choices)]

['[CLS] cameron decided to have a barbecue and gathered her friends together. [SEP] how would others feel as a result? [SEP] like attending [SEP]',
 '[CLS] cameron decided to have a barbecue and gathered her friends together. [SEP] how would others feel as a result? [SEP] like staying home [SEP]',
 '[CLS] cameron decided to have a barbecue and gathered her friends together. [SEP] how would others feel as a result? [SEP] a good friend to have [SEP]']

In [None]:
tokenized_train_data = train_dataset.map(preprocess_function, batched=True)
tokenized_valid_data = valid_dataset.map(preprocess_function, batched=True)

Map:   0%|          | 0/23387 [00:00<?, ? examples/s]

Map:   0%|          | 0/10023 [00:00<?, ? examples/s]

In [None]:
from dataclasses import dataclass
from transformers.tokenization_utils_base import PreTrainedTokenizerBase, PaddingStrategy
from typing import Optional, Union

@dataclass
class DataCollatorForMultipleChoice:
    """
    Data collator that will dynamically pad the inputs for multiple choice received.
    """
    tokenizer: PreTrainedTokenizerBase
    padding: Union[bool, str, PaddingStrategy] = True
    max_length: Optional[int] = None
    pad_to_multiple_of: Optional[int] = None

    def __call__(self, features):
        label_name = 'label' if 'label' in features[0].keys() else 'labels'
        labels = [int(feature.pop(label_name))-1 for feature in features]
        batch_size = len(features)
        num_choices = len(features[0]['input_ids'])
        flattened_features = [
            [{k: v[i] for k, v in feature.items()} for i in range(num_choices)] for feature in features
        ]
        flattened_features = sum(flattened_features, [])

        batch = self.tokenizer.pad(
            flattened_features,
            padding=self.padding,
            max_length=self.max_length,
            pad_to_multiple_of=self.pad_to_multiple_of,
            return_tensors='pt',
        )
        # Un-flatten
        batch = {k: v.view(batch_size, num_choices, -1) for k, v in batch.items()}
        batch['labels'] = torch.tensor(labels, dtype=torch.int64)
        return batch

When called on a list of examples, it will flatten all the input_ids/attention_mask in big lists that it will pass to the `tokenizer.pad` method. This will return a dictionary with big tensors (of shape `(batch_size * num_choices) x seq_length`) that we then unflatten.

In [None]:
accepted_keys = ['input_ids', 'attention_mask', 'label']
features = [{k: v for k, v in tokenized_train_data[i].items() if k in accepted_keys} for i in range(10)]
batch = DataCollatorForMultipleChoice(tokenizer)(features)

In [None]:
[tokenizer.decode(batch['input_ids'][8][i].tolist()) for i in range(3)]

['[CLS] quinn wanted to help me clean my room up because it was so messy. [SEP] what will quinn want to do next? [SEP] eat messy snacks [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]',
 '[CLS] quinn wanted to help me clean my room up because it was so messy. [SEP] what will quinn want to do next? [SEP] help out a friend [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]',
 '[CLS] quinn wanted to help me clean my room up because it was so messy. [SEP] what will quinn want to do next? [SEP] pick up the dirty clothes [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]']

## Train Model

In [None]:
accuracy = evaluate.load('accuracy')

Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

In [None]:
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return accuracy.compute(predictions=predictions, references=labels)

In [None]:
training_args = TrainingArguments(
    output_dir=OUTPUT_DIR,
    overwrite_output_dir = True,
    do_train=True,
    do_eval=True,
    evaluation_strategy='steps',
    save_strategy='steps',
    logging_strategy='steps',
    eval_steps=200,
    save_steps=200,
    logging_steps=20,
    save_total_limit=2,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    learning_rate=5e-5,
    num_train_epochs=3,
    weight_decay=0.01,
    max_grad_norm=0.1,
    gradient_accumulation_steps=2,
    gradient_checkpointing=True,
    seed=516,
    logging_first_step=True,
    load_best_model_at_end=True,
    metric_for_best_model='loss',
    greater_is_better=False,
    push_to_hub=False,
)

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train_data,
    eval_dataset=tokenized_valid_data,
    tokenizer=tokenizer,
    data_collator=DataCollatorForMultipleChoice(tokenizer=tokenizer),
    compute_metrics=compute_metrics,
)

In [None]:
trainer.train()

The following columns in the training set don't have a corresponding argument in `BertForMultipleChoice.forward` and have been ignored: answerC, answerA, answerB, question, context. If answerC, answerA, answerB, question, context are not expected by `BertForMultipleChoice.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 23387
  Num Epochs = 3
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 32
  Gradient Accumulation steps = 2
  Total optimization steps = 2193
  Number of trainable parameters = 109483009


{'loss': 1.1277, 'learning_rate': 4.997720018239854e-05, 'epoch': 0.0}
{'loss': 1.0982, 'learning_rate': 4.954400364797081e-05, 'epoch': 0.03}
{'loss': 1.0244, 'learning_rate': 4.908800729594164e-05, 'epoch': 0.05}
{'loss': 0.9792, 'learning_rate': 4.863201094391245e-05, 'epoch': 0.08}
{'loss': 0.9231, 'learning_rate': 4.8176014591883265e-05, 'epoch': 0.11}
{'loss': 0.8773, 'learning_rate': 4.772001823985408e-05, 'epoch': 0.14}
{'loss': 0.9189, 'learning_rate': 4.72640218878249e-05, 'epoch': 0.16}
{'loss': 0.8316, 'learning_rate': 4.680802553579571e-05, 'epoch': 0.19}
{'loss': 0.8826, 'learning_rate': 4.6352029183766534e-05, 'epoch': 0.22}
{'loss': 0.8651, 'learning_rate': 4.5896032831737345e-05, 'epoch': 0.25}


The following columns in the evaluation set don't have a corresponding argument in `BertForMultipleChoice.forward` and have been ignored: answerC, answerA, answerB, question, context. If answerC, answerA, answerB, question, context are not expected by `BertForMultipleChoice.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 10023
  Batch size = 16


{'loss': 0.8621, 'learning_rate': 4.544003647970816e-05, 'epoch': 0.27}


Saving model checkpoint to /content/checkpoint/checkpoint-200
Configuration saved in /content/checkpoint/checkpoint-200/config.json


{'eval_loss': 0.7940899729728699, 'eval_accuracy': 0.6448169210815126, 'eval_runtime': 87.0256, 'eval_samples_per_second': 115.173, 'eval_steps_per_second': 7.205, 'epoch': 0.27}


Model weights saved in /content/checkpoint/checkpoint-200/pytorch_model.bin
tokenizer config file saved in /content/checkpoint/checkpoint-200/tokenizer_config.json
Special tokens file saved in /content/checkpoint/checkpoint-200/special_tokens_map.json


{'loss': 0.8356, 'learning_rate': 4.498404012767898e-05, 'epoch': 0.3}
{'loss': 0.8107, 'learning_rate': 4.45280437756498e-05, 'epoch': 0.33}
{'loss': 0.7912, 'learning_rate': 4.407204742362061e-05, 'epoch': 0.36}
{'loss': 0.8026, 'learning_rate': 4.361605107159143e-05, 'epoch': 0.38}
{'loss': 0.8193, 'learning_rate': 4.316005471956224e-05, 'epoch': 0.41}
{'loss': 0.8107, 'learning_rate': 4.270405836753306e-05, 'epoch': 0.44}
{'loss': 0.8157, 'learning_rate': 4.2248062015503877e-05, 'epoch': 0.47}
{'loss': 0.768, 'learning_rate': 4.1792065663474694e-05, 'epoch': 0.49}
{'loss': 0.7383, 'learning_rate': 4.1336069311445504e-05, 'epoch': 0.52}


The following columns in the evaluation set don't have a corresponding argument in `BertForMultipleChoice.forward` and have been ignored: answerC, answerA, answerB, question, context. If answerC, answerA, answerB, question, context are not expected by `BertForMultipleChoice.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 10023
  Batch size = 16


{'loss': 0.7413, 'learning_rate': 4.088007295941633e-05, 'epoch': 0.55}


Saving model checkpoint to /content/checkpoint/checkpoint-400
Configuration saved in /content/checkpoint/checkpoint-400/config.json


{'eval_loss': 0.7077067494392395, 'eval_accuracy': 0.7023845156140875, 'eval_runtime': 86.8596, 'eval_samples_per_second': 115.393, 'eval_steps_per_second': 7.219, 'epoch': 0.55}


Model weights saved in /content/checkpoint/checkpoint-400/pytorch_model.bin
tokenizer config file saved in /content/checkpoint/checkpoint-400/tokenizer_config.json
Special tokens file saved in /content/checkpoint/checkpoint-400/special_tokens_map.json


{'loss': 0.7331, 'learning_rate': 4.042407660738714e-05, 'epoch': 0.57}
{'loss': 0.7448, 'learning_rate': 3.9968080255357956e-05, 'epoch': 0.6}
{'loss': 0.7354, 'learning_rate': 3.9512083903328774e-05, 'epoch': 0.63}
{'loss': 0.6658, 'learning_rate': 3.905608755129959e-05, 'epoch': 0.66}
{'loss': 0.7313, 'learning_rate': 3.860009119927041e-05, 'epoch': 0.68}
{'loss': 0.6673, 'learning_rate': 3.8144094847241226e-05, 'epoch': 0.71}
{'loss': 0.7464, 'learning_rate': 3.7688098495212036e-05, 'epoch': 0.74}
{'loss': 0.6994, 'learning_rate': 3.7232102143182854e-05, 'epoch': 0.77}
{'loss': 0.7052, 'learning_rate': 3.677610579115367e-05, 'epoch': 0.79}


The following columns in the evaluation set don't have a corresponding argument in `BertForMultipleChoice.forward` and have been ignored: answerC, answerA, answerB, question, context. If answerC, answerA, answerB, question, context are not expected by `BertForMultipleChoice.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 10023
  Batch size = 16


{'loss': 0.729, 'learning_rate': 3.632010943912449e-05, 'epoch': 0.82}


Saving model checkpoint to /content/checkpoint/checkpoint-600
Configuration saved in /content/checkpoint/checkpoint-600/config.json


{'eval_loss': 0.6758001446723938, 'eval_accuracy': 0.7162526189763544, 'eval_runtime': 86.9617, 'eval_samples_per_second': 115.258, 'eval_steps_per_second': 7.21, 'epoch': 0.82}


Model weights saved in /content/checkpoint/checkpoint-600/pytorch_model.bin
tokenizer config file saved in /content/checkpoint/checkpoint-600/tokenizer_config.json
Special tokens file saved in /content/checkpoint/checkpoint-600/special_tokens_map.json
Deleting older checkpoint [/content/checkpoint/checkpoint-200] due to args.save_total_limit


{'loss': 0.767, 'learning_rate': 3.5864113087095306e-05, 'epoch': 0.85}
{'loss': 0.6626, 'learning_rate': 3.540811673506612e-05, 'epoch': 0.88}
{'loss': 0.7229, 'learning_rate': 3.4952120383036933e-05, 'epoch': 0.9}
{'loss': 0.6922, 'learning_rate': 3.449612403100775e-05, 'epoch': 0.93}
{'loss': 0.6326, 'learning_rate': 3.404012767897857e-05, 'epoch': 0.96}
{'loss': 0.6536, 'learning_rate': 3.3584131326949385e-05, 'epoch': 0.98}
{'loss': 0.6742, 'learning_rate': 3.31281349749202e-05, 'epoch': 1.01}
{'loss': 0.5375, 'learning_rate': 3.267213862289102e-05, 'epoch': 1.04}
{'loss': 0.5207, 'learning_rate': 3.221614227086183e-05, 'epoch': 1.07}


The following columns in the evaluation set don't have a corresponding argument in `BertForMultipleChoice.forward` and have been ignored: answerC, answerA, answerB, question, context. If answerC, answerA, answerB, question, context are not expected by `BertForMultipleChoice.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 10023
  Batch size = 16


{'loss': 0.5332, 'learning_rate': 3.176014591883265e-05, 'epoch': 1.09}


Saving model checkpoint to /content/checkpoint/checkpoint-800
Configuration saved in /content/checkpoint/checkpoint-800/config.json


{'eval_loss': 0.7005630135536194, 'eval_accuracy': 0.7287239349496158, 'eval_runtime': 85.7014, 'eval_samples_per_second': 116.953, 'eval_steps_per_second': 7.316, 'epoch': 1.09}


Model weights saved in /content/checkpoint/checkpoint-800/pytorch_model.bin
tokenizer config file saved in /content/checkpoint/checkpoint-800/tokenizer_config.json
Special tokens file saved in /content/checkpoint/checkpoint-800/special_tokens_map.json
Deleting older checkpoint [/content/checkpoint/checkpoint-400] due to args.save_total_limit


{'loss': 0.4998, 'learning_rate': 3.1304149566803465e-05, 'epoch': 1.12}
{'loss': 0.4784, 'learning_rate': 3.084815321477428e-05, 'epoch': 1.15}
{'loss': 0.5086, 'learning_rate': 3.0392156862745097e-05, 'epoch': 1.18}
{'loss': 0.5643, 'learning_rate': 2.9936160510715917e-05, 'epoch': 1.2}
{'loss': 0.4753, 'learning_rate': 2.9480164158686728e-05, 'epoch': 1.23}
{'loss': 0.4942, 'learning_rate': 2.902416780665755e-05, 'epoch': 1.26}
{'loss': 0.4964, 'learning_rate': 2.8568171454628362e-05, 'epoch': 1.29}
{'loss': 0.4991, 'learning_rate': 2.811217510259918e-05, 'epoch': 1.31}
{'loss': 0.4926, 'learning_rate': 2.7656178750569994e-05, 'epoch': 1.34}


The following columns in the evaluation set don't have a corresponding argument in `BertForMultipleChoice.forward` and have been ignored: answerC, answerA, answerB, question, context. If answerC, answerA, answerB, question, context are not expected by `BertForMultipleChoice.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 10023
  Batch size = 16


{'loss': 0.4683, 'learning_rate': 2.7200182398540814e-05, 'epoch': 1.37}


Saving model checkpoint to /content/checkpoint/checkpoint-1000
Configuration saved in /content/checkpoint/checkpoint-1000/config.json


{'eval_loss': 0.7169884443283081, 'eval_accuracy': 0.7269280654494662, 'eval_runtime': 87.0822, 'eval_samples_per_second': 115.098, 'eval_steps_per_second': 7.2, 'epoch': 1.37}


Model weights saved in /content/checkpoint/checkpoint-1000/pytorch_model.bin
tokenizer config file saved in /content/checkpoint/checkpoint-1000/tokenizer_config.json
Special tokens file saved in /content/checkpoint/checkpoint-1000/special_tokens_map.json
Deleting older checkpoint [/content/checkpoint/checkpoint-800] due to args.save_total_limit


{'loss': 0.5223, 'learning_rate': 2.674418604651163e-05, 'epoch': 1.4}
{'loss': 0.5536, 'learning_rate': 2.6288189694482446e-05, 'epoch': 1.42}
{'loss': 0.5101, 'learning_rate': 2.583219334245326e-05, 'epoch': 1.45}
{'loss': 0.5023, 'learning_rate': 2.5376196990424077e-05, 'epoch': 1.48}
{'loss': 0.551, 'learning_rate': 2.4920200638394894e-05, 'epoch': 1.5}
{'loss': 0.5166, 'learning_rate': 2.446420428636571e-05, 'epoch': 1.53}
{'loss': 0.4603, 'learning_rate': 2.4008207934336525e-05, 'epoch': 1.56}
{'loss': 0.5046, 'learning_rate': 2.3552211582307343e-05, 'epoch': 1.59}
{'loss': 0.4824, 'learning_rate': 2.309621523027816e-05, 'epoch': 1.61}


The following columns in the evaluation set don't have a corresponding argument in `BertForMultipleChoice.forward` and have been ignored: answerC, answerA, answerB, question, context. If answerC, answerA, answerB, question, context are not expected by `BertForMultipleChoice.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 10023
  Batch size = 16


{'loss': 0.4997, 'learning_rate': 2.2640218878248974e-05, 'epoch': 1.64}


Saving model checkpoint to /content/checkpoint/checkpoint-1200
Configuration saved in /content/checkpoint/checkpoint-1200/config.json


{'eval_loss': 0.6870073676109314, 'eval_accuracy': 0.7316172802554125, 'eval_runtime': 87.2547, 'eval_samples_per_second': 114.871, 'eval_steps_per_second': 7.186, 'epoch': 1.64}


Model weights saved in /content/checkpoint/checkpoint-1200/pytorch_model.bin
tokenizer config file saved in /content/checkpoint/checkpoint-1200/tokenizer_config.json
Special tokens file saved in /content/checkpoint/checkpoint-1200/special_tokens_map.json
Deleting older checkpoint [/content/checkpoint/checkpoint-1000] due to args.save_total_limit


{'loss': 0.5088, 'learning_rate': 2.218422252621979e-05, 'epoch': 1.67}
{'loss': 0.4965, 'learning_rate': 2.172822617419061e-05, 'epoch': 1.7}
{'loss': 0.4811, 'learning_rate': 2.1272229822161423e-05, 'epoch': 1.72}
{'loss': 0.4798, 'learning_rate': 2.081623347013224e-05, 'epoch': 1.75}
{'loss': 0.5431, 'learning_rate': 2.0360237118103057e-05, 'epoch': 1.78}
{'loss': 0.4483, 'learning_rate': 1.990424076607387e-05, 'epoch': 1.81}
{'loss': 0.562, 'learning_rate': 1.944824441404469e-05, 'epoch': 1.83}
{'loss': 0.4694, 'learning_rate': 1.8992248062015506e-05, 'epoch': 1.86}
{'loss': 0.4848, 'learning_rate': 1.853625170998632e-05, 'epoch': 1.89}


The following columns in the evaluation set don't have a corresponding argument in `BertForMultipleChoice.forward` and have been ignored: answerC, answerA, answerB, question, context. If answerC, answerA, answerB, question, context are not expected by `BertForMultipleChoice.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 10023
  Batch size = 16


{'loss': 0.5054, 'learning_rate': 1.8080255357957137e-05, 'epoch': 1.92}


Saving model checkpoint to /content/checkpoint/checkpoint-1400
Configuration saved in /content/checkpoint/checkpoint-1400/config.json


{'eval_loss': 0.6728479862213135, 'eval_accuracy': 0.7333133792277761, 'eval_runtime': 86.703, 'eval_samples_per_second': 115.602, 'eval_steps_per_second': 7.232, 'epoch': 1.92}


Model weights saved in /content/checkpoint/checkpoint-1400/pytorch_model.bin
tokenizer config file saved in /content/checkpoint/checkpoint-1400/tokenizer_config.json
Special tokens file saved in /content/checkpoint/checkpoint-1400/special_tokens_map.json
Deleting older checkpoint [/content/checkpoint/checkpoint-600] due to args.save_total_limit


{'loss': 0.5291, 'learning_rate': 1.7624259005927954e-05, 'epoch': 1.94}
{'loss': 0.4489, 'learning_rate': 1.716826265389877e-05, 'epoch': 1.97}
{'loss': 0.4925, 'learning_rate': 1.6712266301869586e-05, 'epoch': 2.0}
{'loss': 0.312, 'learning_rate': 1.6256269949840403e-05, 'epoch': 2.02}
{'loss': 0.3194, 'learning_rate': 1.580027359781122e-05, 'epoch': 2.05}
{'loss': 0.2651, 'learning_rate': 1.5344277245782034e-05, 'epoch': 2.08}
{'loss': 0.2604, 'learning_rate': 1.4888280893752852e-05, 'epoch': 2.11}
{'loss': 0.2644, 'learning_rate': 1.4432284541723667e-05, 'epoch': 2.13}
{'loss': 0.2339, 'learning_rate': 1.3976288189694483e-05, 'epoch': 2.16}


The following columns in the evaluation set don't have a corresponding argument in `BertForMultipleChoice.forward` and have been ignored: answerC, answerA, answerB, question, context. If answerC, answerA, answerB, question, context are not expected by `BertForMultipleChoice.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 10023
  Batch size = 16


{'loss': 0.2269, 'learning_rate': 1.35202918376653e-05, 'epoch': 2.19}


Saving model checkpoint to /content/checkpoint/checkpoint-1600
Configuration saved in /content/checkpoint/checkpoint-1600/config.json


{'eval_loss': 0.8569731116294861, 'eval_accuracy': 0.7334131497555622, 'eval_runtime': 87.2617, 'eval_samples_per_second': 114.861, 'eval_steps_per_second': 7.185, 'epoch': 2.19}


Model weights saved in /content/checkpoint/checkpoint-1600/pytorch_model.bin
tokenizer config file saved in /content/checkpoint/checkpoint-1600/tokenizer_config.json
Special tokens file saved in /content/checkpoint/checkpoint-1600/special_tokens_map.json
Deleting older checkpoint [/content/checkpoint/checkpoint-1200] due to args.save_total_limit


{'loss': 0.2949, 'learning_rate': 1.3064295485636116e-05, 'epoch': 2.22}
{'loss': 0.2793, 'learning_rate': 1.2608299133606931e-05, 'epoch': 2.24}
{'loss': 0.241, 'learning_rate': 1.2152302781577749e-05, 'epoch': 2.27}
{'loss': 0.2978, 'learning_rate': 1.1696306429548564e-05, 'epoch': 2.3}
{'loss': 0.2835, 'learning_rate': 1.1240310077519382e-05, 'epoch': 2.33}
{'loss': 0.2951, 'learning_rate': 1.0784313725490197e-05, 'epoch': 2.35}
{'loss': 0.2576, 'learning_rate': 1.0328317373461013e-05, 'epoch': 2.38}
{'loss': 0.2874, 'learning_rate': 9.87232102143183e-06, 'epoch': 2.41}
{'loss': 0.2577, 'learning_rate': 9.416324669402646e-06, 'epoch': 2.44}


The following columns in the evaluation set don't have a corresponding argument in `BertForMultipleChoice.forward` and have been ignored: answerC, answerA, answerB, question, context. If answerC, answerA, answerB, question, context are not expected by `BertForMultipleChoice.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 10023
  Batch size = 16


{'loss': 0.2971, 'learning_rate': 8.960328317373462e-06, 'epoch': 2.46}


Saving model checkpoint to /content/checkpoint/checkpoint-1800
Configuration saved in /content/checkpoint/checkpoint-1800/config.json


{'eval_loss': 0.82845538854599, 'eval_accuracy': 0.7286241644218298, 'eval_runtime': 87.326, 'eval_samples_per_second': 114.777, 'eval_steps_per_second': 7.18, 'epoch': 2.46}


Model weights saved in /content/checkpoint/checkpoint-1800/pytorch_model.bin
tokenizer config file saved in /content/checkpoint/checkpoint-1800/tokenizer_config.json
Special tokens file saved in /content/checkpoint/checkpoint-1800/special_tokens_map.json
Deleting older checkpoint [/content/checkpoint/checkpoint-1600] due to args.save_total_limit


{'loss': 0.302, 'learning_rate': 8.504331965344279e-06, 'epoch': 2.49}
{'loss': 0.2395, 'learning_rate': 8.048335613315095e-06, 'epoch': 2.52}
{'loss': 0.292, 'learning_rate': 7.592339261285911e-06, 'epoch': 2.54}
{'loss': 0.255, 'learning_rate': 7.136342909256727e-06, 'epoch': 2.57}
{'loss': 0.2439, 'learning_rate': 6.680346557227543e-06, 'epoch': 2.6}
{'loss': 0.2503, 'learning_rate': 6.224350205198359e-06, 'epoch': 2.63}
{'loss': 0.2506, 'learning_rate': 5.768353853169175e-06, 'epoch': 2.65}
{'loss': 0.2669, 'learning_rate': 5.312357501139991e-06, 'epoch': 2.68}
{'loss': 0.2807, 'learning_rate': 4.856361149110807e-06, 'epoch': 2.71}


The following columns in the evaluation set don't have a corresponding argument in `BertForMultipleChoice.forward` and have been ignored: answerC, answerA, answerB, question, context. If answerC, answerA, answerB, question, context are not expected by `BertForMultipleChoice.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 10023
  Batch size = 16


{'loss': 0.2866, 'learning_rate': 4.400364797081624e-06, 'epoch': 2.74}


Saving model checkpoint to /content/checkpoint/checkpoint-2000
Configuration saved in /content/checkpoint/checkpoint-2000/config.json


{'eval_loss': 0.8466891646385193, 'eval_accuracy': 0.7358076424224285, 'eval_runtime': 86.9034, 'eval_samples_per_second': 115.335, 'eval_steps_per_second': 7.215, 'epoch': 2.74}


Model weights saved in /content/checkpoint/checkpoint-2000/pytorch_model.bin
tokenizer config file saved in /content/checkpoint/checkpoint-2000/tokenizer_config.json
Special tokens file saved in /content/checkpoint/checkpoint-2000/special_tokens_map.json
Deleting older checkpoint [/content/checkpoint/checkpoint-1800] due to args.save_total_limit


{'loss': 0.2561, 'learning_rate': 3.9443684450524395e-06, 'epoch': 2.76}
{'loss': 0.2226, 'learning_rate': 3.488372093023256e-06, 'epoch': 2.79}
{'loss': 0.283, 'learning_rate': 3.032375740994072e-06, 'epoch': 2.82}
{'loss': 0.256, 'learning_rate': 2.5763793889648885e-06, 'epoch': 2.85}
{'loss': 0.2784, 'learning_rate': 2.1203830369357045e-06, 'epoch': 2.87}
{'loss': 0.3112, 'learning_rate': 1.6643866849065208e-06, 'epoch': 2.9}
{'loss': 0.2552, 'learning_rate': 1.208390332877337e-06, 'epoch': 2.93}
{'loss': 0.2319, 'learning_rate': 7.523939808481532e-07, 'epoch': 2.95}
{'loss': 0.2594, 'learning_rate': 2.963976288189695e-07, 'epoch': 2.98}




Training completed. Do not forget to share your model on huggingface.co/models =)


Loading best model from /content/checkpoint/checkpoint-1400 (score: 0.6728479862213135).


{'train_runtime': 3184.8548, 'train_samples_per_second': 22.03, 'train_steps_per_second': 0.689, 'train_loss': 0.5212774588754066, 'epoch': 3.0}


TrainOutput(global_step=2193, training_loss=0.5212774588754066, metrics={'train_runtime': 3184.8548, 'train_samples_per_second': 22.03, 'train_steps_per_second': 0.689, 'train_loss': 0.5212774588754066, 'epoch': 3.0})

## Evaluate Model

In [None]:
model.eval()
test_gths = []; test_preds = []
with torch.no_grad(): 
    for idx, example in tqdm(enumerate(test_dataset)):
        context_question = example['context'] + tokenizer.sep_token + example['question']
        inputs = tokenizer([[context_question, example['answerA']], [context_question, example['answerB']], [context_question, example['answerC']]], padding=True, return_tensors='pt').to(DEVICE)
        labels = int(example['label']) - 1
        outputs = model(**{k: v.unsqueeze(0) for k, v in inputs.items()})
        logits = outputs.logits
        preds = logits.argmax().item()
        test_gths.append(labels)
        test_preds.append(preds)
        torch.cuda.empty_cache()

0it [00:00, ?it/s]

In [None]:
accuracy.compute(predictions=test_preds, references=test_gths)

{'accuracy': 0.6038894575230297}

# 과제: [HW5] Multiple choice commonsense reasoning (~ 05/21 23:59)

이번 실습에서는 `social_i_qa` 데이터셋에 대하여 `bert-base` 모델을 학습하고 평가하는 내용을 진행했습니다.  
이에 따라 이번 과제는 `social_i_qa` 및 `commonsense_qa` 데이터셋에 대하여 `roberta-base` 모델을 학습하고 평가하는 것입니다.

```
tokenizer = AutoTokenizer.from_pretrained('roberta-base')
model = AutoModelForMultipleChoice.from_pretrained('roberta-base')
datasets = load_dataset('commonsense_qa')
```

- 제출 내용
  - Q1) 동일한 조건 하에서 SocialIQA와 CommonsenseQA 성능 비교
    - Q1-a) `social_i_qa` 데이터셋에 학습한 `roberta-base` 모델의 성능을 test set에서 평가하세요. 이때 성능 지표로 accuracy 뿐만 아니라 macro precision, recall, f1-score를 모두 포함해야 합니다. (20점)
      - `social_i_qa` 데이터셋의 경우 train/validation 2가지 split만 존재합니다. 따라서 validation split을 test set으로 사용하도록 합니다.
    - Q1-b) `commonsense_qa` 데이터셋에 대해 학습한 `robert-base` 모델의 성능을 test set에서 평가하세요. 이때 성능 지표로 accuracy 뿐만 아니라 macro precision, recall, f1-score를 모두 포함해야 합니다. (20점)
      - `commonsense_qa` 데이터셋의 경우 train/validation/test 3가지 split이 존재합니다. 따라서 test split을 test set으로 사용하시면 됩니다.
    - Q1-c) 만약 SocialIQA와 CommonsenseQA에서의 성능에 차이가 존재한다면, 그 이유에 대해 서술하세요. (20점)
  - Q2) COMET inference를 활용하여 SocialIQA에서의 성능 개선
    - [`socialiqa` 폴더](https://drive.google.com/drive/folders/17WyMnrNvKKeMatPp4U4AypMlrVMZHR9n?usp=sharing)에는 train/dev(=test) set에 대한 COMET inference (context와 question이 주어졌을 때 생성한 inference) 가 annotation 되어있는 데이터가 relation type 별로 존재합니다. 이를 활용하여 SocialIQA에서의 성능 향상을 이끌어내세요.
    - Q2-a) 실험 환경 (e.g. 학습에 사용한 COMET inference의 종류, 각종 hyperparameters, ...) 을 설명하고, 그러한 세팅을 선택한 이유에 대해 서술하세요. (20점)
    - Q2-b) 실험 결과 (accuracy, macro precision, recall, f1-score) 를 보고하고, 성능이 항샹된 이유에 대하여 서술하세요. (20점)
- 제출 분량
    - 2 page 이상 4 page 이하
- 제출 형식
    - pdf로 변환하여 learnus에 제출
    - 파일명 = HW5_학번_이름.pdf (ex. HW5_2021321394_김채형.pdf)
- 제출 마감 일시
    - 5월 21일 (일) 23:59