# intermediate_hellaswag
This notebook takes our hellaswag dataset and trains an intermediate model.

## Imports & Settings

First, update working directory to parent so that we may use our custom functions

In [1]:
import os
os.chdir('..')
# os.getcwd( )

In [2]:
import params
from utils import *
from trainer import *

import numpy as np
import pandas as pd
from datasets import load_from_disk

from transformers import RobertaTokenizer, RobertaForMultipleChoice
from torch.utils.data import TensorDataset, DataLoader, RandomSampler

# suppress model warning
from transformers import logging
logging.set_verbosity_error()

# set logging level
import logging
logging.basicConfig(format='%(message)s', level=logging.INFO)

In [3]:
# set general seeds
set_seeds(1)

# set dataloader generator seed
g = torch.Generator()
g.manual_seed(1)

# set params for this model
params.num_labels = 4
params.output_dir = "model_saves/intermediate_CosmosQA_01"
params.dataset_path = "data/inter_cosmosqa/itesd_cosmosqa_balanced.hf"

# Ensure we're on an ARM environment if necessary.
platform_check()

We're Armed: macOS-13.1-arm64-i386-64bit


## Load Data

### Cosmos QA

In [4]:
cosmos_datasets = load_from_disk(params.dataset_path)

In [5]:
cosmos_datasets

DatasetDict({
    train: Dataset({
        features: ['id', 'context', 'question', 'answer0', 'answer1', 'answer2', 'answer3', 'label'],
        num_rows: 22272
    })
    validation: Dataset({
        features: ['id', 'context', 'question', 'answer0', 'answer1', 'answer2', 'answer3', 'label'],
        num_rows: 2985
    })
    test: Dataset({
        features: ['id', 'context', 'question', 'answer0', 'answer1', 'answer2', 'answer3', 'label'],
        num_rows: 2825
    })
})

In [6]:
def show_one(example):
    print(f"Context: {example['context']}")
    print(f"Question: {example['question']}")
    print(f"  A - {example['answer0']}")
    print(f"  B - {example['answer1']}")
    print(f"  C - {example['answer2']}")
    print(f"  D - {example['answer3']}")
    print(f"\nGround truth: option {['A', 'B', 'C', 'D'][example['label']]}")

show_one(cosmos_datasets["train"][50])

Context: I had mentioned my call from Omaha in my previous entry and I finally got a hold of the HR lady who was trying to reach me . Like I had expected , it was far from a solid offer , rather just seeing if I was still interested in positions . Still it was the first real bite I ' ve gotten from the line I had sunk in that pond and its not like I have a whole lot of options , so I said I was still interested .
Question: What happened after the call ?
  A - I accepted the position .
  B - I chose a different option .
  C - I went fishing .
  D - None of the above choices .

Ground truth: option A


## Preprocess

In [7]:
params.tokenizer = RobertaTokenizer.from_pretrained("roberta-base", use_fast=True)

In [8]:
# encoding_dict
encoded_datasets = cosmos_datasets.map(mc_preprocessing, batched=True)

encoded_datasets

Loading cached processed dataset at data/inter_cosmosqa/itesd_cosmosqa_balanced.hf/train/cache-990048748da5baca.arrow


  0%|          | 0/3 [00:00<?, ?ba/s]

  0%|          | 0/3 [00:00<?, ?ba/s]

DatasetDict({
    train: Dataset({
        features: ['id', 'context', 'question', 'answer0', 'answer1', 'answer2', 'answer3', 'label', 'input_ids', 'attention_mask'],
        num_rows: 22272
    })
    validation: Dataset({
        features: ['id', 'context', 'question', 'answer0', 'answer1', 'answer2', 'answer3', 'label', 'input_ids', 'attention_mask'],
        num_rows: 2985
    })
    test: Dataset({
        features: ['id', 'context', 'question', 'answer0', 'answer1', 'answer2', 'answer3', 'label', 'input_ids', 'attention_mask'],
        num_rows: 2825
    })
})

### Double-Check input_id lengths
We're performing this check to ensure that 256 max token length is sufficient for this task.

In [9]:
train_ids = encoded_datasets["train"]['input_ids']

lengths = []
for i in train_ids:
    for j in i:
        lengths.append(len(j))

print(len(lengths))

89088


In [10]:
lengths[:10]

[75, 75, 76, 77, 97, 98, 98, 100, 91, 94]

In [11]:
max(lengths)

212

### View input Structure

The inputs are four copies of cxt_a and ctx_b each strung together with one ending option. They start with the \<s> BOS token, which may act as the CLS token instead, and are separated with the \</s> token--end of sequence or separator token.

https://huggingface.co/docs/transformers/model_doc/roberta
https://stackoverflow.com/questions/61465223/roberta-tokenization-of-multiple-sequences

In [12]:
show_one(cosmos_datasets["train"][100])

Context: Beth managed to locate a small eatery that sold sandwiches . I headed over with Will and asked for one . At first , I did n't know what I was ordering ; I just pointed to what looked good . Turns out that I ordered a brie sandwich with lettuce , tomato , and butter .
Question: What 's a possible reason Beth located an eatery ?
  A - Because she was hungry .
  B - None of the above choices .
  C - Because the eatery sells sandwiches .
  D - Because the writer did n't know what they were ordering .

Ground truth: option A


In [13]:
print(params.tokenizer.decode(encoded_datasets['train']["input_ids"][100][0]))
print(params.tokenizer.decode(encoded_datasets['train']["input_ids"][100][1]))
print(params.tokenizer.decode(encoded_datasets['train']["input_ids"][100][2]))
print(params.tokenizer.decode(encoded_datasets['train']["input_ids"][100][3]))

<s>Beth managed to locate a small eatery that sold sandwiches. I headed over with Will and asked for one. At first, I didn't know what I was ordering ; I just pointed to what looked good. Turns out that I ordered a brie sandwich with lettuce, tomato, and butter.</s></s>What's a possible reason Beth located an eatery?</s></s>Because she was hungry.</s>
<s>Beth managed to locate a small eatery that sold sandwiches. I headed over with Will and asked for one. At first, I didn't know what I was ordering ; I just pointed to what looked good. Turns out that I ordered a brie sandwich with lettuce, tomato, and butter.</s></s>What's a possible reason Beth located an eatery?</s></s>None of the above choices.</s>
<s>Beth managed to locate a small eatery that sold sandwiches. I headed over with Will and asked for one. At first, I didn't know what I was ordering ; I just pointed to what looked good. Turns out that I ordered a brie sandwich with lettuce, tomato, and butter.</s></s>What's a possible r

In [14]:
print(encoded_datasets['train']["input_ids"][100][0])
print(encoded_datasets['train']["attention_mask"][100][0])


[0, 387, 4774, 2312, 7, 12982, 10, 650, 19969, 219, 14, 1088, 19072, 479, 38, 3475, 81, 19, 2290, 8, 553, 13, 65, 479, 497, 78, 2156, 38, 222, 295, 75, 216, 99, 38, 21, 12926, 25606, 38, 95, 3273, 7, 99, 1415, 205, 479, 27271, 66, 14, 38, 2740, 10, 741, 3636, 15649, 19, 24515, 2156, 20406, 2156, 8, 9050, 479, 2, 2, 2264, 128, 29, 10, 678, 1219, 10472, 2034, 41, 19969, 219, 17487, 2, 2, 10105, 79, 21, 11130, 479, 2]
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]


## Prep Dataloaders

In [15]:
# collect necessary features from encoded datasets and arrange them into a
# a format acceptable by the dataloader
train_features = construct_input(encoded_datasets['train'])
validate_features = construct_input(encoded_datasets['validation'])

In [16]:
# dataloaders w collation
# Prepare DataLoader
train_dataloader = DataLoader(
            train_features,
            sampler = RandomSampler(train_features),
            batch_size = params.batch_size,
            worker_init_fn=seed_worker,
            generator=g,
            collate_fn=collate
        )

validation_dataloader = DataLoader(
            validate_features,
            sampler = RandomSampler(validate_features),
            batch_size = params.batch_size,
            worker_init_fn=seed_worker,
            generator=g,
            collate_fn=collate
        )

In [17]:
# view an example from the dataloader
next(iter(train_dataloader))

[tensor([[[    0, 43725,   352,  ...,     1,     1,     1],
          [    0, 43725,   352,  ...,     1,     1,     1],
          [    0, 43725,   352,  ...,     1,     1,     1],
          [    0, 43725,   352,  ...,     1,     1,     1]],
 
         [[    0,   713,  3298,  ...,     1,     1,     1],
          [    0,   713,  3298,  ...,     1,     1,     1],
          [    0,   713,  3298,  ...,     1,     1,     1],
          [    0,   713,  3298,  ...,     1,     1,     1]],
 
         [[    0,   243,  2594,  ...,     1,     1,     1],
          [    0,   243,  2594,  ...,     1,     1,     1],
          [    0,   243,  2594,  ...,     1,     1,     1],
          [    0,   243,  2594,  ...,     1,     1,     1]],
 
         ...,
 
         [[    0,   243,  2092,  ...,     1,     1,     1],
          [    0,   243,  2092,  ...,     1,     1,     1],
          [    0,   243,  2092,  ...,     1,     1,     1],
          [    0,   243,  2092,  ...,     1,     1,     1]],
 
         [[ 

## Train

* Note: if continuing from checkpoint, continue to next section

Download transformers.RobertaForSequenceClassification, which is a RoBERTa model with a linear layer for sentence classification (or regression) on top of the pooled output:

In [18]:
# Load the RobertaForSequenceClassification model
model = RobertaForMultipleChoice.from_pretrained('roberta-base',
                                                  num_labels = params.num_labels,
                                                  output_attentions = False,
                                                  output_hidden_states = False,
                                                    )

from torchinfo import summary
summary(model, input_size=(1, 4, 256), dtypes=['torch.IntTensor'])

Layer (type:depth-idx)                                       Output Shape              Param #
RobertaForMultipleChoice                                     [1, 4]                    --
├─RobertaModel: 1-1                                          [4, 768]                  --
│    └─RobertaEmbeddings: 2-1                                [4, 256, 768]             --
│    │    └─Embedding: 3-1                                   [4, 256, 768]             38,603,520
│    │    └─Embedding: 3-2                                   [4, 256, 768]             768
│    │    └─Embedding: 3-3                                   [4, 256, 768]             394,752
│    │    └─LayerNorm: 3-4                                   [4, 256, 768]             1,536
│    │    └─Dropout: 3-5                                     [4, 256, 768]             --
│    └─RobertaEncoder: 2-2                                   [4, 256, 768]             --
│    │    └─ModuleList: 3-6                                  --               

Set model to device, initialize trainer

In [19]:
model.to(params.device)
# print(f"Trained Dataset: {dataset_path}")
print(f"Device: {params.device}")

optimizer = torch.optim.Adam(params=model.parameters(), 
                             lr=params.learning_rate,
                             weight_decay=params.weight_decay) #roberta

trainer = Trainer(model=model,
                  device=params.device,
                  tokenizer=params.tokenizer,
                  train_dataloader=train_dataloader,
                  validation_dataloader=validation_dataloader,
                  epochs=params.epochs,
                  optimizer=optimizer,
                  val_loss_fn=params.val_loss_fn,
                  num_labels=params.num_labels,
                  output_dir=params.output_dir,
                  save_freq=params.save_freq,
                  checkpoint_freq=params.checkpoint_freq)

output_parameters()

Device: mps

          Training Dataset: data/inter_cosmosqa/itesd_cosmosqa_balanced.hf
          Number of Labels: 4
          Batch Size: 16
          Learning Rate: 1e-05
          Weight Decay: 0
          Epochs: 10
          Output Directory: model_saves/intermediate_CosmosQA_01
          Save Frequency: 1
          Checkpoint Frequency: 1
          Max Length: 256
          


In [20]:
trainer.fit()

Epoch 1: 100%|██████████| 1392/1392 [1:30:00<00:00,  3.88s/batch]
	 Validation 186: 100%|██████████| 187/187 [03:16<00:00,  1.05s/batch]

 	 - Train loss: 1.014240
	 - Validation Loss: 0.935583
	 - Validation Accuracy: 0.640523
	 - Validation F1: 0.640523
	 - Validation Recall: 0.640523
	 - Validation Precision: 0.640523 

	 * Model @ epoch 1 saved to model_saves/intermediate_CosmosQA_01/E01_A0.64_F0.64
	 * Model checkpoint saved to model_saves/intermediate_CosmosQA_01/E01_A0.64_F0.64/checkpoint.pt

Epoch 2:   0%|          | 1/1392 [00:03<1:31:10,  3.93s/batch]


KeyboardInterrupt: 

## Continue Training from Checkpoint

In [None]:
# Load the RobertaForSequenceClassification model
model = RobertaForMultipleChoice.from_pretrained('roberta-base',
                                                  num_labels = params.num_labels,
                                                  output_attentions = False,
                                                  output_hidden_states = False,
                                                    )

model.to(params.device)
print(f"Device: {params.device}")

optimizer = torch.optim.Adam(params=model.parameters(), lr=params.learning_rate) #roberta

checkpoint_load = "model_saves/intermediate_cosmos_01/E07_A0.68_F0.68/checkpoint.pt"

trainer = Trainer(model=model,
                  device=params.device,
                  tokenizer=params.tokenizer,
                  train_dataloader=train_dataloader,
                  validation_dataloader=validation_dataloader,
                  epochs=params.epochs,
                  optimizer=optimizer,
                  val_loss_fn=params.val_loss_fn,
                  num_labels=params.num_labels,
                  output_dir=params.output_dir,
                  save_freq=params.save_freq,
                  checkpoint_freq=params.checkpoint_freq, 
                  checkpoint_load=checkpoint_load)

In [None]:
trainer.fit()