## Benchmarking

Here we load our preprocessed evaluation dataset `MarioBarbeque/DeepMind-LinAlg-1D-eval` into a DataLoader with a specific collate function in order to benchmark the `flan-t5-large` model's performance on solving linear equations of a single variable before finetuning it on the large DeepMind training dataset. 

For the benchmarking in this notebook, we use a single node, single GPU (Nivida T4) compute instance. For improved computation, we install the Nvidia `apex` python package for use in computing the normalization layers of the T5 model. In regards to `apex` the 🤗 T5 documentation states:

"*[after installation] the model will automatically use `apex.normalization.FusedRMSNorm` instead of `T5LayerNorm`. The former uses an optimized fused kernel which is several times faster than the latter.*"

 The `apex` package and its `optimizers` module will also be useful when we actually train our model. We can construct an improved `FusedAdam` (Adam or AdamW) optimizer in favor of the standard `torch.optim.AdamW` optimizer. 
 
 For this benchmarking notebook, we first update and install all relevant software to our compute instance.


In [0]:
import torch
print(torch.__version__)

2.3.1+cu121


In [0]:
# ensure we have the most recent version of transformers
!pip install -U transformers
# ensure we have ninja installed to speed up Nvidia apex source compilation
!pip install ninja
# install Nvidia Apex for optimized computation in the T5 normalization layers
!pip install git+https://github.com/NVIDIA/apex.git --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext"

dbutils.library.restartPython()

Collecting transformers
  Obtaining dependency information for transformers from https://files.pythonhosted.org/packages/f2/3a/8bdab26e09c5a242182b7ba9152e216d5ab4ae2d78c4298eb4872549cd35/transformers-4.47.1-py3-none-any.whl.metadata
  Downloading transformers-4.47.1-py3-none-any.whl.metadata (44 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/44.1 kB[0m [31m?[0m eta [36m-:--:--[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.1/44.1 kB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.24.0 (from transformers)
  Obtaining dependency information for huggingface-hub<1.0,>=0.24.0 from https://files.pythonhosted.org/packages/61/8c/fbdc0a88a622d9fa54e132d7bf3ee03ec602758658a2db5b339a65be2cfe/huggingface_hub-0.27.0-py3-none-any.whl.metadata
  Downloading huggingface_hub-0.27.0-py3-none-any.whl.metadata (13 kB)
Collecting tokenizers<0.22,>=0.21 (from transformers)
  Obtaining dependency information for tokenizers<

In [0]:
import torch
torch.version.cuda

'12.1'

In [0]:
!pip freeze | grep apex

apex @ git+https://github.com/NVIDIA/apex.git@73375b3bbcb59a5d6ff43f2fafd00b9ecdbe0417


In [0]:
# confirm the apex library is available
from apex import normalization

In [0]:
# load our eval dataset
from datasets import load_dataset

eval_dataset = load_dataset("MarioBarbeque/DeepMind-LinAlg-1D-eval")



In [0]:
# grab the only relevant Dataset object within the DatasetDict object and peek it
eval_dataset = eval_dataset["test"]
eval_dataset

Dataset({
    features: ['input_ids', 'attention_mask', 'labels'],
    num_rows: 10000
})

In [0]:
# check the preprocessed, tokenized dataset is loaded as expected
eval_dataset[:5]

{'input_ids': [[5175,
   162,
   3,
   4949,
   4613,
   1935,
   26,
   1768,
   668,
   3166,
   3,
   18,
   3,
   27640,
   3274,
   3,
   5947,
   2773,
   21,
   3,
   26,
   5,
   1,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0],
  [5175,
   162,
   9526,
   1935,
   40,
   1768,
   3479,
   1935,
   40,
   3,
   18,
   3,
   10124,
   3,
   18,
   3,
   3891,
   3274,
   3,
   632,
   21,
   3,
   40,
   5,
   1,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0],
  [5175,
   162,
   3,
   18,
   4389,
   1935,
   17,
   1768,
   1179,
   4225,
   3,
   18,
   505,
   3707,
   1768,
   668,
   4201,
   3274,
   3,
   632,
   21,
   3,
   17,
   5,
   1,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0],
  [5175,
   162,
   6374,
   1935,
   122,
   3274,
   3,
   19978,
   1935,
   122,
   3,
   18,
   3,
   4450,
   1935,
   122,
   3,
   18,
   3,
   4440,
   1935,
   122,
   3,
   18,
   3,
   26755,
   21,
   3,
   122,
   5,
   1,
   0,
   0,
   

In [0]:
# reinstantiate our tokenizer and model in bfloat16
from transformers import T5Tokenizer, T5ForConditionalGeneration

tokenizer = T5Tokenizer.from_pretrained("google/flan-t5-large")
model = T5ForConditionalGeneration.from_pretrained("google/flan-t5-large", device_map="auto", torch_dtype=torch.bfloat16)

2024-12-18 04:36:08.426196: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


In [0]:
# double check the precision of our tensors
model.dtype

torch.bfloat16

In [0]:
# use the custom mem_status fn to check the amount of memory used on the T4 GPU
def mem_status(): 
    if torch.cuda.is_available():
        gpus = torch.cuda.device_count()
        print("Memory status: ")
        for i in range(gpus):
            properties = torch.cuda.get_device_properties(i)
            total_memory = properties.total_memory / (1024 ** 3)  # Convert to GB
            allocated_memory = torch.cuda.memory_allocated(i) / (1024 ** 3)  # Convert to GB
            reserved_memory = torch.cuda.memory_reserved(i) / (1024 ** 3)  # Convert to GB
            available_memory = total_memory - reserved_memory
            print(f"GPU {i}:")
            print(f"  Total memory: {total_memory:.2f} GB")
            print(f"  Allocated memory: {allocated_memory:.2f} GB")
            print(f"  Reserved memory: {reserved_memory:.2f} GB")
            print(f"  Available memory: {available_memory:.2f} GB")
    else:
        print("No GPU available.")

mem_status()

Memory status: 
GPU 0:
  Total memory: 15.57 GB
  Allocated memory: 1.50 GB
  Reserved memory: 1.53 GB
  Available memory: 14.04 GB


In [0]:
# preliminarily, we convert our tokenized datasets' data format to numpy
# this will ultimately be required under the hood by the DataCollatorForSeq2Seq class for padding the labels to the same length in each of our batches 
eval_dataset.set_format("numpy")

In [0]:
# time to configure our dataloaders with a unique collating function

from transformers import DataCollatorForSeq2Seq
from torch.utils.data import DataLoader

batch_size = 32 # multiple of 8 to optimize computation on Nvidia tensor cores

# we make use of the standard seq2seq collator and pass the powerful `pad_to_multiple_of` argument as discussed in the preprocessing notebook
data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, label_pad_token_id=tokenizer.pad_token_id, pad_to_multiple_of=2)

eval_dataloader = DataLoader(
    eval_dataset, 
    batch_size=batch_size, 
    collate_fn=data_collator
)

In [0]:
# grab a batch from our eval dataloader
for batch in eval_dataloader:
    break
{k: v.shape for k, v in batch.items()}

  batch["labels"] = torch.tensor(batch["labels"], dtype=torch.int64)


{'input_ids': torch.Size([32, 34]),
 'attention_mask': torch.Size([32, 34]),
 'labels': torch.Size([32, 4])}

In [0]:
# put all elements of the batch on the GPU
device = torch.device("cuda")
batch = {k: v.to(device) for k, v in batch.items()}

In [0]:
# check how the model responds to a single batch before starting the whole benchmarking eval
outputs = model(**batch)
outputs.loss, outputs.logits.shape

Passing a tuple of `past_key_values` is deprecated and will be removed in Transformers v4.48.0. You should pass an instance of `EncoderDecoderCache` instead, e.g. `past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values)`.


(tensor(15.1875, device='cuda:0', dtype=torch.bfloat16,
        grad_fn=<NllLossBackward0>),
 torch.Size([32, 4, 32128]))

In [0]:
outputs.logits.argmax(dim=-1), batch["labels"]

(tensor([[ 314,    1,    1,  489],
         [ 668,    1,    1,  431],
         [   3,    1,    1, 1401],
         [   3, 5783,    1,    1],
         [   3,   18,    1,    1],
         [   3,   18, 2469,    1],
         [   3, 6039,    1,    1],
         [   3,    1,    1,    3],
         [   3,    1,    1,    3],
         [   3,   18, 3420,    1],
         [ 944,    1,    1,  489],
         [   3,   18,    1,    1],
         [ 668,    1,    1,  489],
         [   3,   18,    1,    1],
         [   3,   18, 2773,    1],
         [ 204,    1,    1,  204],
         [   3,   18, 2688,    1],
         [   3,    1,    1,    3],
         [   3,   18, 3420,    1],
         [   3,   18, 3420,    1],
         [ 489,    1,    1,  305],
         [   3,   18, 2469,    1],
         [   3,    1,    1,    3],
         [   3,   18, 2773,    1],
         [   3,    1,    1,    3],
         [   3, 4536,    1,    1],
         [ 314,    1,    1,  489],
         [   3,   18, 3341,    1],
         [   3,   18

In [0]:
# check the updated mem status after passing the test batch into the model for inference
mem_status()

Memory status: 
GPU 0:
  Total memory: 15.57 GB
  Allocated memory: 3.40 GB
  Reserved memory: 3.47 GB
  Available memory: 12.11 GB


In [0]:
# compare predictions to labels in the test batch
for pred, label in zip([tokenizer.decode(pred, skip_special_tokens=True) for pred in outputs.logits.argmax(dim=-1)], [tokenizer.decode(label, skip_special_tokens=True) for label in batch["labels"]]):
    print(f"prediction: {pred}, label: {label}")

prediction: 4 7, label: 7
prediction: 9 6, label: 2
prediction: 21, label: 23
prediction: -6, label: -8
prediction: -, label: -17
prediction: -35, label: -28
prediction: -8, label: -12
prediction: , label: 27
prediction: , label: 42
prediction: -36, label: -38
prediction: 25 7, label: 7
prediction: -, label: -12
prediction: 9 7, label: 14
prediction: -, label: -11
prediction: -23, label: -38
prediction: 2 2, label: 2
prediction: -26, label: -48
prediction: , label: 49
prediction: -36, label: -29
prediction: -36, label: -44
prediction: 7 5, label: 5
prediction: -35, label: -45
prediction: , label: 29
prediction: -23, label: -42
prediction: , label: 10
prediction: -10, label: -7
prediction: 4 7, label: 7
prediction: -31, label: -45
prediction: -23, label: -37
prediction: 18, label: 19
prediction: , label: 49
prediction: -8, label: -9


In [0]:
# create exact match benchmark metric to be used for evaluation
from evaluate import load

exact_match_test_benchmark = load("exact_match")
# f1_benchmark = load("f1")

In [0]:
# TODO remove since we are adding items individually

# compute metric for first batch - how many of the predictions are correct? 
exact_match_test_benchmark.add_batch(predictions=[tokenizer.decode(pred, skip_special_tokens=True) for pred in outputs.logits.argmax(dim=-1)], references=[tokenizer.decode(label, skip_special_tokens=True) for label in batch["labels"]])

# f1_benchmark.add_batch(predictions=[pred for pred in outputs.logits.argmax(dim=-1).tolist()], references=[label for label in batch["labels"].tolist()])

print(">>> The exact_match score of this batch is: " + str(exact_match_test_benchmark.compute()))
# print(">>> The f1 score of this batch is: " + str(f1_benchmark.compute()))

>>> The exact_match score of this batch is: {'exact_match': 0.0}


In [0]:
# TODO - do we need this? 

# comparing the output tokens directly - double checking the above
num_correct = 0
for i in batch["labels"]:
    for j in outputs.logits.argmax(dim=-1):
        if i.cpu().numpy().tolist() == j.cpu().numpy().tolist():
            num_correct            
num_correct

0

In [0]:
# get the exact match score and create a dataset for individual partial correctness in the same loop
batch_partials = {"predicted_tokens": [], "label_tokens": [], "decoded_prediction": [], "decoded_label": []}

for pred, label in zip(outputs.logits.argmax(dim=-1), batch["labels"]):
    
    exact_match_test_benchmark.add(predictions=tokenizer.decode(pred, skip_special_tokens=True), references=tokenizer.decode(label, skip_special_tokens=True))
    
    batch_partials["predicted_tokens"].append(pred)
    batch_partials["label_tokens"].append(label)
    batch_partials["decoded_prediction"].append(tokenizer.decode(pred, skip_special_tokens=True))
    batch_partials["decoded_label"].append(tokenizer.decode(label, skip_special_tokens=True))

print(">>> The exact_match score of this batch is: " + str(exact_match_test_benchmark.compute()))

>>> The exact_match score of this batch is: {'exact_match': 0.0}


In [0]:
# peek dict
batch_partials

{'predicted_tokens': [tensor([314,   1,   1, 489], device='cuda:0'),
  tensor([668,   1,   1, 431], device='cuda:0'),
  tensor([   3,    1,    1, 1401], device='cuda:0'),
  tensor([   3, 5783,    1,    1], device='cuda:0'),
  tensor([ 3, 18,  1,  1], device='cuda:0'),
  tensor([   3,   18, 2469,    1], device='cuda:0'),
  tensor([   3, 6039,    1,    1], device='cuda:0'),
  tensor([3, 1, 1, 3], device='cuda:0'),
  tensor([3, 1, 1, 3], device='cuda:0'),
  tensor([   3,   18, 3420,    1], device='cuda:0'),
  tensor([944,   1,   1, 489], device='cuda:0'),
  tensor([ 3, 18,  1,  1], device='cuda:0'),
  tensor([668,   1,   1, 489], device='cuda:0'),
  tensor([ 3, 18,  1,  1], device='cuda:0'),
  tensor([   3,   18, 2773,    1], device='cuda:0'),
  tensor([204,   1,   1, 204], device='cuda:0'),
  tensor([   3,   18, 2688,    1], device='cuda:0'),
  tensor([3, 1, 1, 3], device='cuda:0'),
  tensor([   3,   18, 3420,    1], device='cuda:0'),
  tensor([   3,   18, 3420,    1], device='cuda:0'),


In [0]:
# confirm we can construct a 🤗 datasets Dataset object from this dict
from datasets import Dataset

ds = Dataset.from_dict(batch_partials)

In [0]:
# peek 🤗 Dataset format
ds, ds["predicted_tokens"], ds["decoded_label"]

(Dataset({
     features: ['predicted_tokens', 'label_tokens', 'decoded_prediction', 'decoded_label'],
     num_rows: 32
 }),
 [[314, 1, 1, 489],
  [668, 1, 1, 431],
  [3, 1, 1, 1401],
  [3, 5783, 1, 1],
  [3, 18, 1, 1],
  [3, 18, 2469, 1],
  [3, 6039, 1, 1],
  [3, 1, 1, 3],
  [3, 1, 1, 3],
  [3, 18, 3420, 1],
  [944, 1, 1, 489],
  [3, 18, 1, 1],
  [668, 1, 1, 489],
  [3, 18, 1, 1],
  [3, 18, 2773, 1],
  [204, 1, 1, 204],
  [3, 18, 2688, 1],
  [3, 1, 1, 3],
  [3, 18, 3420, 1],
  [3, 18, 3420, 1],
  [489, 1, 1, 305],
  [3, 18, 2469, 1],
  [3, 1, 1, 3],
  [3, 18, 2773, 1],
  [3, 1, 1, 3],
  [3, 4536, 1, 1],
  [314, 1, 1, 489],
  [3, 18, 3341, 1],
  [3, 18, 2773, 1],
  [3, 1, 1, 507],
  [3, 1, 1, 3],
  [3, 6039, 1, 1]],
 ['7',
  '2',
  '23',
  '-8',
  '-17',
  '-28',
  '-12',
  '27',
  '42',
  '-38',
  '7',
  '-12',
  '14',
  '-11',
  '-38',
  '2',
  '-48',
  '49',
  '-29',
  '-44',
  '5',
  '-45',
  '29',
  '-42',
  '10',
  '-7',
  '7',
  '-45',
  '-37',
  '19',
  '49',
  '-9'])

In [0]:
# check the number of eval steps for the entire benchmarking process
num_eval_steps = len(eval_dataloader)
num_eval_steps

313

In [0]:
# now we run a full benchmarking on our eval set
from tqdm.auto import tqdm

progress_bar = tqdm(range(num_eval_steps))
exact_match_benchmark = load("exact_match")
# create an empty dict to populate with the results for partial correctness evaluation
partials = {"predicted_tokens": [], "label_tokens": [], "decoded_prediction": [], "decoded_label": []}

model.eval()
for batch in eval_dataloader:
    batch = {k: v.to(device) for k, v in batch.items()}
    with torch.no_grad():
        outputs = model(**batch)

    # logits = outputs.logits
    # predictions = torch.argmax(logits, dim=-1)
    for pred, label in zip(outputs.logits.argmax(dim=-1), batch["labels"]):
        # add decoded predictions and labels to the metric object
        exact_match_test_benchmark.add(predictions=tokenizer.decode(pred, skip_special_tokens=True), references=tokenizer.decode(label, skip_special_tokens=True))
        # populate the partial correctness dict for detailed, individual eval
        partials["predicted_tokens"].append(pred)
        partials["label_tokens"].append(label)
        partials["decoded_prediction"].append(tokenizer.decode(pred, skip_special_tokens=True))
        partials["decoded_label"].append(tokenizer.decode(label, skip_special_tokens=True))

    # exact_match_benchmark.add_batch(
    #     predictions=[tokenizer.decode(pred, skip_special_tokens=True) for pred in outputs.logits.argmax(dim=-1)], 
    #     references=[tokenizer.decode(label, skip_special_tokens=True) for label in batch["labels"]]
    # )
    # # here add all the predictions and labels to their own datasets
    # for pred in outputs.logits.argmax(dim=-1).tolist():

    # f1_benchmark.add_batch(predictions=[pred for pred in outputs.logits.argmax(dim=-1)], references=[label for label in batch["labels"]])
    progress_bar.update(1)
    if progress_bar.n % 100 == 0:
        mem_status()

print(exact_match_test_benchmark.compute())
partial_correctness_dataset = Dataset.from_dict(partials)
# print(f1_benchmark.compute())

  0%|          | 0/313 [00:00<?, ?it/s]

Memory status: 
GPU 0:
  Total memory: 15.57 GB
  Allocated memory: 3.53 GB
  Reserved memory: 3.77 GB
  Available memory: 11.81 GB
Memory status: 
GPU 0:
  Total memory: 15.57 GB
  Allocated memory: 3.54 GB
  Reserved memory: 3.81 GB
  Available memory: 11.77 GB
Memory status: 
GPU 0:
  Total memory: 15.57 GB
  Allocated memory: 3.54 GB
  Reserved memory: 3.81 GB
  Available memory: 11.77 GB
{'exact_match': 0.0956}


In [0]:
# peek the created dataset
partial_correctness_dataset

Dataset({
    features: ['predicted_tokens', 'label_tokens', 'decoded_prediction', 'decoded_label'],
    num_rows: 10000
})

In [0]:
# lets push the partial evaluation dataset to the hub for saving
dbutils.widgets.text("hf_token", "", "hf_token")

In [0]:
hf_token = dbutils.widgets.get("hf_token")
!huggingface-cli login --token $hf_token

usage: huggingface-cli <command> [<args>] login [-h] [--token TOKEN]
                                                [--add-to-git-credential]
huggingface-cli <command> [<args>] login: error: argument --token: expected one argument


In [0]:
partial_correctness_dataset.push_to_hub("FLAN-T5-DeepMind-LinAlg-1D-benchmark", commit_message="dataset constructed for benchmarking the partial correctness of the pretrained FLAN T5 large model's ability to solve 1D linear equations")

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/10 [00:00<?, ?ba/s]

CommitInfo(commit_url='https://huggingface.co/datasets/MarioBarbeque/FLAN-T5-DeepMind-LinAlg-1D-benchmark/commit/6c76ae43e827c8037a55a790c57a88f0d03addc5', commit_message="dataset constructed for benchmarking the partial correctness of the pretrained FLAN T5 large model's ability to solve 1D linear equations", commit_description='', oid='6c76ae43e827c8037a55a790c57a88f0d03addc5', pr_url=None, repo_url=RepoUrl('https://huggingface.co/datasets/MarioBarbeque/FLAN-T5-DeepMind-LinAlg-1D-benchmark', endpoint='https://huggingface.co', repo_type='dataset', repo_id='MarioBarbeque/FLAN-T5-DeepMind-LinAlg-1D-benchmark'), pr_revision=None, pr_num=None)

In [0]:
from datasets import load_dataset

partial_correctness_ds = load_dataset("MarioBarbeque/FLAN-T5-DeepMind-LinAlg-1D-benchmark")



Downloading readme:   0%|          | 0.00/415 [00:00<?, ?B/s]



Downloading data:   0%|          | 0.00/103k [00:00<?, ?B/s]

Generating eval split:   0%|          | 0/10000 [00:00<?, ? examples/s]

In [0]:
# rename the split to eval 
data = partial_correctness_ds.pop("train")
partial_correctness_ds["eval"] = data
partial_correctness_ds

In [0]:
partial_correctness_ds.push_to_hub("FLAN-T5-DeepMind-LinAlg-1D-benchmark", commit_message="update the dataset's split name")

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/10 [00:00<?, ?ba/s]

README.md:   0%|          | 0.00/418 [00:00<?, ?B/s]



CommitInfo(commit_url='https://huggingface.co/datasets/MarioBarbeque/FLAN-T5-DeepMind-LinAlg-1D-benchmark/commit/4cbda527ccc0f11917280f9a0fd37ac0cac072b2', commit_message="update the dataset's split name", commit_description='', oid='4cbda527ccc0f11917280f9a0fd37ac0cac072b2', pr_url=None, pr_revision=None, pr_num=None)

In [0]:
partial_correctness_ds = partial_correctness_ds["eval"]
partial_correctness_ds

Dataset({
    features: ['predicted_tokens', 'label_tokens', 'decoded_prediction', 'decoded_label'],
    num_rows: 10000
})

In [0]:
# lets analyze this data
partial_correctness_ds[:10]

{'predicted_tokens': [[314, 1, 1, 489],
  [668, 1, 1, 431],
  [3, 1, 1, 1401],
  [3, 5783, 1, 1],
  [3, 18, 1, 1],
  [3, 18, 2469, 1],
  [3, 6039, 1, 1],
  [3, 1, 1, 3],
  [3, 1, 1, 3],
  [3, 18, 3420, 1]],
 'label_tokens': [[489, 1, 0, 0],
  [204, 1, 0, 0],
  [1902, 1, 0, 0],
  [3, 6039, 1, 0],
  [3, 10794, 1, 0],
  [3, 18, 2577, 1],
  [3, 5947, 1, 0],
  [2307, 1, 0, 0],
  [6426, 1, 0, 0],
  [3, 18, 3747, 1]],
 'decoded_prediction': ['4 7',
  '9 6',
  '21',
  '-6',
  '-',
  '-35',
  '-8',
  '',
  '',
  '-36'],
 'decoded_label': ['7',
  '2',
  '23',
  '-8',
  '-17',
  '-28',
  '-12',
  '27',
  '42',
  '-38']}

In [0]:
# first, add some interesting metrics to each of our records
# lets score the f1 metric, precision, and recall scores for each predicition, inlcusive of the special tokens, while also adding a boolean flag for exact matches
from sklearn.metrics import f1_score, precision_score, recall_score

def compute_individual_metrics(example):
    example["f1__w_special"] = f1_score(y_true=example["label_tokens"], y_pred=example["predicted_tokens"], average='micro')
    example["precision__w_special"] = precision_score(y_true=example["label_tokens"], y_pred=example["predicted_tokens"], average='micro')
    example["recall__w_special"] = recall_score(y_true=example["label_tokens"], y_pred=example["predicted_tokens"], average='micro')
    example["is_exact_match"] = example["decoded_prediction"] == example["decoded_label"]
    return example

In [0]:
# test metric computation one single record
test_record = compute_individual_metrics(partial_correctness_ds[0])
test_record

{'predicted_tokens': [314, 1, 1, 489],
 'label_tokens': [489, 1, 0, 0],
 'decoded_prediction': '4 7',
 'decoded_label': '7',
 'f1__w_special': 0.25,
 'precision__w_special': 0.25,
 'recall__w_special': 0.25,
 'is_exact_match': False}

In [0]:
individual_metrics_ds = partial_correctness_ds.map(compute_individual_metrics)

Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

In [0]:
individual_metrics_ds[:5]

{'predicted_tokens': [[314, 1, 1, 489],
  [668, 1, 1, 431],
  [3, 1, 1, 1401],
  [3, 5783, 1, 1],
  [3, 18, 1, 1]],
 'label_tokens': [[489, 1, 0, 0],
  [204, 1, 0, 0],
  [1902, 1, 0, 0],
  [3, 6039, 1, 0],
  [3, 10794, 1, 0]],
 'decoded_prediction': ['4 7', '9 6', '21', '-6', '-'],
 'decoded_label': ['7', '2', '23', '-8', '-17'],
 'f1__w_special': [0.25, 0.25, 0.25, 0.5, 0.5],
 'precision__w_special': [0.25, 0.25, 0.25, 0.5, 0.5],
 'recall__w_special': [0.25, 0.25, 0.25, 0.5, 0.5],
 'is_exact_match': [False, False, False, False, False]}

In [0]:
# we know the exact match score is .0956
exact_matches = individual_metrics_ds.filter(lambda x: x['is_exact_match'] == True)

Filter:   0%|          | 0/10000 [00:00<?, ? examples/s]

In [0]:
len(exact_matches)

956

In [0]:
num_perfect_tokens = 0
for match in exact_matches:
    if match['f1__w_special'] == 1.0:
        num_perfect_tokens += 1
num_perfect_tokens

126

In [0]:
# that is, only __% of the exactly matched decoded predicitions are equivalent in terms of tokenization
print(f">>> Of the correct decoded matches, only {num_perfect_tokens / len(exact_matches):.4f} have predicted and label tokens that are identical.")

>>> Of the correct decoded matches, only 0.1318 have predicted and label tokens that are identical.


In [0]:
exact_matches[:25]

{'predicted_tokens': [[3, 4525, 1, 1],
  [3, 18, 2773, 1],
  [3, 4949, 1, 1],
  [3, 1, 1, 505],
  [3, 1, 1, 944],
  [3, 1, 1, 668],
  [3, 18, 2469, 1],
  [3, 1, 1, 507],
  [3, 6039, 1, 1],
  [3, 1, 1, 604],
  [3, 1, 1, 943],
  [3, 1, 1, 305],
  [3, 1, 1, 209],
  [3, 1, 1, 6654],
  [3, 18, 3647, 1],
  [3, 1, 1, 898],
  [3, 1, 1, 335],
  [3, 1, 1, 1630],
  [3, 1, 1, 335],
  [3, 2292, 1, 1],
  [3, 18, 2773, 1],
  [3, 632, 1, 1],
  [3, 4949, 1, 1],
  [3, 1, 1, 204],
  [3, 1, 1, 505]],
 'label_tokens': [[3, 4525, 1, 0],
  [3, 18, 2773, 1],
  [3, 4949, 1, 0],
  [505, 1, 0, 0],
  [944, 1, 0, 0],
  [668, 1, 0, 0],
  [3, 18, 2469, 1],
  [507, 1, 0, 0],
  [3, 6039, 1, 0],
  [604, 1, 0, 0],
  [943, 1, 0, 0],
  [305, 1, 0, 0],
  [209, 1, 0, 0],
  [6654, 1, 0, 0],
  [3, 18, 3647, 1],
  [898, 1, 0, 0],
  [335, 1, 0, 0],
  [1630, 1, 0, 0],
  [335, 1, 0, 0],
  [3, 2292, 1, 0],
  [3, 18, 2773, 1],
  [3, 632, 1, 0],
  [3, 4949, 1, 0],
  [204, 1, 0, 0],
  [505, 1, 0, 0]],
 'decoded_prediction': ['-5',
  

In [0]:
wrong_matches = individual_metrics_ds.filter(lambda x: x['is_exact_match'] == False)

Filter:   0%|          | 0/10000 [00:00<?, ? examples/s]

In [0]:
# but most of these may or may not be largely because of matches on special tokens
num = 0
for m in wrong_matches:
    if m["f1__w_special"] >= 0.5:
        num += 1
num

4861

The poor performance of the model overall, notably its `0.0956` exact match score, coupled with the saturation of common special tokens makes the desired task of interpreting the partial correctness of predictions rather difficult. It may be wise, rather, to go ahead and train our model. From here, we can review its (most important) exact match metric for full correctness of predicitions. Subsequently, we may investigate the structure of partial correctness of those (hopefully) much improved predictions. At which time, we might learn something interesting in the trends of incorrect answers before ultimately returning here and applying a similar method of analysis on this messier, pretrained benchmarking.