In [1]:
%load_ext autoreload
%autoreload 2
import lm_eval.evaluator
import llminference as L

# Load a pretrained model & evaluate

This codebase provides & interfaces with multiple harnesses for evaluating language models, with a particular focus on text generation.

In [9]:
adapter = L.Adapter.from_pretrained("EleutherAI/pythia-410m")

## TriviaQA

We evaluate TriviaQA using a custom harness including handling for on-disk context caching. It is quite bare-bones, so it's easy to get hands-on with the data & results.

In [17]:
triviaqa_data = L.qa.TriviaQA.data()
examples = [L.qa.add_zero_shot_prompt(triviaqa_data[i], L.qa.TriviaQA.DEFAULT_PROMPT) for i in range(20)]
display(examples[2])
results = list(L.qa.evaluate(adapter, examples, batch_size=10, open_book=False, use_cache=False))
display(results[2])
print("em_accuracy", sum(r["match"] for r in results) / len(results))

Found cached dataset trivia_qa (/nethome/douglaso/.cache/huggingface/datasets/trivia_qa/rc/1.2.0/ee76d8a9403e71177e2a3fa7e414d1ee28a79a0970d9176f62f268798aa64b31)


  0%|          | 0/3 [00:00<?, ?it/s]

Loading cached processed dataset at /nethome/douglaso/.cache/huggingface/datasets/trivia_qa/rc/1.2.0/ee76d8a9403e71177e2a3fa7e414d1ee28a79a0970d9176f62f268798aa64b31/cache-ff2e232cc9aacce1.arrow
Loading cached shuffled indices for dataset at /nethome/douglaso/.cache/huggingface/datasets/trivia_qa/rc/1.2.0/ee76d8a9403e71177e2a3fa7e414d1ee28a79a0970d9176f62f268798aa64b31/cache-605e4ea0cb72fde7.arrow


{'question': 'What was Amy Williams sled called on which she won Olympic gold for Britain at Vancouver in the Skeleton event?',
 'question_id': 'bb_9117',
 'answers': ['Arthur', 'Arthur (name)'],
 'context': 'Amy Williams: My body told me it was time to quit | Daily Mail Online\ncomments\nOlympic champion Amy Williams remembers how she fought an urge to cry as she climbed from her sled at the skeleton bob World Championships in Lake Placid.\nWhile others rushed to congratulate her on her finest performance since she had swept into the hearts of the nation at the Vancouver Games two years earlier, her reaction had nothing to do with finishing fifth.\nAfter 10 years of hurling herself head first down a mountain on a tea tray, she knew she had raced for the last time.\nAll over: Amy knew she had to call a halt to her skeleton years, but  is excited  at carving out a new future\n‘My body was screaming out: “Stop, stop, stop!” ’ said Williams as she recalled the precise moment two months ag

Evaluating EleutherAI/pythia-410m: 100%|██████████| 2/2 [00:10<00:00,  5.14s/it]


{'question_id': 'bb_9117',
 'output': '\n\nAmy Williams\n\nAmy Williams\n\nAmy Williams\n\nAmy Williams\n\nAmy Williams\n\nAmy Williams',
 'match': False}

em_accuracy 0.15


## LM Evaluation Harness

[LM-Eval](https://github.com/EleutherAI/lm-evaluation-harness) provides a broad set of tasks that are easy to plug into.

In [5]:
display(lm_eval.evaluator.evaluate(adapter, lm_eval.tasks.get_task_dict(["wikitext"]), limit=2))

Found cached dataset wikitext_document_level (/nethome/douglaso/.cache/huggingface/datasets/EleutherAI___wikitext_document_level/wikitext-2-raw-v1/1.0.0/c7f10a7786444f898dd236db33d4bee9b130f8cbcac690e7bde9b0d027e19fc1)


  0%|          | 0/3 [00:00<?, ?it/s]

Running loglikelihood_rolling requests


100%|██████████| 2/2 [00:28<00:00, 14.25s/it]


{'results': {'wikitext': {'word_perplexity': 19.207863308659714,
   'byte_perplexity': 1.6874262449793889,
   'bits_per_byte': 0.7548244453765823}},
 'versions': {'wikitext': 1}}

## Outcompare

Outcompare is a custom harness for comparing the greedy generations of a language model against a reference output (e.g. the same model, before quantisation to low-precision).

In [7]:
outcompare_data = L.outcompare.Dataset.load("../data/pythia-410m.json")
display(L.outcompare.evaluate(adapter.model, outcompare_data, batch_size=16, limit=64))

{'entropy_rmse': 0.0,
 'entropy_rmse_stderr': 0.0,
 'exact_match_length': 64.0,
 'exact_match_length_stderr': 0.0,
 'edit_distance_L16': 0.0,
 'edit_distance_L16_stderr': 0.0}

## Deliberately mess up the model & see what happens

In [8]:
adapter.model.gpt_neox.layers[4].attention.dense.weight.data.fill_(0)
display(L.outcompare.evaluate(adapter.model, outcompare_data, batch_size=16, limit=64))

{'entropy_rmse': 0.2652008831501007,
 'entropy_rmse_stderr': 0.05454297363758087,
 'exact_match_length': 6.546875,
 'exact_match_length_stderr': 0.8977843523025513,
 'edit_distance_L16': 7.640625,
 'edit_distance_L16_stderr': 0.6312651038169861}