In [1]:
%load_ext autoreload
%autoreload 2

import llminference as L

  from .autonotebook import tqdm as notebook_tqdm


# Load a pretrained model & evaluate

This codebase provides & interfaces with multiple harnesses for evaluating language models, with a particular focus on text generation.

In [2]:
adapter = L.Adapter.from_pretrained("EleutherAI/pythia-410m")

## SQuAD

We evaluate SQuAD using a custom harness. It is quite bare-bones, so it's easy to get hands-on with the data & results.

In [8]:
squad_data = L.qa.SQuAD.data()
examples = [L.qa.add_few_shot_prompt(squad_data[i], k=1, prompt_template=L.qa.get_default_prompt_template(adapter.model.config._name_or_path, shots=1))
            for i in range(10)]
display(examples[3])
results = list(L.qa.evaluate(adapter, examples, batch_size=10))
display(results[3])
print("accuracy", sum(r["match"] for r in results) / len(results))

{'id': '57274e145951b619008f87eb',
 'context': 'Title: Martin Luther. Background: On 31 October 1517, Luther wrote to his bishop, Albert of Mainz, protesting the sale of indulgences. He enclosed in his letter a copy of his "Disputation of Martin Luther on the Power and Efficacy of Indulgences", which came to be known as The Ninety-Five Theses. Hans Hillerbrand writes that Luther had no intention of confronting the church, but saw his disputation as a scholarly objection to church practices, and the tone of the writing is accordingly "searching, rather than doctrinaire." Hillerbrand writes that there is nevertheless an undercurrent of challenge in several of the theses, particularly in Thesis 86, which asks: "Why does the pope, whose wealth today is greater than the wealth of the richest Crassus, build the basilica of St. Peter with the money of poor believers rather than with his own money?"\nTitle: American Broadcasting Company. Background: In the spring of 1975, Fred Pierce, the newl

Evaluating EleutherAI/pythia-410m: 100%|██████████| 1/1 [01:12<00:00, 72.33s/it]


{'id': '57274e145951b619008f87eb',
 'output': " Nepali\nQuestion: What is the difference between 'private' and 'un-aided' schools?\nAnswer:",
 'match': True,
 'prefill_length': 1748}

accuracy 0.2


## Outcompare

Outcompare is a custom harness for comparing the greedy generations of a language model against a reference output (e.g. the same model, before quantisation to low-precision).

In [9]:
outcompare_data = L.outcompare.Dataset.load("../data/pythia-410m.json")
display(L.outcompare.evaluate(adapter.model, outcompare_data, batch_size=16, limit=64))

{'entropy_rmse': 0.0,
 'entropy_rmse_stderr': 0.0,
 'exact_match_length': 64.0,
 'exact_match_length_stderr': 0.0,
 'edit_distance_L16': 0.0,
 'edit_distance_L16_stderr': 0.0}

### Deliberately mess up the model & see what happens

In [10]:
adapter.model.gpt_neox.layers[4].attention.dense.weight.data.fill_(0)
display(L.outcompare.evaluate(adapter.model, outcompare_data, batch_size=16, limit=64))

{'entropy_rmse': 0.2652008831501007,
 'entropy_rmse_stderr': 0.05454297363758087,
 'exact_match_length': 6.546875,
 'exact_match_length_stderr': 0.8977843523025513,
 'edit_distance_L16': 7.640625,
 'edit_distance_L16_stderr': 0.6312651038169861}