# How-To Evaluate Models with LM-Eval

The `harness` research project implements a wrapper over the `lm-eval` framework, provided by EleutherAI. It is designed to make it easy to evaluate NLP models and compare their performance. In this tutorial, we will walk through the process of evaluating NLP models with `harness`, including how to set up the framework, how to use it to evaluate models, and how to interpret the results.

## Installation

The `harness` project is designed to be an installable module, which allow users to call it from outside its package. Thus, one can install it as follows:

In [3]:
try:
    import harness
except ModuleNotFoundError:
    !pip install git+https://github.com/microsoft/archai.git@pre-release#subdirectory=research/harness

## Wrap Requirements (Model and Tokenizer)

The first step is to load a model (instances of `torch.nn.Module`) and the tokenizer (instances of `transformers.AutoTokenizer`).

In this example, we will load the pre-trained `gpt2` from the Hugging Face Hub and its tokenizer:

In [4]:
from transformers import AutoModelForCausalLM, AutoTokenizer
from harness.lm_eval_hf_model import HFEvalModel

model = AutoModelForCausalLM.from_pretrained("gpt2")
tokenizer = AutoTokenizer.from_pretrained("gpt2")

hf_model = HFEvalModel(model, tokenizer)

## Start the Evaluation

After the model and tokenizer have been loaded, evaluating a model is simple as calling the `evaluate_wrapper()` function.

Note that we opted to create a wrapper over the `lm_eval.evaluator.evaluate()` method to supply our research demands, which consists in easily prototyping new models based of Hugging Face. Nevertheless, one is always allowed to bring own models and additional functionalities that might be needed.

### Required Arguments

* `hf_model`: An instance of a model and tokenizer that have been wrapped with the `HFEvalModel` class.
* `tasks`: A list of string-based tasks identifiers.

### Optional Arguments

* `num_fewshot`: Number of few-shot samples, defaults to `0`.
* `no_cache`: Disables the caching mechanism and re-computes the predictions, defaults to `False`.

In [5]:
from lm_eval.tasks import ALL_TASKS
from harness.lm_eval_evaluator import evaluate_wrapper

print(f"List of tasks: {ALL_TASKS}")

outputs = evaluate_wrapper(
        hf_model,
        ["copa"],
        num_fewshot=0,
        no_cache=True,
    )

Could not import signal.SIGPIPE (this is expected on Windows machines)


List of tasks: ['anagrams1', 'anagrams2', 'anli_r1', 'anli_r2', 'anli_r3', 'arc_challenge', 'arc_easy', 'arithmetic_1dc', 'arithmetic_2da', 'arithmetic_2dm', 'arithmetic_2ds', 'arithmetic_3da', 'arithmetic_3ds', 'arithmetic_4da', 'arithmetic_4ds', 'arithmetic_5da', 'arithmetic_5ds', 'blimp_adjunct_island', 'blimp_anaphor_gender_agreement', 'blimp_anaphor_number_agreement', 'blimp_animate_subject_passive', 'blimp_animate_subject_trans', 'blimp_causative', 'blimp_complex_NP_island', 'blimp_coordinate_structure_constraint_complex_left_branch', 'blimp_coordinate_structure_constraint_object_extraction', 'blimp_determiner_noun_agreement_1', 'blimp_determiner_noun_agreement_2', 'blimp_determiner_noun_agreement_irregular_1', 'blimp_determiner_noun_agreement_irregular_2', 'blimp_determiner_noun_agreement_with_adj_2', 'blimp_determiner_noun_agreement_with_adj_irregular_1', 'blimp_determiner_noun_agreement_with_adj_irregular_2', 'blimp_determiner_noun_agreement_with_adjective_1', 'blimp_distracto

Reusing dataset super_glue (C:\Users\gderosa\.cache\huggingface\datasets\super_glue\copa\1.0.2\d040c658e2ddef6934fdd97deb45c777b6ff50c524781ea434e7219b56a428a7)


  0%|          | 0/3 [00:00<?, ?it/s]

Running loglikelihood requests


100%|██████████| 200/200 [00:24<00:00,  8.27it/s]


## Formatting the Outputs

After the predictions have been computed, they are saved in an `outputs` variable (dictionary). However, `lm_eval` provides an additional function, denoted as `make_table()` that formats the outputs into a readable table:

In [6]:
from lm_eval.evaluator import make_table

print(make_table(outputs))

|Task|Version|Metric|Value|   |Stderr|
|----|------:|------|----:|---|-----:|
|copa|      0|acc   | 0.64|±  |0.0482|

