# How-To Evaluate a Custom Task with LM-Eval

Even though `lm-eval` framework supports more than 200 tasks, one might want to implement an additional one. With that in mind, this tutorial walks through the process of creating a custom task, including it in the registry and evaluating models with it.

## Installation

The `harness` project is designed to be an installable module, which allow users to call it from outside its package. Thus, one can install it as follows:

In [16]:
try:
    import harness
except ModuleNotFoundError:
    !pip install git+https://github.com/microsoft/archai.git@pre-release#subdirectory=research/harness

## Creating a Custom Task

Tasks always inherits from the base class `Task`, which is implemented by the `lm_eval.base` module. When defining a custom task, there are some constants and methods that need to be overriden:

### Constants

* `VERSION`: Indicates the version of the task for reproducibility.
* `DATASET_PATH`: Name of the dataset from the Hugging Face Hub.
* `DATASET_NAME`: Configuration name of the dataset from the Hugging Face Hub.

### Methods

* `should_decontaminate()`: Whether can be decontaminated with an n-grams file.
* `has_training_docs()`: Whether dataset supports a training set.
* `has_validation_docs()`: Whether dataset supports a validation set.
* `has_test_docs()`: Whether dataset supports a testing set.
* `test_docs()`: Indicates the `DatasetDict` key to be used for the testing samples.
* `doc_to_text()`: Defines the task input.
* `doc_to_target()`: Defines the task target.
* `construct_requests()`: Creates a tuple of requests that defines the core computation of the task (e.g., usually zero-shot is conducted using log-likelihood over the desired target token).
* `process_results()`: Processes the output of the requests and calculates their metric (e.g., accuracy).
* `aggregation()`: Defines how multiple outputs should be aggregated (e.g., mean).
* `higher_is_better()`: Defines if a higher metric value corresponds to a better metric.

*One can refer to the `lm-eval` implemented tasks if additional information is needed: https://github.com/EleutherAI/lm-evaluation-harness/tree/master/lm_eval/tasks.*

In this example, we will be implementing the AX-b task from the SuperGLUE benchmark:

In [17]:
from typing import Any, Dict, List

from datasets.arrow_dataset import Dataset
from harness.utils.request_factory import Request, rf
from lm_eval.base import Task
from lm_eval.metrics import mean

class AXb(Task):
    VERSION = 0
    DATASET_PATH = "super_glue"
    DATASET_NAME = "axb"

    def should_decontaminate(self) -> bool:
        return False

    def has_training_docs(self) -> bool:
        return False

    def has_validation_docs(self) -> bool:
        return False

    def has_test_docs(self) -> bool:
        return True

    def test_docs(self) -> Dataset:
        return self.dataset["test"]

    def doc_to_text(self, doc: Dict[str, Any]) -> str:
        return f"{doc['sentence1']}\nQuestion: {doc['sentence2']} True or False?\nAnswer:"

    def doc_to_target(self, doc: Dict[str, Any]) -> str:
        available_labels = {0: "True", 1: "False"}
        label = doc["label"]

        return f" {available_labels[label]}"

    def construct_requests(self, doc: Dict[str, Any], ctx: str) -> List[Request]:
        ll_true = rf.loglikelihood(ctx, " True")
        ll_false = rf.loglikelihood(ctx, " False")

        return ll_true, ll_false

    def process_results(self, doc: Dict[str, Any], results: List[str]) -> Dict[str, Any]:
        ll_true, ll_false = results

        prediction = int(ll_false > ll_true)
        reference = doc["label"]

        acc = 1.0 if (ll_true > ll_false) == reference else 0.0

        return {"acc": acc}

    def aggregation(self) -> Dict[str, Any]:
        return {"acc": mean}

    def higher_is_better(self) -> Dict[str, Any]:
        return {"acc": True}

## Adding Task to Registry

After a custom task has been defined, it needs to be added to two constants that enables its usability:

* `ALL_TASKS`: List of available tasks (useful when parsing from the command line).
* `TASK_REGISTRY`: Dictionary mapping the task identifier and its class.

In [18]:
from lm_eval.tasks import ALL_TASKS, TASK_REGISTRY

ALL_TASKS.append("axb")
TASK_REGISTRY.update({"axb": AXb})

## Evaluate using Custom Task

Finally, the custom task evaluation follows the same protocol defined by the `simple_evaluation.ipynb` example, as follows: 

In [19]:
from transformers import AutoModelForCausalLM, AutoTokenizer

from lm_eval.evaluator import make_table

from harness.lm_eval_evaluator import evaluate_wrapper
from harness.lm_eval_hf_model import HFEvalModel

model = AutoModelForCausalLM.from_pretrained("gpt2")
tokenizer = AutoTokenizer.from_pretrained("gpt2")

hf_model = HFEvalModel(model, tokenizer)

outputs = evaluate_wrapper(
        hf_model,
        ["axb"],
        num_fewshot=0,
        no_cache=True,
    )

print(make_table(outputs))

Reusing dataset super_glue (C:\Users\gderosa\.cache\huggingface\datasets\super_glue\axb\1.0.2\d040c658e2ddef6934fdd97deb45c777b6ff50c524781ea434e7219b56a428a7)


  0%|          | 0/1 [00:00<?, ?it/s]

Running loglikelihood requests


100%|██████████| 2200/2200 [09:33<00:00,  3.83it/s]


|Task|Version|Metric|Value |   |Stderr|
|----|------:|------|-----:|---|-----:|
|axb |      0|acc   |0.5652|±  |0.0149|

