<a href="https://colab.research.google.com/github/peremartra/Large-Language-Model-Notebooks-Course/blob/main/4-Evaluating%20LLMs/4_4_lm-evaluation-harness.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<div>
    <h1>Large Language Models Projects</a></h1>
    <h3>Apply and Implement Strategies for Large Language Models</h3>
    <h2>4.4-Evaluating Open Source Models with lm-evaluation-harness.
  </h2>
    <h3>Evaluate models from Hugging Face using EleutherAI/lm-evaluation-harness.</h3>
</div>

by [Pere Martra](https://www.linkedin.com/in/pere-martra)

____________
Models: Llama-3.2-1B / pruned40-llama-1b.

Colab Environment: T4 / L4 / A100 GPU.

Keys:
* EleutherAI.
* lm-evaluation-harness
* Hugging Face
___________
**disclaimer: This notebook has been created after the first edition of the book was published. It is not included in the book’s original content but is intended to supplement and expand on the topics covered.**
This is the unofficial repository for the book:
        <a href="https://amzn.to/4eanT1g"> <b>Large Language Models:</b> Apply and Implement Strategies for Large Language Models</a> (<a href="https://link.springer.com/book/10.1007/979-8-8688-0515-8">Apress</a>).
        The book is based on the content of this repository, but the notebooks are being updated, and I am incorporating new examples and chapters.
        If you are looking for the official repository for the book, with the original notebooks, you should visit the
        <a href="https://github.com/Apress/Large-Language-Models-Projects">Apress repository</a>, where you can find all the notebooks in their original format as they appear in the book.

-----


The lm-eval library by EleutherAI provides easy access to academic benchmarks that have become industry standards. It supports the evaluation of both Open Source models and APIs from providers like OpenAI, and even allows for the evaluation of adapters created using techniques such as LoRA.

In this notebook, I will focus on a small but important feature of the library: evaluating models compatible with Hugging Face's Transformers library.

#Installing libraries

In [None]:
!pip install -q lm-eval
!pip install -q transformers
from lm_eval import evaluator, tasks, models

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.0/44.0 kB[0m [31m1.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m51.8/51.8 kB[0m [31m3.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.4/2.4 MB[0m [31m41.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m36.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m9.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [31m7.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m243.3/243.3 kB[0m [31m20.3 MB/s[0m eta [36m0:00

# Evaluating Models.

Primero voy a crear una simple funcion que sirve como wrapper a la llamada de la función simple_evaluate de la clase evaluator de la libreria lm-eval.

Esta función recibe el nombre del modelo, la lista de los tests a realizar, y si queremos que se utilicen o no few shot prompts en la evaluación.


In [None]:
def evaluate_hf_model(model_name, tasks=['arc_easy'], num_fewshot=0):
    """
    It calls the evaluator to evaluate a model available on Hugging Face.

    Args:
    - model_name: The model name in hugging Face.
    - tasks: Tasks to evaluate.
    - num_fewshot: Number of examples of few-shot learning

    Returns:
    - metrics.
    """
    model_args = f"pretrained={model_name},device=cuda"
    tasks = tasks

    results = evaluator.simple_evaluate(
      model="hf",
      model_args=model_args,
      tasks=tasks,
      num_fewshot=num_fewshot,  # Number of few-shot smaples.
      limit=None,  # Use all the samples in the Evaluate Dataset.
      bootstrap_iters=10
    )

    metrics = results.get('results', {})
    return metrics


Let’s take a closer look at the parameters that the `simple_evaluate` function receives:

* `model`:
Here, the value "hf" is passed, which indicates that the model should be fetched from the Hugging Face repository. Other possible values include "openai", "anthropic", and "local". Depending on the value provided, the parameters for `model_args` will need to be adjusted accordingly.

* `model_args`:
In this case, it includes the name of the model from Hugging Face and the device on which the model should be loaded.

* `tasks`:
A list of benchmarks (tasks) to evaluate.

* `num_few_shot`:
The number of few-shot examples for each task. The library is responsible for generating the examples to pass to the model, and there is no theoretical maximum.

* `limit`:
The number of records from the dataset to use for evaluation. If set to `None`, the entire dataset will be used.

* `bootstrap_iters`:
A statistical technique used to estimate confidence intervals for the evaluation metrics. The library resamples the data with replacement (10 iterations in this case) and recalculates the metrics for each resample. Although 10 iterations are quite small, it allows for faster execution.

In [None]:
# Select tasks to evaluate.
tasks = ['lambada', 'boolq']

I have selected two tasks for evaluation:

- `boolq`: This task evaluates the model's reading comprehension. The model is presented with a text passage and must answer a binary question ('yes' or 'no') based on the information provided. While the output is a simple binary response, the task requires nuanced understanding of the text to produce accurate answers.

- `lambada`: This task assesses the model’s ability to generate text based on context. The model is provided with a passage and must predict the last word. This requires the model to integrate information from the entire context, as the final word often depends on subtle relationships and meaning within the text. Although the output is a single word, the complexity lies in the dependency on strong contextual understanding and language modeling.

There are many more tasks to choose from, including comprehensive tests like MMLU (Massive Multitask Language Understanding), which is used in the renowned Hugging Face Open LLM Leaderboard.

As the first model to evaluate, I chose the smallest model from the Llama family. It’s important to note that, although everything is done through a library, the evaluation takes place in our environment. This means that evaluating a very large model can take a significant amount of time.

In [None]:
metrics_base = evaluate_hf_model("meta-llama/Llama-3.2-1B", tasks=tasks)

INFO:lm-eval:Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234
INFO:lm-eval:Initializing hf model, with arguments: {'pretrained': 'meta-llama/Llama-3.2-1B', 'device': 'cuda'}
INFO:lm-eval:Using device 'cuda'


config.json:   0%|          | 0.00/843 [00:00<?, ?B/s]

INFO:lm-eval:Using model type 'default'


tokenizer_config.json:   0%|          | 0.00/50.5k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/301 [00:00<?, ?B/s]

INFO:lm-eval:Model parallel was set to False, max memory was not set, and device map was set to {'': 'cuda'}


model.safetensors:   0%|          | 0.00/2.47G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/185 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/4.99k [00:00<?, ?B/s]

lambada_openai.py:   0%|          | 0.00/4.82k [00:00<?, ?B/s]

0000.parquet:   0%|          | 0.00/1.16M [00:00<?, ?B/s]

Generating test split:   0%|          | 0/5153 [00:00<?, ? examples/s]



README.md:   0%|          | 0.00/7.32k [00:00<?, ?B/s]

train-00000-of-00002.parquet:   0%|          | 0.00/269M [00:00<?, ?B/s]

train-00001-of-00002.parquet:   0%|          | 0.00/281M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/1.14M [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/1.08M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/2662 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/5153 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/4869 [00:00<?, ? examples/s]



README.md:   0%|          | 0.00/18.2k [00:00<?, ?B/s]

super_glue.py:   0%|          | 0.00/30.7k [00:00<?, ?B/s]

The repository for super_glue contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/super_glue.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N] y


Downloading data:   0%|          | 0.00/4.12M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/9427 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/3270 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/3245 [00:00<?, ? examples/s]

INFO:lm-eval:Building contexts for boolq on rank 0...
100%|██████████| 3270/3270 [00:01<00:00, 1700.01it/s]
INFO:lm-eval:Building contexts for lambada_standard on rank 0...
100%|██████████| 5153/5153 [00:10<00:00, 499.86it/s]
INFO:lm-eval:Building contexts for lambada_openai on rank 0...
100%|██████████| 5153/5153 [00:10<00:00, 496.27it/s]
INFO:lm-eval:Running loglikelihood requests
Running loglikelihood requests: 100%|██████████| 16846/16846 [04:51<00:00, 57.78it/s]


bootstrapping for stddev: perplexity


100%|██████████| 1/1 [00:00<00:00, 146.23it/s]


bootstrapping for stddev: perplexity


100%|██████████| 1/1 [00:00<00:00, 137.90it/s]


In [None]:
metrics_base

{'boolq': {'alias': 'boolq',
  'acc,none': 0.6415902140672783,
  'acc_stderr,none': 0.008387090607540316},
 'lambada_openai': {'alias': 'lambada_openai',
  'perplexity,none': 5.746826006398914,
  'perplexity_stderr,none': 0.1939159153154742,
  'acc,none': 0.6219677857558704,
  'acc_stderr,none': 0.0067555455026514066},
 'lambada_standard': {'alias': 'lambada_standard',
  'perplexity,none': 8.673946525846537,
  'perplexity_stderr,none': 0.37872093321740846,
  'acc,none': 0.5346400155249369,
  'acc_stderr,none': 0.006949240168656243}}

The second model to be evaluated is one I created through a pruning process applied to a Llama model. This has resulted in a smaller model that retains much of its capabilities.  

In fact, the pruned model, according to the OpenLLM Leaderboard, outperforms the original model by 40% in its performance.  

If you’d like to see how the pruned model was created, here’s a link to the notebook:  https://github.com/peremartra/Large-Language-Model-Notebooks-Course/blob/main/6-PRUNING/6_3_pruning_structured_llama3.2-1b_OK.ipynb


In [None]:
metrics_pruned = evaluate_hf_model("oopere/pruned40-llama-1b", tasks=tasks)

INFO:lm-eval:Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234
INFO:lm-eval:Initializing hf model, with arguments: {'pretrained': 'oopere/pruned40-llama-1b', 'device': 'cuda'}
INFO:lm-eval:Using device 'cuda'


config.json:   0%|          | 0.00/883 [00:00<?, ?B/s]

INFO:lm-eval:Using model type 'default'


tokenizer_config.json:   0%|          | 0.00/50.5k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/335 [00:00<?, ?B/s]

INFO:lm-eval:Model parallel was set to False, max memory was not set, and device map was set to {'': 'cuda'}


model.safetensors:   0%|          | 0.00/1.83G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/180 [00:00<?, ?B/s]

INFO:lm-eval:Building contexts for boolq on rank 0...
100%|██████████| 3270/3270 [00:01<00:00, 1639.82it/s]
INFO:lm-eval:Building contexts for lambada_standard on rank 0...
100%|██████████| 5153/5153 [00:10<00:00, 492.87it/s]
INFO:lm-eval:Building contexts for lambada_openai on rank 0...
100%|██████████| 5153/5153 [00:10<00:00, 494.42it/s]
INFO:lm-eval:Running loglikelihood requests
Running loglikelihood requests: 100%|██████████| 16846/16846 [04:51<00:00, 57.71it/s]


bootstrapping for stddev: perplexity


100%|██████████| 1/1 [00:00<00:00, 137.31it/s]


bootstrapping for stddev: perplexity


100%|██████████| 1/1 [00:00<00:00, 113.79it/s]


In [None]:
metrics_pruned

{'boolq': {'alias': 'boolq',
  'acc,none': 0.6211009174311927,
  'acc_stderr,none': 0.00848467871856502},
 'lambada_openai': {'alias': 'lambada_openai',
  'perplexity,none': 90.1044721908429,
  'perplexity_stderr,none': 4.994319975462519,
  'acc,none': 0.29846691247816803,
  'acc_stderr,none': 0.006375059594075383},
 'lambada_standard': {'alias': 'lambada_standard',
  'perplexity,none': 171.13005138829917,
  'perplexity_stderr,none': 8.31697633954944,
  'acc,none': 0.24781680574422665,
  'acc_stderr,none': 0.00601505029677859}}

# Conclusion.
This notebook serves as a brief introduction to one of the main libraries for model evaluation.

I chose to demonstrate how models are evaluated on Hugging Face because you can not only evaluate third-party models but, as shown in the second example, also models you’ve created and uploaded to the well-known open-source repository.

This evaluation approach is one of the simplest and fastest methods to run. I personally use it almost every time I create a new model.

You can find a complete list of available tasks in the library's repository:
https://github.com/EleutherAI/lm-evaluation-harness/tree/main/lm_eval/tasks


##Authors Note.
In addition to creating content like this notebook and offering it under the MIT license, I have also contributed to repositories such as those of Hugging Face and Google Gemini.

I am especially proud of my book: <a href="https://amzn.to/4eanT1g"><b>Large Language Models:</b> Apply and Implement Strategies for Large Language Models</a> (Apress).

You can find it on both <a href="https://amzn.to/4eanT1g">Amazon</a> and <a href="https://link.springer.com/book/10.1007/979-8-8688-0515-8">Springer</a>, where they often have good deals on the purchase price.

If you take a look and end up purchasing it, keep in mind that you can reach out with any questions via the Discussions section of this same repository or on any of my social media channels. I’ll do my best to respond as quickly as possible.