Lighteval is a toolkit for evaluating large language models through prompting without fine-tuning. A minimal notebook demonstrating the use of lighteval can be found at https://github.com/TurkuNLP/Deep_Learning_in_LangTech_course/blob/master/lighteval_example.ipynb .

Familiarize yourself with the notebook and use lighteval to run the following and record the results:

1) Evaluate the small gpt2 model in zero-, one- and five-shot settings on the ARC challenge task.

2) Evaluate a larger, more modern model such as allenai/OLMo-1B-hf in zero-, one- and five-shot settings on the ARC challenge task.

For information about the ARC Challenge, see Clark et al. (2018) Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge; for examples of the data, see https://huggingface.co/datasets/allenai/ai2_arc, and for results on the task reported in the literature, see e.g. https://paperswithcode.com/sota/common-sense-reasoning-on-arc-challenge.

Based on your results, explain briefly

a) how the gpt2 results relate to the random baseline and what this implies about the model

b) what the results for the larger model indicate about the effectiveness of few-shot prompting

c) how the results for the larger model relate to the state of the art (best reported results) in the task

## Lighteval example

Install the [`lighteval`](https://github.com/huggingface/lighteval) package.

In [1]:
!pip install --quiet lighteval

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/51.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m51.8/51.8 kB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m262.0/262.0 kB[0m [31m22.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m137.6/137.6 kB[0m [31m15.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.5/1.5 MB[0m [31m74.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.6/6.6 MB[0m [31m114.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m471.6/471.6 kB[0m [31m41.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m207.3/207.3 kB[0m [31m22.2 MB/s[0m eta [36m

Run evaluation using the small [`gpt2`](https://huggingface.co/gpt2) model and the [ARC challenge](https://paperswithcode.com/dataset/arc) task in a zero-shot setting.

The format of the `--tasks` argument is

```suite|task|few_shot|truncate_few_shots```

Where supported `suite|task` values are given in the [list of available tasks](https://github.com/huggingface/lighteval/wiki/Available-Tasks), and `few_shot` specifies the number of shots for few-shot prompting. (The argument `truncate_few_shots` can be ignored here).

The evaluation result for the accuracy metric can be found toward the end of the output on the `leaderboard:arc:challenge:0 ... acc` line as well as in a JSON file in the `evals` directory after execution.

## ZERO SHOT

In [2]:
!lighteval accelerate \
    --model_args "pretrained=gpt2" \
    --tasks "leaderboard|arc:challenge|0|0" \
    --override_batch_size 1 \
    --output_dir="./evals/"

2024-10-11 09:13:45.482072: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-10-11 09:13:45.498424: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-10-11 09:13:45.519836: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-10-11 09:13:45.526372: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-10-11 09:13:45.541790: I tensorflow/core/platform/cpu_feature_guar

## ONE SHOT

In [3]:
!lighteval accelerate \
    --model_args "pretrained=gpt2" \
    --tasks "leaderboard|arc:challenge|1|0" \
    --override_batch_size 1 \
    --output_dir="./evals/"

2024-10-11 09:15:05.376222: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-10-11 09:15:05.392560: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-10-11 09:15:05.414557: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-10-11 09:15:05.421003: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-10-11 09:15:05.436522: I tensorflow/core/platform/cpu_feature_guar

## FEW SHOT

In [4]:
!lighteval accelerate \
    --model_args "pretrained=gpt2" \
    --tasks "leaderboard|arc:challenge|25|0" \
    --override_batch_size 1 \
    --output_dir="./evals/"

2024-10-11 09:16:15.850921: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-10-11 09:16:15.867000: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-10-11 09:16:15.888107: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-10-11 09:16:15.894559: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-10-11 09:16:15.909860: I tensorflow/core/platform/cpu_feature_guar

# LARGER MODEL allenai/OLMo-1B-hf

## ZERO SHOT


In [5]:
!lighteval accelerate \
    --model_args "pretrained=allenai/OLMo-1B-hf" \
    --tasks "leaderboard|arc:challenge|0|0" \
    --override_batch_size 1 \
    --output_dir="./evals/"

2024-10-11 09:20:19.420913: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-10-11 09:20:19.437031: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-10-11 09:20:19.458111: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-10-11 09:20:19.464481: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-10-11 09:20:19.479755: I tensorflow/core/platform/cpu_feature_guar

## ONE SHOT

In [6]:
!lighteval accelerate \
    --model_args "pretrained=allenai/OLMo-1B-hf" \
    --tasks "leaderboard|arc:challenge|1|0" \
    --override_batch_size 1 \
    --output_dir="./evals/"

2024-10-11 09:24:07.057252: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-10-11 09:24:07.073624: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-10-11 09:24:07.094857: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-10-11 09:24:07.101263: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-10-11 09:24:07.117101: I tensorflow/core/platform/cpu_feature_guar

## FEW SHOT

In [8]:
!lighteval accelerate \
    --model_args "pretrained=allenai/OLMo-1B-hf" \
    --tasks "leaderboard|arc:challenge|25|0" \
    --override_batch_size 1 \
    --output_dir="./evals/"

2024-10-11 09:54:19.271641: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-10-11 09:54:19.288109: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-10-11 09:54:19.309729: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-10-11 09:54:19.316266: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-10-11 09:54:19.331938: I tensorflow/core/platform/cpu_feature_guar

a) how the gpt2 results relate to the random baseline and what this implies about the model

Random baseline is 25.02% and the GTP2 model hovers around 19-20%, so it is even worse than random. This implies that the model is not suited for this kind of reasoning.

b) what the results for the larger model indicate about the effectiveness of few-shot prompting

For the larger `OLMo-1B-hf` model the results were 29%,31%, 32% for zero shot, one shot and few shot respectively. That indicates that the more examples the model sees the more it can improve.

c) how the results for the larger model relate to the state of the art (best reported results) in the task

State of the art for this task is GPT-4 with respective score of 96.4. This few shot got to 33% so not even close.