### ⚠ IMPORTANT ⚠

Please ensure your Colab runtime is set to the following:

A100 GPU

Evaluation and instruction-tuning a LLM is a resource intensive process - please make sure you're using the appropriate instance.

# Evaluating an Instruct-tuned Model

Now that we've spent some time creating models with:

- Unsupervised pre-training
- Supervised fine-tuning
- Some instruction-tuning

We're ready to begin to think about how we can evaluate these models.

## Instruct-tuned Evaluation

We will now repeat the process we used on our baseline - but using the instruct-tuned version of our model!

### Load Mistral AI's Mistral-7B in 4-bit Quantization

Let's grab our dependencies, and load our model!

In [1]:
!pip install -qU bitsandbytes datasets accelerate loralib peft transformers trl

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m119.8/119.8 MB[0m [31m13.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m542.0/542.0 kB[0m [31m49.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.6/302.6 kB[0m [31m37.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m251.6/251.6 kB[0m [31m33.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.1/9.1 MB[0m [31m77.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m245.2/245.2 kB[0m [31m32.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m17.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m25.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━

In [2]:
import torch
import torch.nn as nn
import bitsandbytes as bnb
from transformers import AutoTokenizer, AutoConfig, AutoModelForCausalLM, BitsAndBytesConfig

Conforming to previous notebooks - let's set up our quantization config for our model.

In [3]:
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_double_quant=True,
    bnb_4bit_compute_dtype=torch.float16,
)

Unused kwargs: ['bnb_double_quant']. These kwargs are not used in <class 'transformers.utils.quantization_config.BitsAndBytesConfig'>.


Now we have our quantization settings confirmed - let's load up our model!

In [7]:
model_id = "mistralai/Mistral-7B-Instruct-v0.2"

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map='auto',
)

tokenizer = AutoTokenizer.from_pretrained(model_id)

config.json:   0%|          | 0.00/596 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/25.1k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/3 [00:00<?, ?it/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/4.94G [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/4.54G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.46k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/72.0 [00:00<?, ?B/s]

####❓ Question:

Taking a look at the [model card](https://huggingface.co/mistralai/Mistral-7B-v0.1) (and the linked resources on the card) is this an instruct-tuned model or not?

### Collect and Load the Eleuther AI Evaluation Harness

Now that we have our baseline model loaded - we need to evaluate it.

For that, we'll use a tool called [Eleuther AI's LM evaluation harness](https://github.com/EleutherAI/lm-evaluation-harness). This is a specialized tool for running benchmarks on various language tasks.

Let's start by grabbing and installing it!

Why Eleuther AI's Evaluation Harness? Well - it's what powers the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)!

In [8]:
!git clone https://github.com/EleutherAI/lm-evaluation-harness
%cd lm-evaluation-harness
!pip install -e .

Cloning into 'lm-evaluation-harness'...
remote: Enumerating objects: 34126, done.[K
remote: Counting objects: 100% (241/241), done.[K
remote: Compressing objects: 100% (127/127), done.[K
remote: Total 34126 (delta 144), reused 186 (delta 111), pack-reused 33885[K
Receiving objects: 100% (34126/34126), 23.16 MiB | 12.24 MiB/s, done.
Resolving deltas: 100% (23858/23858), done.
/content/lm-evaluation-harness
Obtaining file:///content/lm-evaluation-harness
  Installing build dependencies ... [?25l[?25hdone
  Checking if build backend supports build_editable ... [?25l[?25hdone
  Getting requirements to build editable ... [?25l[?25hdone
  Preparing editable metadata (pyproject.toml) ... [?25l[?25hdone
Collecting evaluate (from lm_eval==0.4.2)
  Downloading evaluate-0.4.2-py3-none-any.whl (84 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m3.6 MB/s[0m eta [36m0:00:00[0m
Collecting jsonlines (from lm_eval==0.4.2)
  Downloading jsonlines-

Now, we can cast our model to the desired format.

In [9]:
import lm_eval
from lm_eval.models.huggingface import HFLM
eval_model = HFLM(model, batch_size=16)

Downloading builder script:   0%|          | 0.00/5.67k [00:00<?, ?B/s]



We'll set up our tasks so we can leverage them at evaluation time!

Next, we can evaluate our base model!

>NOTE: This step will take ~30-40min. to run in full on the A100 - so ensure you set aside time to run it fully if you desire!

We're going to leverage two benchmarks today:

- [HellaSwag](https://rowanzellers.com/hellaswag/)
- [ARC Easy](https://leaderboard.allenai.org/arc_easy/submissions/get-started)
- A subset of the [MMLU benchmark](https://paperswithcode.com/dataset/mmlu), focusing only on the `college_mathematics` task.

These are lightweight benchmarks used to "score" models against eachother on the OpenLM leaderboard.

We'll consider a simple average of their scores as the "overall" score of the baseline model.

You could easily extend the number of tasks considered if you wanted to more exactly emulate the Open LLM Leaderboard.

In [None]:
results = lm_eval.simple_evaluate(
    model=eval_model,
    tasks=["hellaswag", "arc_easy"],
    num_fewshot=0,
    batch_size=16,
)

INFO:lm-eval:Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234
INFO:lm-eval:Using pre-initialized model
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.


Downloading builder script:   0%|          | 0.00/4.36k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/2.53k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/6.84k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/12.1M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/3.04M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/3.14M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/39905 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/10003 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/10042 [00:00<?, ? examples/s]

Map:   0%|          | 0/39905 [00:00<?, ? examples/s]

Map:   0%|          | 0/10042 [00:00<?, ? examples/s]

Downloading readme:   0%|          | 0.00/9.00k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/331k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/346k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/86.1k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/2251 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/2376 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/570 [00:00<?, ? examples/s]

INFO:lm-eval:Setting fewshot random generator seed to 1234
INFO:lm-eval:Setting fewshot random generator seed to 1234
INFO:lm-eval:Building contexts for arc_easy on rank 0...
100%|██████████| 2376/2376 [00:02<00:00, 861.86it/s]
INFO:lm-eval:Building contexts for hellaswag on rank 0...
100%|██████████| 10042/10042 [00:04<00:00, 2292.78it/s]
INFO:lm-eval:Running loglikelihood requests
Running loglikelihood requests:  58%|█████▊    | 29045/49669 [05:09<02:38, 130.00it/s]

In [None]:
import pandas as pd

pd.DataFrame(results["results"])

Unnamed: 0,hellaswag,arc_easy
"acc,none",0.655845,0.819444
"acc_stderr,none",0.004741,0.007893
"acc_norm,none",0.833201,0.771044
"acc_norm_stderr,none",0.00372,0.008622
alias,hellaswag,arc_easy


### Few-shot MMLU (Machine Learning)

In [None]:
fs_mmlu_results = lm_eval.simple_evaluate(
    model=eval_model,
    tasks=["mmlu_flan_n_shot_loglikelihood_machine_learning"],
    num_fewshot=5,
    batch_size=16,
)

You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.
INFO:lm-eval:Building contexts for task on rank 0...
INFO:lm-eval:Running loglikelihood requests
100%|██████████| 448/448 [00:34<00:00, 12.94it/s]


In [None]:
import pandas as pd

pd.DataFrame(fs_mmlu_results["results"])

Unnamed: 0,mmlu_flan_n_shot_loglikelihood_machine_learning
"acc,none",0.482143
"acc_norm,none",0.482143
"acc_norm_stderr,none",0.047428
"acc_stderr,none",0.047428
alias,mmlu_flan_n_shot_loglikelihood_machine_learning


### Zero-Shot MMLU (Machine Learning)

In [None]:
zs_mmlu_results = lm_eval.simple_evaluate(
    model=eval_model,
    tasks=["mmlu_flan_n_shot_loglikelihood_machine_learning"],
    num_fewshot=0,
    batch_size=16,
)

You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.
INFO:lm-eval:Building contexts for task on rank 0...
INFO:lm-eval:Running loglikelihood requests
100%|██████████| 448/448 [00:05<00:00, 85.45it/s] 


In [None]:
import pandas as pd

pd.DataFrame(zs_mmlu_results["results"])

Unnamed: 0,mmlu_flan_n_shot_loglikelihood_machine_learning
"acc,none",0.482143
"acc_norm,none",0.482143
"acc_norm_stderr,none",0.047428
"acc_stderr,none",0.047428
alias,mmlu_flan_n_shot_loglikelihood_machine_learning


In [None]:
cot_mmlu_results = lm_eval.simple_evaluate(
    model=eval_model,
    tasks=["mmlu_flan_cot_zeroshot_conceptual_physics"],
    num_fewshot=0,
    batch_size=16,
)

You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.
INFO:lm-eval:num_fewshot has been set to 0 for mmlu_flan_cot_zeroshot_conceptual_physics in its config. Manual configuration will be ignored.
INFO:lm-eval:Building contexts for task on rank 0...
INFO:lm-eval:Running generate_until requests


  4%|▍         | 1/26 [00:21<08:56, 21.47s/it][A
100%|██████████| 26/26 [00:42<00:00,  1.64s/it]


In [None]:
import pandas as pd

pd.DataFrame(cot_mmlu_results["results"])

Unnamed: 0,mmlu_flan_cot_zeroshot_conceptual_physics
alias,mmlu_flan_cot_zeroshot_conceptual_physics
"exact_match,get-answer",0.153846
"exact_match_stderr,get-answer",0.07216


####❓Question:

What *exactly* are these two benchmarks measuring?