### ⚠ IMPORTANT ⚠

Please ensure your Colab runtime is set to the following:

A100 GPU

Evaluation and instruction-tuning a LLM is a resource intensive process - please make sure you're using the appropriate instance.

# Model Evaluation: A Primer

Now that we've spent some time creating models with:

- Unsupervised pre-training -> nanoGPT -> Shapereare
- Supervised fine-tuning -> gpt2 -> b-mc2/sql-create-context -> Understand and write sql query
- Some instruction-tuning -> Mistral.v01 -> mosaicml/instruct-v3 -> Thought how to respond to the instruction

We're ready to begin to think about how we can evaluate these models.

## Baseline Evaluation

In order to properly understand how our model is improving - we need to first start with a baseline evaluation of our model's performance.

Let's start with Mistral AI's `Mistral-7B` model.

We're going to load and compare everything in 4-bit quantization today in order to ensure we can fit the model on our Google Colab instance.

Let's start by setting up and loading our model to prepare it for evaluation.

### Load Mistral AI's Mistral-7B in 4-bit Quantization

Let's grab our dependencies, and load our model!

In [1]:
!pip install -qU bitsandbytes datasets accelerate loralib peft transformers trl

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m72.9/72.9 MB[0m [31m32.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m374.7/374.7 kB[0m [31m30.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m511.9/511.9 kB[0m [31m36.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m93.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.6/24.6 MB[0m [31m91.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m883.7/883.7 kB[0m [31m57.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m664.8/664.8 MB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [2]:
import torch
import torch.nn as nn
import bitsandbytes as bnb
from transformers import AutoTokenizer, AutoConfig, AutoModelForCausalLM, BitsAndBytesConfig

Conforming to previous notebooks - let's set up our quantization config for our model.

In [3]:
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_double_quant=True,
    bnb_4bit_compute_dtype=torch.float16,
)

Now we have our quantization settings confirmed - let's load up our model!

In [5]:
from huggingface_hub import login
login(new_session=False)

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [6]:
%%time
model_id = "mistralai/Mistral-7B-v0.1"

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map='auto',
)

tokenizer = AutoTokenizer.from_pretrained(model_id)

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/25.1k [00:00<?, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.94G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/4.54G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/996 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

CPU times: user 1min 5s, sys: 1min 7s, total: 2min 12s
Wall time: 1min 39s


In [None]:
#With Quantization
#!nvidia-smi

In [7]:
#Without  Quantization
!nvidia-smi

Sun Aug 10 02:57:02 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA A100-SXM4-40GB          Off |   00000000:00:04.0 Off |                    0 |
| N/A   33C    P0             51W /  400W |    7363MiB /  40960MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
                                                

In [None]:
#Without Quantization GPU memory
#29073/1024


In [None]:
#With Quantization GPU memory
#5441/1024

In [None]:
# With Quantization
# CPU times: user 37.1 s, sys: 29.1 s, total: 1min 6s
# Wall time: 1min 2s

####❓ Question:

Taking a look at the [model card](https://huggingface.co/mistralai/Mistral-7B-v0.1) (and the linked resources on the card) is this an instruct-tuned model or not?

### Collect and Load the Eleuther AI Evaluation Harness

Now that we have our baseline model loaded - we need to evaluate it.

For that, we'll use a tool called [Eleuther AI's LM evaluation harness](https://github.com/EleutherAI/lm-evaluation-harness). This is a specialized tool for running benchmarks on various language tasks.

Let's start by grabbing and installing it!

Why Eleuther AI's Evaluation Harness? Well - it's what powers the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)!

In [8]:
!git clone https://github.com/EleutherAI/lm-evaluation-harness
%cd lm-evaluation-harness
!pip install -e .

Cloning into 'lm-evaluation-harness'...
remote: Enumerating objects: 55391, done.[K
remote: Counting objects: 100% (367/367), done.[K
remote: Compressing objects: 100% (210/210), done.[K
remote: Total 55391 (delta 292), reused 157 (delta 157), pack-reused 55024 (from 3)[K
Receiving objects: 100% (55391/55391), 31.57 MiB | 23.02 MiB/s, done.
Resolving deltas: 100% (38303/38303), done.
/content/lm-evaluation-harness
Obtaining file:///content/lm-evaluation-harness
  Installing build dependencies ... [?25l[?25hdone
  Checking if build backend supports build_editable ... [?25l[?25hdone
  Getting requirements to build editable ... [?25l[?25hdone
  Preparing editable metadata (pyproject.toml) ... [?25l[?25hdone
Collecting evaluate (from lm_eval==0.4.9.1)
  Downloading evaluate-0.4.5-py3-none-any.whl.metadata (9.5 kB)
Collecting datasets<4.0,>=2.16.0 (from lm_eval==0.4.9.1)
  Downloading datasets-3.6.0-py3-none-any.whl.metadata (19 kB)
Collecting jsonlines (from lm_eval==0.4.9.1)
 

Now, we can cast our model to the desired format.

In [9]:
import lm_eval
from lm_eval.models.huggingface import HFLM
eval_model = HFLM(model, batch_size=4)



We'll set up our tasks so we can leverage them at evaluation time!

Next, we can evaluate our base model!

>NOTE: This step will take ~30-40min. to run in full on the A100 - so ensure you set aside time to run it fully if you desire!

We're going to leverage two benchmarks today:

- [HellaSwag](https://rowanzellers.com/hellaswag/)
- [ARC Easy](https://leaderboard.allenai.org/arc_easy/submissions/get-started)
- A subset of the [MMLU benchmark](https://paperswithcode.com/dataset/mmlu), focusing only on the `machine_learning` task.

These are lightweight benchmarks used to "score" models against eachother on the OpenLM leaderboard.

We'll consider a simple average of their scores as the "overall" score of the baseline model.

You could easily extend the number of tasks considered if you wanted to more exactly emulate the Open LLM Leaderboard.

In [None]:
results = lm_eval.simple_evaluate(
    model=eval_model,
    tasks=["hellaswag", "arc_easy"],
    num_fewshot=0,
    batch_size=16,
)

In [None]:
import pandas as pd

pd.DataFrame(results["results"])

Unnamed: 0,hellaswag,arc_easy
"acc,none",0.608743,0.798822
"acc_stderr,none",0.00487,0.008226
"acc_norm,none",0.805616,0.786195
"acc_norm_stderr,none",0.003949,0.008413
alias,hellaswag,arc_easy


In [None]:
fs_mmlu_results = lm_eval.simple_evaluate(
    model=eval_model,
    tasks=["mmlu_flan_n_shot_loglikelihood_machine_learning"],
    num_fewshot=5,
    batch_size=16,
)

You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.


Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

INFO:lm-eval:Building contexts for task on rank 0...
INFO:lm-eval:Running loglikelihood requests
100%|██████████| 448/448 [00:37<00:00, 11.95it/s]


In [None]:
import pandas as pd

pd.DataFrame(fs_mmlu_results["results"])

Unnamed: 0,mmlu_flan_n_shot_loglikelihood_machine_learning
"acc,none",0.401786
"acc_norm,none",0.401786
"acc_norm_stderr,none",0.046533
"acc_stderr,none",0.046533
alias,mmlu_flan_n_shot_loglikelihood_machine_learning


### Zero-Shot MMLU

In [None]:
zs_mmlu_results = lm_eval.simple_evaluate(
    model=eval_model,
    tasks=["mmlu_flan_n_shot_loglikelihood_machine_learning"],
    num_fewshot=0,
    batch_size=16,
)

You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.
INFO:lm-eval:Building contexts for task on rank 0...
INFO:lm-eval:Running loglikelihood requests
100%|██████████| 448/448 [00:09<00:00, 44.81it/s]


In [None]:
import pandas as pd

pd.DataFrame(zs_mmlu_results["results"])

Unnamed: 0,mmlu_flan_n_shot_loglikelihood_machine_learning
"acc,none",0.3125
"acc_norm,none",0.3125
"acc_norm_stderr,none",0.043995
"acc_stderr,none",0.043995
alias,mmlu_flan_n_shot_loglikelihood_machine_learning


### Chain of Thought

Now let's try a Chain of Thought example!

In [None]:
cot_mmlu_results = lm_eval.simple_evaluate(
    model=eval_model,
    tasks=["mmlu_flan_cot_zeroshot_machine_learning"],
    num_fewshot=0,
    batch_size=16,
)

You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.
INFO:lm-eval:num_fewshot has been set to 0 for mmlu_flan_cot_zeroshot_machine_learning in its config. Manual configuration will be ignored.
INFO:lm-eval:Building contexts for task on rank 0...
INFO:lm-eval:Running generate_until requests
100%|██████████| 11/11 [01:06<00:00,  6.02s/it]


In [None]:
import pandas as pd

pd.DataFrame(cot_mmlu_results["results"])

Unnamed: 0,mmlu_flan_cot_zeroshot_machine_learning
alias,mmlu_flan_cot_zeroshot_machine_learning
"exact_match,get-answer",0.0
"exact_match_stderr,get-answer",0.0
