### ⚠ IMPORTANT ⚠

Please ensure your Colab runtime is set to the following:

A100 GPU

Evaluation and instruction-tuning a LLM is a resource intensive process - please make sure you're using the appropriate instance.

# Evaluating an Instruct-tuned Model

Now that we've spent some time creating models with:

- Unsupervised pre-training
- Supervised fine-tuning
- Some instruction-tuning

We're ready to begin to think about how we can evaluate these models.

## Instruct-tuned Evaluation

We will now repeat the process we used on our baseline - but using the instruct-tuned version of our model!

### Load Mistral AI's Mistral-7B in 4-bit Quantization

Let's grab our dependencies, and load our model!

In [None]:
!pip install -qU bitsandbytes datasets accelerate loralib peft transformers trl

In [None]:
import torch
import torch.nn as nn
import bitsandbytes as bnb
from transformers import AutoTokenizer, AutoConfig, AutoModelForCausalLM, BitsAndBytesConfig

Conforming to previous notebooks - let's set up our quantization config for our model.

In [None]:
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_double_quant=True,
    bnb_4bit_compute_dtype=torch.float16,
)

In [None]:
from huggingface_hub import login
login(new_session=False)

Now we have our quantization settings confirmed - let's load up our model!

In [None]:
model_id = "mistralai/Mistral-7B-Instruct-v0.2"

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map='auto',
)

tokenizer = AutoTokenizer.from_pretrained(model_id)

In [None]:
!nvidia-smi

In [None]:
#del model
del model_without_quantization
del tokenizer
import gc
gc.collect()
torch.cuda.empty_cache()

In [None]:
!nvidia-smi

In [None]:
model_id = "mistralai/Mistral-7B-Instruct-v0.2"

model_without_quantization = AutoModelForCausalLM.from_pretrained(
    model_id
)

tokenizer = AutoTokenizer.from_pretrained(model_id)

In [None]:
!nvidia-smi

####❓ Question:

Taking a look at the [model card](https://huggingface.co/mistralai/Mistral-7B-v0.1) (and the linked resources on the card) is this an instruct-tuned model or not?

### Collect and Load the Eleuther AI Evaluation Harness

Now that we have our baseline model loaded - we need to evaluate it.

For that, we'll use a tool called [Eleuther AI's LM evaluation harness](https://github.com/EleutherAI/lm-evaluation-harness). This is a specialized tool for running benchmarks on various language tasks.

Let's start by grabbing and installing it!

Why Eleuther AI's Evaluation Harness? Well - it's what powers the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)!

In [None]:
!git clone https://github.com/EleutherAI/lm-evaluation-harness
%cd lm-evaluation-harness
!pip install -e .

Now, we can cast our model to the desired format.

In [None]:
import lm_eval
from lm_eval.models.huggingface import HFLM
eval_model = HFLM(model, batch_size=16)

We'll set up our tasks so we can leverage them at evaluation time!

Next, we can evaluate our base model!

>NOTE: This step will take ~30-40min. to run in full on the A100 - so ensure you set aside time to run it fully if you desire!

We're going to leverage two benchmarks today:

- [HellaSwag](https://rowanzellers.com/hellaswag/)
- [ARC Easy](https://leaderboard.allenai.org/arc_easy/submissions/get-started)
- A subset of the [MMLU benchmark](https://paperswithcode.com/dataset/mmlu), focusing only on the `college_mathematics` task.

These are lightweight benchmarks used to "score" models against eachother on the OpenLM leaderboard.

We'll consider a simple average of their scores as the "overall" score of the baseline model.

You could easily extend the number of tasks considered if you wanted to more exactly emulate the Open LLM Leaderboard.

In [None]:
results = lm_eval.simple_evaluate(
    model=eval_model,
    tasks=["hellaswag", "arc_easy"],
    num_fewshot=0,
    batch_size=16,
)

In [None]:
import pandas as pd

pd.DataFrame(results["results"])

In [None]:
Temperature : Controlling the creativity
Range : 0 - 2
Temperature : 0 - Less Creativity
Temperature : 2 - More Creativity

Temperature in used in Formula

### Few-shot MMLU (Machine Learning)

In [None]:
fs_mmlu_results = lm_eval.simple_evaluate(
    model=eval_model,
    tasks=["mmlu_flan_n_shot_loglikelihood_machine_learning"],
    num_fewshot=5,
    batch_size=16,
)

In [None]:
import pandas as pd

pd.DataFrame(fs_mmlu_results["results"])

### Zero-Shot MMLU (Machine Learning)

In [None]:
zs_mmlu_results = lm_eval.simple_evaluate(
    model=eval_model,
    tasks=["mmlu_flan_n_shot_loglikelihood_machine_learning"],
    num_fewshot=0,
    batch_size=16,
)

In [None]:
import pandas as pd

pd.DataFrame(zs_mmlu_results["results"])

In [None]:
cot_mmlu_results = lm_eval.simple_evaluate(
    model=eval_model,
    tasks=["mmlu_flan_cot_zeroshot_conceptual_physics"],
    num_fewshot=0,
    batch_size=16,
)

In [None]:
import pandas as pd

pd.DataFrame(cot_mmlu_results["results"])

####❓Question:

What *exactly* are these two benchmarks measuring?