<a href="https://colab.research.google.com/github/menhguin/minp_paper/blob/main/%5BPUBLIC%5D_Min_P_Evals_Replication_for_GPQA_and_GSM8K_COT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# LLM Evaluation

In this notebook, we'll use the [language model evaluation harness](https://github.com/EleutherAI/lm-evaluation-harness)
utility built by EleutherAI to evaluate our model on a suite of different tasks.

#Step 1. Install EleutherAI Evaluations Harness
*   Logging into WandB is optional.
*   Logging into Huggingface API is required to run GPQA. This is to prevent database leakage.

In [None]:
!pip install -e git+https://github.com/EleutherAI/lm-evaluation-harness.git#egg=lm_eval[wandb,vllm] # skip if you don't want to use wandb to log results
!pip install lm_eval[wandb] # skip if you don't want to use wandb to log results
!huggingface-cli login

#Step 2. Run selected evals
Change parameters as preferred:

*   **Top P:** Lower values are more selective. It is recommended to use Top P = 0.9-0.95. - *E.g. Top P = 0.9 means using the fewest tokens that make up 90% of the probability distribution, and the remaining ~10% is truncated.*
*   **Min P:** Lower values are less selective. It is recommended to use Min P = 0.05-0.1. - *E.g. Min P = 0.1 means every token where P < 10% of P(most probable token) is truncated.*
*   **Temperature scaling:** Usually only 0 to 1 produces coherent output with Top P, but Min P allows good outputs until 3-5!

## A. GPQA Main Generative (5-shot)

In [None]:
sampler = "min_p"
sampler_value = "0.1"
tasks = "gpqa_main_generative_n_shot"
model = "vllm"
num_fewshot = "5"

In [None]:
temperature = 1

!lm_eval \
    --model {model} \
    --model_args pretrained=mistralai/Mistral-7B-v0.1,dtype=auto \
    --batch_size "auto" \
    --tasks {tasks} \
    --num_fewshot {num_fewshot} \
    --log_samples \
    --output_path ./lm-eval-output/ \
    --gen_kwargs {sampler}={sampler_value},temperature={temperature},do_sample=True \
    --wandb_args project=lm-eval-harness-integration,name={tasks}_{sampler}_{sampler_value}_temp_{temperature} \
    --device cuda

## B1. GSM8K Chain-of-Thought (8-shot)

In [None]:
sampler = "min_p"
sampler_value = "0.1"
tasks = "gsm8k_cot"
model = "vllm"
num_fewshot = "8"

In [None]:
temperature = 1

!lm_eval \
    --model {model} \
    --model_args pretrained=mistralai/Mistral-7B-v0.1,dtype=auto \
    --batch_size "auto" \
    --tasks {tasks} \
    --num_fewshot {num_fewshot} \
    --log_samples \
    --output_path ./lm-eval-output/ \
    --gen_kwargs {sampler}={sampler_value},temperature={temperature},do_sample=True \
    --wandb_args project=lm-eval-harness-integration,name={tasks}_{sampler}_{sampler_value}_temp_{temperature} \
    --device cuda

##B2. GSM8K Chain-of-Thought (8-shot) (Self Consistency)
We do not recommend running these unless you either have 50-100x the compute to run the previous evals, or lower the question limit to ~10 via `--limit 10 \`.

In [None]:
sampler = "min _p"
sampler_value = "0.1"
tasks = "gsm8k_cot_self_consistency"
model = "vllm"
num_fewshot = "5"

In [None]:
temperature = 1

!lm_eval \
    --model {model} \
    --model_args pretrained=mistralai/Mistral-7B-v0.1,dtype=auto \
    --batch_size "auto" \
    --tasks {tasks} \
    --num_fewshot {num_fewshot} \
    --log_samples \
    --output_path ./lm-eval-output/ \
    --gen_kwargs {sampler}={sampler_value},temperature={temperature},do_sample=True \
    --limit 10 \ #self-consistency can have a lot of runs, remove this at your peril
    --wandb_args project=lm-eval-harness-integration,name={tasks}_{sampler}_{sampler_value}_temp_{temperature} \
    --device cuda


#Reference: EleutherAI Eval Harness task list
For those curious to run other evals! Please note that Min P is currently only accessible for `generate_until` tasks. There is currently no easy way to index these tasks, I just Ctrl + F'd `generate_until` on the [EleutherAI Evals Harness Repo](https://github.com/EleutherAI/lm-evaluation-harness).

In [None]:
 !lm-eval --tasks list

#Alternate: Git Clone Method for EleutherAI Evaluations Harness
An alternate way to sometimes get around Evals Harness installation issues.

In [None]:
%%capture
!git clone https://github.com/EleutherAI/lm-evaluation-harness
!pip install -e lm-evaluation-harness