
Roadmap for next 1-2 weeks:
* **More reliable generation/answer matching** - Basically get the Goodfire Llama 8B to near-peer with HF and VLLM on a few tasks. I think it's mainly an answer matching problem for now, since generations definitely ... they exist. I managed to get GPQA and TriviaQA generating answers, but not grading properly (?)
* **Code cleanup and optimisation** - This was very scrappy. I wasn't even done setting up my IDE properly, so probably a few more days of just streamlining stuff.I wasn't able to set up fast batching/parallelisation, and looking at the docs there doesn't seem an obvious option for this, so I'm stuck with some slow 1-by-1 generations for now. This won't be an issue until I fix the more important bugs, but it will be annoying when I do ...
* **Figuring out and implementing basic SAE feature methods** - I basically have to implement this while doing cleanup. Now that I'm getting any generation and any scoring going, I need to start implementing the actual fun features. But again, this is subject to designing the user end flow, so some guidance would be helpful.

#Step 1. Install EleutherAI Evaluations Harness
*   Logging into WandB is optional.
*   Logging into Huggingface API is required to run GPQA. This is to prevent database leakage.
*   Uses Goodfire API! Experimental as of 11th Jan 2025

In [None]:
import os
import huggingface_hub
from google.colab import userdata

# Install latest versions of necessary libraries
!pip install goodfire
!pip install vllm
!pip install -e git+https://github.com/menhguin/lm-evaluation-harness.git#egg=lm_eval[wandb,vllm] # skip if you don't want to use wandb to log results

Automated login for Hugging Face Hub via Colab Secrets. If you don't have this, it'll prompt for manual login if you don't have one. If you completely remove this, you can't run GPQA or use Llama models via HF.

In [None]:
# Check for Huggingface API key and log in if available, otherwise prompt for manual login
hf_token = userdata.get('HF_READ_TOKEN')
if hf_token:
    huggingface_hub.login(hf_token)
else:
    print("Huggingface token not found. Please login manually.")
    !huggingface-cli login

Automated login for WandB via Colab Secrets. If you don't have this, it'll just prompt you later if you use wandb_args.

In [None]:
# Check for WandB API key and log in if available, otherwise skip login
wandb_token = userdata.get('WANDB_API_KEY')
if wandb_token:
    os.environ["WANDB_API_KEY"] = wandb_token
    import wandb
    wandb.login()
else:
    print("WandB token not found. Continuing without logging into WandB.")

Automated login for Goodfire API via Colab Secrets.

In [None]:
# Get API key from Colab secrets
GOODFIRE_API_KEY = userdata.get('GOODFIRE_API_KEY')
if not GOODFIRE_API_KEY:
    raise ValueError("Please set GOODFIRE_API_KEY in Colab secrets")
os.environ['GOODFIRE_API_KEY'] = GOODFIRE_API_KEY

#Step 2. Run evaluation

In [None]:
# gsm8k_cot_llama is currently the only one that definitely works
sampler = "top_p"
sampler_value = "0.9"
tasks = "gsm8k_cot_llama"
model = "goodfire"
model_args = "meta-llama/Meta-Llama-3.1-8B-Instruct"
num_fewshot = "8"

temperature = 1

!lm_eval \
    --model {model} \
    --model_args pretrained={model_args} \
    --batch_size "auto" \
    --tasks {tasks} \
    --num_fewshot {num_fewshot} \
    --apply_chat_template \
    --fewshot_as_multiturn \
    --limit 30 \
    --log_samples \
    --output_path ./lm-eval-output/ \
    --gen_kwargs {sampler}={sampler_value},temperature={temperature},inspect=true,do_sample=True \
    --wandb_args project=lm-eval-harness-integration,name={tasks}_{sampler}_{sampler_value}_temp_{temperature}_{model}_{model_args.replace('/', '_')} \
    --device cuda

In [None]:
sampler = "top_p"
sampler_value = "0.9"
tasks = "gpqa_main_generative_n_shot"
model = "goodfire"
model_args = "meta-llama/Meta-Llama-3.1-8B-Instruct"
num_fewshot = "1"

temperature = 1

!lm_eval \
    --model {model} \
    --model_args pretrained={model_args} \
    --batch_size "auto" \
    --tasks {tasks} \
    --num_fewshot {num_fewshot} \
    --apply_chat_template \
    --fewshot_as_multiturn \
    --limit 30 \
    --log_samples \
    --output_path ./lm-eval-output/ \
    --gen_kwargs {sampler}={sampler_value},temperature={temperature},do_sample=True \
    --wandb_args project=lm-eval-harness-integration,name={tasks}_{sampler}_{sampler_value}_temp_{temperature}_{model}_{model_args.replace('/', '_')} \
    --device cuda

In [None]:
sampler = "top_p"
sampler_value = "0.9"
tasks = "gpqa_main_generative_n_shot"
model = "goodfire"
model_args = "meta-llama/Meta-Llama-3.1-8B-Instruct"
num_fewshot = "5"

temperature = 1

!lm_eval \
    --model {model} \
    --model_args pretrained={model_args} \
    --batch_size "auto" \
    --tasks {tasks} \
    --num_fewshot {num_fewshot} \
    --apply_chat_template \
    --fewshot_as_multiturn \
    --limit 30 \
    --log_samples \
    --output_path ./lm-eval-output/ \
    --gen_kwargs {sampler}={sampler_value},temperature={temperature},do_sample=True \
    --wandb_args project=lm-eval-harness-integration,name={tasks}_{sampler}_{sampler_value}_temp_{temperature}_{model}_{model_args.replace('/', '_')} \
    --device cuda

In [None]:
sampler = "top_p"
sampler_value = "0.9"
tasks = "triviaqa"
model = "goodfire"
model_args = "meta-llama/Meta-Llama-3.1-8B-Instruct"
num_fewshot = "0"

temperature = 1

!lm_eval \
    --model {model} \
    --model_args pretrained={model_args} \
    --batch_size "auto" \
    --tasks {tasks} \
    --num_fewshot {num_fewshot} \
    --apply_chat_template \
    --fewshot_as_multiturn \
    --limit 30 \
    --log_samples \
    --output_path ./lm-eval-output/ \
    --gen_kwargs {sampler}={sampler_value},temperature={temperature},do_sample=True \
    --wandb_args project=lm-eval-harness-integration,name={tasks}_{sampler}_{sampler_value}_temp_{temperature}_{model}_{model_args.replace('/', '_')} \
    --device cuda

In [None]:
sampler = "top_p"
sampler_value = "0.9"
tasks = "gsm8k_cot_llama"
model = "vllm"
model_args = "meta-llama/Meta-Llama-3.1-8B-Instruct"

temperature = 1

!lm_eval \
    --model {model} \
    --model_args pretrained={model_args} \
    --batch_size "auto" \
    --tasks {tasks} \
    --limit 30 \
    --log_samples \
    --output_path ./lm-eval-output/ \
    --gen_kwargs {sampler}={sampler_value},temperature={temperature},do_sample=True \
    --wandb_args project=lm-eval-harness-integration,name={tasks}_{sampler}_{sampler_value}_temp_{temperature}_{model}_{model_args.replace('/', '_')} \
    --device cuda

#Reference: EleutherAI Eval Harness task list
For those curious to run other evals! Please note that Min P is currently only accessible for `generate_until` tasks. There is currently no easy way to index these tasks, I just Ctrl + F'd `generate_until` on the [EleutherAI Evals Harness Repo](https://github.com/EleutherAI/lm-evaluation-harness).

In [None]:
# Test Goodfire Client
import goodfire

client = goodfire.Client(api_key=GOODFIRE_API_KEY)

# Simple test call
response = client.chat.completions.create(
    messages=[{"role": "user", "content": "Say hello!"}],
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
    max_completion_tokens=10
)

# Access response using ChatCompletion object attributes
print("Test response:", response.choices[0].message['content'])

In [None]:
 !lm-eval --tasks list