<a href="https://colab.research.google.com/github/menhguin/natural_language_rl/blob/main/Goodfire_LLM_Eval_Harness_integration.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Step 1. Install EleutherAI Evaluations Harness
*   Logging into WandB is optional.
*   Logging into Huggingface API is required to run GPQA. This is to prevent database leakage.
*   Uses Goodfire API! Experimental as of Jan 2025

In [1]:
import os
import huggingface_hub
from google.colab import userdata

# Install latest versions of necessary libraries
!pip install goodfire
!pip install -e git+https://github.com/menhguin/lm-evaluation-harness.git#egg=lm_eval[wandb,vllm] # skip if you don't want to use wandb to log results

Collecting goodfire
  Downloading goodfire-0.3.4-py3-none-any.whl.metadata (24 kB)
Collecting httpx<0.28.0,>=0.27.2 (from goodfire)
  Downloading httpx-0.27.2-py3-none-any.whl.metadata (7.1 kB)
Collecting ipywidgets<9.0.0,>=8.1.5 (from goodfire)
  Downloading ipywidgets-8.1.5-py3-none-any.whl.metadata (2.3 kB)
Collecting scipy<2.0.0,>=1.14.1 (from goodfire)
  Downloading scipy-1.15.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (61 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.0/62.0 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
Collecting comm>=0.1.3 (from ipywidgets<9.0.0,>=8.1.5->goodfire)
  Downloading comm-0.2.2-py3-none-any.whl.metadata (3.7 kB)
Collecting widgetsnbextension~=4.0.12 (from ipywidgets<9.0.0,>=8.1.5->goodfire)
  Downloading widgetsnbextension-4.0.13-py3-none-any.whl.metadata (1.6 kB)
Collecting jedi>=0.16 (from ipython>=6.1.0->ipywidgets<9.0.0,>=8.1.5->goodfire)
  Downloading jedi-0.19.2-py2.py3-none-any.whl.metadata (

Automated login for Hugging Face Hub via Colab Secrets. If you don't have this, it'll prompt for manual login if you don't have one. If you completely remove this, you can't run GPQA or use Llama models via HF.

In [2]:
# Check for Huggingface API key and log in if available, otherwise prompt for manual login
hf_token = userdata.get('HF_READ_TOKEN')
if hf_token:
    huggingface_hub.login(hf_token)
else:
    print("Huggingface token not found. Please login manually.")
    !huggingface-cli login

Automated login for WandB via Colab Secrets. If you don't have this, it'll just prompt you later if you use wandb_args.

In [3]:
# Check for WandB API key and log in if available, otherwise skip login
wandb_token = userdata.get('WANDB_API_KEY')
if wandb_token:
    os.environ["WANDB_API_KEY"] = wandb_token
    import wandb
    wandb.login()
else:
    print("WandB token not found. Continuing without logging into WandB.")

[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.
[34m[1mwandb[0m: Currently logged in as: [33mminh1228[0m ([33mmenhguin[0m). Use [1m`wandb login --relogin`[0m to force relogin


Automated login for Goodfire API via Colab Secrets.

In [4]:
# Get API key from Colab secrets
GOODFIRE_API_KEY = userdata.get('GOODFIRE_API_KEY')
if not GOODFIRE_API_KEY:
    raise ValueError("Please set GOODFIRE_API_KEY in Colab secrets")
os.environ['GOODFIRE_API_KEY'] = GOODFIRE_API_KEY

#Step 2. Run evaluation

In [7]:
!python -m lm_eval \
    --model goodfire \
    --model_args pretrained=meta-llama/Meta-Llama-3.1-8B-Instruct \
    --tasks gsm8k_cot_llama \
    --num_fewshot 8 \
    --limit 30 \
    --apply_chat_template \
    --fewshot_as_multiturn \
    --batch_size auto \
    --log_samples \
    --output_path ./lm-eval-output/ \
    --gen_kwargs top_p=0.9,temperature=1.0

2025-01-11 17:50:26.904809: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2025-01-11 17:50:26.928468: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2025-01-11 17:50:26.935370: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-01-11 17:50:26.952194: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2025-01-11:17:50:32,175 INFO     [__main__.py:279] Ve

#Reference: EleutherAI Eval Harness task list
For those curious to run other evals! Please note that Min P is currently only accessible for `generate_until` tasks. There is currently no easy way to index these tasks, I just Ctrl + F'd `generate_until` on the [EleutherAI Evals Harness Repo](https://github.com/EleutherAI/lm-evaluation-harness).

In [6]:
# Test Goodfire Client
import goodfire

client = goodfire.Client(api_key=GOODFIRE_API_KEY)

# Simple test call
response = client.chat.completions.create(
    messages=[{"role": "user", "content": "Say hello!"}],
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
    max_completion_tokens=10
)

# Access response using ChatCompletion object attributes
print("Test response:", response.choices[0].message['content'])

Test response: Hello! It's nice to meet you. How


In [None]:
 !lm-eval --tasks list