#Step 1. Install EleutherAI Evaluations Harness
*   Logging into WandB is optional.
*   Logging into Huggingface API is required to run GPQA. This is to prevent database leakage.
*   Uses Goodfire API! Experimental as of Jan 2025

In [None]:
import os
import huggingface_hub
from google.colab import userdata

# Install latest versions of necessary libraries
!pip install goodfire
!pip install -e git+https://github.com/menhguin/lm-evaluation-harness.git#egg=lm_eval[wandb,vllm] # skip if you don't want to use wandb to log results

Automated login for Hugging Face Hub via Colab Secrets. If you don't have this, it'll prompt for manual login if you don't have one. If you completely remove this, you can't run GPQA or use Llama models via HF.

In [2]:
# Check for Huggingface API key and log in if available, otherwise prompt for manual login
hf_token = userdata.get('HF_READ_TOKEN')
if hf_token:
    huggingface_hub.login(hf_token)
else:
    print("Huggingface token not found. Please login manually.")
    !huggingface-cli login

Automated login for WandB via Colab Secrets. If you don't have this, it'll just prompt you later if you use wandb_args.

In [3]:
# Check for WandB API key and log in if available, otherwise skip login
wandb_token = userdata.get('WANDB_API_KEY')
if wandb_token:
    os.environ["WANDB_API_KEY"] = wandb_token
    import wandb
    wandb.login()
else:
    print("WandB token not found. Continuing without logging into WandB.")

[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.
[34m[1mwandb[0m: Currently logged in as: [33mminh1228[0m ([33mmenhguin[0m). Use [1m`wandb login --relogin`[0m to force relogin


Automated login for Goodfire API via Colab Secrets.

In [4]:
# Try to get GOODFIRE_API_KEY from environment or Colab secrets
GOODFIRE_API_KEY = os.getenv('GOODFIRE_API_KEY') or userdata.get('GOODFIRE_API_KEY')
if not GOODFIRE_API_KEY:
    raise ValueError("Please set GOODFIRE_API_KEY in environment or Colab secrets")

#Step 2. Run evaluation

In [9]:
# Cell 5
import os

# Set environment variable for Goodfire API key if not already set
if 'GOODFIRE_API_KEY' not in os.environ:
    os.environ['GOODFIRE_API_KEY'] = GOODFIRE_API_KEY  # Make sure this is defined from previous cells

# Run the evaluation using the command-line interface
!python -m lm_eval \
    --model goodfire \
    --model_args pretrained=meta-llama/Meta-Llama-3.1-8B-Instruct \
    --tasks gsm8k_cot_llama \
    --num_fewshot 8 \
    --limit 10 \
    --batch_size auto \
    --log_samples \
    --output_path ./lm-eval-output/ \
    --gen_kwargs top_p=0.9,temperature=1.0,do_sample=True

2025-01-11 15:46:36.174423: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2025-01-11 15:46:36.198326: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2025-01-11 15:46:36.205320: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-01-11 15:46:36.221929: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2025-01-11:15:46:43,967 INFO     [__main__.py:279] Ve

#Reference: EleutherAI Eval Harness task list
For those curious to run other evals! Please note that Min P is currently only accessible for `generate_until` tasks. There is currently no easy way to index these tasks, I just Ctrl + F'd `generate_until` on the [EleutherAI Evals Harness Repo](https://github.com/EleutherAI/lm-evaluation-harness).

In [6]:
# Test Goodfire Client
import goodfire

client = goodfire.Client(api_key=GOODFIRE_API_KEY)

# Simple test call
response = client.chat.completions.create(
    messages=[{"role": "user", "content": "Say hello!"}],
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
    max_completion_tokens=10
)

# Access response using ChatCompletion object attributes
print("Test response:", response.choices[0].message['content'])

Test response: Hello! It's nice to meet you. How


In [None]:
 !lm-eval --tasks list