### **Using lm-evaluation-harness and benchmark tasks Gsm8k and TriviaQa with watsonx.ai foundation models**


### **Prerequisites**
Before you use the sample code in this notebook, you must perform the following setup tasks:

- Create a watsonx.ai Runtime instance.
- Create a Cloud Object Storage (COS) instance.

**Note: When using Watson Studio, you already have a COS instance associated with the project you are running the notebook in.**

How to install lm-evaluation-harness - two ways
    lm-evaluation-harness is a unified framework to test generative language models on a large number of different evaluation tasks. For more info and the source code, check out its GitHub repository: https://github.com/EleutherAI/lm-evaluation-harness/tree/main

    1 - Package installation - to use as is:

!pip install lm-eval | tail -n 1

    2- Local installation - for debugging purposes:

git clone https://github.com/EleutherAI/lm-evaluation-harness
cd lm-evaluation-harness
pip install -e .

### **Validate installation**


In [2]:
!pip list | findstr ibm_watsonx_ai
!pip list | findstr lm_eval


ibm_watsonx_ai              1.3.20



[notice] A new release of pip is available: 24.0 -> 25.1.1
[notice] To update, run: python.exe -m pip install --upgrade pip


lm_eval                     0.4.8



[notice] A new release of pip is available: 24.0 -> 25.1.1
[notice] To update, run: python.exe -m pip install --upgrade pip


### **Setting up necessary IBM watsonx credentials**
Required credentials:

- IBM Cloud API key,
- IBM Cloud URL
- IBM Cloud Project ID


In [None]:
from dotenv import load_dotenv
import os

load_dotenv()  # This loads the env automatically

True

## **lm-evaluation-harness — Cheat-Sheet**

### 1. Required CLI flags
| Flag | Purpose | Example |
|------|---------|---------|
| `--model` | Selects the harness adapter | `watsonx_llm` |
| `--model_args` | Passes provider-specific kwargs | `model_id=<MODEL_NAME>` |
| `--tasks` | Specifies benchmark datasets | `gsm8k,triviaqa` |
| `--limit` | Max # of examples per task | `500` |
| `--seed` | Reproducibility for sampling | `42` |

---

### 2. Model IDs being on IBM Watsonx

| Friendly name | `model_id` string |
|---------------|-------------------|
| Llama-4 Maverick 17B | `meta-llama/llama-4-maverick-17b-128e-instruct-fp8` |
| Mistral Large | `mistralai/mistral-large` |
| Granite 3B Instruct | `ibm/granite-3-3-8b-instruct` |
| Flan-UL2 | `google/flan-ul2` |

---





In [None]:
import subprocess

# List of your model IDs on Watsonx
models = [
    "meta-llama/llama-4-maverick-17b-128e-instruct-fp8",
    'mistralai/mistral-large',
    "ibm/granite-3-3-8b-instruct",
    "google/flan-ul2",  
]

# Evaluation tasks
tasks = "gsm8k"

# Create results directory if it doesn't exist
os.makedirs("results", exist_ok=True)

# Run evaluation for each model
for model_id in models:
    model_name = model_id.split("/")[-1]
    output_path = f"results/{model_name.replace('-', '_')}_results.json"

    command = [
        "lm_eval",
        "--model", "watsonx_llm",
        "--verbosity", "ERROR",
        "--model_args", f"model_id={model_id}",
        "--limit", "500",
        "--tasks", tasks,
        "--output_path", output_path,
        "--seed", "42",
    ]

    print(f"\n🔍 Evaluating model: {model_id}")
    subprocess.run(command)

### **Summary results**

In [16]:
import os
import json
from tabulate import tabulate

# 99% Confidence Level Z-score
Z_99 = 2.576

results_dir = "results"
summary = []

# Walk through all subfolders in results/
for model_folder in os.listdir(results_dir):
    folder_path = os.path.join(results_dir, model_folder)

    if os.path.isdir(folder_path):
        # Find the JSON file inside each folder
        json_files = [f for f in os.listdir(folder_path) if f.endswith(".json")]
        if not json_files:
            print(f"⚠️ No JSON file found in {model_folder}")
            continue

        # Assume one result file per model folder
        result_path = os.path.join(folder_path, json_files[0])

        try:
            with open(result_path, "r", encoding="utf-8") as f:
                data = json.load(f)
                row = {"model": model_folder}

                for task, results in data.get("results", {}).items():
                    exact_match = None
                    stderr = None

                    for key, val in results.items():
                        if key.startswith("exact_match,strict-match"):
                            label = f"{task}:ExactMatch"
                            exact_match = val
                            row[label] = round(val, 3)

                        if key.startswith("exact_match_stderr,strict-match"):
                            stderr_label = f"{task}:StdErr"
                            stderr = val
                            row[stderr_label] = round(val, 4)  # Keep 4 decimals for more precision

                    # Calculate 99% Confidence Interval if both values are found
                    if exact_match is not None and stderr is not None:
                        ci_lower = exact_match - Z_99 * stderr
                        ci_upper = exact_match + Z_99 * stderr
                        ci_label = f"{task}:99% CI"
                        # Format as [lower, upper]
                        row[ci_label] = f"[{round(ci_lower, 3)}, {round(ci_upper, 3)}]"

                summary.append(row)

        except Exception as e:
            print(f"❌ Failed to read {result_path}: {e}")

# Display table
if summary:
    print("\n📊 Summary Table:")
    print(tabulate(summary, headers="keys", tablefmt="github"))
else:
    print("⚠️ No valid results to display.")





📊 Summary Table:
| model                                               |   gsm8k:ExactMatch |   gsm8k:StdErr | gsm8k:99% CI   |
|-----------------------------------------------------|--------------------|----------------|----------------|
| flan_ul2_results.json                               |              0.246 |         0.0193 | [0.196, 0.296] |
| granite_3_3_8b_instruct_results.json                |              0.676 |         0.021  | [0.622, 0.73]  |
| llama_4_maverick_17b_128e_instruct_fp8_results.json |              0.924 |         0.0119 | [0.893, 0.955] |
| mistral_large_results.json                          |              0.914 |         0.0126 | [0.882, 0.946] |
