### Evaluating Explanations with an LLM Evaluator (Ollama)

This notebook experiments with **LLM-based product evals** using a small labeled dataset of explanations for an article.

Runtime: Ollama
To run this notebook:

- Install and open **Ollama**
- Pull and load the model:

```bash
ollama pull llama3.2
ollama run llama3.2


#### Reference

This flow is inspired by Eugene Yan’s article:

> **“How I Build Product Evaluations for LLM Applications”**  
> https://eugeneyan.com/writing/product-evals/

I follow his ideas of:

- Labeling a **small dataset** (with enough fails)  
- Using **binary labels** (`PASS` / `FAIL`)  
- Treating the **evaluator as a model to align** and measuring metrics  

---

#### What this notebook does

- Loads `data/evals/article_explanations.csv` with:
  - `input` – paragraph/snippet from the article  
  - `output` – a candidate explanation  
  - `label` – human ground truth (`PASS` / `FAIL`)
- Uses an **LLM via Ollama** as an evaluator:
  - Given `input` + `output`, it predicts `pred_label` (`PASS` / `FAIL`)
- Computes metrics (with **FAIL as the “positive” class**):
  - Precision / Recall / F1 for **FAIL**  
  - Cohen’s Kappa between `label` and `pred_label`

This gives a small, reusable **eval harness** for the task:

> *“Is this a good explanation of this paragraph?”*

---

#### How to read the metrics

- **Precision (FAIL)** – of the explanations marked FAIL, how many are truly bad?  
- **Recall (FAIL)** – of all truly bad explanations, how many did we catch?  
- **F1 (FAIL)** – balance of precision and recall for FAIL.  
- **Kappa** – agreement between human and LLM labels  
  - Higher is better; ~0.4–0.6 is already decent.

---

#### Future work

- Add more examples (especially realistic FAILs).  
- Try different evaluator prompts and compare metrics.  
- Split into multiple evaluators (e.g., **faithfulness** vs **main idea** vs **conciseness**).

In [10]:
import pandas as pd
from pathlib import Path

PROJECT_ROOT = Path.cwd().parent if Path.cwd().name == "notebooks" else Path.cwd()
DATA_EVALS = PROJECT_ROOT / "data" / "evals" / "article_explanations.csv"

# print("Project root:", PROJECT_ROOT)
print("Eval file exists:", DATA_EVALS.exists())

df = pd.read_csv(DATA_EVALS)
df.head(4)

Eval file exists: True


Unnamed: 0,id,input,output,label
0,1,The author recommends using simple binary labe...,The author argues that we should always use de...,FAIL
1,2,The author recommends using simple binary labe...,"The key message is to prefer simple, discrete ...",PASS
2,3,He points out that fine-grained numeric scales...,The article says numeric rating scales are ide...,FAIL
3,4,He notes that stakeholders sometimes ask for g...,The article observes that stakeholders often a...,PASS


Step 2: Align an evaluator LLM

In [11]:
# Setup Ollama LLM Evaluator
from llama_index.llms.ollama import Ollama

eval_llm = Ollama(
    model="llama3.2",
    base_url="http://127.0.0.1:11434",
    request_timeout=300.0,
)

# quick sanity check
resp = eval_llm.complete("Say hello in one short sentence.")
print(resp.text)


Hello!


In [12]:
# write the evaluator prompt - output should be PASS or FAIL
def build_eval_prompt(row):
    return f"""
You are evaluating an explanation of a paragraph from an article about building product evaluations.

ARTICLE PARAGRAPH:
\"\"\"{row['input']}\"\"\"

MODEL EXPLANATION:
\"\"\"{row['output']}\"\"\"

EVALUATION CRITERION:
- The explanation should be accurate and faithful to the paragraph.
- It should clearly capture the *main idea* the author is making.
- It should be concise (around 2–3 sentences).
- If it misses key points, invents claims, or is vague/general, it should be FAIL.

Respond with exactly one word: PASS or FAIL.
"""

def llm_evaluate(row):
    prompt = build_eval_prompt(row)
    resp = eval_llm.complete(prompt)
    return resp.text.strip().upper()


In [13]:
# Run the evaluator on a subset
from sklearn.model_selection import train_test_split

df_dev, df_test = train_test_split(df, test_size=0.25, random_state=42, stratify=df['label'])

print("Dev size:", len(df_dev), "Test size:", len(df_test))


Dev size: 12 Test size: 5


In [None]:
df_dev = df_dev.copy()
df_dev['pred_label'] = df_dev.apply(llm_evaluate, axis=1)

# Normalize true labels
df_dev['label'] = df_dev['label'].astype(str).str.strip().str.upper()

# Normalize predicted labels: strip, uppercase, remove trailing punctuation
df_dev['pred_label'] = (
    df_dev['pred_label']
    .astype(str)
    .str.strip()
    .str.upper()
    .str.replace('.', '', regex=False)   # remove dots
)

print("True labels:\n", df_dev['label'].value_counts())
print("\nPred labels:\n", df_dev['pred_label'].value_counts())

In [None]:
# Compute precision, recall, and Cohen’s Kappa
from sklearn.metrics import precision_recall_fscore_support, cohen_kappa_score

y_true = df_dev['label']
y_pred = df_dev['pred_label']

# Treat FAIL as the "positive" class – convert to booleans
y_true_fail = (y_true == 'FAIL')
y_pred_fail = (y_pred == 'FAIL')

# treat FAIL as the "positive" class (since we care about catching defects)
precision, recall, f1, _ = precision_recall_fscore_support(
    y_true, y_pred, labels=['FAIL', 'PASS'], pos_label='FAIL', average='binary'
)

kappa = cohen_kappa_score(y_true, y_pred)

print(f"Precision (FAIL): {precision:.3f}")
print(f"Recall (FAIL):    {recall:.3f}")
print(f"F1 (FAIL):        {f1:.3f}")
print(f"Cohen's Kappa:    {kappa:.3f}")

Precision (FAIL): 1.000
Recall (FAIL):    0.857
F1 (FAIL):        0.923
Cohen's Kappa:    0.833


Step 3: Wrap it in a tiny “eval harness”

In [None]:
import numpy as np
def run_eval(df_samples, name="run"):
     # handle empty test set
    if df_samples is None or len(df_samples) == 0:
        return {
            "run": name,
            "n_samples": 0,
            "precision_FAIL": np.nan,
            "recall_FAIL": np.nan,
            "f1_FAIL": np.nan,
            "kappa": np.nan,
        }, df_samples

    df_samples = df_samples.copy()
    df_samples['pred_label'] = df_samples.apply(llm_evaluate, axis=1)
    
    # Normalize true labels
    df_samples['label'] = (
        df_samples['label']
        .astype(str)
        .str.strip()
        .str.upper()
    )

    # Normalize predicted labels
    df_samples['pred_label'] = (
        df_samples['pred_label']
        .astype(str)
        .str.strip()
        .str.upper()
        .str.replace('.', '', regex=False)
    )

    # Convert to booleans: is this a FAIL?
    y_true = df_samples['label']
    y_pred = df_samples['pred_label']

    y_true_fail = (y_true == 'FAIL')
    y_pred_fail = (y_pred == 'FAIL')
    
    precision, recall, f1, _ = precision_recall_fscore_support(
        y_true_fail,
        y_pred_fail,
        average='binary',
        pos_label=True,
    )

    kappa = cohen_kappa_score(y_true, y_pred)
    
    row = {
        "run": name,
        "n_samples": len(df_samples),
        "precision_FAIL": precision,
        "recall_FAIL": recall,
        "f1_FAIL": f1,
        "kappa": kappa,
    }
    return row, df_samples

# Example: use on test set after you're happy with dev performance
metrics, df_test_scored = run_eval(df_test, name="baseline_prompt")
metrics


{'run': 'baseline_prompt',
 'n_samples': 5,
 'precision_FAIL': 0.75,
 'recall_FAIL': 1.0,
 'f1_FAIL': 0.8571428571428571,
 'kappa': 0.5454545454545454}