### Install helper libraries for uploads

This cell installs a few Python helper tools:

- **boto3** – a Python library that lets us talk to our storage system (MinIO) from code.  
- **tqdm** – shows a nice progress bar while files are uploading.  
- **python-dotenv (`dotenv`)** – reads settings (like bucket names and URLs) from a simple text file called `config.env`.

These tools are used in later cells to upload the dataset into the workshop’s storage.


In [None]:
!pip install openai langchain_openai azure-identity dotenv coloredlogs datasets dotenv

### Configure evaluation files and paths

This cell sets up where all of our evaluation files live and gives them clear names:

- **Experiment name**: here it’s `"surfing"` – used to build file paths.
- **Eval questions file**: the questions generated in notebook `1-generate-data`.
- **Raw answers**:
  - **Baseline raw answers** – what the original (teacher) model answered.
  - **Student raw answers** – what our fine-tuned model answered.
- **Formatted answers**:
  - Files where answers are put into a standard **RAFT eval format** so they’re easy to score.
- **Score files**:
  - **Row scores (JSONL)** – one line per question, with its scores.
  - **Aggregate metrics (JSON)** – overall averages for each score type.

> **JSON vs JSONL**  
> - **JSON**: a single structured document (often for summaries or settings).  
> - **JSONL** (“JSON Lines”): many small JSON objects, one per line (great for datasets).

At the end, it prints a summary of all file paths and checks that the eval questions file actually exists.


In [None]:
# Configure experiment name, eval file locations, and related env vars
import os
from pathlib import Path
from dotenv import load_dotenv

# LOAD CONFIGURATION
load_dotenv("config.env")

# Basic experiment config
experiment_name = "surfing"
experiment_dir = f"dataset/{experiment_name}-files"

# Questions generated by 1_gen
dataset_path_hf_eval = f"{experiment_dir}/{experiment_name}-hf.eval.jsonl"
os.environ["DATASET_PATH_HF_EVAL"] = dataset_path_hf_eval

# Raw RAFT answer files (model outputs before formatting)
dataset_path_hf_eval_answer_baseline = f"{experiment_dir}/{experiment_name}-hf.eval.answer.baseline.jsonl"
os.environ["DATASET_PATH_HF_EVAL_ANSWER_BASELINE"] = dataset_path_hf_eval_answer_baseline

dataset_path_hf_eval_answer_student = f"{experiment_dir}/{experiment_name}-hf.eval.answer.student.jsonl"
os.environ["DATASET_PATH_HF_EVAL_ANSWER_STUDENT"] = dataset_path_hf_eval_answer_student

# Formatted answer files (for scoring)
dataset_path_eval_answer_baseline = f"{experiment_dir}/{experiment_name}-eval.answer.baseline.jsonl"
os.environ["DATASET_PATH_EVAL_ANSWER_BASELINE"] = dataset_path_eval_answer_baseline

dataset_path_eval_answer_student = f"{experiment_dir}/{experiment_name}-eval.answer.student.jsonl"
os.environ["DATASET_PATH_EVAL_ANSWER_STUDENT"] = dataset_path_eval_answer_student

# Scored answer files (row-level scores)
dataset_path_eval_answer_score_baseline = f"{experiment_dir}/{experiment_name}-eval.answer.score.baseline.jsonl"
dataset_path_eval_answer_score_student  = f"{experiment_dir}/{experiment_name}-eval.answer.score.student.jsonl"

# Aggregated metrics files
dataset_path_eval_answer_score_metrics_baseline = (
    f"{experiment_dir}/{experiment_name}-eval.answer.score.metrics.baseline.json"
)
dataset_path_eval_answer_score_metrics_student = (
    f"{experiment_dir}/{experiment_name}-eval.answer.score.metrics.student.json"
)

print(
    f"""
Evaluation File Configuration
-----------------------------
Eval questions file                  : {dataset_path_hf_eval}

Baseline raw answers (model output)  : {dataset_path_hf_eval_answer_baseline}
Student  raw answers (model output)  : {dataset_path_hf_eval_answer_student}

Baseline formatted answers (for eval): {dataset_path_eval_answer_baseline}
Student  formatted answers (for eval): {dataset_path_eval_answer_student}

Baseline row scores JSONL            : {dataset_path_eval_answer_score_baseline}
Student  row scores JSONL            : {dataset_path_eval_answer_score_student}

Baseline aggregate metrics JSON      : {dataset_path_eval_answer_score_metrics_baseline}
Student  aggregate metrics JSON      : {dataset_path_eval_answer_score_metrics_student}
"""
)

if not Path(dataset_path_hf_eval).is_file():
    raise FileNotFoundError(f"Eval file not found: {dataset_path_hf_eval}")


### Configure baseline, student, and judge model endpoints

This cell tells the notebook **which models to talk to** and **where they live**:

- Reads a shared **base URL** and **API key** from the environment.
- Defines three logical roles:
  - **Baseline model** – the original “teacher” model we compare against.
  - **Student model** – the fine-tuned Granite model we trained.
  - **Judge model** – a strong model (Qwen) used **only to score answers**.

It then sets environment variables like:

- `BASELINE_OPENAI_BASE_URL`, `BASELINE_OPENAI_API_KEY`, etc.  
- `STUDENT_OPENAI_BASE_URL`, `STUDENT_DEPLOYMENT_NAME`, etc.  
- `JUDGE_OPENAI_BASE_URL`, `JUDGE_OPENAI_DEPLOYMENT`, etc.

> **Endpoint / router**  
> - An **endpoint** is the URL where we send our requests.  
> - A **router** is a service that receives those requests and forwards them to the right model.

Finally, it prints a short summary so you can see which endpoint and model name each role is using.


In [None]:
# Configure baseline, student, and judge model endpoints (via router)
import os

OPENAI_BASE_URL   = os.getenv('OPENAI_BASE_URL')
OPENAI_API_KEY    = os.getenv('OPENAI_API_KEY')

BASELINE_MODEL_NAME = "openai.gpt-oss-120b-1:0"
STUDENT_MODEL_NAME  = "granite-4.0-micro"
JUDGE_MODEL_NAME    = "qwen.qwen3-32b-v1:0"

# Baseline model env (used by .gorilla/raft/eval.py with --env-prefix BASELINE)
os.environ["BASELINE_OPENAI_BASE_URL"]  = OPENAI_BASE_URL
os.environ["BASELINE_OPENAI_API_KEY"]   = OPENAI_API_KEY
os.environ["BASELINE_OPENAI_DEPLOYMENT"] = BASELINE_MODEL_NAME
os.environ["BASELINE_MODEL_API"]        = "chat"

# Student model env (used by .gorilla/raft/eval.py with --env-prefix STUDENT)
os.environ["STUDENT_OPENAI_BASE_URL"]   = os.getenv('STUDENT_OPENAI_BASE_URL')
os.environ["STUDENT_OPENAI_API_KEY"]    = OPENAI_API_KEY
os.environ["STUDENT_DEPLOYMENT_NAME"]   = "granite-student"
os.environ["STUDENT_MODEL_API"]         = "chat"

# Judge model env (used by judge client via the router)
os.environ["JUDGE_OPENAI_BASE_URL"]     = OPENAI_BASE_URL
os.environ["JUDGE_OPENAI_API_KEY"]      = OPENAI_API_KEY
os.environ["JUDGE_OPENAI_DEPLOYMENT"]   = JUDGE_MODEL_NAME

print(
    f"""
Model Endpoint Configuration
----------------------------
Baseline endpoint : {os.environ['BASELINE_OPENAI_BASE_URL']}
Baseline model    : {BASELINE_MODEL_NAME}

Student endpoint  : {os.environ['STUDENT_OPENAI_BASE_URL']}
Student model     : {STUDENT_MODEL_NAME}

Judge endpoint    : {os.environ['JUDGE_OPENAI_BASE_URL']}
Judge model       : {JUDGE_MODEL_NAME}
"""
)


### Run the baseline model over the evaluation questions

This cell calls the **baseline (teacher) model** to answer every evaluation question.

- Uses the RAFT `eval.py` script to:
  - Read the eval questions file.
  - Send each question to the **baseline model** via the router.
  - Save the answers as a **raw JSONL file**.
- It only runs if the baseline answers file does **not** already exist, so you don’t waste time re-running the same evaluation.

This gives us the baseline’s raw answers, which we’ll later format and score.

In [None]:
%%bash
# Run baseline model over eval split (if not already done)
if [ ! -f "$DATASET_PATH_HF_EVAL_ANSWER_BASELINE" ]; then
  echo "Running baseline model over eval split..."
  python .gorilla/raft/eval.py \
    --question-file "$DATASET_PATH_HF_EVAL" \
    --answer-file "$DATASET_PATH_HF_EVAL_ANSWER_BASELINE" \
    --model "$BASELINE_OPENAI_DEPLOYMENT" \
    --env-prefix BASELINE \
    --mode "$BASELINE_MODEL_API"
else
  echo "Baseline answers file already exists, skipping."
fi


### Convert baseline raw answers into RAFT eval format

This cell uses the RAFT `format.py` helper to:

- Take the **baseline raw answers JSONL** as input.
- Produce a **formatted eval JSONL** file.

The **RAFT eval format** organizes each example into a common structure (question, context, gold/reference answer, model’s final answer), which makes it easy for the judge model and scoring code to work with.

In [None]:
# Format baseline raw answers into RAFT eval format
!python .gorilla/raft/format.py \
    --input "$DATASET_PATH_HF_EVAL_ANSWER_BASELINE" \
    --input-type jsonl \
    --output "$DATASET_PATH_EVAL_ANSWER_BASELINE" \
    --output-format eval

### Helper: nicely display one formatted answer

This cell defines two helper functions:

- `row_to_markdown(df, idx)` – turns one row from a DataFrame into a readable Markdown block.
- `pretty_print_row(df, idx)` – actually displays that Markdown in the notebook.

It also:
- Cleans up special tags like `<DOCUMENT>`, `<ANSWER>`, and custom quote markers so they display nicely.
- Loads the **baseline formatted answers** and prints one example.

This lets you eyeball a single example to check that the question, context, and answers all look reasonable before you run large-scale scoring.

In [None]:
# Utilities to pretty-print an eval row as Markdown for manual inspection
import pandas as pd

def row_to_markdown(df, idx):
    sample = df.iloc[idx]
    md = ""
    for name in df.columns.values:
        value = sample[name]
        value = value.replace("<DOCUMENT>", "`<DOCUMENT>`").replace("</DOCUMENT>", "`</DOCUMENT>`")
        value = value.replace("<ANSWER>", "`<ANSWER>`").replace("...</ANSWER>", "`</ANSWER>`")
        value = value.replace("##begin_quote##", "`##begin_quote##`").replace("##end_quote##", "`##end_quote##`")
        md += "### " + name + "\n" + value + "\n"
    return md

def pretty_print_row(df, idx):
    from IPython.display import display, Markdown
    display(Markdown(row_to_markdown(df, idx)))

print("Baseline (formatted) answer example:")
pretty_print_row(pd.read_json(dataset_path_eval_answer_baseline, lines=True), 0)


### Run the student (fine-tuned) model over the evaluation questions

This cell does the **same type of evaluation as the baseline**, but for the **student model**:

- Calls the RAFT `eval.py` script again.
- Sends each eval question to the **student deployment**.
- Writes out a **student raw answers JSONL** file.

It also skips running if the output file already exists.  
Now we have raw answers for both baseline and student on exactly the same questions.

In [None]:
%%bash
# Run student model over eval split (if not already done)
if [ ! -f "$DATASET_PATH_HF_EVAL_ANSWER_STUDENT" ]; then
  echo "Running student model over eval split..."
  python .gorilla/raft/eval.py \
    --question-file "$DATASET_PATH_HF_EVAL" \
    --answer-file "$DATASET_PATH_HF_EVAL_ANSWER_STUDENT" \
    --model "$STUDENT_DEPLOYMENT_NAME" \
    --env-prefix STUDENT \
    --mode "$STUDENT_MODEL_API"
else
  echo "Student answers file already exists, skipping."
fi


### Preview the student model’s raw answers

This cell:

- Loads the **student raw answers JSONL** into a pandas table.
- Shows the first two rows.

It’s just a quick spot-check to confirm that the student model produced answers in the expected structure.

In [None]:
# Check the first few student raw answers
import pandas as pd

pd.read_json(dataset_path_hf_eval_answer_student, lines=True).head(2)

### Convert student raw answers into RAFT eval format

This cell mirrors what we did for the baseline:

- Takes the **student raw answers JSONL** as input.
- Uses RAFT `format.py` to create a **student eval JSONL** file in the standard eval format.

After this, both baseline and student answers are ready to be scored in exactly the same way.

In [None]:
# Format student raw answers into RAFT eval format
! python .gorilla/raft/format.py \
    --input "$DATASET_PATH_HF_EVAL_ANSWER_STUDENT" \
    --input-type jsonl \
    --output "$DATASET_PATH_EVAL_ANSWER_STUDENT" \
    --output-format eval

### Display one student formatted answer for comparison

This cell:

- Loads the **student formatted answers**.
- Uses `pretty_print_row` to render one of them as Markdown.

You can compare this visually with the baseline example to see how the student’s reasoning and style differ from the teacher’s on the same question.

In [None]:
# Pretty-print one student formatted answer for inspection
import pandas as pd

print("Student (formatted) answer example:")
pretty_print_row(pd.read_json(dataset_path_eval_answer_student, lines=True), 0)

### Set up the judge model client

This cell configures the **judge model**, which scores how good each answer is.

- Uses the `OpenAI` Python client, pointing it at our internal **router** (`base_url`) with an **API key**.
- Chooses **Qwen 32B** as the judge model:
  - It doesn’t answer user questions for the app itself.
  - Instead, it reads the question, context, gold answer, and model answer, and returns quality scores.

It prints the judge URL and model name, then builds a `judge_client` we’ll reuse when scoring.

In [None]:
# Configure judge client (Qwen via router) using OpenAI-compatible API
from openai import OpenAI

JUDGE_BASE_URL = OPENAI_BASE_URL
JUDGE_API_KEY  = OPENAI_API_KEY
JUDGE_MODEL    = "qwen.qwen3-32b-v1:0"

print("Judge base URL:", JUDGE_BASE_URL)
print("Judge model:   ", JUDGE_MODEL)

judge_client = OpenAI(
    base_url=JUDGE_BASE_URL,
    api_key=JUDGE_API_KEY,
)

In [None]:
### Helper: safely pull JSON out of the judge’s response

Sometimes a model wraps JSON in extra text (e.g., comments or backticks).  
This helper function:

- Looks for the first `{` and last `}` in the text.
- Returns just that portion as a JSON string.
- Raises a clear error if it can’t find a JSON object.

We use this to be more forgiving if the judge model doesn’t respond with “perfect” JSON.

In [None]:
# Helper: extract the first JSON object from a string (if judge wraps it)
def extract_json_block(text: str) -> str:
    """
    Try to pull the first {...} JSON object out of a string.
    Useful if the judge wraps JSON in extra text or backticks.
    """
    start = text.find("{")
    end = text.rfind("}")
    if start == -1 or end == -1 or end <= start:
        raise ValueError(f"Could not find JSON object in: {text!r}")
    return text[start : end + 1]

### Call the judge model once for a single QA pair

This cell defines `judge_single`, which:

1. Builds a **system message** telling the judge to be an impartial evaluator.  
2. Sends the judge:
   - The **question**,
   - The **context** (the supporting text),
   - The **gold/reference answer** (what we believe is correct),
   - The **model’s answer** (baseline or student).
3. Asks the judge to return a small JSON object with scores (usually 1–5) for:
   - **Coherence** – Is the answer logically consistent and well-structured?
   - **Relevance** – Does it answer the question and stay on topic?
   - **Groundedness** – Is it supported by the given context (not hallucinated)?
   - **Fluency** – Is the language clear and readable?
   - **Similarity** – How close is it to the reference/gold answer?

The function:
- Calls the judge via `judge_client.chat.completions.create`.
- Parses the JSON (using `extract_json_block` if needed).
- Returns a dictionary with standardized score names like `"coherence.gpt_coherence"`.

In [None]:
# Call the judge model once to score a single QA pair
def judge_single(question: str, context: str, gold: str, answer: str) -> dict:
    """
    Call the judge model (Qwen via the router) to score one answer.

    Returns a dict with keys like:
      - "coherence.gpt_coherence"
      - "relevance.gpt_relevance"
      - "groundedness.gpt_groundedness"
      - "fluency.gpt_fluency"
      - "similarity.gpt_similarity"
    """

    system_msg = (
        "You are an impartial evaluation assistant. "
        "Given a question, context, reference answer, and model answer, "
        "you will score the model answer on several dimensions from 1 to 5.\n\n"
        "Return ONLY a valid JSON object, no commentary, in this format:\n"
        "{\n"
        '  "coherence":   <number 1-5>,\n'
        '  "relevance":   <number 1-5>,\n'
        '  "groundedness":<number 1-5>,\n'
        '  "fluency":     <number 1-5>,\n'
        '  "similarity":  <number 1-5>\n'
        "}\n"
        "Use whole or decimal numbers (e.g., 3 or 4.5)."
    )

    user_msg = f"""
Question:
{question}

Context (information to base the answer on):
{context}

Reference answer (ground truth):
{gold}

Model answer to evaluate:
{answer}
"""

    resp = judge_client.chat.completions.create(
        model=JUDGE_MODEL,
        messages=[
            {"role": "user", "content": system_msg},
            {"role": "user", "content": user_msg},
        ],
        temperature=0,
        max_tokens=256,
    )

    text = resp.choices[0].message.content.strip()
    try:
        data = json.loads(text)
    except json.JSONDecodeError:
        data = json.loads(extract_json_block(text))

    # Map to the same naming style as the original RAFT/Azure notebook
    scores = {
        "coherence.gpt_coherence":       float(data["coherence"]),
        "relevance.gpt_relevance":       float(data["relevance"]),
        "groundedness.gpt_groundedness": float(data["groundedness"]),
        "fluency.gpt_fluency":           float(data["fluency"]),
        "similarity.gpt_similarity":     float(data["similarity"]),
    }
    return scores

### Score a full dataset and compute row-level + overall metrics

This cell defines `score_dataset_file`, which runs the judge over **all** examples in a formatted eval file:

- **Input**:
  - `formatted_answers_path` – the RAFT eval JSONL (baseline or student).
  - `row_scores_output_path` – where to save per-question scores (JSONL).
  - `metrics_output_path` – where to save overall averages (JSON).
- **What it does**:
  1. Loads all examples into a pandas DataFrame.
  2. Loops through each row with a `tqdm` progress bar.
  3. For each row, calls `judge_single(...)` to get the five scores.
  4. Accumulates sums for each metric.
  5. Stores a row-level record with:
     - Inputs (question, context, gold answer, model answer),
     - Outputs (the scores from the judge).
  6. Writes:
     - A **row-level scores JSONL** file (one line per question).
     - An **aggregate metrics JSON** file with average scores.

It then returns the metrics dictionary so we can use it in later cells.

In [None]:
# Score a full formatted dataset file and write row-level + aggregate metrics
def score_dataset_file(
    formatted_answers_path: str,
    row_scores_output_path: str,
    metrics_output_path: str,
) -> dict:
    """
    formatted_answers_path: JSONL from .gorilla/raft/format.py (eval format)
    row_scores_output_path: JSONL where we write per-row inputs & outputs
    metrics_output_path:    JSON file with aggregated mean scores
    """

    df = pd.read_json(formatted_answers_path, lines=True)
    print(f"Loaded {len(df)} examples from {formatted_answers_path}")

    rows_out = []
    metric_sums = None
    n = 0

    for _, row in tqdm(df.iterrows(), total=len(df)):
        question = row["question"]
        context = row.get("context", "")
        gold    = row["gold_final_answer"]
        answer  = row["final_answer"]

        scores = judge_single(question, context, gold, answer)

        # Initialize metric sums
        if metric_sums is None:
            metric_sums = {k: 0.0 for k in scores.keys()}

        for k, v in scores.items():
            metric_sums[k] += v

        rows_out.append(
            {
                "inputs": {
                    "question":          question,
                    "context":           context,
                    "final_answer":      answer,
                    "gold_final_answer": gold,
                },
                "outputs": scores,
            }
        )
        n += 1

    # Write row-level scores
    with open(row_scores_output_path, "w") as f:
        for row in rows_out:
            f.write(json.dumps(row) + "\n")
    print("Wrote row scores to:", row_scores_output_path)

    # Compute aggregate metrics
    metrics = {k: (v / n if n > 0 else 0.0) for k, v in metric_sums.items()}
    with open(metrics_output_path, "w") as f:
        json.dump(metrics, f, indent=2)
    print("Wrote aggregate metrics to:", metrics_output_path)

    return metrics


### Evaluate the baseline model with the judge

This cell:

- Imports `tqdm` and `json` for progress and file handling.
- Calls `score_dataset_file` on the **baseline formatted answers**.
- Prints the baseline’s average scores for each metric (coherence, relevance, etc.).

These scores form our **reference point** – how well the teacher model performs on this eval set.

In [None]:
# Run judge over baseline formatted answers and compute metrics
from tqdm import tqdm
import json

baseline_metrics = score_dataset_file(
    formatted_answers_path=dataset_path_eval_answer_baseline,
    row_scores_output_path=dataset_path_eval_answer_score_baseline,
    metrics_output_path=dataset_path_eval_answer_score_metrics_baseline,
)

print("\nBaseline metrics (judge model):")
for k, v in baseline_metrics.items():
    print(f"{k}: {v:.3f}")

### Compare baseline vs student scores and compute improvement

Below this cell:

- Builds a small pandas table (`metrics_df`) with:
  - One row per metric,
  - Two columns: `baseline` and `student`.
- Adds an `improvement` column that shows the **relative change**:
  \[
  \text{improvement} = \frac{\text{student} - \text{baseline}}{\text{baseline}}
  \]

In plain terms:
- A **positive** improvement value means the student scored higher than the baseline for that metric.
- A **negative** value means the student did worse.

The displayed table lets you quickly see where finetuning helped, hurt, or had little effect.

In [None]:
# Run judge over student formatted answers and compute metrics
student_metrics = score_dataset_file(
    formatted_answers_path=dataset_path_eval_answer_student,
    row_scores_output_path=dataset_path_eval_answer_score_student,
    metrics_output_path=dataset_path_eval_answer_score_metrics_student,
)

print("\nStudent metrics (judge model):")
for k, v in student_metrics.items():
    print(f"{k}: {v:.3f}")

In [None]:
# Compare baseline vs student metrics and compute relative improvement
import pandas as pd

metrics_df = pd.DataFrame.from_dict(
    {
        "baseline": baseline_metrics,
        "student":  student_metrics,
    }
)

metrics_df["improvement"] = (metrics_df["student"] - metrics_df["baseline"]) / metrics_df["baseline"]

metrics_df
