# Task
Use Labelled data to create an LLM judge.

* Understand the data
* Split data into train/dev/test
* Write a judge prompt with some few shot examples from train
* Iterate on dev
* Measure TPR TNR on test
* Use Judgy to estimate unbiased performance and confidence interval



In [None]:
import json
from typing import Literal

from pydantic import BaseModel


class LabeledTrace(BaseModel):
    query: str
    dietary_restriction: str
    response: str
    success: bool
    error: str | None
    trace_id: str
    query_id: str
    label: Literal["PASS", "FAIL"]
    reasoning: str
    confidence: Literal["HIGH", "MEDIUM", "LOW"]
    labeled: bool


with open("reference_files/labeled_traces.jsonl") as f:
    traces: list[LabeledTrace] = [LabeledTrace(**json.loads(line)) for line in f]

print(f"Loaded {len(traces)} traces")
traces[0]

In [None]:
import polars as pl
from polars import DataFrame

df = pl.DataFrame([t.model_dump() for t in traces])
df.head()

In [None]:
import altair as alt

diet_counts: DataFrame = df.group_by("dietary_restriction").len().rename({"len": "count"})

def stacked_bar(df, col):
    data = df.group_by("dietary_restriction", col).len().rename({"len": "count"})
    return alt.Chart(data).mark_bar().encode(
        x="dietary_restriction:N",
        y="count:Q",
        color=f"{col}:N",
        tooltip=["dietary_restriction", col, "count"],
    ).properties(title=f"{col} by Dietary Restriction", width=400, height=250)

Ok - so it looks like they are all successes

In [None]:
stacked_bar(df, "success")

There are a few categories like whole30 where we have only FAIL examples

In [None]:
stacked_bar(df, "label")

The confidence level is almost always high - not sure where it is from, looks like we struggled with diabetic-friendly & gluten free though

In [None]:
stacked_bar(df, "confidence")