# 05. Evaluate with Agent-as-Judge

This notebook demonstrates how to use the `make_judge` API with the `{{trace}}` variable to create an "agent-as-judge". This allows the judge to evaluate the full execution trace of the agent, not just the inputs and outputs.

This is somewhat new and still WIP.

In [None]:
%run ./00_setup.ipynb

## Load evaluation data as records

In [None]:
eval_dataset = mlflow.genai.datasets.get_dataset(
    name=f"{CATALOG}.{SCHEMA}.{EVAL_TABLE}",
)

eval_records = eval_dataset.to_df()[["inputs", "expectations"]].to_dict(
    orient="records"
)

## Define Agent-as-Judge

We define a custom judge that has access to the `{{trace}}` variable. This variable contains the JSON representation of the agent's execution trace.

In [None]:
from mlflow.genai.judges import make_judge

JUDGE_MODEL = "databricks:/databricks-claude-sonnet-4"

agent_trace_judge = make_judge(
    name="agent_trace_quality",
    instructions="""Evaluate the quality of {{ trace }} for the following:
    1. Was the response format used to extract entities from the raw text input?
    2. Did the extract include all the entities (
        "start_date", "end_date", "leased_space", "lessee", "lessor", "signing_date", 
        "term_of_payment", "designated_use", "extension_period", "expiration_date_of_lease")?
    Your response must be a boolean: yes (if the trace looks good) or no.""",
    model=JUDGE_MODEL,
)

## Run evaluation

In [None]:
from pprint import pprint
from IPython.display import Markdown, display

In [None]:
result = extract_lease_data(eval_records[0]["inputs"]["query"])

sample_trace_id = mlflow.get_last_active_trace_id()
sample_trace = mlflow.get_trace(sample_trace_id)

feedback = agent_trace_judge(trace=sample_trace)

display(Markdown(f"**Performance Rating:** {feedback.value}"))
display(Markdown(f"**Analysis:** {feedback.rationale}"))