# 04. Evaluate with `make_judge`

The `make_judge` API is new and only available after MLflow>=3.4.0. Detailed documentation [here](https://mlflow.org/docs/latest/genai/eval-monitor/scorers/llm-judge/make-judge/).

In [None]:
%run ./00_setup.ipynb

## Load evaluation data as records

In [None]:
eval_dataset = mlflow.genai.datasets.get_dataset(
    name=f"{CATALOG}.{SCHEMA}.{EVAL_TABLE}",
)

eval_records = eval_dataset.to_df()[["inputs", "expectations"]].to_dict(
    orient="records"
)

## Define custom LLM judges for each field

In [None]:
from mlflow.genai.judges import make_judge

JUDGE_MODEL = "databricks:/databricks-claude-sonnet-4"

In [None]:
# Create judges for each entity field
start_date_judge = make_judge(
    name="start_date",
    instructions="""Evaluate if the extracted lease start date is correct.
The start date must be:
1. Correctly identified from the lease agreement
2. In a valid date format
3. Match the expected value

Document: {{ inputs }}
Extracted Value: {{ outputs }}
Expected Value: {{ expectations }}

Your response must be a boolean: yes or no.""",
    model=JUDGE_MODEL,
)

end_date_judge = make_judge(
    name="end_date",
    instructions="""Evaluate if the extracted lease end date is correct.
The end date must be:
1. Correctly identified from the lease agreement
2. In a valid date format
3. Match the expected value

Document: {{ inputs }}
Extracted Value: {{ outputs }}
Expected Value: {{ expectations }}

Your response must be a boolean: yes or no.""",
    model=JUDGE_MODEL,
)

leased_space_judge = make_judge(
    name="leased_space",
    instructions="""Evaluate if the description of the leased space or property is accurately extracted.
The extraction should:
1. Include all relevant details about the space
2. Match the expected description
3. Not omit key identifying information

Document: {{ inputs }}
Extracted Value: {{ outputs }}
Expected Value: {{ expectations }}

Your response must be a boolean: yes or no.""",
    model=JUDGE_MODEL,
)

lessee_judge = make_judge(
    name="lessee",
    instructions="""Evaluate if the lessee (tenant) is correctly identified.
The extraction should:
1. Correctly identify the tenant from the lease agreement
2. Include the full legal name if applicable
3. Match the expected value

Document: {{ inputs }}
Extracted Value: {{ outputs }}
Expected Value: {{ expectations }}

Your response must be a boolean: yes or no.""",
    model=JUDGE_MODEL,
)

lessor_judge = make_judge(
    name="lessor",
    instructions="""Evaluate if the lessor (landlord) is correctly identified.
The extraction should:
1. Correctly identify the landlord from the lease agreement
2. Include the full legal name if applicable
3. Match the expected value

Document: {{ inputs }}
Extracted Value: {{ outputs }}
Expected Value: {{ expectations }}

Your response must be a boolean: yes or no.""",
    model=JUDGE_MODEL,
)

signing_date_judge = make_judge(
    name="signing_date",
    instructions="""Evaluate if the lease signing date is correctly extracted.
The signing date must be:
1. The date when the lease agreement was signed
2. In a valid date format
3. Match the expected value

Document: {{ inputs }}
Extracted Value: {{ outputs }}
Expected Value: {{ expectations }}

Your response must be a boolean: yes or no.""",
    model=JUDGE_MODEL,
)

term_of_payment_judge = make_judge(
    name="term_of_payment",
    instructions="""Evaluate if the payment terms and schedule are correctly identified.
The extraction should:
1. Include the payment schedule (monthly, quarterly, etc.)
2. Capture any relevant payment conditions
3. Match the expected terms

Document: {{ inputs }}
Extracted Value: {{ outputs }}
Expected Value: {{ expectations }}

Your response must be a boolean: yes or no.""",
    model=JUDGE_MODEL,
)

designated_use_judge = make_judge(
    name="designated_use",
    instructions="""Evaluate if the designated or permitted use of the leased property is correctly extracted.
The extraction should:
1. Accurately describe the permitted use
2. Include any restrictions if mentioned
3. Match the expected value

Document: {{ inputs }}
Extracted Value: {{ outputs }}
Expected Value: {{ expectations }}

Your response must be a boolean: yes or no.""",
    model=JUDGE_MODEL,
)

extension_period_judge = make_judge(
    name="extension_period",
    instructions="""Evaluate if any extension or renewal period is correctly identified.
The extraction should:
1. Identify if an extension period exists
2. Include the duration if specified
3. Match the expected value

Document: {{ inputs }}
Extracted Value: {{ outputs }}
Expected Value: {{ expectations }}

Your response must be a boolean: yes or no.""",
    model=JUDGE_MODEL,
)

expiration_date_judge = make_judge(
    name="expiration_date_of_lease",
    instructions="""Evaluate if the lease expiration date is correctly extracted.
The expiration date must be:
1. Correctly identified from the lease agreement
2. In a valid date format
3. Match the expected value

Document: {{ inputs }}
Extracted Value: {{ outputs }}
Expected Value: {{ expectations }}

Your response must be a boolean: yes or no.""",
    model=JUDGE_MODEL,
)

## Run evaluation with custom judges

In [None]:
# Collect all judges in a list for evaluation
field_judges = [
    start_date_judge,
    end_date_judge,
    leased_space_judge,
    lessee_judge,
    lessor_judge,
    signing_date_judge,
    term_of_payment_judge,
    designated_use_judge,
    extension_period_judge,
    expiration_date_judge,
]

# Run evaluation with all field judges
with mlflow.start_run(run_name="Eval with make_judge"):
    mlflow.genai.evaluate(
        data=eval_records,
        predict_fn=extract_lease_data,
        scorers=field_judges,
    )