
<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning">
</div>


# Evaluate Model Performance using LLM-as-a-Judge with MLflow

In this demo, we will learn how to implement LLM-as-a-Judge to evaluate model performance using MLflow.evaluate API.

**Learning Objectives**

*By the end of this demo, you will be able to:*

* Standard QA Metrics using mlflow.evaluate

* Use built-in LLM-as-a-Judge to calculate answer similarity metrics using mlflow.evaluate

* Define custom LLM-as-a-Judge metrics using mlflow.evaluate

## Setup Demo Environment

### Install dependencies and configure environment

In [0]:
!pip install mlflow databricks_genai_inference==0.2.3 evaluate torch transformers textstat
dbutils.library.restartPython()

### Reset MLflow Experiment

In [0]:
import mlflow

# Default MLflow experiment name in Databricks is /Users/{username}/{notebook filename}
notebook_name = dbutils.notebook.entry_point.getDbutils().notebook().getContext().notebookPath().get()
experiment = mlflow.get_experiment_by_name(notebook_name)
print(experiment)

from mlflow.utils.databricks_utils import get_databricks_host_creds
from mlflow.utils.request_utils import augmented_raise_for_status
from mlflow.utils.rest_utils import http_request
import time

def delete_all_runs(experiment_id: str) -> int:
    """
    Bulk delete all runs in an experiment.
    
    :param experiment_id: The ID of the experiment containing the runs to delete.
    :return: The number of runs deleted.
    """
    # Current time in milliseconds
    max_timestamp_millis = int(time.time() * 1000)
    
    json_body = {
        "experiment_id": experiment_id, 
        "max_timestamp_millis": max_timestamp_millis,
        "max_runs": 10000  # Maximum allowed value
    }
    
    response = http_request(
        host_creds=get_databricks_host_creds(),
        endpoint="/api/2.0/mlflow/databricks/runs/delete-runs",
        method="POST",
        json=json_body,
    )
    
    augmented_raise_for_status(response)
    return response.json()["runs_deleted"]

experiment_id = experiment.experiment_id
deleted_runs_count = delete_all_runs(experiment_id)
print(f"Deleted {deleted_runs_count} runs.")

## Basic Question-Answering Evaluation
Create a test case of inputs that will be passed into the model and ground_truth which will be used to compare against the generated output from the model.

In [0]:
import mlflow
import mlflow.deployments
import pandas as pd
import mlflow.metrics.genai

mlflow.deployments.set_deployments_target("databricks")

In [0]:
eval_df = pd.DataFrame(
    {
        "inputs": [
            "How does useEffect() work?",
            "What does the static keyword in a function mean?",
            "What does the 'finally' block in Python do?",
            "What is the difference between multiprocessing and multithreading?",
        ],
        "ground_truth": [
            "The useEffect() hook tells React that your component needs to do something after render. React will remember the function you passed (we’ll refer to it as our “effect”), and call it later after performing the DOM updates.",
            "Static members belongs to the class, rather than a specific instance. This means that only one instance of a static member exists, even if you create multiple objects of the class, or if you don't create any. It will be shared by all objects.",
            "'Finally' defines a block of code to run when the try... except...else block is final. The finally block will be executed no matter if the try block raises an error or not.",
            "Multithreading refers to the ability of a processor to execute multiple threads concurrently, where each thread runs a process. Whereas multiprocessing refers to the ability of a system to run multiple processors in parallel, where each processor can run one or more threads.",
        ],
    }
)

Create a simple Databricks model that asks DBRX to answer the question in two sentences. Call mlflow.evaluate() with the model and evaluation dataframe.



In [0]:
with mlflow.start_run() as run:
    results = mlflow.evaluate(
        model="endpoints:/databricks-dbrx-instruct",
        data=eval_df,
        targets="ground_truth",  # specify which column corresponds to the expected output
        model_type="question-answering",  # model type indicates which metrics are relevant for this task
        evaluators="default",
    )
results.metrics

Inspect the evaluation results table as a dataframe to see row-by-row metrics to further assess model performance.

In [0]:
results.tables["eval_results_table"]

## LLM-judged correctness with DBRX
Construct an answer similarity metric using the answer_similarity() metric factory function.

In [0]:
from mlflow.metrics.genai import EvaluationExample, answer_similarity

# Create an example to describe what answer_similarity means like for this problem.
example = EvaluationExample(
    input="What is MLflow?",
    output="MLflow is an open-source platform for managing machine "
    "learning workflows, including experiment tracking, model packaging, "
    "versioning, and deployment, simplifying the ML lifecycle.",
    score=4,
    justification="The definition effectively explains what MLflow is "
    "its purpose, and its developer. It could be more concise for a 5-score.",
    grading_context={
        "targets": "MLflow is an open-source platform for managing "
        "the end-to-end machine learning (ML) lifecycle. It was developed by Databricks, "
        "a company that specializes in big data and machine learning solutions. MLflow is "
        "designed to address the challenges that data scientists and machine learning "
        "engineers face when developing, training, and deploying machine learning models."
    },
)

# Construct the metric using DBRX as the judge
answer_similarity_metric = answer_similarity(model="endpoints:/databricks-dbrx-instruct", examples=[example])

print(answer_similarity_metric)

Call mlflow.evaluate() again but with your new answer_similarity_metric

In [0]:
with mlflow.start_run() as run:
    results = mlflow.evaluate(
        model="endpoints:/databricks-dbrx-instruct",
        data=eval_df,
        targets="ground_truth",
        model_type="question-answering",
        evaluators="default",
        extra_metrics=[answer_similarity_metric],  # use the answer similarity metric created above
    )
results.metrics

In [0]:
results.tables["eval_results_table"]

## Custom LLM-judged metric for professionalism
Create a custom metric that will be used to determine professionalism of the model outputs. Use make_genai_metric with a metric definition, grading prompt, grading example, and judge model configuration

In [0]:
from mlflow.metrics.genai import EvaluationExample, make_genai_metric

professionalism_metric = make_genai_metric(
    name="professionalism",
    definition=(
        "Professionalism refers to the use of a formal, respectful, and appropriate style of communication that is tailored to the context and audience. It often involves avoiding overly casual language, slang, or colloquialisms, and instead using clear, concise, and respectful language"
    ),
    grading_prompt=(
        "Professionalism: If the answer is written using a professional tone, below "
        "are the details for different scores: "
        "- Score 1: Language is extremely casual, informal, and may include slang or colloquialisms. Not suitable for professional contexts."
        "- Score 2: Language is casual but generally respectful and avoids strong informality or slang. Acceptable in some informal professional settings."
        "- Score 3: Language is balanced and avoids extreme informality or formality. Suitable for most professional contexts. "
        "- Score 4: Language is noticeably formal, respectful, and avoids casual elements. Appropriate for business or academic settings. "
        "- Score 5: Language is excessively formal, respectful, and avoids casual elements. Appropriate for the most formal settings such as textbooks. "
    ),
    examples=[
        EvaluationExample(
            input="What is MLflow?",
            output=(
                "MLflow is like your friendly neighborhood toolkit for managing your machine learning projects. It helps you track experiments, package your code and models, and collaborate with your team, making the whole ML workflow smoother. It's like your Swiss Army knife for machine learning!"
            ),
            score=2,
            justification=(
                "The response is written in a casual tone. It uses contractions, filler words such as 'like', and exclamation points, which make it sound less professional. "
            ),
        )
    ],
    version="v1",
    model="endpoints:/databricks-dbrx-instruct",
    parameters={"temperature": 0.0},
    grading_context_columns=[],
    aggregations=["mean", "variance", "p90"],
    greater_is_better=True,
)

print(professionalism_metric)

Call mlflow.evaluate with your new professionalism metric.

In [0]:
with mlflow.start_run() as run:
    results = mlflow.evaluate(
        model="endpoints:/databricks-dbrx-instruct",
        data=eval_df,
        targets="ground_truth",
        model_type="question-answering",
        evaluators="default",
        extra_metrics=[professionalism_metric],  # use the professionalism metric we created above
    )
results.metrics

In [0]:
results.tables["eval_results_table"]

## Conclusion
In this demo we highlighted how to evaluate LLM's using standard metrics such as ROUGE.  How those are only for specific task and the rise for custom metrics to align with your business use case.  How LLM's can be used to also evaluate the output of LLM's.


&copy; 2024 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the 
<a href="https://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/><a href="https://databricks.com/privacy-policy">Privacy Policy</a> | 
<a href="https://databricks.com/terms-of-use">Terms of Use</a> | 
<a href="https://help.databricks.com/">Support</a>