Support model evaluation with an endpoint URL #11262

B-Step62 · 2024-02-28T08:01:16Z

🛠 DevTools 🛠

Install mlflow from this PR

pip install git+https://github.com/mlflow/mlflow.git@refs/pull/11262/merge

Checkout with GitHub CLI

gh pr checkout 11262

Related Issues/PRs

#xxx

What changes are proposed in this pull request?

This change enhances mlflow.evaluate() function to accepts an endpoint URL to model argument. We start to get frequent requests for this from users who want to evaluate the proprietary model endpoint or Databricks served models. Without this feature, they need to implement a custom predict function, which is not super difficult but less convenient.

Important notes

We only support the OpenAI-compatible chat/completion models, to avoid asking too many configurations to users.
The input dataset should contain a single text input and MLflow constructs chat/completion request from it. This limits some flexibility for input e.g. system prompt for chat model, but is consistent with other flavors like OpenAI and Transformers. The more complex use case should be handled by a custom function.
Inference params is not supported in current API, but they are indispensable for chat/completion models. This PR adds a new argument inference_params to the API, but it is only used for endpoint evaluation, not for other type of models. We can do follow-up to support inference params for all model types.

Example

eval_data = pd.DataFrame(
    {
        # Input data must be a string column and named "inputs".
        "inputs": [
            "Write 3 reasons why you should use MLflow?",
            "Can you explain the difference between classification and regression?",
        ],
    }
)


with mlflow.start_run() as run:
    results = mlflow.evaluate(
        # URL of the OpenAI-compatible chat/completion endpoint
        model="https://example.com/serving-endpoints/databricks-mpt-30b-instruct/invocations",
        data=eval_data,
        # New argument for support inference params
        inference_parameters={"max_tokens": 100, "temperature": 0.0},
        # New argument for passing auth info
        headers={"Authorization": "Bearer YOUR API KEY"},
        # By specifying the "completion" model type, MOLflow automatically format the input text to the completion request format,
        # and extract the generated text from the completion response.
        model_type="completion",
    )

How is this PR tested?

Existing unit/integration tests
New unit/integration tests
Manual tests (WIP)

Tested in Databricks. Sample code for one test case:

import mlflow
import pandas as pd

eval_data = pd.DataFrame(
    {
        "inputs": [
            "What is MLflow?",
            "What is Spark?",
        ],
        "ground_truth": [
            "MLflow is an open-source platform ....",
            "Apache Spark is an open-source, ...",
        ],
    }
)


with mlflow.start_run(run_name="mlflow-eval-chat") as run:
    results = mlflow.evaluate(
        # Pass the URL of an OpenAI-compatible chat endpoint
        model="https://${{DB_HOST}}/serving-endpoints/databricks-mixtral-8x7b-instruct/invocations",
        data=eval_data,
        targets="ground_truth",
        model_type="chat",
        inference_params={"max_tokens": 100, "temperature": 0.0},
        headers={"Authentication": "Bearer ${{DB_TOKEN}}"},
        extra_metrics=[mlflow.metrics.exact_match()],
    )

Does this PR require documentation update?

Release Notes

Is this a user-facing change?

No. You can skip the rest of this section.
Yes. Give a description of this change to be included in the release notes for MLflow users.

The mlflow.evaluate() API supports a model endpoint URL as a evaluation subject. This is particularly useful when you want to evaluate deployed models or proprietary APIs such as Databricks Foundation Model APIs.

What component(s), interfaces, languages, and integrations does this PR affect?

Components

Interface

area/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev server
area/docker: Docker use across MLflow's components, such as MLflow Projects and MLflow Models
area/sqlalchemy: Use of SQLAlchemy in the Tracking Service or Model Registry
area/windows: Windows support

Language

language/r: R APIs and clients
language/java: Java APIs and clients
language/new: Proposals for new client languages

Integrations

integrations/azure: Azure and Azure ML integrations
integrations/sagemaker: SageMaker integrations
integrations/databricks: Databricks integrations

How should the PR be classified in the release notes? Choose one:

rn/none - No description will be included. The PR will be mentioned only by the PR number in the "Small Bugfixes and Documentation Updates" section
rn/breaking-change - The PR will be mentioned in the "Breaking Changes" section
rn/feature - A new user-facing feature worth mentioning in the release notes
rn/bug-fix - A user-facing bug fix worth mentioning in the release notes
rn/documentation - A user-facing documentation change worth mentioning in the release notes

github-actions · 2024-02-28T08:01:35Z

Documentation preview for 20a9741 will be available when this CircleCI job
completes successfully.

More info

Ignore this comment if this PR does not change the documentation.
It takes a few minutes for the preview to be available.
The preview is updated when a new commit is pushed to this PR.
This comment was created by https://github.com/mlflow/mlflow/actions/runs/8106426949.

docs/source/llms/llm-evaluate/index.rst

mlflow/models/evaluation/llm_utils.py

tests/evaluate/test_evaluation.py

mlflow/models/evaluation/llm_utils.py

mlflow/models/evaluation/base.py

docs/source/llms/llm-evaluate/index.rst

mlflow/models/evaluation/base.py

docs/source/llms/llm-evaluate/index.rst

B-Step62 · 2024-02-29T09:40:41Z

mlflow/models/evaluation/base.py

+    return False
+
+
+def _hash_array_of_dict_as_bytes(data):


note: This hashing support is required for passing a list of dictionary (chat/completion payload) or similar pandas DataFrame as a input dataset. The hashing is required for generating dataset digest, but it doesn't handle a list or array contains dictionary. The implementation is not super clean, just band-aiding without breaking existing use cases.

B-Step62 · 2024-02-29T09:52:49Z

mlflow/models/evaluation/base.py

@@ -1140,7 +1173,9 @@ def _convert_data_to_mlflow_dataset(data, targets=None, predictions=None):
        from pyspark.sql import DataFrame as SparkDataFrame

    if isinstance(data, list):
-        return mlflow.data.from_numpy(np.array(data), targets=np.array(targets))
+        return mlflow.data.from_numpy(
+            np.array(data), targets=np.array(targets) if targets else None


note: This is small but important change. In the original logic, targets param is always wrapped by numpy array when the input data is list, then if targets is None doesn't work in downstream. Concretely, when we pass a list as an input data and leave targets None, it will raise a weird error.

This is not specific to the endpoint evaluation, however, implicitly requiring targets only for list input is confusing, so I think we should fix it in this PR.

mlflow/metrics/genai/model_utils.py

mlflow/models/evaluation/base.py

docs/source/llms/llm-evaluate/index.rst

mlflow/models/evaluation/base.py

dbczumar · 2024-02-29T22:38:16Z

mlflow/models/evaluation/base.py

+def _hash_array_of_dict_as_bytes(data):
+    # NB: If an array or list contains dictionary element, it can't be hashed with
+    # pandas.util.hash_array. Hence we need to manually hash the elements here. This is
+    # particularly for the LLM use case where the input can be a list of dictionary
+    # (chat/completion payloads), so doesn't handle more complex case like nested lists.
+    result = b""
+    for elm in data:
+        if isinstance(elm, (list, np.ndarray)):
+            result += _hash_array_of_dict_as_bytes(elm)
+        elif isinstance(elm, dict):
+            result += _hash_dict_as_bytes(elm)


Can we move all of these hashing methods into a separate module, ideally located in mlflow.data? since these are all about hashing for use with MLflow Datasets

Agree this should be in a separate module, but for the location, I would still put this under evaluation module. These methods are used for hashing various input like list, numpy, dataframe, and only used for class EvaluationDataset digest.

Now that the digesting method for each flavor e.g. TensorflowDataset is moved to respective module (ref), so I think we can have evaluation/dataset.py and put all these methods. WDYT?

I would prefer to put all of EvaluationDataset in mlflow/data to maintain consistency with other datasets - e.g. we don't put TensorflowDataset in mlflow/tensorflow

Feel free to tackle this in a follow-up ticket as long as we track it :)

I would prefer to put all of EvaluationDataset in mlflow/data

Makes sense.

Feel free to tackle this in a follow-up ticket as long as we track it :)

Yeah let me do that, created a JIRA in next sprint (ML-39039), but may be able to pick up if no fun surprise happens in other tasks:)

tests/evaluate/test_evaluation.py

harupy

LGTM!

docs/source/llms/llm-evaluate/index.rst

mlflow/metrics/genai/genai_metric.py

mlflow/models/evaluation/base.py

dbczumar

LGTM with docs nits! Thanks @B-Step62

Signed-off-by: B-Step62 <yuki.watanabe@databricks.com>

1. Add support for list input 2. Fix issue of None targets handling to unblock 1 3. Add support for a dictionary element in input data, to directly pass chat/completion payload 4. Fix issue of dictionary hashing to unblock 2 5. Other comments Signed-off-by: B-Step62 <yuki.watanabe@databricks.com>

Signed-off-by: B-Step62 <yuki.watanabe@databricks.com>

Signed-off-by: B-Step62 <yuki.watanabe@databricks.com> Signed-off-by: Arthur Jenoudet <arthur.jenoudet@databricks.com>

github-actions bot added area/models MLmodel format, model serialization/deserialization, flavors rn/feature Mention under Features in Changelogs. labels Feb 28, 2024

B-Step62 force-pushed the evaluate-on-endpoint branch from 42937f2 to d0f261c Compare February 28, 2024 08:58

B-Step62 requested review from harupy, BenWilson2 and dbczumar February 28, 2024 11:10

dbczumar reviewed Feb 28, 2024

View reviewed changes

docs/source/llms/llm-evaluate/index.rst Outdated Show resolved Hide resolved

dbczumar reviewed Feb 28, 2024

View reviewed changes

docs/source/llms/llm-evaluate/index.rst Outdated Show resolved Hide resolved

dbczumar reviewed Feb 28, 2024

View reviewed changes

mlflow/models/evaluation/llm_utils.py Outdated Show resolved Hide resolved

harupy reviewed Feb 29, 2024

View reviewed changes

mlflow/models/evaluation/llm_utils.py Outdated Show resolved Hide resolved

harupy reviewed Feb 29, 2024

View reviewed changes

mlflow/models/evaluation/llm_utils.py Outdated Show resolved Hide resolved

harupy reviewed Feb 29, 2024

View reviewed changes

tests/evaluate/test_evaluation.py Outdated Show resolved Hide resolved

harupy reviewed Feb 29, 2024

View reviewed changes

mlflow/models/evaluation/llm_utils.py Outdated Show resolved Hide resolved

harupy reviewed Feb 29, 2024

View reviewed changes

mlflow/models/evaluation/llm_utils.py Outdated Show resolved Hide resolved

harupy reviewed Feb 29, 2024

View reviewed changes

mlflow/models/evaluation/base.py Outdated Show resolved Hide resolved

dbczumar reviewed Feb 29, 2024

View reviewed changes

docs/source/llms/llm-evaluate/index.rst Outdated Show resolved Hide resolved

dbczumar reviewed Feb 29, 2024

View reviewed changes

docs/source/llms/llm-evaluate/index.rst Outdated Show resolved Hide resolved

dbczumar reviewed Feb 29, 2024

View reviewed changes

docs/source/llms/llm-evaluate/index.rst Outdated Show resolved Hide resolved

dbczumar reviewed Feb 29, 2024

View reviewed changes

mlflow/models/evaluation/base.py Outdated Show resolved Hide resolved

dbczumar reviewed Feb 29, 2024

View reviewed changes

mlflow/models/evaluation/base.py Outdated Show resolved Hide resolved

dbczumar reviewed Feb 29, 2024

View reviewed changes

mlflow/models/evaluation/base.py Outdated Show resolved Hide resolved

dbczumar reviewed Feb 29, 2024

View reviewed changes

mlflow/models/evaluation/base.py Outdated Show resolved Hide resolved

dbczumar reviewed Feb 29, 2024

View reviewed changes

mlflow/models/evaluation/base.py Outdated Show resolved Hide resolved

harupy reviewed Feb 29, 2024

View reviewed changes

docs/source/llms/llm-evaluate/index.rst Show resolved Hide resolved

B-Step62 commented Feb 29, 2024

View reviewed changes

docs/source/llms/llm-evaluate/index.rst Outdated Show resolved Hide resolved

B-Step62 commented Feb 29, 2024

View reviewed changes

B-Step62 requested review from harupy and dbczumar February 29, 2024 09:53

dbczumar reviewed Feb 29, 2024

View reviewed changes

mlflow/metrics/genai/model_utils.py Outdated Show resolved Hide resolved

dbczumar reviewed Feb 29, 2024

View reviewed changes

mlflow/models/evaluation/base.py Outdated Show resolved Hide resolved

dbczumar reviewed Feb 29, 2024

View reviewed changes

docs/source/llms/llm-evaluate/index.rst Outdated Show resolved Hide resolved

dbczumar reviewed Feb 29, 2024

View reviewed changes

mlflow/models/evaluation/base.py Show resolved Hide resolved

dbczumar reviewed Feb 29, 2024

View reviewed changes

tests/evaluate/test_evaluation.py Outdated Show resolved Hide resolved

B-Step62 force-pushed the evaluate-on-endpoint branch 2 times, most recently from ca89855 to a62c2c3 Compare February 29, 2024 23:30

B-Step62 requested a review from dbczumar February 29, 2024 23:31

harupy approved these changes Mar 1, 2024

View reviewed changes

dbczumar reviewed Mar 1, 2024

View reviewed changes

docs/source/llms/llm-evaluate/index.rst Outdated Show resolved Hide resolved

dbczumar reviewed Mar 1, 2024

View reviewed changes

docs/source/llms/llm-evaluate/index.rst Outdated Show resolved Hide resolved

dbczumar reviewed Mar 1, 2024

View reviewed changes

mlflow/metrics/genai/genai_metric.py Outdated Show resolved Hide resolved

dbczumar reviewed Mar 1, 2024

View reviewed changes

mlflow/models/evaluation/base.py Outdated Show resolved Hide resolved

dbczumar approved these changes Mar 1, 2024

View reviewed changes

B-Step62 added 11 commits March 1, 2024 13:08

Support Deployment Server endpoint for model evaluation

ec7b824

Signed-off-by: B-Step62 <yuki.watanabe@databricks.com>

use model_type

7c82d70

Signed-off-by: B-Step62 <yuki.watanabe@databricks.com>

docs

6870c66

Signed-off-by: B-Step62 <yuki.watanabe@databricks.com>

small fix

f359b6d

Signed-off-by: B-Step62 <yuki.watanabe@databricks.com>

use deployment client

d5e6904

Signed-off-by: B-Step62 <yuki.watanabe@databricks.com>

fix

b9821ed

Signed-off-by: B-Step62 <yuki.watanabe@databricks.com>

typo

1698d84

Signed-off-by: B-Step62 <yuki.watanabe@databricks.com>

feedback

85bbcf9

Signed-off-by: B-Step62 <yuki.watanabe@databricks.com>

fix test

9c8c7a8

Signed-off-by: B-Step62 <yuki.watanabe@databricks.com>

doc update

20a9741

Signed-off-by: B-Step62 <yuki.watanabe@databricks.com>

B-Step62 force-pushed the evaluate-on-endpoint branch from e59739b to 20a9741 Compare March 1, 2024 04:09

B-Step62 merged commit eeb1d4d into mlflow:master Mar 1, 2024
37 checks passed

B-Step62 deleted the evaluate-on-endpoint branch March 1, 2024 04:51

artjen pushed a commit to artjen/mlflow that referenced this pull request Mar 26, 2024

Support model evaluation with an endpoint URL (mlflow#11262)

2b8bdd2

Signed-off-by: B-Step62 <yuki.watanabe@databricks.com> Signed-off-by: Arthur Jenoudet <arthur.jenoudet@databricks.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support model evaluation with an endpoint URL #11262

Support model evaluation with an endpoint URL #11262

B-Step62 commented Feb 28, 2024 •

edited

github-actions bot commented Feb 28, 2024 •

edited

B-Step62 Feb 29, 2024

B-Step62 Feb 29, 2024

dbczumar Feb 29, 2024

B-Step62 Feb 29, 2024 •

edited

dbczumar Feb 29, 2024

dbczumar Feb 29, 2024

B-Step62 Feb 29, 2024 •

edited

harupy left a comment

dbczumar left a comment

Support model evaluation with an endpoint URL #11262

Support model evaluation with an endpoint URL #11262

Conversation

B-Step62 commented Feb 28, 2024 • edited

Install mlflow from this PR

Checkout with GitHub CLI

Related Issues/PRs

What changes are proposed in this pull request?

How is this PR tested?

Does this PR require documentation update?

Release Notes

Is this a user-facing change?

What component(s), interfaces, languages, and integrations does this PR affect?

How should the PR be classified in the release notes? Choose one:

github-actions bot commented Feb 28, 2024 • edited

B-Step62 Feb 29, 2024

Choose a reason for hiding this comment

B-Step62 Feb 29, 2024

Choose a reason for hiding this comment

dbczumar Feb 29, 2024

Choose a reason for hiding this comment

B-Step62 Feb 29, 2024 • edited

Choose a reason for hiding this comment

dbczumar Feb 29, 2024

Choose a reason for hiding this comment

dbczumar Feb 29, 2024

Choose a reason for hiding this comment

B-Step62 Feb 29, 2024 • edited

Choose a reason for hiding this comment

harupy left a comment

Choose a reason for hiding this comment

dbczumar left a comment

Choose a reason for hiding this comment

B-Step62 commented Feb 28, 2024 •

edited

github-actions bot commented Feb 28, 2024 •

edited

B-Step62 Feb 29, 2024 •

edited

B-Step62 Feb 29, 2024 •

edited