Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support model evaluation with an endpoint URL #11262

Merged
merged 11 commits into from Mar 1, 2024

Conversation

B-Step62
Copy link
Collaborator

@B-Step62 B-Step62 commented Feb 28, 2024

🛠 DevTools 🛠

Open in GitHub Codespaces

Install mlflow from this PR

pip install git+https://github.com/mlflow/mlflow.git@refs/pull/11262/merge

Checkout with GitHub CLI

gh pr checkout 11262

Related Issues/PRs

#xxx

What changes are proposed in this pull request?

This change enhances mlflow.evaluate() function to accepts an endpoint URL to model argument. We start to get frequent requests for this from users who want to evaluate the proprietary model endpoint or Databricks served models. Without this feature, they need to implement a custom predict function, which is not super difficult but less convenient.

Important notes

  • We only support the OpenAI-compatible chat/completion models, to avoid asking too many configurations to users.
  • The input dataset should contain a single text input and MLflow constructs chat/completion request from it. This limits some flexibility for input e.g. system prompt for chat model, but is consistent with other flavors like OpenAI and Transformers. The more complex use case should be handled by a custom function.
  • Inference params is not supported in current API, but they are indispensable for chat/completion models. This PR adds a new argument inference_params to the API, but it is only used for endpoint evaluation, not for other type of models. We can do follow-up to support inference params for all model types.

Example

eval_data = pd.DataFrame(
    {
        # Input data must be a string column and named "inputs".
        "inputs": [
            "Write 3 reasons why you should use MLflow?",
            "Can you explain the difference between classification and regression?",
        ],
    }
)


with mlflow.start_run() as run:
    results = mlflow.evaluate(
        # URL of the OpenAI-compatible chat/completion endpoint
        model="https://example.com/serving-endpoints/databricks-mpt-30b-instruct/invocations",
        data=eval_data,
        # New argument for support inference params
        inference_parameters={"max_tokens": 100, "temperature": 0.0},
        # New argument for passing auth info
        headers={"Authorization": "Bearer YOUR API KEY"},
        # By specifying the "completion" model type, MOLflow automatically format the input text to the completion request format,
        # and extract the generated text from the completion response.
        model_type="completion",
    )

How is this PR tested?

  • Existing unit/integration tests
  • New unit/integration tests
  • Manual tests (WIP)

Tested in Databricks. Sample code for one test case:

import mlflow
import pandas as pd

eval_data = pd.DataFrame(
    {
        "inputs": [
            "What is MLflow?",
            "What is Spark?",
        ],
        "ground_truth": [
            "MLflow is an open-source platform ....",
            "Apache Spark is an open-source, ...",
        ],
    }
)


with mlflow.start_run(run_name="mlflow-eval-chat") as run:
    results = mlflow.evaluate(
        # Pass the URL of an OpenAI-compatible chat endpoint
        model="https://${{DB_HOST}}/serving-endpoints/databricks-mixtral-8x7b-instruct/invocations",
        data=eval_data,
        targets="ground_truth",
        model_type="chat",
        inference_params={"max_tokens": 100, "temperature": 0.0},
        headers={"Authentication": "Bearer ${{DB_TOKEN}}"},
        extra_metrics=[mlflow.metrics.exact_match()],
    )

Does this PR require documentation update?

  • No. You can skip the rest of this section.
  • Yes. I've updated:
    • Examples
    • API references
    • Instructions

Release Notes

Is this a user-facing change?

  • No. You can skip the rest of this section.
  • Yes. Give a description of this change to be included in the release notes for MLflow users.

The mlflow.evaluate() API supports a model endpoint URL as a evaluation subject. This is particularly useful when you want to evaluate deployed models or proprietary APIs such as Databricks Foundation Model APIs.

What component(s), interfaces, languages, and integrations does this PR affect?

Components

  • area/artifacts: Artifact stores and artifact logging
  • area/build: Build and test infrastructure for MLflow
  • area/deployments: MLflow Deployments client APIs, server, and third-party Deployments integrations
  • area/docs: MLflow documentation pages
  • area/examples: Example code
  • area/model-registry: Model Registry service, APIs, and the fluent client calls for Model Registry
  • area/models: MLmodel format, model serialization/deserialization, flavors
  • area/recipes: Recipes, Recipe APIs, Recipe configs, Recipe Templates
  • area/projects: MLproject format, project running backends
  • area/scoring: MLflow Model server, model deployment tools, Spark UDFs
  • area/server-infra: MLflow Tracking server backend
  • area/tracking: Tracking Service, tracking client APIs, autologging

Interface

  • area/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev server
  • area/docker: Docker use across MLflow's components, such as MLflow Projects and MLflow Models
  • area/sqlalchemy: Use of SQLAlchemy in the Tracking Service or Model Registry
  • area/windows: Windows support

Language

  • language/r: R APIs and clients
  • language/java: Java APIs and clients
  • language/new: Proposals for new client languages

Integrations

  • integrations/azure: Azure and Azure ML integrations
  • integrations/sagemaker: SageMaker integrations
  • integrations/databricks: Databricks integrations

How should the PR be classified in the release notes? Choose one:

  • rn/none - No description will be included. The PR will be mentioned only by the PR number in the "Small Bugfixes and Documentation Updates" section
  • rn/breaking-change - The PR will be mentioned in the "Breaking Changes" section
  • rn/feature - A new user-facing feature worth mentioning in the release notes
  • rn/bug-fix - A user-facing bug fix worth mentioning in the release notes
  • rn/documentation - A user-facing documentation change worth mentioning in the release notes

@github-actions github-actions bot added area/models MLmodel format, model serialization/deserialization, flavors rn/feature Mention under Features in Changelogs. labels Feb 28, 2024
Copy link

github-actions bot commented Feb 28, 2024

Documentation preview for 20a9741 will be available when this CircleCI job
completes successfully.

More info

return False


def _hash_array_of_dict_as_bytes(data):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: This hashing support is required for passing a list of dictionary (chat/completion payload) or similar pandas DataFrame as a input dataset. The hashing is required for generating dataset digest, but it doesn't handle a list or array contains dictionary. The implementation is not super clean, just band-aiding without breaking existing use cases.

@@ -1140,7 +1173,9 @@ def _convert_data_to_mlflow_dataset(data, targets=None, predictions=None):
from pyspark.sql import DataFrame as SparkDataFrame

if isinstance(data, list):
return mlflow.data.from_numpy(np.array(data), targets=np.array(targets))
return mlflow.data.from_numpy(
np.array(data), targets=np.array(targets) if targets else None
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: This is small but important change. In the original logic, targets param is always wrapped by numpy array when the input data is list, then if targets is None doesn't work in downstream. Concretely, when we pass a list as an input data and leave targets None, it will raise a weird error.

This is not specific to the endpoint evaluation, however, implicitly requiring targets only for list input is confusing, so I think we should fix it in this PR.

Comment on lines +451 to +461
def _hash_array_of_dict_as_bytes(data):
# NB: If an array or list contains dictionary element, it can't be hashed with
# pandas.util.hash_array. Hence we need to manually hash the elements here. This is
# particularly for the LLM use case where the input can be a list of dictionary
# (chat/completion payloads), so doesn't handle more complex case like nested lists.
result = b""
for elm in data:
if isinstance(elm, (list, np.ndarray)):
result += _hash_array_of_dict_as_bytes(elm)
elif isinstance(elm, dict):
result += _hash_dict_as_bytes(elm)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move all of these hashing methods into a separate module, ideally located in mlflow.data? since these are all about hashing for use with MLflow Datasets

Copy link
Collaborator Author

@B-Step62 B-Step62 Feb 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree this should be in a separate module, but for the location, I would still put this under evaluation module. These methods are used for hashing various input like list, numpy, dataframe, and only used for class EvaluationDataset digest.

Now that the digesting method for each flavor e.g. TensorflowDataset is moved to respective module (ref), so I think we can have evaluation/dataset.py and put all these methods. WDYT?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to put all of EvaluationDataset in mlflow/data to maintain consistency with other datasets - e.g. we don't put TensorflowDataset in mlflow/tensorflow

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel free to tackle this in a follow-up ticket as long as we track it :)

Copy link
Collaborator Author

@B-Step62 B-Step62 Feb 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to put all of EvaluationDataset in mlflow/data

Makes sense.

Feel free to tackle this in a follow-up ticket as long as we track it :)

Yeah let me do that, created a JIRA in next sprint (ML-39039), but may be able to pick up if no fun surprise happens in other tasks:)

@B-Step62 B-Step62 force-pushed the evaluate-on-endpoint branch 2 times, most recently from ca89855 to a62c2c3 Compare February 29, 2024 23:30
Copy link
Member

@harupy harupy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Collaborator

@dbczumar dbczumar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with docs nits! Thanks @B-Step62

Signed-off-by: B-Step62 <yuki.watanabe@databricks.com>
Signed-off-by: B-Step62 <yuki.watanabe@databricks.com>
Signed-off-by: B-Step62 <yuki.watanabe@databricks.com>
Signed-off-by: B-Step62 <yuki.watanabe@databricks.com>
Signed-off-by: B-Step62 <yuki.watanabe@databricks.com>
Signed-off-by: B-Step62 <yuki.watanabe@databricks.com>
Signed-off-by: B-Step62 <yuki.watanabe@databricks.com>
1. Add support for list input
2. Fix issue of None targets handling to unblock 1
3. Add support for a dictionary element in input data, to directly pass chat/completion payload
4. Fix issue of dictionary hashing to unblock 2
5. Other comments

Signed-off-by: B-Step62 <yuki.watanabe@databricks.com>
Signed-off-by: B-Step62 <yuki.watanabe@databricks.com>
Signed-off-by: B-Step62 <yuki.watanabe@databricks.com>
Signed-off-by: B-Step62 <yuki.watanabe@databricks.com>
@B-Step62 B-Step62 merged commit eeb1d4d into mlflow:master Mar 1, 2024
37 checks passed
@B-Step62 B-Step62 deleted the evaluate-on-endpoint branch March 1, 2024 04:51
artjen pushed a commit to artjen/mlflow that referenced this pull request Mar 26, 2024
Signed-off-by: B-Step62 <yuki.watanabe@databricks.com>
Signed-off-by: Arthur Jenoudet <arthur.jenoudet@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/models MLmodel format, model serialization/deserialization, flavors rn/feature Mention under Features in Changelogs.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants