Evaluation Default evaluator #5092

WeichenXu123 · 2021-11-22T16:16:59Z

What changes are proposed in this pull request?

Prototype Evaluation Default evaluator

Test code:

import numpy as np
from sklearn.linear_model import LogisticRegression
import shap
import matplotlib.pyplot as plt
import sklearn
import pandas as pd

import xgboost
import shap
from mlflow.models.evaluation import evaluate, EvaluationDataset
import mlflow
from mlflow.tracking.artifact_utils import get_artifact_uri

# train XGBoost model
X, y = shap.datasets.adult()
new_feature_names_map = {f: f + '12345678901234' for f in X.columns}
X = X.rename(columns=new_feature_names_map)

model = xgboost.XGBClassifier().fit(X, y)

with mlflow.start_run() as run:
    mlflow.sklearn.log_model(model, 'xgb_model')

model_uri = get_artifact_uri(run.info.run_id, 'xgb_model')


def predict_proba(self, data):
    return self._model_impl.predict_proba(data)


mlflow.pyfunc.PyFuncModel.predict_proba = predict_proba

y = y.astype(np.int32)


X['label'] = y

dataset = EvaluationDataset(data=X, labels='label', name='adult')

"""
        evaluator_config={'default': {
            'explainality.algorithm': 'partition',
            'explainality.nsamples': 1000
        }}
"""

with mlflow.start_run() as run:
    result = evaluate(
        model_uri,
        model_type='classifier',
        dataset=dataset,
        evaluators=['default'],
    )

print(f'run_id={run.info.run_id}\n')
client = mlflow.tracking.MlflowClient()
artifacts = [f.path for f in client.list_artifacts(run.info.run_id)]
print(f'artifacts={artifacts}')

How is this patch tested?

(Details)

Does this PR change the documentation?

No. You can skip the rest of this section.
Yes. Make sure the changed pages / sections render correctly by following the steps below.

Check the status of the ci/circleci: build_doc check. If it's successful, proceed to the
next step, otherwise fix it.
Click Details on the right to open the job page of CircleCI.
Click the Artifacts tab.
Click docs/build/html/index.html.
Find the changed pages / sections and make sure they render correctly.

Release Notes

Is this a user-facing change?

No. You can skip the rest of this section.
Yes. Give a description of this change to be included in the release notes for MLflow users.

Implement builtin default evaluator for model evaluation.

What component(s), interfaces, languages, and integrations does this PR affect?

Components

Interface

area/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev server
area/docker: Docker use across MLflow's components, such as MLflow Projects and MLflow Models
area/sqlalchemy: Use of SQLAlchemy in the Tracking Service or Model Registry
area/windows: Windows support

Language

language/r: R APIs and clients
language/java: Java APIs and clients
language/new: Proposals for new client languages

Integrations

integrations/azure: Azure and Azure ML integrations
integrations/sagemaker: SageMaker integrations
integrations/databricks: Databricks integrations

How should the PR be classified in the release notes? Choose one:

rn/breaking-change - The PR will be mentioned in the "Breaking Changes" section
rn/none - No description will be included. The PR will be mentioned only by the PR number in the "Small Bugfixes and Documentation Updates" section
rn/feature - A new user-facing feature worth mentioning in the release notes
rn/bug-fix - A user-facing bug fix worth mentioning in the release notes
rn/documentation - A user-facing documentation change worth mentioning in the release notes

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

mlflow/models/evaluation/base.py

mlflow/models/evaluation/default_evaluator.py

harupy · 2022-01-05T09:23:58Z

mlflow/models/evaluation/default_evaluator.py

+    if not _matplotlib_initialized:
+        import matplotlib.pyplot as pyplot
+
+        pyplot.rcParams["figure.dpi"] = 144


I don't think we should mutate rcParams. Can we just specify dpi and figsize when saving plots?

seems cannot. We need set it when creating figure, if so we need modify shap source code.

maybe we could try this
https://stackoverflow.com/a/35394637

Yes. I just also found it. Use with matplotlib.rc_context(...)

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

tests/models/test_default_evaluator.py

harupy · 2022-01-05T11:39:13Z

mlflow/models/evaluation/base.py

+_last_failed_evaluator = None
+
+
+def _get_last_failed_evaluator():


where do we use this function?

In usage log, we want to log error raised from which evaluator (TODO), for debug purpose. so I add this.

why do we only store the last failed evaluator?

The evaluate API will raise error when it hit the first failed evaluator. (The API designed to be so.) When any evaluator fail evaluate raise error, otherwise return all merged success results

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

WeichenXu123 · 2022-01-05T15:50:47Z

mlflow/models/evaluation/default_evaluator.py

+            sampled_X = sampled_X.rename(columns=truncated_feature_name_map, copy=False)
+
+        if algorithm:
+            supported_algos = ['exact', 'permutation', 'partition']


Note:
in shap 0.40, the shap.Explainer argument algo , only when setting 'exact', 'permutation', 'partition', it works fine, the remaining algo (listed in shap doc) are all buggy in my test. So I limit them to be the 3 values.

BenWilson2 · 2022-01-05T16:42:41Z

mlflow/models/evaluation/default_evaluator.py

+        try:
+            pyplot.clf()
+            do_plot()
+            pyplot.savefig(artifact_file_local_path)


Can we set the dpi value to something higher than the default of 100? The images look really blurry and out of place in the tracking UI. 300 might be a better option.

pyplot.savefig(artifact_file_local_path, dpi=300)

Or perhaps save in .svg format to ensure that scalability in different zoom levels in browsers will work well?

I increase dpi to 288 (previous in databricks notebook it is 72)

Save fig as svg is not good choice, the Pillow package does not support svg. But we need return artifact as the pillow Image instance.

mlflow/models/evaluation/lift_curve.py

mlflow/models/evaluation/default_evaluator.py

BenWilson2 · 2022-01-05T17:31:05Z

mlflow/models/evaluation/default_evaluator.py

+            sk_metrics.ConfusionMatrixDisplay(
+                confusion_matrix=confusion_matrix,
+                display_labels=self.label_list,
+            ).plot(cmap="Blues")


We need to scale these plots for class count > 6. The example for multiclass that has 10 classes is illegible with numeric wrapping overwrites of values. Setting figsize overrides on the canvas and changing the dpi at figure save time will make a much more clear visualization.

The autologging code also generate confusion matrix plot for training dataset and has similar issue. Let's do a follow-up PR to improve for both cases.

Create a follow-up ticket for this:
https://databricks.atlassian.net/browse/ML-19817

mlflow/models/evaluation/default_evaluator.py

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

jinzhang21

LGTM. Nice implementation! Sorry for the delayed review, Weichen.

mlflow/models/evaluation/artifacts.py

jinzhang21 · 2022-01-10T00:18:56Z

mlflow/models/evaluation/base.py

 def evaluate(
+    *,


What does the "*" mean here? Do you intend to make it a nameless positional arg like *args? It may worth to spell it out even if it's not used, and put it in the end of the argument list, e.g.
def _evaluate(model, model_type, dataset, actual_run_id, evaluator_name_list, evaluator_name_to_conf_map, *unused_args):

No. The * is a python grammar, not a argument, it means all following arguments must be a key-word-only arguments. https://www.python.org/dev/peps/pep-3102/

Mark all argument as key-word-only is a proposal raised by @dbczumar , so that in future if mlflow containerized model provide "model_type" property, we can remove the "model_type" argument here but keep backwards compatibility

Wait, removing the model_type argumnet is a breaking change, right?

In future, we can change API to be :

def evaluate( *, model, dataset, run_id, evaluators=None, evaluator_config=None, **kwargs ): if "model_type" in kwargs: model_type = kwargs["kwargs"] logger.warning("deprecation: ....") else: model_type = model.model_type ...

Then we can keep backwards compatiblity.

I didn't know about this usage pattern. Interesting!
Another way to keep backward compatibility is to use keyword arg for model_type, e.g.

def evaluate( model, dataset, run_id, model_type=None, evaluators=None, evaluator_config=None, **kwargs ):

What's the downside of going this more "conventional" way of specifying optional arguments?

In current evaluate implementation, the model_type argument is required. In future we will make it "optional and deprecated" and then remove this argument.

jinzhang21 · 2022-01-10T00:22:01Z

mlflow/models/evaluation/base.py

@@ -496,33 +685,72 @@ def evaluate(
                             a nested dictionary whose key is the evaluator name.
    :return: An :py:class:`mlflow.models.evaluation.EvaluationDataset` instance containing
             evaluation results.
+


Nice! I really like the documentation here to spell out the expected outcome and limitations. Hopefully we'll make it shorter and easier to navigate over time.

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

WeichenXu123 added 22 commits November 15, 2021 21:46

init

e2c5aff

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

update

91a6e1c

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

update

071e86b

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

update

e7d8a76

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

update

7499ac6

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

update

f1819a4

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

update

aa397e1

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

update

981164a

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

update

593bec8

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

update

2240d0a

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

update

0e54b97

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

update

6ee4367

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

update

15712e0

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

update

63b1d2b

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

update

afc52e8

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

rename module

feddc1e

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

address comments

27814cf

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

address comments

fc9cf1a

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

Merge branch 'master' into eval_api

1f0c12d

revert black change

d642309

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

change module path

2a67297

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

update

66f760d

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

WeichenXu123 marked this pull request as draft November 22, 2021 16:17