Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add pos_label to mlflow.evaluate #6696

Merged
merged 29 commits into from
Sep 9, 2022

Conversation

harupy
Copy link
Member

@harupy harupy commented Sep 5, 2022

Signed-off-by: harupy hkawamura0130@gmail.com

Related Issues/PRs

#xxx

What changes are proposed in this pull request?

Add pos_label to mlflow.evaluate to provide a way to specify the positive label to use when computing metrics for binary classification.

How is this patch tested?

  • I have written tests (not required for typo or doc fix) and confirmed the proposed feature/bug-fix/change works.

  • Unit tests

  • Manual test

import mlflow
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris, load_breast_cancer
from sklearn.model_selection import train_test_split
from pprint import pprint
import warnings

warnings.simplefilter("ignore")
client = mlflow.MlflowClient()

# Binary
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
model = LogisticRegression().fit(X_train, y_train)

with mlflow.start_run() as run:
    info = mlflow.sklearn.log_model(model, "model")
    result = mlflow.evaluate(
        info.model_uri,
        X_test,
        targets=y_test,
        model_type="classifier",
        dataset_name="binary",
        evaluators="default",
        evaluator_config={"log_model_explainability": False},
    )

print(f"Binary: pos_label = None")
pprint(client.get_run(run.info.run_id).data.metrics)

print("\n" + "=" * 80 + "\n")

with mlflow.start_run() as run:
    info = mlflow.sklearn.log_model(model, "model")
    result = mlflow.evaluate(
        info.model_uri,
        X_test,
        targets=y_test,
        model_type="classifier",
        dataset_name="binary",
        evaluators="default",
        evaluator_config={
            "log_model_explainability": False,
            "pos_label": 1,
        },
    )

print(f"Binary: pos_label = 1s")
pprint(client.get_run(run.info.run_id).data.metrics)

print("\n" + "=" * 80 + "\n")

# Multi-class
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
model = LogisticRegression().fit(X_train, y_train)

with mlflow.start_run() as run:
    info = mlflow.sklearn.log_model(model, "model")
    result = mlflow.evaluate(
        info.model_uri,
        X_test,
        targets=y_test,
        model_type="classifier",
        dataset_name="multiclass",
        evaluators="default",
        evaluator_config={"log_model_explainability": False},
    )

print(f"Multi-class: pos_label = None")
pprint(client.get_run(run.info.run_id).data.metrics)

Does this PR change the documentation?

  • No. You can skip the rest of this section.
  • Yes. Make sure the changed pages / sections render correctly by following the steps below.
  1. Click the Details link on the Preview docs check.
  2. Find the changed pages / sections and make sure they render correctly.

Release Notes

Is this a user-facing change?

  • No. You can skip the rest of this section.
  • Yes. Give a description of this change to be included in the release notes for MLflow users.

Add pos_label to mlflow.evaluate to specify the positive label to use when computing metrics for binary classification.

What component(s), interfaces, languages, and integrations does this PR affect?

Components

  • area/artifacts: Artifact stores and artifact logging
  • area/build: Build and test infrastructure for MLflow
  • area/docs: MLflow documentation pages
  • area/examples: Example code
  • area/model-registry: Model Registry service, APIs, and the fluent client calls for Model Registry
  • area/models: MLmodel format, model serialization/deserialization, flavors
  • area/pipelines: Pipelines, Pipeline APIs, Pipeline configs, Pipeline Templates
  • area/projects: MLproject format, project running backends
  • area/scoring: MLflow Model server, model deployment tools, Spark UDFs
  • area/server-infra: MLflow Tracking server backend
  • area/tracking: Tracking Service, tracking client APIs, autologging

Interface

  • area/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev server
  • area/docker: Docker use across MLflow's components, such as MLflow Projects and MLflow Models
  • area/sqlalchemy: Use of SQLAlchemy in the Tracking Service or Model Registry
  • area/windows: Windows support

Language

  • language/r: R APIs and clients
  • language/java: Java APIs and clients
  • language/new: Proposals for new client languages

Integrations

  • integrations/azure: Azure and Azure ML integrations
  • integrations/sagemaker: SageMaker integrations
  • integrations/databricks: Databricks integrations

How should the PR be classified in the release notes? Choose one:

  • rn/breaking-change - The PR will be mentioned in the "Breaking Changes" section
  • rn/none - No description will be included. The PR will be mentioned only by the PR number in the "Small Bugfixes and Documentation Updates" section
  • rn/feature - A new user-facing feature worth mentioning in the release notes
  • rn/bug-fix - A user-facing bug fix worth mentioning in the release notes
  • rn/documentation - A user-facing documentation change worth mentioning in the release notes

Signed-off-by: harupy <hkawamura0130@gmail.com>
Signed-off-by: harupy <hkawamura0130@gmail.com>
Signed-off-by: harupy <hkawamura0130@gmail.com>
@github-actions github-actions bot added area/tracking Tracking service, tracking client APIs, autologging rn/feature Mention under Features in Changelogs. labels Sep 5, 2022
Signed-off-by: harupy <hkawamura0130@gmail.com>
Signed-off-by: harupy <hkawamura0130@gmail.com>
@harupy harupy marked this pull request as ready for review September 6, 2022 05:38
Signed-off-by: harupy <hkawamura0130@gmail.com>
Signed-off-by: harupy <hkawamura0130@gmail.com>
Signed-off-by: harupy <hkawamura0130@gmail.com>
Signed-off-by: harupy <hkawamura0130@gmail.com>
Signed-off-by: harupy <hkawamura0130@gmail.com>
Signed-off-by: harupy <hkawamura0130@gmail.com>
Copy link
Collaborator

@dbczumar dbczumar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks @harupy !

Co-authored-by: Corey Zumar <39497902+dbczumar@users.noreply.github.com>
Signed-off-by: harupy <hkawamura0130@gmail.com>
@harupy harupy force-pushed the add-pos-label-evaluate-master branch from 39524de to c6a0ea7 Compare September 6, 2022 22:29
@@ -147,7 +147,7 @@ def _get_binary_sum_up_label_pred_prob(positive_class_index, positive_class, y,
return y_bin, y_pred_bin, y_prob_bin


def _get_classifier_per_class_metrics(y, y_pred):
def _get_classifier_per_class_metrics(y, y_pred, *, pos_label=1):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are we defaulting to 1? I'm looking at https://github.com/mlflow/mlflow/pull/6117/files and it seems that in that PR we default to None?

Copy link
Member Author

@harupy harupy Sep 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jimmyxu-db I chose 1 to preserve the current behavior of the default evaluator.

In mlflow.sklearn.eval_and_log_metrics, pos_label is used for both binary and multi-class classification models. pos_label = None with average = 'weighted' works for both model types.

Copy link
Member Author

@harupy harupy Sep 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can set the default to None (and use average = 'weighted') if that makes more sense.

@WeichenXu123 Thoughts?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can set the default to None (and use average = 'weighted') if that makes more sense.

For binary case, we should only set average = 'binary' ?

and pos_label = 1 is consistent with sklearn metric function default value https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_score.html

so I propose use default value 1

Copy link
Member Author

@harupy harupy Sep 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For binary case, we should only set average = 'binary' ?

It depends on how balanced the dataset is. If it's imbalanced, other options such as weighted might make more sense.

Copy link
Member Author

@harupy harupy Sep 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yxiong

How about we pass in both pos_label and average to allow users to specify them? I can see two options here:

This sounds good to me.

The downside for this is multiclass metrics, it will fail.

It actually won't in the default evaluator. The default evaluator logs 1-vs-rest metric for each class, while eval_and_log_metrics just uses <sklearn_score_function>(y_true, y_pred, average="weighted", pos_label=None). @WeichenXu123 Do you know why we chose this approach?

def _get_classifier_per_class_metrics_collection_df(y, y_pred, labels):
per_class_metrics_list = []
for positive_class_index, positive_class in enumerate(labels):
(y_bin, y_pred_bin, _,) = _get_binary_sum_up_label_pred_prob(
positive_class_index, positive_class, y, y_pred, None
)
per_class_metrics = {"positive_class": positive_class}
per_class_metrics.update(_get_classifier_per_class_metrics(y_bin, y_pred_bin))
per_class_metrics_list.append(per_class_metrics)
return pd.DataFrame(per_class_metrics_list)

The returned dataframe looks like this:

   positive_class  true_negatives  false_positives  false_negatives  true_positives    recall  precision  f1_score   roc_auc  precision_recall_auc
0               0            2774              205              137             184  0.573209   0.473008  0.518310  0.858065              0.371424
1               1            2753              209              191             147  0.434911   0.412921  0.423631  0.831169              0.385339
2               2            2752              208              158             182  0.535294   0.466667  0.498630  0.847666              0.409305
3               3            2823              153              160             164  0.506173   0.517350  0.511700  0.818308              0.510070
4               4            2784              168              263              85  0.244253   0.335968  0.282862  0.761322              0.271138
5               5            2730              238              244              88  0.265060   0.269939  0.267477  0.781107              0.319155
6               6            2791              189              207             113  0.353125   0.374172  0.363344  0.805106              0.340894
7               7            2757              220              221             102  0.315789   0.316770  0.316279  0.748290              0.268311
8               8            2799              185              232              84  0.265823   0.312268  0.287179  0.751020              0.225234
9               9            2729              233              195             143  0.423077   0.380319  0.400560  0.828694              0.388790

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I got a little confused. Does that mean evaluate has a different behavior than eval_and_log_metrics? In particular, I expect to see a single output from the evaluation result (which will be sorted by AutoML to select the best model), not one-output-per-label.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does that mean evaluate has a different behavior than eval_and_log_metrics?

Correct.

Copy link
Member Author

@harupy harupy Sep 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quick script to check the differences:

import mlflow
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

X, y = make_classification(n_samples=10000, n_classes=10, n_informative=5, random_state=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
model = LogisticRegression(solver="liblinear").fit(X_train, y_train)

with mlflow.start_run() as run:
    mlflow.sklearn.log_model(model, "model")
    model_uri = mlflow.get_artifact_uri("model")
    result = mlflow.evaluate(
        model_uri,
        X_test,
        targets=y_test,
        model_type="classifier",
        dataset_name="multiclass-classification-dataset",
        evaluators="default",
        evaluator_config={"log_model_explainability": False},
    )


with mlflow.start_run():
    mlflow.sklearn.eval_and_log_metrics(model, X_test, y_test, prefix="test_")

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yxiong We discussed this issue. We'll update the default evaluator so that it logs the same metrics as eval_and_log_metrics, and also update the default value of pos_label to None.

@@ -147,7 +147,7 @@ def _get_binary_sum_up_label_pred_prob(positive_class_index, positive_class, y,
return y_bin, y_pred_bin, y_prob_bin


def _get_classifier_per_class_metrics(y, y_pred):
def _get_classifier_per_class_metrics(y, y_pred, *, pos_label=1):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about we pass in both pos_label and average to allow users to specify them? I can see two options here:

  1. Make the default consistent with sklearn (e.g. https://scikit-learn.org/stable/modules/generated/sklearn.metrics.recall_score.html). The downside for this is multiclass metrics, it will fail.
  2. Set the average to weighted by default so multiclass metrics won't fail. But we need to change it back to binary when pos_label is specified. Otherwise it'll produce the wrong metrics for binary classification (see Pass in the pos_label parameter to eval_and_log_metrics #5807).

(2) is what we did before with eval_and_log_metrics, but I think (1) is more intuitive.

This is important to get right! Please confirm with me on the approach we choose before merging this PR.

@WeichenXu123
Copy link
Collaborator

WeichenXu123 commented Sep 8, 2022

  1. What about change the pos_label config name to be binary_classifier_pos_label ? In this PR it only apply to binary classifier case.

  2. In Default Evaluator it also computes metrics for multinomial classifier, see

    metrics["f1_score_micro"] = sk_metrics.f1_score(y, y_pred, average="micro", labels=labels)
    we can explicitly set pos_label argument to be None there.

  3. Let's change _get_classifier_per_class_metrics function name to be _get_binary_classification_metrics.

Signed-off-by: harupy <hkawamura0130@gmail.com>
Signed-off-by: harupy <hkawamura0130@gmail.com>
Signed-off-by: harupy <hkawamura0130@gmail.com>
Signed-off-by: harupy <hkawamura0130@gmail.com>
Signed-off-by: harupy <hkawamura0130@gmail.com>
@harupy harupy requested review from yxiong and removed request for yxiong September 8, 2022 03:29
Copy link
Contributor

@yxiong yxiong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM in general. Please add some more test to make sure the behavior is correct.

mlflow/models/evaluation/base.py Outdated Show resolved Hide resolved
tests/models/test_default_evaluator.py Outdated Show resolved Hide resolved
tests/models/test_default_evaluator.py Outdated Show resolved Hide resolved
@@ -926,6 +926,12 @@ def evaluate(
- **log_metrics_with_dataset_info**: A boolean value specifying whether or not to include
information about the evaluation dataset in the name of each metric logged to MLflow
Tracking during evaluation, default value is True.
- **pos_label**: The positive label used to compute binary classification metrics such as
precision, recall, f1, etc (default: ``None``). This parameter is only used for binary
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should pos_label default to 1 for binary classification models?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

eval_and_log_metrics uses None by default regardelss of the model type.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In eval_and_log_metrics, it will set average to weighted by default, in which case the pos_label doesn't matter. I think that's wrong, and we do not need to be backward compatible to that. Setting pos_label default to 1 makes sense.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense, I will update the code.

Signed-off-by: harupy <hkawamura0130@gmail.com>
Signed-off-by: harupy <hkawamura0130@gmail.com>
Signed-off-by: harupy <hkawamura0130@gmail.com>
Comment on lines +929 to +935
- **pos_label**: The positive label to use when computing classification metrics such as
precision, recall, f1, etc. for binary classification models (default: ``1``). For
multiclass classification and regression models, this parameter will be ignored.
- **average**: The averaging method to use when computing classification metrics such as
precision, recall, f1, etc. for multiclass classification models
(default: ``'weighted'``). For binary classification and regression models, this
parameter will be ignored.
Copy link
Member Author

@harupy harupy Sep 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yxiong @WeichenXu123 Added average and updated the pos_label doc. Could you check?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine

Signed-off-by: harupy <hkawamura0130@gmail.com>
Signed-off-by: harupy <hkawamura0130@gmail.com>
@harupy harupy requested a review from yxiong September 8, 2022 08:22
Signed-off-by: harupy <hkawamura0130@gmail.com>
Signed-off-by: harupy <hkawamura0130@gmail.com>
Signed-off-by: harupy <hkawamura0130@gmail.com>
Signed-off-by: harupy <hkawamura0130@gmail.com>
Signed-off-by: harupy <hkawamura0130@gmail.com>
Copy link
Collaborator

@WeichenXu123 WeichenXu123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Signed-off-by: harupy <hkawamura0130@gmail.com>
Signed-off-by: harupy <hkawamura0130@gmail.com>


@pytest.mark.parametrize("average", [None, "weighted", "macro", "micro"])
def test_evaluation_multiclass_classification_with_average(average):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yxiong Added a test for average.

Copy link
Contributor

@yxiong yxiong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing my comments. LGTM!

Comment on lines +152 to +160
"example_count": len(y_true),
"accuracy_score": sk_metrics.accuracy_score(y_true, y_pred),
"recall_score": sk_metrics.recall_score(
y_true, y_pred, average=average, pos_label=pos_label
),
"precision_score": sk_metrics.precision_score(
y_true, y_pred, average=average, pos_label=pos_label
),
"f1_score": sk_metrics.f1_score(y_true, y_pred, average=average, pos_label=pos_label),
Copy link
Member Author

@harupy harupy Sep 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed the same metric names to align with eval_and_log_metrics.

if not is_binomial:
metrics["f1_score_micro"] = sk_metrics.f1_score(y, y_pred, average="micro", labels=labels)
metrics["f1_score_macro"] = sk_metrics.f1_score(y, y_pred, average="macro", labels=labels)
def _get_binary_classifier_metrics(*, y_true, y_pred, y_proba=None, labels=None, pos_label=1):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Enforce keyword form to avoid swapping y_true and y_pred by mistake.

@harupy
Copy link
Member Author

harupy commented Sep 8, 2022

@yxiong @dbczumar @WeichenXu123 Here's the comparison of logged metrics between mlflow.evaluate and mlflow.sklearn.eval_and_log_metrics:

Code
import mlflow
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.datasets import load_iris, load_breast_cancer, load_diabetes
from sklearn.model_selection import train_test_split
from pprint import pprint
import logging
import warnings

logging.getLogger("mlflow").setLevel(logging.CRITICAL)
warnings.simplefilter("ignore")
client = mlflow.MlflowClient()

evaluator_config = {
    "log_metrics_with_dataset_info": False,
    "log_model_explainability": False,
    "metric_prefix": "test_",
}


def divider(title, length=80):
    length = shutil.get_terminal_size(fallback=(80, 24))[0] if length is None else length
    rest = length - len(title) - 2
    left = rest // 2 if rest % 2 else (rest + 1) // 2
    return "\n{} {} {}\n".format("=" * left, title, "=" * (rest - left))


# Binary
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
model = LogisticRegression().fit(X_train, y_train)

with mlflow.start_run() as run:
    info = mlflow.sklearn.log_model(model, "model")
    result = mlflow.evaluate(
        info.model_uri,
        X_test,
        targets=y_test,
        model_type="classifier",
        dataset_name="breast_cancer",
        evaluators="default",
        evaluator_config=evaluator_config,
    )

print(divider("mlflow.evaluate - binary classification"))
pprint(client.get_run(run.info.run_id).data.metrics)


with mlflow.start_run() as run:
    mlflow.sklearn.eval_and_log_metrics(model, X_test, y_test, prefix="test_")

print(divider("mlflow.sklearn.eval_and_log_metrics - binary classification"))
pprint(client.get_run(run.info.run_id).data.metrics)

# Multi-class
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
model = LogisticRegression().fit(X_train, y_train)

with mlflow.start_run() as run:
    info = mlflow.sklearn.log_model(model, "model")
    result = mlflow.evaluate(
        info.model_uri,
        X_test,
        targets=y_test,
        model_type="classifier",
        dataset_name="iris",
        evaluators="default",
        evaluator_config=evaluator_config,
    )

print(divider("mlflow.evaluate - multiclass classification"))
pprint(client.get_run(run.info.run_id).data.metrics)

with mlflow.start_run() as run:
    mlflow.sklearn.eval_and_log_metrics(model, X_test, y_test, prefix="test_")

print(divider("eval_and_log_metrics - multiclass classification"))
pprint(client.get_run(run.info.run_id).data.metrics)

# Regression
X, y = load_diabetes(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
model = LinearRegression().fit(X_train, y_train)

with mlflow.start_run() as run:
    info = mlflow.sklearn.log_model(model, "model")
    result = mlflow.evaluate(
        info.model_uri,
        X_test,
        targets=y_test,
        model_type="regressor",
        dataset_name="diabetes",
        evaluators="default",
        evaluator_config=evaluator_config,
    )

print(divider("mlflow.evaluate - regression"))
pprint(client.get_run(run.info.run_id).data.metrics)


with mlflow.start_run() as run:
    mlflow.sklearn.eval_and_log_metrics(model, X_test, y_test, prefix="test_")

print(divider("mlflow.sklearn.eval_and_log_metrics - regression"))
pprint(client.get_run(run.info.run_id).data.metrics)
=================== mlflow.evaluate - binary classification ====================

{'test_accuracy_score': 0.9574468085106383,
 'test_example_count': 188.0,
 'test_f1_score': 0.9669421487603306,
 'test_false_negatives': 4.0,
 'test_false_positives': 4.0,
 'test_log_loss': 0.09432132481354148,
 'test_precision_recall_auc': 0.9976327843071228,
 'test_precision_score': 0.9669421487603306,
 'test_recall_score': 0.9669421487603306,
 'test_roc_auc': 0.9956827433082521,
 'test_score': 0.9574468085106383,
 'test_true_negatives': 63.0,
 'test_true_positives': 117.0}

========= mlflow.sklearn.eval_and_log_metrics - binary classification ==========

{'test_accuracy_score': 0.9574468085106383,
 'test_f1_score': 0.9574468085106383,
 'test_log_loss': 0.0943213248135415,
 'test_precision_score': 0.9574468085106383,
 'test_recall_score': 0.9574468085106383,
 'test_roc_auc_score': 0.9956827433082521,
 'test_score': 0.9574468085106383}

================= mlflow.evaluate - multiclass classification ==================

{'test_accuracy_score': 1.0,
 'test_example_count': 50.0,
 'test_f1_score': 1.0,
 'test_log_loss': 0.12474422166493657,
 'test_precision_score': 1.0,
 'test_recall_score': 1.0,
 'test_score': 1.0}

=============== eval_and_log_metrics - multiclass classification ===============

{'test_accuracy_score': 1.0,
 'test_f1_score': 1.0,
 'test_log_loss': 0.12474422166493657,
 'test_precision_score': 1.0,
 'test_recall_score': 1.0,
 'test_roc_auc_score': 1.0,
 'test_score': 1.0}

========================= mlflow.evaluate - regression =========================

{'test_example_count': 146.0,
 'test_max_error': 152.70618561385456,
 'test_mean_absolute_error': 41.96445302653266,
 'test_mean_absolute_percentage_error': 0.35510850217076595,
 'test_mean_on_label': 152.43835616438355,
 'test_mean_squared_error': 2817.801570168678,
 'test_r2_score': 0.5103954261351439,
 'test_root_mean_squared_error': 53.082968739216895,
 'test_score': 0.5103954261351439,
 'test_sum_on_label': 22256.0}

=============== mlflow.sklearn.eval_and_log_metrics - regression ===============

{'test_mae': 41.96445302653266,
 'test_mse': 2817.801570168678,
 'test_r2_score': 0.5103954261351439,
 'test_rmse': 53.082968739216895,
 'test_score': 0.5103954261351439}

We still have a few differences:

  • mlflow.sklearn.eval_and_log_metrics logs score, while mlflow.evaluate doesn't.
  • Regression metric names (mae vs mean_absolute_error).

@harupy
Copy link
Member Author

harupy commented Sep 9, 2022

@yxiong Once this PR is merged, the difference between mlflow.sklearn.eval_and_log_metrics and mlflow.evaluate is the regression metric naming (e.g. mae vs. mean_absolute_error) as shown in the comment above. Is it possible to update AutoML to support long regression metric names logged by mlflow.evaluate?

@harupy harupy requested a review from yxiong September 9, 2022 01:19
@yxiong
Copy link
Contributor

yxiong commented Sep 9, 2022

@yxiong Once this PR is merged, the difference between mlflow.sklearn.eval_and_log_metrics and mlflow.evaluate is the regression metric naming (e.g. mae vs. mean_absolute_error) as shown in the comment above. Is it possible to update AutoML to support long regression metric names logged by mlflow.evaluate?

I think that's ok. @jimmyxu-db Please take note on this naming difference during our migration.

@harupy harupy merged commit b05b214 into mlflow:master Sep 9, 2022
nnethery pushed a commit to nnethery/mlflow that referenced this pull request Feb 1, 2024
* Add pos_label to mlflow.evaluate

Signed-off-by: harupy <hkawamura0130@gmail.com>

* fix dataset_name

Signed-off-by: harupy <hkawamura0130@gmail.com>

* fix _get_classifier_per_class_metrics_collection_df

Signed-off-by: harupy <hkawamura0130@gmail.com>

* fix failed tests

Signed-off-by: harupy <hkawamura0130@gmail.com>

* use evaluator_config

Signed-off-by: harupy <hkawamura0130@gmail.com>

* remove average

Signed-off-by: harupy <hkawamura0130@gmail.com>

* revert

Signed-off-by: harupy <hkawamura0130@gmail.com>

* remove pos_label

Signed-off-by: harupy <hkawamura0130@gmail.com>

* revert changes in examples/pipelines/sklearn_regression

Signed-off-by: harupy <hkawamura0130@gmail.com>

* default value

Signed-off-by: harupy <hkawamura0130@gmail.com>

* add **

Signed-off-by: harupy <hkawamura0130@gmail.com>

* Update mlflow/models/evaluation/base.py

Co-authored-by: Corey Zumar <39497902+dbczumar@users.noreply.github.com>
Signed-off-by: harupy <hkawamura0130@gmail.com>

* fix metrics to log in default evaluator

Signed-off-by: harupy <hkawamura0130@gmail.com>

* fix doc

Signed-off-by: harupy <hkawamura0130@gmail.com>

* mention default value

Signed-off-by: harupy <hkawamura0130@gmail.com>

* add _score

Signed-off-by: harupy <hkawamura0130@gmail.com>

* fix

Signed-off-by: harupy <hkawamura0130@gmail.com>

* add average and sett pos_label to 1 by default

Signed-off-by: harupy <hkawamura0130@gmail.com>

* remove _get_classifier_metrics

Signed-off-by: harupy <hkawamura0130@gmail.com>

* fix doc

Signed-off-by: harupy <hkawamura0130@gmail.com>

* refactor

Signed-off-by: harupy <hkawamura0130@gmail.com>

* add test

Signed-off-by: harupy <hkawamura0130@gmail.com>

* rename dataset

Signed-off-by: harupy <hkawamura0130@gmail.com>

* fix

Signed-off-by: harupy <hkawamura0130@gmail.com>

* test pos_label=None

Signed-off-by: harupy <hkawamura0130@gmail.com>

* fix ternary

Signed-off-by: harupy <hkawamura0130@gmail.com>

* fix

Signed-off-by: harupy <hkawamura0130@gmail.com>

* fix tests

Signed-off-by: harupy <hkawamura0130@gmail.com>

* remove redundant function parameters

Signed-off-by: harupy <hkawamura0130@gmail.com>

Signed-off-by: harupy <hkawamura0130@gmail.com>
Co-authored-by: Corey Zumar <39497902+dbczumar@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/tracking Tracking service, tracking client APIs, autologging rn/feature Mention under Features in Changelogs.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants