-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add pos_label
to mlflow.evaluate
#6696
Conversation
Signed-off-by: harupy <hkawamura0130@gmail.com>
Signed-off-by: harupy <hkawamura0130@gmail.com>
Signed-off-by: harupy <hkawamura0130@gmail.com>
Signed-off-by: harupy <hkawamura0130@gmail.com>
Signed-off-by: harupy <hkawamura0130@gmail.com>
Signed-off-by: harupy <hkawamura0130@gmail.com>
Signed-off-by: harupy <hkawamura0130@gmail.com>
Signed-off-by: harupy <hkawamura0130@gmail.com>
Signed-off-by: harupy <hkawamura0130@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks @harupy !
Co-authored-by: Corey Zumar <39497902+dbczumar@users.noreply.github.com> Signed-off-by: harupy <hkawamura0130@gmail.com>
39524de
to
c6a0ea7
Compare
@@ -147,7 +147,7 @@ def _get_binary_sum_up_label_pred_prob(positive_class_index, positive_class, y, | |||
return y_bin, y_pred_bin, y_prob_bin | |||
|
|||
|
|||
def _get_classifier_per_class_metrics(y, y_pred): | |||
def _get_classifier_per_class_metrics(y, y_pred, *, pos_label=1): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are we defaulting to 1? I'm looking at https://github.com/mlflow/mlflow/pull/6117/files and it seems that in that PR we default to None?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jimmyxu-db I chose 1 to preserve the current behavior of the default evaluator.
In mlflow.sklearn.eval_and_log_metrics
, pos_label
is used for both binary and multi-class classification models. pos_label = None
with average = 'weighted'
works for both model types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can set the default to None
(and use average = 'weighted'
) if that makes more sense.
@WeichenXu123 Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can set the default to None (and use average = 'weighted') if that makes more sense.
For binary case, we should only set average = 'binary' ?
and pos_label = 1 is consistent with sklearn metric function default value https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_score.html
so I propose use default value 1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For binary case, we should only set average = 'binary' ?
It depends on how balanced the dataset is. If it's imbalanced, other options such as weighted
might make more sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about we pass in both pos_label and average to allow users to specify them? I can see two options here:
This sounds good to me.
The downside for this is multiclass metrics, it will fail.
It actually won't in the default evaluator. The default evaluator logs 1-vs-rest metric for each class, while eval_and_log_metrics
just uses <sklearn_score_function>(y_true, y_pred, average="weighted", pos_label=None)
. @WeichenXu123 Do you know why we chose this approach?
mlflow/mlflow/models/evaluation/default_evaluator.py
Lines 186 to 197 in 1f7f9dc
def _get_classifier_per_class_metrics_collection_df(y, y_pred, labels): | |
per_class_metrics_list = [] | |
for positive_class_index, positive_class in enumerate(labels): | |
(y_bin, y_pred_bin, _,) = _get_binary_sum_up_label_pred_prob( | |
positive_class_index, positive_class, y, y_pred, None | |
) | |
per_class_metrics = {"positive_class": positive_class} | |
per_class_metrics.update(_get_classifier_per_class_metrics(y_bin, y_pred_bin)) | |
per_class_metrics_list.append(per_class_metrics) | |
return pd.DataFrame(per_class_metrics_list) |
The returned dataframe looks like this:
positive_class true_negatives false_positives false_negatives true_positives recall precision f1_score roc_auc precision_recall_auc
0 0 2774 205 137 184 0.573209 0.473008 0.518310 0.858065 0.371424
1 1 2753 209 191 147 0.434911 0.412921 0.423631 0.831169 0.385339
2 2 2752 208 158 182 0.535294 0.466667 0.498630 0.847666 0.409305
3 3 2823 153 160 164 0.506173 0.517350 0.511700 0.818308 0.510070
4 4 2784 168 263 85 0.244253 0.335968 0.282862 0.761322 0.271138
5 5 2730 238 244 88 0.265060 0.269939 0.267477 0.781107 0.319155
6 6 2791 189 207 113 0.353125 0.374172 0.363344 0.805106 0.340894
7 7 2757 220 221 102 0.315789 0.316770 0.316279 0.748290 0.268311
8 8 2799 185 232 84 0.265823 0.312268 0.287179 0.751020 0.225234
9 9 2729 233 195 143 0.423077 0.380319 0.400560 0.828694 0.388790
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I got a little confused. Does that mean evaluate
has a different behavior than eval_and_log_metrics
? In particular, I expect to see a single output from the evaluation result (which will be sorted by AutoML to select the best model), not one-output-per-label.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does that mean evaluate has a different behavior than eval_and_log_metrics?
Correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quick script to check the differences:
import mlflow
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
X, y = make_classification(n_samples=10000, n_classes=10, n_informative=5, random_state=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
model = LogisticRegression(solver="liblinear").fit(X_train, y_train)
with mlflow.start_run() as run:
mlflow.sklearn.log_model(model, "model")
model_uri = mlflow.get_artifact_uri("model")
result = mlflow.evaluate(
model_uri,
X_test,
targets=y_test,
model_type="classifier",
dataset_name="multiclass-classification-dataset",
evaluators="default",
evaluator_config={"log_model_explainability": False},
)
with mlflow.start_run():
mlflow.sklearn.eval_and_log_metrics(model, X_test, y_test, prefix="test_")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yxiong We discussed this issue. We'll update the default evaluator so that it logs the same metrics as eval_and_log_metrics
, and also update the default value of pos_label
to None
.
@@ -147,7 +147,7 @@ def _get_binary_sum_up_label_pred_prob(positive_class_index, positive_class, y, | |||
return y_bin, y_pred_bin, y_prob_bin | |||
|
|||
|
|||
def _get_classifier_per_class_metrics(y, y_pred): | |||
def _get_classifier_per_class_metrics(y, y_pred, *, pos_label=1): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about we pass in both pos_label
and average
to allow users to specify them? I can see two options here:
- Make the default consistent with sklearn (e.g. https://scikit-learn.org/stable/modules/generated/sklearn.metrics.recall_score.html). The downside for this is multiclass metrics, it will fail.
- Set the
average
toweighted
by default so multiclass metrics won't fail. But we need to change it back tobinary
whenpos_label
is specified. Otherwise it'll produce the wrong metrics for binary classification (see Pass in thepos_label
parameter toeval_and_log_metrics
#5807).
(2) is what we did before with eval_and_log_metrics
, but I think (1) is more intuitive.
This is important to get right! Please confirm with me on the approach we choose before merging this PR.
|
Signed-off-by: harupy <hkawamura0130@gmail.com>
Signed-off-by: harupy <hkawamura0130@gmail.com>
Signed-off-by: harupy <hkawamura0130@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM in general. Please add some more test to make sure the behavior is correct.
mlflow/models/evaluation/base.py
Outdated
@@ -926,6 +926,12 @@ def evaluate( | |||
- **log_metrics_with_dataset_info**: A boolean value specifying whether or not to include | |||
information about the evaluation dataset in the name of each metric logged to MLflow | |||
Tracking during evaluation, default value is True. | |||
- **pos_label**: The positive label used to compute binary classification metrics such as | |||
precision, recall, f1, etc (default: ``None``). This parameter is only used for binary |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should pos_label
default to 1 for binary classification models?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
eval_and_log_metrics
uses None
by default regardelss of the model type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In eval_and_log_metrics
, it will set average
to weighted
by default, in which case the pos_label
doesn't matter. I think that's wrong, and we do not need to be backward compatible to that. Setting pos_label
default to 1 makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense, I will update the code.
Signed-off-by: harupy <hkawamura0130@gmail.com>
Signed-off-by: harupy <hkawamura0130@gmail.com>
- **pos_label**: The positive label to use when computing classification metrics such as | ||
precision, recall, f1, etc. for binary classification models (default: ``1``). For | ||
multiclass classification and regression models, this parameter will be ignored. | ||
- **average**: The averaging method to use when computing classification metrics such as | ||
precision, recall, f1, etc. for multiclass classification models | ||
(default: ``'weighted'``). For binary classification and regression models, this | ||
parameter will be ignored. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yxiong @WeichenXu123 Added average
and updated the pos_label
doc. Could you check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's fine
Signed-off-by: harupy <hkawamura0130@gmail.com>
Signed-off-by: harupy <hkawamura0130@gmail.com>
Signed-off-by: harupy <hkawamura0130@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Signed-off-by: harupy <hkawamura0130@gmail.com>
|
||
|
||
@pytest.mark.parametrize("average", [None, "weighted", "macro", "micro"]) | ||
def test_evaluation_multiclass_classification_with_average(average): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yxiong Added a test for average
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for addressing my comments. LGTM!
"example_count": len(y_true), | ||
"accuracy_score": sk_metrics.accuracy_score(y_true, y_pred), | ||
"recall_score": sk_metrics.recall_score( | ||
y_true, y_pred, average=average, pos_label=pos_label | ||
), | ||
"precision_score": sk_metrics.precision_score( | ||
y_true, y_pred, average=average, pos_label=pos_label | ||
), | ||
"f1_score": sk_metrics.f1_score(y_true, y_pred, average=average, pos_label=pos_label), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed the same metric names to align with eval_and_log_metrics
.
if not is_binomial: | ||
metrics["f1_score_micro"] = sk_metrics.f1_score(y, y_pred, average="micro", labels=labels) | ||
metrics["f1_score_macro"] = sk_metrics.f1_score(y, y_pred, average="macro", labels=labels) | ||
def _get_binary_classifier_metrics(*, y_true, y_pred, y_proba=None, labels=None, pos_label=1): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Enforce keyword form to avoid swapping y_true and y_pred by mistake.
@yxiong @dbczumar @WeichenXu123 Here's the comparison of logged metrics between Codeimport mlflow
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.datasets import load_iris, load_breast_cancer, load_diabetes
from sklearn.model_selection import train_test_split
from pprint import pprint
import logging
import warnings
logging.getLogger("mlflow").setLevel(logging.CRITICAL)
warnings.simplefilter("ignore")
client = mlflow.MlflowClient()
evaluator_config = {
"log_metrics_with_dataset_info": False,
"log_model_explainability": False,
"metric_prefix": "test_",
}
def divider(title, length=80):
length = shutil.get_terminal_size(fallback=(80, 24))[0] if length is None else length
rest = length - len(title) - 2
left = rest // 2 if rest % 2 else (rest + 1) // 2
return "\n{} {} {}\n".format("=" * left, title, "=" * (rest - left))
# Binary
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
model = LogisticRegression().fit(X_train, y_train)
with mlflow.start_run() as run:
info = mlflow.sklearn.log_model(model, "model")
result = mlflow.evaluate(
info.model_uri,
X_test,
targets=y_test,
model_type="classifier",
dataset_name="breast_cancer",
evaluators="default",
evaluator_config=evaluator_config,
)
print(divider("mlflow.evaluate - binary classification"))
pprint(client.get_run(run.info.run_id).data.metrics)
with mlflow.start_run() as run:
mlflow.sklearn.eval_and_log_metrics(model, X_test, y_test, prefix="test_")
print(divider("mlflow.sklearn.eval_and_log_metrics - binary classification"))
pprint(client.get_run(run.info.run_id).data.metrics)
# Multi-class
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
model = LogisticRegression().fit(X_train, y_train)
with mlflow.start_run() as run:
info = mlflow.sklearn.log_model(model, "model")
result = mlflow.evaluate(
info.model_uri,
X_test,
targets=y_test,
model_type="classifier",
dataset_name="iris",
evaluators="default",
evaluator_config=evaluator_config,
)
print(divider("mlflow.evaluate - multiclass classification"))
pprint(client.get_run(run.info.run_id).data.metrics)
with mlflow.start_run() as run:
mlflow.sklearn.eval_and_log_metrics(model, X_test, y_test, prefix="test_")
print(divider("eval_and_log_metrics - multiclass classification"))
pprint(client.get_run(run.info.run_id).data.metrics)
# Regression
X, y = load_diabetes(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
model = LinearRegression().fit(X_train, y_train)
with mlflow.start_run() as run:
info = mlflow.sklearn.log_model(model, "model")
result = mlflow.evaluate(
info.model_uri,
X_test,
targets=y_test,
model_type="regressor",
dataset_name="diabetes",
evaluators="default",
evaluator_config=evaluator_config,
)
print(divider("mlflow.evaluate - regression"))
pprint(client.get_run(run.info.run_id).data.metrics)
with mlflow.start_run() as run:
mlflow.sklearn.eval_and_log_metrics(model, X_test, y_test, prefix="test_")
print(divider("mlflow.sklearn.eval_and_log_metrics - regression"))
pprint(client.get_run(run.info.run_id).data.metrics)
We still have a few differences:
|
@yxiong Once this PR is merged, the difference between |
I think that's ok. @jimmyxu-db Please take note on this naming difference during our migration. |
* Add pos_label to mlflow.evaluate Signed-off-by: harupy <hkawamura0130@gmail.com> * fix dataset_name Signed-off-by: harupy <hkawamura0130@gmail.com> * fix _get_classifier_per_class_metrics_collection_df Signed-off-by: harupy <hkawamura0130@gmail.com> * fix failed tests Signed-off-by: harupy <hkawamura0130@gmail.com> * use evaluator_config Signed-off-by: harupy <hkawamura0130@gmail.com> * remove average Signed-off-by: harupy <hkawamura0130@gmail.com> * revert Signed-off-by: harupy <hkawamura0130@gmail.com> * remove pos_label Signed-off-by: harupy <hkawamura0130@gmail.com> * revert changes in examples/pipelines/sklearn_regression Signed-off-by: harupy <hkawamura0130@gmail.com> * default value Signed-off-by: harupy <hkawamura0130@gmail.com> * add ** Signed-off-by: harupy <hkawamura0130@gmail.com> * Update mlflow/models/evaluation/base.py Co-authored-by: Corey Zumar <39497902+dbczumar@users.noreply.github.com> Signed-off-by: harupy <hkawamura0130@gmail.com> * fix metrics to log in default evaluator Signed-off-by: harupy <hkawamura0130@gmail.com> * fix doc Signed-off-by: harupy <hkawamura0130@gmail.com> * mention default value Signed-off-by: harupy <hkawamura0130@gmail.com> * add _score Signed-off-by: harupy <hkawamura0130@gmail.com> * fix Signed-off-by: harupy <hkawamura0130@gmail.com> * add average and sett pos_label to 1 by default Signed-off-by: harupy <hkawamura0130@gmail.com> * remove _get_classifier_metrics Signed-off-by: harupy <hkawamura0130@gmail.com> * fix doc Signed-off-by: harupy <hkawamura0130@gmail.com> * refactor Signed-off-by: harupy <hkawamura0130@gmail.com> * add test Signed-off-by: harupy <hkawamura0130@gmail.com> * rename dataset Signed-off-by: harupy <hkawamura0130@gmail.com> * fix Signed-off-by: harupy <hkawamura0130@gmail.com> * test pos_label=None Signed-off-by: harupy <hkawamura0130@gmail.com> * fix ternary Signed-off-by: harupy <hkawamura0130@gmail.com> * fix Signed-off-by: harupy <hkawamura0130@gmail.com> * fix tests Signed-off-by: harupy <hkawamura0130@gmail.com> * remove redundant function parameters Signed-off-by: harupy <hkawamura0130@gmail.com> Signed-off-by: harupy <hkawamura0130@gmail.com> Co-authored-by: Corey Zumar <39497902+dbczumar@users.noreply.github.com>
Signed-off-by: harupy hkawamura0130@gmail.com
Related Issues/PRs
#xxx
What changes are proposed in this pull request?
Add
pos_label
tomlflow.evaluate
to provide a way to specify the positive label to use when computing metrics for binary classification.How is this patch tested?
I have written tests (not required for typo or doc fix) and confirmed the proposed feature/bug-fix/change works.
Unit tests
Manual test
Does this PR change the documentation?
Details
link on thePreview docs
check.Release Notes
Is this a user-facing change?
Add
pos_label
tomlflow.evaluate
to specify the positive label to use when computing metrics for binary classification.What component(s), interfaces, languages, and integrations does this PR affect?
Components
area/artifacts
: Artifact stores and artifact loggingarea/build
: Build and test infrastructure for MLflowarea/docs
: MLflow documentation pagesarea/examples
: Example codearea/model-registry
: Model Registry service, APIs, and the fluent client calls for Model Registryarea/models
: MLmodel format, model serialization/deserialization, flavorsarea/pipelines
: Pipelines, Pipeline APIs, Pipeline configs, Pipeline Templatesarea/projects
: MLproject format, project running backendsarea/scoring
: MLflow Model server, model deployment tools, Spark UDFsarea/server-infra
: MLflow Tracking server backendarea/tracking
: Tracking Service, tracking client APIs, autologgingInterface
area/uiux
: Front-end, user experience, plotting, JavaScript, JavaScript dev serverarea/docker
: Docker use across MLflow's components, such as MLflow Projects and MLflow Modelsarea/sqlalchemy
: Use of SQLAlchemy in the Tracking Service or Model Registryarea/windows
: Windows supportLanguage
language/r
: R APIs and clientslanguage/java
: Java APIs and clientslanguage/new
: Proposals for new client languagesIntegrations
integrations/azure
: Azure and Azure ML integrationsintegrations/sagemaker
: SageMaker integrationsintegrations/databricks
: Databricks integrationsHow should the PR be classified in the release notes? Choose one:
rn/breaking-change
- The PR will be mentioned in the "Breaking Changes" sectionrn/none
- No description will be included. The PR will be mentioned only by the PR number in the "Small Bugfixes and Documentation Updates" sectionrn/feature
- A new user-facing feature worth mentioning in the release notesrn/bug-fix
- A user-facing bug fix worth mentioning in the release notesrn/documentation
- A user-facing documentation change worth mentioning in the release notes