Support logging PyTorch metrics on training batch steps with autolog #5516

adamreeve · 2022-03-21T00:49:54Z

What changes are proposed in this pull request?

Add a new log_every_n_step parameter to mlflow.pytorch.autolog that enables logging step based metrics using the step count instead of epoch. By default, this parameter is False and these metrics will be logged using epoch numbers to keep backwards compatibility.

This fixes #4235.

This is a second attempt after #5497 was reverted as that PR accessed trainer.logger_connector which is being made private in the next PyTorch-Lightning version.

The new approach only uses trainer.callback_metrics for recent PyTorch-Lightning versions and additionally tracks which metrics have been logged on epochs so these aren't considered as step based metrics. Recording the metrics seen on epochs and steps is done outside the check for whether they should be logged in mlflow, so that eg. if log_every_n_epoch is greater than 1, we still know which metrics were logged on the epoch within PyTorch-Lightning and shouldn't be considered as step based metrics in the second epoch (and similarly for when steps are logged less frequently than epochs, but that seems quite unlikely).

I was a little concerned about the performance impact of retrieving and checking the metric keys on each step and epoch so did some benchmarking, and the impact was fairly negligible. I used a very simple PyTorch model with a single layer and random data and ran 100 epochs, with each epoch running 64 training batches. I'm logging to a local mlflow dev server with a sqlite tracking store. I also ran the same benchmarks after merging in the current changes from #5460 which enables batch SQL writing, and the extra time taken when logging on steps is greatly reduced:

Test case	Time taken (s)	Time taken with batch writing (s)
Logging epochs only	26.6	18.8
Logging every 10th step with #5497	62.7	19.0
Logging every 10th step with this PR	63.1	19.2

I had hoped that changing to this approach would mean that the code would be the same for all PyTorch-Lightning versions, but unfortunately versions prior to 1.4.0 had test failures and I found that trainer.callback_metrics accessed within a callback after a training batch didn't include the metrics from the current batch, as these are only updated after all callbacks have been run, so I had to revert back to using the logger_connector approach for older versions.

How is this patch tested?

New unit test added.

Does this PR change the documentation?

No. You can skip the rest of this section.
Yes. Make sure the changed pages / sections render correctly by following the steps below.

Check the status of the ci/circleci: build_doc check. If it's successful, proceed to the
next step, otherwise fix it.
Click Details on the right to open the job page of CircleCI.
Click the Artifacts tab.
Click docs/build/html/index.html.
Find the changed pages / sections and make sure they render correctly.

Release Notes

Is this a user-facing change?

No. You can skip the rest of this section.
Yes. Give a description of this change to be included in the release notes for MLflow users.

mlflow.pytorch.autolog now accepts a log_every_n_step parameter that can be used to log step based metrics against the step number.

What component(s), interfaces, languages, and integrations does this PR affect?

Components

Interface

area/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev server
area/docker: Docker use across MLflow's components, such as MLflow Projects and MLflow Models
area/sqlalchemy: Use of SQLAlchemy in the Tracking Service or Model Registry
area/windows: Windows support

Language

language/r: R APIs and clients
language/java: Java APIs and clients
language/new: Proposals for new client languages

Integrations

integrations/azure: Azure and Azure ML integrations
integrations/sagemaker: SageMaker integrations
integrations/databricks: Databricks integrations

How should the PR be classified in the release notes? Choose one:

rn/breaking-change - The PR will be mentioned in the "Breaking Changes" section
rn/none - No description will be included. The PR will be mentioned only by the PR number in the "Small Bugfixes and Documentation Updates" section
rn/feature - A new user-facing feature worth mentioning in the release notes
rn/bug-fix - A user-facing bug fix worth mentioning in the release notes
rn/documentation - A user-facing documentation change worth mentioning in the release notes

Signed-off-by: Adam Reeve <adreeve@gmail.com>

…ch metrics Signed-off-by: Adam Reeve <adreeve@gmail.com>

Signed-off-by: Adam Reeve <adreeve@gmail.com>

harupy

lgtm!

…lflow#5516) * Support logging PyTorch metrics on training batch steps Signed-off-by: Adam Reeve <adreeve@gmail.com> * Formatting Signed-off-by: Adam Reeve <adreeve@gmail.com> * More formatting fixes Signed-off-by: Adam Reeve <adreeve@gmail.com> * Change default for log_every_n_step to None Signed-off-by: Adam Reeve <adreeve@gmail.com> * Handle getting step metrics for PyTorch-Lightning < 1.4.0 Signed-off-by: Adam Reeve <adreeve@gmail.com> * Add test case logging every 10th step Signed-off-by: Adam Reeve <adreeve@gmail.com> * Add test case with epochs logged less frequently than steps Signed-off-by: Adam Reeve <adreeve@gmail.com> * Use only the public Trainer.callback_metrics property to access PyTorch metrics Signed-off-by: Adam Reeve <adreeve@gmail.com> * Go back to using the logger_connector for PL < 1.4.0 Signed-off-by: Adam Reeve <adreeve@gmail.com>

adamreeve added 9 commits March 21, 2022 09:21

Support logging PyTorch metrics on training batch steps

531a945

Signed-off-by: Adam Reeve <adreeve@gmail.com>

Formatting

c4c5848

Signed-off-by: Adam Reeve <adreeve@gmail.com>

More formatting fixes

0b92e79

Signed-off-by: Adam Reeve <adreeve@gmail.com>

Change default for log_every_n_step to None

051bba2

Signed-off-by: Adam Reeve <adreeve@gmail.com>

Handle getting step metrics for PyTorch-Lightning < 1.4.0

a4ba133

Signed-off-by: Adam Reeve <adreeve@gmail.com>

Add test case logging every 10th step

567d5e0

Signed-off-by: Adam Reeve <adreeve@gmail.com>

Add test case with epochs logged less frequently than steps

aba6959

Signed-off-by: Adam Reeve <adreeve@gmail.com>

Use only the public Trainer.callback_metrics property to access PyTor…

aed7d41

…ch metrics Signed-off-by: Adam Reeve <adreeve@gmail.com>

Go back to using the logger_connector for PL < 1.4.0

042efcc

Signed-off-by: Adam Reeve <adreeve@gmail.com>

github-actions bot added area/tracking Tracking service, tracking client APIs, autologging rn/feature Mention under Features in Changelogs. and removed rn/feature Mention under Features in Changelogs. labels Mar 21, 2022

harupy added the enable-dev-tests Enables cross-version tests for dev versions label Mar 21, 2022

harupy approved these changes Mar 21, 2022

View reviewed changes

harupy merged commit 3ab6fbf into mlflow:master Mar 21, 2022

adamreeve deleted the autolog_step_4235_2 branch March 21, 2022 03:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support logging PyTorch metrics on training batch steps with autolog #5516

Support logging PyTorch metrics on training batch steps with autolog #5516

adamreeve commented Mar 21, 2022 •

edited

Loading

harupy left a comment

Support logging PyTorch metrics on training batch steps with autolog #5516

Support logging PyTorch metrics on training batch steps with autolog #5516

Conversation

adamreeve commented Mar 21, 2022 • edited Loading

What changes are proposed in this pull request?

How is this patch tested?

Does this PR change the documentation?

Release Notes

Is this a user-facing change?

What component(s), interfaces, languages, and integrations does this PR affect?

How should the PR be classified in the release notes? Choose one:

harupy left a comment

Choose a reason for hiding this comment

adamreeve commented Mar 21, 2022 •

edited

Loading