Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support logging PyTorch metrics on training batch steps with autolog #5516

Merged
merged 9 commits into from
Mar 21, 2022

Conversation

adamreeve
Copy link
Contributor

@adamreeve adamreeve commented Mar 21, 2022

What changes are proposed in this pull request?

Add a new log_every_n_step parameter to mlflow.pytorch.autolog that enables logging step based metrics using the step count instead of epoch. By default, this parameter is False and these metrics will be logged using epoch numbers to keep backwards compatibility.

This fixes #4235.

This is a second attempt after #5497 was reverted as that PR accessed trainer.logger_connector which is being made private in the next PyTorch-Lightning version.

The new approach only uses trainer.callback_metrics for recent PyTorch-Lightning versions and additionally tracks which metrics have been logged on epochs so these aren't considered as step based metrics. Recording the metrics seen on epochs and steps is done outside the check for whether they should be logged in mlflow, so that eg. if log_every_n_epoch is greater than 1, we still know which metrics were logged on the epoch within PyTorch-Lightning and shouldn't be considered as step based metrics in the second epoch (and similarly for when steps are logged less frequently than epochs, but that seems quite unlikely).

I was a little concerned about the performance impact of retrieving and checking the metric keys on each step and epoch so did some benchmarking, and the impact was fairly negligible. I used a very simple PyTorch model with a single layer and random data and ran 100 epochs, with each epoch running 64 training batches. I'm logging to a local mlflow dev server with a sqlite tracking store. I also ran the same benchmarks after merging in the current changes from #5460 which enables batch SQL writing, and the extra time taken when logging on steps is greatly reduced:

Test case Time taken (s) Time taken with batch writing (s)
Logging epochs only 26.6 18.8
Logging every 10th step with #5497 62.7 19.0
Logging every 10th step with this PR 63.1 19.2

I had hoped that changing to this approach would mean that the code would be the same for all PyTorch-Lightning versions, but unfortunately versions prior to 1.4.0 had test failures and I found that trainer.callback_metrics accessed within a callback after a training batch didn't include the metrics from the current batch, as these are only updated after all callbacks have been run, so I had to revert back to using the logger_connector approach for older versions.

How is this patch tested?

New unit test added.

Does this PR change the documentation?

  • No. You can skip the rest of this section.
  • Yes. Make sure the changed pages / sections render correctly by following the steps below.
  1. Check the status of the ci/circleci: build_doc check. If it's successful, proceed to the
    next step, otherwise fix it.
  2. Click Details on the right to open the job page of CircleCI.
  3. Click the Artifacts tab.
  4. Click docs/build/html/index.html.
  5. Find the changed pages / sections and make sure they render correctly.

Release Notes

Is this a user-facing change?

  • No. You can skip the rest of this section.
  • Yes. Give a description of this change to be included in the release notes for MLflow users.

mlflow.pytorch.autolog now accepts a log_every_n_step parameter that can be used to log step based metrics against the step number.

What component(s), interfaces, languages, and integrations does this PR affect?

Components

  • area/artifacts: Artifact stores and artifact logging
  • area/build: Build and test infrastructure for MLflow
  • area/docs: MLflow documentation pages
  • area/examples: Example code
  • area/model-registry: Model Registry service, APIs, and the fluent client calls for Model Registry
  • area/models: MLmodel format, model serialization/deserialization, flavors
  • area/projects: MLproject format, project running backends
  • area/scoring: MLflow Model server, model deployment tools, Spark UDFs
  • area/server-infra: MLflow Tracking server backend
  • area/tracking: Tracking Service, tracking client APIs, autologging

Interface

  • area/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev server
  • area/docker: Docker use across MLflow's components, such as MLflow Projects and MLflow Models
  • area/sqlalchemy: Use of SQLAlchemy in the Tracking Service or Model Registry
  • area/windows: Windows support

Language

  • language/r: R APIs and clients
  • language/java: Java APIs and clients
  • language/new: Proposals for new client languages

Integrations

  • integrations/azure: Azure and Azure ML integrations
  • integrations/sagemaker: SageMaker integrations
  • integrations/databricks: Databricks integrations

How should the PR be classified in the release notes? Choose one:

  • rn/breaking-change - The PR will be mentioned in the "Breaking Changes" section
  • rn/none - No description will be included. The PR will be mentioned only by the PR number in the "Small Bugfixes and Documentation Updates" section
  • rn/feature - A new user-facing feature worth mentioning in the release notes
  • rn/bug-fix - A user-facing bug fix worth mentioning in the release notes
  • rn/documentation - A user-facing documentation change worth mentioning in the release notes

Signed-off-by: Adam Reeve <adreeve@gmail.com>
Signed-off-by: Adam Reeve <adreeve@gmail.com>
Signed-off-by: Adam Reeve <adreeve@gmail.com>
Signed-off-by: Adam Reeve <adreeve@gmail.com>
Signed-off-by: Adam Reeve <adreeve@gmail.com>
Signed-off-by: Adam Reeve <adreeve@gmail.com>
Signed-off-by: Adam Reeve <adreeve@gmail.com>
…ch metrics

Signed-off-by: Adam Reeve <adreeve@gmail.com>
Signed-off-by: Adam Reeve <adreeve@gmail.com>
@github-actions github-actions bot added area/tracking Tracking service, tracking client APIs, autologging rn/feature Mention under Features in Changelogs. and removed rn/feature Mention under Features in Changelogs. labels Mar 21, 2022
@harupy harupy added the enable-dev-tests Enables cross-version tests for dev versions label Mar 21, 2022
Copy link
Member

@harupy harupy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

@harupy harupy merged commit 3ab6fbf into mlflow:master Mar 21, 2022
@adamreeve adamreeve deleted the autolog_step_4235_2 branch March 21, 2022 03:24
erensahin pushed a commit to erensahin/mlflow that referenced this pull request Apr 11, 2022
…lflow#5516)

* Support logging PyTorch metrics on training batch steps

Signed-off-by: Adam Reeve <adreeve@gmail.com>

* Formatting

Signed-off-by: Adam Reeve <adreeve@gmail.com>

* More formatting fixes

Signed-off-by: Adam Reeve <adreeve@gmail.com>

* Change default for log_every_n_step to None

Signed-off-by: Adam Reeve <adreeve@gmail.com>

* Handle getting step metrics for PyTorch-Lightning < 1.4.0

Signed-off-by: Adam Reeve <adreeve@gmail.com>

* Add test case logging every 10th step

Signed-off-by: Adam Reeve <adreeve@gmail.com>

* Add test case with epochs logged less frequently than steps

Signed-off-by: Adam Reeve <adreeve@gmail.com>

* Use only the public Trainer.callback_metrics property to access PyTorch metrics

Signed-off-by: Adam Reeve <adreeve@gmail.com>

* Go back to using the logger_connector for PL < 1.4.0

Signed-off-by: Adam Reeve <adreeve@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/tracking Tracking service, tracking client APIs, autologging enable-dev-tests Enables cross-version tests for dev versions rn/feature Mention under Features in Changelogs.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] mlflow.autolog() for PyTorch Lightning logs _step metrics for epochs instead of steps
2 participants