-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support logging PyTorch metrics on training batch steps with autolog #5516
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Signed-off-by: Adam Reeve <adreeve@gmail.com>
Signed-off-by: Adam Reeve <adreeve@gmail.com>
Signed-off-by: Adam Reeve <adreeve@gmail.com>
Signed-off-by: Adam Reeve <adreeve@gmail.com>
Signed-off-by: Adam Reeve <adreeve@gmail.com>
Signed-off-by: Adam Reeve <adreeve@gmail.com>
Signed-off-by: Adam Reeve <adreeve@gmail.com>
…ch metrics Signed-off-by: Adam Reeve <adreeve@gmail.com>
Signed-off-by: Adam Reeve <adreeve@gmail.com>
github-actions
bot
added
area/tracking
Tracking service, tracking client APIs, autologging
rn/feature
Mention under Features in Changelogs.
and removed
rn/feature
Mention under Features in Changelogs.
labels
Mar 21, 2022
harupy
approved these changes
Mar 21, 2022
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm!
erensahin
pushed a commit
to erensahin/mlflow
that referenced
this pull request
Apr 11, 2022
…lflow#5516) * Support logging PyTorch metrics on training batch steps Signed-off-by: Adam Reeve <adreeve@gmail.com> * Formatting Signed-off-by: Adam Reeve <adreeve@gmail.com> * More formatting fixes Signed-off-by: Adam Reeve <adreeve@gmail.com> * Change default for log_every_n_step to None Signed-off-by: Adam Reeve <adreeve@gmail.com> * Handle getting step metrics for PyTorch-Lightning < 1.4.0 Signed-off-by: Adam Reeve <adreeve@gmail.com> * Add test case logging every 10th step Signed-off-by: Adam Reeve <adreeve@gmail.com> * Add test case with epochs logged less frequently than steps Signed-off-by: Adam Reeve <adreeve@gmail.com> * Use only the public Trainer.callback_metrics property to access PyTorch metrics Signed-off-by: Adam Reeve <adreeve@gmail.com> * Go back to using the logger_connector for PL < 1.4.0 Signed-off-by: Adam Reeve <adreeve@gmail.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
area/tracking
Tracking service, tracking client APIs, autologging
enable-dev-tests
Enables cross-version tests for dev versions
rn/feature
Mention under Features in Changelogs.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What changes are proposed in this pull request?
Add a new
log_every_n_step
parameter tomlflow.pytorch.autolog
that enables logging step based metrics using the step count instead of epoch. By default, this parameter isFalse
and these metrics will be logged using epoch numbers to keep backwards compatibility.This fixes #4235.
This is a second attempt after #5497 was reverted as that PR accessed
trainer.logger_connector
which is being made private in the next PyTorch-Lightning version.The new approach only uses
trainer.callback_metrics
for recent PyTorch-Lightning versions and additionally tracks which metrics have been logged on epochs so these aren't considered as step based metrics. Recording the metrics seen on epochs and steps is done outside the check for whether they should be logged in mlflow, so that eg. iflog_every_n_epoch
is greater than 1, we still know which metrics were logged on the epoch within PyTorch-Lightning and shouldn't be considered as step based metrics in the second epoch (and similarly for when steps are logged less frequently than epochs, but that seems quite unlikely).I was a little concerned about the performance impact of retrieving and checking the metric keys on each step and epoch so did some benchmarking, and the impact was fairly negligible. I used a very simple PyTorch model with a single layer and random data and ran 100 epochs, with each epoch running 64 training batches. I'm logging to a local mlflow dev server with a sqlite tracking store. I also ran the same benchmarks after merging in the current changes from #5460 which enables batch SQL writing, and the extra time taken when logging on steps is greatly reduced:
I had hoped that changing to this approach would mean that the code would be the same for all PyTorch-Lightning versions, but unfortunately versions prior to 1.4.0 had test failures and I found that
trainer.callback_metrics
accessed within a callback after a training batch didn't include the metrics from the current batch, as these are only updated after all callbacks have been run, so I had to revert back to using thelogger_connector
approach for older versions.How is this patch tested?
New unit test added.
Does this PR change the documentation?
ci/circleci: build_doc
check. If it's successful, proceed to thenext step, otherwise fix it.
Details
on the right to open the job page of CircleCI.Artifacts
tab.docs/build/html/index.html
.Release Notes
Is this a user-facing change?
mlflow.pytorch.autolog
now accepts alog_every_n_step
parameter that can be used to log step based metrics against the step number.What component(s), interfaces, languages, and integrations does this PR affect?
Components
area/artifacts
: Artifact stores and artifact loggingarea/build
: Build and test infrastructure for MLflowarea/docs
: MLflow documentation pagesarea/examples
: Example codearea/model-registry
: Model Registry service, APIs, and the fluent client calls for Model Registryarea/models
: MLmodel format, model serialization/deserialization, flavorsarea/projects
: MLproject format, project running backendsarea/scoring
: MLflow Model server, model deployment tools, Spark UDFsarea/server-infra
: MLflow Tracking server backendarea/tracking
: Tracking Service, tracking client APIs, autologgingInterface
area/uiux
: Front-end, user experience, plotting, JavaScript, JavaScript dev serverarea/docker
: Docker use across MLflow's components, such as MLflow Projects and MLflow Modelsarea/sqlalchemy
: Use of SQLAlchemy in the Tracking Service or Model Registryarea/windows
: Windows supportLanguage
language/r
: R APIs and clientslanguage/java
: Java APIs and clientslanguage/new
: Proposals for new client languagesIntegrations
integrations/azure
: Azure and Azure ML integrationsintegrations/sagemaker
: SageMaker integrationsintegrations/databricks
: Databricks integrationsHow should the PR be classified in the release notes? Choose one:
rn/breaking-change
- The PR will be mentioned in the "Breaking Changes" sectionrn/none
- No description will be included. The PR will be mentioned only by the PR number in the "Small Bugfixes and Documentation Updates" sectionrn/feature
- A new user-facing feature worth mentioning in the release notesrn/bug-fix
- A user-facing bug fix worth mentioning in the release notesrn/documentation
- A user-facing documentation change worth mentioning in the release notes