New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce time-based batch metrics logging and change XGBoost to use it #3619
Introduce time-based batch metrics logging and change XGBoost to use it #3619
Conversation
Signed-off-by: Andrew Nitu <andrewnitu@gmail.com>
Signed-off-by: Andrew Nitu <andrewnitu@gmail.com>
Signed-off-by: Andrew Nitu <andrewnitu@gmail.com>
Signed-off-by: Andrew Nitu <andrewnitu@gmail.com>
Signed-off-by: Andrew Nitu <andrewnitu@gmail.com>
Signed-off-by: Andrew Nitu <andrewnitu@gmail.com>
Signed-off-by: Andrew Nitu <andrewnitu@gmail.com>
Signed-off-by: Andrew Nitu <andrewnitu@gmail.com>
Signed-off-by: Andrew Nitu <andrewnitu@gmail.com>
Signed-off-by: Andrew Nitu <andrewnitu@gmail.com>
Signed-off-by: Andrew Nitu <andrewnitu@gmail.com>
Signed-off-by: Andrew Nitu <andrewnitu@gmail.com>
Signed-off-by: Andrew Nitu <andrewnitu@gmail.com>
Signed-off-by: Andrew Nitu <andrewnitu@gmail.com>
Signed-off-by: Andrew Nitu <andrewnitu@gmail.com>
Signed-off-by: Andrew Nitu <andrewnitu@gmail.com>
Signed-off-by: Andrew Nitu <andrewnitu@gmail.com>
Signed-off-by: Andrew Nitu <andrewnitu@gmail.com>
Signed-off-by: Andrew Nitu <andrewnitu@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andrewnitu Awesome! Left a few comments
mlflow/utils/autologging_utils.py
Outdated
|
||
|
||
# we pass the batch_metrics_handler through, such that the callback can access it | ||
def _timed_log_batch(batch_metrics_handler, run_id, metrics): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we move this into BatchMetricsHandler
? Seems to make sense given that the method refers to an instance of batch_metrics_handler
.
mlflow/utils/autologging_utils.py
Outdated
batch_metrics_handler.num_log_batch += 1 | ||
|
||
|
||
class BatchMetricsHandler: # BatchMetricsLogger maybe? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. Let's call this BatchMetricsLogger
.
mlflow/utils/autologging_utils.py
Outdated
if step in self.data: | ||
self.data[step].append([int(time_wrapper_for_timestamp() * 1000), metrics]) | ||
else: | ||
self.data[step] = [[int(time_wrapper_for_timestamp() * 1000), metrics]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we construct Metric
objects here and just append metric objects to a list, rather than keeping track of things by step? Seems like we ultimately collapse everything into a list at purge time anyway. If we want to maintain a sorted order based on step, timestamp, etc, we can use the sorted
function within the purge routine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah i don't see why not. Seems kinda wasteful to group them by step then ungroup then again as i'm doing now
mlflow/utils/autologging_utils.py
Outdated
self.total_training_time += training_time | ||
|
||
if step in self.data: | ||
self.data[step].append([int(time_wrapper_for_timestamp() * 1000), metrics]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can time_wrapper_for_timestamp()
and these other timing functions give us times in millis so we don't have to convert them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On second thought, I'm not sure we need these timer wrappers. See #3619 (comment)
mlflow/utils/autologging_utils.py
Outdated
metrics_slices = [ | ||
metrics[i * MAX_METRICS_PER_BATCH : (i + 1) * MAX_METRICS_PER_BATCH] | ||
for i in range((len(metrics) + MAX_METRICS_PER_BATCH - 1) // MAX_METRICS_PER_BATCH) | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think using the step
parameter for range()
will simplify things here. e.g:
def chunks(lst, n):
"""Yield successive n-sized chunks from lst."""
for i in range(0, len(lst), n):
yield lst[i:i + n]
(Credits to https://stackoverflow.com/a/312464/11952869)
mlflow/utils/autologging_utils.py
Outdated
def time_wrapper_for_log(): | ||
return time.time() | ||
|
||
|
||
def time_wrapper_for_current(): | ||
return time.time() | ||
|
||
|
||
def time_wrapper_for_timestamp(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of wrapping time.time()
for mocking, can we just manipulate the total_log_batch_time
and total_training_time
properties of BatchMetricsHandler
in our test cases?
If we're concerned about the measurement of total_training_time
and total_log_batch_time
, we can always construct another test case that performs sleeps to simulate training / logging and then verifies that total_training_time
/ total_log_batch_time
exceed expected thresholds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i tried to test this but for some reason at least in pytest sleep doesnt increase the system clock, even though the test is obviously taking longer (so the sleep is running)
mlflow/xgboost.py
Outdated
@@ -340,6 +341,9 @@ def record_eval_results(eval_results): | |||
""" | |||
|
|||
def callback(env): | |||
batch_metrics_handler.record_metrics( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add batch_metrics_handler
as an argument to record_eval_results
and thread it through to this callback
to ensure that we're not accidentally referencing some state left over from a previous BatchMetricsHandler
, for example?
mlflow/utils/autologging_utils.py
Outdated
|
||
|
||
@contextlib.contextmanager | ||
def with_batch_metrics_handler(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add docs here?
mlflow/utils/autologging_utils.py
Outdated
# data is an array of tuples of the form (timestamp, metrics at timestamp) | ||
self.data = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like a dictionary to me! (Though I think it should be a list of Metric
objects - see comment below)
Signed-off-by: Andrew Nitu <andrewnitu@gmail.com>
Signed-off-by: Andrew Nitu <andrewnitu@gmail.com>
Signed-off-by: Andrew Nitu <andrewnitu@gmail.com>
Signed-off-by: Andrew Nitu <andrewnitu@gmail.com>
mlflow/utils/autologging_utils.py
Outdated
Context manager that yields a BatchMetricsLogger object, which metrics can be logged against. | ||
The BatchMetricsLogger will keep metrics in a list until it decides they should be logged, at | ||
which point the accumulated metrics will be batch logged. The BatchMetricsLogger will ensure | ||
that logging imposes no more than a 10% overhead on the training, where the training is | ||
measured by adding up the time elapsed between consecutive calls to record_metrics. | ||
|
||
Once the context is closed, any metrics that have yet to be logged will be logged. | ||
|
||
:param run_id: ID of the run that the metrics will be logged to. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Instead of future tense, we should favor using present tense where possible (e.g., The BatchMetricsLogger keeps metrics in a list ...
instead of The BatchMetricsLogger will keep metrics in a list...
)
mlflow/xgboost.py
Outdated
# logging metrics on each iteration. | ||
for idx, metrics in enumerate(eval_results): | ||
try_mlflow_log(mlflow.log_metrics, metrics, step=idx) | ||
run_id = mlflow.tracking.fluent._get_or_start_run().info.run_id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use mlflow.active_run().info.run_id
instead? A run should have already been created on line 350 (either that or an active run already existed prior to autologging)
mlflow/utils/autologging_utils.py
Outdated
if self.total_log_batch_time == 0: # we don't yet have data on how long logging takes | ||
return True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now I don't think we need this :)
Signed-off-by: Andrew Nitu <andrewnitu@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Almost there! Found a few more small issues
mlflow/utils/autologging_utils.py
Outdated
|
||
|
||
@contextlib.contextmanager | ||
def with_batch_metrics_logger(run_id): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Let's drop the leading with
so that usages become with batch_metrics_logger
rather than with with_batch_metrics_logger
:
def with_batch_metrics_logger(run_id): | |
def batch_metrics_logger(run_id): |
mlflow/utils/autologging_utils.py
Outdated
with_batch_metrics_logger = BatchMetricsLogger(run_id) | ||
yield with_batch_metrics_logger | ||
with_batch_metrics_logger._purge() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with_batch_metrics_logger = BatchMetricsLogger(run_id) | |
yield with_batch_metrics_logger | |
with_batch_metrics_logger._purge() | |
metrics_logger = BatchMetricsLogger(run_id) | |
yield metrics_logger | |
metrics_logger._purge() |
mlflow/utils/autologging_utils.py
Outdated
self.total_log_batch_time += end - start | ||
|
||
def _should_purge(self): | ||
log_batch_time_fudge_factor = 10 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fudge factor seems like the wrong term here. This is the desired ratio of training time to batch logging time. Perhaps target_training_to_logging_time_ratio
?
|
||
log_batch_mock.reset_mock() # resets the 'calls' of this mock | ||
|
||
# the above 'training' took 1 second. So with fudge factor of 10x, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As above, I don't think that fudge factor is accurate terminology for this use case
run_id = mlflow.tracking.fluent._get_or_start_run().info.run_id | ||
with with_batch_metrics_logger(run_id) as batch_metrics_logger: | ||
batch_metrics_logger.record_metrics({"x": 1}, step=0) # data doesn't matter | ||
# first metrics should be skipped to record a previous timestamp and batch log time |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
skipped
seems to imply that we're dropping metrics or not logging them; I think we mean that we're logging them immediately (i.e. "skipping waiting")
|
||
|
||
def test_batch_metrics_logger_logs_all_metrics(start_run): # pylint: disable=unused-argument | ||
with mock.patch.object(MlflowClient, "log_batch") as log_batch_mock: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to mock log_batch
? Ideally, it would be nice to test that the metrics are actually logged by leaving this unmocked and querying the run data after the iterative calls to record_metrics()
complete.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we want to do this for every test? or is it sufficient to do it for one test and know that it works end-to-end, then save time by keeping the mock for the rest of the tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually it seems this really only applies to one of the tests.. the rest care about the intermediate state AS the BatchMetricsLogger is logging, not the final outcome, so it makes more sense to use mocks there
mlflow/utils/autologging_utils.py
Outdated
self.total_training_time += training_time | ||
|
||
for key, value in metrics.items(): | ||
self.data.append(Metric(key, value, current_timestamp, step)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the metric timestamp must be an integer value with millisecond resolution; e.g., this should be the following (based on what we do in the fluent API -
mlflow/mlflow/tracking/fluent.py
Line 443 in 055978c
timestamp = int(time.time() * 1000) |
self.data.append(Metric(key, value, current_timestamp, step)) | |
self.data.append(Metric(key, value, int(current_timestamp * 1000), step)) |
I found this by running our XGBoost example, where I encountered warning logs from the file store about operating on timestamp content in float form, rather than integer form:
/Users/czumar/mlflow/mlflow/xgboost.py:412: DeprecationWarning: inspect.getargspec() is deprecated since Python 3.0, use inspect.signature() or inspect.getfullargspec()
all_arg_names = inspect.getargspec(original)[0] # pylint: disable=W1505
[0] train-mlogloss:0.74723
[1] train-mlogloss:0.54060
[2] train-mlogloss:0.40276
[3] train-mlogloss:0.30789
[4] train-mlogloss:0.24052
[5] train-mlogloss:0.19087
[6] train-mlogloss:0.15471
[7] train-mlogloss:0.12807
[8] train-mlogloss:0.10722
[9] train-mlogloss:0.09053
/Users/czumar/mlflow/mlflow/xgboost.py:387: UserWarning: Logging to MLflow failed: invalid literal for int() with base 10: '1604679208.8766131'
try_mlflow_log(mlflow.log_artifact, filepath)
/Users/czumar/mlflow/mlflow/xgboost.py:465: UserWarning: Logging to MLflow failed: invalid literal for int() with base 10: '1604679208.8766131'
try_mlflow_log(mlflow.log_artifact, filepath)
/Users/czumar/mlflow/mlflow/xgboost.py:501: UserWarning: Logging to MLflow failed: invalid literal for int() with base 10: '1604679208.8766131'
input_example=input_example,
^ Interestingly, the error is only encountered when subsequent artifact logging operations are called, at which point the file store reads the logged metric files (not sure why it needs to do this, but it does)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make sure to add a test case for this, ensuring that timestamps have millisecond resolution and are integer values?
mlflow/utils/autologging_utils.py
Outdated
for i in range(0, len(self.data), MAX_METRICS_PER_BATCH) | ||
] | ||
for metrics_slice in metrics_slices: | ||
MlflowClient().log_batch(run_id=self.run_id, metrics=metrics_slice) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we wrap this in a call to try_mlflow_log()
to ensure that failures don't prevent future metrics from being logged?
We should document this behavior and, if possible, we should add a test case for it too.
Signed-off-by: Andrew Nitu <andrewnitu@gmail.com>
Signed-off-by: Andrew Nitu <andrewnitu@gmail.com>
Signed-off-by: Andrew Nitu <andrewnitu@gmail.com>
mlflow/utils/autologging_utils.py
Outdated
@@ -261,6 +264,9 @@ def batch_metrics_logger(run_id): | |||
that logging imposes no more than a 10% overhead on the training, where the training is | |||
measured by adding up the time elapsed between consecutive calls to record_metrics. | |||
|
|||
If logging a batch fails, a log will be emitted and subsequent metrics will continue to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If logging a batch fails, a log will be emitted and subsequent metrics will continue to | |
If logging a batch fails, a warning will be emitted and subsequent metrics will continue to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with one tiny comment - thanks @andrewnitu !
…it (mlflow#3619) * xgboost log on every iteration with timing Signed-off-by: Andrew Nitu <andrewnitu@gmail.com> * get avg time Signed-off-by: Andrew Nitu <andrewnitu@gmail.com> * fix Signed-off-by: Andrew Nitu <andrewnitu@gmail.com> * batch send all at the end of training Signed-off-by: Andrew Nitu <andrewnitu@gmail.com> * stash Signed-off-by: Andrew Nitu <andrewnitu@gmail.com> * rename promise to future Signed-off-by: Andrew Nitu <andrewnitu@gmail.com> * remove batch_log_interval Signed-off-by: Andrew Nitu <andrewnitu@gmail.com> * make should_purge have no side effects Signed-off-by: Andrew Nitu <andrewnitu@gmail.com> * do not assume step anymore Signed-off-by: Andrew Nitu <andrewnitu@gmail.com> * add test case Signed-off-by: Andrew Nitu <andrewnitu@gmail.com> * stash Signed-off-by: Andrew Nitu <andrewnitu@gmail.com> * autofmt Signed-off-by: Andrew Nitu <andrewnitu@gmail.com> * linting Signed-off-by: Andrew Nitu <andrewnitu@gmail.com> * some cleanup and gather batch log time on initial iteration Signed-off-by: Andrew Nitu <andrewnitu@gmail.com> * more cleanup Signed-off-by: Andrew Nitu <andrewnitu@gmail.com> * reimport time Signed-off-by: Andrew Nitu <andrewnitu@gmail.com> * revert changes to xgboost example Signed-off-by: Andrew Nitu <andrewnitu@gmail.com> * add chunking test and clean up tests Signed-off-by: Andrew Nitu <andrewnitu@gmail.com> * refactor chunking test Signed-off-by: Andrew Nitu <andrewnitu@gmail.com> * revert adding __eq__ method to metric entity Signed-off-by: Andrew Nitu <andrewnitu@gmail.com> * remove commented-out code Signed-off-by: Andrew Nitu <andrewnitu@gmail.com> * fix xgboost autolog tests Signed-off-by: Andrew Nitu <andrewnitu@gmail.com> * remove unused import Signed-off-by: Andrew Nitu <andrewnitu@gmail.com> * remove unused import Signed-off-by: Andrew Nitu <andrewnitu@gmail.com> * code review Signed-off-by: Andrew Nitu <andrewnitu@gmail.com> * fix line lenght Signed-off-by: Andrew Nitu <andrewnitu@gmail.com> * change to total log batch time instead of average Signed-off-by: Andrew Nitu <andrewnitu@gmail.com> * make test go through two cycles of batch logging Signed-off-by: Andrew Nitu <andrewnitu@gmail.com> * code review Signed-off-by: Andrew Nitu <andrewnitu@gmail.com> * some code review Signed-off-by: Andrew Nitu <andrewnitu@gmail.com> * code review Signed-off-by: Andrew Nitu <andrewnitu@gmail.com> * remove extra param from xgboost example Signed-off-by: Andrew Nitu <andrewnitu@gmail.com> * nit fix Signed-off-by: Andrew Nitu <andrewnitu@gmail.com>
What changes are proposed in this pull request?
How is this patch tested?
Release Notes
Is this a user-facing change?
XGBoost autologging will now incrementally log iterations as they are trained, instead of all at the end of the training.
What component(s), interfaces, languages, and integrations does this PR affect?
Components
area/artifacts
: Artifact stores and artifact loggingarea/build
: Build and test infrastructure for MLflowarea/docs
: MLflow documentation pagesarea/examples
: Example codearea/model-registry
: Model Registry service, APIs, and the fluent client calls for Model Registryarea/models
: MLmodel format, model serialization/deserialization, flavorsarea/projects
: MLproject format, project running backendsarea/scoring
: Local serving, model deployment tools, spark UDFsarea/server-infra
: MLflow server, JavaScript dev serverarea/tracking
: Tracking Service, tracking client APIs, autologgingInterface
area/uiux
: Front-end, user experience, JavaScript, plottingarea/docker
: Docker use across MLflow's components, such as MLflow Projects and MLflow Modelsarea/sqlalchemy
: Use of SQLAlchemy in the Tracking Service or Model Registryarea/windows
: Windows supportLanguage
language/r
: R APIs and clientslanguage/java
: Java APIs and clientslanguage/new
: Proposals for new client languagesIntegrations
integrations/azure
: Azure and Azure ML integrationsintegrations/sagemaker
: SageMaker integrationsintegrations/databricks
: Databricks integrationsHow should the PR be classified in the release notes? Choose one:
rn/breaking-change
- The PR will be mentioned in the "Breaking Changes" sectionrn/none
- No description will be included. The PR will be mentioned only by the PR number in the "Small Bugfixes and Documentation Updates" sectionrn/feature
- A new user-facing feature worth mentioning in the release notesrn/bug-fix
- A user-facing bug fix worth mentioning in the release notesrn/documentation
- A user-facing documentation change worth mentioning in the release notes