New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Async calls to log_metric #1550
Comments
also this would make it more fault-tolerant. currently an experiment that has an temporarily unreachable Tracking URI defined, will die. |
A potential workaround to the lag due to logging every epoch is instead using mlflow.log_metrics, or the lower-level MlflowClient.log_batch. Having an async process definitely would be useful as a built-in feature, though. |
Also interested in this. Would be very useful for both fault tolerance and for cases where the running time of an epoch is short. |
@apurva-koti Is there any new progress now? |
Hi folks, the MLflow fluent API, including |
We created a simple async wrapper in our project which leverages MLflow https://github.com/microsoft/qlib/blob/d7d19feb4ebb0c4318ac3bfda32a34c56e28a6a0/qlib/workflow/recorder.py#L298 hope it will be helpful. |
The primary quantities that have a demand for async logging tend to be ones that are not simple parameters and metrics. For my use case, @you-n-g 's solution was of immense help. I added some multi-threading capability to their code and simplified a few downstream patterns, for anyone who may find that helpful. https://github.com/phelps-matthew/dl-schema/blob/torch-advanced/dl_schema/recorder_base.py |
Is it possible to call log_metrics and execute it asynchronously?
In my situation, I am doing experiments with simple models and simple datasets that are very fast to train/test, but when I try to log_metrics every epoch, it takes much longer due to the logs on mlflow. I was thinking that we should build some queue that could be processed in parallel with the original code, without blocking the main processes of train and test.
The text was updated successfully, but these errors were encountered: