-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mlflow UI slowdowns with the number of timeseries metrics #1571
Comments
@jdlesage Thank you for raising this concern. Regarding the ListRuns performance issue with a large number of metrics, it seems reasonable to create a new database table containing the latest metric for each run (e.g., the metric with the maximum step, timestamp, and value); ListRuns can then read from this table without fetching all metrics for each run. Regarding the database size problem, it might make sense to log metric values less frequently. Perhaps logging a metric every 10-20 steps is sufficient if the aggregate number of steps per metric is very large. Alternatively, we recommend using a scalable, SQLAlchemy-compatible database in order to meet storage capacity requirements. It would be useful to know how many distinct metrics you're logging for each run and how many timeseries ("step") entries you're logging per metric. Can you provide this information? |
Agree. I will do soon a PR to create the separate table and compare performance. We also advised our teams to reduce the frequency. It would be nice to implement something that enforces this policy. (sampling at the end of the run ?). It could be another nice PR. FYI, in our database the "biggest" run is: Our database is currently a MariaDB (version 10.3.11) |
Hello, I am working with @jdlesage on our mlflow server, I was looking into this issue and I found two separate causes for the slowness of the UI:
I think we could greatly improve this by replacing all those queries and the python loop by a single sql JOIN with the right filter to get only the metrics with the last step. What do you think ? |
@t-henri PR fixes the problem. I close the issue. |
System information
mlflow --version
): 1.0.0Describe the problem
In our server, we log runs that use the timeseries metrics (metrics that uses the step parameters). It leads to have more than 10K metrics / run in the metrics table.
This kind of runs slowdown a lot the server. To compute the list of runs, mlflow needs to join runs with the metrics table that becomes huge.
And more than performance issue, it will lead to increase the size of our database. The current solution will hardly scale with the number of runs.
Suggestions
1/ A simple solution would be to limit the number of steps that can be registered per experiments. => This solution could be see as a limitation for users.
2/ Change the DB schema. Store timeseries metrics in a separate table and don't use it in the experiment listing. => It will not fix database size problem.
3/ Use another way to store timeseries. A solution could be to store them in the run artifacts. => It is much more complex to implement and change the design of MLflow.
What are your recommendations ? Thanks!
The text was updated successfully, but these errors were encountered: