Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mlflow UI slowdowns with the number of timeseries metrics #1571

Closed
jdlesage opened this issue Jul 11, 2019 · 4 comments
Closed

Mlflow UI slowdowns with the number of timeseries metrics #1571

jdlesage opened this issue Jul 11, 2019 · 4 comments

Comments

@jdlesage
Copy link
Contributor

System information

  • Have I written custom code (as opposed to using a stock example script provided in MLflow): No
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux CentOS 7
  • MLflow installed from (source or binary): source
  • MLflow version (run mlflow --version): 1.0.0
  • Python version: 3.6
  • **npm version (if running the dev UI): N/A
  • Exact command to reproduce:

Describe the problem

In our server, we log runs that use the timeseries metrics (metrics that uses the step parameters). It leads to have more than 10K metrics / run in the metrics table.
This kind of runs slowdown a lot the server. To compute the list of runs, mlflow needs to join runs with the metrics table that becomes huge.
And more than performance issue, it will lead to increase the size of our database. The current solution will hardly scale with the number of runs.

Suggestions

1/ A simple solution would be to limit the number of steps that can be registered per experiments. => This solution could be see as a limitation for users.

2/ Change the DB schema. Store timeseries metrics in a separate table and don't use it in the experiment listing. => It will not fix database size problem.

3/ Use another way to store timeseries. A solution could be to store them in the run artifacts. => It is much more complex to implement and change the design of MLflow.

What are your recommendations ? Thanks!

@dbczumar
Copy link
Collaborator

@jdlesage Thank you for raising this concern. Regarding the ListRuns performance issue with a large number of metrics, it seems reasonable to create a new database table containing the latest metric for each run (e.g., the metric with the maximum step, timestamp, and value); ListRuns can then read from this table without fetching all metrics for each run. LogMetric can then write to both tables: the table consisting of metric timeseries and the table consisting of latest metric values. We would be more than happy to review and merge a PR that implements this proposal!

Regarding the database size problem, it might make sense to log metric values less frequently. Perhaps logging a metric every 10-20 steps is sufficient if the aggregate number of steps per metric is very large. Alternatively, we recommend using a scalable, SQLAlchemy-compatible database in order to meet storage capacity requirements.

It would be useful to know how many distinct metrics you're logging for each run and how many timeseries ("step") entries you're logging per metric. Can you provide this information?

@jdlesage
Copy link
Contributor Author

Agree. I will do soon a PR to create the separate table and compare performance.

We also advised our teams to reduce the frequency. It would be nice to implement something that enforces this policy. (sampling at the end of the run ?). It could be another nice PR.

FYI, in our database the "biggest" run is:
2500 steps per metrics
25 distinct metrics per run

Our database is currently a MariaDB (version 10.3.11)

@t-henri
Copy link
Contributor

t-henri commented Jul 18, 2019

Hello,

I am working with @jdlesage on our mlflow server, I was looking into this issue and I found two separate causes for the slowness of the UI:

  • There is a for loop in the python code used to keep only the last metrics for a specific run (in mlflow/store/dbmodels/models.py), and when we list runs we spend around 75% of the time in this loop. In our case, the page takes 7 seconds to load and we spend 5.5 seconds in this loop.

  • We do a lot of sql queries, to load the page we make one query to get the list of runs, then we do three queries per run, one for its metrics, one for its tags and one for its params. So if you have 100 runs, it makes 301 queries to the database.

I think we could greatly improve this by replacing all those queries and the python loop by a single sql JOIN with the right filter to get only the metrics with the last step.

What do you think ?

@jdlesage
Copy link
Contributor Author

jdlesage commented Aug 2, 2019

@t-henri PR fixes the problem. I close the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants