Mlflow UI slowdowns with the number of timeseries metrics #1571

jdlesage · 2019-07-11T09:23:19Z

System information

Have I written custom code (as opposed to using a stock example script provided in MLflow): No
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux CentOS 7
MLflow installed from (source or binary): source
MLflow version (run mlflow --version): 1.0.0
Python version: 3.6
**npm version (if running the dev UI): N/A
Exact command to reproduce:

Describe the problem

In our server, we log runs that use the timeseries metrics (metrics that uses the step parameters). It leads to have more than 10K metrics / run in the metrics table.
This kind of runs slowdown a lot the server. To compute the list of runs, mlflow needs to join runs with the metrics table that becomes huge.
And more than performance issue, it will lead to increase the size of our database. The current solution will hardly scale with the number of runs.

Suggestions

1/ A simple solution would be to limit the number of steps that can be registered per experiments. => This solution could be see as a limitation for users.

2/ Change the DB schema. Store timeseries metrics in a separate table and don't use it in the experiment listing. => It will not fix database size problem.

3/ Use another way to store timeseries. A solution could be to store them in the run artifacts. => It is much more complex to implement and change the design of MLflow.

What are your recommendations ? Thanks!

The text was updated successfully, but these errors were encountered:

dbczumar · 2019-07-12T06:05:59Z

@jdlesage Thank you for raising this concern. Regarding the ListRuns performance issue with a large number of metrics, it seems reasonable to create a new database table containing the latest metric for each run (e.g., the metric with the maximum step, timestamp, and value); ListRuns can then read from this table without fetching all metrics for each run. LogMetric can then write to both tables: the table consisting of metric timeseries and the table consisting of latest metric values. We would be more than happy to review and merge a PR that implements this proposal!

Regarding the database size problem, it might make sense to log metric values less frequently. Perhaps logging a metric every 10-20 steps is sufficient if the aggregate number of steps per metric is very large. Alternatively, we recommend using a scalable, SQLAlchemy-compatible database in order to meet storage capacity requirements.

It would be useful to know how many distinct metrics you're logging for each run and how many timeseries ("step") entries you're logging per metric. Can you provide this information?

jdlesage · 2019-07-12T11:25:37Z

Agree. I will do soon a PR to create the separate table and compare performance.

We also advised our teams to reduce the frequency. It would be nice to implement something that enforces this policy. (sampling at the end of the run ?). It could be another nice PR.

FYI, in our database the "biggest" run is:
2500 steps per metrics
25 distinct metrics per run

Our database is currently a MariaDB (version 10.3.11)

t-henri · 2019-07-18T14:44:18Z

Hello,

I am working with @jdlesage on our mlflow server, I was looking into this issue and I found two separate causes for the slowness of the UI:

There is a for loop in the python code used to keep only the last metrics for a specific run (in mlflow/store/dbmodels/models.py), and when we list runs we spend around 75% of the time in this loop. In our case, the page takes 7 seconds to load and we spend 5.5 seconds in this loop.
We do a lot of sql queries, to load the page we make one query to get the list of runs, then we do three queries per run, one for its metrics, one for its tags and one for its params. So if you have 100 runs, it makes 301 queries to the database.

I think we could greatly improve this by replacing all those queries and the python loop by a single sql JOIN with the right filter to get only the metrics with the last step.

What do you think ?

jdlesage · 2019-08-02T08:03:25Z

@t-henri PR fixes the problem. I close the issue.

t-henri mentioned this issue Jul 24, 2019

Get run last metrics optimization #1660

Merged

22 tasks

jdlesage mentioned this issue Jul 30, 2019

[FR] Add periodic tasks to be executed by the tracking server #1680

Closed

jdlesage closed this as completed Aug 2, 2019

e-dorigatti mentioned this issue Jan 12, 2024

We'd like your feedback on our MLflow experience! #10329

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mlflow UI slowdowns with the number of timeseries metrics #1571

Mlflow UI slowdowns with the number of timeseries metrics #1571

jdlesage commented Jul 11, 2019

dbczumar commented Jul 12, 2019

jdlesage commented Jul 12, 2019

t-henri commented Jul 18, 2019

jdlesage commented Aug 2, 2019

Mlflow UI slowdowns with the number of timeseries metrics #1571

Mlflow UI slowdowns with the number of timeseries metrics #1571

Comments

jdlesage commented Jul 11, 2019

System information

Describe the problem

Suggestions

dbczumar commented Jul 12, 2019

jdlesage commented Jul 12, 2019

t-henri commented Jul 18, 2019

jdlesage commented Aug 2, 2019