MLFlow UI takes very long time to load experiments #1517

mbelang · 2019-06-27T17:52:25Z

System information

mlflow 1.0.0
python 3.6.8
mlflow server running in kubernetes EKS behind ELB
experiments stored on EBS volume
Current gunicorn configuration:

[2019-06-27 17:32:42 +0000] [13] [DEBUG] Current configuration:
  config: None
  bind: ['0.0.0.0:5000']
  backlog: 2048
  workers: 4
  worker_class: gevent
  threads: 3
  worker_connections: 1000
  max_requests: 0
  max_requests_jitter: 0
  timeout: 300
  graceful_timeout: 30
  keepalive: 300
  limit_request_line: 4094
  limit_request_fields: 100
  limit_request_field_size: 8190
  reload: False
  reload_engine: auto
  reload_extra_files: []
  spew: False
  check_config: False
  preload_app: False
  sendfile: None
  reuse_port: False
  chdir: /
  daemon: False
  raw_env: []
  pidfile: None
  worker_tmp_dir: None
  user: 0
  group: 0
  umask: 0
  initgroups: False
  tmp_upload_dir: None
  secure_scheme_headers: {'X-FORWARDED-PROTOCOL': 'ssl', 'X-FORWARDED-PROTO': 'https', 'X-FORWARDED-SSL': 'on'}
  forwarded_allow_ips: ['127.0.0.1']
  accesslog: None
  disable_redirect_access_to_syslog: False
  access_log_format: %(h)s %(l)s %(u)s %(t)s "%(r)s" %(s)s %(b)s "%(f)s" "%(a)s"
  errorlog: -
  loglevel: DEBUG
  capture_output: False
  logger_class: gunicorn.glogging.Logger
  logconfig: None
  logconfig_dict: {}
  syslog_addr: udp://localhost:514
  syslog: False
  syslog_prefix: None
  syslog_facility: user
  enable_stdio_inheritance: False
  statsd_host: None
  statsd_prefix: 
  proc_name: None
  default_proc_name: mlflow.server:app
  pythonpath: None
  paste: None
  on_starting: <function OnStarting.on_starting at 0x7fc2cfbff158>
  on_reload: <function OnReload.on_reload at 0x7fc2cfbff268>
  when_ready: <function WhenReady.when_ready at 0x7fc2cfbff378>
  pre_fork: <function Prefork.pre_fork at 0x7fc2cfbff488>
  post_fork: <function Postfork.post_fork at 0x7fc2cfbff598>
  post_worker_init: <function PostWorkerInit.post_worker_init at 0x7fc2cfbff6a8>
  worker_int: <function WorkerInt.worker_int at 0x7fc2cfbff7b8>
  worker_abort: <function WorkerAbort.worker_abort at 0x7fc2cfbff8c8>
  pre_exec: <function PreExec.pre_exec at 0x7fc2cfbff9d8>
  pre_request: <function PreRequest.pre_request at 0x7fc2cfbffae8>
  post_request: <function PostRequest.post_request at 0x7fc2cfbffb70>
  child_exit: <function ChildExit.child_exit at 0x7fc2cfbffc80>
  worker_exit: <function WorkerExit.worker_exit at 0x7fc2cfbffd90>
  nworkers_changed: <function NumWorkersChanged.nworkers_changed at 0x7fc2cfbffea0>
  on_exit: <function OnExit.on_exit at 0x7fc2cfc0d048>
  proxy_protocol: False
  proxy_allow_ips: ['127.0.0.1']
  keyfile: None
  certfile: None
  ssl_version: 2
  cert_reqs: 0
  ca_certs: None
  suppress_ragged_eofs: True
  do_handshake_on_connect: False
  ciphers: TLSv1
  raw_paste_global_conf: []

Describe the problem

We have some experiments with a lot of runs (over 400 runs) and the UI takes forever to load all the run. For instance to load 400 runs it takes around 2m15s. I had to bump the AWS ELB idle timeout to 300s to manage to get that page to load. I saw that the UI does an ajax call to load all runs at the same time.

Would it be possible to have lazy loading of the runs instead of loading them all at the same time?

The text was updated successfully, but these errors were encountered:

mateiz · 2019-06-27T18:15:42Z

Have you tried running this outside EKS, ELB, etc? It's hart to tell what is adding the overhead here. For better performance, we also recommend the SQL store instead of the file store (https://mlflow.org/docs/latest/tracking.html#mlflow-tracking-servers).

mbelang · 2019-06-27T18:48:35Z

Yes I did and all works fine but I never tried with that amount of runs. I will do that and report back.

I also saw that mlflow 1.0.0 supports SLQ store which I like better than file store. Is there a migration guide to migrate data from file store to sql store?

mateiz · 2019-06-27T19:32:56Z

Unfortunately there's no migration tool, but you could use the lower-level Python API to move data over (https://mlflow.org/docs/latest/python_api/mlflow.tracking.html). We would like to support a migration tool in the repo actually if someone's interested in building one.

mbelang · 2019-06-27T19:51:09Z

Ok I just did some local testing.
I copied all the production data on my local machine and started the container locally. The loading time is faster than the kubernetes deployment 30s vs 2m15s. It is still quite long IMHO as it loads all the data (in my case 77 MB) in one shot.

The main difference between my production deployment and my machine is that I save network transfer from EBS volume -> container -> browser.

I do not know if moving to SQL store will really help speed up things. Do you have benchmark comparing file store vs SQL?

mateiz · 2019-06-27T20:00:25Z

Yeah, we published some benchmarks for the SQL store here: https://databricks.com/blog/2019/03/28/mlflow-v0-9-0-features-sql-backend-projects-in-docker-and-customization-in-python-models.html. File systems tend to get slow once you have to go through thousands of directories, like our tracking store has to. It does depend on the file system and underlying media somewhat, but this particular benchmark is on a very fast local SSD and we see a big difference even there.

mbelang · 2019-06-27T20:14:40Z

Will migrate to SQL and since we will need to write a migration script we will contribute it at the same time.

mateiz · 2019-06-27T20:15:49Z

That would be awesome, thanks!

amesar · 2019-07-02T02:44:45Z

Here's a b/w utility to export from one MLflow server and import to another.

https://github.com/amesar/mlflow-fun/tree/master/tools/core#exportimport-run

mbelang · 2019-07-03T19:15:45Z

thx @amesar for you script. I did modify it a bit to allow running it once to transfer all experiments and runs in a single run.

@mateiz the DB setup is a really good improvement in term of speed. 1 minute to load 500 runs instead of close to 3 minutes.

I do feel this is quite long and the UI should benefits from lazy-loading/paging.

mateiz · 2019-07-04T10:09:40Z

Glad it helped. We are adding pagination right now actually and we'll reduce the default number of runs shown on the UI and enable paging instead.

BTW another thing I wanted to point out is that the "table" view on the UI is much slower than the "compact" view, and does not use react-virtualized. We're planning to get rid of it altogether in a future release, since you can always move some fields to be their own columns in the compact view. Try the compact view if you are using the table one right now.

mbelang · 2019-07-04T10:12:39Z

Great new. Will try that and thanks for the update.

Where would be the right place in the repo to contribute the migration script?

mbelang · 2019-07-05T12:21:45Z

FYI, by default it is compact view.

nachiketparanjape · 2019-07-19T06:46:18Z

Hi @mbelang were you able to get a resolution for this?

I am seeing a similar issue. Even though I have 1500 runs, my UI will only show "1000 latest runs".
I am also running the mlflow server with --gunicorn-opts "--timeout=120". as per recommendations on #627

mbelang · 2019-07-19T10:53:44Z

@nachiketparanjape as mentioned above o managed to cut the loading time form 2:30m+ to about 1m by switching to postgres backend.

Also as mentioned above, they are implementing pagination which is to me the real fix.

I didn't take any chances and put gunicorn timeout to 300s. If you are deployed in cloud, do not forget to bump the load balancer timeout as well.

nachiketparanjape · 2019-07-19T17:01:09Z

@mbelang I do have the setup in cloud with a postgres RDS backend. Takes me around 10s to load 300 runs in an experiment with 120s gunicorn timeout.

I think pagination would probably help the current issue I am facing : Unable to load more than 1000 runs. Thanks for the input!

Zangr · 2019-07-27T00:14:41Z

Mlflow 1.1 will load the initial 100 runs and show a "Load more" button when you scroll to the end of the table.
#1564

mbelang · 2019-07-29T14:38:48Z

Still takes a solid 30 seconds to load those 100 runs and scrolling to the end doesn't bring the Load More button.

mateiz · 2019-07-29T17:51:12Z

Maxime, I think this patch to the SQL storage backend may also help: #1660. Do you want to try that out? You can comment on there if you find an issue with it.

mbelang · 2019-07-29T18:05:13Z

Cool will try when I get the chance. Also for the Load More button that I do not see? I do not think it is related though right?

mateiz · 2019-07-29T22:49:27Z

The button should appear if you are running MLflow 1.1 and you have more than 100 runs. Did you update your server to 1.1?

mbelang · 2019-07-29T23:51:08Z

Yes it ask me to do a db upgrade which I did and I have more than 400 runs

Zangr · 2019-07-30T00:39:17Z

@mbelang could you please check if the search run response with your initial 100 runs has a non-empty next_page_token field?

mbelang · 2019-07-30T11:39:07Z

@Zangr the next_page_token field is not empty.

Zangr · 2019-08-06T16:58:03Z

Thanks @mbelang! Looks like "Load more" is not showing in windows/IE. We are investigating the issue. For now, mac/chrome can be a workaround.
#1694

diadochos · 2019-08-31T02:38:43Z

Is the "Load more" functionality also supposed to work when the runs are nested?
I am using mlflow==1.2.0 with sqlite, and it seems to timeout.

mbelang · 2019-09-17T18:06:50Z

@diadochos you have to scroll down at the botton after the first 100 runs the button will appear.

@Zangr With 1.2.0, the Load More button appears but it is still very very slow to load even 100 runs. FYI, I was on Linux and Chromium.

stale · 2019-10-08T18:27:28Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

mbelang · 2019-10-08T19:20:55Z

Just updated to 1.3.0 and it is loading like bullet train!

Thank you. I consider this issue as fixed.

JeremieMelo · 2020-09-21T00:37:22Z

from Mlflow 1.9.1, still super slow.... just 8 runs with very few params and metrics. It takes almost a minute to react on every click.

simonhessner · 2021-04-22T21:02:31Z

I have around 850 runs and it takes 2-3 minutes to load, sometimes I even get an INTERNAL_SERVER_ERROR. Is there anything I can do to speed it up? I am using the default mlflow setup

LeandroMoura3 · 2021-04-23T00:13:54Z

Is trivial make a build of MLflow 1.3 and run it on local enveriment if the user is a begginer in python? If not, is really necessary that this 1.3 version commented above go to the release. So, it's issue isn't close.

alam-shahul · 2023-08-20T23:18:45Z

Can this be fixed for local MLflow experiments? I feel that it shouldn't be this slow.

mbelang mentioned this issue Jul 8, 2019

Tracking UI breaks when there are too many runs for a single experiment #627

Closed

Zangr added the area/uiux Front-end, user experience, plotting, JavaScript, JavaScript dev server label Jul 27, 2019

Zangr self-assigned this Jul 27, 2019

stale bot added the stale label Oct 8, 2019

mbelang closed this as completed Oct 8, 2019

e-dorigatti mentioned this issue Jan 12, 2024

We'd like your feedback on our MLflow experience! #10329

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MLFlow UI takes very long time to load experiments #1517

MLFlow UI takes very long time to load experiments #1517

mbelang commented Jun 27, 2019

mateiz commented Jun 27, 2019

mbelang commented Jun 27, 2019

mateiz commented Jun 27, 2019

mbelang commented Jun 27, 2019

mateiz commented Jun 27, 2019

mbelang commented Jun 27, 2019

mateiz commented Jun 27, 2019

amesar commented Jul 2, 2019

mbelang commented Jul 3, 2019

mateiz commented Jul 4, 2019

mbelang commented Jul 4, 2019

mbelang commented Jul 5, 2019

nachiketparanjape commented Jul 19, 2019

mbelang commented Jul 19, 2019

nachiketparanjape commented Jul 19, 2019

Zangr commented Jul 27, 2019

mbelang commented Jul 29, 2019

mateiz commented Jul 29, 2019

mbelang commented Jul 29, 2019

mateiz commented Jul 29, 2019

mbelang commented Jul 29, 2019

Zangr commented Jul 30, 2019

mbelang commented Jul 30, 2019

Zangr commented Aug 6, 2019

diadochos commented Aug 31, 2019

mbelang commented Sep 17, 2019 •

edited

stale bot commented Oct 8, 2019

mbelang commented Oct 8, 2019

JeremieMelo commented Sep 21, 2020

simonhessner commented Apr 22, 2021

LeandroMoura3 commented Apr 23, 2021

alam-shahul commented Aug 20, 2023

MLFlow UI takes very long time to load experiments #1517

MLFlow UI takes very long time to load experiments #1517

Comments

mbelang commented Jun 27, 2019

System information

Describe the problem

mateiz commented Jun 27, 2019

mbelang commented Jun 27, 2019

mateiz commented Jun 27, 2019

mbelang commented Jun 27, 2019

mateiz commented Jun 27, 2019

mbelang commented Jun 27, 2019

mateiz commented Jun 27, 2019

amesar commented Jul 2, 2019

mbelang commented Jul 3, 2019

mateiz commented Jul 4, 2019

mbelang commented Jul 4, 2019

mbelang commented Jul 5, 2019

nachiketparanjape commented Jul 19, 2019

mbelang commented Jul 19, 2019

nachiketparanjape commented Jul 19, 2019

Zangr commented Jul 27, 2019

mbelang commented Jul 29, 2019

mateiz commented Jul 29, 2019

mbelang commented Jul 29, 2019

mateiz commented Jul 29, 2019

mbelang commented Jul 29, 2019

Zangr commented Jul 30, 2019

mbelang commented Jul 30, 2019

Zangr commented Aug 6, 2019

diadochos commented Aug 31, 2019

mbelang commented Sep 17, 2019 • edited

stale bot commented Oct 8, 2019

mbelang commented Oct 8, 2019

JeremieMelo commented Sep 21, 2020

simonhessner commented Apr 22, 2021

LeandroMoura3 commented Apr 23, 2021

alam-shahul commented Aug 20, 2023

mbelang commented Sep 17, 2019 •

edited