Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MLFlow UI takes very long time to load experiments #1517

Closed
mbelang opened this issue Jun 27, 2019 · 32 comments
Closed

MLFlow UI takes very long time to load experiments #1517

mbelang opened this issue Jun 27, 2019 · 32 comments
Assignees
Labels
area/uiux Front-end, user experience, plotting, JavaScript, JavaScript dev server stale

Comments

@mbelang
Copy link

mbelang commented Jun 27, 2019

System information

  • mlflow 1.0.0
  • python 3.6.8
  • mlflow server running in kubernetes EKS behind ELB
  • experiments stored on EBS volume
  • Current gunicorn configuration:
[2019-06-27 17:32:42 +0000] [13] [DEBUG] Current configuration:
  config: None
  bind: ['0.0.0.0:5000']
  backlog: 2048
  workers: 4
  worker_class: gevent
  threads: 3
  worker_connections: 1000
  max_requests: 0
  max_requests_jitter: 0
  timeout: 300
  graceful_timeout: 30
  keepalive: 300
  limit_request_line: 4094
  limit_request_fields: 100
  limit_request_field_size: 8190
  reload: False
  reload_engine: auto
  reload_extra_files: []
  spew: False
  check_config: False
  preload_app: False
  sendfile: None
  reuse_port: False
  chdir: /
  daemon: False
  raw_env: []
  pidfile: None
  worker_tmp_dir: None
  user: 0
  group: 0
  umask: 0
  initgroups: False
  tmp_upload_dir: None
  secure_scheme_headers: {'X-FORWARDED-PROTOCOL': 'ssl', 'X-FORWARDED-PROTO': 'https', 'X-FORWARDED-SSL': 'on'}
  forwarded_allow_ips: ['127.0.0.1']
  accesslog: None
  disable_redirect_access_to_syslog: False
  access_log_format: %(h)s %(l)s %(u)s %(t)s "%(r)s" %(s)s %(b)s "%(f)s" "%(a)s"
  errorlog: -
  loglevel: DEBUG
  capture_output: False
  logger_class: gunicorn.glogging.Logger
  logconfig: None
  logconfig_dict: {}
  syslog_addr: udp://localhost:514
  syslog: False
  syslog_prefix: None
  syslog_facility: user
  enable_stdio_inheritance: False
  statsd_host: None
  statsd_prefix: 
  proc_name: None
  default_proc_name: mlflow.server:app
  pythonpath: None
  paste: None
  on_starting: <function OnStarting.on_starting at 0x7fc2cfbff158>
  on_reload: <function OnReload.on_reload at 0x7fc2cfbff268>
  when_ready: <function WhenReady.when_ready at 0x7fc2cfbff378>
  pre_fork: <function Prefork.pre_fork at 0x7fc2cfbff488>
  post_fork: <function Postfork.post_fork at 0x7fc2cfbff598>
  post_worker_init: <function PostWorkerInit.post_worker_init at 0x7fc2cfbff6a8>
  worker_int: <function WorkerInt.worker_int at 0x7fc2cfbff7b8>
  worker_abort: <function WorkerAbort.worker_abort at 0x7fc2cfbff8c8>
  pre_exec: <function PreExec.pre_exec at 0x7fc2cfbff9d8>
  pre_request: <function PreRequest.pre_request at 0x7fc2cfbffae8>
  post_request: <function PostRequest.post_request at 0x7fc2cfbffb70>
  child_exit: <function ChildExit.child_exit at 0x7fc2cfbffc80>
  worker_exit: <function WorkerExit.worker_exit at 0x7fc2cfbffd90>
  nworkers_changed: <function NumWorkersChanged.nworkers_changed at 0x7fc2cfbffea0>
  on_exit: <function OnExit.on_exit at 0x7fc2cfc0d048>
  proxy_protocol: False
  proxy_allow_ips: ['127.0.0.1']
  keyfile: None
  certfile: None
  ssl_version: 2
  cert_reqs: 0
  ca_certs: None
  suppress_ragged_eofs: True
  do_handshake_on_connect: False
  ciphers: TLSv1
  raw_paste_global_conf: []

Describe the problem

We have some experiments with a lot of runs (over 400 runs) and the UI takes forever to load all the run. For instance to load 400 runs it takes around 2m15s. I had to bump the AWS ELB idle timeout to 300s to manage to get that page to load. I saw that the UI does an ajax call to load all runs at the same time.

Would it be possible to have lazy loading of the runs instead of loading them all at the same time?

@mateiz
Copy link
Contributor

mateiz commented Jun 27, 2019

Have you tried running this outside EKS, ELB, etc? It's hart to tell what is adding the overhead here. For better performance, we also recommend the SQL store instead of the file store (https://mlflow.org/docs/latest/tracking.html#mlflow-tracking-servers).

@mbelang
Copy link
Author

mbelang commented Jun 27, 2019

Yes I did and all works fine but I never tried with that amount of runs. I will do that and report back.

I also saw that mlflow 1.0.0 supports SLQ store which I like better than file store. Is there a migration guide to migrate data from file store to sql store?

@mateiz
Copy link
Contributor

mateiz commented Jun 27, 2019

Unfortunately there's no migration tool, but you could use the lower-level Python API to move data over (https://mlflow.org/docs/latest/python_api/mlflow.tracking.html). We would like to support a migration tool in the repo actually if someone's interested in building one.

@mbelang
Copy link
Author

mbelang commented Jun 27, 2019

Ok I just did some local testing.
I copied all the production data on my local machine and started the container locally. The loading time is faster than the kubernetes deployment 30s vs 2m15s. It is still quite long IMHO as it loads all the data (in my case 77 MB) in one shot.

The main difference between my production deployment and my machine is that I save network transfer from EBS volume -> container -> browser.

I do not know if moving to SQL store will really help speed up things. Do you have benchmark comparing file store vs SQL?

@mateiz
Copy link
Contributor

mateiz commented Jun 27, 2019

Yeah, we published some benchmarks for the SQL store here: https://databricks.com/blog/2019/03/28/mlflow-v0-9-0-features-sql-backend-projects-in-docker-and-customization-in-python-models.html. File systems tend to get slow once you have to go through thousands of directories, like our tracking store has to. It does depend on the file system and underlying media somewhat, but this particular benchmark is on a very fast local SSD and we see a big difference even there.

@mbelang
Copy link
Author

mbelang commented Jun 27, 2019

Will migrate to SQL and since we will need to write a migration script we will contribute it at the same time.

@mateiz
Copy link
Contributor

mateiz commented Jun 27, 2019

That would be awesome, thanks!

@amesar
Copy link

amesar commented Jul 2, 2019

Here's a b/w utility to export from one MLflow server and import to another.

https://github.com/amesar/mlflow-fun/tree/master/tools/core#exportimport-run

@mbelang
Copy link
Author

mbelang commented Jul 3, 2019

thx @amesar for you script. I did modify it a bit to allow running it once to transfer all experiments and runs in a single run.

@mateiz the DB setup is a really good improvement in term of speed. 1 minute to load 500 runs instead of close to 3 minutes.

I do feel this is quite long and the UI should benefits from lazy-loading/paging.

@mateiz
Copy link
Contributor

mateiz commented Jul 4, 2019

Glad it helped. We are adding pagination right now actually and we'll reduce the default number of runs shown on the UI and enable paging instead.

BTW another thing I wanted to point out is that the "table" view on the UI is much slower than the "compact" view, and does not use react-virtualized. We're planning to get rid of it altogether in a future release, since you can always move some fields to be their own columns in the compact view. Try the compact view if you are using the table one right now.

@mbelang
Copy link
Author

mbelang commented Jul 4, 2019

Great new. Will try that and thanks for the update.

Where would be the right place in the repo to contribute the migration script?

@mbelang
Copy link
Author

mbelang commented Jul 5, 2019

FYI, by default it is compact view.

@nachiketparanjape
Copy link

Hi @mbelang were you able to get a resolution for this?

I am seeing a similar issue. Even though I have 1500 runs, my UI will only show "1000 latest runs".
I am also running the mlflow server with --gunicorn-opts "--timeout=120". as per recommendations on #627

@mbelang
Copy link
Author

mbelang commented Jul 19, 2019

@nachiketparanjape as mentioned above o managed to cut the loading time form 2:30m+ to about 1m by switching to postgres backend.

Also as mentioned above, they are implementing pagination which is to me the real fix.

I didn't take any chances and put gunicorn timeout to 300s. If you are deployed in cloud, do not forget to bump the load balancer timeout as well.

@nachiketparanjape
Copy link

@mbelang I do have the setup in cloud with a postgres RDS backend. Takes me around 10s to load 300 runs in an experiment with 120s gunicorn timeout.

I think pagination would probably help the current issue I am facing : Unable to load more than 1000 runs. Thanks for the input!

@Zangr Zangr added the area/uiux Front-end, user experience, plotting, JavaScript, JavaScript dev server label Jul 27, 2019
@Zangr Zangr self-assigned this Jul 27, 2019
@Zangr
Copy link
Contributor

Zangr commented Jul 27, 2019

Mlflow 1.1 will load the initial 100 runs and show a "Load more" button when you scroll to the end of the table.
#1564

@mbelang
Copy link
Author

mbelang commented Jul 29, 2019

Still takes a solid 30 seconds to load those 100 runs and scrolling to the end doesn't bring the Load More button.

@mateiz
Copy link
Contributor

mateiz commented Jul 29, 2019

Maxime, I think this patch to the SQL storage backend may also help: #1660. Do you want to try that out? You can comment on there if you find an issue with it.

@mbelang
Copy link
Author

mbelang commented Jul 29, 2019

Cool will try when I get the chance. Also for the Load More button that I do not see? I do not think it is related though right?

@mateiz
Copy link
Contributor

mateiz commented Jul 29, 2019

The button should appear if you are running MLflow 1.1 and you have more than 100 runs. Did you update your server to 1.1?

@mbelang
Copy link
Author

mbelang commented Jul 29, 2019

Yes it ask me to do a db upgrade which I did and I have more than 400 runs

@Zangr
Copy link
Contributor

Zangr commented Jul 30, 2019

@mbelang could you please check if the search run response with your initial 100 runs has a non-empty next_page_token field?

@mbelang
Copy link
Author

mbelang commented Jul 30, 2019

@Zangr the next_page_token field is not empty.

@Zangr
Copy link
Contributor

Zangr commented Aug 6, 2019

Thanks @mbelang! Looks like "Load more" is not showing in windows/IE. We are investigating the issue. For now, mac/chrome can be a workaround.
#1694

@diadochos
Copy link

Is the "Load more" functionality also supposed to work when the runs are nested?
I am using mlflow==1.2.0 with sqlite, and it seems to timeout.

@mbelang
Copy link
Author

mbelang commented Sep 17, 2019

@diadochos you have to scroll down at the botton after the first 100 runs the button will appear.

@Zangr With 1.2.0, the Load More button appears but it is still very very slow to load even 100 runs. FYI, I was on Linux and Chromium.

@stale
Copy link

stale bot commented Oct 8, 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Oct 8, 2019
@mbelang
Copy link
Author

mbelang commented Oct 8, 2019

Just updated to 1.3.0 and it is loading like bullet train!

Thank you. I consider this issue as fixed.

@mbelang mbelang closed this as completed Oct 8, 2019
@JeremieMelo
Copy link

from Mlflow 1.9.1, still super slow.... just 8 runs with very few params and metrics. It takes almost a minute to react on every click.

@simonhessner
Copy link
Contributor

I have around 850 runs and it takes 2-3 minutes to load, sometimes I even get an INTERNAL_SERVER_ERROR. Is there anything I can do to speed it up? I am using the default mlflow setup

@LeandroMoura3
Copy link

Is trivial make a build of MLflow 1.3 and run it on local enveriment if the user is a begginer in python? If not, is really necessary that this 1.3 version commented above go to the release. So, it's issue isn't close.

@alam-shahul
Copy link

Can this be fixed for local MLflow experiments? I feel that it shouldn't be this slow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/uiux Front-end, user experience, plotting, JavaScript, JavaScript dev server stale
Projects
None yet
Development

No branches or pull requests

10 participants