New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MLFlow UI takes very long time to load experiments #1517
Comments
Have you tried running this outside EKS, ELB, etc? It's hart to tell what is adding the overhead here. For better performance, we also recommend the SQL store instead of the file store (https://mlflow.org/docs/latest/tracking.html#mlflow-tracking-servers). |
Yes I did and all works fine but I never tried with that amount of runs. I will do that and report back. I also saw that mlflow 1.0.0 supports SLQ store which I like better than file store. Is there a migration guide to migrate data from file store to sql store? |
Unfortunately there's no migration tool, but you could use the lower-level Python API to move data over (https://mlflow.org/docs/latest/python_api/mlflow.tracking.html). We would like to support a migration tool in the repo actually if someone's interested in building one. |
Ok I just did some local testing. The main difference between my production deployment and my machine is that I save network transfer from EBS volume -> container -> browser. I do not know if moving to SQL store will really help speed up things. Do you have benchmark comparing file store vs SQL? |
Yeah, we published some benchmarks for the SQL store here: https://databricks.com/blog/2019/03/28/mlflow-v0-9-0-features-sql-backend-projects-in-docker-and-customization-in-python-models.html. File systems tend to get slow once you have to go through thousands of directories, like our tracking store has to. It does depend on the file system and underlying media somewhat, but this particular benchmark is on a very fast local SSD and we see a big difference even there. |
Will migrate to SQL and since we will need to write a migration script we will contribute it at the same time. |
That would be awesome, thanks! |
Here's a b/w utility to export from one MLflow server and import to another. https://github.com/amesar/mlflow-fun/tree/master/tools/core#exportimport-run |
thx @amesar for you script. I did modify it a bit to allow running it once to transfer all experiments and runs in a single run. @mateiz the DB setup is a really good improvement in term of speed. 1 minute to load 500 runs instead of close to 3 minutes. I do feel this is quite long and the UI should benefits from lazy-loading/paging. |
Glad it helped. We are adding pagination right now actually and we'll reduce the default number of runs shown on the UI and enable paging instead. BTW another thing I wanted to point out is that the "table" view on the UI is much slower than the "compact" view, and does not use react-virtualized. We're planning to get rid of it altogether in a future release, since you can always move some fields to be their own columns in the compact view. Try the compact view if you are using the table one right now. |
Great new. Will try that and thanks for the update. Where would be the right place in the repo to contribute the migration script? |
FYI, by default it is compact view. |
@nachiketparanjape as mentioned above o managed to cut the loading time form 2:30m+ to about 1m by switching to postgres backend. Also as mentioned above, they are implementing pagination which is to me the real fix. I didn't take any chances and put gunicorn timeout to 300s. If you are deployed in cloud, do not forget to bump the load balancer timeout as well. |
@mbelang I do have the setup in cloud with a postgres RDS backend. Takes me around 10s to load 300 runs in an experiment with 120s gunicorn timeout. I think pagination would probably help the current issue I am facing : Unable to load more than 1000 runs. Thanks for the input! |
Mlflow 1.1 will load the initial 100 runs and show a "Load more" button when you scroll to the end of the table. |
Still takes a solid 30 seconds to load those 100 runs and scrolling to the end doesn't bring the |
Maxime, I think this patch to the SQL storage backend may also help: #1660. Do you want to try that out? You can comment on there if you find an issue with it. |
Cool will try when I get the chance. Also for the |
The button should appear if you are running MLflow 1.1 and you have more than 100 runs. Did you update your server to 1.1? |
Yes it ask me to do a db upgrade which I did and I have more than 400 runs |
@mbelang could you please check if the search run response with your initial 100 runs has a non-empty |
@Zangr the |
Is the "Load more" functionality also supposed to work when the runs are nested? |
@diadochos you have to scroll down at the botton after the first 100 runs the button will appear. @Zangr With 1.2.0, the |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
Just updated to 1.3.0 and it is loading like bullet train! Thank you. I consider this issue as fixed. |
from Mlflow 1.9.1, still super slow.... just 8 runs with very few params and metrics. It takes almost a minute to react on every click. |
I have around 850 runs and it takes 2-3 minutes to load, sometimes I even get an INTERNAL_SERVER_ERROR. Is there anything I can do to speed it up? I am using the default mlflow setup |
Is trivial make a build of MLflow 1.3 and run it on local enveriment if the user is a begginer in python? If not, is really necessary that this 1.3 version commented above go to the release. So, it's issue isn't close. |
Can this be fixed for local MLflow experiments? I feel that it shouldn't be this slow. |
System information
Describe the problem
We have some experiments with a lot of runs (over 400 runs) and the UI takes forever to load all the run. For instance to load 400 runs it takes around 2m15s. I had to bump the AWS ELB idle timeout to 300s to manage to get that page to load. I saw that the UI does an ajax call to load all runs at the same time.
Would it be possible to have lazy loading of the runs instead of loading them all at the same time?
The text was updated successfully, but these errors were encountered: