Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance for accessing run data, FileSystem backend, caching, other workarounds? #1902

Closed
rquintino opened this issue Oct 3, 2019 · 5 comments
Labels

Comments

@rquintino
Copy link

Hi, we're currently exploring options to solve the main issue we have currently with mlflow, which is performance for reading run data (filesystem backend) :(, can imagine highly related to the way mlflow stores data, ex: every single param, metric goes into a separate file. Depending on project we can track hundreds of data points per run. So we end up on mlruns folder with 50k 100k files, even if for a lower number of runs.

We are considering sql backends also, but wondering if even in the filesystem could we have some workarounds available, like caching, or aggregating, especially for stale/closed runs.

This and the fact we can't use mlflowui to compare runs in different experiments (we use one backend per project), highly reduces the usefulness of mlflow ui projects with large number of runs. We end up producing a huge dataframe (which we can update around ~1 min for a few thousands of runs), which is even cached, refreshed only on a need basis.

Blazingly fast, but we lose the mlflow, so was thinking if some kind of caching could be enabled for mlflow server/ui? refreshed periodically, something like that.

thanks!

@sabman
Copy link

sabman commented Oct 4, 2019

Can you add a tag with the date or an expiry date? Then you can filter by that?

@ankmathur96
Copy link
Contributor

Have you tried using the SQL store with a MySQL backend?

@DoktorMike
Copy link

I also experience retrieval quite slow and sometimes even timing out resulting in an "Internal Server Error". Even as few as 10 runs in an experiment can sometimes time out. Also using s3 for artifact storage. I could switch to a MySQL backend but that shouldn't really help in something as small as 10 runs right? So there must be another issue why this is so slow. I'll post here if I come up with a solution.

@stale
Copy link

stale bot commented Nov 4, 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Nov 4, 2019
@stale
Copy link

stale bot commented Jan 3, 2020

This issue has not been interacted with for 60 days since it was marked as stale! As such, the issue is now closed. If you have this issue or more information, please re-open the issue!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants