Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mlflow works extremly slow with PostgreSQL #1763

Closed
DimaZhu opened this issue Aug 20, 2019 · 19 comments
Closed

Mlflow works extremly slow with PostgreSQL #1763

DimaZhu opened this issue Aug 20, 2019 · 19 comments
Assignees

Comments

@DimaZhu
Copy link

DimaZhu commented Aug 20, 2019

Hi, everyone. We started to use MlFlow several weeks ago and decided to use PostgreSQL to speed up processing. But now experiment with ~300 runs loads approximately infinity and exceeds any reasonable time limits. Than I've updated to v1.1 and it started to work better (~2 minutes), but still devastating. I've looked in database and found out that there is 5500 runs. Could it be the reason of slow mlflow work? And if so, if there is API to clean deleted runs from db?

@dbczumar
Copy link
Collaborator

@DimaZhu Can you try this again on MLflow 1.2 and confirm that the issue persists? The latest release contains some performance improvements for the SQLAlchemyStore.

Are you logging a large number of metrics to MLflow? The 1.2 release should help significantly in that case.

Regarding your question about run deletion, MLflow provides a delete_run() API; here are the docs for the Python API: https://mlflow.org/docs/latest/python_api/mlflow.tracking.html#mlflow.tracking.MlflowClient.delete_run

@dbczumar dbczumar self-assigned this Aug 20, 2019
@DimaZhu
Copy link
Author

DimaZhu commented Aug 21, 2019

@dbczumar Just tried v1.2, didn't notice any difference.
Btw, mlflow --version crashes:

Traceback (most recent call last):
  File "/Vol0/user/d.zhukov/miniconda3/envs/odometry/bin/mlflow", line 10, in <module>
    sys.exit(cli())
  File "/Vol0/user/d.zhukov/miniconda3/envs/odometry/lib/python3.6/site-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/Vol0/user/d.zhukov/miniconda3/envs/odometry/lib/python3.6/site-packages/click/core.py", line 716, in main
    with self.make_context(prog_name, args, **extra) as ctx:
  File "/Vol0/user/d.zhukov/miniconda3/envs/odometry/lib/python3.6/site-packages/click/core.py", line 641, in make_context
    self.parse_args(ctx, args)
  File "/Vol0/user/d.zhukov/miniconda3/envs/odometry/lib/python3.6/site-packages/click/core.py", line 1089, in parse_args
    rest = Command.parse_args(self, ctx, args)
  File "/Vol0/user/d.zhukov/miniconda3/envs/odometry/lib/python3.6/site-packages/click/core.py", line 940, in parse_args
    value, args = param.handle_parse_result(ctx, opts, args)
  File "/Vol0/user/d.zhukov/miniconda3/envs/odometry/lib/python3.6/site-packages/click/core.py", line 1477, in handle_parse_result
    self.callback, ctx, self, value)
  File "/Vol0/user/d.zhukov/miniconda3/envs/odometry/lib/python3.6/site-packages/click/core.py", line 96, in invoke_param_callback
    return callback(ctx, param, value)
  File "/Vol0/user/d.zhukov/miniconda3/envs/odometry/lib/python3.6/site-packages/click/decorators.py", line 270, in callback
    raise RuntimeError('Could not determine version')
RuntimeError: Could not determine version

We have approximately 40 parameters and 50 metrics per run and also artifacts.

I'm sorry, my question wasn't clear enough. In mlflow I can see approximately 500 runs in summary. But in psql I can see 5500 experiments, deleted ones are still in database and marked as 'deleted'. Can this be a bottleneck? I read in the docs that in case of file system database I should manually delete deleted experiments from trash. Do I need to do something similar with PostgreSQL?

@dbczumar
Copy link
Collaborator

@DimaZhu Can you try applying the patch in #1767 and let me know if this improves UI load times? When you try it, please be sure to take a backup of your MLflow database before running the migration defined in the PR.

@DimaZhu
Copy link
Author

DimaZhu commented Aug 30, 2019

I tried and didn't noticed any difference. And mlflow server didn't ask to update db scheme. Maybe I can share a copy of db with you?

@dbczumar
Copy link
Collaborator

dbczumar commented Sep 1, 2019

@DimaZhu are you sure that you installed MLflow from source by cloning the MLflow repository, checking out the branch associated with PR #1767, and invoking pip install -e /path/to/mlflow/repository?

@DimaZhu
Copy link
Author

DimaZhu commented Sep 1, 2019

@dbczumar yes, I'm pretty sure. I've created new conda env for this test. And installed mlflow from instructions in Dockerfile. And mlflow --versionworks in this build.

@dbczumar
Copy link
Collaborator

dbczumar commented Sep 1, 2019

@DimaZhu What is the output of mlflow --version on the host machine where the tracking server is running? Are you running the tracking server within a Docker container?

@DimaZhu
Copy link
Author

DimaZhu commented Sep 1, 2019

1.2.0
No, mlflow server runs without docker. I've just copied instructions from Dockerfile.

@dbczumar
Copy link
Collaborator

dbczumar commented Sep 1, 2019

If you've installed MLflow from source, the version displayed should be 1.2.1.dev0 (

VERSION = '1.2.1.dev0'
). Can you try following the instructions in my previous comment and let me know if installing that way produces the expected version string?

@DimaZhu
Copy link
Author

DimaZhu commented Sep 2, 2019

@dbczumar, I followed the instructions. mlflow --version crashes again. But from output of 'pip install -e .' i can see the right version mlflow==1.2.1.dev0. This time I can't launch mlflow server successfully. I can see Oops! Something went wrong. and image of waterfall. Do you have any suggestions?

@dbczumar
Copy link
Collaborator

dbczumar commented Sep 2, 2019

@DimaZhu were you asked to upgrade your database upon running mlflow server? Is there any log output from the MLflow server that is emitted when the UI displays Oops! Something went wrong?

@DimaZhu
Copy link
Author

DimaZhu commented Sep 2, 2019

@dbczumar No, mlflow server didn't ask to upgrade and there is no logs in browser. Maybe something wring with access to db. I'll check

@DimaZhu
Copy link
Author

DimaZhu commented Sep 2, 2019

@dbczumar 🎉 That's much much much better! 30 s vs 240 s. And already loaded runs opens almost instantly and you don't have to wait another 4 minutes! Thank you very much!

@dbczumar
Copy link
Collaborator

dbczumar commented Sep 2, 2019

@DimaZhu Awesome! Glad to hear that. If possible, it would be great to see how we could improve things further by using your database as an example workload. If you still don't mind sharing it, feel free to send me (corey.zumar) a message on the open source MLflow slack: http://tinyurl.com/mlflow-slack

@DimaZhu
Copy link
Author

DimaZhu commented Sep 2, 2019

@dbczumar To do this right, I should discuss this with my boss. I'll let you know

@DimaZhu
Copy link
Author

DimaZhu commented Sep 2, 2019

@dbczumar I've found interesting moment. I run npm install from probably master and it worked fine. Now I've tried from branch in Pull Request:

npm install --verbose
npm info it worked if it ends with ok
npm verb cli [
npm verb cli   '/Vol0/user/d.zhukov/.linuxbrew/Cellar/node/12.9.1/bin/node',
npm verb cli   '/Vol0/user/d.zhukov/.linuxbrew/bin/npm',
npm verb cli   'install',
npm verb cli   '--verbose'
npm verb cli ]
npm info using npm@6.10.3
npm info using node@v12.9.1
npm verb npm-session 654e045844d17c50
npm timing stage:loadCurrentTree Completed in 7ms
npm timing stage:loadIdealTree:cloneCurrentTree Completed in 0ms
npm timing stage:loadIdealTree:loadShrinkwrap Completed in 1ms
npm timing stage:loadIdealTree:loadAllDepsIntoIdealTree Completed in 2ms
npm timing stage:loadIdealTree Completed in 5ms
npm timing stage:generateActionsToTake Completed in 4ms
npm verb correctMkdir /Vol0/user/d.zhukov/.npm/_locks correctMkdir not in flight; initializing
npm verb lock using /Vol0/user/d.zhukov/.npm/_locks/staging-e87a4b00c8b60fcf.lock for /Vol0/user/d.zhukov/Projects/odometry/submodules/mlflow/mlflow/server/node_modules/.staging
npm verb unlock done using /Vol0/user/d.zhukov/.npm/_locks/staging-e87a4b00c8b60fcf.lock for /Vol0/user/d.zhukov/Projects/odometry/submodules/mlflow/mlflow/server/node_modules/.staging
npm timing stage:executeActions Completed in 86ms
npm timing stage:rollbackFailedOptional Completed in 0ms
npm info linkStuff !invalid#1
npm info lifecycle undefined~install: undefined
npm info lifecycle undefined~postinstall: undefined
npm info lifecycle undefined~prepublish: undefined
npm info lifecycle undefined~prepare: undefined
npm timing stage:runTopLevelLifecycles Completed in 114ms
npm WARN saveError ENOENT: no such file or directory, open '/Vol0/user/d.zhukov/Projects/odometry/submodules/mlflow/mlflow/server/package.json'
npm info lifecycle undefined~preshrinkwrap: undefined
npm info lifecycle undefined~shrinkwrap: undefined
npm notice created a lockfile as package-lock.json. You should commit this file.
npm info lifecycle undefined~postshrinkwrap: undefined
npm WARN enoent ENOENT: no such file or directory, open '/Vol0/user/d.zhukov/Projects/odometry/submodules/mlflow/mlflow/server/package.json'
npm verb enoent This is related to npm not being able to find a file.
npm verb enoent 
npm WARN server No description
npm WARN server No repository field.
npm WARN server No README data
npm WARN server No license field.

And it failed. Installation from master fixes it

@dbczumar
Copy link
Collaborator

dbczumar commented Sep 2, 2019

@DimaZhu Does running npm install npm-clean help?

@dbczumar
Copy link
Collaborator

dbczumar commented Sep 2, 2019

@DimaZhu I also filed #1805 to improve the performance of the ListExperiments, GetExperiment, and GetExperimentByName APIs in the SQLAlchemy store. This should substantially improve UI performance in your case, as you mentioned you have a large number of experiments. Can you try it out?

@dbczumar
Copy link
Collaborator

This class of performance issues has been addressed by #1767 and #1805. These changes will be incorporated into the next MLflow release. In the mean time, you can test the changes by cloning the MLflow repository running the MLflow tracking server on the master branch. If this issue is still occurring, please feel free to re-open the ticket.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants