Mlflow works extremly slow with PostgreSQL #1763

DimaZhu · 2019-08-20T08:03:08Z

Hi, everyone. We started to use MlFlow several weeks ago and decided to use PostgreSQL to speed up processing. But now experiment with ~300 runs loads approximately infinity and exceeds any reasonable time limits. Than I've updated to v1.1 and it started to work better (~2 minutes), but still devastating. I've looked in database and found out that there is 5500 runs. Could it be the reason of slow mlflow work? And if so, if there is API to clean deleted runs from db?

dbczumar · 2019-08-20T17:12:07Z

@DimaZhu Can you try this again on MLflow 1.2 and confirm that the issue persists? The latest release contains some performance improvements for the SQLAlchemyStore.

Are you logging a large number of metrics to MLflow? The 1.2 release should help significantly in that case.

Regarding your question about run deletion, MLflow provides a delete_run() API; here are the docs for the Python API: https://mlflow.org/docs/latest/python_api/mlflow.tracking.html#mlflow.tracking.MlflowClient.delete_run

DimaZhu · 2019-08-21T06:20:07Z

@dbczumar Just tried v1.2, didn't notice any difference.
Btw, mlflow --version crashes:

Traceback (most recent call last):
  File "/Vol0/user/d.zhukov/miniconda3/envs/odometry/bin/mlflow", line 10, in <module>
    sys.exit(cli())
  File "/Vol0/user/d.zhukov/miniconda3/envs/odometry/lib/python3.6/site-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/Vol0/user/d.zhukov/miniconda3/envs/odometry/lib/python3.6/site-packages/click/core.py", line 716, in main
    with self.make_context(prog_name, args, **extra) as ctx:
  File "/Vol0/user/d.zhukov/miniconda3/envs/odometry/lib/python3.6/site-packages/click/core.py", line 641, in make_context
    self.parse_args(ctx, args)
  File "/Vol0/user/d.zhukov/miniconda3/envs/odometry/lib/python3.6/site-packages/click/core.py", line 1089, in parse_args
    rest = Command.parse_args(self, ctx, args)
  File "/Vol0/user/d.zhukov/miniconda3/envs/odometry/lib/python3.6/site-packages/click/core.py", line 940, in parse_args
    value, args = param.handle_parse_result(ctx, opts, args)
  File "/Vol0/user/d.zhukov/miniconda3/envs/odometry/lib/python3.6/site-packages/click/core.py", line 1477, in handle_parse_result
    self.callback, ctx, self, value)
  File "/Vol0/user/d.zhukov/miniconda3/envs/odometry/lib/python3.6/site-packages/click/core.py", line 96, in invoke_param_callback
    return callback(ctx, param, value)
  File "/Vol0/user/d.zhukov/miniconda3/envs/odometry/lib/python3.6/site-packages/click/decorators.py", line 270, in callback
    raise RuntimeError('Could not determine version')
RuntimeError: Could not determine version

We have approximately 40 parameters and 50 metrics per run and also artifacts.

I'm sorry, my question wasn't clear enough. In mlflow I can see approximately 500 runs in summary. But in psql I can see 5500 experiments, deleted ones are still in database and marked as 'deleted'. Can this be a bottleneck? I read in the docs that in case of file system database I should manually delete deleted experiments from trash. Do I need to do something similar with PostgreSQL?

dbczumar · 2019-08-21T21:08:38Z

@DimaZhu Can you try applying the patch in #1767 and let me know if this improves UI load times? When you try it, please be sure to take a backup of your MLflow database before running the migration defined in the PR.

DimaZhu · 2019-08-30T15:27:22Z

I tried and didn't noticed any difference. And mlflow server didn't ask to update db scheme. Maybe I can share a copy of db with you?

dbczumar · 2019-09-01T02:47:32Z

@DimaZhu are you sure that you installed MLflow from source by cloning the MLflow repository, checking out the branch associated with PR #1767, and invoking pip install -e /path/to/mlflow/repository?

DimaZhu · 2019-09-01T06:14:08Z

@dbczumar yes, I'm pretty sure. I've created new conda env for this test. And installed mlflow from instructions in Dockerfile. And mlflow --versionworks in this build.

dbczumar · 2019-09-01T07:26:33Z

@DimaZhu What is the output of mlflow --version on the host machine where the tracking server is running? Are you running the tracking server within a Docker container?

DimaZhu · 2019-09-01T07:49:36Z

1.2.0
No, mlflow server runs without docker. I've just copied instructions from Dockerfile.

dbczumar · 2019-09-01T08:42:17Z

If you've installed MLflow from source, the version displayed should be 1.2.1.dev0 (

mlflow/mlflow/version.py

Line 4 in 007bf13

VERSION = '1.2.1.dev0'

). Can you try following the instructions in my previous comment and let me know if installing that way produces the expected version string?

DimaZhu · 2019-09-02T06:12:57Z

@dbczumar, I followed the instructions. mlflow --version crashes again. But from output of 'pip install -e .' i can see the right version mlflow==1.2.1.dev0. This time I can't launch mlflow server successfully. I can see Oops! Something went wrong. and image of waterfall. Do you have any suggestions?

dbczumar · 2019-09-02T06:49:12Z

@DimaZhu were you asked to upgrade your database upon running mlflow server? Is there any log output from the MLflow server that is emitted when the UI displays Oops! Something went wrong?

DimaZhu · 2019-09-02T07:14:11Z

@dbczumar No, mlflow server didn't ask to upgrade and there is no logs in browser. Maybe something wring with access to db. I'll check

DimaZhu · 2019-09-02T08:18:06Z

@dbczumar 🎉 That's much much much better! 30 s vs 240 s. And already loaded runs opens almost instantly and you don't have to wait another 4 minutes! Thank you very much!

dbczumar · 2019-09-02T08:35:55Z

@DimaZhu Awesome! Glad to hear that. If possible, it would be great to see how we could improve things further by using your database as an example workload. If you still don't mind sharing it, feel free to send me (corey.zumar) a message on the open source MLflow slack: http://tinyurl.com/mlflow-slack

DimaZhu · 2019-09-02T08:42:49Z

@dbczumar To do this right, I should discuss this with my boss. I'll let you know

DimaZhu · 2019-09-02T11:50:07Z

@dbczumar I've found interesting moment. I run npm install from probably master and it worked fine. Now I've tried from branch in Pull Request:

npm install --verbose
npm info it worked if it ends with ok
npm verb cli [
npm verb cli   '/Vol0/user/d.zhukov/.linuxbrew/Cellar/node/12.9.1/bin/node',
npm verb cli   '/Vol0/user/d.zhukov/.linuxbrew/bin/npm',
npm verb cli   'install',
npm verb cli   '--verbose'
npm verb cli ]
npm info using npm@6.10.3
npm info using node@v12.9.1
npm verb npm-session 654e045844d17c50
npm timing stage:loadCurrentTree Completed in 7ms
npm timing stage:loadIdealTree:cloneCurrentTree Completed in 0ms
npm timing stage:loadIdealTree:loadShrinkwrap Completed in 1ms
npm timing stage:loadIdealTree:loadAllDepsIntoIdealTree Completed in 2ms
npm timing stage:loadIdealTree Completed in 5ms
npm timing stage:generateActionsToTake Completed in 4ms
npm verb correctMkdir /Vol0/user/d.zhukov/.npm/_locks correctMkdir not in flight; initializing
npm verb lock using /Vol0/user/d.zhukov/.npm/_locks/staging-e87a4b00c8b60fcf.lock for /Vol0/user/d.zhukov/Projects/odometry/submodules/mlflow/mlflow/server/node_modules/.staging
npm verb unlock done using /Vol0/user/d.zhukov/.npm/_locks/staging-e87a4b00c8b60fcf.lock for /Vol0/user/d.zhukov/Projects/odometry/submodules/mlflow/mlflow/server/node_modules/.staging
npm timing stage:executeActions Completed in 86ms
npm timing stage:rollbackFailedOptional Completed in 0ms
npm info linkStuff !invalid#1
npm info lifecycle undefined~install: undefined
npm info lifecycle undefined~postinstall: undefined
npm info lifecycle undefined~prepublish: undefined
npm info lifecycle undefined~prepare: undefined
npm timing stage:runTopLevelLifecycles Completed in 114ms
npm WARN saveError ENOENT: no such file or directory, open '/Vol0/user/d.zhukov/Projects/odometry/submodules/mlflow/mlflow/server/package.json'
npm info lifecycle undefined~preshrinkwrap: undefined
npm info lifecycle undefined~shrinkwrap: undefined
npm notice created a lockfile as package-lock.json. You should commit this file.
npm info lifecycle undefined~postshrinkwrap: undefined
npm WARN enoent ENOENT: no such file or directory, open '/Vol0/user/d.zhukov/Projects/odometry/submodules/mlflow/mlflow/server/package.json'
npm verb enoent This is related to npm not being able to find a file.
npm verb enoent 
npm WARN server No description
npm WARN server No repository field.
npm WARN server No README data
npm WARN server No license field.

And it failed. Installation from master fixes it

dbczumar · 2019-09-02T20:46:13Z

@DimaZhu Does running npm install npm-clean help?

dbczumar · 2019-09-02T20:47:41Z

@DimaZhu I also filed #1805 to improve the performance of the ListExperiments, GetExperiment, and GetExperimentByName APIs in the SQLAlchemy store. This should substantially improve UI performance in your case, as you mentioned you have a large number of experiments. Can you try it out?

dbczumar · 2019-09-13T23:44:59Z

This class of performance issues has been addressed by #1767 and #1805. These changes will be incorporated into the next MLflow release. In the mean time, you can test the changes by cloning the MLflow repository running the MLflow tracking server on the master branch. If this issue is still occurring, please feel free to re-open the ticket.

dbczumar self-assigned this Aug 20, 2019

dbczumar closed this as completed Sep 13, 2019

e-dorigatti mentioned this issue Jan 12, 2024

We'd like your feedback on our MLflow experience! #10329

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mlflow works extremly slow with PostgreSQL #1763

Mlflow works extremly slow with PostgreSQL #1763

DimaZhu commented Aug 20, 2019 •

edited

dbczumar commented Aug 20, 2019

DimaZhu commented Aug 21, 2019

dbczumar commented Aug 21, 2019

DimaZhu commented Aug 30, 2019

dbczumar commented Sep 1, 2019

DimaZhu commented Sep 1, 2019

dbczumar commented Sep 1, 2019

DimaZhu commented Sep 1, 2019

dbczumar commented Sep 1, 2019

DimaZhu commented Sep 2, 2019

dbczumar commented Sep 2, 2019

DimaZhu commented Sep 2, 2019

DimaZhu commented Sep 2, 2019

dbczumar commented Sep 2, 2019

DimaZhu commented Sep 2, 2019

DimaZhu commented Sep 2, 2019

dbczumar commented Sep 2, 2019

dbczumar commented Sep 2, 2019

dbczumar commented Sep 13, 2019

Mlflow works extremly slow with PostgreSQL #1763

Mlflow works extremly slow with PostgreSQL #1763

Comments

DimaZhu commented Aug 20, 2019 • edited

dbczumar commented Aug 20, 2019

DimaZhu commented Aug 21, 2019

dbczumar commented Aug 21, 2019

DimaZhu commented Aug 30, 2019

dbczumar commented Sep 1, 2019

DimaZhu commented Sep 1, 2019

dbczumar commented Sep 1, 2019

DimaZhu commented Sep 1, 2019

dbczumar commented Sep 1, 2019

DimaZhu commented Sep 2, 2019

dbczumar commented Sep 2, 2019

DimaZhu commented Sep 2, 2019

DimaZhu commented Sep 2, 2019

dbczumar commented Sep 2, 2019

DimaZhu commented Sep 2, 2019

DimaZhu commented Sep 2, 2019

dbczumar commented Sep 2, 2019

dbczumar commented Sep 2, 2019

dbczumar commented Sep 13, 2019

DimaZhu commented Aug 20, 2019 •

edited