Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MLflow worker timeout when opening UI #925

Closed
jseppanen opened this issue Feb 26, 2019 · 38 comments
Closed

MLflow worker timeout when opening UI #925

jseppanen opened this issue Feb 26, 2019 · 38 comments
Labels
area/uiux Front-end, user experience, plotting, JavaScript, JavaScript dev server priority/important-soon The issue is worked on by the community currently or will be very soon, ideally in time for the

Comments

@jseppanen
Copy link

jseppanen commented Feb 26, 2019

System information

  • Have I written custom code (as opposed to using a stock example script provided in MLflow): no
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 16.04.5
  • MLflow installed from (source or binary): pip install mlflow
  • MLflow version (run mlflow --version): mlflow, version 0.8.2
  • Python version: Python 3.6.6 :: Anaconda, Inc.
  • **npm version (if running the dev UI):
  • Exact command to reproduce: mlflow server --file-store /bigdata/mlflow --host 0.0.0.0

Describe the problem

MLflow UI shows Niagara falls with "Oops! Something went wrong" every time I try opening it. I've been using it for two months, but recently it has started crashing until today I cannot get the UI to open at all anymore.

Logs

server logs after fresh restart:

[2019-02-26 12:34:36 +0000] [9] [INFO] Starting gunicorn 19.9.0
[2019-02-26 12:34:36 +0000] [9] [INFO] Listening at: http://0.0.0.0:5000 (9)
[2019-02-26 12:34:36 +0000] [9] [INFO] Using worker: sync
[2019-02-26 12:34:36 +0000] [12] [INFO] Booting worker with pid: 12
[2019-02-26 12:34:36 +0000] [14] [INFO] Booting worker with pid: 14
[2019-02-26 12:34:36 +0000] [15] [INFO] Booting worker with pid: 15
[2019-02-26 12:34:36 +0000] [18] [INFO] Booting worker with pid: 18
[2019-02-26 12:35:30 +0000] [9] [CRITICAL] WORKER TIMEOUT (pid:14)
[2019-02-26 12:35:30 +0000] [14] [INFO] Worker exiting (pid: 14)
[2019-02-26 12:35:30 +0000] [28] [INFO] Booting worker with pid: 28

browser console logs when opening UI:

setupAjaxHeaders.js:22 
{_xsrf: "2|a583f945|b32757069a3ea1c54e37f87dba1c1428|1549020795"}
service-worker.js:1 Uncaught (in promise) Error: Request for http://localhost:5000/static-files/static-files/static/css/main.fbf8a477.css returned a response with status 404
    at service-worker.js:1
service-worker.js:1 Uncaught (in promise) Error: Request for http://localhost:5000/static-files/static-files/static/css/main.fbf8a477.css returned a response with status 404
    at service-worker.js:1
jquery.js:9355 POST http://localhost:5000/ajax-api/2.0/preview/mlflow/runs/search net::ERR_EMPTY_RESPONSE
Actions.js:155 XHR failed 
{readyState: 0, getResponseHeader: ƒ, getAllResponseHeaders: ƒ, setRequestHeader: ƒ, overrideMimeType: ƒ, …}
react-dom.production.min.js:151 TypeError: Cannot read property 'getErrorCode' of undefined
    at errorRenderFunc (ExperimentPage.js:122)
    at e.value (RequestStateWrapper.js:51)
    at f (react-dom.production.min.js:131)
    at beginWork (react-dom.production.min.js:138)
    at o (react-dom.production.min.js:176)
    at a (react-dom.production.min.js:176)
    at x (react-dom.production.min.js:182)
    at y (react-dom.production.min.js:181)
    at v (react-dom.production.min.js:181)
    at d (react-dom.production.min.js:180)
AppErrorBoundary.js:19 TypeError: Cannot read property 'getErrorCode' of undefined
    at errorRenderFunc (ExperimentPage.js:122)
    at e.value (RequestStateWrapper.js:51)
    at f (react-dom.production.min.js:131)
    at beginWork (react-dom.production.min.js:138)
    at o (react-dom.production.min.js:176)
    at a (react-dom.production.min.js:176)
    at x (react-dom.production.min.js:182)
    at y (react-dom.production.min.js:181)
    at v (react-dom.production.min.js:181)
    at d (react-dom.production.min.js:180)
:5000/#/experiments/1:1 Uncaught (in promise) 
t {xhr: {…}}
```
@5ke
Copy link

5ke commented Jun 17, 2019

Same problem here! Started a parameter search before the weekend and have therefore run far more experiments than before. Now I can't start the ui anymore. Where do I start trouble shooting?

@5ke
Copy link

5ke commented Jun 17, 2019

It turns out that the ui simply doesn't handle too many runs (in my case it starts struggling when mlruns contains more than circa 1000 experiments). Around this threshold the ui becomes unstable (sometimes crashes, sometimes works, but it's never quick&responsive), and eventually there are too many runs and it won't load at all.

This goes a bit against the philosophy of being able to track all your experiments.

Would using a local db instead of file storage help? Hosting externally is not an option for me.

As a side note: during troubleshooting I discovered that when you move runs around into different folders, it's important to update the artifact_location parameter in the main meta.yaml, otherwise you'll experience a different type of crash, without a clear warning.

@selimonat
Copy link

same here. ui either timeouts, or crashes with about 650 runs... sometimes works, mostly doesn't.

@Zangr
Copy link
Contributor

Zangr commented Jul 26, 2019

In Mlflow 1.1 the runs listing has been changed to show the first 100 runs + a "Load more" button if you have more than 100. Could you please try and see if that makes your situation better?

@selimonat
Copy link

We were checking that out few days ago, got exactly the same error.

@mlflow mlflow deleted a comment from jakubczakon Jul 29, 2019
@CrankyDragon
Copy link

So I just started running into to this issue after upgrading to 1.0.0+ and noticed that the URL for static files is incorrect causing worker to block. Basically if I switch:
http://localhost:5000/static-files/static-files/static/css/main.fbf8a477.css
to:
http://localhost:5000/static-files/static/css/main.fbf8a477.css
These assets load fine. Anyone have a patch for this?

@datsabk
Copy link

datsabk commented Aug 28, 2019

Any update on this issue? It still exists on v1.2.0.0

@pranasziaukas
Copy link

pranasziaukas commented Sep 12, 2019

This is unexpectedly unpleasant. I did a number of runs with the idea of sorting best-to-worst metrics afterwards. But the UI indeed crashes after more than ~1000 runs...

Moreover, it only has to load the first 100 runs and show "load more" afterwards.

@stale
Copy link

stale bot commented Oct 3, 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Oct 3, 2019
@stale
Copy link

stale bot commented Dec 2, 2019

This issue has not been interacted with for 60 days since it was marked as stale! As such, the issue is now closed. If you have this issue or more information, please re-open the issue!

@stale stale bot closed this as completed Dec 2, 2019
@Nintorac
Copy link

Still getting this issue in MLFlow 1.4. Is the situation improved by using a database backend?

@HungUnicorn
Copy link

getting issue with 1.7 and postgres backend

@gkonstanty
Copy link

gkonstanty commented Apr 29, 2020

It is the same for 1.7 without postgres backend.
Could this issue be re-opened?

@gkonstanty
Copy link

I have just upgraded server to run on 1.9.0 version (without postgres backend) and nothing has changed.

Adding --gunicorn-opts "--timeout 180" has somehow helped, but the number of our experiments is constantly growing, so even 180sec will be not sufficient soon. And waiting so long for some simple queries results is kinda annoying.

Could you please check this issue?

@faddey-w
Copy link

faddey-w commented Jul 6, 2020

same issue, number of runs is ~50. In my case this machine also runs Tensorboard which uses a lot of RAM - look like less available RAM makes this issue more severe.

@tmywada
Copy link

tmywada commented Aug 19, 2020

I face the same issue with version 1.10.0 and file system. All files are generated as expected, but the same "WORKIER TIMEOUT" message returns when I try to access individual records (i.e., clicking date's hyperlinks).

@Zangr Zangr added area/uiux Front-end, user experience, plotting, JavaScript, JavaScript dev server priority/important-soon The issue is worked on by the community currently or will be very soon, ideally in time for the and removed stale labels Aug 20, 2020
@Zangr
Copy link
Contributor

Zangr commented Aug 20, 2020

Reopening this issue as per community request and reassign priority to get it into the queue.

@Zangr Zangr reopened this Aug 20, 2020
@spott
Copy link

spott commented Sep 22, 2020

@gkonstanty : where are you adding the --gunicorn-opts "--timeout 180" option?

@5ke
Copy link

5ke commented Oct 1, 2020

@spott @gkonstanty : where are you adding the --gunicorn-opts "--timeout 180" option?

I couldn't get mlflow ui --gunicorn-opts "--timeout 180" to work either (error: no such option --gunicorn-opts)

But the following worked for me:
GUNICORN_CMD_ARGS="--timeout 180" mlflow ui

@gkonstanty
Copy link

Sorry, @spott, I missed your msg.

I'm adding it to mlflow server:

mlflow server --host 0.0.0.0 -p 5000 --backend-store-uri /mlflow/data/ --default-artifact-root /mlflow/artifacts/ --gunicorn-opts "--timeout 180"

@kenimou
Copy link

kenimou commented Mar 15, 2021

I have the same issue (version 1.14.1) and I use Postgres as the backend database. When I access the UI, the list of experiments doesn't load and the spinner just keeps spinning. To me it feels like it could not query the database to display the data.

It happens very often (> 50% of the time when I try to access the UI, especially after a run is finished) and I don't even have a lot of experiments (<10 per collection, 2 collections in total).

@sergeyleyko
Copy link

mlflow 1.18, year 2021, issue is still here..

@daniel-beyond
Copy link

mlflow 1.21 same issue...

@dprateek1991
Copy link

I received somewhat better performance with this -

mlflow server --backend-store-uri=postgresql://postgres:${RDS_PASSWORD}@${RDS_HOST}:5432/mlflow --default-artifact-root=${ARTIFACT_STORE} --host 0.0.0.0 --port 5000 --gunicorn-opts "--worker-class gevent --threads 3 --workers 3 --timeout 300 --keep-alive 300 --log-level INFO"

@marioGab
Copy link

mlflow 1.26.1 still same issue! Can you please provide a workaround or something? Even setting gunicorn-opts is not helping.

@jmahlik
Copy link
Contributor

jmahlik commented Jun 28, 2022

It seems like limiting the initial experiments displayed and having a "load more" button similar to how the runs pages work might help.

The issue indeed presents only when there are about 1000+ experiments. The python apis continue to function fine with a large number of experiments, it's only the ui/js side that gets bogged down.

I tracked down how loading more runs was implemented
#1564. Maybe some of the implementation could be borrowed?

@marioGab
Copy link

I only have 3 to 5 experiments but still get this error. I just have alot of metrics and parameters in each experiment. So the "load more" idea is not possible.

@alokpadhi
Copy link

Well I found one workaround here. Try mlflow server -h 0.0.0.0 -p 5000 --backend-store-uri postgresql://DB_USER:DB_PASSWD@DB_ENDPOINT:5432/DB_NAME --default-artifact-root s3://S3_BUCKET_NAME --gunicorn-opts "--timeout 0"

This will wait till the data transfer finishes and loads, I have to delete a few of my experiments from my S3 bucket to make it load a little faster which I think is not a very welcoming workaround.

Will have to wait for a permanent fix to this.

@jmahlik
Copy link
Contributor

jmahlik commented Jul 20, 2022

With a large timeout set, our problem seems to be only with the ui generating the experiments table with 1000+ experiments. I wonder if defaulting the experiments side bar to hidden would help as a short term fix? Collapsing the sidebar seems to fix the problem (after waiting a few minutes for it to load). Yes, it would break almost immediately when someone clicks to expand it, but a user could still use the ui if they knew the experiment id before hand.

Maybe these are really two separate issues. One for large runs and one for experiments on the home page?

@sergeyleyko
Copy link

The problem is not only with a large number of experiments but with a large number of experiment metrics logged, like in my case.
Thus the issue is due to a bad UI design, and I suppose it can be fixed with big refactoring only.

@ptavaressilva
Copy link

Problem persists on 1.28.0
MLflow is great, but for me this bug renders it practically useless.

@ChenchengLiang
Copy link

Uses http in browser instead of https works for me.

@crowne
Copy link

crowne commented Jun 29, 2023

I was getting this error when clicking on experiments while I had not correctly configured my blob storage.
Once the configuration was corrected then error went away.

@hvanmegen
Copy link

I was getting this error when clicking on experiments while I had not correctly configured my blob storage. Once the configuration was corrected then error went away.

What did you change to get rid of this error?

@hvanmegen
Copy link

Uses http in browser instead of https works for me.
oh, seriously?

@crowne
Copy link

crowne commented Jul 11, 2023

I was getting this error when clicking on experiments while I had not correctly configured my blob storage. Once the configuration was corrected then error went away.

What did you change to get rid of this error?

I had to change the format of the environment variable : AZURE_STORAGE_CONNECTION_STRING
The previous incorrect format was preventing the application from connecting to the Azure Blob Storage.

@alam-shahul
Copy link

Is there a way for this issue to get even higher priority? I don't really understand how people use MLflow given that this bug exists. Perhaps it's because most people aren't using the UI feature...

@pbrit
Copy link

pbrit commented Aug 29, 2023

I managed to get it fixed by (in my setup):

  • increasing the number of workers --workers 20 (6 should be enough because it's exactly how many connections are allowed per domain in modern browsers, 20 for good measure).
  • switching to eventlet

It's worth noting that the root cause, at least in my setup, has nothing to do with mlflow itself. I'm running it in Kubernetes (GKE) and in order to access it on my machine I'm using kubectl port-forward. It looks like the way kubectl is proxying the requests exhausts all workers and since they are synchronous by default no new connections can be accepted. Supposedly port-forwarding isn't compatible with the event Gunicorn worker class.

@here Could you describe your setup a bit, please? Are you using kubectl port-forward or anything similar to it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/uiux Front-end, user experience, plotting, JavaScript, JavaScript dev server priority/important-soon The issue is worked on by the community currently or will be very soon, ideally in time for the
Projects
None yet
Development

No branches or pull requests