Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High CPU load in Passenger Core percentage of requests freeze #1709

Closed
tinco opened this issue Dec 21, 2015 · 5 comments
Closed

High CPU load in Passenger Core percentage of requests freeze #1709

tinco opened this issue Dec 21, 2015 · 5 comments

Comments

@tinco
Copy link
Contributor

@tinco tinco commented Dec 21, 2015

Under certain conditions requests appear to be getting stuck in Passenger 5.0.20 (no reponse) with a Ruby app.

Conditions:
The behavior most likely started after an increase in websocket connections (a script that makes one every N seconds was introduced recently). The server is not particularily high traffic, but each session is coupled to a websocket that tunnels I/O constantly. The server performs I/O with many backends, possibly leaking connections. We haven't managed to reproduce the conditions outside production yet.

Research:
Thread 0x7fa221283700 is suspected to be stuck in FdSinkChannel (down the line from RequestHandler::onRequestBody), to be reviewed.
Thread 0x7fa219ef9700 looks to be stuck in a runSync, but that is likely a side-effect (e.g. frozen event loop).

Backtraces available here: (only for Phusion staff)
https://phusionnl.slack.com/files/niels/F0H3SPQHY/-.sh

@cbeckr

This comment has been minimized.

Copy link

@cbeckr cbeckr commented Jan 20, 2016

We may have run into the same issue after updating to 5.0.23 (from version 4): The server peaks at 100% CPU doing basic ELB health checks; we're also using Websockets (Faye / EM).
Here's a gist: https://gist.github.com/cbeckr/e4559ed1e14fd34d2c44
What struck me was how all queued requests have the exact same timestamp.

@OnixGH

This comment has been minimized.

Copy link
Contributor

@OnixGH OnixGH commented Jan 26, 2016

Research into #1732 revealed it to be looping at exactly the same place (a write loop in FdSinkChannel), and the code in question was found to contain a wrong exit condition. Closing this as a duplicate.

@OnixGH

This comment has been minimized.

Copy link
Contributor

@OnixGH OnixGH commented Jan 26, 2016

@cbeckr I'm not sure your issue is the same. The gist mentions "Watchdog seems to be killed; forcing shutdown of all subprocesses", which is not a symptom of this issue (unless you were killing it manually). If the watchdog died without you killing it your system might be out of memory and killing processes, which can cause all sorts of trouble.

In any case we'll release a fix as part of 5.0.24 soon, so it's worth seeing if that makes a difference.

OnixGH pushed a commit that referenced this issue Jan 26, 2016
Suspected cause for GH-1709, GH-1732.
@cbeckr

This comment has been minimized.

Copy link

@cbeckr cbeckr commented Jan 26, 2016

@OnixGH thanks for the update, we'll keep an eye out for 5.0.24.
The watchdog was killed since I also sent it a SIGQUIT, reading up on the guidance in the "The right way to deal with frozen processes on Unix". However, SIGQUIT is obviously only handled by the workers - something minor that could be clarified in the documentation.

@OnixGH

This comment has been minimized.

Copy link
Contributor

@OnixGH OnixGH commented Jan 26, 2016

@cbeckr the link you're referring to is pretty old, and we've been in the process of moving documentation into the Passenger Library.

It's a work in progress (it's not completely there yet but the SIGQUIT part is mentioned here), but we'll take your comment into account :)

OnixGH pushed a commit that referenced this issue Jan 26, 2016
Suspected cause for GH-1709, GH-1732. Inspection revealed another wrong
exit condition (causing preliminary exit in edge-case code), which was
also fixed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.