-
Notifications
You must be signed in to change notification settings - Fork 7.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fpm: some workers get stuck on "\0fscf\0" write to master process over time #11447
Comments
For reference, this is the full configuration file used:
And the systemd unit file (here we run 2 different php-fpm pinned each to half of the NUMA node for performance reasons, but single CPU/NUMA servers are also affected):
|
I have been thinking about this one and not sure why it blocks on that write. This message is sent after the request end and it is done to be able to flush message composed of multiple message not finalized by new line (e.g. https://github.com/php/php-src/blob/06dd1d78a7ec1678b53ef657033c2021f4dc902f/sapi/fpm/tests/log-bwd-multiple-msgs.phpt ). This pipe is used for all logs from the child so I'm just trying to figure out when this can actually block as normally it doesn't block because master should read from it just fine. It looks that for some reason this pipe is open but master does not read from it. Would you be able to change |
I've set I really don't know what was triggering this, but I do remember often seeing multiple php-fpm workers stuck for the exact same amount of time, so probably something happening that managed to affect more than one worker while it was ongoing. |
Ever since enabling debug, the problem hasn't surfaced again. Feel free to close this issue if nothing can be done with the current information, and I'll reopen attaching debug output if it ever reappears on our servers. |
It might signal some issue as this should not normally happen. Can you describe a bit more the CouchBase issue and specifically the moxi-server setup? It might give me some idea what might have happened. |
The moxi-server is just a local process that accepts memcache GET/SET calls and forwards them to a CouchBase cluster to which it's connected and aware of its topology. From php-fpm, the code is basically just using the memcached pecl module (3.2.0, latest as of writing this) to connect to that socket and send the GET and SET, of which many are large multi-GET, at an average rate of about 3000/s across two different php-fpm backends of 768 processes each. |
I have been looking into and thinking about this and it looks me that the primary cause might have been due to the UNIX socket overload and / or something related to the high load at that time which by the report happened at the same time. We won't probably figure that part out but more important question for me is why FPM needs to wait forever for this write. Obviously something went wrong with the pipe buffer (or something related to that) as it did not unblock the waiting process. It's a blocking pipe so it might potentially get to such state but why does this happen only for that request flushing write and not for other logs? So how could we better recover from this issue. I have got few ideas:
Well the second one make sense only for non-blocking socket so it is really just one idea. :) I think the above could potentially help and also improve pipe throughput. It requires some changes in reading and writing logic though. It might also have some side effects so it needs to be properly tested under heavy load. For those reason it will be more a feature as it will just target master branch for those changes due to their impact. |
Description
After updating from PHP 7.3 to 8.1, we are now seeing the number of "active" php-fpm processes slowly grow over time, which is not expected. Looking at the fpm-status output, many processes report a duration of multiple days.
In the above, after restarting php-fpm, the number of active processes goes back down to around 60. So only about 30 got stuck out of 1 billion accepted connections. Really not that many, but resulting in +50% active processes after a week, and getting worse as time goes.
Looking at the stuck processes, all are on the exact same write. Here's one example:
That fd is a pipe to the master php-fpm process:
The fscf string can be found in sapi/fpm/fpm/fpm_stdio.c:
So for some reason, it looks like in some rare cases, sending this command from a worker back to the master gets stuck forever.
PHP Version
PHP 8.1.20
Operating System
RHEL 7.9
The text was updated successfully, but these errors were encountered: