-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"Failed to write, and not due to blocking" on Memcached 1.5.3 #458
Comments
Sounds like the client killed the connection for some reason :( Server tries to write but gets peer'ed. Looks like php-memcached, libmemcached based client. What client version? + What distro/where did the server and client builds come from? |
Client is from Ubuntu repositories. Server is compiled and packaged manually.
I also opened php-memcached-dev/php-memcached#429 |
Well the error is likely on the client, unless the server is returning a malformed response. Hopefully you'll have more luck with the php-memcached project? At some point somewhere the client should be emitting an error code. |
Alright, thanks for narrowing it down. I'll wait for them to respond on php-memcached-dev/php-memcached#429 |
@dormando what does this mean, do you mean the client closed an end of the connection or something else? I'm not seeing close in strace.
|
@glennpratt If the server is getting "connection reset by peer" when writing, it means it received a TCP RST (connection was reset). That could happen for a number of reasons (firewalls? client crash? fd recycle?), so I can't really tell you why offhand. |
Thanks for the feedback, greatly appreciated. I can't say I see any of those things from strace or tcpdump. Here's a Wireshark summary from both sides: https://gist.github.com/glennpratt/1973c7e5f423dfe7365c72d89c34ab21 |
Got some "acked unseen segments" toward the end of the capture. packet loss/rate issues? |
Memcached Server
Bad Client machine
Good Client machine
|
Also, here's the wireshark summaries from a good session: https://gist.github.com/glennpratt/9e29e54504806acdc2fc8feca35558f8 |
confused; what's a good client machine vs bad client machine? only certain clients get disconnected from even the same servers? |
The behavior only seems to happen from one particular server (out of three in a cluster) at a time. The behavior reliably returns on new clusters, after replacing servers, etc. The errors will stop for 1 or 2 request cycles after restarting memcached. It seems possible that the issue may be triggered by the actual content of the cache traffic. |
That's certainly possible, which is why I was asking about getting client errors at first. It should be raising some kind of exception internally. also when I said your wireshark is missing data, I don't mean the machine was retransmitting, I mean your tcpdump wasn't capturing everything so it wasn't fully correlating. It could've missed the RST in there. |
I am not seeing the "Connection reset by peer" log message except when the client program actually exits, in which case it is likely expected. I'm not seeing that log on the server while the client seems to be hung in a loop around EAGAIN. The client code is not receiving an Exception in PHP land as far as I can tell. When I inspect the PHP process with gdb while it is in this loop, it is in libmemcached code. I'll make some new captures and include something from strace and gdb if I can. |
Any chance you could re-summarize the problem from the start? The more detail the better since I don't know drupal/php/etc. As originally written it sounds like the client EAGAIN's for some amount of time, then the server gets peer'ed trying to write. From what you say now it sounds like the client gets into EAGAIN infinite loops sometimes, and the client will eventually timeout and get killed? |
Hey, how goes? I'd like to get this figured out or close out the issue. thanks! |
Hi @dormando, Thanks for checking in! To summarize from the start: Drupal application operating on 3 webnodes with a single memcached daemon. We have isolated the problem to one configuration setting for the PHP memcached extension: The problem is this, during application bootstrap there are a number of gets and sets. On one or two of the three nodes, the application bootstrap becomes extremely slow. Looking at gdb, the process is looping inside libmemcached. Looking at Wireshark we see the "TCP Window Full" message at the same time that the process goes into a very slow loop inside libmemcached. None of this happens with As far as I can tell, the connection resets only happened at a timeout killing the process, as you suggest. In other words, not an issue. |
Sorry! I know I pinged but just getting back to looking through issues. Sounds like you figured it out? Something killing your process? Okay if we close the issue then? :) |
@dormando, No we are stuck now. We need
|
Got it. I misread that part sorry. From your latest update it might be possible you're just filling the tcp window? uhhh I thought libmemcached had a pressure release though. it's been a while since I've thought about it. basically: if you issue large gets and sets but aren't reading the gets off the socket fast enough the receive buffer fills on the client side. Then on memcached's side its send buffer fills.. so then it goes into a loop waiting to send more data before going back into read mode. Think that's something that might be happening here? If that's not it I'd have to dig in really closely or we'd need a reproducable test independent of your app that I can examine. |
Ah I meant to more clearly say this can happen on streams of gets: "get foo\r\n" x 100k without reading the socket and you'll deadlock. |
For others who might stumble upon this. The fix was to set send and recv size.
|
FWIW, I just got this today on macOS and Drupal 8.8.5:
It happens with the default options, so I checked the combinations of all 4 options suggested in this issue separately:
|
The fix is to not write a million keys to the socket without reading anything back. |
Thanks for the answer, I opened an issue on the Drupal |
I'm trying to debug an issue where connections to memcache timeout after a stream of EAGAIN messages on the recvfrom and sendto system calls.
Temporarily upgrading to the latest version of memcached did not resolve the issue. It can be reproduced on multiple servers across different availability zones and the servers have been replaced and the issue eventually reoccurs on the new servers.
On the client
I have been unable to isolate the issue from Drupal so far as testing with php manually works fine. The initial drush9 command also works after memcached restart but fails on all following occurrences.
During this time ~13,000 sets and gets are performed from the Drupal bootstrap.
On the with server -vvv flags enabled
Version info
Client Config
The text was updated successfully, but these errors were encountered: