-
Notifications
You must be signed in to change notification settings - Fork 236
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Streams should check for Session shutdown when waiting for data & clean up timers #127
Streams should check for Session shutdown when waiting for data & clean up timers #127
Conversation
Done |
7eab13b
to
9aa57cd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your excellent PR and debugging work!
One thing that does alarm me is how It's only ever blocked on in But it's also used for...
|
9aa57cd
to
5cd99ad
Compare
PR feedback addressed, I think. |
https://github.com/lattwood/nomad/releases/tag/v1.7.8-yamux-fork for anyone following along at home, there's a zip with a linux amd64 nomad build with this fix. if you're paranoid i recommend just building it yourself, however. |
Now that things have stabilized, going to see if this issue continues to appear. |
All hell broke loose due to a provider network outage. After 24 hours with this custom fork of Nomad deployed to both a ~100 node cluster, and a ~2300 node cluster, we do not have any stuck/pending allocs. Confirmed, this fixes hashicorp/nomad#23305. |
*confirmed, it improves the bug. still happening, sadly, and the output of When I'm back from vacation there's a good chance I'm going to just shoehorn a timer of some kind into yamux's Read method to get it out of that switch statement at the bottom of the method.
Then the node gets marked as |
Thanks for the update @lattwood! I just wanted to let you know we still intend to merge this PR after some more internal validation across teams since many HashiCorp products rely on yamux. We're also planning on at least improving the archaic infrastructure to pass Lots of vacations this time of year for us as well, so it will probably be a couple weeks before we're able to merge this. |
Sorry for the delay! I've finally been testing this patch. Playing around for a while with The fact that this patch seems to have solved your issue makes me lean toward merging this even if we can't come to a satisfactory explanation of why.
(Emphasis mine) These are buffered channels: https://github.com/hashicorp/yamux/blob/v0.1.1/stream.go#L75-L76 I don't see a way for Stream.Read to miss a send on I know it's been a long time, but do you still have a goroutine dump handy? Is it possible another goroutine was blocked in I tried to write a test to exercise the I sprinkled some |
Argh sorry that #130 caused a merge conflict. Tests now pass with the race detector enabled, and we finally enabled GHA CI! We've been testing your patch more internally, so I wanted to leave a quick update:
I'm hoping to get this all finished up this week so it makes it into the next Nomad patch release. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After a lot of attempts I have been unable to cause the bug this appears to fix.
In theory if io.ReadFull
blocked indefinitely that could prevent Session.Close()
from closing each Stream. This would leave the yamux Session in the precise state being seen here: Session.shutdownCh
is closed, but individual streams have not been notified.
If that case is possible, this patch fixes it. Unfortunately the only way I was able to replicate that case was when disabling client write timeouts (as well as carefully blocking server writes to deadlock keepalive handling) which is not the case in the Nomad environment in the original bug report: hashicorp/nomad#23305
Goroutine dumps observing this bug would be extremely useful and may be reported privately to: nomad-oss-debug@hashicorp.com
After extensive testing I have found no downside to accepting this patch and intend to merge it. Leaving open a few more days to let other folks have a chance to review.
See #131 for some of the changes I made to the test suite while testing this. Note that those changes are not inclusive: I also did a lot of manual testing directly with yamux as well as via Nomad. At no point could I exercise either a bug or a problem with this PR.
Thanks @lattwood, and I'm very sorry for the long delay!
@schmichael so, without revealing too much, what we're doing with Nomad regularly sees us have our hosting providers firewall off the Nomad Masters and the Nomad Agents for upwards of 30 minutes at a time. Agents can initiate TCP connections to the Masters when this happens, but all traffic from the Masters to the Agents becomes blocked. So the client SYN makes it to the server, but the server SYN-ACK doesn't leave the provider's network. I think that could actually manifest as So I think you're onto something here. With all that said, we are still seeing the issue occur periodically, but now we don't have such a smoking gun in the form of yamux. I'll see about grabbing some goroutine stacktraces the next time I'm kicking nodes. And I'll get this PR's conflicts addressed. |
How many do ya want? :D |
Hm, phooey. Is it significantly better than without the yamux patch?
Working on a distributed system written in Go means the team is pretty used to combing through huge mountains of goroutines, so share whatever you're able. 😁 |
@schmichael stacks sent. It's a fork of Nomad 1.7.7 using this version of yamux- https://github.com/lattwood/yamux/tree/shutdown_edgecase (not the rebased branch for this PR) #!/usr/bin/env bash
set -euo pipefail
IFS=$'\n\t'
target=$1
ssh "root@${target}" 'killall -SIGQUIT nomad && sleep 3'
ssh "root@${target}" 'journalctl --unit nomad.service --since "5 minutes ago" > gostacks.txt'
scp "root@${target}:gostacks.txt" "out/${target}_stacks.txt" > parallel -j16 './get_stack.sh {}' ::: `nomad job status -json | jq -r '.[].Allocations[] | select(.ClientStatus == "pending") | .NodeName' | sort -u |
Stream.Close is called when the Session's keepalive fails. This call wakes up Stream.Read & Stream.write waiters. The wakeup is via a non-blocking send to recvNotifyCh/sendNotifyCh, which are unbuffered channels. In Stream.Read this send can be attempted between the call to s.stateLock.Unlock() and the select statement at the bottom of the method. When this occurs, if deadlines haven't been set on the Stream in question, the call to Stream.Read will block forever. By adding Session.shutdownCh to the channels in the select statement, this coordination gap is addressed. This uses `goto START` instead of `return 0, io.EOF`, as that would potentially leave a last read on the buffer in the Read case.
In hashicorp#31, the Read path was changed to move away from time.After. This change was not reflected in the Write path, and this commit rectifies that.
When stopping a time.Timer, you need to check the return value and drain the channel to prevent a memory leak, if the Timer has fired.
5cd99ad
to
84b3fc6
Compare
PR rebased against hashicorp/yamux/main |
(this patch does reduce the frequency of it getting blocked tho) |
@schmichael I think this is good to merge at this point? Also, any chance you happened to look at that yamux fork mentioned on the Nomad issue about potentially swapping out for SPDY, in terms of compatibility issues? If it doesn't seem like a horrible idea, I think we could be open to running a Nomad fork on our end to kick the tires on it as it relates to solving the issue we're experiencing. |
Friendly poke @schmichael |
Peeked at a few of the goroutine dumps you sent and something is clearly blocking Merging! Thanks for your extensive help and even more extensive patience @lattwood. We can continue debugging on the Nomad issue. |
The main point of this dependency upgrade is to pull in the fixes in hashicorp/yamux#127 which prevents leaking deadlocked goroutines. It has been observed to improve the issue in #23305 but does not seem sufficient to fix it entirely. Since touching yamux is a rare and scary event, I do **not** intend to backport this. If we discover the improvements are stable and significant enough, or if further fixes land in yamux, backporting can be done at that time.
The main point of this dependency upgrade is to pull in the fixes in hashicorp/yamux#127 which prevents leaking deadlocked goroutines. It has been observed to improve the issue in #23305 but does not seem sufficient to fix it entirely. Since touching yamux is a rare and scary event, I do **not** intend to backport this. If we discover the improvements are stable and significant enough, or if further fixes land in yamux, backporting can be done at that time.
Stream.Close
is called when theSession
's keepalive fails. This call wakes upStream.Read
&Stream.write
waiters. The wakeup is via a non-blocking send torecvNotifyCh
/sendNotifyCh
, which are unbuffered channels.In
Stream.Read
this send can be attempted between the call tos.stateLock.Unlock()
and the select statement at the bottom of the method. When this occurs, if deadlines haven't been set on theStream
in question, the call toStream.Read
will block forever.By adding
Session.shutdownCh
to the channels in the select statement, this coordination gap is addressed.This uses
goto START
instead ofreturn 0, io.EOF
, as that would potentially leave a last read on the buffer in the Read case.Should fix hashicorp/nomad#23305.
Definitely will fix a bug about allocations getting stuck in the
pending
state until the Nomad Agent is restarted that I haven't filed.The issue was discovered by sending a "broken" Nomad Agent a
SIGQUIT
, and seeing the goroutine responsible for fetching allocation updates from the Nomad Servers was blocked in the select statement at the bottom ofStream.Read
for > 24 hours (or more specifically, how long the allocation was in the pending state), with a lot of stuff about msgpack and jobspec fields in the middle of the stacktrace.