[v0.20.x-branch] Backport #10540: discovery: fix gossiper shutdown deadlock#10554
[v0.20.x-branch] Backport #10540: discovery: fix gossiper shutdown deadlock#10554yyforyongyu merged 2 commits intov0.20.x-branchfrom
Conversation
When processing a remote network announcement, it is possible for two error messages to be sent back on the errChan. Since Brontide doesn't actually read from errChan, and since errChan only buffered one error message, the sending goroutine would deadlock forever. This would only become apparent when the gossiper attempted to shut down and got hung up. For now, we can fix this simply by buffering up to two error messages on errChan. There is an existing TODO to restructure this logic entirely to use the actor model, and we can do a more thorough fix as part of that work. This bug was discovered while doing full node fuzz testing and was triggered by sending a specific channel_announcement message and then shutting down LND. (cherry picked from commit 21588ac)
|
Please cherry-pick the changes locally and resolve any conflicts. git fetch origin backport-10540-to-v0.20.x-branch
git worktree add --checkout .worktree/backport-10540-to-v0.20.x-branch backport-10540-to-v0.20.x-branch
cd .worktree/backport-10540-to-v0.20.x-branch
git reset --hard HEAD^
git cherry-pick -x 5fa025e9e7a180de0a81750757fcc20eb37b28a9
git push --force-with-lease |
0d4fc6e to
f427fee
Compare
🟠 PR Severity: HIGH
🟠 High (1 file)
🟡 Medium (0 files)None 🟢 Low (1 file)
AnalysisThis PR backports #10540 which fixes a shutdown deadlock in the gossiper. The primary change is in Key observations:
The HIGH severity is appropriate because the gossiper is responsible for network topology propagation, and while this is a shutdown-specific bug rather than affecting normal operation, changes to concurrency control in the discovery package warrant careful review by an engineer familiar with the gossip protocol's threading model. To override, add a |
|
Fixed the conflict manually. |
gijswijs
left a comment
There was a problem hiding this comment.
Confirmed that the backport is the same fix that's already in master, so:
LGTM! 🎉
Backport of #10540
When processing a remote network announcement, it is possible for two error messages to be sent back on the
errChan. SinceBrontidedoesn't actually read fromerrChan, and sinceerrChanonly buffered one error message, the sending goroutine would deadlock forever. This would only become apparent when the gossiper attempted to shut down and got hung up.For now, we can fix this simply by buffering up to two error messages on
errChan. There is an existing TODO to restructure this logic entirely to use the actor model, and we can do a more thorough fix as part of that work.This bug was discovered while doing full node fuzz testing and was triggered by sending a specific channel_announcement message and then shutting down LND.