Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

discovery: make gossip replies synchronous #2916

Merged

Conversation

Projects
None yet
3 participants
@cfromknecht
Copy link
Collaborator

commented Apr 8, 2019

This PR modifies the discovery.GossipSyncer to send all replies and and gossip filter backlogs synchronously. This serves to throttle the rate at which the serving node writes messages to the wire, which has been observed to result to a disconnect if the remote node's discovery stream is filled up. Testing on my node has shown overall more stable connections, as this helps in more closely aligning our sending rate to the throughput of the other node.

In the process, we make a pair of syncer's state machines totally isolated, which facilitates a nice removal of duplicate test code in discovery/syncer_test.go since the requesting side can now be tested independently of the remote peer's state machine.

See commit messages for more details.

@cfromknecht cfromknecht added this to the 0.7 milestone Apr 11, 2019

@Roasbeef Roasbeef modified the milestones: 0.7, 0.6.1 Apr 12, 2019

@Roasbeef Roasbeef requested a review from wpaulino Apr 18, 2019

@cfromknecht cfromknecht force-pushed the cfromknecht:split-syncer-query-reply branch from 543e094 to 6172d82 Apr 19, 2019

@Roasbeef
Copy link
Member

left a comment

First pass review completed, starting to run this now on a few nodes. I think this combined with #2924 will put us in an even better place w.r.t our p2p interaction with lnd but also other implementations. Will focus on the tests (as that's where the bulk of the diff is) during my next round. The core changes themselves are easy to follow though. We'll need to also possibly synchronize landing this with #2932 since it also does some overhaul work in this file.

Show resolved Hide resolved discovery/syncer.go
Show resolved Hide resolved discovery/syncer.go
// sendToPeer is a function closure that should send the set of
// targeted messages to the peer we've been assigned to sync the graph
// state from.
// sendToPeer sends a variadic number of messages to the remote peer.

This comment has been minimized.

Copy link
@wpaulino

wpaulino Apr 24, 2019

Collaborator

Any reason we still keep the async version around? We could just make all the sends sync, or expose it through the function bool parameter.

This comment has been minimized.

Copy link
@cfromknecht

cfromknecht Apr 24, 2019

Author Collaborator

initially i thought it might be nice to keep the async version for use in our own channelGraphSyncer, since it would be less blocking and more available to receive commands from the the SyncManager. it also seemed a more minimal change, though i think it's worth exploring making them all synchronous, especially since we already handle the errors on most of the call sites

Show resolved Hide resolved discovery/syncer_test.go
@cfromknecht

This comment has been minimized.

Copy link
Collaborator Author

commented Apr 26, 2019

@Roasbeef

This comment has been minimized.

Copy link
Member

commented Apr 26, 2019

Needs a rebase now that #2932 has landed.

@cfromknecht cfromknecht force-pushed the cfromknecht:split-syncer-query-reply branch from 14f73a5 to 5372d27 Apr 27, 2019

@wpaulino

This comment has been minimized.

Copy link
Collaborator

commented Apr 27, 2019

Need another rebase 😬

cfromknecht added some commits Apr 27, 2019

discovery/syncer: separate query + reply handlers
This commit creates a distinct replyHandler, completely isolating the
requesting state machine from the processing of queries from the remote
peer. Before the two were interlaced, and the syncer could only reply to
messages in certain states. Now the two will be complete separated,
which is preliminary step to make the replies synchronous (as otherwise
we would be blocking our own requesting state machine).

With this changes, the channelGraphSyncer of each peer will drive the
replyHanlder of the other. The two can now operate independently, or
even spun up conditionally depending on advertised support for gossip
queries, as shown below:

          A                                 B
 channelGraphSyncer ---control-msg--->
                                        replyHandler
 channelGraphSyncer <--control-msg----
           gossiper <--gossip-msgs----

                    <--control-msg---- channelGraphSyncer
       replyHandler
                    ---control-msg---> channelGraphSyncer
                    ---gossip-msgs---> gossiper
lnpeer+peer: use importable lnpeer.ErrPeerExiting
As a prepatory step to making gossip replies synchronous, we will move
the ErrPeerExiting error into the lnpeer package so that it can be
imported by the discovery package. With synchronous sends, this error
can now be detected and handled by goroutines in the syncer, and cause
them to exit instead of continue to sending backlogs.
discovery/syncer: make gossip sends synchronous
This commit makes all replies in the gossip syncer synchronous, meaning
that they will wait for each message to be successfully written to the
remote peer before attempting to send the next. This helps throttle
messages the remote peer has requested, preventing unintended
disconnects when the remote peer is slow to process messages. This
changes also helps out congestion in the peer by forcing the syncer to
buffer the messages instead of dumping them into the peer's queue.
discovery/syncer_test: fix benign off-by-one in DelayDOS test
Prior to this change, the numQueryResponses that we calculated would be
one more than what we actually wanted since it didn't account for the
initial QueryChannelRange msg. This resulted in the test sending one
extra delayed query than was configured. This doesn't fundamentally
impact the test, but does make what happens in the test more reflective
of the configuration.

@cfromknecht cfromknecht force-pushed the cfromknecht:split-syncer-query-reply branch from 5372d27 to 5251ebe Apr 27, 2019

@cfromknecht

This comment has been minimized.

Copy link
Collaborator Author

commented Apr 27, 2019

@wpaulino rebased

@wpaulino
Copy link
Collaborator

left a comment

LGTM

@Roasbeef
Copy link
Member

left a comment

LGTM 🚋

Running on my nodes the past few days, I've found that this and the flush changes in brontide seem to really speed up node restarts, and also improve peer connection stability.

@Roasbeef Roasbeef merged commit 985902b into lightningnetwork:master Apr 30, 2019

1 of 2 checks passed

coverage/coveralls Coverage decreased (-0.04%) to 60.042%
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details

cfromknecht added a commit to cfromknecht/lnd that referenced this pull request May 6, 2019

peer: break sendMessage on server quit
This commit modifies sendMessage to break on the server's quit channel,
which allows synchronous callers of SendMessage or SendLazyMessage to
receive an error during server shutdown which can be independent of a
particular peer's shutdown.

As of lightningnetwork#2916, all replies
made by gossip syncers were modified to be synchronous. In certain
cases, This would prevent the syncers from shutting down promptly, as
they would try to offload a batch a of messages that could not be
aborted. Now, an error will be propagated back to the caller, allowing
them to detect their condition, and reevaluate their own quit signals,
releasing any waitgrouped goroutines and permitting a quick shutdown.

cfromknecht added a commit to cfromknecht/lnd that referenced this pull request May 6, 2019

peer: break sendMessage on server quit
This commit modifies sendMessage to break on the server's quit channel,
which allows synchronous callers of SendMessage or SendLazyMessage to
receive an error during server shutdown which can be independent of a
particular peer's shutdown.

As of lightningnetwork#2916, all replies
made by gossip syncers were modified to be synchronous. In certain
cases, This would prevent the syncers from shutting down promptly, as
they would try to offload a batch a of messages that could not be
aborted. Now, an error will be propagated back to the caller, allowing
them to detect the error condition, and reevaluate their own quit
signals, releasing any waitgrouped goroutines and permitting a quick
shutdown.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.