Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

peer: catch write timeouts, retry with backoff #2819

Merged
merged 6 commits into from
Mar 27, 2019

Conversation

cfromknecht
Copy link
Contributor

@cfromknecht cfromknecht commented Mar 22, 2019

Fixes #2784.

Copy link
Member

@Roasbeef Roasbeef left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Straight forward diff, will start testing this on mainnet now across various nodes.

peer.go Show resolved Hide resolved
peer.go Outdated Show resolved Hide resolved
@wpaulino wpaulino added this to the 0.6 milestone Mar 25, 2019
peer.go Outdated Show resolved Hide resolved
peer.go Show resolved Hide resolved
peer.go Outdated Show resolved Hide resolved
@cfromknecht cfromknecht force-pushed the peer-write-retry branch 5 times, most recently from b20590d to aae0d28 Compare March 26, 2019 07:30
Copy link
Member

@Roasbeef Roasbeef left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 💥

I've been running this on my mainnet nodes lately, and have seen it resolve most of the issues (aside from caching) that we've seen with peer connectivity. One minor nit re a duplicated constant, but once rebased, this is ready to land IMO.

peer.go Show resolved Hide resolved
This commit modifies the writeHandler to catch timeout
errors, and retry writes to the socket after a small
backoff, which increases exponentially from 5s to 1m.
With the growing channel graph size, some lower-powered
devices can be slow to pull messages off the wire during
validation. The current behavior will cause us to
disconnect the peer, and resend all of the messages that
the remote peer is slow to validate. Catching the timeout
helps in preventing such expensive reconnection cycles,
especially as the network continues to grow.

This is also a preliminary step to reducing the
write timeout constant. This will allow concurrent usage
of the write pools w/out devoting excessive amounts of
time blocking the pool for slow peers.
This commit reduces the peer's write timeout to 5s.
Now that the peer catches write timeouts and doesn't
disconnect, this will ensure we spend less time blocking
in the write pool in case others also need to access the
workers concurrently. Slower peers will now only block
for 5s, after every reattempt w/ exponential backoff.
In this commit, we add a 5 minute idle timer to
the write handler. After catching the write
timeouts, it's been observed that some connections
have trouble reading a message for several hours.
This typically points to a deeper issue w/ the peer
or, e.g. the remote peer switched networks. This now
mirrors the idle timeout used in the read handler,
such that we will disconnect a peer if we are unable
to send or receive a message from the peer after 5
minutes.

We also modify the readHandler to drain its
idleTimer's channel in the even that the timer had
already fired, but we successfully sent the message.
Bumps the default read and write handlers to be well
above the average number of peers a node has. Since
the worker counts specify only a maximum number of
concurrent read/write workers, it is expected that
the actual usage would converge to the requirements
of the node anyway. However, in preparation for a
major release, this is a conservative measure to
ensure that the default values aren't too low and
improve network instability.
@Roasbeef Roasbeef merged commit b935f69 into lightningnetwork:master Mar 27, 2019
@cfromknecht cfromknecht deleted the peer-write-retry branch March 27, 2019 22:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants