Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

peer: catch write timeouts, retry with backoff #2819

Merged
merged 6 commits into from Mar 27, 2019

Conversation

Projects
None yet
3 participants
@cfromknecht
Copy link
Collaborator

cfromknecht commented Mar 22, 2019

Fixes #2784.

@Roasbeef
Copy link
Member

Roasbeef left a comment

Straight forward diff, will start testing this on mainnet now across various nodes.

Show resolved Hide resolved peer.go
Show resolved Hide resolved peer.go Outdated

@wpaulino wpaulino added this to the 0.6 milestone Mar 25, 2019

Show resolved Hide resolved peer.go Outdated
Show resolved Hide resolved peer.go
Show resolved Hide resolved peer.go Outdated

@cfromknecht cfromknecht force-pushed the cfromknecht:peer-write-retry branch 5 times, most recently from b20590d to aae0d28 Mar 25, 2019

@Roasbeef
Copy link
Member

Roasbeef left a comment

LGTM 💥

I've been running this on my mainnet nodes lately, and have seen it resolve most of the issues (aside from caching) that we've seen with peer connectivity. One minor nit re a duplicated constant, but once rebased, this is ready to land IMO.

Show resolved Hide resolved peer.go

cfromknecht added some commits Mar 26, 2019

peer: retry writes with delay on timeout errors
This commit modifies the writeHandler to catch timeout
errors, and retry writes to the socket after a small
backoff, which increases exponentially from 5s to 1m.
With the growing channel graph size, some lower-powered
devices can be slow to pull messages off the wire during
validation. The current behavior will cause us to
disconnect the peer, and resend all of the messages that
the remote peer is slow to validate. Catching the timeout
helps in preventing such expensive reconnection cycles,
especially as the network continues to grow.

This is also a preliminary step to reducing the
write timeout constant. This will allow concurrent usage
of the write pools w/out devoting excessive amounts of
time blocking the pool for slow peers.
peer: reduce write timeout to 5 seconds
This commit reduces the peer's write timeout to 5s.
Now that the peer catches write timeouts and doesn't
disconnect, this will ensure we spend less time blocking
in the write pool in case others also need to access the
workers concurrently. Slower peers will now only block
for 5s, after every reattempt w/ exponential backoff.
peer: add symmetric write idle timeout
In this commit, we add a 5 minute idle timer to
the write handler. After catching the write
timeouts, it's been observed that some connections
have trouble reading a message for several hours.
This typically points to a deeper issue w/ the peer
or, e.g. the remote peer switched networks. This now
mirrors the idle timeout used in the read handler,
such that we will disconnect a peer if we are unable
to send or receive a message from the peer after 5
minutes.

We also modify the readHandler to drain its
idleTimer's channel in the even that the timer had
already fired, but we successfully sent the message.
lncfg/workers: bump default read/write workers from 16 -> 100
Bumps the default read and write handlers to be well
above the average number of peers a node has. Since
the worker counts specify only a maximum number of
concurrent read/write workers, it is expected that
the actual usage would converge to the requirements
of the node anyway. However, in preparation for a
major release, this is a conservative measure to
ensure that the default values aren't too low and
improve network instability.

@cfromknecht cfromknecht force-pushed the cfromknecht:peer-write-retry branch from aae0d28 to c536516 Mar 26, 2019

@Roasbeef Roasbeef merged commit b935f69 into lightningnetwork:master Mar 27, 2019

1 of 2 checks passed

coverage/coveralls Coverage decreased (-0.05%) to 59.744%
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details

@cfromknecht cfromknecht deleted the cfromknecht:peer-write-retry branch Mar 27, 2019

@hpbock hpbock referenced this pull request Mar 30, 2019

Closed

Peer connections flapping #2784

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.