-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
peer: catch write timeouts, retry with backoff #2819
peer: catch write timeouts, retry with backoff #2819
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Straight forward diff, will start testing this on mainnet now across various nodes.
b20590d
to
aae0d28
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 💥
I've been running this on my mainnet nodes lately, and have seen it resolve most of the issues (aside from caching) that we've seen with peer connectivity. One minor nit re a duplicated constant, but once rebased, this is ready to land IMO.
This commit modifies the writeHandler to catch timeout errors, and retry writes to the socket after a small backoff, which increases exponentially from 5s to 1m. With the growing channel graph size, some lower-powered devices can be slow to pull messages off the wire during validation. The current behavior will cause us to disconnect the peer, and resend all of the messages that the remote peer is slow to validate. Catching the timeout helps in preventing such expensive reconnection cycles, especially as the network continues to grow. This is also a preliminary step to reducing the write timeout constant. This will allow concurrent usage of the write pools w/out devoting excessive amounts of time blocking the pool for slow peers.
This commit reduces the peer's write timeout to 5s. Now that the peer catches write timeouts and doesn't disconnect, this will ensure we spend less time blocking in the write pool in case others also need to access the workers concurrently. Slower peers will now only block for 5s, after every reattempt w/ exponential backoff.
In this commit, we add a 5 minute idle timer to the write handler. After catching the write timeouts, it's been observed that some connections have trouble reading a message for several hours. This typically points to a deeper issue w/ the peer or, e.g. the remote peer switched networks. This now mirrors the idle timeout used in the read handler, such that we will disconnect a peer if we are unable to send or receive a message from the peer after 5 minutes. We also modify the readHandler to drain its idleTimer's channel in the even that the timer had already fired, but we successfully sent the message.
Bumps the default read and write handlers to be well above the average number of peers a node has. Since the worker counts specify only a maximum number of concurrent read/write workers, it is expected that the actual usage would converge to the requirements of the node anyway. However, in preparation for a major release, this is a conservative measure to ensure that the default values aren't too low and improve network instability.
aae0d28
to
c536516
Compare
Fixes #2784.