peer: catch write timeouts, retry with backoff #2819
Roasbeef left a comment
I've been running this on my mainnet nodes lately, and have seen it resolve most of the issues (aside from caching) that we've seen with peer connectivity. One minor nit re a duplicated constant, but once rebased, this is ready to land IMO.
This commit modifies the writeHandler to catch timeout errors, and retry writes to the socket after a small backoff, which increases exponentially from 5s to 1m. With the growing channel graph size, some lower-powered devices can be slow to pull messages off the wire during validation. The current behavior will cause us to disconnect the peer, and resend all of the messages that the remote peer is slow to validate. Catching the timeout helps in preventing such expensive reconnection cycles, especially as the network continues to grow. This is also a preliminary step to reducing the write timeout constant. This will allow concurrent usage of the write pools w/out devoting excessive amounts of time blocking the pool for slow peers.
This commit reduces the peer's write timeout to 5s. Now that the peer catches write timeouts and doesn't disconnect, this will ensure we spend less time blocking in the write pool in case others also need to access the workers concurrently. Slower peers will now only block for 5s, after every reattempt w/ exponential backoff.
In this commit, we add a 5 minute idle timer to the write handler. After catching the write timeouts, it's been observed that some connections have trouble reading a message for several hours. This typically points to a deeper issue w/ the peer or, e.g. the remote peer switched networks. This now mirrors the idle timeout used in the read handler, such that we will disconnect a peer if we are unable to send or receive a message from the peer after 5 minutes. We also modify the readHandler to drain its idleTimer's channel in the even that the timer had already fired, but we successfully sent the message.
Bumps the default read and write handlers to be well above the average number of peers a node has. Since the worker counts specify only a maximum number of concurrent read/write workers, it is expected that the actual usage would converge to the requirements of the node anyway. However, in preparation for a major release, this is a conservative measure to ensure that the default values aren't too low and improve network instability.
Add this suggestion to a batch that can be applied as a single commit. This suggestion is invalid because no changes were made to the code. Suggestions cannot be applied while the pull request is closed. Suggestions cannot be applied while viewing a subset of changes. Only one suggestion per line can be applied in a batch. Add this suggestion to a batch that can be applied as a single commit. Applying suggestions on deleted lines is not supported. You must change the existing code in this line in order to create a valid suggestion. Outdated suggestions cannot be applied. This suggestion has been applied or marked resolved. Suggestions cannot be applied from pending reviews. Suggestions cannot be applied on multi-line comments.