Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase IO_TIMEOUT to allow nodes on high-latency connections to sync #3109

Merged
merged 1 commit into from Nov 13, 2019

Conversation

@mmgen
Copy link
Contributor

mmgen commented Nov 11, 2019

Commit d3dbafa "Use blocking IO in P2P to reduce CPU load" (merged into v2.1.0) introduced the constant IO_TIMEOUT, setting it to 1 second.

On nodes with high-latency connections, this short timeout causes the txhashset archive download during step 2 of the IBD process to invariably fail before it completes. Since there's no mechanism for resuming a failed download, this means the node gets stuck at this stage and never syncs.

Increasing IO_TIMEOUT to 10 seconds solves the issue on my node; others might suggest a more optimal value for the constant.

Commit d3dbafa "Use blocking IO in P2P to reduce CPU load" (merged
into v2.1.0) introduced the constant IO_TIMEOUT, setting it to 1 second.

On nodes with high-latency connections, this short timeout causes the
txhashset archive download during step 2 of the IBD process to
invariably fail before it completes.  Since there's no mechanism for
resuming a failed download, this means the node gets stuck at this stage
and never syncs.

Increasing IO_TIMEOUT to 10 seconds solves the issue on my node; others
might suggest a more optimal value for the constant.
@DavidBurkett

This comment has been minimized.

Copy link
Contributor

DavidBurkett commented Nov 11, 2019

Does this change affect shutdown times?

@mmgen

This comment has been minimized.

Copy link
Contributor Author

mmgen commented Nov 11, 2019

Doesn't seem to. Haven't noticed any difference in my node's behavior.

@hashmap

This comment has been minimized.

Copy link
Member

hashmap commented Nov 11, 2019

Yes, it would increase shutdown time. Also we need to figure out why it fails the download, timeout is not an exceptional situation, it doesn't stop reading from a peer. Perhaps we partially read, get timeout and drop the buffer, I'll check it. This would be a short-term fix.
In long term we need to think about using async io on top of tokio or async-std. We used tokio for p2p and got rid of it in early 2018 because it was hard to maintain and reason about (we still have tokio in api layer and it's a pain), but with the latest stable rust brings us async/.await which makes async io is much more pleasant to work with. However this change would require some redesign in message processing area, we need to separate network io from header/block processing.

@mmgen

This comment has been minimized.

Copy link
Contributor Author

mmgen commented Nov 11, 2019

Agree this is a short-term fix. But it does solve the problem and doesn't seem to cause any serious issues of its own. The downloads were repeatedly failing partway through with EAGAIN. Here's a snippet from the log file:

    DEBUG grin_p2p::protocol - handle_payload: txhashset archive: 2256000/460392662
    DEBUG grin_p2p::protocol - handle_payload: txhashset archive: 9888000/460392662
    ERROR grin_p2p::protocol - handle_payload: txhashset archive save to file fail. err=Connection(Os { code: 11, kind: WouldBlock, message: "Resource temporarily unavailable" })

Suggestion: the fix could be made more fine-grained by using the longer timeout only for the txhashset download. Other connections thus wouldn't be affected.

@hashmap

This comment has been minimized.

Copy link
Member

hashmap commented Nov 11, 2019

Interesting. And I was wrong about longer shutdown, it would be the case with the original PR, but then we changed the behavior, now we just close remaining threads after short period of time

@mmgen

This comment has been minimized.

Copy link
Contributor Author

mmgen commented Nov 11, 2019

I've been unable to observe any difference in the patched node's behavior. Aside from the fact that it now syncs :-)

@hashmap hashmap merged commit 928097a into mimblewimble:master Nov 13, 2019
10 checks passed
10 checks passed
mimblewimble.grin Build #20191111.1 succeeded
Details
mimblewimble.grin (linux api/util/store) linux api/util/store succeeded
Details
mimblewimble.grin (linux chain/core/keychain) linux chain/core/keychain succeeded
Details
mimblewimble.grin (linux pool/p2p/src) linux pool/p2p/src succeeded
Details
mimblewimble.grin (linux release) linux release succeeded
Details
mimblewimble.grin (linux servers) linux servers succeeded
Details
mimblewimble.grin (macos release) macos release succeeded
Details
mimblewimble.grin (macos test) macos test succeeded
Details
mimblewimble.grin (windows release) windows release succeeded
Details
mimblewimble.grin (windows test) windows test succeeded
Details
@mmgen mmgen deleted the mmgen:io-timeout-patch branch Nov 13, 2019
@antiochp antiochp added this to the 3.0.0 milestone Dec 11, 2019
@antiochp antiochp mentioned this pull request Dec 11, 2019
garyyu added a commit to garyyu/grin that referenced this pull request Jan 14, 2020
mimblewimble#3109)

Commit d3dbafa "Use blocking IO in P2P to reduce CPU load" (merged
into v2.1.0) introduced the constant IO_TIMEOUT, setting it to 1 second.

On nodes with high-latency connections, this short timeout causes the
txhashset archive download during step 2 of the IBD process to
invariably fail before it completes.  Since there's no mechanism for
resuming a failed download, this means the node gets stuck at this stage
and never syncs.

Increasing IO_TIMEOUT to 10 seconds solves the issue on my node; others
might suggest a more optimal value for the constant.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.