Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

txhashset archive save to file fail (connection closed) #2929

Closed
antiochp opened this issue Jun 28, 2019 · 8 comments
Closed

txhashset archive save to file fail (connection closed) #2929

antiochp opened this issue Jun 28, 2019 · 8 comments
Labels

Comments

@antiochp
Copy link
Member

@antiochp antiochp commented Jun 28, 2019

Added some logging to the txhashset download so we can track progress via the log file (on the 2.x.x branch) -

20190628 09:40:09.321 DEBUG grin_p2p::peer - Asking 67.207.70.122:3414 for txhashset archive at 231880 000018532638.
20190628 09:40:14.687 DEBUG grin_p2p::protocol - handle_payload: txhashset archive for 000018532638 at 231880. size=253592404
20190628 09:40:25.693 DEBUG grin_p2p::protocol - handle_payload: txhashset archive: 2976000/253592404
20190628 09:40:36.748 DEBUG grin_p2p::protocol - handle_payload: txhashset archive: 9072000/253592404
20190628 09:40:47.769 DEBUG grin_p2p::protocol - handle_payload: txhashset archive: 16944000/253592404
20190628 09:40:58.829 DEBUG grin_p2p::protocol - handle_payload: txhashset archive: 24384000/253592404
...
20190628 09:49:02.918 DEBUG grin_p2p::protocol - handle_payload: txhashset archive: 238656000/253592404
20190628 09:49:14.008 DEBUG grin_p2p::protocol - handle_payload: txhashset archive: 245472000/253592404
20190628 09:49:25.148 DEBUG grin_p2p::protocol - handle_payload: txhashset archive: 250128000/253592404
20190628 09:49:36.447 DEBUG grin_p2p::protocol - handle_payload: txhashset archive: 251232000/253592404
20190628 09:49:38.456 ERROR grin_p2p::protocol - handle_payload: txhashset archive save to file fail. err=Connection(Os { code: 54, kind: ConnectionReset, message: "Connection reset by peer" })

I have seen the above scenario a few time now. We download what appears to be almost all the bytes and then the connection is dropped.

Seems like too much of a coincidence that we get so close to getting the whole file and then fail to save it with a "connection closed" (like 99% of the way there on multiple occasions).

I suspect we actually have all the bytes and that the connection is getting closed on the sending side as soon as it has sent everything over (for whatever reason).
We must be able to do something smarter on the receiving side to handle this more robustly.

Related - #2639.

@antiochp antiochp added the bug label Jun 28, 2019
@antiochp antiochp added this to the 2.x.x milestone Jun 28, 2019
@antiochp
Copy link
Member Author

@antiochp antiochp commented Jun 28, 2019

Here's another instance of it happening -

...
20190628 10:11:09.139 DEBUG grin_p2p::protocol - handle_payload: txhashset archive: 261024000/275364910
20190628 10:11:20.201 DEBUG grin_p2p::protocol - handle_payload: txhashset archive: 266400000/275364910
20190628 10:11:31.297 DEBUG grin_p2p::protocol - handle_payload: txhashset archive: 271008000/275364910
20190628 10:11:37.064 ERROR grin_p2p::protocol - handle_payload: txhashset archive save to file fail. err=Connection(Os { code: 54, kind: ConnectionReset, message: "Connection reset by peer" })

We log the update msg every 10s so here again we got within 10s of completing the download and the connection mysteriously dropped on the other end.

I would bet "several minutes worth of Grin" that we actually received the full file and then our connection handling caused it to fail to save successfully.

@hashmap
Copy link
Member

@hashmap hashmap commented Jun 28, 2019

Can confirm, saw it recently. Will also check how we handle it.

hashmap added a commit to cyclefortytwo/grin that referenced this issue Jun 28, 2019
We use a buffered SyncChannel (buffer size is 10) to store messages to
be sent to a remote peer. In send impl we use try_send which return
error if channel is closed (peer is disconnected) or the buffer is full.
Such error leads to dropping the peer.

When we send a txhaset archive a peer's thread is busy with sending it
and can't send other messages, eg pings. If the network connection is
slow it may lead to channel sending error hence the peer's drop.

We could increase buffer, but at the same time we use SyncChannel as
regular channel (with unlimited buffer), so it may make more sense to switch to it. It's unlikely that a peer can buffer too many messages, so wasting memory on prealocating a big buffer is not justified.

Adresses mimblewimble#2929
@antiochp
Copy link
Member Author

@antiochp antiochp commented Jun 28, 2019

@hashmap My understanding of our current threading model is as follows (please correct me if this is wrong) -

  • bunch of peer threads receiving msgs from our peers
  • some arbitrary thread sending data out to peers (broadcast of block or tx etc.)
    • we use the send_channel here to get the data to the necessary peer thread

We do not use the individual peer threads to actually send data out to our peers.
But we may receive a tx from one peer and then use the thread corresponding to that peer to broadcast out to all our other peers.

But in the case of msgs that have a response, like the txhashset request we do use the same thread to send data back to that peer.
So we receive a txhashset request on the thread corresponding to a particular peer and then we use that same thread to "reply" to that peer, effectively locking that thread up for an extended period of time (as you mention in #2930).

I wonder if we should consider spawning a separate thread for long running "replies" like the txhashset response?

Edit: This is not correct. We have the send_channel in there - so we send out via the network using the peer thread.

hashmap added a commit to cyclefortytwo/grin that referenced this issue Jun 28, 2019
When we send a txhashet archive a peer's thread is busy with sending it
and can't send other messages, eg pings. If the network connection is
slow buffer capacity 10 may be not enough,  hence the peer's drop.

Safer attempt to address mimblewimble#2929 in 2.0.0
hashmap added a commit that referenced this issue Jun 30, 2019
When we send a txhashet archive a peer's thread is busy with sending it
and can't send other messages, eg pings. If the network connection is
slow buffer capacity 10 may be not enough,  hence the peer's drop.

Safer attempt to address #2929 in 2.0.0
hashmap added a commit to cyclefortytwo/grin that referenced this issue Jun 30, 2019
When we send a txhashet archive a peer's thread is busy with sending it
and can't send other messages, eg pings. If the network connection is
slow buffer capacity 10 may be not enough,  hence the peer's drop.

Safer attempt to address mimblewimble#2929 in 2.0.0
hashmap added a commit that referenced this issue Jun 30, 2019
When we send a txhashet archive a peer's thread is busy with sending it
and can't send other messages, eg pings. If the network connection is
slow buffer capacity 10 may be not enough,  hence the peer's drop.

Safer attempt to address #2929 in 2.0.0
@mmgen
Copy link
Contributor

@mmgen mmgen commented Nov 11, 2019

Related issue: with the introduction of blocking IO in d3dbafa, a failed txhashset archive download throwing the IBD process into an infinite loop has now become an issue with high-latency connections. It's easily solved by increasing the constant IO_TIMEOUT (in p2p/src/conn.rs) to 10000 milliseconds.

See: #3109

@mmgen
Copy link
Contributor

@mmgen mmgen commented Nov 24, 2019

Still observing this with freshly-compiled client (master@78220feb). When remote peer is a v1 client, DL gets to 99% and then fails with "connection reset by peer":

    201911DD HH:MM:SS.nnn DEBUG grin_p2p::protocol - handle_payload: txhashset archive: 484608000/489598372
    201911DD HH:MM:SS.nnn ERROR grin_p2p::protocol - handle_payload: txhashset archive save to file fail. err=Connection(Os { code: 104, kind: ConnectionReset, message: "Connection reset by peer" })

This appears to be a problem with v1 clients, so is obviously unfixable at this point, unless the fix is backported and users agree to upgrade. (Apparently, #2934 didn't help, since these are 2.0.0 clients).

When remote peer is v2 (v2.1.1), I'm observing even stranger behavior: The download begins normally, but after 30 seconds or so the peer just stops sending data. The connection remains open, no error is reported, but the data flow just stops.

The following debugging code reveals that the blocking is occurring at self.stream.read_exact(), which indicates the problem is with the remote end of the connection:

diff --git a/p2p/src/conn.rs b/p2p/src/conn.rs                                                                                                                            
index 4bfe5bc6..efe683ab 100644
--- a/p2p/src/conn.rs
+++ b/p2p/src/conn.rs
@@ -106,10 +106,12 @@ impl<'a> Message<'a> {
        let mut written = 0;
        while written < len {
            let read_len = cmp::min(8000, len - written);
+           debug!("before read: len: {}, written: {}, read_len: {}", len, written, read_len);
            let mut buf = vec![0u8; read_len];
            self.stream.read_exact(&mut buf[..])?;
            writer.write_all(&mut buf)?;
            written += read_len;
+           debug!("after read:  len: {}, written: {}, read_len: {}", len, written, read_len);
        }
        Ok(written)
    }

Both errors have been observed multiple (10x) times, and the behavior is always the same.

If the latter error has been fixed in v3.0.0, then please ignore. There are very few 3.0.0 nodes active on mainnet, so haven't been able to test against them.

@quentinlesceller
Copy link
Member

@quentinlesceller quentinlesceller commented May 11, 2020

@antiochp any update on that issue? Is this still relevant?

@mmgen
Copy link
Contributor

@mmgen mmgen commented May 11, 2020

@antiochp any update on that issue? Is this still relevant?

Don't know. I haven't fired up a fresh Grin node in awhile. Perhaps I'll do so in the coming days.

@antiochp
Copy link
Member Author

@antiochp antiochp commented May 13, 2020

I'm going to close this for now. I believe this is fixed in 3.x.x.
If we see this behavior or something similar we can reopen a new issue.

@antiochp antiochp closed this May 13, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
4 participants