txhashset archive save to file fail (connection closed) #2929
Added some logging to the txhashset download so we can track progress via the log file (on the 2.x.x branch) -
I have seen the above scenario a few time now. We download what appears to be almost all the bytes and then the connection is dropped.
Seems like too much of a coincidence that we get so close to getting the whole file and then fail to save it with a "connection closed" (like 99% of the way there on multiple occasions).
I suspect we actually have all the bytes and that the connection is getting closed on the sending side as soon as it has sent everything over (for whatever reason).
Related - #2639.
The text was updated successfully, but these errors were encountered:
Here's another instance of it happening -
We log the update msg every 10s so here again we got within 10s of completing the download and the connection mysteriously dropped on the other end.
I would bet "several minutes worth of Grin" that we actually received the full file and then our connection handling caused it to fail to save successfully.
We use a buffered SyncChannel (buffer size is 10) to store messages to be sent to a remote peer. In send impl we use try_send which return error if channel is closed (peer is disconnected) or the buffer is full. Such error leads to dropping the peer. When we send a txhaset archive a peer's thread is busy with sending it and can't send other messages, eg pings. If the network connection is slow it may lead to channel sending error hence the peer's drop. We could increase buffer, but at the same time we use SyncChannel as regular channel (with unlimited buffer), so it may make more sense to switch to it. It's unlikely that a peer can buffer too many messages, so wasting memory on prealocating a big buffer is not justified. Adresses mimblewimble#2929
@hashmap My understanding of our current threading model is as follows (please correct me if this is wrong) -
But in the case of msgs that have a response, like the txhashset request we do use the same thread to send data back to that peer.
I wonder if we should consider spawning a separate thread for long running "replies" like the txhashset response?
Edit: This is not correct. We have the
When we send a txhashet archive a peer's thread is busy with sending it and can't send other messages, eg pings. If the network connection is slow buffer capacity 10 may be not enough, hence the peer's drop. Safer attempt to address mimblewimble#2929 in 2.0.0
Related issue: with the introduction of blocking IO in d3dbafa, a failed txhashset archive download throwing the IBD process into an infinite loop has now become an issue with high-latency connections. It's easily solved by increasing the constant IO_TIMEOUT (in p2p/src/conn.rs) to 10000 milliseconds.
Still observing this with freshly-compiled client (master@78220feb). When remote peer is a v1 client, DL gets to 99% and then fails with "connection reset by peer":
This appears to be a problem with v1 clients, so is obviously unfixable at this point, unless the fix is backported and users agree to upgrade. (Apparently, #2934 didn't help, since these are 2.0.0 clients).
When remote peer is v2 (v2.1.1), I'm observing even stranger behavior: The download begins normally, but after 30 seconds or so the peer just stops sending data. The connection remains open, no error is reported, but the data flow just stops.
The following debugging code reveals that the blocking is occurring at
Both errors have been observed multiple (10x) times, and the behavior is always the same.
If the latter error has been fixed in v3.0.0, then please ignore. There are very few 3.0.0 nodes active on mainnet, so haven't been able to test against them.