Skip to content
This repository has been archived by the owner on Nov 6, 2020. It is now read-only.

Connections between peers on a private testnet drop for no apparent reason #2668

Closed
lgpawel opened this issue Oct 17, 2016 · 23 comments · Fixed by #4102
Closed

Connections between peers on a private testnet drop for no apparent reason #2668

lgpawel opened this issue Oct 17, 2016 · 23 comments · Fixed by #4102
Assignees
Labels
F2-bug 🐞 The client fails to follow expected behavior. M4-core ⛓ Core client code / Rust.
Milestone

Comments

@lgpawel
Copy link

lgpawel commented Oct 17, 2016

With Parity version v1.4.0-unstable-271bcf4-20161009/x86_64-linux-gnu/rustc1.12.0 and a private testnet (spec file: https://gist.github.com/lgpawel/a6ba7660b778dc9775b00849abfc8be0) I experience unexpected disconnects. In both of the following examples, there are just two instances of Parity running parallel on the same machine, launched with --bootnodes pointing the instances to one another. For some reason (which may or may not be a separate issue), the numbers of connected peers are bigger than 1, as if there were multiple parallel connections between the instances.

In these logs the instances sit idle reporting two connections between them, then one of them drops, and after some time the other does too: https://gist.github.com/lgpawel/711033cfc0f4e0bd180d2e5397d3e91b

In these, the instances sit idle, apparently stable with three connections between them, then I launch mining for both of them (explicitly limiting it to respectively 1 and 2 cores of a four-core processor) and immediately the connections are dropped, this time with much more verbose output: https://gist.github.com/lgpawel/c7c093325a533388d961e9e294849290

@lgpawel lgpawel changed the title Connections between peers drop for no apparent reason Connections between peers on a private testnet drop for no apparent reason Oct 17, 2016
@tomusdrw tomusdrw added F2-bug 🐞 The client fails to follow expected behavior. M4-core ⛓ Core client code / Rust. labels Oct 17, 2016
@tomusdrw
Copy link
Collaborator

Updates from Gitter:

upgraded Parity indeed shows peer info. Now again with two instances I have 2 peers connected in each instead of 1, both with the same id (being the enode of the other instance, unsurprisingly) and differing only in the localAddress and remoteAddress, see https://gist.github.com/lgpawel/638b21babf3cf61047b1b118b12311fa

@lgpawel
Copy link
Author

lgpawel commented Oct 18, 2016

Now with v1.4.0-unstable-6c7af57-20161017/x86_64-linux-gnu/rustc1.12.0 — the same launch parameters, no mining, and an unexpected disconnect with nothing meaningful in the logs: https://gist.github.com/lgpawel/76473425d6eab03ee036c84c01d3d47a

@arkpar
Copy link
Collaborator

arkpar commented Oct 31, 2016

Should be fixed in latest master. @lgpawel could you confirm?

@lgpawel
Copy link
Author

lgpawel commented Nov 1, 2016

Not really, unfortunately. Pulled in the latest updates, cargo-built the release, version now reports as v1.4.0-unstable-cb3bb24-20161101/x86_64-linux-gnu/rustc1.12.0 (coincidentally, running with -v or --version flag outputs nothing now). Ran three instances in parallel, pointing them to one another by specifying them as --bootnodes. They have established multiple (2, 4, and 7) connections between one another, but this was stable for an hour, as long as none of them mined anything.

Then I've ran miners for two of them. (The third sits on a quite long fork that it cannot reorg by itself. The fork is a result of one of these disconnects happening some time ago; I was not around to force the reconnect before it got that long. I don't care about this, though.) All connections between the instances 1 and 3 dropped pretty much instantly, in minutes one of the two connections between instances 1 and 2 followed. In after some time, one of the connections between 2 and 3 died, soon followed by the rest of them, and finally the remaining connection between 1 and 2 was also lost. All this has happened over the course of some 30 minutes.

I don't care about 3 disconnecting from 1 or 2, I guess this may even be intended, when 3 eg. receives a block freshly mined by 1 or 2, recognizes it as incompatible with the fork it's sitting on, and concludes that these peers are not going to send sensible blocks. But obviously 1 and 2 disconnecting is an issue.

I didn't grab the full logs, but some of them are here: https://gist.github.com/lgpawel/0171fe09498b1c95cffa277ea4d02eb3

@gavofyork gavofyork modified the milestone: 1.4 Civility Nov 1, 2016
@gavofyork gavofyork added this to the 1.5 Tenuity milestone Nov 17, 2016
@gavofyork gavofyork added M4-network M4-core ⛓ Core client code / Rust. and removed M4-core ⛓ Core client code / Rust. M4-network labels Nov 17, 2016
@arkpar
Copy link
Collaborator

arkpar commented Dec 14, 2016

Could not reproduce this on latest master with the given chain spec. @lgpawel could you check again? Please run with -l network=trace and collect logs if this is still the case.

@lgpawel
Copy link
Author

lgpawel commented Dec 16, 2016

Just to be on the safe side, I've (re)installed Cargo from http://rustup.rs, re-cloned Parity repo, did git submodule init, git submodule update and cargo build --release, getting version Parity/v1.5.0-unstable-03db4d0-20161215/x86_64-linux-gnu/rustc1.13.0. I used a chain spec literally downloaded from the Gist above (https://gist.github.com/lgpawel/a6ba7660b778dc9775b00849abfc8be0) and decided to work with a fresh chain.

I ran three nodes as parity --chain ChainSpecExample.json --db-path nodeX --port 3030X --jsonrpc-port 854X --no-dapps --no-ui --author [...] --bootnodes [...] -l network=trace --log-file nodeX.log with X = 1, 2, 3 and --bootnodes pointing the nodes to one another. They have established double connections with one another. After some 7 hours, I turned the mining on for each of them with ethminer -C -t 1 -F http://127.0.0.1:854X (on a four-core processor) and left it overnight.

Now, the redundant connections between 1 and 3, 1 and 2, and 2 and 3 dropped some 2.5, 3.5, and 10 hours after mining on, respectively. As of now, 15 hours after mining on, single connections between all nodes are kept.

So it's not 100% conclusive to me. I kind of expected to find all the connections, including the redundant ones, to be live in the morning, and then I'd be more than happy to have this issue closed. In this situation, however, I can still entertain a possibility that this connection dropping that I've observed may not be limited to redundant connections.

I'm not posting any logs yet, because after a run that long with -l network=trace they sit at 1.8-2.2 gigs each (slightly confusing why there's a difference like this, especially that the larger of those was written by node 1 which, as indicated above, lost both is redundant connections earlier on and therefore had possibly less things to log for some six hours). If you'd like to see them, please let me know which parts should I post, or where to upload them so they're convenient to view.

I'm keeping the setup running as of now, and I certainly can let it run over the weekend, unless I accidentally kill the processes or something.

Update: a few minutes ago (some 17.5 hours after mining on) a second connection between 2 and 3 (re)appeared. This is something I've never seen in my local testnet experiments before, and it's nice to see connections to appear rather than drop for a change, but for a redundant connection it seems somewhat dodgy. Moar update: back to single connection between these nodes 20 min later.

(Disclaimer: I'm not looking into specific peer info from Parity, instead I infer the number of connections between specific nodes by the number of peers each of them reports. Eg. if these go from 2-2-2 to 2-3-3, I conclude that a connection appeared between 2 and 3.)

@arkpar
Copy link
Collaborator

arkpar commented Dec 16, 2016

So each node reports 4 connections initially? as in 0/4/4 ?
Is it possible that you have two network routes between each peers? E.g. Each pears has two network adapters connected to each other over the same or different networks?

@lgpawel
Copy link
Author

lgpawel commented Dec 16, 2016

0/ 4/25, to be exact, but yes. I wouldn't suspect anything like that, ie. this computer doesn't have any weird networking setup, just a single Ethernet adapter with no specific config (ifconfig lists just one + lo). However, --bootnodes were specified using the LAN IP of the computer and not 127.0.0.1, if that matters.

As for the status of the test: the remaining connection between 1 and 2 dropped some three hours ago, and just a while ago the same happened to the last connection between 1 and 3. This left 1 isolated from the rest of the network for some five minutes until the 1-3 connection reappeared. (Also the additional 2-3 connection reappeared and disappeared four times in the meantime.)

I don't know if even on a small LAN testnet it is enough for a miner to be disconnected for five minutes to develop a critically long fork, but if five minutes is required to reconnect even to nodes on the same computer, then I guess it may take even longer, especially in other circumstances. That is, if it isn't by design. Also note that the connection between 1 and 2 has not been reestablished for 3 hours. (Update: it has just been reestablished, after some 3.5 hours.)

I guess I should upload just the relevant parts of the logs, and I guess that would mean from some time before a connection is lost until that (or after that). Please let me know how long sections should I extract and for which disconnects (or reconnects).

@arkpar
Copy link
Collaborator

arkpar commented Dec 16, 2016

@lgpawel a few seconds worth of logs around any disconnect event and the start should do it, thank you

@lgpawel
Copy link
Author

lgpawel commented Dec 19, 2016

I had left it running over the weekend and today morning I've discovered that they were all disconnected from one another: node 1 from the others for 10 hours already, and nodes 2-3 from one another for some two hours with some short reconnects happening after that. I've stopped the processes; the logs are 6.5-7.7 gigs each.

As for the recommendation above, I've decided to look at the messages printed every 30 secs that contain peer count, and whenever this count changed between consecutive messages, print all the messages logged in this 30 second window. However, this is still a lot: across all three nodes, there have been 220 occurences of this kind, and for each of them there is anything between 146 and 193527 lines logged (although both of these are outliers). The whole set has 2.5M lines and 200 MB.

In https://gist.github.com/lgpawel/66c5453ada316f199ad8311d28b5a60e I've pasted a list of these log snippets with their respective line counts and sizes. The file names contain the node number, times of the beginning and end of the window logged, and a change in the number of peers that has happened within it.

I can upload them somewhere (what filesharing service is easy to use these days? or should I set up a GitHub repo?), but if only some of them are needed, please let me know. If I should crop them even further, please let me know, but if so, then I'd like to be able to script it.

@arkpar
Copy link
Collaborator

arkpar commented Dec 19, 2016

A couple of these files would do for a start. I'm interested in the disconnect reason. It should in the log along with disconnect event.

@lgpawel
Copy link
Author

lgpawel commented Dec 20, 2016

Here https://gist.github.com/lgpawel/c4d0f827630a42a100559ff7e262dbcb are two pairs of disconnect logs, chosen by virtue of relative brevity. One pair is node1.log.2016-12-18 15:03:07--2016-12-18 15:03:37.1--0.snippet and node3.log.2016-12-18 15:02:51--2016-12-18 15:03:21.1--0.snippet, the other is node2.log.2016-12-16 10:40:32--2016-12-16 10:41:02.3--2.snippet and node3.log.2016-12-16 10:40:37--2016-12-16 10:41:07.3--2.snippet. Ask away if more fragments are needed.

@arkpar
Copy link
Collaborator

arkpar commented Dec 22, 2016

So it looks like the node was disconnected because it failed to respond to ping. Is there a log from the other node for the same time? If you are mining on the same machine consider allocating less CPU cores for mining. Also use --allow-ips=private to prevent parity from trying to connected to external nodes needlessly.

EDIT: nm, I've found the second log in the same gist.

@lgpawel
Copy link
Author

lgpawel commented Dec 22, 2016

If you are mining on the same machine consider allocating less CPU cores for mining.

As stated above, I've passed flags -C -t 1 to ethminer, and the processor has four cores, so I imagine that ideally the mining processes occupied one each and everything else ran on the remaining core. I had not enforced any stricter control, though.

@gavofyork
Copy link
Contributor

any movement on this, @arkpar ?

@arkpar
Copy link
Collaborator

arkpar commented Jan 10, 2017

Potential fix has been merged to master. @lgpawel could you give it another run with the latest build?

@lgpawel
Copy link
Author

lgpawel commented Jan 13, 2017

I'm now testing it in a basically identical setup as in https://github.com/ethcore/parity/issues/2668#issuecomment-267560784. Parity is now in Parity/v1.6.0-unstable-311730e-20170112/x86_64-linux-gnu/rustc1.14.0 version, the chain spec is the one at https://gist.github.com/lgpawel/a6ba7660b778dc9775b00849abfc8be0, and the command for the nodes is parity --chain ChainSpecExample.json -d nodeX --port 3030X --jsonrpc-port 854X --no-dapps --no-ui --author 0b501e7eedea110ca7edda7aba5eda7a5e1ec7ed -l network=trace --log-file nodeX.log --bootnodes [...].

So far the behaviour is similar to the report above. All the nodes initially reported four connections each, from which I infer that they've established double connections to one another. This remained consistent for some 18 hours until I've switched the mining on (with ethminer -C -t 1 -F http://127.0.0.1:854X again). Then, in about two minutes nodes 1 and 2 went down by one connection each, and in another six minutes again, while node 3 went down by two at once. Since then (but that was just an hour ago) the situation seems stable with a single connection between each pair of nodes. I'll keep this running over the weekend and report any updates afterwards.

https://gist.github.com/lgpawel/4ab5b97a2e062f07b02b1db2f687e371 contains the log fragments for when the connection drops mentioned above. Also for node 1 there was a brief moment in which one of the two remaining connections showed up as pending and this log fragment is also included.

As an aside, in https://gist.github.com/lgpawel/7168834dc25ce7d8ef0a932fb854e791 I've posted a few hundred lines from the top of each of the logs, up to the point where they already report the duplicate connections, maybe this is of some interest. I guess that even if the connection dropping is fixed, this warrants an issue on its own, even if it's less important?

@lgpawel
Copy link
Author

lgpawel commented Jan 16, 2017

Putting aside a handful of seconds-long disconnects, over the weekend there were three occasions in which the nodes all disconnected from one another, leaving the network completely fragmented, and then reestablishing full connectivity (in all cases, after about 90 minutes). Luckily this seems not to have broken the consensus in this case, as now the nodes are connected and report importing mined blocks from one another.

In https://gist.github.com/lgpawel/c19f39fb13e00a7c478b52725496ef76 there's a minute's worth of logs from each of the nodes building up to the first of the three longer disconnects. At the end there's a list of snippets that I've generated with a script, and the filenames serve as a transcript of changes to the reported connection numbers. The file sizes/line counts are not directly comparable to those from the run from a month ago, as the log verbosity seems to have been altered internally and also my script is different now.

I'd say that the connection stability is much, much better, as there is much fewer disconnects and the connections seem to get picked up after at most ca. 90 mins (compared to >10 hours of not reconnecting in one case previously). Yet these periods of complete network fragmentation are worrying and I'm not going to go ahead and close this issue yet. Please ask away for more logs if needed.

@arkpar
Copy link
Collaborator

arkpar commented Jan 17, 2017

Looks like remaining disconnects might be caused by your LAN going down for a short duration. Try using localhost address instead of LAN address for bootnodes. I.e. replace 192.168.0.128 with 127.0.0.1 in all bootnodes URLs.
As for the duplicate connections, this should be fixed in latest master with #4180.

@lgpawel
Copy link
Author

lgpawel commented Jan 17, 2017

I've now restarted the run accordingly after updating to Parity/v1.6.0-unstable-8852a05-20170117/x86_64-linux-gnu/rustc1.14.0. So far I can see that the duplicate connections do not appear. I'll report back in a few days' time.

Would it be worth to try this out with nodes running on separate machines with actual LAN connections between them? Or maybe it should be considered solved altogether, if LAN problems were responsible for the last disconnects?

@arkpar
Copy link
Collaborator

arkpar commented Jan 18, 2017

If this is caused by a LAN problems it is not an issue indeed as long as peers reconnect when the network is up.

@lgpawel
Copy link
Author

lgpawel commented Jan 20, 2017

OK, I've ran it with the new version and for 17 hours before mining on and 53 hours afterwards, and all I've seen no redundant connections and only a handful of seconds-long disconnects. I guess this concludes it all. Thanks!

@arkpar
Copy link
Collaborator

arkpar commented Jan 20, 2017

Thank you for detailed reports and testing.

@arkpar arkpar closed this as completed Jan 20, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
F2-bug 🐞 The client fails to follow expected behavior. M4-core ⛓ Core client code / Rust.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants