Skip to content
This repository has been archived by the owner on Aug 23, 2020. It is now read-only.

Repeated adding/removing of neighbors brakes TCP connection #345

Closed
bahamapascal opened this issue Nov 10, 2017 · 18 comments
Closed

Repeated adding/removing of neighbors brakes TCP connection #345

bahamapascal opened this issue Nov 10, 2017 · 18 comments

Comments

@bahamapascal
Copy link

When a TCP neighbor is repeatedly added and removed via API call it will eventually brake the connection
This behavior varies, but in general after 10 - 20 removals and re-adding's it will trigger this bug.
Reestablishing of the connection is then only possible by restarting the IRI application. Some times if that doesn't help, the affected TCP neighbor also has to restart his node.
UDP neighbors are not affected by this.

@bahamapascal
Copy link
Author

I will add a 2 Gi bounty to who ever fixes this issue.
The bounty is valid until the 26th of November.

@nayanmshah
Copy link

Tried almost close to 20 times. Cannot reproduce it.

@bahamapascal
Copy link
Author

Hi @nayanmshah, apologies for my late reply, but I was swamped with work these days.
Any way, I have started the test again and it broke after aprox 31 times of removing/adding, the connection broke. I set the interval of the adding/removing to 5 min.
You can finde the console log here: https://gist.github.com/bahamapascal/75de4c90e5eae24a98bff86380605dba
On line 1682 was the last successful connection to the TCP neighbor

@romansemko
Copy link
Contributor

+1 reproducible by almost everyone testing Nelson at this moment. Does this affect UDP connections?

@rajivshah3
Copy link
Member

+1

Also saw this which I've never seen before:
12/16 21:33:10.876 [pool-5-thread-14] ERROR c.i.i.n.r.ReplicatorSourceProcessor - Did not receive neighbors listener port
Then it closed a sink right after:
12/16 21:33:10.877 [pool-5-thread-14] INFO c.i.i.n.r.ReplicatorSinkPool - Sink XX.XX.XX.XXX closed
12/16 21:33:10.877 [pool-5-thread-14] INFO com.iota.iri.network.Neighbor - Source XX.XX.XX.XXX closed

romansemko added a commit to SemkoDev/nelson.cli that referenced this issue Dec 16, 2017
@bahamapascal
Copy link
Author

@romansemko No this only affects TCP connections.

@bahamapascal
Copy link
Author

By the way, the 2 Gi bounty is still available if this gets fixed

@romansemko
Copy link
Contributor

Oh, maybe I should give it a shot then 👍

@bahamapascal
Copy link
Author

Would be awesome!

@mlouielu
Copy link

@bahamapascal Can you retry with 1.4.1.4? I can't reproduce this on 1.4.1.4 after 50+ remove/add neighbors.

@romansemko
Copy link
Contributor

I can try running Nelson with TCP. Will report if I see any difference.

@bahamapascal
Copy link
Author

@mlouielu Will do and let you know

@bahamapascal
Copy link
Author

Tested with 1.4.1.4 again and same issue.
Have upladed the log: https://gist.github.com/bahamapascal/9e8c39219eaca964a5cf57c1e2ed1205#file-iri-log-L1613
Line 1613 is the last time that it successfully connected to the TCP neighbor

@0vercrypt
Copy link

0vercrypt commented Dec 21, 2017

I'm not a java developer but I tried finding the problem and fixing it. I didn't commit anything but here's the problem explanation + fix:

The problem refers to the multithreading of ReplicatorSinkPool. The pool is currently limited to 32 threads. If you add a TCP neighbor, a new thread will be created. If this pool exceeds the 32 threads (limit of 32 seen in network/Replicator.java) the new threads will be queued but never executed because of the already added ones that are still executing. To reproduce this behavior just set the limit in Replicator.java (NUM_THREADS) to 3 and try removing and adding a TCP neighbor 4 times.

I fixed the problem after many tries with the following:
I implemented the following into network/ReplicatorSinkProcessor.java:

class ReplicatorSinkProcessor implements Runnable {
    [...]
    private boolean isStopped = false;
    [...]
    public void stop() {
        isStopped = true;
    }
    [...]
    @Override
    public void run() {
        [...]
                    while (!replicatorSinkPool.shutdown && !isStopped) {
[...]

After that I added some lines to network/ReplicatorSinkPool.java:

import java.util.Map;
import java.util.HashMap;
[...]
public class ReplicatorSinkPool  implements Runnable {
    [...]
    private final Map<Neighbor, Runnable> runnables = new HashMap<Neighbor, Runnable>();
    [...]
    public void createSink(TCPNeighbor neighbor) {
        if(runnables.containsKey(neighbor)) {
            ((ReplicatorSinkProcessor)runnables.get(neighbor)).stop();
        }
        Runnable proc = new ReplicatorSinkProcessor( neighbor, this, port);
        sinkPool.submit(proc);
        runnables.put(neighbor, proc);
    }

What happens is the following:
If the threads in the pool get the stop-flag (via stop() -> isStopped = true) the while loop gets interrupted. The thread will then be in "completed" state to make space for new threads.
Completed threads get kicked out of the pool when adding new threads if the thread pool is full.

Reach out to me on slack if I'm wrong: @Leibi133

@GhostTyper
Copy link

I hope you get the bounty. :D

@0vercrypt
Copy link

0vercrypt commented Dec 22, 2017

Sorry, you're surely totally right @alon-e.

So please forget about my last fix. The correct fix is the following (without the old fix):
network/Neighbor.java:

[...]
public abstract class Neighbor {
    [...]
     public void setFlagged(boolean flagged) {
         this.flagged = flagged;
     }
+    
+    private boolean stopped = false;
+    public void stop() {
+        this.stopped = true;
+    }
+    public boolean isStopped() {
+        return stopped;
+    }
    [...]

network/Node.java:

[...]
public class Node {
    [...]
    public boolean removeNeighbor(final URI uri, boolean isConfigured) {
        final Neighbor neighbor = newNeighbor(uri, isConfigured);
        if (uri.getScheme().equals("tcp")) {
            neighbors.stream().filter(n -> n instanceof TCPNeighbor)
                    .map(n -> ((TCPNeighbor) n))
                    .filter(n -> n.equals(neighbor))
                    .forEach(TCPNeighbor::clear);
        }
+        neighbors.stream().filter(n -> n.equals(neighbor)).forEach(Neighbor::stop);
        return neighbors.remove(neighbor);
    }
    [...]

network/replicator/ReplicatorSinkProcessor.java:

[...]
class ReplicatorSinkProcessor implements Runnable {
    [...]
    @Override
    public void run() {
        [...]
-                    while (!replicatorSinkPool.shutdown) {
+                    while (!replicatorSinkPool.shutdown && !neighbor.isStopped()) {

Lines with + have been added, lines with - removed. All other lines weren't touched.
Hope this helps if you didn't fix the issue yet. Now there won't be any more GC problems anymore :)

@bahamapascal
Copy link
Author

I have forked IRI and integrated the changes as @Leibi133 decribed:
dev...bahamapascal:dev

Have tested it now and it's working perfectly! A big thanks to Leibi33 for the Fix!
I have now sent him his Bounty :)

@bahamapascal
Copy link
Author

The fix is now integrated to the dev branch: 3049751

Thanks to @Leibi133 for the fix as well as Alon-e and Paul H for integrating it to the dev brach.

stplaydog added a commit to triasteam/StreamNet that referenced this issue May 5, 2019
[fix iotaledger#265] Define Ancestor forward parameters and data types
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants