Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nodes are not reconnecting to each other after OnTerminated due to long time without connection #462

Closed
vncoelho opened this issue Nov 11, 2018 · 9 comments
Labels
discussion Initial issue state - proposed but not yet accepted

Comments

@vncoelho
Copy link
Member

vncoelho commented Nov 11, 2018

@erikzhang, @jsolman

We are not expert in this part of the code, but we are now with a dedicated team checking it out in different manners.

After removing the connection from a node, as expected, it finished all its connections after some couple of minutes:

OnTerminated bye bye  
 endPoint.Address=172.18.0.2
OnTerminated bye bye  
 endPoint.Address=172.18.0.6
OnTerminated bye bye  
 endPoint.Address=172.18.0.3
OnTerminated bye bye  
 endPoint.Address=172.18.0.9
OnTerminated bye bye  
 endPoint.Address=172.18.0.8
OnTerminated bye bye  
 endPoint.Address=172.18.0.7
OnTerminated bye bye  
 endPoint.Address=172.18.0.4

However, the problem is that it is not being able to fully reconnect to these nodes. This is one of the main causes of chaynsync problems both for Consensus Nodes and for normal Seed RPC nodes.

After all these aforementioned batch of OnTerminated calls the actors seams to have some problem in reestablishing the connections, which are often reporting OnTerminated after some time.
But is keep reporting ConnectedPeers.Count close to the limit (10 as default).

@erikzhang erikzhang added the discussion Initial issue state - proposed but not yet accepted label Nov 11, 2018
@vncoelho
Copy link
Member Author

vncoelho commented Nov 11, 2018

Even in normal operation connections are dropping all time and calling OnTerminated for some peers, even when they are stable, maybe it is due to the lack of communication between them.

We are going to double check that.

@erikzhang
Copy link
Member

Please find out where the problem lies.

@vncoelho
Copy link
Member Author

I am trying, master. aehuaheuaea

We noticed that this Timeout is kind of normal.
We need some more experiments. Let's keep this in mind that there is a possible problem. We are not 100% sure.

@vncoelho
Copy link
Member Author

An issue was also opened at Akka.Net

@vncoelho
Copy link
Member Author

vncoelho commented Nov 15, 2018

Apparently, the lines are these one, @erikzhang:

neo/neo/Network/UPnP.cs

Lines 33 to 35 in 60a02ee

s.SendTo(data, ipe);
s.SendTo(data, ipe);
s.SendTo(data, ipe);

By commenting these lines we can stop the problem reported here and in #463. Even commenting these lines everyone seams to work normally. O.o
I got it was a HandShake message.

When not commented we have:

[ERROR][11/15/2018 14:36:29][Thread 0003][akka://NeoSystem/user/$b] Network is unreachable
Cause: System.Net.Sockets.SocketException (0x80004005): Network is unreachable
   at System.Net.Sockets.Socket.UpdateStatusAfterSocketErrorAndThrowException(SocketError error, String callerName)
   at System.Net.Sockets.Socket.SendTo(Byte[] buffer, Int32 offset, Int32 size, SocketFlags socketFlags, EndPoint remoteEP)

It can not communicate saying it is unreachable.
But the network is on and we still can ping that machine.

Maybe it is something related to the way it was terminated, because it works perfectly until we have problem with the Network adapter.

@vncoelho
Copy link
Member Author

vncoelho commented Nov 15, 2018

By the way, what is this Socket, Erik? aheuaheuahea

Is it TCP or WS?
I still did not see where WS are used.

@vncoelho
Copy link
Member Author

vncoelho commented Nov 21, 2018

@erikzhang and @jsolman, the strange think is that when nodes are stuck like this and trying to sync restarting is the best choice.

I mean, by killing the application and restarting again the consensus nodes sync very fast.
In this sense, I think that the problem might be something related to priority in receiving the messages (some are expiring and getting lost).

The two options I thought are:

  • Create a method that kills neosystem and restart everything inside neo
  • Stop all services and give total priority in just receiving blocks

What do you think, @erikzhang?

@vncoelho
Copy link
Member Author

vncoelho commented Nov 21, 2018

I like the idea of destroying NeoSystem and initialize it again.
However, it does not track and solve the source of the problem.
But I think it is a simple and good solution for now.

Basically we would need to call method ReinitializaNeoSystem() when we detected that a node is getting behind.

@vncoelho
Copy link
Member Author

vncoelho commented Apr 9, 2019

I believe this has been solved, but let's keep a reference of this on #620.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion Initial issue state - proposed but not yet accepted
Projects
None yet
Development

No branches or pull requests

2 participants