NetworkDB incorrect number of entries in networkNodes#1836
NetworkDB incorrect number of entries in networkNodes#1836mavenugo merged 1 commit intomoby:masterfrom
Conversation
|
Also the nDB.networks is not properly cleaned up but will take care of that in a separate PR |
|
|
||
| logrus.Debugf("%s: joined network %s", nDB.config.NodeName, nid) | ||
| if _, err := nDB.bulkSync(networkNodes, true); err != nil { | ||
| if _, err := nDB.bulkSync(nDB.networkNodes[nid], true); err != nil { |
There was a problem hiding this comment.
should nDB.bulksync() be under a mutex ?
There was a problem hiding this comment.
also not sure if nDB.networkNodes[nid] needs to be passed explicitly. I guess the bulkSync function can use networkNodes from the nDB object right ? wdyt ?
There was a problem hiding this comment.
Just checked, my bad, I thought that passing the element doing a read was safe instead is not.
Definitely yes, I was actually thinking of creating some helper function for getNodesNetwork and getNetworkNodes and delete all the duplicate code that is around. I will probably fix this and create a new PR for that. Thanks for catching that
b8bea36 to
88698b2
Compare
|
@fcrisciani I tried quick add/remove of containers from a node and also quick deamon kill/start on a node. But not hitting this issue. So the exact sequence of events is not very clear to me.. What steps are you using exactly ? And when you see the issue, does the same node's IP occur multiple times but with different names ? We make the node name unique every time a node joins the cluster. If this is what is happening it should be a temporary issue because ie: a node is printed as peer only if exists in nDB.nodes. After an ungraceful daemon restart, memberlist will quickly the old node-name as not alive any more and will trigger a leave. So we should take it out of |
|
@sanimej just run the test that is in the PR with the master code, that will show you the issue |
|
@fcrisciani trying the test in master indeed fails as you explained and it passes with your fix. |
| maxRetry := 5 | ||
| dbs := createNetworkDBInstances(t, 2, "node") | ||
|
|
||
| logrus.SetLevel(logrus.DebugLevel) |
A rapid (within networkReapTime 30min) leave/join network can corrupt the list of nodes per network with multiple copies of the same nodes. The fix makes sure that each node is present only once Signed-off-by: Flavio Crisciani <flavio.crisciani@docker.com>
88698b2 to
297c3d4
Compare
|
LGTM |
A rapid (within networkReapTime 30min) leave/join network
can corrupt the list of nodes per network with multiple copies
of the same nodes.
The fix makes sure that each node is present only once
Signed-off-by: Flavio Crisciani flavio.crisciani@docker.com