Consolidate node management logic by fcrisciani · Pull Request #2035 · moby/libnetwork

fcrisciani · 2017-12-12T18:24:50Z

Created method to handle the node state change with cleanup operation
associated.
Realign testing client with the new diagnostic interface.

Signed-off-by: Flavio Crisciani flavio.crisciani@docker.com

codecov-io · 2017-12-12T20:17:49Z

Codecov Report

❗ No coverage uploaded for pull request base (master@7d95d32). Click here to learn what that means.
The diff coverage is 80.48%.

@@            Coverage Diff            @@
##             master    #2035   +/-   ##
=========================================
  Coverage          ?   40.05%           
=========================================
  Files             ?      138           
  Lines             ?    22132           
  Branches          ?        0           
=========================================
  Hits              ?     8864           
  Misses            ?    11967           
  Partials          ?     1301

Impacted Files	Coverage Δ
networkdb/networkdbdiagnose.go	`0% <0%> (ø)`
networkdb/event_delegate.go	`68.57% <50%> (ø)`
networkdb/delegate.go	`74.26% <66.66%> (ø)`
networkdb/nodemgmt.go	`92.59% <92.59%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7d95d32...b499aa9. Read the comment docs.

mavenugo · 2017-12-13T00:02:27Z


 	nDB.RLock()
-	active, left, _, n := nDB.getNode(nEvent, false)
+	defer nDB.RUnlock()


Do you really want to hold on the nDB lock for the entire function call ?

yep, the function is fairly short and avoids dealing with crazy race where other node events can come in the middle and change the state. If an event come and is fresh, has to be atomically handled in one shot. In the previous logic every time you were taking back the log you had to start all the checks from scratch because another event maybe newer changed the node state. At the end the increase of complexity and exposure to race condition in this part of the code is not worth the time of the lock. These are also node events so should be not super frequent as table events, also if synced every 30s by push/pull, the current logic will anyway return at line 35

mavenugo · 2017-12-13T15:51:06Z

-			// the node is already marked as active nothing to do
+		moved, err := nDB.changeNodeState(n.Name, nodeActiveState)
+		if err != nil {
+			logrus.WithError(err).Error("unable to find the node to move, this is a bug")


this is a bug :-)

mavenugo · 2017-12-13T16:19:46Z

+	return true, nil
+}
+
+func (nDB *NetworkDB) purgeReincarnation(mn *memberlist.Node) bool {


General comment on this function. In the previous incarnation of this function, the nodes were purged if the address matches. But now, I see 2 changes

The nodes are also purged from Active state (which wasnt the case before)

The nodes are purged only if the Port matches and the Name doesnt match.
These are obviously change in functionalities than just reorganizing the functions.

Can you pls give more clarifications on these ?

replying to the points:

this function is now called only during the notifyJoin from memberlist. The check for active is to cover the case of a quick node reboot, where a new instance from the same host (same ip, and same port but new name that is autogenerated) is joining the cluster, before that the previous incarnation is marked as dead from memberlist because of the ping failure. In this specific case this function identify that the node is actually a new incarnation of the same node so will take care of the cleanup of the previous one before the arrival of the memberlist notification. The main objective here is to avoid that a node with same IP is sent into the failed list, if doing so the node will be polled periodically from the retry logic, but in this case is the wrong decision because that identity is gone forever and replaced by a new one.

The condition for a new incarnation is to have same IP, same port but different name. If a node joins with same name and same IP is not a new incarnation is simply one node that got disconnected and came back and anyway won't arrive here, but will be intercepted by the logic at line https://github.com/docker/libnetwork/pull/2035/files#diff-e0a0ef8c8c9573b1dabd65379df2af3fL51 and following.

Overall I though that would be clearer to have 1 single function that takes care of all the conditions and address them on the join on a node instead of splitting the logic between join/leave

mavenugo · 2017-12-13T16:28:27Z

 	case NodeEventTypeLeave:
-		if left {
-			// the node is already marked as left nothing to do.
+		moved, err := nDB.changeNodeState(n.Name, nodeLeftState)


changeNodeState not just resets the nodeReapInterval, but also cleans up all the states created for this particular node. But in the current behavior, it was not done here. But it was handled in NotifyLeave.
I am not sure how these 2 functions differ when it comes to handling NodeEventTypeLeave vs NotifyLeave. But obviously this PR brings up a new functionality that needs a bit more explanation.

This is an explicit message coming from a node that notify the intention to leave the cluster. In normal condition this node should have already cleaned up all its state behind, so the cleanup is only an extra precaution that is taken here. It is also true that this message is supposed to arrive before the memberlist notifyLeave, that before was taking care of the cleanup, but this behavioral I think is in line with the node intention of leaving the cluster.

Created method to handle the node state change with cleanup operation associated. Realign testing client with the new diagnostic interface Signed-off-by: Flavio Crisciani <flavio.crisciani@docker.com>

mavenugo

Thanks for the clarifications.

LGTM

fcrisciani force-pushed the netdb-realign-test-infra branch 2 times, most recently from 1962808 to d5d0e38 Compare December 12, 2017 20:00

mavenugo reviewed Dec 13, 2017

View reviewed changes

mavenugo suggested changes Dec 13, 2017

View reviewed changes

Cleanup node management logic

b499aa9

Created method to handle the node state change with cleanup operation associated. Realign testing client with the new diagnostic interface Signed-off-by: Flavio Crisciani <flavio.crisciani@docker.com>

fcrisciani force-pushed the netdb-realign-test-infra branch from d5d0e38 to b499aa9 Compare December 13, 2017 17:40

mavenugo approved these changes Dec 13, 2017

View reviewed changes

mavenugo merged commit 26531e5 into moby:master Dec 13, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consolidate node management logic#2035

Consolidate node management logic#2035
mavenugo merged 1 commit intomoby:masterfrom
fcrisciani:netdb-realign-test-infra

fcrisciani commented Dec 12, 2017

Uh oh!

codecov-io commented Dec 12, 2017 •

edited

Loading

Uh oh!

mavenugo Dec 13, 2017

Uh oh!

fcrisciani Dec 13, 2017

Uh oh!

mavenugo Dec 13, 2017

Uh oh!

fcrisciani Dec 13, 2017

Uh oh!

mavenugo Dec 13, 2017

Uh oh!

fcrisciani Dec 13, 2017

Uh oh!

mavenugo Dec 13, 2017

Uh oh!

fcrisciani Dec 13, 2017

Uh oh!

mavenugo left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

fcrisciani commented Dec 12, 2017

Uh oh!

codecov-io commented Dec 12, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mavenugo left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov-io commented Dec 12, 2017 •

edited

Loading