Skip to content

Consolidate node management logic#2035

Merged
mavenugo merged 1 commit intomoby:masterfrom
fcrisciani:netdb-realign-test-infra
Dec 13, 2017
Merged

Consolidate node management logic#2035
mavenugo merged 1 commit intomoby:masterfrom
fcrisciani:netdb-realign-test-infra

Conversation

@fcrisciani
Copy link
Copy Markdown

Created method to handle the node state change with cleanup operation
associated.
Realign testing client with the new diagnostic interface.

Signed-off-by: Flavio Crisciani flavio.crisciani@docker.com

@fcrisciani fcrisciani force-pushed the netdb-realign-test-infra branch 2 times, most recently from 1962808 to d5d0e38 Compare December 12, 2017 20:00
@codecov-io
Copy link
Copy Markdown

codecov-io commented Dec 12, 2017

Codecov Report

❗ No coverage uploaded for pull request base (master@7d95d32). Click here to learn what that means.
The diff coverage is 80.48%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master    #2035   +/-   ##
=========================================
  Coverage          ?   40.05%           
=========================================
  Files             ?      138           
  Lines             ?    22132           
  Branches          ?        0           
=========================================
  Hits              ?     8864           
  Misses            ?    11967           
  Partials          ?     1301
Impacted Files Coverage Δ
networkdb/networkdbdiagnose.go 0% <0%> (ø)
networkdb/event_delegate.go 68.57% <50%> (ø)
networkdb/delegate.go 74.26% <66.66%> (ø)
networkdb/nodemgmt.go 92.59% <92.59%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7d95d32...b499aa9. Read the comment docs.

Comment thread networkdb/delegate.go

nDB.RLock()
active, left, _, n := nDB.getNode(nEvent, false)
defer nDB.RUnlock()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you really want to hold on the nDB lock for the entire function call ?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, the function is fairly short and avoids dealing with crazy race where other node events can come in the middle and change the state. If an event come and is fresh, has to be atomically handled in one shot. In the previous logic every time you were taking back the log you had to start all the checks from scratch because another event maybe newer changed the node state. At the end the increase of complexity and exposure to race condition in this part of the code is not worth the time of the lock. These are also node events so should be not super frequent as table events, also if synced every 30s by push/pull, the current logic will anyway return at line 35

Comment thread networkdb/delegate.go Outdated
// the node is already marked as active nothing to do
moved, err := nDB.changeNodeState(n.Name, nodeActiveState)
if err != nil {
logrus.WithError(err).Error("unable to find the node to move, this is a bug")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a bug :-)

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment thread networkdb/nodemgmt.go
return true, nil
}

func (nDB *NetworkDB) purgeReincarnation(mn *memberlist.Node) bool {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General comment on this function. In the previous incarnation of this function, the nodes were purged if the address matches. But now, I see 2 changes

  1. The nodes are also purged from Active state (which wasnt the case before)
  2. The nodes are purged only if the Port matches and the Name doesnt match.
    These are obviously change in functionalities than just reorganizing the functions.

Can you pls give more clarifications on these ?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replying to the points:

  1. this function is now called only during the notifyJoin from memberlist. The check for active is to cover the case of a quick node reboot, where a new instance from the same host (same ip, and same port but new name that is autogenerated) is joining the cluster, before that the previous incarnation is marked as dead from memberlist because of the ping failure. In this specific case this function identify that the node is actually a new incarnation of the same node so will take care of the cleanup of the previous one before the arrival of the memberlist notification. The main objective here is to avoid that a node with same IP is sent into the failed list, if doing so the node will be polled periodically from the retry logic, but in this case is the wrong decision because that identity is gone forever and replaced by a new one.

  2. The condition for a new incarnation is to have same IP, same port but different name. If a node joins with same name and same IP is not a new incarnation is simply one node that got disconnected and came back and anyway won't arrive here, but will be intercepted by the logic at line https://github.com/docker/libnetwork/pull/2035/files#diff-e0a0ef8c8c9573b1dabd65379df2af3fL51 and following.

Overall I though that would be clearer to have 1 single function that takes care of all the conditions and address them on the join on a node instead of splitting the logic between join/leave

Comment thread networkdb/delegate.go
case NodeEventTypeLeave:
if left {
// the node is already marked as left nothing to do.
moved, err := nDB.changeNodeState(n.Name, nodeLeftState)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changeNodeState not just resets the nodeReapInterval, but also cleans up all the states created for this particular node. But in the current behavior, it was not done here. But it was handled in NotifyLeave.
I am not sure how these 2 functions differ when it comes to handling NodeEventTypeLeave vs NotifyLeave. But obviously this PR brings up a new functionality that needs a bit more explanation.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an explicit message coming from a node that notify the intention to leave the cluster. In normal condition this node should have already cleaned up all its state behind, so the cleanup is only an extra precaution that is taken here. It is also true that this message is supposed to arrive before the memberlist notifyLeave, that before was taking care of the cleanup, but this behavioral I think is in line with the node intention of leaving the cluster.

Created method to handle the node state change with cleanup operation
associated.
Realign testing client with the new diagnostic interface

Signed-off-by: Flavio Crisciani <flavio.crisciani@docker.com>
@fcrisciani fcrisciani force-pushed the netdb-realign-test-infra branch from d5d0e38 to b499aa9 Compare December 13, 2017 17:40
Copy link
Copy Markdown
Contributor

@mavenugo mavenugo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the clarifications.

LGTM

@mavenugo mavenugo merged commit 26531e5 into moby:master Dec 13, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants