Bug 1884420: bootstrap: API shows up, start it again #102

celebdor · 2020-10-01T18:39:50Z

In order to keep the VIP in the bootstrap node until the masters' API shows up, we increased the priority of the bootstrap keepalived API VIP membership. In order for the VIP to successfully move to the masters even when the bootstrap is requested to stay even after clustering (when its API server is already gone), we implemented a mechanism in the monitor that stops it. The problem with that was that sometimes, during a clustering, the API in the bootstrap node could go down for long enough that it looked like it would not go up anymore.

This PR addresses it by continuing to check for the API server on the bootstrap node, and reloading keepalived if it shows up again. In case it is gone for good, the behavior will be the same, but if it just went down for a while because of API pod restarts and resource issues, we'll reload and reclaim the API VIP.

yboaron · 2020-10-01T20:30:15Z

pkg/monitor/dynkeepalived.go

 			consecutiveErr = 0
 		}
 		if consecutiveErr > bootstrapApiFailuresThreshold {
 			log.WithFields(logrus.Fields{
 				"consecutiveErr":                consecutiveErr,
 				"bootstrapApiFailuresThreshold": bootstrapApiFailuresThreshold,
 			}).Info("handleBootstrapStopKeepalived: Num of failures exceeds threshold")
-			bootstrapStopKeepalived <- true
+			bootstrapStopKeepalived <- stopped
 			return


We shouldn't exit this function after 'stop' message was sent to channel,

yboaron · 2020-10-01T20:56:57Z

pkg/monitor/dynkeepalived.go

 			consecutiveErr = 0
 		}
 		if consecutiveErr > bootstrapApiFailuresThreshold {
 			log.WithFields(logrus.Fields{
 				"consecutiveErr":                consecutiveErr,
 				"bootstrapApiFailuresThreshold": bootstrapApiFailuresThreshold,
 			}).Info("handleBootstrapStopKeepalived: Num of failures exceeds threshold")
-			bootstrapStopKeepalived <- true
-			return
+			bootstrapStopKeepalived <- stopped


'stop' message will be sent every second after API goes down, right?

well, It doesn't matter because at the MCO side if there is no running Keepalived process it just display a msg to log

cybertron

One minor thing about the log message, but otherwise this is working for me and should fix the problem.

cybertron · 2020-10-01T22:17:51Z

pkg/monitor/dynkeepalived.go

 				if err == nil {
-					log.Info("Stop message successfully sent to Keepalived container control socket")
+					log.Info("Command message successfully sent to Keepalived container control socket: %s", string(cmdMsg[:]))


This isn't formatting correctly. I think you need log.Infof to use the format string.

cybertron · 2020-10-01T22:21:10Z

pkg/monitor/dynkeepalived.go

@@ -166,15 +173,17 @@ func handleBootstrapStopKeepalived(kubeconfigPath string, bootstrapStopKeepalive
 				"consecutiveErr": consecutiveErr,
 			}).Info("handleBootstrapStopKeepalived: detect failure on API")
 		} else {
+			if consecutiveErr > bootstrapApiFailuresThreshold { // Means it was stopped


Ah, I believe this is why I added the separate start command, so I wouldn't have to check this. With a separate command I can just resend it safely and don't have to worry about a message getting lost because keepalived happens to restart or something.

This should be fine for the bootstrap though. It isn't a long-lived thing anyway.

celebdor · 2020-10-01T22:41:22Z

/retitle Bug 1884420: bootstrap: API shows up, start it again

openshift-ci-robot · 2020-10-01T22:43:26Z

@celebdor: This pull request references Bugzilla bug 1884420, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target release (4.6.0) matches configured target release for branch (4.6.0)
bug is in the state NEW, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

In response to this:

Bug 1884420: bootstrap: API shows up, start it again

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot · 2020-10-01T22:46:55Z

@celebdor: This pull request references Bugzilla bug 1884420, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target release (4.6.0) matches configured target release for branch (4.6.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

In response to this:

Bug 1884420: bootstrap: API shows up, start it again

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

In order to keep the VIP in the bootstrap node until the masters' API shows up, we increased the priority of the bootstrap keepalived API VIP membership. In order for the VIP to successfully move to the masters even when the bootstrap is requested to stay even after clustering (when its API server is already gone), we implemented a mechanism in the monitor that stops it. The problem with that was that sometimes, during a clustering, the API in the bootstrap node could go down for long enough that it looked like it would not go up anymore. This PR addresses it by continuing to check for the API server on the bootstrap node, and reloading keepalived if it shows up again. In case it is gone for good, the behavior will be the same, but if it just went down for a while because of API pod restarts and resource issues, we'll reload and reclaim the API VIP.

yboaron · 2020-10-01T22:57:55Z

I have one concern that is because we don't sync config change and stop bootstrap functions, so config change function can trigger 'reload' Keepalived as a result of config change although Keepalived should be stopped.
But I assume after a second the handlestopbootstrap function will resend stop .
so we should be good

Since the masters don't know the bootstrap IP address we must verify that Keepalived on bootstrap isn't running otherwise two nodes will hold the VIP.

yboaron · 2020-10-01T23:06:06Z

/lgtm

openshift-ci-robot · 2020-10-01T23:06:22Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: celebdor, yboaron

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [celebdor,yboaron]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

celebdor · 2020-10-01T23:34:18Z

I have one concern that is because we don't sync config change and stop bootstrap functions, so config change function can trigger 'reload' Keepalived as a result of config change although Keepalived should be stopped.
But I assume after a second the handlestopbootstrap function will resend stop .
so we should be good

Since the masters don't know the bootstrap IP address we must verify that Keepalived on bootstrap isn't running otherwise two nodes will hold the VIP.

which two nodes?

openshift-ci-robot · 2020-10-02T00:11:28Z

@celebdor: All pull requests linked via external trackers have merged:

openshift/baremetal-runtimecfg#102

Bugzilla bug 1884420 has been moved to the MODIFIED state.

In response to this:

Bug 1884420: bootstrap: API shows up, start it again

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 1, 2020

openshift-ci-robot requested review from bcrochet and yboaron October 1, 2020 18:40

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 1, 2020

yboaron reviewed Oct 1, 2020

View reviewed changes

celebdor force-pushed the start-again branch from 3824648 to 6858d3c Compare October 1, 2020 20:32

yboaron reviewed Oct 1, 2020

View reviewed changes

cybertron reviewed Oct 1, 2020

View reviewed changes

celebdor changed the title ~~[WIP] bootstrap: API shows up, start it again~~ Bug 1884420: bootstrap: API shows up, start it again Oct 1, 2020

openshift-ci-robot added bugzilla/severity-high Referenced Bugzilla bug's severity is high for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. labels Oct 1, 2020

openshift-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 1, 2020

celebdor force-pushed the start-again branch from 6858d3c to 79170bd Compare October 1, 2020 22:49

openshift-ci-robot assigned yboaron Oct 1, 2020

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Oct 1, 2020

openshift-merge-robot merged commit 0158c53 into openshift:master Oct 2, 2020

celebdor deleted the start-again branch October 2, 2020 21:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug 1884420: bootstrap: API shows up, start it again #102

Bug 1884420: bootstrap: API shows up, start it again #102

celebdor commented Oct 1, 2020 •

edited

yboaron Oct 1, 2020

yboaron Oct 1, 2020

yboaron Oct 1, 2020

celebdor Oct 1, 2020

cybertron left a comment

cybertron Oct 1, 2020

cybertron Oct 1, 2020

celebdor commented Oct 1, 2020

openshift-ci-robot commented Oct 1, 2020

openshift-ci-robot commented Oct 1, 2020

yboaron commented Oct 1, 2020 •

edited

yboaron commented Oct 1, 2020

openshift-ci-robot commented Oct 1, 2020

celebdor commented Oct 1, 2020

openshift-ci-robot commented Oct 2, 2020

Bug 1884420: bootstrap: API shows up, start it again #102

Bug 1884420: bootstrap: API shows up, start it again #102

Conversation

celebdor commented Oct 1, 2020 • edited

yboaron Oct 1, 2020

Choose a reason for hiding this comment

yboaron Oct 1, 2020

Choose a reason for hiding this comment

yboaron Oct 1, 2020

Choose a reason for hiding this comment

celebdor Oct 1, 2020

Choose a reason for hiding this comment

cybertron left a comment

Choose a reason for hiding this comment

cybertron Oct 1, 2020

Choose a reason for hiding this comment

cybertron Oct 1, 2020

Choose a reason for hiding this comment

celebdor commented Oct 1, 2020

openshift-ci-robot commented Oct 1, 2020

openshift-ci-robot commented Oct 1, 2020

yboaron commented Oct 1, 2020 • edited

yboaron commented Oct 1, 2020

openshift-ci-robot commented Oct 1, 2020

celebdor commented Oct 1, 2020

openshift-ci-robot commented Oct 2, 2020

celebdor commented Oct 1, 2020 •

edited

yboaron commented Oct 1, 2020 •

edited