Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug 1884420: bootstrap: API shows up, start it again #102

Merged
merged 1 commit into from Oct 2, 2020

Conversation

celebdor
Copy link
Contributor

@celebdor celebdor commented Oct 1, 2020

In order to keep the VIP in the bootstrap node until the masters' API shows up, we increased the priority of the bootstrap keepalived API VIP membership. In order for the VIP to successfully move to the masters even when the bootstrap is requested to stay even after clustering (when its API server is already gone), we implemented a mechanism in the monitor that stops it. The problem with that was that sometimes, during a clustering, the API in the bootstrap node could go down for long enough that it looked like it would not go up anymore.

This PR addresses it by continuing to check for the API server on the bootstrap node, and reloading keepalived if it shows up again. In case it is gone for good, the behavior will be the same, but if it just went down for a while because of API pod restarts and resource issues, we'll reload and reclaim the API VIP.

@openshift-ci-robot openshift-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 1, 2020
@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 1, 2020
consecutiveErr = 0
}
if consecutiveErr > bootstrapApiFailuresThreshold {
log.WithFields(logrus.Fields{
"consecutiveErr": consecutiveErr,
"bootstrapApiFailuresThreshold": bootstrapApiFailuresThreshold,
}).Info("handleBootstrapStopKeepalived: Num of failures exceeds threshold")
bootstrapStopKeepalived <- true
bootstrapStopKeepalived <- stopped
return
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't exit this function after 'stop' message was sent to channel,

consecutiveErr = 0
}
if consecutiveErr > bootstrapApiFailuresThreshold {
log.WithFields(logrus.Fields{
"consecutiveErr": consecutiveErr,
"bootstrapApiFailuresThreshold": bootstrapApiFailuresThreshold,
}).Info("handleBootstrapStopKeepalived: Num of failures exceeds threshold")
bootstrapStopKeepalived <- true
return
bootstrapStopKeepalived <- stopped
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'stop' message will be sent every second after API goes down, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well, It doesn't matter because at the MCO side if there is no running Keepalived process it just display a msg to log

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right

Copy link
Member

@cybertron cybertron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One minor thing about the log message, but otherwise this is working for me and should fix the problem.

if err == nil {
log.Info("Stop message successfully sent to Keepalived container control socket")
log.Info("Command message successfully sent to Keepalived container control socket: %s", string(cmdMsg[:]))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't formatting correctly. I think you need log.Infof to use the format string.

@@ -166,15 +173,17 @@ func handleBootstrapStopKeepalived(kubeconfigPath string, bootstrapStopKeepalive
"consecutiveErr": consecutiveErr,
}).Info("handleBootstrapStopKeepalived: detect failure on API")
} else {
if consecutiveErr > bootstrapApiFailuresThreshold { // Means it was stopped
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I believe this is why I added the separate start command, so I wouldn't have to check this. With a separate command I can just resend it safely and don't have to worry about a message getting lost because keepalived happens to restart or something.

This should be fine for the bootstrap though. It isn't a long-lived thing anyway.

@celebdor
Copy link
Contributor Author

celebdor commented Oct 1, 2020

/retitle Bug 1884420: bootstrap: API shows up, start it again

@celebdor celebdor changed the title [WIP] bootstrap: API shows up, start it again Bug 1884420: bootstrap: API shows up, start it again Oct 1, 2020
@openshift-ci-robot openshift-ci-robot added bugzilla/severity-high Referenced Bugzilla bug's severity is high for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. labels Oct 1, 2020
@openshift-ci-robot
Copy link
Contributor

@celebdor: This pull request references Bugzilla bug 1884420, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.6.0) matches configured target release for branch (4.6.0)
  • bug is in the state NEW, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

In response to this:

Bug 1884420: bootstrap: API shows up, start it again

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 1, 2020
@openshift-ci-robot
Copy link
Contributor

@celebdor: This pull request references Bugzilla bug 1884420, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.6.0) matches configured target release for branch (4.6.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

In response to this:

Bug 1884420: bootstrap: API shows up, start it again

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

In order to keep the VIP in the bootstrap node until the masters' API
shows up, we increased the priority of the bootstrap keepalived API VIP
membership. In order for the VIP to successfully move to the masters
even when the bootstrap is requested to stay even after clustering (when
its API server is already gone), we implemented a mechanism in the
monitor that stops it. The problem with that was that sometimes, during
a clustering, the API in the bootstrap node could go down for long
enough that it looked like it would not go up anymore.

This PR addresses it by continuing to check for the API server on the
bootstrap node, and reloading keepalived if it shows up again. In case
it is gone for good, the behavior will be the same, but if it just went
down for a while because of API pod restarts and resource issues, we'll
reload and reclaim the API VIP.
@yboaron
Copy link
Contributor

yboaron commented Oct 1, 2020

I have one concern that is because we don't sync config change and stop bootstrap functions, so config change function can trigger 'reload' Keepalived as a result of config change although Keepalived should be stopped.
But I assume after a second the handlestopbootstrap function will resend stop .
so we should be good

Since the masters don't know the bootstrap IP address we must verify that Keepalived on bootstrap isn't running otherwise two nodes will hold the VIP.

@yboaron
Copy link
Contributor

yboaron commented Oct 1, 2020

/lgtm

@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: celebdor, yboaron

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Oct 1, 2020
@celebdor
Copy link
Contributor Author

celebdor commented Oct 1, 2020

I have one concern that is because we don't sync config change and stop bootstrap functions, so config change function can trigger 'reload' Keepalived as a result of config change although Keepalived should be stopped.
But I assume after a second the handlestopbootstrap function will resend stop .
so we should be good

Since the masters don't know the bootstrap IP address we must verify that Keepalived on bootstrap isn't running otherwise two nodes will hold the VIP.

which two nodes?

@openshift-merge-robot openshift-merge-robot merged commit 0158c53 into openshift:master Oct 2, 2020
@openshift-ci-robot
Copy link
Contributor

@celebdor: All pull requests linked via external trackers have merged:

Bugzilla bug 1884420 has been moved to the MODIFIED state.

In response to this:

Bug 1884420: bootstrap: API shows up, start it again

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@celebdor celebdor deleted the start-again branch October 2, 2020 21:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/severity-high Referenced Bugzilla bug's severity is high for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants