New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug 1884420: bootstrap: API shows up, start it again #102
Conversation
pkg/monitor/dynkeepalived.go
Outdated
consecutiveErr = 0 | ||
} | ||
if consecutiveErr > bootstrapApiFailuresThreshold { | ||
log.WithFields(logrus.Fields{ | ||
"consecutiveErr": consecutiveErr, | ||
"bootstrapApiFailuresThreshold": bootstrapApiFailuresThreshold, | ||
}).Info("handleBootstrapStopKeepalived: Num of failures exceeds threshold") | ||
bootstrapStopKeepalived <- true | ||
bootstrapStopKeepalived <- stopped | ||
return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldn't exit this function after 'stop' message was sent to channel,
consecutiveErr = 0 | ||
} | ||
if consecutiveErr > bootstrapApiFailuresThreshold { | ||
log.WithFields(logrus.Fields{ | ||
"consecutiveErr": consecutiveErr, | ||
"bootstrapApiFailuresThreshold": bootstrapApiFailuresThreshold, | ||
}).Info("handleBootstrapStopKeepalived: Num of failures exceeds threshold") | ||
bootstrapStopKeepalived <- true | ||
return | ||
bootstrapStopKeepalived <- stopped |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'stop' message will be sent every second after API goes down, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well, It doesn't matter because at the MCO side if there is no running Keepalived process it just display a msg to log
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One minor thing about the log message, but otherwise this is working for me and should fix the problem.
pkg/monitor/dynkeepalived.go
Outdated
if err == nil { | ||
log.Info("Stop message successfully sent to Keepalived container control socket") | ||
log.Info("Command message successfully sent to Keepalived container control socket: %s", string(cmdMsg[:])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't formatting correctly. I think you need log.Infof to use the format string.
@@ -166,15 +173,17 @@ func handleBootstrapStopKeepalived(kubeconfigPath string, bootstrapStopKeepalive | |||
"consecutiveErr": consecutiveErr, | |||
}).Info("handleBootstrapStopKeepalived: detect failure on API") | |||
} else { | |||
if consecutiveErr > bootstrapApiFailuresThreshold { // Means it was stopped |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I believe this is why I added the separate start command, so I wouldn't have to check this. With a separate command I can just resend it safely and don't have to worry about a message getting lost because keepalived happens to restart or something.
This should be fine for the bootstrap though. It isn't a long-lived thing anyway.
/retitle Bug 1884420: bootstrap: API shows up, start it again |
@celebdor: This pull request references Bugzilla bug 1884420, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@celebdor: This pull request references Bugzilla bug 1884420, which is valid. 3 validation(s) were run on this bug
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
In order to keep the VIP in the bootstrap node until the masters' API shows up, we increased the priority of the bootstrap keepalived API VIP membership. In order for the VIP to successfully move to the masters even when the bootstrap is requested to stay even after clustering (when its API server is already gone), we implemented a mechanism in the monitor that stops it. The problem with that was that sometimes, during a clustering, the API in the bootstrap node could go down for long enough that it looked like it would not go up anymore. This PR addresses it by continuing to check for the API server on the bootstrap node, and reloading keepalived if it shows up again. In case it is gone for good, the behavior will be the same, but if it just went down for a while because of API pod restarts and resource issues, we'll reload and reclaim the API VIP.
I have one concern that is because we don't sync config change and stop bootstrap functions, so config change function can trigger 'reload' Keepalived as a result of config change although Keepalived should be stopped. Since the masters don't know the bootstrap IP address we must verify that Keepalived on bootstrap isn't running otherwise two nodes will hold the VIP. |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: celebdor, yboaron The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
which two nodes? |
@celebdor: All pull requests linked via external trackers have merged: Bugzilla bug 1884420 has been moved to the MODIFIED state. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
In order to keep the VIP in the bootstrap node until the masters' API shows up, we increased the priority of the bootstrap keepalived API VIP membership. In order for the VIP to successfully move to the masters even when the bootstrap is requested to stay even after clustering (when its API server is already gone), we implemented a mechanism in the monitor that stops it. The problem with that was that sometimes, during a clustering, the API in the bootstrap node could go down for long enough that it looked like it would not go up anymore.
This PR addresses it by continuing to check for the API server on the bootstrap node, and reloading keepalived if it shows up again. In case it is gone for good, the behavior will be the same, but if it just went down for a while because of API pod restarts and resource issues, we'll reload and reclaim the API VIP.