New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[magnum-auto-healer] Fix duplicated repair action #1530
[magnum-auto-healer] Fix duplicated repair action #1530
Conversation
Build succeeded.
|
Build succeeded.
|
log.Infof("Node %s is found in unhealthy again, but we're going to defer the repair because it maybe in reboot process", serverID) | ||
firstTimeRebootNodes[serverID] = n | ||
processed = true | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it's rebooted for a while ago but still presented as un healthy, we just delete it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ricolin What do you mean "we just delete it"? We cannot just delete the node, since it's being rebooted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just noted for the behavior, so correct me if I misinterpret it
I mean we will not found node in firstTimeRebootNodes [1] if it's not firsttimeUnhealthy and already passed delay mins since last reboot. And that node will be deleted at line 471
[1] 3d06649#diff-9846ce0766626342e9df737eff34e6aba6b203aa80e0ba8fa021ac26ddbad785R467
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, if the node is not firsttimeUnhealthy and already passed delayed mins, then it will be deleted from the k8s PoV.
CheckDelayAfterAdd: 10 * time.Minute, | ||
DryRun: false, | ||
CloudProvider: "openstack", | ||
MonitorInterval: 60 * time.Second, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there reason why we change this to 60?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. The default report interview from kubelet to kube-apiserver is 40s, so it's possible that the node status hasn't been updated within 30s. And based on our experience on prod, 30s is a bit too often. So I'm proposing to change it to a number just more than 40s so 60s is a good balance. I will update the release to cover this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with 60 secs, but I wonder is there any proper way to notify users about the default config change. releasenote?:)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have already updated the release note.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
log.Infof("Node %s is found in unhealthy again, but we're going to defer the repair because it maybe in reboot process", serverID) | ||
firstTimeRebootNodes[serverID] = n | ||
processed = true | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just noted for the behavior, so correct me if I misinterpret it
I mean we will not found node in firstTimeRebootNodes [1] if it's not firsttimeUnhealthy and already passed delay mins since last reboot. And that node will be deleted at line 471
[1] 3d06649#diff-9846ce0766626342e9df737eff34e6aba6b203aa80e0ba8fa021ac26ddbad785R467
|
Build succeeded.
|
Build succeeded.
|
@@ -316,6 +316,10 @@ func (c *Controller) repairNodes(unhealthyNodes []healthcheck.NodeInfo) { | |||
newNode := node.KubeNode.DeepCopy() | |||
newNode.Spec.Unschedulable = true | |||
|
|||
// Skip cordon for master node |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because by default we have already set the UnScheduable for all master nodes. It's not necessary to set it again here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is specific to magnum, I would recommend we add a config option to decide if master node should be uncordoned or not. The repairNodes
is supposed to be agnostic to providers.
// FirstTimeRepair Handle the first time repair for a node | ||
// 1) If the node is the first time in error, reboot it | ||
// 2) If the node is not the first time in error, check if the last reboot time is in provider.Config.RebuildDelayAfterReboot | ||
func (provider OpenStackCloudProvider) FirstTimeRepair(n healthcheck.NodeInfo, serverID string, firstTimeRebootNodes map[string]healthcheck.NodeInfo) (bool, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FirstTimeRepair doesn't need to be exposed because it's only called in Repair()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Will fix in next patch set.
@@ -250,6 +250,61 @@ func (provider OpenStackCloudProvider) waitForServerDetachVolumes(serverID strin | |||
return rootVolumeID, err | |||
} | |||
|
|||
// FirstTimeRepair Handle the first time repair for a node | |||
// 1) If the node is the first time in error, reboot it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reboot and uncordon it.
Also, please describe explicitly what's the meaning of the returned value(processed
) and what the called should do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do.
|
||
n.RebootAt = time.Now() | ||
firstTimeRebootNodes[serverID] = n | ||
unHealthyNodes[serverID] = n |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
duplicate with Line 269
} | ||
} | ||
|
||
n.RebootAt = time.Now() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The nodes []healthcheck.NodeInfo
outside of this function is not going to be changed because n
here is just a copy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have removed the duplicated line at 269 and the variable n
will be saved at line 290. So I think the new value of RebootAt will be saved. Please correct me if I missed something. Cheers.
There are several changes in this commit: 1) Fix the duplicated repair actions 2) Change default health check period from 30s to 60s
3d06649
to
278c871
Compare
Build succeeded.
|
Build succeeded.
|
/lgtm |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: openstacker The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Doing a ninja approval based on the lgtm from Lingxian :) |
There are several changes in this commit: 1) Fix the duplicated repair actions 2) Change default health check period from 30s to 60s (cherry picked from commit db60e00)
There are several changes in this commit: 1) Fix the duplicated repair actions 2) Change default health check period from 30s to 60s (cherry picked from commit db60e00)
There are several changes in this commit: 1) Fix the duplicated repair actions 2) Change default health check period from 30s to 60s (cherry picked from commit db60e00)
There are several changes in this commit: 1) Fix the duplicated repair actions 2) Change default health check period from 30s to 60s (cherry picked from commit db60e00)
There are several changes in this commit: 1) Fix the duplicated repair actions 2) Change default health check period from 30s to 60s
There are several changes in this commit: 1) Fix the duplicated repair actions 2) Change default health check period from 30s to 60s (cherry picked from commit db60e00)
What this PR does / why we need it:
This is kind of a regression issue. After introducing the feature that restarts the broken node if it's found the first time, then the stack update action is quick because there is no real action involved. But the controller is still polling the status periodically, so another round check coming and will trigger the repair again.
Which issue this PR fixes(if applicable):
fixes #1529
Special notes for reviewers:
Release note: