Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[magnum-auto-healer] Fix duplicated repair action #1530

Merged
merged 1 commit into from May 26, 2021

Conversation

openstacker
Copy link
Contributor

@openstacker openstacker commented May 12, 2021

What this PR does / why we need it:
This is kind of a regression issue. After introducing the feature that restarts the broken node if it's found the first time, then the stack update action is quick because there is no real action involved. But the controller is still polling the status periodically, so another round check coming and will trigger the repair again.

Which issue this PR fixes(if applicable):
fixes #1529

Special notes for reviewers:

Release note:

[magnum-auto-healer] Fix duplicated auto-healing actions and the default health check period is changed from 30s to 60s.

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels May 12, 2021
@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label May 12, 2021
@theopenlab-ci
Copy link

theopenlab-ci bot commented May 12, 2021

Build succeeded.

@theopenlab-ci
Copy link

theopenlab-ci bot commented May 12, 2021

Build succeeded.

log.Infof("Node %s is found in unhealthy again, but we're going to defer the repair because it maybe in reboot process", serverID)
firstTimeRebootNodes[serverID] = n
processed = true
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's rebooted for a while ago but still presented as un healthy, we just delete it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ricolin What do you mean "we just delete it"? We cannot just delete the node, since it's being rebooted.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just noted for the behavior, so correct me if I misinterpret it

I mean we will not found node in firstTimeRebootNodes [1] if it's not firsttimeUnhealthy and already passed delay mins since last reboot. And that node will be deleted at line 471

[1] 3d06649#diff-9846ce0766626342e9df737eff34e6aba6b203aa80e0ba8fa021ac26ddbad785R467

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, if the node is not firsttimeUnhealthy and already passed delayed mins, then it will be deleted from the k8s PoV.

CheckDelayAfterAdd: 10 * time.Minute,
DryRun: false,
CloudProvider: "openstack",
MonitorInterval: 60 * time.Second,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there reason why we change this to 60?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. The default report interview from kubelet to kube-apiserver is 40s, so it's possible that the node status hasn't been updated within 30s. And based on our experience on prod, 30s is a bit too often. So I'm proposing to change it to a number just more than 40s so 60s is a good balance. I will update the release to cover this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with 60 secs, but I wonder is there any proper way to notify users about the default config change. releasenote?:)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have already updated the release note.

Copy link
Contributor

@ricolin ricolin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

log.Infof("Node %s is found in unhealthy again, but we're going to defer the repair because it maybe in reboot process", serverID)
firstTimeRebootNodes[serverID] = n
processed = true
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just noted for the behavior, so correct me if I misinterpret it

I mean we will not found node in firstTimeRebootNodes [1] if it's not firsttimeUnhealthy and already passed delay mins since last reboot. And that node will be deleted at line 471

[1] 3d06649#diff-9846ce0766626342e9df737eff34e6aba6b203aa80e0ba8fa021ac26ddbad785R467

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 14, 2021
@openstacker
Copy link
Contributor Author

/lgtm

@openstacker openstacker reopened this May 17, 2021
@theopenlab-ci
Copy link

theopenlab-ci bot commented May 17, 2021

Build succeeded.

@theopenlab-ci
Copy link

theopenlab-ci bot commented May 17, 2021

Build succeeded.

@@ -316,6 +316,10 @@ func (c *Controller) repairNodes(unhealthyNodes []healthcheck.NodeInfo) {
newNode := node.KubeNode.DeepCopy()
newNode.Spec.Unschedulable = true

// Skip cordon for master node
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because by default we have already set the UnScheduable for all master nodes. It's not necessary to set it again here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is specific to magnum, I would recommend we add a config option to decide if master node should be uncordoned or not. The repairNodes is supposed to be agnostic to providers.

// FirstTimeRepair Handle the first time repair for a node
// 1) If the node is the first time in error, reboot it
// 2) If the node is not the first time in error, check if the last reboot time is in provider.Config.RebuildDelayAfterReboot
func (provider OpenStackCloudProvider) FirstTimeRepair(n healthcheck.NodeInfo, serverID string, firstTimeRebootNodes map[string]healthcheck.NodeInfo) (bool, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FirstTimeRepair doesn't need to be exposed because it's only called in Repair()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Will fix in next patch set.

@@ -250,6 +250,61 @@ func (provider OpenStackCloudProvider) waitForServerDetachVolumes(serverID strin
return rootVolumeID, err
}

// FirstTimeRepair Handle the first time repair for a node
// 1) If the node is the first time in error, reboot it
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reboot and uncordon it.

Also, please describe explicitly what's the meaning of the returned value(processed) and what the called should do.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.


n.RebootAt = time.Now()
firstTimeRebootNodes[serverID] = n
unHealthyNodes[serverID] = n
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

duplicate with Line 269

}
}

n.RebootAt = time.Now()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The nodes []healthcheck.NodeInfo outside of this function is not going to be changed because n here is just a copy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have removed the duplicated line at 269 and the variable n will be saved at line 290. So I think the new value of RebootAt will be saved. Please correct me if I missed something. Cheers.

There are several changes in this commit:

1) Fix the duplicated repair actions
2) Change default health check period from 30s to 60s
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 24, 2021
@theopenlab-ci
Copy link

theopenlab-ci bot commented May 24, 2021

Build succeeded.

@theopenlab-ci
Copy link

theopenlab-ci bot commented May 24, 2021

Build succeeded.

@lingxiankong
Copy link
Contributor

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 25, 2021
@openstacker
Copy link
Contributor Author

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: openstacker

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 26, 2021
@openstacker
Copy link
Contributor Author

Doing a ninja approval based on the lgtm from Lingxian :)

@k8s-ci-robot k8s-ci-robot merged commit db60e00 into kubernetes:master May 26, 2021
openstacker added a commit to openstacker/cloud-provider-openstack that referenced this pull request May 31, 2021
There are several changes in this commit:

1) Fix the duplicated repair actions
2) Change default health check period from 30s to 60s

(cherry picked from commit db60e00)
k8s-ci-robot pushed a commit that referenced this pull request Jun 2, 2021
There are several changes in this commit:

1) Fix the duplicated repair actions
2) Change default health check period from 30s to 60s

(cherry picked from commit db60e00)
openstacker added a commit to openstacker/cloud-provider-openstack that referenced this pull request Jun 2, 2021
There are several changes in this commit:

1) Fix the duplicated repair actions
2) Change default health check period from 30s to 60s

(cherry picked from commit db60e00)
openstacker added a commit to openstacker/cloud-provider-openstack that referenced this pull request Jun 2, 2021
There are several changes in this commit:

1) Fix the duplicated repair actions
2) Change default health check period from 30s to 60s

(cherry picked from commit db60e00)
k8s-ci-robot pushed a commit that referenced this pull request Jun 2, 2021
There are several changes in this commit:

1) Fix the duplicated repair actions
2) Change default health check period from 30s to 60s

(cherry picked from commit db60e00)
lingxiankong pushed a commit that referenced this pull request Jun 3, 2021
There are several changes in this commit:

1) Fix the duplicated repair actions
2) Change default health check period from 30s to 60s

(cherry picked from commit db60e00)
powellchristoph pushed a commit to powellchristoph/cloud-provider-openstack that referenced this pull request Jan 19, 2022
There are several changes in this commit:

1) Fix the duplicated repair actions
2) Change default health check period from 30s to 60s
jingczhang pushed a commit to nokia/cloud-provider-openstack that referenced this pull request Jun 1, 2022
There are several changes in this commit:

1) Fix the duplicated repair actions
2) Change default health check period from 30s to 60s

(cherry picked from commit db60e00)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[magnum-auto-healer] Fix duplicated healing action
4 participants