[magnum-auto-healer] Fix duplicated repair action #1530

openstacker · 2021-05-12T01:45:39Z

What this PR does / why we need it:
This is kind of a regression issue. After introducing the feature that restarts the broken node if it's found the first time, then the stack update action is quick because there is no real action involved. But the controller is still polling the status periodically, so another round check coming and will trigger the repair again.

Which issue this PR fixes(if applicable):
fixes #1529

Special notes for reviewers:

Release note:

[magnum-auto-healer] Fix duplicated auto-healing actions and the default health check period is changed from 30s to 60s.

theopenlab-ci · 2021-05-12T01:50:05Z

Build succeeded.

cloud-provider-openstack-unittest : SUCCESS in 3m 41s

theopenlab-ci · 2021-05-12T01:50:38Z

Build succeeded.

cloud-provider-openstack-format : SUCCESS in 3m 08s

ricolin · 2021-05-12T17:26:15Z

pkg/autohealing/cloudprovider/openstack/provider.go

+			log.Infof("Node %s is found in unhealthy again, but we're going to defer the repair because it maybe in reboot process", serverID)
+			firstTimeRebootNodes[serverID] = n
+			processed = true
+		}


If it's rebooted for a while ago but still presented as un healthy, we just delete it

@ricolin What do you mean "we just delete it"? We cannot just delete the node, since it's being rebooted.

Just noted for the behavior, so correct me if I misinterpret it

I mean we will not found node in firstTimeRebootNodes [1] if it's not firsttimeUnhealthy and already passed delay mins since last reboot. And that node will be deleted at line 471

[1] 3d06649#diff-9846ce0766626342e9df737eff34e6aba6b203aa80e0ba8fa021ac26ddbad785R467

Yes, if the node is not firsttimeUnhealthy and already passed delayed mins, then it will be deleted from the k8s PoV.

ricolin · 2021-05-12T17:47:56Z

pkg/autohealing/config/config.go

-		CheckDelayAfterAdd:   10 * time.Minute,
+		DryRun:                  false,
+		CloudProvider:           "openstack",
+		MonitorInterval:         60 * time.Second,


Is there reason why we change this to 60?

Good question. The default report interview from kubelet to kube-apiserver is 40s, so it's possible that the node status hasn't been updated within 30s. And based on our experience on prod, 30s is a bit too often. So I'm proposing to change it to a number just more than 40s so 60s is a good balance. I will update the release to cover this.

I'm fine with 60 secs, but I wonder is there any proper way to notify users about the default config change. releasenote?:)

I have already updated the release note.

ricolin

/lgtm

ricolin · 2021-05-14T04:45:11Z

pkg/autohealing/cloudprovider/openstack/provider.go

+			log.Infof("Node %s is found in unhealthy again, but we're going to defer the repair because it maybe in reboot process", serverID)
+			firstTimeRebootNodes[serverID] = n
+			processed = true
+		}


Just noted for the behavior, so correct me if I misinterpret it

I mean we will not found node in firstTimeRebootNodes [1] if it's not firsttimeUnhealthy and already passed delay mins since last reboot. And that node will be deleted at line 471

[1] 3d06649#diff-9846ce0766626342e9df737eff34e6aba6b203aa80e0ba8fa021ac26ddbad785R467

openstacker · 2021-05-17T02:11:41Z

/lgtm

theopenlab-ci · 2021-05-17T02:17:16Z

Build succeeded.

cloud-provider-openstack-format : SUCCESS in 3m 03s

theopenlab-ci · 2021-05-17T02:17:53Z

Build succeeded.

cloud-provider-openstack-unittest : SUCCESS in 3m 52s

lingxiankong · 2021-05-17T05:26:56Z

pkg/autohealing/controller/controller.go

@@ -316,6 +316,10 @@ func (c *Controller) repairNodes(unhealthyNodes []healthcheck.NodeInfo) {
 					newNode := node.KubeNode.DeepCopy()
 					newNode.Spec.Unschedulable = true

+					// Skip cordon for master node


Because by default we have already set the UnScheduable for all master nodes. It's not necessary to set it again here.

If this is specific to magnum, I would recommend we add a config option to decide if master node should be uncordoned or not. The repairNodes is supposed to be agnostic to providers.

lingxiankong · 2021-05-17T08:02:19Z

pkg/autohealing/cloudprovider/openstack/provider.go

+// FirstTimeRepair Handle the first time repair for a node
+// 1) If the node is the first time in error, reboot it
+// 2) If the node is not the first time in error, check if the last reboot time is in provider.Config.RebuildDelayAfterReboot
+func (provider OpenStackCloudProvider) FirstTimeRepair(n healthcheck.NodeInfo, serverID string, firstTimeRebootNodes map[string]healthcheck.NodeInfo) (bool, error) {


FirstTimeRepair doesn't need to be exposed because it's only called in Repair()

Good point. Will fix in next patch set.

pkg/autohealing/cloudprovider/openstack/provider.go

lingxiankong · 2021-05-17T10:33:34Z

pkg/autohealing/cloudprovider/openstack/provider.go

@@ -250,6 +250,61 @@ func (provider OpenStackCloudProvider) waitForServerDetachVolumes(serverID strin
 	return rootVolumeID, err
 }

+// FirstTimeRepair Handle the first time repair for a node
+// 1) If the node is the first time in error, reboot it


reboot and uncordon it.

Also, please describe explicitly what's the meaning of the returned value(processed) and what the called should do.

lingxiankong · 2021-05-17T10:37:24Z

pkg/autohealing/cloudprovider/openstack/provider.go

+
+			n.RebootAt = time.Now()
+			firstTimeRebootNodes[serverID] = n
+			unHealthyNodes[serverID] = n


duplicate with Line 269

lingxiankong · 2021-05-17T10:49:13Z

pkg/autohealing/cloudprovider/openstack/provider.go

+				}
+			}
+
+			n.RebootAt = time.Now()


The nodes []healthcheck.NodeInfo outside of this function is not going to be changed because n here is just a copy.

I have removed the duplicated line at 269 and the variable n will be saved at line 290. So I think the new value of RebootAt will be saved. Please correct me if I missed something. Cheers.

There are several changes in this commit: 1) Fix the duplicated repair actions 2) Change default health check period from 30s to 60s

theopenlab-ci · 2021-05-24T22:34:23Z

Build succeeded.

cloud-provider-openstack-format : SUCCESS in 2m 25s

theopenlab-ci · 2021-05-24T22:36:08Z

Build succeeded.

cloud-provider-openstack-unittest : SUCCESS in 3m 49s

lingxiankong · 2021-05-25T22:03:12Z

/lgtm

openstacker · 2021-05-26T22:36:25Z

/approve

k8s-ci-robot · 2021-05-26T22:36:38Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: openstacker

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~pkg/autohealing/OWNERS~~ [openstacker]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openstacker · 2021-05-26T22:37:02Z

Doing a ninja approval based on the lgtm from Lingxian :)

There are several changes in this commit: 1) Fix the duplicated repair actions 2) Change default health check period from 30s to 60s (cherry picked from commit db60e00)

There are several changes in this commit: 1) Fix the duplicated repair actions 2) Change default health check period from 30s to 60s

There are several changes in this commit: 1) Fix the duplicated repair actions 2) Change default health check period from 30s to 60s (cherry picked from commit db60e00)

k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels May 12, 2021

k8s-ci-robot requested review from lingxiankong and ricolin May 12, 2021 01:45

k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label May 12, 2021

ricolin reviewed May 12, 2021

View reviewed changes

ricolin reviewed May 14, 2021

View reviewed changes

k8s-ci-robot assigned ricolin May 14, 2021

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 14, 2021

openstacker closed this May 17, 2021

openstacker reopened this May 17, 2021

lingxiankong suggested changes May 17, 2021

View reviewed changes

Fix duplicated repair action

278c871

There are several changes in this commit: 1) Fix the duplicated repair actions 2) Change default health check period from 30s to 60s

openstacker force-pushed the fix-duplicated-repair branch from 3d06649 to 278c871 Compare May 24, 2021 22:31

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 24, 2021

k8s-ci-robot assigned lingxiankong May 25, 2021

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 25, 2021

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 26, 2021

k8s-ci-robot merged commit db60e00 into kubernetes:master May 26, 2021

lingxiankong pushed a commit that referenced this pull request Jun 3, 2021

Fix duplicated repair action (#1530)

88ad2c9

There are several changes in this commit: 1) Fix the duplicated repair actions 2) Change default health check period from 30s to 60s (cherry picked from commit db60e00)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[magnum-auto-healer] Fix duplicated repair action #1530

[magnum-auto-healer] Fix duplicated repair action #1530

openstacker commented May 12, 2021 •

edited

theopenlab-ci bot commented May 12, 2021

theopenlab-ci bot commented May 12, 2021

ricolin May 12, 2021

openstacker May 13, 2021

ricolin May 14, 2021

openstacker May 17, 2021

ricolin May 12, 2021

openstacker May 13, 2021

ricolin May 13, 2021

openstacker May 13, 2021 •

edited

ricolin left a comment

ricolin May 14, 2021

openstacker commented May 17, 2021

theopenlab-ci bot commented May 17, 2021

theopenlab-ci bot commented May 17, 2021

lingxiankong May 17, 2021

openstacker May 19, 2021

lingxiankong May 20, 2021

lingxiankong May 17, 2021

openstacker May 19, 2021

lingxiankong May 17, 2021

openstacker May 20, 2021

lingxiankong May 17, 2021

lingxiankong May 17, 2021

openstacker May 21, 2021

theopenlab-ci bot commented May 24, 2021

theopenlab-ci bot commented May 24, 2021

lingxiankong commented May 25, 2021

openstacker commented May 26, 2021

k8s-ci-robot commented May 26, 2021

openstacker commented May 26, 2021

[magnum-auto-healer] Fix duplicated repair action #1530

[magnum-auto-healer] Fix duplicated repair action #1530

Conversation

openstacker commented May 12, 2021 • edited

theopenlab-ci bot commented May 12, 2021

theopenlab-ci bot commented May 12, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

openstacker May 13, 2021 • edited

Choose a reason for hiding this comment

ricolin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

openstacker commented May 17, 2021

theopenlab-ci bot commented May 17, 2021

theopenlab-ci bot commented May 17, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

theopenlab-ci bot commented May 24, 2021

theopenlab-ci bot commented May 24, 2021

lingxiankong commented May 25, 2021

openstacker commented May 26, 2021

k8s-ci-robot commented May 26, 2021

openstacker commented May 26, 2021

openstacker commented May 12, 2021 •

edited

openstacker May 13, 2021 •

edited