[KWOK] retry when applying Stages fails #911

caozhuozi · 2024-01-11T14:27:37Z

What type of PR is this?

/kind feature

What this PR does / why we need it:

Which issue(s) this PR fixes:

Fixes #816

Special notes for your reviewer:

Since we have a new design based on the weight delaying queue(#902), the old implementation #904 was rejected. Will close the old one and start here.

Currently, I only modified the node controller for a pre-review. Please feel free to give comments!

Does this PR introduce a user-facing change?

Add Stage retry mechanism

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

netlify · 2024-01-11T14:27:46Z

✅ Deploy Preview for k8s-kwok canceled.

Name	Link
🔨 Latest commit	`8832ec1`
🔍 Latest deploy log	https://app.netlify.com/sites/k8s-kwok/deploys/65b30a5b4442aa0008e7a419

k8s-ci-robot · 2024-01-11T14:27:47Z

Hi @caozhuozi. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

pkg/kwok/controllers/node_controller.go

caozhuozi · 2024-01-14T11:30:29Z

Hi, @wzshiming.
I added a shoudRetry function to determine whether an error needs to be retried.

Currently, I only check those network errors and bypass all other error types.

The logic is slightly different from the one I proposed before: #911 (comment)
The main difference is that I gave up using the IsKwokError() method which was designed to check whether the error is generated by kwok itself such as ComputePatchError.
It is easy for me to define the error:

var ComputePatchError = errors.New("compute patch error")

But it's hard for me to integrate the newly defined error into the existing code. For example:

func (c *NodeController) computePatch(node *corev1.Node, tpl string) ([]byte, error) {
	patch, err := c.renderer.ToJSON(tpl, node)
	if err != nil {
		return nil, err
	}

	original, err := json.Marshal(node.Status)
	if err != nil {
		return nil, err
	}

	sum, err := strategicpatch.StrategicMergePatch(original, patch, node.Status)
	if err != nil {
		return nil, err
	}

	nodeStatus := corev1.NodeStatus{}
	err = json.Unmarshal(sum, &nodeStatus)
	if err != nil {
		return nil, err
	}

	dist, err := json.Marshal(nodeStatus)
	if err != nil {
		return nil, err
	}

	if bytes.Equal(original, dist) {
		return nil, nil
	}

	return json.Marshal(map[string]json.RawMessage{
		"status": patch,
	})
}

If I return the newly defined error in computePatch directly, the original err messages within the method will be lost.
For example:

func (c *NodeController) computePatch(node *corev1.Node, tpl string) ([]byte, error) {
	patch, err := c.renderer.ToJSON(tpl, node)
	if err != nil {
		return nil, ComputePatchError  // the original `err` will be lost
	}
        // omitted
}

An alternative way is that I return the newly defined error where computePatch method is called:

patch, err = c.computePatch(node, next.StatusTemplate)
if err != nil {
    logger.Debug("failed to obtain the patch of node %s: %w", node.Name, err)
    return ComputePatchError
}

But I am not sure if it is Go idiomatic.

wzshiming · 2024-01-14T14:52:48Z

I think like We can use errors.Join and fmt.Errorf

caozhuozi · 2024-01-15T04:35:32Z

Hi, @wzshiming. Do you mean to join the specific error with the newly defined error?
I have two concerns here:

I'm not sure if it is a Go idiomatic way

the other concern is that where should I put the join statement? There are 2 choices currently:

inside computePatch method, like:

  func (c *NodeController) computePatch(node *corev1.Node, tpl string) ([]byte, error) {
  patch, err := c.renderer.ToJSON(tpl, node)
  if err != nil {
  return errors.Join(ComputePatchError, err)  // the original `err` will be lost
  }
  // omitted
  }

at where computePath method is called, like

patch, err = c.computePatch(node, next.StatusTemplate)
if err != nil {
    return errors.Join(ComputePatchError, err)
}

which one do you prefer?

wzshiming · 2024-01-15T08:35:35Z

I think you can finish this and verify the various corner cases locally

caozhuozi · 2024-01-15T12:55:44Z

I made several changes in this commit:

define a new error type ErrComputePatch, which can cover both finalizers and status patching errors.
remove those resource isNotFound checks in node controller as they will be handled in shouldRetry, which is the only entry for determining errors.
move the patch operation out of finalizersModify and rename the method to computeFinalizerPatch. If any error occurs in computeFinalizerPatch, we should also mark the error as ErrComputePatch. The motivation is that we should put all the patch operations in playStage, resulting in clearer code logic.
By comparison, rename the computePatch to computeStatusPatch to indicate it is exclusively used to patch status.
generalize patchResource method such that it can be used by both status patching and finalizers patching.

Please feel free to give comments! 🙏

pkg/kwok/controllers/node_controller.go

pkg/kwok/controllers/error.go

caozhuozi · 2024-01-15T14:56:42Z

since we only need to check the network error, I just removed the defined ErrComputePatch in the latest commit.

pkg/kwok/controllers/node_controller.go

wzshiming · 2024-01-17T02:54:43Z

/ok-to-test

caozhuozi · 2024-01-17T11:04:50Z

In this new commit,

Get the finalizerModify logic back
check whether the key already exists in the map when adding a job
Create a function in utils to obtain a default backoff setting shared by all kwok controllers for those retried jobs.
rebase the code based on the recent changes.

If there are no problems with this version, I will start to change other controllers.

pkg/kwok/controllers/node_controller.go

wzshiming · 2024-01-22T10:41:19Z

Sorry for the delay, this PR may need to be rebase due to the conflict with #920, I will take time to test this PR this week.

caozhuozi · 2024-01-22T10:53:36Z

Sorry for the delay, this PR may need to be rebase due to the conflict with #920, I will take time to test this PR this week.

Oh, @wzshiming! Understood! No worry! I will rebase it. 😊

pkg/kwok/controllers/node_controller.go

pkg/kwok/controllers/utils.go

pkg/kwok/controllers/node_controller.go

pkg/kwok/controllers/stage_controller.go

pkg/kwok/controllers/error.go

pkg/kwok/controllers/stage_controller.go

pkg/kwok/controllers/node_controller.go

wzshiming · 2024-01-25T07:55:06Z

pkg/kwok/controllers/pod_controller.go

+		needRetry, err := c.playStage(ctx, pod.Resource, pod.Stage)
+		if err != nil {
+			logger.Error("failed to apply stage", err,
+				"node", pod.Resource.Name, "stage", pod.Stage.Name())


Suggested change

"node", pod.Resource.Name, "stage", pod.Stage.Name())

"pod", pod.Key,

"stage", pod.Stage.Name(),

)

pkg/kwok/controllers/pod_controller.go

wzshiming · 2024-01-25T07:57:17Z

pkg/kwok/controllers/stage_controller.go

+		if needRetry {
+			*resource.RetryCount++
+			logger.Info("retrying for failed job",
+				"resource", resource.Resource.GetName(), "stage", resource.Stage.Name(), "retry", *resource.RetryCount)


Suggested change

"resource", resource.Resource.GetName(), "stage", resource.Stage.Name(), "retry", *resource.RetryCount)

"resource", resource.Key,

"stage", resource.Stage.Name(),

"retry", *resource.RetryCount,

)

pkg/kwok/controllers/node_controller.go

pkg/kwok/controllers/stage_controller.go

pkg/kwok/controllers/pod_controller.go

pkg/kwok/controllers/utils.go

wzshiming · 2024-01-25T14:47:07Z

pkg/kwok/controllers/node_controller.go

+			logger.Debug("Skip modify status",
+				"reason", "do not need to modify status",


Suggested change

logger.Debug("Skip modify status",

"reason", "do not need to modify status",

logger.Debug("Skip node",

"reason", "do not need to modify",

wzshiming · 2024-01-25T14:51:18Z

pkg/kwok/controllers/utils.go

+
+// backoffDelayByStep calculates the backoff delay period based on steps
+func backoffDelayByStep(steps int, c wait.Backoff) time.Duration {
+	if steps <= 0 {


Suggested change

if steps <= 0 {

if steps < 0 {

The parameter for the first failure should be 0, which is the expected 1s delay

since uint64 has no value less than zero after the newest change, I just removed the if clause.

wzshiming · 2024-01-25T14:57:39Z

pkg/kwok/controllers/pod_controller.go

+			*pod.RetryCount++
+			logger.Info("retrying for failed job",
+				"pod", pod.Key,
+				"stage", pod.Stage.Name(),
+				"retry", *pod.RetryCount,
+			)
+			// for failed jobs, we re-push them into the queue with a lower weight
+			// and a backoff period to avoid blocking normal tasks
+			retryDelay := backoffDelayByStep(*pod.RetryCount, c.backoff)
+			c.addStageJob(pod, retryDelay, 1)


Suggested change

*pod.RetryCount++

logger.Info("retrying for failed job",

"pod", pod.Key,

"stage", pod.Stage.Name(),

"retry", *pod.RetryCount,

)

// for failed jobs, we re-push them into the queue with a lower weight

// and a backoff period to avoid blocking normal tasks

retryDelay := backoffDelayByStep(*pod.RetryCount, c.backoff)

c.addStageJob(pod, retryDelay, 1)

retryCount := atomic.AddUint64(resource.RetryCount, 1) - 1

logger.Info("retrying for failed job",

"pod", pod.Key,

"stage", pod.Stage.Name(),

"retry", retryCount,

)

// for failed jobs, we re-push them into the queue with a lower weight

// and a backoff period to avoid blocking normal tasks

retryDelay := backoffDelayByStep(retryCount, c.backoff)

c.addStageJob(pod, retryDelay, 1)

In some corner case, there might be a data race, using atomic add to avoid panic.

wzshiming · 2024-01-25T14:59:51Z

pkg/kwok/controllers/node_controller.go

+			)
+			// for failed jobs, we re-push them into the queue with a lower weight
+			// and a backoff period to avoid blocking normal tasks
+			retryDelay := backoffDelayByStep(*node.RetryCount, c.backoff)


wzshiming · 2024-01-25T15:00:26Z

pkg/kwok/controllers/stage_controller.go

+			)
+			// for failed jobs, we re-push them into the queue with a lower weight
+			// and a backoff period to avoid blocking normal tasks
+			retryDelay := backoffDelayByStep(*resource.RetryCount, c.backoff)


wzshiming · 2024-01-26T05:22:50Z

/approve
/lgtm
/label tide/merge-method-squash

k8s-ci-robot · 2024-01-26T05:22:58Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: caozhuozi, wzshiming

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [wzshiming]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

caozhuozi · 2024-01-26T07:45:02Z

@wzshiming

I made many mistakes and misunderstood a lot of things along the way (and I also learned a lot.) Thanks for the great patience and for taking the trouble to guide me during the PR.

k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/feature Categorizes issue or PR as related to a new feature. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jan 11, 2024

k8s-ci-robot requested review from Huang-Wei and wzshiming January 11, 2024 14:27

k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jan 11, 2024

wzshiming reviewed Jan 11, 2024

View reviewed changes

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jan 14, 2024

caozhuozi force-pushed the feat/retry branch from a6a1aa9 to fce56bd Compare January 15, 2024 13:00

wzshiming reviewed Jan 15, 2024

View reviewed changes

pkg/kwok/controllers/node_controller.go Outdated Show resolved Hide resolved

pkg/kwok/controllers/node_controller.go Outdated Show resolved Hide resolved

pkg/kwok/controllers/error.go Outdated Show resolved Hide resolved

wzshiming reviewed Jan 17, 2024

View reviewed changes

pkg/kwok/controllers/node_controller.go Outdated Show resolved Hide resolved

pkg/kwok/controllers/node_controller.go Outdated Show resolved Hide resolved

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jan 17, 2024

caozhuozi force-pushed the feat/retry branch from f5e69e3 to 8284f62 Compare January 17, 2024 10:56

caozhuozi force-pushed the feat/retry branch from 8284f62 to 70c4c3b Compare January 17, 2024 11:09

wzshiming reviewed Jan 19, 2024

View reviewed changes

pkg/kwok/controllers/node_controller.go Show resolved Hide resolved

caozhuozi mentioned this pull request Jan 22, 2024

add support for impersonating dynamic client and configuring subresource #920

Merged

wzshiming reviewed Jan 23, 2024

View reviewed changes

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 23, 2024

wzshiming reviewed Jan 24, 2024

View reviewed changes

pkg/kwok/controllers/error.go Outdated Show resolved Hide resolved

rebase

2ecf597

caozhuozi force-pushed the feat/retry branch 2 times, most recently from ff2c2dd to 2ecf597 Compare January 24, 2024 14:13

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 24, 2024

revert stage controller

aa37ccd

caozhuozi force-pushed the feat/retry branch from b6e2325 to aa37ccd Compare January 24, 2024 16:24

revert pod and node controller

238a8de

wzshiming reviewed Jan 25, 2024

View reviewed changes

fix based on new comments

dc5db0e

wzshiming reviewed Jan 25, 2024

View reviewed changes

avoid data race when increasing retry counter

8832ec1

k8s-ci-robot added the tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. label Jan 26, 2024

k8s-ci-robot assigned wzshiming Jan 26, 2024

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 26, 2024

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 26, 2024

k8s-ci-robot merged commit e1f43e2 into kubernetes-sigs:main Jan 26, 2024
28 checks passed

k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed release-note-none Denotes a PR that doesn't merit a release note. labels Jan 26, 2024

wzshiming mentioned this pull request Apr 11, 2024

[KWOK] add retry feature when fails to play stages #904

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[KWOK] retry when applying Stages fails #911

[KWOK] retry when applying Stages fails #911

caozhuozi commented Jan 11, 2024 •

edited by wzshiming

netlify bot commented Jan 11, 2024 •

edited

k8s-ci-robot commented Jan 11, 2024

caozhuozi commented Jan 14, 2024 •

edited

wzshiming commented Jan 14, 2024

caozhuozi commented Jan 15, 2024

wzshiming commented Jan 15, 2024

caozhuozi commented Jan 15, 2024 •

edited

caozhuozi commented Jan 15, 2024

wzshiming commented Jan 17, 2024

caozhuozi commented Jan 17, 2024

wzshiming commented Jan 22, 2024

caozhuozi commented Jan 22, 2024

wzshiming Jan 25, 2024

wzshiming Jan 25, 2024

wzshiming Jan 25, 2024

wzshiming Jan 25, 2024

caozhuozi Jan 26, 2024

wzshiming Jan 25, 2024 •

edited

caozhuozi Jan 26, 2024

wzshiming Jan 25, 2024

wzshiming Jan 25, 2024

wzshiming commented Jan 26, 2024

k8s-ci-robot commented Jan 26, 2024

caozhuozi commented Jan 26, 2024

		logger.Debug("Skip modify status",
		"reason", "do not need to modify status",

[KWOK] retry when applying Stages fails #911

[KWOK] retry when applying Stages fails #911

Conversation

caozhuozi commented Jan 11, 2024 • edited by wzshiming

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

netlify bot commented Jan 11, 2024 • edited

✅ Deploy Preview for k8s-kwok canceled.

k8s-ci-robot commented Jan 11, 2024

caozhuozi commented Jan 14, 2024 • edited

wzshiming commented Jan 14, 2024

caozhuozi commented Jan 15, 2024

wzshiming commented Jan 15, 2024

caozhuozi commented Jan 15, 2024 • edited

caozhuozi commented Jan 15, 2024

wzshiming commented Jan 17, 2024

caozhuozi commented Jan 17, 2024

wzshiming commented Jan 22, 2024

caozhuozi commented Jan 22, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wzshiming Jan 25, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wzshiming commented Jan 26, 2024

k8s-ci-robot commented Jan 26, 2024

caozhuozi commented Jan 26, 2024

caozhuozi commented Jan 11, 2024 •

edited by wzshiming

netlify bot commented Jan 11, 2024 •

edited

caozhuozi commented Jan 14, 2024 •

edited

caozhuozi commented Jan 15, 2024 •

edited

wzshiming Jan 25, 2024 •

edited