Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[KWOK] retry when applying Stages fails #911

Merged
merged 5 commits into from
Jan 26, 2024

Conversation

caozhuozi
Copy link
Contributor

@caozhuozi caozhuozi commented Jan 11, 2024

What type of PR is this?

/kind feature

What this PR does / why we need it:

Which issue(s) this PR fixes:

Fixes #816

Special notes for your reviewer:

Since we have a new design based on the weight delaying queue(#902), the old implementation #904 was rejected. Will close the old one and start here.

Currently, I only modified the node controller for a pre-review. Please feel free to give comments!

Does this PR introduce a user-facing change?

Add Stage retry mechanism

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/feature Categorizes issue or PR as related to a new feature. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jan 11, 2024
Copy link

netlify bot commented Jan 11, 2024

Deploy Preview for k8s-kwok canceled.

Name Link
🔨 Latest commit 8832ec1
🔍 Latest deploy log https://app.netlify.com/sites/k8s-kwok/deploys/65b30a5b4442aa0008e7a419

@k8s-ci-robot
Copy link
Contributor

Hi @caozhuozi. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jan 11, 2024
pkg/kwok/controllers/node_controller.go Outdated Show resolved Hide resolved
pkg/kwok/controllers/node_controller.go Outdated Show resolved Hide resolved
pkg/kwok/controllers/node_controller.go Show resolved Hide resolved
pkg/kwok/controllers/node_controller.go Outdated Show resolved Hide resolved
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jan 14, 2024
@caozhuozi
Copy link
Contributor Author

caozhuozi commented Jan 14, 2024

Hi, @wzshiming.
I added a shoudRetry function to determine whether an error needs to be retried.

Currently, I only check those network errors and bypass all other error types.

The logic is slightly different from the one I proposed before: #911 (comment)
The main difference is that I gave up using the IsKwokError() method which was designed to check whether the error is generated by kwok itself such as ComputePatchError.
It is easy for me to define the error:

var ComputePatchError = errors.New("compute patch error")

But it's hard for me to integrate the newly defined error into the existing code. For example:

func (c *NodeController) computePatch(node *corev1.Node, tpl string) ([]byte, error) {
	patch, err := c.renderer.ToJSON(tpl, node)
	if err != nil {
		return nil, err
	}

	original, err := json.Marshal(node.Status)
	if err != nil {
		return nil, err
	}

	sum, err := strategicpatch.StrategicMergePatch(original, patch, node.Status)
	if err != nil {
		return nil, err
	}

	nodeStatus := corev1.NodeStatus{}
	err = json.Unmarshal(sum, &nodeStatus)
	if err != nil {
		return nil, err
	}

	dist, err := json.Marshal(nodeStatus)
	if err != nil {
		return nil, err
	}

	if bytes.Equal(original, dist) {
		return nil, nil
	}

	return json.Marshal(map[string]json.RawMessage{
		"status": patch,
	})
}

If I return the newly defined error in computePatch directly, the original err messages within the method will be lost.
For example:

func (c *NodeController) computePatch(node *corev1.Node, tpl string) ([]byte, error) {
	patch, err := c.renderer.ToJSON(tpl, node)
	if err != nil {
		return nil, ComputePatchError  // the original `err` will be lost
	}
        // omitted
}

An alternative way is that I return the newly defined error where computePatch method is called:

patch, err = c.computePatch(node, next.StatusTemplate)
if err != nil {
    logger.Debug("failed to obtain the patch of node %s: %w", node.Name, err)
    return ComputePatchError
}

But I am not sure if it is Go idiomatic.

@wzshiming
Copy link
Member

I think like We can use errors.Join and fmt.Errorf

@caozhuozi
Copy link
Contributor Author

Hi, @wzshiming. Do you mean to join the specific error with the newly defined error?
I have two concerns here:

  1. I'm not sure if it is a Go idiomatic way

  2. the other concern is that where should I put the join statement? There are 2 choices currently:

    • inside computePatch method, like:

        func (c *NodeController) computePatch(node *corev1.Node, tpl string) ([]byte, error) {
        patch, err := c.renderer.ToJSON(tpl, node)
        if err != nil {
        return errors.Join(ComputePatchError, err)  // the original `err` will be lost
        }
        // omitted
        }
    • at where computePath method is called, like

      patch, err = c.computePatch(node, next.StatusTemplate)
      if err != nil {
          return errors.Join(ComputePatchError, err)
      }

    which one do you prefer?

@wzshiming
Copy link
Member

I think you can finish this and verify the various corner cases locally

@caozhuozi
Copy link
Contributor Author

caozhuozi commented Jan 15, 2024

I made several changes in this commit:

  1. define a new error type ErrComputePatch, which can cover both finalizers and status patching errors.
  2. remove those resource isNotFound checks in node controller as they will be handled in shouldRetry, which is the only entry for determining errors.
  3. move the patch operation out of finalizersModify and rename the method to computeFinalizerPatch. If any error occurs in computeFinalizerPatch, we should also mark the error as ErrComputePatch. The motivation is that we should put all the patch operations in playStage, resulting in clearer code logic.
  4. By comparison, rename the computePatch to computeStatusPatch to indicate it is exclusively used to patch status.
  5. generalize patchResource method such that it can be used by both status patching and finalizers patching.

Please feel free to give comments! 🙏

pkg/kwok/controllers/node_controller.go Outdated Show resolved Hide resolved
pkg/kwok/controllers/node_controller.go Outdated Show resolved Hide resolved
pkg/kwok/controllers/error.go Outdated Show resolved Hide resolved
@caozhuozi
Copy link
Contributor Author

since we only need to check the network error, I just removed the defined ErrComputePatch in the latest commit.

pkg/kwok/controllers/node_controller.go Outdated Show resolved Hide resolved
pkg/kwok/controllers/node_controller.go Outdated Show resolved Hide resolved
@wzshiming
Copy link
Member

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jan 17, 2024
@caozhuozi
Copy link
Contributor Author

In this new commit,

  • Get the finalizerModify logic back
  • check whether the key already exists in the map when adding a job
  • Create a function in utils to obtain a default backoff setting shared by all kwok controllers for those retried jobs.
  • rebase the code based on the recent changes.

If there are no problems with this version, I will start to change other controllers.

@wzshiming
Copy link
Member

Sorry for the delay, this PR may need to be rebase due to the conflict with #920, I will take time to test this PR this week.

@caozhuozi
Copy link
Contributor Author

Sorry for the delay, this PR may need to be rebase due to the conflict with #920, I will take time to test this PR this week.

Oh, @wzshiming! Understood! No worry! I will rebase it. 😊

pkg/kwok/controllers/node_controller.go Outdated Show resolved Hide resolved
pkg/kwok/controllers/utils.go Outdated Show resolved Hide resolved
pkg/kwok/controllers/node_controller.go Outdated Show resolved Hide resolved
pkg/kwok/controllers/stage_controller.go Outdated Show resolved Hide resolved
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 23, 2024
@caozhuozi caozhuozi force-pushed the feat/retry branch 2 times, most recently from ff2c2dd to 2ecf597 Compare January 24, 2024 14:13
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 24, 2024
pkg/kwok/controllers/stage_controller.go Outdated Show resolved Hide resolved
pkg/kwok/controllers/node_controller.go Outdated Show resolved Hide resolved
pkg/kwok/controllers/node_controller.go Outdated Show resolved Hide resolved
needRetry, err := c.playStage(ctx, pod.Resource, pod.Stage)
if err != nil {
logger.Error("failed to apply stage", err,
"node", pod.Resource.Name, "stage", pod.Stage.Name())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"node", pod.Resource.Name, "stage", pod.Stage.Name())
"pod", pod.Key,
"stage", pod.Stage.Name(),
)

pkg/kwok/controllers/pod_controller.go Outdated Show resolved Hide resolved
if needRetry {
*resource.RetryCount++
logger.Info("retrying for failed job",
"resource", resource.Resource.GetName(), "stage", resource.Stage.Name(), "retry", *resource.RetryCount)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"resource", resource.Resource.GetName(), "stage", resource.Stage.Name(), "retry", *resource.RetryCount)
"resource", resource.Key,
"stage", resource.Stage.Name(),
"retry", *resource.RetryCount,
)

pkg/kwok/controllers/node_controller.go Outdated Show resolved Hide resolved
pkg/kwok/controllers/stage_controller.go Outdated Show resolved Hide resolved
pkg/kwok/controllers/pod_controller.go Outdated Show resolved Hide resolved
pkg/kwok/controllers/utils.go Outdated Show resolved Hide resolved
Comment on lines 470 to 471
logger.Debug("Skip modify status",
"reason", "do not need to modify status",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
logger.Debug("Skip modify status",
"reason", "do not need to modify status",
logger.Debug("Skip node",
"reason", "do not need to modify",


// backoffDelayByStep calculates the backoff delay period based on steps
func backoffDelayByStep(steps int, c wait.Backoff) time.Duration {
if steps <= 0 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if steps <= 0 {
if steps < 0 {

The parameter for the first failure should be 0, which is the expected 1s delay

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since uint64 has no value less than zero after the newest change, I just removed the if clause.

Comment on lines 308 to 317
*pod.RetryCount++
logger.Info("retrying for failed job",
"pod", pod.Key,
"stage", pod.Stage.Name(),
"retry", *pod.RetryCount,
)
// for failed jobs, we re-push them into the queue with a lower weight
// and a backoff period to avoid blocking normal tasks
retryDelay := backoffDelayByStep(*pod.RetryCount, c.backoff)
c.addStageJob(pod, retryDelay, 1)
Copy link
Member

@wzshiming wzshiming Jan 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
*pod.RetryCount++
logger.Info("retrying for failed job",
"pod", pod.Key,
"stage", pod.Stage.Name(),
"retry", *pod.RetryCount,
)
// for failed jobs, we re-push them into the queue with a lower weight
// and a backoff period to avoid blocking normal tasks
retryDelay := backoffDelayByStep(*pod.RetryCount, c.backoff)
c.addStageJob(pod, retryDelay, 1)
retryCount := atomic.AddUint64(resource.RetryCount, 1) - 1
logger.Info("retrying for failed job",
"pod", pod.Key,
"stage", pod.Stage.Name(),
"retry", retryCount,
)
// for failed jobs, we re-push them into the queue with a lower weight
// and a backoff period to avoid blocking normal tasks
retryDelay := backoffDelayByStep(retryCount, c.backoff)
c.addStageJob(pod, retryDelay, 1)

In some corner case, there might be a data race, using atomic add to avoid panic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!!

)
// for failed jobs, we re-push them into the queue with a lower weight
// and a backoff period to avoid blocking normal tasks
retryDelay := backoffDelayByStep(*node.RetryCount, c.backoff)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

)
// for failed jobs, we re-push them into the queue with a lower weight
// and a backoff period to avoid blocking normal tasks
retryDelay := backoffDelayByStep(*resource.RetryCount, c.backoff)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

@wzshiming
Copy link
Member

/approve
/lgtm
/label tide/merge-method-squash

@k8s-ci-robot k8s-ci-robot added the tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. label Jan 26, 2024
@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 26, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: caozhuozi, wzshiming

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 26, 2024
@k8s-ci-robot k8s-ci-robot merged commit e1f43e2 into kubernetes-sigs:main Jan 26, 2024
28 checks passed
@caozhuozi
Copy link
Contributor Author

@wzshiming

I made many mistakes and misunderstood a lot of things along the way (and I also learned a lot.) Thanks for the great patience and for taking the trouble to guide me during the PR.

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed release-note-none Denotes a PR that doesn't merit a release note. labels Jan 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[kwok] controller doesn't retry when the stage play runs into transient error
3 participants