Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not bump API requests backoff in the Job controller due to pod failures #118759

Merged

Conversation

mimowo
Copy link
Contributor

@mimowo mimowo commented Jun 20, 2023

What type of PR is this?

/kind cleanup
/kind bug

What this PR does / why we need it:

Pod failures shouldn't increase the backoff for API requests as they are counted independently since #114768, and use different constants since: #118615.

Which issue(s) this PR fixes:

Part of #118527

Special notes for your reviewer:

Still, pod failures bump the expotential pod failure backoff delay used for pod recreations.

Also, clean up the stale expectations for expectedForGetKey as the syncJob no longer returns forget since #114768.

Does this PR introduce a user-facing change?

Reduce delay when processing jobs after a transient API error

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jun 20, 2023
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-priority Indicates a PR lacks a `priority/foo` label and requires one. label Jun 20, 2023
@mimowo
Copy link
Contributor Author

mimowo commented Jun 20, 2023

/assign @alculquicondor

@k8s-ci-robot k8s-ci-robot added sig/apps Categorizes an issue or PR as relevant to SIG Apps. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jun 20, 2023
@mimowo mimowo force-pushed the dont-apibackoff-on-pod-failures branch from 9fcbb2c to 784a309 Compare June 20, 2023 09:31
@mimowo
Copy link
Contributor Author

mimowo commented Jun 20, 2023

/test pull-kubernetes-e2e-kind

@alculquicondor
Copy link
Member

I'm not sure the release note is very user friendly. Maybe something simpler like:

Reduce delay when processing jobs after a transient API error

failed int
}

func TestJobBackoffReset(t *testing.T) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you summarize why this test is changing so drastically?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test checks that the rate-limiter backoff in the syncJob queue is reset after a successful execution. For this reason it used to add an item with AddRateLimited based on the error due to pod failures, then it checked that the queue is emptied after a succeeded pod (esentially that the Forget is called).

However, now pod failures don't enqueue in the rate limiter. Still, the queue is empties after a successful syncJob, so it seems to make sense to preserve the test that the rate-limiter is getting emptied (Forget getting called).

Additionally, the test used to do it in two variants (with parallelism=1 and parallelism=2), I don't think it matters for the current scenario.

t.Errorf("%s: unexpected job failure", name)
}
// the queue is emptied on success
fakePodControl.Err = nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be a different test case instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think no, this is part of the scenario. First an element was put into the queue (due to error), and here we are going to empty the queue by success (so the Forget is called).

Copy link
Member

@alculquicondor alculquicondor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 20, 2023
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: baae73c0691b893c019f7667df86c2fd39d1da92

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alculquicondor, mimowo

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 20, 2023
@k8s-ci-robot k8s-ci-robot merged commit 2651e70 into kubernetes:master Jun 20, 2023
12 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v1.28 milestone Jun 20, 2023
@mimowo mimowo deleted the dont-apibackoff-on-pod-failures branch November 29, 2023 15:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/apps Categorizes an issue or PR as relevant to SIG Apps. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants