Benchmark job with backoff limit per index #121393

mimowo · 2023-10-20T13:59:21Z

What type of PR is this?

/kind cleanup

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Experiments on my local machine running the Benchmark integration test for jobs with failures using the BenchmarkLargeFailureHandling test. Here we have

b.Cleanup(setDurationDuringTest(&jobcontroller.DefaultJobPodFailureBackOff, fastPodFailureBackoff))
b.Cleanup(setDurationDuringTest(&jobcontroller.MaxJobPodFailureBackOff, fastPodFailureBackoff))

to minimize the timing differences due to different backoff strategies.

The results of the benchmark are put the KEP update PR: kubernetes/enhancements#4321.

Also, lowered the nPods parameter to 10 and 100 to avoid issues with long running time on CI. The parameters can be easily increased for local testing.

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

https://github.com/kubernetes/enhancements/tree/master/keps/sig-apps/3850-backoff-limits-per-index-for-indexed-jobs

k8s-ci-robot · 2023-10-20T13:59:30Z

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

mimowo · 2023-10-23T10:42:27Z

/test pull-kubernetes-unit

mimowo · 2023-10-23T15:14:26Z

@alculquicondor PTAL at the benchmark results.

Do you think this warrants further investigation at this point?
Do you think we want something like this merged, or the one time experiment on-demand is enough: Benchmark job with backoff limit per index #121393? I'm a little bit worried about merging the benchmark given our issues with the runtime for integration tests.

alculquicondor

Why is this benchmark relevant?
If there are no Pod failures, why would having backoffLimitPerIndex have an impact?

mimowo · 2023-10-25T07:26:07Z

Why is this benchmark relevant?
If there are no Pod failures, why would having backoffLimitPerIndex have an impact?

This is a sanity benchmark to make sure there is no significant penalty for stepping into some new branches of code like here or here.

However, to verify the new branches don't add significant penalty at scale we would need to have many failures, probably >1000. At this point, delay due to expotential backoff would take over. Since the delay patterns are different between global and perIndex backoffLimit I'm not sure how to design the proper full benchmark.

mimowo · 2023-10-25T10:19:13Z

One idea that occurs to me now is to use an integration test. Mark 1000 pods as failed, and measure just the duration of syncJob execution for comparison. This approach would allow to abstract out the expotential delay differences. Even if be not very representative for user experience, it will get us some clues about performance.

EDIT: I'm trying yet another approach with setting the backoffDelay to 10ms, or less only to make it less relevant.
EDIT: pushed one more benchmark with failing pods (with nPods failing pods). To minimize the differences with backoff strategies I use:

b.Cleanup(setDurationDuringTest(&jobcontroller.DefaultJobPodFailureBackOff, fastPodFailureBackoff))
b.Cleanup(setDurationDuringTest(&jobcontroller.MaxJobPodFailureBackOff, fastPodFailureBackoff))

I will publish the results when have them.

mimowo · 2023-10-30T11:02:06Z

I think it's valuable to keep this for a bit.
You should probably copy some of the results to the KEP, for future reference when considering the graduation to stable.

SGTM. I have prepared the KEP update: kubernetes/enhancements#4321

alculquicondor

/approve

test/integration/job/job_test.go

alculquicondor · 2023-10-31T16:08:55Z

/retitle Benchmark job with backoff limit per index

k8s-ci-robot · 2023-10-31T16:09:04Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alculquicondor, mimowo

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~test/integration/job/OWNERS~~ [alculquicondor]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

mimowo · 2023-10-31T16:42:59Z

cc @kannon92 @pohly
would you like to review as well?

mimowo · 2023-10-31T17:25:20Z

/test pull-kubernetes-unit
retry as this is failure unrelated with this PR, but it might be good to look into this deeper as this is a job failure: https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/121393/pull-kubernetes-unit/1719392853026672640
FYI @alculquicondor @dejanzele

mimowo · 2023-10-31T17:35:12Z

/test pull-kubernetes-unit retry as this is failure unrelated with this PR, but it might be good to look into this deeper as this is a job failure: https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/121393/pull-kubernetes-unit/1719392853026672640 FYI @alculquicondor @dejanzele

Ticketed: #121652

mimowo · 2023-10-31T18:08:05Z

/retest

seans3 · 2023-10-31T20:31:47Z

/remove-sig api-machinery

dims · 2023-10-31T22:30:40Z

/retest

alculquicondor · 2023-11-01T13:01:18Z

oops
/lgtm

k8s-ci-robot · 2023-11-01T13:01:27Z

LGTM label has been added.

Git tree hash: d25535bb9583f54359ab98c41aa974c115131fde

pacoxu · 2023-11-02T04:28:54Z

This is a test change. I suppose that we can add milestone v1.29 for this.

mimowo · 2023-11-02T07:33:51Z

Yes, I think it makes sense to add the milestone.

mimowo · 2023-11-02T07:34:16Z

/assign @soltysh

pacoxu · 2023-11-02T07:40:48Z

/milestone v1.29

k8s-ci-robot added the needs-priority Indicates a PR lacks a `priority/foo` label and requires one. label Oct 20, 2023

k8s-ci-robot requested review from atiratree and mortent October 20, 2023 14:00

k8s-ci-robot added area/test sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/testing Categorizes an issue or PR as relevant to SIG Testing. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Oct 20, 2023

mimowo force-pushed the backoff-limit-per-index-load-test branch from 3ec9651 to b78af4c Compare October 23, 2023 08:54

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 23, 2023

mimowo force-pushed the backoff-limit-per-index-load-test branch from b78af4c to 2962755 Compare October 23, 2023 09:55

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 23, 2023

mimowo changed the title ~~WIP: Benchmark job with backoff limit per index~~ [DO NOT MERGE] Benchmark job with backoff limit per index [BENCHMARK EXPERIMENTS] Oct 23, 2023

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 23, 2023

mimowo mentioned this pull request Oct 23, 2023

Graduate BackoffLimitPerIndex to Beta #121356

Merged

6 tasks

alculquicondor reviewed Oct 24, 2023

View reviewed changes

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Oct 25, 2023

mimowo force-pushed the backoff-limit-per-index-load-test branch from 2297edd to 4819547 Compare October 25, 2023 12:23

mimowo force-pushed the backoff-limit-per-index-load-test branch 2 times, most recently from 5c04f66 to ff87bf3 Compare October 30, 2023 10:55

alculquicondor reviewed Oct 31, 2023

View reviewed changes

test/integration/job/job_test.go Outdated Show resolved Hide resolved

k8s-ci-robot changed the title ~~[DO NOT MERGE] Benchmark job with backoff limit per index [BENCHMARK EXPERIMENTS]~~ Benchmark job with backoff limit per index Oct 31, 2023

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 31, 2023

Benchmark job with backoff limit per index

168e016

mimowo force-pushed the backoff-limit-per-index-load-test branch from ff87bf3 to 168e016 Compare October 31, 2023 16:35

k8s-ci-robot removed the sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. label Oct 31, 2023

k8s-ci-robot assigned alculquicondor Nov 1, 2023

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 1, 2023

k8s-ci-robot assigned soltysh Nov 2, 2023

k8s-ci-robot added this to the v1.29 milestone Nov 2, 2023

k8s-ci-robot merged commit 515d1ce into kubernetes:master Nov 2, 2023
15 checks passed

jensentanlo mentioned this pull request Nov 2, 2023

Backoff Limit Per Index For Indexed Jobs kubernetes/enhancements#3850

Open

8 tasks

mimowo deleted the backoff-limit-per-index-load-test branch November 29, 2023 15:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmark job with backoff limit per index #121393

Benchmark job with backoff limit per index #121393

mimowo commented Oct 20, 2023 •

edited

k8s-ci-robot commented Oct 20, 2023

mimowo commented Oct 23, 2023

mimowo commented Oct 23, 2023

alculquicondor left a comment

mimowo commented Oct 25, 2023

mimowo commented Oct 25, 2023 •

edited

mimowo commented Oct 30, 2023

alculquicondor left a comment

alculquicondor commented Oct 31, 2023

k8s-ci-robot commented Oct 31, 2023

mimowo commented Oct 31, 2023

mimowo commented Oct 31, 2023

mimowo commented Oct 31, 2023

mimowo commented Oct 31, 2023

seans3 commented Oct 31, 2023

dims commented Oct 31, 2023

alculquicondor commented Nov 1, 2023

k8s-ci-robot commented Nov 1, 2023

pacoxu commented Nov 2, 2023

mimowo commented Nov 2, 2023

mimowo commented Nov 2, 2023

pacoxu commented Nov 2, 2023

Benchmark job with backoff limit per index #121393

Benchmark job with backoff limit per index #121393

Conversation

mimowo commented Oct 20, 2023 • edited

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

k8s-ci-robot commented Oct 20, 2023

mimowo commented Oct 23, 2023

mimowo commented Oct 23, 2023

alculquicondor left a comment

Choose a reason for hiding this comment

mimowo commented Oct 25, 2023

mimowo commented Oct 25, 2023 • edited

mimowo commented Oct 30, 2023

alculquicondor left a comment

Choose a reason for hiding this comment

alculquicondor commented Oct 31, 2023

k8s-ci-robot commented Oct 31, 2023

mimowo commented Oct 31, 2023

mimowo commented Oct 31, 2023

mimowo commented Oct 31, 2023

mimowo commented Oct 31, 2023

seans3 commented Oct 31, 2023

dims commented Oct 31, 2023

alculquicondor commented Nov 1, 2023

k8s-ci-robot commented Nov 1, 2023

pacoxu commented Nov 2, 2023

mimowo commented Nov 2, 2023

mimowo commented Nov 2, 2023

pacoxu commented Nov 2, 2023

mimowo commented Oct 20, 2023 •

edited

mimowo commented Oct 25, 2023 •

edited