Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmark job with backoff limit per index #121393

Merged

Conversation

mimowo
Copy link
Contributor

@mimowo mimowo commented Oct 20, 2023

What type of PR is this?

/kind cleanup

What this PR does / why we need it:

Which issue(s) this PR fixes:

Part of: kubernetes/enhancements#3850

Special notes for your reviewer:

  1. Experiments on my local machine running the Benchmark integration test for jobs with failures using the BenchmarkLargeFailureHandling test. Here we have
b.Cleanup(setDurationDuringTest(&jobcontroller.DefaultJobPodFailureBackOff, fastPodFailureBackoff))
b.Cleanup(setDurationDuringTest(&jobcontroller.MaxJobPodFailureBackOff, fastPodFailureBackoff))

to minimize the timing differences due to different backoff strategies.

The results of the benchmark are put the KEP update PR: kubernetes/enhancements#4321.

Also, lowered the nPods parameter to 10 and 100 to avoid issues with long running time on CI. The parameters can be easily increased for local testing.

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

https://github.com/kubernetes/enhancements/tree/master/keps/sig-apps/3850-backoff-limits-per-index-for-indexed-jobs

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Oct 20, 2023
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-priority Indicates a PR lacks a `priority/foo` label and requires one. label Oct 20, 2023
@k8s-ci-robot k8s-ci-robot added area/test sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/testing Categorizes an issue or PR as relevant to SIG Testing. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Oct 20, 2023
@mimowo mimowo force-pushed the backoff-limit-per-index-load-test branch from 3ec9651 to b78af4c Compare October 23, 2023 08:54
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 23, 2023
@mimowo mimowo force-pushed the backoff-limit-per-index-load-test branch from b78af4c to 2962755 Compare October 23, 2023 09:55
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 23, 2023
@mimowo mimowo changed the title WIP: Benchmark job with backoff limit per index [DO NOT MERGE] Benchmark job with backoff limit per index [BENCHMARK EXPERIMENTS] Oct 23, 2023
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 23, 2023
@mimowo
Copy link
Contributor Author

mimowo commented Oct 23, 2023

/test pull-kubernetes-unit

@mimowo
Copy link
Contributor Author

mimowo commented Oct 23, 2023

@alculquicondor PTAL at the benchmark results.

  1. Do you think this warrants further investigation at this point?
  2. Do you think we want something like this merged, or the one time experiment on-demand is enough: Benchmark job with backoff limit per index #121393? I'm a little bit worried about merging the benchmark given our issues with the runtime for integration tests.

Copy link
Member

@alculquicondor alculquicondor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this benchmark relevant?
If there are no Pod failures, why would having backoffLimitPerIndex have an impact?

@mimowo
Copy link
Contributor Author

mimowo commented Oct 25, 2023

Why is this benchmark relevant?
If there are no Pod failures, why would having backoffLimitPerIndex have an impact?

This is a sanity benchmark to make sure there is no significant penalty for stepping into some new branches of code like here or here.

However, to verify the new branches don't add significant penalty at scale we would need to have many failures, probably >1000. At this point, delay due to expotential backoff would take over. Since the delay patterns are different between global and perIndex backoffLimit I'm not sure how to design the proper full benchmark.

@mimowo
Copy link
Contributor Author

mimowo commented Oct 25, 2023

One idea that occurs to me now is to use an integration test. Mark 1000 pods as failed, and measure just the duration of syncJob execution for comparison. This approach would allow to abstract out the expotential delay differences. Even if be not very representative for user experience, it will get us some clues about performance.

EDIT: I'm trying yet another approach with setting the backoffDelay to 10ms, or less only to make it less relevant.
EDIT: pushed one more benchmark with failing pods (with nPods failing pods). To minimize the differences with backoff strategies I use:

b.Cleanup(setDurationDuringTest(&jobcontroller.DefaultJobPodFailureBackOff, fastPodFailureBackoff))
b.Cleanup(setDurationDuringTest(&jobcontroller.MaxJobPodFailureBackOff, fastPodFailureBackoff))

I will publish the results when have them.

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Oct 25, 2023
@mimowo mimowo force-pushed the backoff-limit-per-index-load-test branch from 2297edd to 4819547 Compare October 25, 2023 12:23
@mimowo mimowo force-pushed the backoff-limit-per-index-load-test branch 2 times, most recently from 5c04f66 to ff87bf3 Compare October 30, 2023 10:55
@mimowo
Copy link
Contributor Author

mimowo commented Oct 30, 2023

I think it's valuable to keep this for a bit.
You should probably copy some of the results to the KEP, for future reference when considering the graduation to stable.

SGTM. I have prepared the KEP update: kubernetes/enhancements#4321

Copy link
Member

@alculquicondor alculquicondor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

test/integration/job/job_test.go Outdated Show resolved Hide resolved
@alculquicondor
Copy link
Member

/retitle Benchmark job with backoff limit per index

@k8s-ci-robot k8s-ci-robot changed the title [DO NOT MERGE] Benchmark job with backoff limit per index [BENCHMARK EXPERIMENTS] Benchmark job with backoff limit per index Oct 31, 2023
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alculquicondor, mimowo

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 31, 2023
@mimowo mimowo force-pushed the backoff-limit-per-index-load-test branch from ff87bf3 to 168e016 Compare October 31, 2023 16:35
@mimowo
Copy link
Contributor Author

mimowo commented Oct 31, 2023

cc @kannon92 @pohly
would you like to review as well?

@mimowo
Copy link
Contributor Author

mimowo commented Oct 31, 2023

/test pull-kubernetes-unit
retry as this is failure unrelated with this PR, but it might be good to look into this deeper as this is a job failure: https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/121393/pull-kubernetes-unit/1719392853026672640
FYI @alculquicondor @dejanzele

@mimowo
Copy link
Contributor Author

mimowo commented Oct 31, 2023

/test pull-kubernetes-unit retry as this is failure unrelated with this PR, but it might be good to look into this deeper as this is a job failure: https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/121393/pull-kubernetes-unit/1719392853026672640 FYI @alculquicondor @dejanzele

Ticketed: #121652

@mimowo
Copy link
Contributor Author

mimowo commented Oct 31, 2023

/retest

@seans3
Copy link
Contributor

seans3 commented Oct 31, 2023

/remove-sig api-machinery

@k8s-ci-robot k8s-ci-robot removed the sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. label Oct 31, 2023
@dims
Copy link
Member

dims commented Oct 31, 2023

/retest

@alculquicondor
Copy link
Member

oops
/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 1, 2023
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: d25535bb9583f54359ab98c41aa974c115131fde

@pacoxu
Copy link
Member

pacoxu commented Nov 2, 2023

This is a test change. I suppose that we can add milestone v1.29 for this.

@mimowo
Copy link
Contributor Author

mimowo commented Nov 2, 2023

Yes, I think it makes sense to add the milestone.

@mimowo
Copy link
Contributor Author

mimowo commented Nov 2, 2023

/assign @soltysh

@pacoxu
Copy link
Member

pacoxu commented Nov 2, 2023

/milestone v1.29

@k8s-ci-robot k8s-ci-robot added this to the v1.29 milestone Nov 2, 2023
@k8s-ci-robot k8s-ci-robot merged commit 515d1ce into kubernetes:master Nov 2, 2023
15 checks passed
@mimowo mimowo deleted the backoff-limit-per-index-load-test branch November 29, 2023 15:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. release-note-none Denotes a PR that doesn't merit a release note. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
Archived in project
Archived in project
Development

Successfully merging this pull request may close these issues.

None yet

7 participants