New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add remaining e2e tests for Job BackoffLimitPerIndex based on KEP #121633
Add remaining e2e tests for Job BackoffLimitPerIndex based on KEP #121633
Conversation
/sig apps |
// we use parallelism=1 to make sure in the asserts only one pod was created | ||
parallelism = int32(1) | ||
ginkgo.By("Creating an indexed job with backoffLimit per index and maxFailedIndexes") | ||
job := e2ejob.NewTestJob("fail", "with-max-failed-indexes", v1.RestartPolicyNever, parallelism, completions, nil, backoffLimit) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
side note: having these global variables for parallelism
, completions
and backoffLimit
make the tests very hard to read. I need to jump from one place to another to check what the values are. We should get rid of them in a follow up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sgtm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I just realized this parallelism = int32(1)
was a bug that could override the parallelism for the next test to run. This caused this failure which I also reproduced locally: https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/121633/pull-kubernetes-e2e-gce/1719394173829779456. It could cause flakes depending on the order of execution of the tests.
Fixed now by parallelism := int32(1)
. However, its better to avoid the global vars. I will open an issue for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Opened the cleanup issue for now, maybe some folks will want to grab: #121689
test/e2e/apps/job.go
Outdated
ginkgo.By("Verifying the Job status fields to ensure the upper indexes didn't execute") | ||
job, err = e2ejob.GetJob(ctx, f.ClientSet, f.Namespace.Name, job.Name) | ||
framework.ExpectNoError(err, "failed to retrieve latest job object") | ||
gomega.Expect(job.Status.FailedIndexes).Should(gomega.HaveValue(gomega.Equal("0,1"))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a guarantee that both pods will run? Could this flake?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a guarantee on that, because the job does not specify maxFailedIndexes, so all indexes are expected to be executed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ran it a couple of times locally and it passed consistently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, right, it's a FailIndex, not FailJob.
It would be more interesting to use a job where (let's say, odd numbers fail). I think there's a helper for that already?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, we have a helper, I was thinking about using this. I will adjust if you think it makes more sense.
e5fd24e
to
cb18d82
Compare
/test pull-kubernetes-unit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/approve
/lgtm
LGTM label has been added. Git tree hash: c74bf0c2f313f25cbd9f9c1008e540d3f088a6c8
|
/retest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
/assign @soltysh |
cb18d82
to
ae73cf9
Compare
/assign @soltysh |
/test pull-kubernetes-unit |
@pacoxu |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/approve
/triage accepted |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: alculquicondor, mimowo, soltysh The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/milestone v1.29 |
LGTM label has been added. Git tree hash: c7fb7c87a8bdc4246388d325a23467eea10db4c3
|
What type of PR is this?
/kind cleanup
What this PR does / why we need it:
Which issue(s) this PR fixes:
Part of: kubernetes/enhancements#3850
Special notes for your reviewer:
The test scenarios were outlined in the KEP: https://github.com/kubernetes/enhancements/tree/master/keps/sig-apps/3850-backoff-limits-per-index-for-indexed-jobs, but I forgot about them when preparing #121368. Filling the gap now.
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: