Backoff only when failed pod shows up #60985

soltysh · 2018-03-09T15:44:13Z

What this PR does / why we need it:
Upon introducing the backoff policy we started to delay sync runs for the job when it failed several times before. This leads to failed jobs not reporting status right away in cases that are not related to failed pods, eg. a successful run. This PR ensures the backoff is applied only when updatePod receives a failed pod.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #59918 #59527

/assign @janetkuo @kow3ns

Release note:

None

soltysh · 2018-03-09T15:44:29Z

/cc @clamoriniere1A @csrwng

k8s-ci-robot · 2018-03-09T15:44:30Z

@soltysh: GitHub didn't allow me to request PR reviews from the following users: clamoriniere1A.

Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

/cc @clamoriniere1A @csrwng

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

soltysh · 2018-03-09T15:44:57Z

@janetkuo @kow3ns not sure if we want to have this for 1.10, I'll mark it as such, but feel free to drop it from there.

soltysh · 2018-03-09T17:02:59Z

Fixed the bazel update.

clamoriniere1A · 2018-03-09T21:23:52Z

pkg/controller/job/job_controller.go

-	backoff := getBackoff(jm.queue, key)
+	var backoff time.Duration
+	if immediate {
+		backoff = time.Duration(0)


I think that forcing the backoff value to 0 will only do half of the job because it didn't clear the counter of the previous failures in the rateLimitedQueue.

what do you think about the following code?

if immediate { jm.queue.Forget(key) } backoff := getBackoff(jm.queue, key)

jdumars · 2018-03-12T14:43:27Z

If this gets an LGTM ASAP we can get this into 1.10. Also this needs a SIG and priority assigned.

soltysh · 2018-03-12T16:57:41Z

Comments addressed.
@clamoriniere1A @janetkuo @kow3ns ptal

clamoriniere1A · 2018-03-12T17:47:47Z

/lgtm

k8s-ci-robot · 2018-03-12T17:47:54Z

@clamoriniere1A: changing LGTM is restricted to assignees, and only kubernetes org members may be assigned issues.

In response to this:

/lgtm

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

janetkuo · 2018-03-13T00:08:24Z

pkg/controller/job/job_controller.go

+	immediate := true
+	if curPod.Status.Phase == v1.PodFailed {
+		immediate = false
+	}


nit:

immediate := curPod.Status.Phase != v1.PodFailed

janetkuo · 2018-03-13T00:09:53Z

pkg/controller/job/job_controller.go

 		return
 	}

-	// Retrieves the backoff duration for this Job
+	if immediate {
+		jm.queue.Forget(key)


Just to confirm, this will cause backoff == time.Duration(0), right?

Yes, this will remove the key from the queue which will result in a fresh add with a zero backoff.

janetkuo · 2018-03-13T00:11:54Z

pkg/controller/job/job_controller_test.go

+			newPod.Status.Phase = tc.phase
+			manager.updatePod(oldPod, newPod)
+
+			if queue.duration.Nanoseconds() != tc.backoff {


nit: It's probably simpler to move * DefaultJobBackOff.Nanoseconds() here

janetkuo · 2018-03-13T00:13:32Z

pkg/controller/job/job_controller_test.go

+		"1st success": {0, v1.PodSucceeded, int64(0) * DefaultJobBackOff.Nanoseconds()},
+		"2nd success": {1, v1.PodSucceeded, int64(0) * DefaultJobBackOff.Nanoseconds()},
+		"1st running": {0, v1.PodSucceeded, int64(0) * DefaultJobBackOff.Nanoseconds()},
+		"2nd running": {1, v1.PodSucceeded, int64(0) * DefaultJobBackOff.Nanoseconds()},


running cases and success cases are the same?

Yes they are, but I've decided to put them for clarity :-)

k8s-github-robot · 2018-03-13T00:14:32Z

[MILESTONENOTIFIER] Milestone Pull Request: Up-to-date for process

@janetkuo @kow3ns @soltysh

Pull Request Labels

sig/apps: Pull Request will be escalated to these SIGs if needed.
priority/important-longterm: Escalate to the pull request owners; move out of the milestone after 1 attempt.
kind/bug: Fixes a bug discovered during the current release.

Help

jdumars · 2018-03-13T14:53:44Z

@janetkuo when satisfied, could you add the LGTM so it ships in 1.10? Thanks!

soltysh · 2018-03-14T10:50:00Z

@janetkuo nits addressed, ptal

jdumars · 2018-03-15T15:07:08Z

Hi @janetkuo could you PTAL so we can wrap this up for 1.10? Thanks!

janetkuo · 2018-03-16T00:13:10Z

/lgtm

k8s-ci-robot · 2018-03-16T00:13:17Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: clamoriniere1A, janetkuo, soltysh

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~pkg/controller/job/OWNERS~~ [janetkuo,soltysh]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-github-robot · 2018-03-16T05:55:00Z

Automatic merge from submit-queue (batch tested with PRs 60978, 60985). If you want to cherry-pick this change to another branch, please follow the instructions here.

@janetkuo

Automatic merge from submit-queue (batch tested with PRs 64009, 64780, 64354, 64727, 63650). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>. Never clean backoff in job controller **What this PR does / why we need it**: In #60985 I've added a mechanism which allows immediate job status update, unfortunately that broke the backoff logic seriously. I'm sorry for that. I've changed the `immediate` mechanism so that it NEVER cleans the backoff, but for the cases when we want fast status update it uses a zero backoff. **Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*: Fixes #62382 **Special notes for your reviewer**: /assign @janetkuo **Release note**: ```release-note None ```

k8s-ci-robot assigned janetkuo and kow3ns Mar 9, 2018

k8s-ci-robot requested review from erictune and janetkuo March 9, 2018 15:44

k8s-ci-robot requested a review from csrwng March 9, 2018 15:44

soltysh added kind/bug Categorizes issue or PR as related to a bug. kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. status/approved-for-milestone labels Mar 9, 2018

soltysh added this to the v1.10 milestone Mar 9, 2018

k8s-github-robot added the milestone/incomplete-labels label Mar 9, 2018

soltysh force-pushed the issue59918 branch from f297648 to e7642cc Compare March 9, 2018 17:02

clamoriniere1A reviewed Mar 9, 2018

View reviewed changes

k8s-github-robot added milestone/removed and removed milestone/incomplete-labels labels Mar 12, 2018

k8s-github-robot removed this from the v1.10 milestone Mar 12, 2018

soltysh force-pushed the issue59918 branch from e7642cc to bbbb6da Compare March 12, 2018 16:15

soltysh added priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. sig/apps Categorizes an issue or PR as relevant to SIG Apps. labels Mar 12, 2018

janetkuo reviewed Mar 13, 2018

View reviewed changes

janetkuo added this to the v1.10 milestone Mar 13, 2018

k8s-github-robot removed the milestone/removed label Mar 13, 2018

Backoff only when failed pod shows up

1266252

soltysh force-pushed the issue59918 branch from bbbb6da to 1266252 Compare March 14, 2018 10:49

kow3ns added this to Backlog in Workloads Mar 15, 2018

kow3ns moved this from Backlog to In Progress in Workloads Mar 15, 2018

janetkuo approved these changes Mar 16, 2018

View reviewed changes

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 16, 2018

k8s-github-robot merged commit 5d67222 into kubernetes:master Mar 16, 2018

Workloads automation moved this from In Progress to Done Mar 16, 2018

soltysh mentioned this pull request Mar 16, 2018

[e2e flake] [sig-apps] Job should run a job to completion when tasks sometimes fail and are not locally restarted #59527

Closed

soltysh deleted the issue59918 branch March 16, 2018 16:30

ceshihao mentioned this pull request Apr 27, 2018

Backoff Limit for Job does not work on Kubernetes 1.10.0 #62382

Closed

soltysh mentioned this pull request May 10, 2018

Never clean backoff in job controller #63650

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Backoff only when failed pod shows up #60985

Backoff only when failed pod shows up #60985

soltysh commented Mar 9, 2018

soltysh commented Mar 9, 2018

k8s-ci-robot commented Mar 9, 2018

soltysh commented Mar 9, 2018

soltysh commented Mar 9, 2018

clamoriniere1A Mar 9, 2018

soltysh Mar 12, 2018

jdumars commented Mar 12, 2018 •

edited

soltysh commented Mar 12, 2018

clamoriniere1A commented Mar 12, 2018

k8s-ci-robot commented Mar 12, 2018

janetkuo Mar 13, 2018

janetkuo Mar 13, 2018

soltysh Mar 14, 2018

janetkuo Mar 13, 2018 •

edited

janetkuo Mar 13, 2018

soltysh Mar 14, 2018

k8s-github-robot commented Mar 13, 2018

jdumars commented Mar 13, 2018

soltysh commented Mar 14, 2018

jdumars commented Mar 15, 2018

janetkuo commented Mar 16, 2018

k8s-ci-robot commented Mar 16, 2018

k8s-github-robot commented Mar 16, 2018

Backoff only when failed pod shows up #60985

Backoff only when failed pod shows up #60985

Conversation

soltysh commented Mar 9, 2018

soltysh commented Mar 9, 2018

k8s-ci-robot commented Mar 9, 2018

soltysh commented Mar 9, 2018

soltysh commented Mar 9, 2018

clamoriniere1A Mar 9, 2018

Choose a reason for hiding this comment

soltysh Mar 12, 2018

Choose a reason for hiding this comment

jdumars commented Mar 12, 2018 • edited

soltysh commented Mar 12, 2018

clamoriniere1A commented Mar 12, 2018

k8s-ci-robot commented Mar 12, 2018

janetkuo Mar 13, 2018

Choose a reason for hiding this comment

janetkuo Mar 13, 2018

Choose a reason for hiding this comment

soltysh Mar 14, 2018

Choose a reason for hiding this comment

janetkuo Mar 13, 2018 • edited

Choose a reason for hiding this comment

janetkuo Mar 13, 2018

Choose a reason for hiding this comment

soltysh Mar 14, 2018

Choose a reason for hiding this comment

k8s-github-robot commented Mar 13, 2018

jdumars commented Mar 13, 2018

soltysh commented Mar 14, 2018

jdumars commented Mar 15, 2018

janetkuo commented Mar 16, 2018

k8s-ci-robot commented Mar 16, 2018

k8s-github-robot commented Mar 16, 2018

jdumars commented Mar 12, 2018 •

edited

janetkuo Mar 13, 2018 •

edited