[e2e flake] [sig-apps] Job should run a job to completion when tasks sometimes fail and are not locally restarted #59527

mithrav · 2018-02-08T03:43:36Z

Back story in #54904, but still the top 2nd flake on Velodrome: http://velodrome.k8s.io/dashboard/db/bigquery-metrics?orgId=1

mithrav · 2018-02-08T03:43:54Z

kind/bug

mithrav · 2018-02-08T03:44:14Z

/priority failing-test
/priority important-soon
/sig apps
/kind flake
@kubernetes/sig-apps-test-failures

k8s-ci-robot · 2018-02-08T03:44:21Z

@mithrav: Reiterating the mentions to trigger a notification:
@kubernetes/sig-apps-test-failures

In response to this:

/priority failing-test
/priority important-soon
/sig apps
/kind flake
@kubernetes/sig-apps-test-failures

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

mithrav · 2018-02-26T23:40:45Z

Pinging for updates. This is still a top cause of flakiness.

soltysh · 2018-03-09T10:35:11Z

So looking at the logs there are 2 problems here actually:

Is a bug filled in Jobs - takes a while to update job status if job succeeds after retrying many times #59918, where the job controller does not notice a successful run after several retries.
The backoff is not getting cleared (Improves backoff policy in JobController #60202) when a successful run happened.

All in all it looks like the backoff policy is causing both problems. Digging further to get to the actual root cause.

soltysh · 2018-03-09T10:39:19Z

So it looks like no. 2 is caused by 1, actually, and since I'm able to reproduce this once in a while locally I'll try to get to the root cause.

@janetkuo

Automatic merge from submit-queue (batch tested with PRs 60978, 60985). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>. Backoff only when failed pod shows up **What this PR does / why we need it**: Upon introducing the backoff policy we started to delay sync runs for the job when it failed several times before. This leads to failed jobs not reporting status right away in cases that are not related to failed pods, eg. a successful run. This PR ensures the backoff is applied only when `updatePod` receives a failed pod. **Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*: Fixes #59918 #59527 /assign @janetkuo @kow3ns **Release note**: ```release-note None ```

soltysh · 2018-03-16T16:07:35Z

#60985 merged, closing.
/close

k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Feb 8, 2018

kow3ns added this to Backlog in Workloads Feb 26, 2018

soltysh self-assigned this Mar 9, 2018

soltysh mentioned this issue Mar 9, 2018

Backoff only when failed pod shows up #60985

Merged

k8s-ci-robot closed this as completed Mar 16, 2018

Workloads automation moved this from Backlog to Done Mar 16, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[e2e flake] [sig-apps] Job should run a job to completion when tasks sometimes fail and are not locally restarted #59527

[e2e flake] [sig-apps] Job should run a job to completion when tasks sometimes fail and are not locally restarted #59527

mithrav commented Feb 8, 2018

mithrav commented Feb 8, 2018

mithrav commented Feb 8, 2018

k8s-ci-robot commented Feb 8, 2018

mithrav commented Feb 26, 2018

soltysh commented Mar 9, 2018

soltysh commented Mar 9, 2018

soltysh commented Mar 16, 2018

[e2e flake] [sig-apps] Job should run a job to completion when tasks sometimes fail and are not locally restarted #59527

[e2e flake] [sig-apps] Job should run a job to completion when tasks sometimes fail and are not locally restarted #59527

Comments

mithrav commented Feb 8, 2018

mithrav commented Feb 8, 2018

mithrav commented Feb 8, 2018

k8s-ci-robot commented Feb 8, 2018

mithrav commented Feb 26, 2018

soltysh commented Mar 9, 2018

soltysh commented Mar 9, 2018

soltysh commented Mar 16, 2018