Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[e2e flake] [sig-apps] Job should run a job to completion when tasks sometimes fail and are not locally restarted #59527

Closed
mithrav opened this issue Feb 8, 2018 · 7 comments
Assignees
Labels
kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. kind/flake Categorizes issue or PR as related to a flaky test. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/apps Categorizes an issue or PR as relevant to SIG Apps.
Projects

Comments

@mithrav
Copy link

mithrav commented Feb 8, 2018

Back story in #54904, but still the top 2nd flake on Velodrome: http://velodrome.k8s.io/dashboard/db/bigquery-metrics?orgId=1

@k8s-ci-robot k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Feb 8, 2018
@mithrav
Copy link
Author

mithrav commented Feb 8, 2018

kind/bug

@mithrav
Copy link
Author

mithrav commented Feb 8, 2018

/priority failing-test
/priority important-soon
/sig apps
/kind flake
@kubernetes/sig-apps-test-failures

@k8s-ci-robot k8s-ci-robot added kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. sig/apps Categorizes an issue or PR as relevant to SIG Apps. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. kind/flake Categorizes issue or PR as related to a flaky test. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Feb 8, 2018
@k8s-ci-robot
Copy link
Contributor

@mithrav: Reiterating the mentions to trigger a notification:
@kubernetes/sig-apps-test-failures

In response to this:

/priority failing-test
/priority important-soon
/sig apps
/kind flake
@kubernetes/sig-apps-test-failures

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@kow3ns kow3ns added this to Backlog in Workloads Feb 26, 2018
@mithrav
Copy link
Author

mithrav commented Feb 26, 2018

Pinging for updates. This is still a top cause of flakiness.

@soltysh soltysh self-assigned this Mar 9, 2018
@soltysh
Copy link
Contributor

soltysh commented Mar 9, 2018

So looking at the logs there are 2 problems here actually:

  1. Is a bug filled in Jobs - takes a while to update job status if job succeeds after retrying many times #59918, where the job controller does not notice a successful run after several retries.
  2. The backoff is not getting cleared (Improves backoff policy in JobController #60202) when a successful run happened.

All in all it looks like the backoff policy is causing both problems. Digging further to get to the actual root cause.

@soltysh
Copy link
Contributor

soltysh commented Mar 9, 2018

So it looks like no. 2 is caused by 1, actually, and since I'm able to reproduce this once in a while locally I'll try to get to the root cause.

k8s-github-robot pushed a commit that referenced this issue Mar 16, 2018
Automatic merge from submit-queue (batch tested with PRs 60978, 60985). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

Backoff only when failed pod shows up

**What this PR does / why we need it**:
Upon introducing the backoff policy we started to delay sync runs for the job when it failed several times before. This leads to failed jobs not reporting status right away in cases that are not related to failed pods, eg. a successful run. This PR ensures the backoff is applied only when `updatePod` receives a failed pod.

**Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*:
Fixes #59918 #59527

/assign @janetkuo @kow3ns 

**Release note**:
```release-note
None
```
@soltysh
Copy link
Contributor

soltysh commented Mar 16, 2018

#60985 merged, closing.
/close

Workloads automation moved this from Backlog to Done Mar 16, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. kind/flake Categorizes issue or PR as related to a flaky test. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/apps Categorizes an issue or PR as relevant to SIG Apps.
Projects
Workloads
  
Done
Development

No branches or pull requests

3 participants