Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flaky e2e test - terminal-failed-job #1691

Open
alenkacz opened this issue Sep 24, 2020 · 8 comments · Fixed by #1693
Open

Flaky e2e test - terminal-failed-job #1691

alenkacz opened this issue Sep 24, 2020 · 8 comments · Fixed by #1693

Comments

@alenkacz
Copy link
Contributor

What happened:
Failed test kudo/harness/terminal-failed-job on a PR introducing a KEP (no code)

What you expected to happen:
no failure

How to reproduce it (as minimally and precisely as possible):
run e2e tests

Anything else we need to know?:
https://app.circleci.com/pipelines/github/kudobuilder/kudo/5320/workflows/2781ca5b-9107-4e7c-b2c2-c6ccd121b615/jobs/15615/steps

Environment:

  • Kubernetes version (use kubectl version):
  • Kudo version (use kubectl kudo version):
  • Operator:
  • operatorVersion:
  • Cloud provider or hardware configuration:
  • OS (e.g. from /etc/os-release):
  • Kernel (e.g. uname -a):
  • Install tools:
  • Others:
@alenkacz alenkacz changed the title Flaky e2e test Flaky e2e test - terminal-failed-job Sep 24, 2020
@ANeumann82
Copy link
Member

I've had a first look at this, and it doesn't look easy. It seems that for some reason k8s does not timout the second job in the e2e-test.

The failing test has two parts:

  • A Job that fails after 3 tries and brings the KUDO plan to a FATAL_ERROR state
  • A Job that times out after 60 seconds and brings the KUDO plan to a FATAL_ERROR state

The first part works, the second part does not get to the timeout. The Event log does not show the expected "Deadline Exceeded" event that would indicate that the job actually times out, so I'm not sure where the issue is. It seems that KUDO correctly detects the timeout when it happens, but k8s does not trigger the timeout.

@zen-dog
Copy link
Contributor

zen-dog commented Sep 24, 2020

In addition to the above analysis: I've found this suspicious line in the etcd log:

2020-09-23T12:25:12.136185294Z stderr F 2020-09-23 12:25:12.136028 W | 
etcdserver: read-only range request "key:\"/registry/jobs/kudo-test-smiling-eft/timeout-job\" " with result 
"range_response_count:1 size:3361" took too long (215.5448ms) to execute

It seems that the request takes too long to return. I've also counted how many health requests we do for the above job resulting in:

2020-09-23T12:24:50.115387918Z stderr F 2020/09/23 12:24:50 HealthUtil: job "timeout-job" still running or failed

Withing the 5 minutes of the test, we made 1196 requests so ~4 req/s. It's not that much but still somewhat excessive. But otherwise, I don't think this is a KUDO issue. Looks more like a kind/docker flake.

@alenkacz
Copy link
Contributor Author

I don't think that issue fixed it :)

@alenkacz alenkacz reopened this Sep 25, 2020
@porridge
Copy link
Member

I think the point is that there isn't much we could do from the previous logs, and we reopen once the issue resurfaces with the improved logging.

@alenkacz
Copy link
Contributor Author

alright, I would keep it open because people tend to ignore failures in PRs and I want to be reminded via issue but let's close it then :)

@ANeumann82
Copy link
Member

Well, I would have kept it open as well :D The close was not intentional, I hoped I could find the bug and fix it in the linked PR, but it seems to happen only rarely.

I don't mind either way - I'm pretty sure we'll find the issue again if the test flakes again :)

@alenkacz
Copy link
Contributor Author

@ANeumann82
Copy link
Member

Even with the additional logging output there's not really a reason to see why the timeout does not occour. We should probably leave this open, but I don't think there's really something we can do at this moment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants