Flaky e2e test - terminal-failed-job #1691

alenkacz · 2020-09-24T07:23:18Z

What happened:
Failed test kudo/harness/terminal-failed-job on a PR introducing a KEP (no code)

What you expected to happen:
no failure

How to reproduce it (as minimally and precisely as possible):
run e2e tests

Anything else we need to know?:
https://app.circleci.com/pipelines/github/kudobuilder/kudo/5320/workflows/2781ca5b-9107-4e7c-b2c2-c6ccd121b615/jobs/15615/steps

Environment:

Kubernetes version (use kubectl version):
Kudo version (use kubectl kudo version):
Operator:
operatorVersion:
Cloud provider or hardware configuration:
OS (e.g. from /etc/os-release):
Kernel (e.g. uname -a):
Install tools:
Others:

The text was updated successfully, but these errors were encountered:

ANeumann82 · 2020-09-24T09:40:53Z

I've had a first look at this, and it doesn't look easy. It seems that for some reason k8s does not timout the second job in the e2e-test.

The failing test has two parts:

A Job that fails after 3 tries and brings the KUDO plan to a FATAL_ERROR state
A Job that times out after 60 seconds and brings the KUDO plan to a FATAL_ERROR state

The first part works, the second part does not get to the timeout. The Event log does not show the expected "Deadline Exceeded" event that would indicate that the job actually times out, so I'm not sure where the issue is. It seems that KUDO correctly detects the timeout when it happens, but k8s does not trigger the timeout.

zen-dog · 2020-09-24T11:16:57Z

In addition to the above analysis: I've found this suspicious line in the etcd log:

2020-09-23T12:25:12.136185294Z stderr F 2020-09-23 12:25:12.136028 W | 
etcdserver: read-only range request "key:\"/registry/jobs/kudo-test-smiling-eft/timeout-job\" " with result 
"range_response_count:1 size:3361" took too long (215.5448ms) to execute

It seems that the request takes too long to return. I've also counted how many health requests we do for the above job resulting in:

2020-09-23T12:24:50.115387918Z stderr F 2020/09/23 12:24:50 HealthUtil: job "timeout-job" still running or failed

Withing the 5 minutes of the test, we made 1196 requests so ~4 req/s. It's not that much but still somewhat excessive. But otherwise, I don't think this is a KUDO issue. Looks more like a kind/docker flake.

alenkacz · 2020-09-25T14:36:00Z

I don't think that issue fixed it :)

porridge · 2020-09-28T04:10:32Z

I think the point is that there isn't much we could do from the previous logs, and we reopen once the issue resurfaces with the improved logging.

alenkacz · 2020-09-29T09:10:41Z

alright, I would keep it open because people tend to ignore failures in PRs and I want to be reminded via issue but let's close it then :)

ANeumann82 · 2020-09-29T09:12:25Z

Well, I would have kept it open as well :D The close was not intentional, I hoped I could find the bug and fix it in the linked PR, but it seems to happen only rarely.

I don't mind either way - I'm pretty sure we'll find the issue again if the test flakes again :)

alenkacz · 2020-11-16T09:23:30Z

Happened again https://app.circleci.com/pipelines/github/kudobuilder/kudo/5564/workflows/2d82fa0a-716d-4087-9995-3a47781e19f5/jobs/16874/tests

ANeumann82 · 2020-12-03T13:26:19Z

Even with the additional logging output there's not really a reason to see why the timeout does not occour. We should probably leave this open, but I don't think there's really something we can do at this moment.

alenkacz added the kind/bug label Sep 24, 2020

alenkacz changed the title ~~Flaky e2e test~~ Flaky e2e test - terminal-failed-job Sep 24, 2020

ANeumann82 mentioned this issue Sep 24, 2020

Use --v 4 on kube-controller-manager for debugging #1693

Merged

ANeumann82 closed this as completed in #1693 Sep 25, 2020

alenkacz reopened this Sep 25, 2020

alenkacz closed this as completed Sep 29, 2020

alenkacz reopened this Nov 16, 2020

alenkacz added the flaky test label Nov 16, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flaky e2e test - terminal-failed-job #1691

Flaky e2e test - terminal-failed-job #1691

alenkacz commented Sep 24, 2020

ANeumann82 commented Sep 24, 2020

zen-dog commented Sep 24, 2020 •

edited

Loading

alenkacz commented Sep 25, 2020

porridge commented Sep 28, 2020

alenkacz commented Sep 29, 2020

ANeumann82 commented Sep 29, 2020

alenkacz commented Nov 16, 2020

ANeumann82 commented Dec 3, 2020

Flaky e2e test - terminal-failed-job #1691

Flaky e2e test - terminal-failed-job #1691

Comments

alenkacz commented Sep 24, 2020

ANeumann82 commented Sep 24, 2020

zen-dog commented Sep 24, 2020 • edited Loading

alenkacz commented Sep 25, 2020

porridge commented Sep 28, 2020

alenkacz commented Sep 29, 2020

ANeumann82 commented Sep 29, 2020

alenkacz commented Nov 16, 2020

ANeumann82 commented Dec 3, 2020

zen-dog commented Sep 24, 2020 •

edited

Loading