Flaky integration tests #4640

annasong20 · 2022-05-13T17:03:04Z

Describe the bug

Many integration tests that kustomize build urls in remoteload_test.go are flaky. They exhibit intended behavior on my machine, but sporadically fail when run for every PR on the server.

Files that can reproduce the issue

We have observed the following flaky tests:

Expected output

The expected output is written in the test cases.

Actual output

On my local machine, the output is as expected. On the server, the tests mostly pass, but occasionally fail. This logs the output of some of the flaky tests on a server run.

Kustomize version

I ran the tests on the master branch, where HEAD was at commit 22668ea.

Platform

I use macOS. The tests only fail for macOS (not Linux) on the server.

Additional context

Issue #4623 also mentions this issue.

The text was updated successfully, but these errors were encountered:

rajatgupta24 · 2022-05-23T02:23:17Z

Hey @annasong20, I saw issue, I'm learning how to write go-tests.
Can I work on this?

annasong20 · 2022-05-23T02:25:40Z

@rajatgupta24 Sure, go for it!

natasha41575 · 2022-08-10T19:50:21Z

/triage accepted

annasong20 · 2022-08-10T21:02:39Z

After a flaky run, I found that all flaky tests failed on git checkout FETCH_HEAD in cloner. Given that we run tests concurrently, I believe the flaky tests fail when repos from the different tests are cloned concurrently and FETCH_HEAD points to the wrong HEAD.

We can fix this either by changing the line git checkout FETCH_HEAD or running the integration tests sequentially.

natasha41575 · 2022-08-25T18:27:57Z

FYI @annasong20 I looked into it a little bit and I think you might be able to replace https://github.com/kubernetes-sigs/kustomize/blob/master/api/internal/git/cloner.go#L33-36 with git fetch origin --depth=1 and git checkout origin/HEAD. I'm not 100% sure (I'm not a git expert) so will need you to verify.

annasong20 · 2022-08-25T18:31:39Z

/assign

natasha41575 · 2022-08-25T18:39:09Z

If there isn't a clean threadsafe way to write the git commands, we can also consider locking these critical lines of code with a mutex.

natasha41575 · 2022-08-26T13:59:33Z

Per offline discussion, it has come to our attention that it being a concurrency issue is unlikely as all the remote tests are running in the same package and therefore not running in parallel.

annasong20 · 2022-08-26T18:55:34Z

My update after looking into this issue some more:

Each flaky test fails for the same reason: it gets stuck on git checkout FETCH_HEAD. However, as stated above, this shouldn't be a concurrent test issue. Moreover, even if the tests ran concurrently, this shouldn't be a concurrency issue, as each git command should be run in its own unique, temporary directory.
The git checkout may be timing out because
- the git fetch in the line before isn't behaving correctly, though this is questionable because we check for errors.
- some other process is locking the files in the git directory, and blocking git fetch indefinitely.
A failed sub-test will fail the parent test because the same require is used. This is a bug I need to fix.

KnVerey · 2022-08-26T19:05:21Z

I'm not sure this fully explains it, but the sheer size of our repo now that it has a docs site again is likely a contributing factor. We should consider preventing submodule initalization and raising the timeout on tests where this isn't important (Kustomize supports this already).

That said, @natasha41575 @annasong20 and I had a discussion about this suite, and I proposed that we go back to the coverage we want to have instead of necessarily fixing the tests in their current form. Here's the tentative plan for what we want:
(1) An exhaustive suite of unit tests covering the user input -> parameters -> url/params that get passed to git piece. This probably just means enumerating the permutations and doing an audit of repospec_test.go.
(2) A suite of integration tests that use the file:// protocol and exhaustively test all the query parameters we support. We don't have this at all yet, and it would replace most of the tests in remoteload_test.go.
(3) A small number of protocol tests that are free to use any parameters that make sense (e.g. we probably want to disable submodules and extend the timeout).

@mightyguava if you're able to help us with this, we would greatly appreciate it.

mightyguava · 2022-08-29T13:35:52Z

/assign @mightyguava

mightyguava · 2022-09-20T13:23:38Z

With #4777 and #4783, can this issue be closed?

KnVerey · 2022-09-22T22:33:51Z

Yes, 🤞 those fixed it. We'll need some time to see enough CI runs to feel completely confident (e.g. that we don't need to add retries to the protocol tests), but we can close this for now and reopen if we discover there's more to be done.

/close

k8s-ci-robot · 2022-09-22T22:33:54Z

@KnVerey: Closing this issue.

In response to this:

Yes, 🤞 those fixed it. We'll need some time to see enough CI runs to feel completely confident (e.g. that we don't need to add retries to the protocol tests), but we can close this for now and reopen if we discover there's more to be done.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

annasong20 added the kind/bug Categorizes issue or PR as related to a bug. label May 13, 2022

k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label May 13, 2022

annasong20 mentioned this issue Jun 6, 2022

Allow tests to only run locally #4664

Merged

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Aug 10, 2022

natasha41575 added the kind/flake Categorizes issue or PR as related to a flaky test. label Aug 10, 2022

KnVerey mentioned this issue Aug 11, 2022

Code freeze on remote loader #4756

Closed

k8s-ci-robot assigned annasong20 Aug 25, 2022

k8s-ci-robot assigned mightyguava Aug 29, 2022

This was referenced Aug 29, 2022

expand and clean up repospec_test #4777

Merged

Attempt to fix remoteload_test flakes and add better errors #4778

Closed

Rewrite remoteload_test integration tests #4783

Merged

k8s-ci-robot closed this as completed Sep 22, 2022

annasong20 mentioned this issue Sep 28, 2022

REQUEST: New membership for @annasong20 kubernetes/org#3736

Closed

9 tasks

mightyguava mentioned this issue Oct 14, 2022

repospec: support ssh urls with ssh certificates #4741

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flaky integration tests #4640

Flaky integration tests #4640

annasong20 commented May 13, 2022

rajatgupta24 commented May 23, 2022

annasong20 commented May 23, 2022

natasha41575 commented Aug 10, 2022

annasong20 commented Aug 10, 2022 •

edited

Loading

natasha41575 commented Aug 25, 2022 •

edited

Loading

annasong20 commented Aug 25, 2022

natasha41575 commented Aug 25, 2022

natasha41575 commented Aug 26, 2022

annasong20 commented Aug 26, 2022

KnVerey commented Aug 26, 2022

mightyguava commented Aug 29, 2022

mightyguava commented Sep 20, 2022

KnVerey commented Sep 22, 2022

k8s-ci-robot commented Sep 22, 2022

Flaky integration tests #4640

Flaky integration tests #4640

Comments

annasong20 commented May 13, 2022

rajatgupta24 commented May 23, 2022

annasong20 commented May 23, 2022

natasha41575 commented Aug 10, 2022

annasong20 commented Aug 10, 2022 • edited Loading

natasha41575 commented Aug 25, 2022 • edited Loading

annasong20 commented Aug 25, 2022

natasha41575 commented Aug 25, 2022

natasha41575 commented Aug 26, 2022

annasong20 commented Aug 26, 2022

KnVerey commented Aug 26, 2022

mightyguava commented Aug 29, 2022

mightyguava commented Sep 20, 2022

KnVerey commented Sep 22, 2022

k8s-ci-robot commented Sep 22, 2022

annasong20 commented Aug 10, 2022 •

edited

Loading

natasha41575 commented Aug 25, 2022 •

edited

Loading