Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flaky integration tests #4640

Closed
annasong20 opened this issue May 13, 2022 · 14 comments
Closed

Flaky integration tests #4640

annasong20 opened this issue May 13, 2022 · 14 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. kind/flake Categorizes issue or PR as related to a flaky test. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@annasong20
Copy link
Contributor

Describe the bug

Many integration tests that kustomize build urls in remoteload_test.go are flaky. They exhibit intended behavior on my machine, but sporadically fail when run for every PR on the server.

Files that can reproduce the issue

We have observed the following flaky tests:

Expected output

The expected output is written in the test cases.

Actual output

On my local machine, the output is as expected. On the server, the tests mostly pass, but occasionally fail. This logs the output of some of the flaky tests on a server run.

Kustomize version

I ran the tests on the master branch, where HEAD was at commit 22668ea.

Platform

I use macOS. The tests only fail for macOS (not Linux) on the server.

Additional context

Issue #4623 also mentions this issue.

@annasong20 annasong20 added the kind/bug Categorizes issue or PR as related to a bug. label May 13, 2022
@k8s-ci-robot k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label May 13, 2022
@rajatgupta24
Copy link

Hey @annasong20, I saw issue, I'm learning how to write go-tests.
Can I work on this?

@annasong20
Copy link
Contributor Author

@rajatgupta24 Sure, go for it!

@natasha41575
Copy link
Contributor

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Aug 10, 2022
@natasha41575 natasha41575 added the kind/flake Categorizes issue or PR as related to a flaky test. label Aug 10, 2022
@annasong20
Copy link
Contributor Author

annasong20 commented Aug 10, 2022

After a flaky run, I found that all flaky tests failed on git checkout FETCH_HEAD in cloner. Given that we run tests concurrently, I believe the flaky tests fail when repos from the different tests are cloned concurrently and FETCH_HEAD points to the wrong HEAD.

We can fix this either by changing the line git checkout FETCH_HEAD or running the integration tests sequentially.

@natasha41575
Copy link
Contributor

natasha41575 commented Aug 25, 2022

FYI @annasong20 I looked into it a little bit and I think you might be able to replace https://github.com/kubernetes-sigs/kustomize/blob/master/api/internal/git/cloner.go#L33-36 with git fetch origin --depth=1 and git checkout origin/HEAD. I'm not 100% sure (I'm not a git expert) so will need you to verify.

@annasong20
Copy link
Contributor Author

/assign

@natasha41575
Copy link
Contributor

If there isn't a clean threadsafe way to write the git commands, we can also consider locking these critical lines of code with a mutex.

@natasha41575
Copy link
Contributor

Per offline discussion, it has come to our attention that it being a concurrency issue is unlikely as all the remote tests are running in the same package and therefore not running in parallel.

@annasong20
Copy link
Contributor Author

My update after looking into this issue some more:

  • Each flaky test fails for the same reason: it gets stuck on git checkout FETCH_HEAD. However, as stated above, this shouldn't be a concurrent test issue. Moreover, even if the tests ran concurrently, this shouldn't be a concurrency issue, as each git command should be run in its own unique, temporary directory.
  • The git checkout may be timing out because
    • the git fetch in the line before isn't behaving correctly, though this is questionable because we check for errors.
    • some other process is locking the files in the git directory, and blocking git fetch indefinitely.
  • A failed sub-test will fail the parent test because the same require is used. This is a bug I need to fix.

@KnVerey
Copy link
Contributor

KnVerey commented Aug 26, 2022

I'm not sure this fully explains it, but the sheer size of our repo now that it has a docs site again is likely a contributing factor. We should consider preventing submodule initalization and raising the timeout on tests where this isn't important (Kustomize supports this already).

That said, @natasha41575 @annasong20 and I had a discussion about this suite, and I proposed that we go back to the coverage we want to have instead of necessarily fixing the tests in their current form. Here's the tentative plan for what we want:
(1) An exhaustive suite of unit tests covering the user input -> parameters -> url/params that get passed to git piece. This probably just means enumerating the permutations and doing an audit of repospec_test.go.
(2) A suite of integration tests that use the file:// protocol and exhaustively test all the query parameters we support. We don't have this at all yet, and it would replace most of the tests in remoteload_test.go.
(3) A small number of protocol tests that are free to use any parameters that make sense (e.g. we probably want to disable submodules and extend the timeout).

@mightyguava if you're able to help us with this, we would greatly appreciate it.

@mightyguava
Copy link
Contributor

/assign @mightyguava

@mightyguava
Copy link
Contributor

With #4777 and #4783, can this issue be closed?

@KnVerey
Copy link
Contributor

KnVerey commented Sep 22, 2022

Yes, 🤞 those fixed it. We'll need some time to see enough CI runs to feel completely confident (e.g. that we don't need to add retries to the protocol tests), but we can close this for now and reopen if we discover there's more to be done.

/close

@k8s-ci-robot
Copy link
Contributor

@KnVerey: Closing this issue.

In response to this:

Yes, 🤞 those fixed it. We'll need some time to see enough CI runs to feel completely confident (e.g. that we don't need to add retries to the protocol tests), but we can close this for now and reopen if we discover there's more to be done.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. kind/flake Categorizes issue or PR as related to a flaky test. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

6 participants