Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce flakiness of density test #66239

Closed
wojtek-t opened this issue Jul 16, 2018 · 13 comments
Closed

Reduce flakiness of density test #66239

wojtek-t opened this issue Jul 16, 2018 · 13 comments
Assignees
Labels
kind/flake Categorizes issue or PR as related to a flaky test. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability.

Comments

@wojtek-t
Copy link
Member

Based on this run:
https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-performance/179/build-log.txt
there are a couple things that went wrong:

  1. Since failure of deleting RC shouldn't break the whole test:
    https://github.com/kubernetes/kubernetes/blob/master/test/e2e/scalability/density.go#L365
    [That should just cause an error, but don't break the whole test]

  2. If one of a node failed due to some reason, we should force deletion of pods from that, to avoid deleting a namespace to take 1h and fail after that.
    [It's probably enough to:

  • when deleting RC fails, list all pods with that selector and force deletiong of them]
  1. Probably reduce timeout for namespace deletion.

/assign @shyamjvs

Shyam - please take a look or delegate.

@k8s-ci-robot k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Jul 16, 2018
@wojtek-t wojtek-t added the sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. label Jul 16, 2018
@k8s-ci-robot k8s-ci-robot removed the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Jul 16, 2018
@shyamjvs
Copy link
Member

Yes, I definitely agree we should improve our deletions logic - it's not very robust currently. I can remember some RC deletion timeout flakes even from our smaller jobs.

Since failure of deleting RC shouldn't break the whole test:

I was suggesting such an idea (in a very rough sense) to you sometime ago if you remember - but in a more general sense, solving for all such flakes instead of doing point-by-point. My idea's roughly the following:

  • whenever some operation in the test (that's not super-critical) fails, we log it to some flakes.txt file and proceed without causing failures in our tests
  • at the end, as part of the CI job we detect if any flakes happened by looking into that file (this would be at test-infra level)
  • if yes (and more than some allowed threshold) we will just mark a separate 'flakes' step in the job as red (instead of making our tests red)
  • we can later process offline those flakes and try to fix those

Pros of that approach:

  • we can clearly separate flakes from genuine test failures (so it'll avoid giving wrong signal to community about scalability tests)
  • tests can continue to run until end and give some useful info instead of failing in between
  • it will allow us to capture flakes systematically - helping plan/prioritize their fixes

Cons of that approach:

  • Sometimes genuine scalability issues may be hidden as 'flakes'. While this can happen, my feeling is we should be able to spot that from the flakes.txt file just as we currently do from build-log. Also, we can limit the flakes threshold to a very low value (for e.g <5) to avoid masking serious issues

Wdyt?

@shyamjvs
Copy link
Member

cc @mborsz (who's currently looking into addressing our test flakiness)

Maciej - Would you be willing to look into this?

@wojtek-t
Copy link
Member Author

@shyamjvs - yes I know we were talking about that.
But given the ClusterLoader effort I don't want to spend too much effort on fixing that.

So if you can do that in a way that we will be able to reuse with ClusterLoader - I'm fine with that.
If not - do the simplest possible thing without fixing all possible problems.

@shyamjvs
Copy link
Member

But given the ClusterLoader effort I don't want to spend too much effort on fixing that.

IMO isolating flakes is independent problem from moving stuff to cluster-loader :)

So if you can do that in a way that we will be able to reuse with ClusterLoader - I'm fine with that.

Yes - I'm trying to make some changes in test-framework that'll also be reusable by cluster-loader.

k8s-github-robot pushed a commit that referenced this issue Jul 24, 2018
Automatic merge from submit-queue (batch tested with PRs 66296, 66382). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

Add flake-reporting utility to testing framework

One step towards #66239

/cc @wojtek-t @mborsz 
(whoever can review first)

```release-note
NONE
```
@shyamjvs
Copy link
Member

So with #66296 in, we now have a library available for reporting flakes. The next steps to do are:

  • Move our tests from using framework.ExpectNoError() to framework.RecordFlakeIfError() wherever it makes sense. This should be done for non-critical operations, e.g creating/updating/deleting individual RCs, services, etc
  • Fail the test at the end if flakeCount > {some small threshold}

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 24, 2018
@wojtek-t
Copy link
Member Author

/remove-lifecycle stale

/assign @mborsz

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 24, 2018
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 22, 2019
@wojtek-t
Copy link
Member Author

WHile a lot of work has happened here, I will still leave it open since some flakes are not yet fully understood.

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 23, 2019
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 29, 2019
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 29, 2019
@wojtek-t
Copy link
Member Author

/remove-lifecycle rotten
/lifecycle frozen

@k8s-ci-robot k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. labels May 29, 2019
@liggitt liggitt added the kind/flake Categorizes issue or PR as related to a flaky test. label Dec 16, 2019
@wojtek-t
Copy link
Member Author

Density test has been merged with load test in the meantime. There are also pretty stable now.
Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/flake Categorizes issue or PR as related to a flaky test. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability.
Projects
None yet
Development

No branches or pull requests

6 participants