Reduce flakiness of density test #66239

wojtek-t · 2018-07-16T12:44:18Z

Based on this run:
https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-performance/179/build-log.txt
there are a couple things that went wrong:

Since failure of deleting RC shouldn't break the whole test:
https://github.com/kubernetes/kubernetes/blob/master/test/e2e/scalability/density.go#L365
[That should just cause an error, but don't break the whole test]
If one of a node failed due to some reason, we should force deletion of pods from that, to avoid deleting a namespace to take 1h and fail after that.
[It's probably enough to:

when deleting RC fails, list all pods with that selector and force deletiong of them]

Probably reduce timeout for namespace deletion.

/assign @shyamjvs

Shyam - please take a look or delegate.

shyamjvs · 2018-07-16T13:54:19Z

Yes, I definitely agree we should improve our deletions logic - it's not very robust currently. I can remember some RC deletion timeout flakes even from our smaller jobs.

Since failure of deleting RC shouldn't break the whole test:

I was suggesting such an idea (in a very rough sense) to you sometime ago if you remember - but in a more general sense, solving for all such flakes instead of doing point-by-point. My idea's roughly the following:

whenever some operation in the test (that's not super-critical) fails, we log it to some flakes.txt file and proceed without causing failures in our tests
at the end, as part of the CI job we detect if any flakes happened by looking into that file (this would be at test-infra level)
if yes (and more than some allowed threshold) we will just mark a separate 'flakes' step in the job as red (instead of making our tests red)
we can later process offline those flakes and try to fix those

Pros of that approach:

we can clearly separate flakes from genuine test failures (so it'll avoid giving wrong signal to community about scalability tests)
tests can continue to run until end and give some useful info instead of failing in between
it will allow us to capture flakes systematically - helping plan/prioritize their fixes

Cons of that approach:

Sometimes genuine scalability issues may be hidden as 'flakes'. While this can happen, my feeling is we should be able to spot that from the flakes.txt file just as we currently do from build-log. Also, we can limit the flakes threshold to a very low value (for e.g <5) to avoid masking serious issues

Wdyt?

shyamjvs · 2018-07-16T13:57:12Z

cc @mborsz (who's currently looking into addressing our test flakiness)

Maciej - Would you be willing to look into this?

wojtek-t · 2018-07-17T13:10:07Z

@shyamjvs - yes I know we were talking about that.
But given the ClusterLoader effort I don't want to spend too much effort on fixing that.

So if you can do that in a way that we will be able to reuse with ClusterLoader - I'm fine with that.
If not - do the simplest possible thing without fixing all possible problems.

shyamjvs · 2018-07-17T13:32:23Z

But given the ClusterLoader effort I don't want to spend too much effort on fixing that.

IMO isolating flakes is independent problem from moving stuff to cluster-loader :)

So if you can do that in a way that we will be able to reuse with ClusterLoader - I'm fine with that.

Yes - I'm trying to make some changes in test-framework that'll also be reusable by cluster-loader.

@wojtek-t

Automatic merge from submit-queue (batch tested with PRs 66296, 66382). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>. Add flake-reporting utility to testing framework One step towards #66239 /cc @wojtek-t @mborsz (whoever can review first) ```release-note NONE ```

shyamjvs · 2018-07-26T14:24:27Z

So with #66296 in, we now have a library available for reporting flakes. The next steps to do are:

Move our tests from using framework.ExpectNoError() to framework.RecordFlakeIfError() wherever it makes sense. This should be done for non-critical operations, e.g creating/updating/deleting individual RCs, services, etc
Fail the test at the end if flakeCount > {some small threshold}

fejta-bot · 2018-10-24T15:12:08Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

wojtek-t · 2018-10-24T19:06:06Z

/remove-lifecycle stale

/assign @mborsz

fejta-bot · 2019-01-22T19:30:42Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

wojtek-t · 2019-01-23T08:27:59Z

WHile a lot of work has happened here, I will still leave it open since some flakes are not yet fully understood.

/remove-lifecycle stale

fejta-bot · 2019-04-29T01:28:55Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot · 2019-05-29T02:11:19Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

wojtek-t · 2019-05-29T07:49:56Z

/remove-lifecycle rotten
/lifecycle frozen

wojtek-t · 2021-09-28T11:30:48Z

Density test has been merged with load test in the meantime. There are also pretty stable now.
Closing.

k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Jul 16, 2018

k8s-ci-robot assigned shyamjvs Jul 16, 2018

wojtek-t added the sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. label Jul 16, 2018

k8s-ci-robot removed the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Jul 16, 2018

shyamjvs mentioned this issue Jul 17, 2018

Add flake-reporting utility to testing framework #66296

Merged

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 24, 2018

k8s-ci-robot assigned mborsz Oct 24, 2018

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 24, 2018

wojtek-t mentioned this issue Nov 12, 2018

Expose more information in clusterloader2 logs kubernetes/perf-tests#293

Open

5 tasks

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 22, 2019

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 23, 2019

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 29, 2019

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 29, 2019

k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. labels May 29, 2019

liggitt added the kind/flake Categorizes issue or PR as related to a flaky test. label Dec 16, 2019

alejandrox1 mentioned this issue Jul 8, 2020

Investigate use of e2e framework's flake report #92911

Open

wojtek-t closed this as completed Sep 28, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce flakiness of density test #66239

Reduce flakiness of density test #66239

wojtek-t commented Jul 16, 2018

shyamjvs commented Jul 16, 2018

shyamjvs commented Jul 16, 2018

wojtek-t commented Jul 17, 2018

shyamjvs commented Jul 17, 2018

shyamjvs commented Jul 26, 2018

fejta-bot commented Oct 24, 2018

wojtek-t commented Oct 24, 2018

fejta-bot commented Jan 22, 2019

wojtek-t commented Jan 23, 2019

fejta-bot commented Apr 29, 2019

fejta-bot commented May 29, 2019

wojtek-t commented May 29, 2019

wojtek-t commented Sep 28, 2021

Reduce flakiness of density test #66239

Reduce flakiness of density test #66239

Comments

wojtek-t commented Jul 16, 2018

shyamjvs commented Jul 16, 2018

shyamjvs commented Jul 16, 2018

wojtek-t commented Jul 17, 2018

shyamjvs commented Jul 17, 2018

shyamjvs commented Jul 26, 2018

fejta-bot commented Oct 24, 2018

wojtek-t commented Oct 24, 2018

fejta-bot commented Jan 22, 2019

wojtek-t commented Jan 23, 2019

fejta-bot commented Apr 29, 2019

fejta-bot commented May 29, 2019

wojtek-t commented May 29, 2019

wojtek-t commented Sep 28, 2021