Label flaky e2es with [Flaky] & slow tests with [Slow] #19021

ikehz · 2015-12-22T19:39:04Z

Continued work on #10548.

Notably, I'm collapsing:

GKE_FLAKY_TESTS
GCE_FLAKY_TESTS
GCE_PARALLEL_FLAKY_TESTS

into one label, [Flaky]. If a test is flaky in the parallel run, then perhaps it's disruptive and should be marked as such, or it's flaky. (There was one set of tests—DaemonRestart—in GCE_PARALLEL_FLAKY_TESTS already in [Disruptive], so I just kicked it out of flaky entirely.)

googlebot · 2015-12-22T19:39:05Z

We found a Contributor License Agreement for you (the sender of this pull request), but were unable to find agreements for the commit author(s). If you authored these, maybe you used a different email address in the git commits than was used to sign the CLA (login here to double check)? If these were authored by someone else, then they will need to sign a CLA as well, and confirm that they're okay with these being contributed to Google.

k8s-github-robot · 2015-12-22T19:42:55Z

Labelling this PR as size/M

k8s-bot · 2015-12-22T20:20:52Z

GCE e2e build/test failed for commit d7e1a8604f2392ab8a2b6091ed3638c8a394dd49.

ikehz · 2015-12-22T20:31:56Z

@spxtr this is ready for review.

k8s-bot · 2015-12-22T20:56:53Z

GCE e2e test build/test passed for commit 436574f4233ef7ba0de5fc522cec00ba11f61df5.

k8s-bot · 2015-12-22T20:57:59Z

GCE e2e test build/test passed for commit e96a00225b9eb4f33f0df32f91fb92ec88ebeb84.

k8s-bot · 2015-12-22T21:03:02Z

GCE e2e test build/test passed for commit 5ee8850.

spxtr · 2015-12-22T21:33:15Z

LGTM but I haven't tested it.

ikehz · 2015-12-22T21:43:33Z

@k8s-oncall: currently the PR builder is totally shot because stuff I changed in #18900 was coupled with these skip lists. I'm going to manually merge this so that hopefully we can get back to where we were.

Label flaky e2es with [Flaky] & slow tests with [Slow]

mikedanese · 2015-12-22T21:56:21Z

Squash?

ikehz · 2015-12-22T22:10:59Z

@mikedanese I didn't want to squash because I wanted it to be clear what I had done, (in particular, the last two commits I wanted to keep clearly apart, since they're fairly invasive).

mikedanese · 2015-12-22T22:37:17Z

Why is this preferable to the FLAKY env vars. Those are much easier to track. I would prefer if we keep the test labels to invariant properties of tests

ikehz · 2015-12-26T19:05:38Z

Those [env vars] are much easier to track.

Why are they much easier to track? I (and perhaps I'm the exception) think this is easier, because this sets the metadata about the test in the test, rather than in some random file in hack/. Also, it's a label rather than a regular expression, which can cause all kinds of problems (matching things that it shouldn't match, matching things that don't even exist, etc.).

I would prefer if we keep the test labels to invariant properties of tests

I see flakiness as an invariant property of a test; if we have to change the test or the code proper to make the test not flaky, that PR should similarly update the label to call it not flaky anymore.

ikehz · 2016-01-04T16:26:09Z

@quinton-hoole had a comment on a related issue, following up here.

@ihmccreery How do you deal with the case where a given test is flaky in one environment and not others. Stated another way, flakyness is a function of (test, environment) as per the existing regex scheme, not just (test).

Looking at this PR, actually, the partitioning of flaky tests isn't/wasn't a function of environment: everything just skipped GCE_FLAKY_TESTS, and GKE_FLAKY_TESTS was just a redundant subset. So I didn't take flakiness as a function of (test, environment) into account.

I'm trying to reduce the number of dimensions by which we partition the e2e tests so that we can more sanely keep track of what's running where. I'm going to work on labels for [SkipIfEnv:GKE], e.g., and we could do something similar for flakes (this work needs to be combined with a refactoring of the Go proper calls to SkipIfProviderIs). I'm not really opposed to having env-specific skip lists, just that as the project stands now, it wasn't something it seemed like we needed, and was yet another added layer of complexity to how we partition tests.

ghost · 2016-01-04T16:31:52Z

@ihmccreery What you're seeing is in fact a form of inheritance, and does in fact make sense if you think about it a bit. GCE_FLAKY_TESTS are the tests that are flaky on GCE, whether they run in serial, parallel, GKE, non-GKE etc. So that's why you see them being inherited. Make sense?

ghost · 2016-01-04T16:39:18Z

@ihmccreery what would perhaps make sense is to restructure the inheritance hierarchy something like the following:

FLAKY_TESTS (tests that flake everywhere - the test or system is fundamentally broken)
- FLAKY_GCE_TESTS (these flake on GCE, but not necessarily other cloud providers)
  - FLAKY_GCE_PARALLEL_TESTS (these fail when run in parallel on GCE)
- FLAKY_AWS_TESTS
  - FLAKY_AWS_PARALLEL_TESTS

etc.

gmarek · 2016-01-04T16:39:27Z

To tell the truth there is a form of inheritance, but it comes from laziness. I created GKE_FLAKY suite when we started to have tests that are stable on GCE and very flaky on GKE. There was no well defined semantics, and we used GCE_FLAKY as things that are generally flaky, and GKE_FLAKY for things that are flaky on GKE (i.e. all the rules for GKE tests were adding those two sets, and GCE tests were using GCE only list).

I see that @quinton-hoole responded as well:)

ikehz · 2016-01-04T16:47:31Z

tl;dr: @quinton-hoole you're correct, but I think it falls into YAGNI.

@quinton-hoole your proposal seems to me, but, like I said, it's an added level of complexity in an already (IMO) overly-complex system of partitioning tests. In my (limited) experience, our team hasn't really experienced any pain from not being able to label where precisely a test is an isn't flaky—indeed, we don't even really have great tooling to determine such a thing, even if we could label it. On the flip side, we've experienced quite a lot of pain from trying to manage a very-complex system of partitioning tests, not having tests running in certain places, etc.

I propose we keep a single, monolithic [Flaky] label, until we find that we actually need to have more fine-grained control. Certainly when folks are filing bugs for flaky tests, they should mention what environments the flakiness occurs.

gmarek · 2016-01-04T16:51:39Z

@ihmccreery - we already had GKE only flaky tests. What do you suggest we do with them - do we keep them running and block our build, turn them off and miss a possible regression, or don't block on GKE suite failures?

ikehz · 2016-01-04T17:03:42Z

We already had GKE only flaky tests.

No, the GKE flaky tests were a subset of the GCE flaky tests, so the skipped tests we actually identical in both environments.

I'm suggesting that if a test is flaky in GKE, mark it as [Flaky]. That will move it to the flaky suites (GKE, GCE, AWS, etc.) until it's fixed.

ikehz · 2016-01-04T17:04:16Z

No, the GKE flaky tests were a subset of the GCE flaky tests, so the skipped tests we actually identical in both environments.

(This is exactly the kind of confusion I'm trying to avoid.)

ghost · 2016-01-04T17:08:45Z

@ihmccreery The existing regex scheme was put in place precisely because we needed it to disable tests that only flaked when run in parallel, or only when run on GKE, or only when run on AWS etc. You can collapse that all into [Flakey], but you will lose a lot of fidelity if you do.

@ixdy to continue this review, as I'm not supposed to stick my nose into testing stuff :-)

ixdy · 2016-01-04T23:23:42Z

At the time GKE_FLAKY_TESTS was introduced (#17821), it contained a test ("KubeProxy\sshould\stest\skube-proxy") which was not on the GCE_FLAKY_TESTS list.

That said, this test was on the GCE_SLOW_TESTS list, and thus basically wasn't running in any of the GCE builds. And it was added to GCE_FLAKY_TESTS almost immediately afterwards in #17831.

tl;dr I agree with @ihmccreery that we probably don't need to worry about this, at least not right now.

gmarek · 2016-01-05T10:02:48Z

I'm OK with removing GKE_FLAKY, as it's not very useful now.

OTOH PARALLEL_FLAKY is still a thing. If I understand your proposal right, you want to have a single suite of 'flaky' tests, running exactly as they do in 'normal' suite (e.g. in parallel). It's a valid idea, as it puts a bit more pressure on test owners. What's more it will require that tests will be better written - I'm only afraid about the engineering cost of fixing existing ones...

So generally - after some thought I actually like the idea of a single "flaky" suite. The main drawback I see is the cost of fixing things that are currently flaky when run in parallel.

@ihmccreery - I'm a huge fan of this effort and your work.

ikehz added area/test area/test-infra labels Dec 22, 2015

ikehz assigned ixdy Dec 22, 2015

googlebot added the cla: no label Dec 22, 2015

ikehz mentioned this pull request Dec 22, 2015

e2e fail: Services should serve identically named services in different namespaces on different load-balancers #18952

Closed

k8s-github-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Dec 22, 2015

This was referenced Dec 22, 2015

All DaemonRestart tests are flaky in parallel #19023

Closed

ServiceAccounts tests are flaky in parallel #19024

Closed

ikehz assigned spxtr and unassigned ixdy Dec 22, 2015

ikehz mentioned this pull request Dec 22, 2015

Properly partition e2e tests with labels to avoid murkiness around what's running where #10548

Closed

ikehz and others added 4 commits December 22, 2015 12:29

Label slow tests [Slow]

14d9a0f

Add flaky label [Flaky] to tests

8b255fe

Remove flaky and slow test skip lists, add [Flaky] and [Slow] labels

a62f2a4

Fold GKE_FLAKY and PARALLEL_FLAKY into FLAKY

5ee8850

ikehz changed the title ~~[WIP] Label flaky e2es with [Flaky] & slow tests with [Slow]~~ Label flaky e2es with [Flaky] & slow tests with [Slow] Dec 22, 2015

ikehz added cla: yes and removed cla: no labels Dec 22, 2015

spxtr added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 22, 2015

googlebot added cla: no and removed cla: yes labels Dec 22, 2015

ikehz pushed a commit that referenced this pull request Dec 22, 2015

Merge pull request #19021 from ihmccreery/test-labels

eeb727a

Label flaky e2es with [Flaky] & slow tests with [Slow]

ikehz merged commit eeb727a into kubernetes:master Dec 22, 2015

spxtr mentioned this pull request Dec 29, 2015

GCE L7 LoadBalancer Controller test failing in gke-flaky suite. #19155

Closed

ikehz mentioned this pull request Jan 4, 2016

Upgrade the kube bot to identify flakes and open issues automatically #19178

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Label flaky e2es with [Flaky] & slow tests with [Slow] #19021

Label flaky e2es with [Flaky] & slow tests with [Slow] #19021

ikehz commented Dec 22, 2015

googlebot commented Dec 22, 2015

k8s-github-robot commented Dec 22, 2015

k8s-bot commented Dec 22, 2015

ikehz commented Dec 22, 2015

k8s-bot commented Dec 22, 2015

k8s-bot commented Dec 22, 2015

k8s-bot commented Dec 22, 2015

spxtr commented Dec 22, 2015

ikehz commented Dec 22, 2015

mikedanese commented Dec 22, 2015

ikehz commented Dec 22, 2015

mikedanese commented Dec 22, 2015

ikehz commented Dec 26, 2015

ikehz commented Jan 4, 2016

ghost commented Jan 4, 2016

ghost commented Jan 4, 2016

gmarek commented Jan 4, 2016

ikehz commented Jan 4, 2016

gmarek commented Jan 4, 2016

ikehz commented Jan 4, 2016

ikehz commented Jan 4, 2016

ghost commented Jan 4, 2016

ixdy commented Jan 4, 2016

gmarek commented Jan 5, 2016

Label flaky e2es with [Flaky] & slow tests with [Slow] #19021

Label flaky e2es with [Flaky] & slow tests with [Slow] #19021

Conversation

ikehz commented Dec 22, 2015

googlebot commented Dec 22, 2015

k8s-github-robot commented Dec 22, 2015

k8s-bot commented Dec 22, 2015

ikehz commented Dec 22, 2015

k8s-bot commented Dec 22, 2015

k8s-bot commented Dec 22, 2015

k8s-bot commented Dec 22, 2015

spxtr commented Dec 22, 2015

ikehz commented Dec 22, 2015

mikedanese commented Dec 22, 2015

ikehz commented Dec 22, 2015

mikedanese commented Dec 22, 2015

ikehz commented Dec 26, 2015

ikehz commented Jan 4, 2016

ghost commented Jan 4, 2016

ghost commented Jan 4, 2016

gmarek commented Jan 4, 2016

ikehz commented Jan 4, 2016

gmarek commented Jan 4, 2016

ikehz commented Jan 4, 2016

ikehz commented Jan 4, 2016

ghost commented Jan 4, 2016

ixdy commented Jan 4, 2016

gmarek commented Jan 5, 2016