Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stop using ginkgo.flakeAttempts #68091

Open
dims opened this issue Aug 30, 2018 · 37 comments
Open

Stop using ginkgo.flakeAttempts #68091

dims opened this issue Aug 30, 2018 · 37 comments

Comments

@dims
Copy link
Member

@dims dims commented Aug 30, 2018

  • Is this really useful? (fuzzer based tests?)
  • Does this contribute to hiding flakiness in our test suites?
  • Is it time to get rid of it?

(Do we really need ginkgo is the larger question i guess!)

@dims

This comment has been minimized.

Copy link
Member Author

@dims dims commented Aug 30, 2018

@dims

This comment has been minimized.

Copy link
Member Author

@dims dims commented Aug 30, 2018

/sig testing

@k8s-ci-robot k8s-ci-robot added sig/testing and removed needs-sig labels Aug 30, 2018
@bgrant0607

This comment has been minimized.

Copy link
Member

@bgrant0607 bgrant0607 commented Aug 30, 2018

It would be good to measure what the impact of this is now. Maybe turn it off in the post-submit tests for a few days after code freeze so that we could observe the impact? Do we have a way to monitor failures in the pre-submit tests?

@BenTheElder

This comment has been minimized.

Copy link
Member

@BenTheElder BenTheElder commented Aug 30, 2018

We do have testgrid alerting for presubmits but it's not always configured, the blocking tests should ping the gke-kubernetes-engprod team (and myself actually) directly.

@spiffxp

This comment has been minimized.

Copy link
Member

@spiffxp spiffxp commented Aug 30, 2018

I was literally talking about this yesterday with @jdumars and @calebamiles, stop reading my mind.

I think we should turn this off selectively before turning it off globally. I would welcome a job that runs the "normal" and "flaky" tests together, with this setting turned off, so that we could accumulate some data first, ref: kubernetes/test-infra#9208

I don't have enough context yet to answer the broader question about ginko, but anecdotally I haven't heard a lot of love for it

@fejta-bot

This comment has been minimized.

Copy link

@fejta-bot fejta-bot commented Nov 28, 2018

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@fejta-bot

This comment has been minimized.

Copy link

@fejta-bot fejta-bot commented Dec 28, 2018

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@spiffxp

This comment has been minimized.

Copy link
Member

@spiffxp spiffxp commented Jan 4, 2019

We still don't have useful data because the job @dims created has been perpetually failing (due to timeout IIRC)

/remove-lifecycle rotten

/assign @dims @spiffxp
because we worked on it

/assign @mariantalla
has to do with deflaking, giving you a heads up this way until we have something better to cluster related issues (some combo of sig+king labels? area label? project board?)

@spiffxp

This comment has been minimized.

Copy link
Member

@spiffxp spiffxp commented Jan 4, 2019

/area deflake

@mariantalla mariantalla added this to Inbox-needs prioritization in Deflaking kubernetes e2e tests Jan 8, 2019
@dims

This comment has been minimized.

Copy link
Member Author

@dims dims commented Jan 19, 2019

@fejta-bot

This comment has been minimized.

Copy link

@fejta-bot fejta-bot commented Apr 19, 2019

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@BenTheElder

This comment has been minimized.

Copy link
Member

@BenTheElder BenTheElder commented Apr 19, 2019

remove-lifecycle stale
someday...

@dims

This comment has been minimized.

Copy link
Member Author

@dims dims commented Apr 26, 2019

/remove-lifecycle stale
/unassign

@fejta-bot

This comment has been minimized.

Copy link

@fejta-bot fejta-bot commented Oct 24, 2019

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@fejta-bot

This comment has been minimized.

Copy link

@fejta-bot fejta-bot commented Nov 23, 2019

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@BenTheElder

This comment has been minimized.

Copy link
Member

@BenTheElder BenTheElder commented Dec 6, 2019

/remove-lifecycle rotten
cc @liggitt

@spiffxp

This comment has been minimized.

Copy link
Member

@spiffxp spiffxp commented Dec 6, 2019

I hacked up http://velodrome.k8s.io/dashboard/db/job-health-release-informing?orgId=1 to use 7d avgs and added the single flake attempt job to get this comparison of the two jobs over the last 6 months

single-flake-vs-multi-attempt

If you had asked me before November I would have said it's clear we should drop the retries based on difference in failure/flake rates. I still lean in that direction now, since their similarity to me means we're not getting any additional value

Another thing we could do is see if the flakes discovered by each job differ appreciably

@spiffxp

This comment has been minimized.

Copy link
Member

@spiffxp spiffxp commented Dec 6, 2019

https://gist.github.com/spiffxp/ff38f6d32c657679b20a81a7a01b7415 - a look at the top 20 flakes caught by each job over the past 8 weeks

We could also try looking at the long tail of flakes caught by each, but I stopped after seeing they both caught at least 120 over the past 8 weeks, and their rankings up top weren't appreciably different.

@spiffxp

This comment has been minimized.

Copy link
Member

@spiffxp spiffxp commented Dec 6, 2019

/remove-lifecycle rotten

@spiffxp

This comment has been minimized.

Copy link
Member

@spiffxp spiffxp commented Dec 6, 2019

Sigh. So upon closer inspection, the ci-kubernetes-e2e-gci-gce job does not appear to actually use flakeAttempts. So we are essentially comparing the same apple to itself when we compare against the single-flake-attempt job.

To compare apples and oranges, we can look at the characteristics of the PR job vs. the CI job. The flake rate of the CI job is roughly double the flake rate of the PR job.

I still feel like we should rip the bandaid off. Going to get a list of impacted jobs

@spiffxp

This comment has been minimized.

Copy link
Member

@spiffxp spiffxp commented Dec 6, 2019

The following jobs explicitly set flakeAttempts to 2:

  • ci-cadvisor-e2e
  • ci-cloud-provider-azure-conformance
  • ci-cloud-provider-azure-conformance-vmss
  • ci-cloud-provider-azure-multiple-zones
  • ci-cloud-provider-azure-serial
  • ci-cloud-provider-azure-slow
  • ci-cloud-provider-azure-slow-vmss
  • ci-containerd-node-e2e
  • ci-containerd-node-e2e-1-2
  • ci-containerd-node-e2e-1-3
  • ci-containerd-node-e2e-features
  • ci-containerd-node-e2e-features-1-2
  • ci-containerd-node-e2e-features-1-3
  • ci-cos-containerd-node-e2e
  • ci-cos-containerd-node-e2e-features
  • ci-cri-containerd-node-e2e
  • ci-cri-containerd-node-e2e-features
  • ci-kubernetes-e2e-aks-engine-azure-1-14-windows
  • ci-kubernetes-e2e-aks-engine-azure-1-15-windows
  • ci-kubernetes-e2e-aks-engine-azure-1-16-windows
  • ci-kubernetes-e2e-aks-engine-azure-1-17-windows
  • ci-kubernetes-e2e-aks-engine-azure-master-staging-windows
  • ci-kubernetes-e2e-aks-engine-azure-master-windows
  • ci-kubernetes-e2e-aws-eks-1-13-conformance
  • ci-kubernetes-e2e-aws-eks-1-13-correctness
  • ci-kubernetes-e2e-gce-private-cluster-correctness
  • ci-kubernetes-e2e-gce-scale-correctness
  • ci-kubernetes-e2e-kops-aws
  • ci-kubernetes-e2e-kops-aws-beta
  • ci-kubernetes-e2e-kops-aws-canary
  • ci-kubernetes-e2e-kops-aws-channelalpha
  • ci-kubernetes-e2e-kops-aws-ena-nvme
  • ci-kubernetes-e2e-kops-aws-ha-uswest2
  • ci-kubernetes-e2e-kops-aws-imagecentos7
  • ci-kubernetes-e2e-kops-aws-imageubuntu1604
  • ci-kubernetes-e2e-kops-aws-networking-kopeio-vxlan
  • ci-kubernetes-e2e-kops-aws-newrunner
  • ci-kubernetes-e2e-kops-aws-sig-cli
  • ci-kubernetes-e2e-kops-aws-stable1
  • ci-kubernetes-e2e-kops-aws-stable2
  • ci-kubernetes-e2e-kops-aws-stable3
  • ci-kubernetes-e2e-kops-aws-updown
  • ci-kubernetes-e2e-kops-aws-vpc-cni
  • ci-kubernetes-e2e-kops-aws-weave
  • ci-kubernetes-e2e-kops-gce
  • ci-kubernetes-e2e-kops-gce-canary
  • ci-kubernetes-e2e-kops-gce-channelalpha
  • ci-kubernetes-e2e-kops-gce-ha
  • pull-cadvisor-e2e
  • pull-cloud-provider-azure-e2e
  • pull-cri-containerd-node-e2e
  • pull-kops-e2e-kubernetes-aws
  • pull-kops-e2e-kubernetes-aws-1-15
  • pull-kops-e2e-kubernetes-aws-1-16
  • pull-kubernetes-e2e-aks-engine-azure-windows
  • pull-kubernetes-e2e-gke
  • pull-kubernetes-e2e-kops-aws
  • pull-kubernetes-node-e2e
  • pull-kubernetes-node-e2e-alpha
  • pull-kubernetes-node-e2e-containerd

The following jobs use GINKGO_TOLERATE_FLAKES which ends up setting flakeAttempts to 2:

  • pull-kubernetes-e2e-containerd-gce
  • pull-kubernetes-e2e-gce
  • pull-kubernetes-e2e-gce-alpha-features
  • pull-kubernetes-e2e-gce-csi-serial
  • pull-kubernetes-e2e-gce-device-plugin-gpu
  • pull-kubernetes-e2e-gce-iscsi
  • pull-kubernetes-e2e-gce-iscsi-serial
  • pull-kubernetes-e2e-gce-rbe
  • pull-kubernetes-e2e-gce-storage-disruptive
  • pull-kubernetes-e2e-gce-storage-slow
  • pull-kubernetes-e2e-gce-storage-slow-rbe
  • pull-kubernetes-e2e-gce-storage-snapshot
  • pull-kubernetes-e2e-gci-gce-autoscaling
  • pull-release-cluster-up
@spiffxp

This comment has been minimized.

Copy link
Member

@spiffxp spiffxp commented Dec 6, 2019

Options in order from least to most disruptive:

  1. explicitly add flakeAttempts=2 to the ci-kubernetes-e2e-gci-gce job, and wait to see if the flake rate goes down appreciably, and whether this increases job duration.. if so, maybe we should keep it
  2. explicitly remove the GINKGO_TOLERATE_FLAKES env var from pull-kubernetes-e2e-gce and see if the flake rate goes up appreciably, and whether this decreases job duration... this will tell us whether we expose more (or different) flakes on a high traffic job
  3. just go ahead and remove it everywhere, and leave it to the community to quarantine and fix whatever flakes rise to the surface (for example, as of today, none of the jobs on release-master-blocking use it, and almost none of the jobs on release-master-informing use it)

The "rip the band aid off" option is 3, and that's what I'd prefer we do. This next month is typically a quiet period, and we would expose just how much flakiness we've been hiding, if any.

/assign @dims
I'm bringing you back in for your opinion here

@liggitt

This comment has been minimized.

Copy link
Member

@liggitt liggitt commented Dec 6, 2019

3. just go ahead and remove it everywhere, and leave it to the community to quarantine and fix whatever flakes rise to the surface (for example, as of today, none of the jobs on release-master-blocking use it, and almost none of the jobs on release-master-informing use it)

The "rip the band aid off" option is 3, and that's what I'd prefer we do. This next month is typically a quiet period, and we would expose just how much flakiness we've been hiding, if any.

That would be my preference as well. To prepare for the cold dose of reality, I'd recommend:

If/when the band-aid comes off, sending out a summary (maybe weekly?) of the top flaking tests, and the delta from the last report might help keep visibility on the problem and incentivize action.

@justaugustus

This comment has been minimized.

Copy link
Member

@justaugustus justaugustus commented Dec 6, 2019

/subscribe

@stevekuznetsov

This comment has been minimized.

Copy link
Contributor

@stevekuznetsov stevekuznetsov commented Dec 6, 2019

@smarterclayton

This comment has been minimized.

Copy link
Contributor

@smarterclayton smarterclayton commented Dec 6, 2019

"rip the bandaid off" = Getting my poncho for the hot arterial blood about to go everywhere

@alejandrox1

This comment has been minimized.

Copy link
Contributor

@alejandrox1 alejandrox1 commented Dec 7, 2019

/cc

@spiffxp

This comment has been minimized.

Copy link
Member

@spiffxp spiffxp commented Dec 9, 2019

I appreciate what @liggitt is asking for, but I have concerns that's blowing scope from "stop retrying flakes" to "put together better deflaking guides, policies, and tools than we have now" and is going to take a while.

I'd rather not have perfect be the prerequisite for this, but am willing to massage existing content into something more cohesive than what we have.

Looking for existing content on flakes:

Existing dashboards:

@liggitt

This comment has been minimized.

Copy link
Member

@liggitt liggitt commented Dec 9, 2019

Even just an email to sig leads and kubernetes-dev describing the intent, along with those topics/references, would be helpful.

@dims

This comment has been minimized.

Copy link
Member Author

@dims dims commented Dec 10, 2019

+100 to throw the lever after dropping an email heads up @liggitt @spiffxp

@dims

This comment has been minimized.

Copy link
Member Author

@dims dims commented Dec 10, 2019

/me hands @smarterclayton some ski goggles!

@spiffxp

This comment has been minimized.

Copy link
Member

@spiffxp spiffxp commented Dec 10, 2019

kubernetes-dev@ thread, proposed merging by Friday the 13th. [ makes spooky noises ]

@BenTheElder

This comment has been minimized.

Copy link
Member

@BenTheElder BenTheElder commented Dec 17, 2019

pull-kubernetes-e2e-kind (NOT .*kind.*conformance jobs) was setting GINKGO_TOLERATE_FLAKES=y to follow pull-kubernetes-e2e-gce.

This was resolved over the weekend shortly after the main change merged. kubernetes-sigs/kind#1171

This job will also be impacted.

@spiffxp spiffxp moved this from Inbox (unprioritized) to In progress in Deflaking kubernetes e2e tests Dec 20, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
You can’t perform that action at this time.