Implement release blocking job criteria #347

spiffxp · 2018-10-22T21:30:30Z

This is an umbrella issue for followup work to #346

That PR describes aspirational release blocking job criteria. This issue is intended to track the followup work, including:

~~assignment of owners to all jobs~~
~~descriptions added to all release-master-blocking jobs~~
~~propose the creation of sig-foo-alerts@googlegroups.com or reuse of sig-foo-test-failures for all sigs that need to be responsive to test failures~~
~~the creation of the release-informing dashboard, and moving jobs out of release-master-blocking to that dashboard~~
a bigquery run that generates metrics for jobs currently on the release-master-blocking dashboard

/sig release
this is sig-release policy
/sig testing
this will be assisted by sig-testing tooling

EDIT 2019-07-23: AFAIK metrics is the only thing that remains to close this out

jberkus · 2018-10-23T00:02:50Z

Are we consolidating all non-blocking dashboards into -informing?

I'd be in favor of that.

BenTheElder · 2018-12-18T22:23:31Z

/cc

spiffxp · 2019-01-03T03:07:20Z

/milestone v1.14

I would like for us to implement this for the v1.14 release

fejta-bot · 2019-04-03T03:35:19Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

justaugustus · 2019-04-29T00:40:23Z

/remove-lifecycle stale

justaugustus · 2019-05-01T11:33:35Z

/help
/milestone v1.15

spiffxp · 2019-05-07T16:25:17Z

assignment of owners to all jobs

descriptions added to all release-master-blocking jobs

propose the creation of sig-foo-alerts@googlegroups.com or reuse of sig-foo-test-failures for all sigs that need to be responsive to test failures

/assign
I'm handling this in #441

the creation of the release-informing dashboard, and moving jobs out of release-master-blocking to that dashboard

this is done

a bigquery run that generates metrics for jobs currently on the release-master-blocking dashboard

I would recommend someone look at https://github.com/kubernetes/test-infra/tree/master/metrics for this

jeefy · 2019-08-12T16:21:23Z

/assign @jberkus

wojtek-t · 2019-08-13T07:16:27Z

This is sort of the mechanism by which people review PRs ... I really hope you are not missing 75% or more of these confused

I actually bet I do if people don't ping me directly. I just opened a list of those where I was reviewer and I didn't see majority of those. If something is critical and really requires my attention, I should be at the very least assigned as approver (those I'm trying to follow, but I'm also missing some percentage of those).

That's not really reasonable, policies should be expected to evolve, not be set in stone, and tomorrow I may introduce some unforeseen form of testing that we need to account for.
I also don't think we should be trying to handle soak testing at all right now. There's no useful soak testing on which to inform policy.

I disagree with that. We shouldn't create a policy that we know want work as soon as we have soak tests. Those are important enough that I really believe we will prioritize them soon.

I'm not seeing an actual resolution in this thread,

There was a 1-hour-long discussion that happened during the zoom meeting. Where I think we roughly converged. @jberkus is actually documenting the outcome of it in his PR.

BenTheElder · 2019-08-13T19:01:09Z

I actually bet I do if people don't ping me directly. I just opened a list of those where I was reviewer and I didn't see majority of those. If something is critical and really requires my attention, I should be at the very least assigned as approver (those I'm trying to follow, but I'm also missing some percentage of those).

This is a separate discussion but ... this kinda defeats the purpose of having assigned reviewers. Please consider dropping a note to contribex about why / how this doesn't work so we can fix this. :/

I disagree with that. We shouldn't create a policy that we know want work as soon as we have soak tests. Those are important enough that I really believe we will prioritize them soon.

There has been no indication that we will have soak tests.

What we call soak tests now are easily among the longest failing tests we have and likely to be removed in the near future... These have been relatively unmaintained for on the order of year(s). Who is prioritizing them? I've heard zero discussion of this in SIG-Testing or SIG-Release ...

There was a 1-hour-long discussion that happened during the zoom meeting. Where I think we roughly converged. @jberkus is actually documenting the outcome of it in his PR.

Excellent. 👍

spiffxp · 2019-08-16T00:42:19Z

I've got an initial attempt at a dashboard that displays metrics relevant to release-blocking criteria: kubernetes/test-infra#13879 (comment)

http://velodrome.k8s.io/dashboard/db/job-health-release-blocking

spiffxp · 2019-08-16T00:44:25Z

I caught that the bazel jobs were postsubmits that didn't meet the "scheduled at least 3 hours" criteria so swapped them with periodics that did kubernetes/test-infra#13907

spiffxp · 2019-08-16T00:46:37Z

The serial job takes way too long, and is failing due to timeout. I think we should kick out egregious offenders, and encourage a pattern of adding a feature-specific job if the feature is truly necessary to be release-blocking.

Opened issues to start this for

kubelet resource tracking tests: The serial regular resource usage tracking e2e tests are too slow kubernetes#81490
HPA tests: The serial HPA e2e tests are too slow kubernetes#81491

tpepper · 2019-08-16T17:50:19Z

/assign @tpepper

guineveresaenger · 2019-08-16T17:54:15Z

/assign @guineveresaenger

spiffxp · 2019-08-16T18:02:16Z

@msau42 is looking to split out serial storage tests into another release-blocking job as well kubernetes/test-infra#13936

I feel like splitting tests into more parallel blocking jobs is a sound approach for now. But, it's only going to get us but so far before we run into new limits:

we pick up additional overhead of standup/teardown of yet another cluster
more jobs = more opportunity for one of them to flake, so overall reliability of master-blocking may go down
what other jobs should be running these tests? eg: skew, upgrade/downgrade, scale, other envs, etc

At some point it's worth questioning why these tests need to be release-blocking, and if there is some sort of bar they should be held to. We presumably do this for Conformance tests, though IMO it's not as rigorously measured as it could be, and relies on extensive human review.

jberkus · 2019-08-27T20:56:12Z

Not completed despite the merger of #752. Mostly because there's still some issues unanswered:

Revise Blocking criteria around flakiness - Revise Blocking criteria around flakiness #773
Decide on procedure for jobs entering (and leaving) Blocking - Decide on procedure for jobs entering (and leaving) Blocking #774
Document process/criteria by which we decide that Informing failures are tolerable - Document process/criteria by which we decide that Informing failures are tolerable #775

alejandrox1 · 2019-09-11T16:24:36Z

will start with #775
/assign

alejandrox1 · 2019-10-07T16:30:16Z

/remove-help
/milestone v1.17

guineveresaenger · 2019-10-21T16:41:23Z

Closing in favor of #773, #774, #775.

Thanks everyone!

spiffxp added sig/release Categorizes an issue or PR as relevant to SIG Release. sig/testing Categorizes an issue or PR as relevant to SIG Testing. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. labels Oct 22, 2018

spiffxp mentioned this issue Oct 22, 2018

Add release blocking job criteria #346

Merged

spiffxp changed the title ~~Implement release block job criteria~~ Implement release blocking job criteria Oct 22, 2018

spiffxp mentioned this issue Oct 30, 2018

Demote gke jobs from release-blocking kubernetes/test-infra#9959

Merged

jberkus mentioned this issue Dec 4, 2018

Decide on criteria for grouping release tests into blocking/informing/extra #405

Closed

spiffxp added this to the v1.14 milestone Jan 3, 2019

This was referenced Jan 3, 2019

Different set of blocking jobs on stable1, 2 and 3 releases kubernetes/test-infra#9363

Closed

testgrid sig-release-master-update descriptions incorrect kubernetes/test-infra#9378

Closed

This was referenced Jan 12, 2019

Identify and assign release-blocking job owners #441

Closed

Re-arrange SIG-Master tests into Blocking and Informing kubernetes/test-infra#10505

Closed

spiffxp mentioned this issue Jan 25, 2019

sig-release-master-uprade-optional is now google-gke-upgrade kubernetes/test-infra#10965

Merged

justaugustus added this to Needs Grooming in SIG Release Mar 19, 2019

mariantalla mentioned this issue Mar 28, 2019

[Tracking] Consolidate and automate creation of sig-release owned testgrid dashboards kubernetes/test-infra#11977

Closed

4 tasks

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 3, 2019

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 29, 2019

justaugustus added this to To do in Release Team via automation May 1, 2019

k8s-ci-robot modified the milestones: v1.14, v1.15 May 1, 2019

k8s-ci-robot added the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label May 1, 2019

k8s-ci-robot assigned spiffxp May 7, 2019

spiffxp mentioned this issue May 14, 2019

Define criteria for a job to be release-blocking #24

Closed

jberkus mentioned this issue Aug 9, 2019

Update the Blocking Jobs document and the CI Signal Handbook based on… #752

Merged

k8s-ci-robot assigned jberkus Aug 12, 2019

spiffxp mentioned this issue Aug 13, 2019

Implement kubernetes job health dashboard kubernetes/test-infra#13879

Closed

k8s-ci-robot assigned tpepper Aug 16, 2019

k8s-ci-robot assigned guineveresaenger Aug 16, 2019

This was referenced Aug 16, 2019

The serial regular resource usage tracking e2e tests are too slow kubernetes/kubernetes#81490

Closed

The serial HPA e2e tests are too slow kubernetes/kubernetes#81491

Closed

k8s-ci-robot assigned alejandrox1 Sep 11, 2019

guineveresaenger unassigned spiffxp, tpepper and guineveresaenger Oct 7, 2019

k8s-ci-robot removed the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label Oct 7, 2019

k8s-ci-robot modified the milestones: v1.16, v1.17 Oct 7, 2019

guineveresaenger closed this as completed Oct 21, 2019

SIG Release automation moved this from In progress to Done (1.17) Oct 21, 2019

Release Team automation moved this from In progress to Done (1.17) Oct 21, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement release blocking job criteria #347

Implement release blocking job criteria #347

spiffxp commented Oct 22, 2018 •

edited

Loading

jberkus commented Oct 23, 2018

BenTheElder commented Dec 18, 2018

spiffxp commented Jan 3, 2019

fejta-bot commented Apr 3, 2019

justaugustus commented Apr 29, 2019

justaugustus commented May 1, 2019

spiffxp commented May 7, 2019 •

edited

Loading

jeefy commented Aug 12, 2019

wojtek-t commented Aug 13, 2019

BenTheElder commented Aug 13, 2019

spiffxp commented Aug 16, 2019

spiffxp commented Aug 16, 2019

spiffxp commented Aug 16, 2019

tpepper commented Aug 16, 2019

guineveresaenger commented Aug 16, 2019

spiffxp commented Aug 16, 2019

jberkus commented Aug 27, 2019 •

edited

Loading

alejandrox1 commented Sep 11, 2019

alejandrox1 commented Oct 7, 2019

guineveresaenger commented Oct 21, 2019

Implement release blocking job criteria #347

Implement release blocking job criteria #347

Comments

spiffxp commented Oct 22, 2018 • edited Loading

jberkus commented Oct 23, 2018

BenTheElder commented Dec 18, 2018

spiffxp commented Jan 3, 2019

fejta-bot commented Apr 3, 2019

justaugustus commented Apr 29, 2019

justaugustus commented May 1, 2019

spiffxp commented May 7, 2019 • edited Loading

jeefy commented Aug 12, 2019

wojtek-t commented Aug 13, 2019

BenTheElder commented Aug 13, 2019

spiffxp commented Aug 16, 2019

spiffxp commented Aug 16, 2019

spiffxp commented Aug 16, 2019

tpepper commented Aug 16, 2019

guineveresaenger commented Aug 16, 2019

spiffxp commented Aug 16, 2019

jberkus commented Aug 27, 2019 • edited Loading

alejandrox1 commented Sep 11, 2019

alejandrox1 commented Oct 7, 2019

guineveresaenger commented Oct 21, 2019

spiffxp commented Oct 22, 2018 •

edited

Loading

spiffxp commented May 7, 2019 •

edited

Loading

jberkus commented Aug 27, 2019 •

edited

Loading