Add release blocking job criteria #346

spiffxp · 2018-10-20T20:42:29Z

The Kubernetes community has developed an over-reliance on e2e tests that are flaky, slow, and seemingly unmaintained. We, the release team, end up spending much of our time rationalizing and explaining away failing or flaky tests as red-but-green, or ignoring meaningful failures under the guise of perpetual-failure being the expected normal state.

This results in delays leading up to release, contributes to the reputation of .0 releases being beta-quality at best, and allows regressions to slip through the cracks in patch releases.

This PR proposes release blocking job criteria based on #24 to help us eliminate noise and improve signal

The criteria spelled out here are less strict than the criteria in that issue but this is based on what the jobs look like today.

Once this PR lands I would like to move the release-blocking jobs toward the criteria listed, and refine as described in the TODO's. e.g.:

take the gke jobs off of release-blocking boards
make a release-master-informing board and migrate jobs to it
update ci signal docs accordingly
hunt down job owners and update job descriptions
propose and implement the use of sig-foo-alerts googlegroups for testgrid alerts
implement a velodrome dashboard that shows how well release blocking jobs are
meeting these criteria

I plan on using #347 as the umbrella issue for that

spiffxp · 2018-10-20T20:43:33Z

/cc @jberkus @AishSundar @krzyzacy

AishSundar · 2018-10-21T06:24:22Z

release-blocking-jobs.md

+We intend to drive the criteria to stricter thresholds over time, e.g.,
+- finishing a run in 60min or less
+- running at least every 2 hours
+- passing at least 90% of all runs1


^^ (s/runs1/runs)

AishSundar · 2018-10-21T06:38:10Z

release-blocking-jobs.md

+are often jobs that are some combination of slow, flaky, infrequent, or
+difficult to maintain.
+
+These jobs may still block the release, but because they require a lot of


I like the idea of moving the "noisy" jobs into a separate dashboard so we dint loose their signal right away and help maintain the "blocking" dashboard in a green state for most part. But I am afraid we will still end up spending the same amount of time and effort chasing the failures in the "informing" dashboard to determine GO/NoGO for a release. I am not sure when and how much to dial down our attention towards the failures here.
It might be worthwhile to add some kind of ownership expectation for these jobs eg: if we dont get attention from Job/Test Owners around failing tests within 'x' days we will not block release on them.

I'm hoping if nothing else it means we spend our time babysitting fewer jobs. These jobs should still have clear owners etc. But I think our decision to block/not-block is the "human interpretation" part I'm trying to call out here. I'll see if I can re-word

AishSundar · 2018-10-21T06:40:59Z

release-blocking-jobs.md

+These jobs may still block the release, but because they require a lot of
+manual, human interpretation, we choose to move them to a separate dashboard.
+
+<!-- TOOD(spiffxp): implement release-master-informing, populate with:


I am assuming we will be opening PRs to move these jobs out with a list of criteria they dont meet?

AishSundar · 2018-10-21T06:42:16Z

release-blocking-jobs.md

+manual, human interpretation, we choose to move them to a separate dashboard.
+
+<!-- TOOD(spiffxp): implement release-master-informing, populate with:
+- gce-master-scale-correctness


this job finishes in 1.5 hrs. Curious as to which criteria it doesnt meet.

I'll have to double check, I may be wrong. I'll reword this so that it's an example not an authoritative list. As you say, I think the discussion for what jobs should be here belongs in a followup PR

that's been a pretty reliable, informative job this cycle. I'd like to keep it.

found it: although the job usually takes ~90min, it only runs once a week

(to run it more frequently would require the allocation of more quota, as right now these jobs share the same project and have have been manually scheduled across days of the week to avoid clashing with each other)

Do we need something for a call for funding for things like this. Is that part of the cncf hosting move?

AishSundar · 2018-10-21T06:48:50Z

release-blocking-jobs.md

-  - kubeadm
-  - kubeadm-$(X-1)-on-$(X)
+  - kubeadm-gce
+  - kubeadm-gce-x.$(y-1)-on-x.y

 ## Testgrid Release Blocking Dashboards


What about the upgrade jobs and the upgrade dashboards ? I understand they are much more thorny. do we want to tackle them separately? FWIW, @justinsb work on splitting the gce and gke upgrade jobs into (i) Serial & Disruptive and (ii) Slow (parallel) jobs have helped in stabilizing the latter.

Practically speaking I think they should remain separate from master while we focus on them. It may be that we decide to move upgrade jobs into blocking/informing down the line.

I don't want to bikeshed on it, but just to document since I went looking:

Previous releases use

release-blocking

release-all

And then the release-all dashboards fall out of maintenance and are almost always red (usually because of upgrade jobs)

What we current have with master is

master-blocking

master-upgrade

master-upgrade-optional

master-kubectl-skew

misc

I haven't audited whether everything not in blocking ~= master-all, I would hope so but suspect not. And then we have the fact that full scale scalability testing is never done on other release branches, only master.

spiffxp · 2018-10-22T17:28:06Z

Discussed during sig-release, general agreement on this, but would still appreciate a pass from @jberkus, and if possible @krzyzacy

krzyzacy · 2018-10-22T17:43:36Z

release-blocking-jobs.md

+
+The ideal release-blocking job meets the following criteria:
+
+- It provisions clusters via open-source software instead of a hosted service


aka, no gke?

cc @jagosan

krzyzacy · 2018-10-22T17:47:08Z

release-blocking-jobs.md

-<!-- TODO(krzyzacy): define blocking criteria -->
+## Release Blocking Criteria
+
+The ideal release-blocking job meets the following criteria:


what if an already-blocking job doesn't, do we kick it out of release blocking?

My goal is for us to use followup PR's to suggest kicking out of blocking. If we decide no, we can't kick out just yet, we'll at least have a documented decision for why it needs to stay

sgtm, maybe also document that here (when/who will need to make that followup kickout PR)

opened #347

krzyzacy · 2018-10-22T17:50:07Z

release-blocking-jobs.md

+
+Jobs that are considered useful or necessary to inform whether a commit on
+master is ready for release, but that clearly do not meet release-blocking 
+criteria, may be placed on the sig-release-master-informing dashboard. These


So if we intend to create sig-release-master-informing, what about per release branch dashboards? is the release-x.y-all sufficient?

I am hoping -all is sufficient, and that master is a special case for now while we work to consolidate into an -all

spiffxp · 2018-10-22T18:41:38Z

/hold
for comment, will revise shortly

krzyzacy · 2018-10-22T21:47:04Z

LGTM besides someone from gke should be ok with gke not blocking oss releases :-)

jagosan · 2018-10-22T22:00:39Z

This is a reasonable direction and aligned with efforts to date. We should be careful to ensure there is a strong provider-agnostic solution to replace any currently-blocking tests that will be removed.

spiffxp · 2018-10-22T22:17:34Z

We should be careful to ensure there is a strong provider-agnostic solution to replace any currently-blocking tests that will be removed.

Agree. I do not think much is lost in terms of release signal by removing these GKE tests from release-blocking. The results will still be available in dashboards such as google-gke and google-gke-staging for those that wish to pay attention to them.

AishSundar · 2018-10-22T22:24:32Z

/lgtm

FWIW we will still continue to monitor the status of the tests in the release-informing dashboard (atleast in 1.13 timeframe) to ensure we are not loosing any critical signal (which doesnt seem to be the case historically for the past 2 releases).

jberkus · 2018-10-22T23:38:39Z

release-blocking-jobs.md


 We use [1.12 release-blocking dashboard](http://k8s-testgrid.appspot.com/sig-release-1.12-blocking)
-as a source of truth:
+as a source of truth. The tab names are autogenerated based on the job names,
+and have the branch/version in their name. For example, for release x.y:


I'd like a comment here that this is just the initial set of tests, and that each of these will be evaluated based on the criteria above. It's really not clear from the description that that's the case, just from your comments.

Hmmmm, never mind, you DO have that above.

jberkus · 2018-10-22T23:40:26Z

/lgtm

I'm really happy with the test criteria. Particularly having overall owners for jobs, not just for the individual tests.

tpepper · 2018-10-23T08:36:11Z

release-blocking-jobs.md

+- gce-master-scale-correctness
+- gce-master-scale-performance
+- gce-cos-master-serial
+- gke-cos-master-serial


Action pending from the 1.12 release retrospective: should we open an issue on analyzing g[ck]e-cos-master-serial versus g[ck]e-cos-master-slow versus g[ck]e-cos-master-default for valuable coverage, quality of signal, support status, etc.? Or put another way, lacking that data and human resources to collect it...is it a beneficial clean up or a problem to simply drop g[ck]e-cos-master-serial instead of moving them to release-master-informing?

I have seen master-serial give us some additional value from time to time vs the master-default one. So there are a few tests spread between the 3 dashboards that give us signal, trick is finding out which ones they are. IOW, I dont think we should drop them completely (atleast right away), but be much more vigilant and aware of what value the various jobs in the informing dashboard are adding, when do we actually find a k8s release blocking issue due to their failures and how responsive the owners are. Once we have that data collected for a release or two we must be in a better place to make a call.

spiffxp · 2018-10-27T19:37:11Z

Holding one more business day for comment, will merge Tuesday 2018-10-30

lavalamp · 2018-10-29T18:53:46Z

release-blocking-jobs.md

+  - represented by a configured testgrid alert
+  - contact info is also part of job description
+-->
+- Fails for no more than 10 runs in a row


These requirements are a little strange-- suppose something is merged that breaks the tests. What we want to happen is that the release is blocked. What would actually happen, interpreting these rules literally, is that we kick all the tests out.

I agree, and would welcome help rewording this. I'm trying to put the stake in sand before setting it in stone. The idea is to punt out jobs that are perpetually failing due to a lack of attention. If they are failing and actively being worked, that's different. The case could also be made that there should be a standard process to follow of revert-before-fix once an offending commit is identified.

I personally think it's actually a bad idea to explicitly spell out the criteria, I think it requires human judgement. I suggest something like:

We intend to remove jobs that are getting an insufficient amount of attention from their owners, i.e., they have test problems that aren't being addressed. We won't punt jobs merely because they are failing, we'll punt them when they are not providing good signal (positive OR negative) AND there's no indication that they will begin providing a reliable signal in relatively short (e.g., 1 week) timeframe. There's no shame in being removed, once the problems have been fixed a job can be added back. In order to prevent the rules from being gamed, we leave it up to humans to determine whether or not a given test meets the criteria.

I'd further add that if it's not obvious or is contentious whether a test job meets the criteria, that sig release and sig testing TLs together as a group should make the decision.

I wouldn't bother defining what should happen if that doesn't work, we should just figure it out if it ever happens.

We're not going to have bots automatically act on this for quite a while (if ever?). Think of these more as triggering conditions for human review. I call out human review as a necessary part of the process to add or remove below.
I'd like to get humans out of the business of having to continually scan jobs and metrics, so they have more time to focus on the business of making decisions. My hope is that by specifying criteria in a manner that can be automated, we can then start focusing more on spelling out the process we should follow.

I added some of lavalamp's language and more strongly emphasized that humans have the final say. I also sketched out a process that we can iterate on in followups

spiffxp · 2018-10-30T17:13:09Z

/hold cancel
this is ready to go

jberkus · 2018-10-30T17:17:44Z

/lgtm

Let's do this.

k8s-ci-robot · 2018-10-30T17:17:52Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: AishSundar, jberkus, spiffxp

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [AishSundar,jberkus,spiffxp]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. approved Indicates a PR has been approved by an approver from all required OWNERS files. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Oct 20, 2018

k8s-ci-robot requested review from calebamiles and jdumars October 20, 2018 20:42

k8s-ci-robot added the sig/release Categorizes an issue or PR as relevant to SIG Release. label Oct 20, 2018

k8s-ci-robot requested review from AishSundar, jberkus and krzyzacy October 20, 2018 20:43

AishSundar reviewed Oct 21, 2018

View reviewed changes

krzyzacy reviewed Oct 22, 2018

View reviewed changes

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 22, 2018

spiffxp mentioned this pull request Oct 22, 2018

Implement release blocking job criteria #347

Closed

k8s-ci-robot assigned AishSundar Oct 22, 2018

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 22, 2018

jberkus reviewed Oct 22, 2018

View reviewed changes

k8s-ci-robot assigned jberkus Oct 22, 2018

tpepper reviewed Oct 23, 2018

View reviewed changes

lavalamp reviewed Oct 29, 2018

View reviewed changes

Add release blocking job criteria

423ed58

spiffxp force-pushed the release-blocking-criteria branch from 3deee75 to 423ed58 Compare October 30, 2018 17:11

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 30, 2018

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 30, 2018

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 30, 2018

k8s-ci-robot merged commit 05b684a into kubernetes:master Oct 30, 2018

spiffxp deleted the release-blocking-criteria branch October 30, 2018 17:23

jberkus mentioned this pull request Dec 4, 2018

Decide on criteria for grouping release tests into blocking/informing/extra #405

Closed

BenTheElder mentioned this pull request May 1, 2019

Define criteria for a job to be release-blocking #24

Closed

alejandrox1 mentioned this pull request May 22, 2019

Updated release-blocking-job criteria #634

Merged

spiffxp mentioned this pull request Jul 27, 2021

Document criteria for kubernetes/kubernetes merge-blocking jobs kubernetes/test-infra#6380

Open


		The ideal release-blocking job meets the following criteria:

		- It provisions clusters via open-source software instead of a hosted service

Add release blocking job criteria #346

Add release blocking job criteria #346

Conversation

spiffxp commented Oct 20, 2018 • edited Loading

spiffxp commented Oct 20, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

spiffxp commented Oct 22, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

spiffxp commented Oct 22, 2018

krzyzacy commented Oct 22, 2018

jagosan commented Oct 22, 2018

spiffxp commented Oct 22, 2018

AishSundar commented Oct 22, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jberkus commented Oct 22, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

spiffxp commented Oct 27, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

spiffxp commented Oct 30, 2018

jberkus commented Oct 30, 2018

k8s-ci-robot commented Oct 30, 2018

spiffxp commented Oct 20, 2018 •

edited

Loading