New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add release blocking job criteria #346

Merged
merged 1 commit into from Oct 30, 2018

Conversation

@spiffxp
Member

spiffxp commented Oct 20, 2018

The Kubernetes community has developed an over-reliance on e2e tests that are flaky, slow, and seemingly unmaintained. We, the release team, end up spending much of our time rationalizing and explaining away failing or flaky tests as red-but-green, or ignoring meaningful failures under the guise of perpetual-failure being the expected normal state.

This results in delays leading up to release, contributes to the reputation of .0 releases being beta-quality at best, and allows regressions to slip through the cracks in patch releases.

This PR proposes release blocking job criteria based on #24 to help us eliminate noise and improve signal

The criteria spelled out here are less strict than the criteria in that issue but this is based on what the jobs look like today.

Once this PR lands I would like to move the release-blocking jobs toward the criteria listed, and refine as described in the TODO's. e.g.:

  • take the gke jobs off of release-blocking boards
  • make a release-master-informing board and migrate jobs to it
  • update ci signal docs accordingly
  • hunt down job owners and update job descriptions
  • propose and implement the use of sig-foo-alerts googlegroups for testgrid alerts
  • implement a velodrome dashboard that shows how well release blocking jobs are
    meeting these criteria

I plan on using #347 as the umbrella issue for that

@spiffxp

This comment has been minimized.

Member

spiffxp commented Oct 20, 2018

@k8s-ci-robot k8s-ci-robot requested review from AishSundar, jberkus and krzyzacy Oct 20, 2018

We intend to drive the criteria to stricter thresholds over time, e.g.,
- finishing a run in 60min or less
- running at least every 2 hours
- passing at least 90% of all runs1

This comment has been minimized.

@AishSundar

AishSundar Oct 21, 2018

Contributor

nit: typo

This comment has been minimized.

@krzyzacy

krzyzacy Oct 22, 2018

Member

^^ (s/runs1/runs)

are often jobs that are some combination of slow, flaky, infrequent, or
difficult to maintain.
These jobs may still block the release, but because they require a lot of

This comment has been minimized.

@AishSundar

AishSundar Oct 21, 2018

Contributor

I like the idea of moving the "noisy" jobs into a separate dashboard so we dint loose their signal right away and help maintain the "blocking" dashboard in a green state for most part. But I am afraid we will still end up spending the same amount of time and effort chasing the failures in the "informing" dashboard to determine GO/NoGO for a release. I am not sure when and how much to dial down our attention towards the failures here.
It might be worthwhile to add some kind of ownership expectation for these jobs eg: if we dont get attention from Job/Test Owners around failing tests within 'x' days we will not block release on them.

This comment has been minimized.

@spiffxp

spiffxp Oct 22, 2018

Member

I'm hoping if nothing else it means we spend our time babysitting fewer jobs. These jobs should still have clear owners etc. But I think our decision to block/not-block is the "human interpretation" part I'm trying to call out here. I'll see if I can re-word

These jobs may still block the release, but because they require a lot of
manual, human interpretation, we choose to move them to a separate dashboard.
<!-- TOOD(spiffxp): implement release-master-informing, populate with:

This comment has been minimized.

@AishSundar

AishSundar Oct 21, 2018

Contributor

I am assuming we will be opening PRs to move these jobs out with a list of criteria they dont meet?

This comment has been minimized.

@spiffxp
manual, human interpretation, we choose to move them to a separate dashboard.
<!-- TOOD(spiffxp): implement release-master-informing, populate with:
- gce-master-scale-correctness

This comment has been minimized.

@AishSundar

AishSundar Oct 21, 2018

Contributor

this job finishes in 1.5 hrs. Curious as to which criteria it doesnt meet.

This comment has been minimized.

@spiffxp

spiffxp Oct 22, 2018

Member

I'll have to double check, I may be wrong. I'll reword this so that it's an example not an authoritative list. As you say, I think the discussion for what jobs should be here belongs in a followup PR

This comment has been minimized.

@jberkus

jberkus Oct 22, 2018

Contributor

that's been a pretty reliable, informative job this cycle. I'd like to keep it.

This comment has been minimized.

@spiffxp

spiffxp Oct 22, 2018

Member

found it: although the job usually takes ~90min, it only runs once a week

This comment has been minimized.

@spiffxp

spiffxp Oct 22, 2018

Member

(to run it more frequently would require the allocation of more quota, as right now these jobs share the same project and have have been manually scheduled across days of the week to avoid clashing with each other)

This comment has been minimized.

@tpepper

tpepper Oct 23, 2018

Contributor

Do we need something for a call for funding for things like this. Is that part of the cncf hosting move?

- kubeadm
- kubeadm-$(X-1)-on-$(X)
- kubeadm-gce
- kubeadm-gce-x.$(y-1)-on-x.y
## Testgrid Release Blocking Dashboards

This comment has been minimized.

@AishSundar

AishSundar Oct 21, 2018

Contributor

What about the upgrade jobs and the upgrade dashboards ? I understand they are much more thorny. do we want to tackle them separately? FWIW, @justinsb work on splitting the gce and gke upgrade jobs into (i) Serial & Disruptive and (ii) Slow (parallel) jobs have helped in stabilizing the latter.

This comment has been minimized.

@spiffxp

spiffxp Oct 22, 2018

Member

Practically speaking I think they should remain separate from master while we focus on them. It may be that we decide to move upgrade jobs into blocking/informing down the line.


I don't want to bikeshed on it, but just to document since I went looking:

Previous releases use

  • release-blocking
  • release-all

And then the release-all dashboards fall out of maintenance and are almost always red (usually because of upgrade jobs)

What we current have with master is

  • master-blocking
  • master-upgrade
  • master-upgrade-optional
  • master-kubectl-skew
  • misc

I haven't audited whether everything not in blocking ~= master-all, I would hope so but suspect not. And then we have the fact that full scale scalability testing is never done on other release branches, only master.

@spiffxp

This comment has been minimized.

Member

spiffxp commented Oct 22, 2018

Discussed during sig-release, general agreement on this, but would still appreciate a pass from @jberkus, and if possible @krzyzacy

The ideal release-blocking job meets the following criteria:
- It provisions clusters via open-source software instead of a hosted service

This comment has been minimized.

@krzyzacy

krzyzacy Oct 22, 2018

Member

aka, no gke?

This comment has been minimized.

@spiffxp

This comment has been minimized.

@krzyzacy
<!-- TODO(krzyzacy): define blocking criteria -->
## Release Blocking Criteria
The ideal release-blocking job meets the following criteria:

This comment has been minimized.

@krzyzacy

krzyzacy Oct 22, 2018

Member

what if an already-blocking job doesn't, do we kick it out of release blocking?

This comment has been minimized.

@spiffxp

spiffxp Oct 22, 2018

Member

My goal is for us to use followup PR's to suggest kicking out of blocking. If we decide no, we can't kick out just yet, we'll at least have a documented decision for why it needs to stay

This comment has been minimized.

@krzyzacy

krzyzacy Oct 22, 2018

Member

sgtm, maybe also document that here (when/who will need to make that followup kickout PR)

This comment has been minimized.

@spiffxp
Jobs that are considered useful or necessary to inform whether a commit on
master is ready for release, but that clearly do not meet release-blocking
criteria, may be placed on the sig-release-master-informing dashboard. These

This comment has been minimized.

@krzyzacy

krzyzacy Oct 22, 2018

Member

So if we intend to create sig-release-master-informing, what about per release branch dashboards? is the release-x.y-all sufficient?

This comment has been minimized.

@spiffxp

spiffxp Oct 22, 2018

Member

I am hoping -all is sufficient, and that master is a special case for now while we work to consolidate into an -all

@spiffxp

This comment has been minimized.

Member

spiffxp commented Oct 22, 2018

/hold
for comment, will revise shortly

@krzyzacy

This comment has been minimized.

Member

krzyzacy commented Oct 22, 2018

LGTM besides someone from gke should be ok with gke not blocking oss releases :-)

@jagosan

This comment has been minimized.

jagosan commented Oct 22, 2018

This is a reasonable direction and aligned with efforts to date. We should be careful to ensure there is a strong provider-agnostic solution to replace any currently-blocking tests that will be removed.

@spiffxp

This comment has been minimized.

Member

spiffxp commented Oct 22, 2018

We should be careful to ensure there is a strong provider-agnostic solution to replace any currently-blocking tests that will be removed.

Agree. I do not think much is lost in terms of release signal by removing these GKE tests from release-blocking. The results will still be available in dashboards such as google-gke and google-gke-staging for those that wish to pay attention to them.

@AishSundar

This comment has been minimized.

Contributor

AishSundar commented Oct 22, 2018

/lgtm

FWIW we will still continue to monitor the status of the tests in the release-informing dashboard (atleast in 1.13 timeframe) to ensure we are not loosing any critical signal (which doesnt seem to be the case historically for the past 2 releases).

We use [1.12 release-blocking dashboard](http://k8s-testgrid.appspot.com/sig-release-1.12-blocking)
as a source of truth:
as a source of truth. The tab names are autogenerated based on the job names,
and have the branch/version in their name. For example, for release x.y:

This comment has been minimized.

@jberkus

jberkus Oct 22, 2018

Contributor

I'd like a comment here that this is just the initial set of tests, and that each of these will be evaluated based on the criteria above. It's really not clear from the description that that's the case, just from your comments.

This comment has been minimized.

@jberkus

jberkus Oct 22, 2018

Contributor

Hmmmm, never mind, you DO have that above.

@jberkus

This comment has been minimized.

Contributor

jberkus commented Oct 22, 2018

/lgtm

I'm really happy with the test criteria. Particularly having overall owners for jobs, not just for the individual tests.

- gce-master-scale-correctness
- gce-master-scale-performance
- gce-cos-master-serial
- gke-cos-master-serial

This comment has been minimized.

@tpepper

tpepper Oct 23, 2018

Contributor

Action pending from the 1.12 release retrospective: should we open an issue on analyzing g[ck]e-cos-master-serial versus g[ck]e-cos-master-slow versus g[ck]e-cos-master-default for valuable coverage, quality of signal, support status, etc.? Or put another way, lacking that data and human resources to collect it...is it a beneficial clean up or a problem to simply drop g[ck]e-cos-master-serial instead of moving them to release-master-informing?

This comment has been minimized.

@AishSundar

AishSundar Oct 23, 2018

Contributor

I have seen master-serial give us some additional value from time to time vs the master-default one. So there are a few tests spread between the 3 dashboards that give us signal, trick is finding out which ones they are. IOW, I dont think we should drop them completely (atleast right away), but be much more vigilant and aware of what value the various jobs in the informing dashboard are adding, when do we actually find a k8s release blocking issue due to their failures and how responsive the owners are. Once we have that data collected for a release or two we must be in a better place to make a call.

@spiffxp

This comment has been minimized.

Member

spiffxp commented Oct 27, 2018

Holding one more business day for comment, will merge Tuesday 2018-10-30

- represented by a configured testgrid alert
- contact info is also part of job description
-->
- Fails for no more than 10 runs in a row

This comment has been minimized.

@lavalamp

lavalamp Oct 29, 2018

Member

These requirements are a little strange-- suppose something is merged that breaks the tests. What we want to happen is that the release is blocked. What would actually happen, interpreting these rules literally, is that we kick all the tests out.

This comment has been minimized.

@spiffxp

spiffxp Oct 29, 2018

Member

I agree, and would welcome help rewording this. I'm trying to put the stake in sand before setting it in stone. The idea is to punt out jobs that are perpetually failing due to a lack of attention. If they are failing and actively being worked, that's different. The case could also be made that there should be a standard process to follow of revert-before-fix once an offending commit is identified.

This comment has been minimized.

@lavalamp

lavalamp Oct 29, 2018

Member

I personally think it's actually a bad idea to explicitly spell out the criteria, I think it requires human judgement. I suggest something like:

We intend to remove jobs that are getting an insufficient amount of attention from their owners, i.e., they have test problems that aren't being addressed. We won't punt jobs merely because they are failing, we'll punt them when they are not providing good signal (positive OR negative) AND there's no indication that they will begin providing a reliable signal in relatively short (e.g., 1 week) timeframe. There's no shame in being removed, once the problems have been fixed a job can be added back.

In order to prevent the rules from being gamed, we leave it up to humans to determine whether or not a given test meets the criteria.

I'd further add that if it's not obvious or is contentious whether a test job meets the criteria, that sig release and sig testing TLs together as a group should make the decision.

I wouldn't bother defining what should happen if that doesn't work, we should just figure it out if it ever happens.

This comment has been minimized.

@spiffxp

spiffxp Oct 29, 2018

Member

We're not going to have bots automatically act on this for quite a while (if ever?). Think of these more as triggering conditions for human review. I call out human review as a necessary part of the process to add or remove below.
I'd like to get humans out of the business of having to continually scan jobs and metrics, so they have more time to focus on the business of making decisions. My hope is that by specifying criteria in a manner that can be automated, we can then start focusing more on spelling out the process we should follow.

This comment has been minimized.

@spiffxp

spiffxp Oct 30, 2018

Member

I added some of lavalamp's language and more strongly emphasized that humans have the final say. I also sketched out a process that we can iterate on in followups

@k8s-ci-robot k8s-ci-robot removed the lgtm label Oct 30, 2018

@spiffxp

This comment has been minimized.

Member

spiffxp commented Oct 30, 2018

/hold cancel
this is ready to go

@jberkus

This comment has been minimized.

Contributor

jberkus commented Oct 30, 2018

/lgtm

Let's do this.

@k8s-ci-robot k8s-ci-robot added the lgtm label Oct 30, 2018

@k8s-ci-robot

This comment has been minimized.

Contributor

k8s-ci-robot commented Oct 30, 2018

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: AishSundar, jberkus, spiffxp

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [AishSundar,jberkus,spiffxp]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit 05b684a into kubernetes:master Oct 30, 2018

2 checks passed

cla/linuxfoundation spiffxp authorized
Details
tide In merge pool.
Details

@spiffxp spiffxp deleted the spiffxp:release-blocking-criteria branch Oct 30, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment