Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add release blocking job criteria #346

Merged
merged 1 commit into from
Oct 30, 2018

Conversation

spiffxp
Copy link
Member

@spiffxp spiffxp commented Oct 20, 2018

The Kubernetes community has developed an over-reliance on e2e tests that are flaky, slow, and seemingly unmaintained. We, the release team, end up spending much of our time rationalizing and explaining away failing or flaky tests as red-but-green, or ignoring meaningful failures under the guise of perpetual-failure being the expected normal state.

This results in delays leading up to release, contributes to the reputation of .0 releases being beta-quality at best, and allows regressions to slip through the cracks in patch releases.

This PR proposes release blocking job criteria based on #24 to help us eliminate noise and improve signal

The criteria spelled out here are less strict than the criteria in that issue but this is based on what the jobs look like today.

Once this PR lands I would like to move the release-blocking jobs toward the criteria listed, and refine as described in the TODO's. e.g.:

  • take the gke jobs off of release-blocking boards
  • make a release-master-informing board and migrate jobs to it
  • update ci signal docs accordingly
  • hunt down job owners and update job descriptions
  • propose and implement the use of sig-foo-alerts googlegroups for testgrid alerts
  • implement a velodrome dashboard that shows how well release blocking jobs are
    meeting these criteria

I plan on using #347 as the umbrella issue for that

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. approved Indicates a PR has been approved by an approver from all required OWNERS files. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Oct 20, 2018
@k8s-ci-robot k8s-ci-robot added the sig/release Categorizes an issue or PR as relevant to SIG Release. label Oct 20, 2018
@spiffxp
Copy link
Member Author

spiffxp commented Oct 20, 2018

/cc @jberkus @AishSundar @krzyzacy

We intend to drive the criteria to stricter thresholds over time, e.g.,
- finishing a run in 60min or less
- running at least every 2 hours
- passing at least 90% of all runs1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: typo

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^^ (s/runs1/runs)

are often jobs that are some combination of slow, flaky, infrequent, or
difficult to maintain.

These jobs may still block the release, but because they require a lot of
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of moving the "noisy" jobs into a separate dashboard so we dint loose their signal right away and help maintain the "blocking" dashboard in a green state for most part. But I am afraid we will still end up spending the same amount of time and effort chasing the failures in the "informing" dashboard to determine GO/NoGO for a release. I am not sure when and how much to dial down our attention towards the failures here.
It might be worthwhile to add some kind of ownership expectation for these jobs eg: if we dont get attention from Job/Test Owners around failing tests within 'x' days we will not block release on them.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm hoping if nothing else it means we spend our time babysitting fewer jobs. These jobs should still have clear owners etc. But I think our decision to block/not-block is the "human interpretation" part I'm trying to call out here. I'll see if I can re-word

These jobs may still block the release, but because they require a lot of
manual, human interpretation, we choose to move them to a separate dashboard.

<!-- TOOD(spiffxp): implement release-master-informing, populate with:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am assuming we will be opening PRs to move these jobs out with a list of criteria they dont meet?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

manual, human interpretation, we choose to move them to a separate dashboard.

<!-- TOOD(spiffxp): implement release-master-informing, populate with:
- gce-master-scale-correctness
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this job finishes in 1.5 hrs. Curious as to which criteria it doesnt meet.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll have to double check, I may be wrong. I'll reword this so that it's an example not an authoritative list. As you say, I think the discussion for what jobs should be here belongs in a followup PR

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's been a pretty reliable, informative job this cycle. I'd like to keep it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

found it: although the job usually takes ~90min, it only runs once a week

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(to run it more frequently would require the allocation of more quota, as right now these jobs share the same project and have have been manually scheduled across days of the week to avoid clashing with each other)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need something for a call for funding for things like this. Is that part of the cncf hosting move?

- kubeadm
- kubeadm-$(X-1)-on-$(X)
- kubeadm-gce
- kubeadm-gce-x.$(y-1)-on-x.y

## Testgrid Release Blocking Dashboards
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about the upgrade jobs and the upgrade dashboards ? I understand they are much more thorny. do we want to tackle them separately? FWIW, @justinsb work on splitting the gce and gke upgrade jobs into (i) Serial & Disruptive and (ii) Slow (parallel) jobs have helped in stabilizing the latter.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Practically speaking I think they should remain separate from master while we focus on them. It may be that we decide to move upgrade jobs into blocking/informing down the line.


I don't want to bikeshed on it, but just to document since I went looking:

Previous releases use

  • release-blocking
  • release-all

And then the release-all dashboards fall out of maintenance and are almost always red (usually because of upgrade jobs)

What we current have with master is

  • master-blocking
  • master-upgrade
  • master-upgrade-optional
  • master-kubectl-skew
  • misc

I haven't audited whether everything not in blocking ~= master-all, I would hope so but suspect not. And then we have the fact that full scale scalability testing is never done on other release branches, only master.

@spiffxp
Copy link
Member Author

spiffxp commented Oct 22, 2018

Discussed during sig-release, general agreement on this, but would still appreciate a pass from @jberkus, and if possible @krzyzacy


The ideal release-blocking job meets the following criteria:

- It provisions clusters via open-source software instead of a hosted service
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aka, no gke?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

<!-- TODO(krzyzacy): define blocking criteria -->
## Release Blocking Criteria

The ideal release-blocking job meets the following criteria:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if an already-blocking job doesn't, do we kick it out of release blocking?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My goal is for us to use followup PR's to suggest kicking out of blocking. If we decide no, we can't kick out just yet, we'll at least have a documented decision for why it needs to stay

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sgtm, maybe also document that here (when/who will need to make that followup kickout PR)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

opened #347


Jobs that are considered useful or necessary to inform whether a commit on
master is ready for release, but that clearly do not meet release-blocking
criteria, may be placed on the sig-release-master-informing dashboard. These
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if we intend to create sig-release-master-informing, what about per release branch dashboards? is the release-x.y-all sufficient?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am hoping -all is sufficient, and that master is a special case for now while we work to consolidate into an -all

@spiffxp
Copy link
Member Author

spiffxp commented Oct 22, 2018

/hold
for comment, will revise shortly

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 22, 2018
@krzyzacy
Copy link
Member

LGTM besides someone from gke should be ok with gke not blocking oss releases :-)

@jagosan
Copy link

jagosan commented Oct 22, 2018

This is a reasonable direction and aligned with efforts to date. We should be careful to ensure there is a strong provider-agnostic solution to replace any currently-blocking tests that will be removed.

@spiffxp
Copy link
Member Author

spiffxp commented Oct 22, 2018

We should be careful to ensure there is a strong provider-agnostic solution to replace any currently-blocking tests that will be removed.

Agree. I do not think much is lost in terms of release signal by removing these GKE tests from release-blocking. The results will still be available in dashboards such as google-gke and google-gke-staging for those that wish to pay attention to them.

@AishSundar
Copy link
Contributor

/lgtm

FWIW we will still continue to monitor the status of the tests in the release-informing dashboard (atleast in 1.13 timeframe) to ensure we are not loosing any critical signal (which doesnt seem to be the case historically for the past 2 releases).

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 22, 2018

We use [1.12 release-blocking dashboard](http://k8s-testgrid.appspot.com/sig-release-1.12-blocking)
as a source of truth:
as a source of truth. The tab names are autogenerated based on the job names,
and have the branch/version in their name. For example, for release x.y:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like a comment here that this is just the initial set of tests, and that each of these will be evaluated based on the criteria above. It's really not clear from the description that that's the case, just from your comments.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmmm, never mind, you DO have that above.

@jberkus
Copy link
Contributor

jberkus commented Oct 22, 2018

/lgtm

I'm really happy with the test criteria. Particularly having overall owners for jobs, not just for the individual tests.

- gce-master-scale-correctness
- gce-master-scale-performance
- gce-cos-master-serial
- gke-cos-master-serial
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Action pending from the 1.12 release retrospective: should we open an issue on analyzing g[ck]e-cos-master-serial versus g[ck]e-cos-master-slow versus g[ck]e-cos-master-default for valuable coverage, quality of signal, support status, etc.? Or put another way, lacking that data and human resources to collect it...is it a beneficial clean up or a problem to simply drop g[ck]e-cos-master-serial instead of moving them to release-master-informing?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have seen master-serial give us some additional value from time to time vs the master-default one. So there are a few tests spread between the 3 dashboards that give us signal, trick is finding out which ones they are. IOW, I dont think we should drop them completely (atleast right away), but be much more vigilant and aware of what value the various jobs in the informing dashboard are adding, when do we actually find a k8s release blocking issue due to their failures and how responsive the owners are. Once we have that data collected for a release or two we must be in a better place to make a call.

@spiffxp
Copy link
Member Author

spiffxp commented Oct 27, 2018

Holding one more business day for comment, will merge Tuesday 2018-10-30

- represented by a configured testgrid alert
- contact info is also part of job description
-->
- Fails for no more than 10 runs in a row
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These requirements are a little strange-- suppose something is merged that breaks the tests. What we want to happen is that the release is blocked. What would actually happen, interpreting these rules literally, is that we kick all the tests out.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, and would welcome help rewording this. I'm trying to put the stake in sand before setting it in stone. The idea is to punt out jobs that are perpetually failing due to a lack of attention. If they are failing and actively being worked, that's different. The case could also be made that there should be a standard process to follow of revert-before-fix once an offending commit is identified.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally think it's actually a bad idea to explicitly spell out the criteria, I think it requires human judgement. I suggest something like:

We intend to remove jobs that are getting an insufficient amount of attention from their owners, i.e., they have test problems that aren't being addressed. We won't punt jobs merely because they are failing, we'll punt them when they are not providing good signal (positive OR negative) AND there's no indication that they will begin providing a reliable signal in relatively short (e.g., 1 week) timeframe. There's no shame in being removed, once the problems have been fixed a job can be added back.

In order to prevent the rules from being gamed, we leave it up to humans to determine whether or not a given test meets the criteria.

I'd further add that if it's not obvious or is contentious whether a test job meets the criteria, that sig release and sig testing TLs together as a group should make the decision.

I wouldn't bother defining what should happen if that doesn't work, we should just figure it out if it ever happens.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're not going to have bots automatically act on this for quite a while (if ever?). Think of these more as triggering conditions for human review. I call out human review as a necessary part of the process to add or remove below.
I'd like to get humans out of the business of having to continually scan jobs and metrics, so they have more time to focus on the business of making decisions. My hope is that by specifying criteria in a manner that can be automated, we can then start focusing more on spelling out the process we should follow.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added some of lavalamp's language and more strongly emphasized that humans have the final say. I also sketched out a process that we can iterate on in followups

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 30, 2018
@spiffxp
Copy link
Member Author

spiffxp commented Oct 30, 2018

/hold cancel
this is ready to go

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 30, 2018
@jberkus
Copy link
Contributor

jberkus commented Oct 30, 2018

/lgtm

Let's do this.

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 30, 2018
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: AishSundar, jberkus, spiffxp

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [AishSundar,jberkus,spiffxp]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/release Categorizes an issue or PR as relevant to SIG Release. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants