-
Notifications
You must be signed in to change notification settings - Fork 380
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add release blocking job criteria #346
Add release blocking job criteria #346
Conversation
release-blocking-jobs.md
Outdated
We intend to drive the criteria to stricter thresholds over time, e.g., | ||
- finishing a run in 60min or less | ||
- running at least every 2 hours | ||
- passing at least 90% of all runs1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: typo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
^^ (s/runs1/runs)
are often jobs that are some combination of slow, flaky, infrequent, or | ||
difficult to maintain. | ||
|
||
These jobs may still block the release, but because they require a lot of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea of moving the "noisy" jobs into a separate dashboard so we dint loose their signal right away and help maintain the "blocking" dashboard in a green state for most part. But I am afraid we will still end up spending the same amount of time and effort chasing the failures in the "informing" dashboard to determine GO/NoGO for a release. I am not sure when and how much to dial down our attention towards the failures here.
It might be worthwhile to add some kind of ownership expectation for these jobs eg: if we dont get attention from Job/Test Owners around failing tests within 'x' days we will not block release on them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm hoping if nothing else it means we spend our time babysitting fewer jobs. These jobs should still have clear owners etc. But I think our decision to block/not-block is the "human interpretation" part I'm trying to call out here. I'll see if I can re-word
release-blocking-jobs.md
Outdated
These jobs may still block the release, but because they require a lot of | ||
manual, human interpretation, we choose to move them to a separate dashboard. | ||
|
||
<!-- TOOD(spiffxp): implement release-master-informing, populate with: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am assuming we will be opening PRs to move these jobs out with a list of criteria they dont meet?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes
release-blocking-jobs.md
Outdated
manual, human interpretation, we choose to move them to a separate dashboard. | ||
|
||
<!-- TOOD(spiffxp): implement release-master-informing, populate with: | ||
- gce-master-scale-correctness |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this job finishes in 1.5 hrs. Curious as to which criteria it doesnt meet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll have to double check, I may be wrong. I'll reword this so that it's an example not an authoritative list. As you say, I think the discussion for what jobs should be here belongs in a followup PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's been a pretty reliable, informative job this cycle. I'd like to keep it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
found it: although the job usually takes ~90min, it only runs once a week
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(to run it more frequently would require the allocation of more quota, as right now these jobs share the same project and have have been manually scheduled across days of the week to avoid clashing with each other)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need something for a call for funding for things like this. Is that part of the cncf hosting move?
- kubeadm | ||
- kubeadm-$(X-1)-on-$(X) | ||
- kubeadm-gce | ||
- kubeadm-gce-x.$(y-1)-on-x.y | ||
|
||
## Testgrid Release Blocking Dashboards |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about the upgrade jobs and the upgrade dashboards ? I understand they are much more thorny. do we want to tackle them separately? FWIW, @justinsb work on splitting the gce and gke upgrade jobs into (i) Serial & Disruptive and (ii) Slow (parallel) jobs have helped in stabilizing the latter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Practically speaking I think they should remain separate from master while we focus on them. It may be that we decide to move upgrade jobs into blocking/informing down the line.
I don't want to bikeshed on it, but just to document since I went looking:
Previous releases use
- release-blocking
- release-all
And then the release-all dashboards fall out of maintenance and are almost always red (usually because of upgrade jobs)
What we current have with master is
- master-blocking
- master-upgrade
- master-upgrade-optional
- master-kubectl-skew
- misc
I haven't audited whether everything not in blocking ~= master-all, I would hope so but suspect not. And then we have the fact that full scale scalability testing is never done on other release branches, only master.
|
||
The ideal release-blocking job meets the following criteria: | ||
|
||
- It provisions clusters via open-source software instead of a hosted service |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
aka, no gke?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @jagosan
<!-- TODO(krzyzacy): define blocking criteria --> | ||
## Release Blocking Criteria | ||
|
||
The ideal release-blocking job meets the following criteria: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if an already-blocking job doesn't, do we kick it out of release blocking?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My goal is for us to use followup PR's to suggest kicking out of blocking. If we decide no, we can't kick out just yet, we'll at least have a documented decision for why it needs to stay
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sgtm, maybe also document that here (when/who will need to make that followup kickout PR)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
opened #347
|
||
Jobs that are considered useful or necessary to inform whether a commit on | ||
master is ready for release, but that clearly do not meet release-blocking | ||
criteria, may be placed on the sig-release-master-informing dashboard. These |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So if we intend to create sig-release-master-informing
, what about per release branch dashboards? is the release-x.y-all
sufficient?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am hoping -all
is sufficient, and that master is a special case for now while we work to consolidate into an -all
/hold |
LGTM besides someone from gke should be ok with gke not blocking oss releases :-) |
This is a reasonable direction and aligned with efforts to date. We should be careful to ensure there is a strong provider-agnostic solution to replace any currently-blocking tests that will be removed. |
Agree. I do not think much is lost in terms of release signal by removing these GKE tests from release-blocking. The results will still be available in dashboards such as google-gke and google-gke-staging for those that wish to pay attention to them. |
/lgtm FWIW we will still continue to monitor the status of the tests in the release-informing dashboard (atleast in 1.13 timeframe) to ensure we are not loosing any critical signal (which doesnt seem to be the case historically for the past 2 releases). |
|
||
We use [1.12 release-blocking dashboard](http://k8s-testgrid.appspot.com/sig-release-1.12-blocking) | ||
as a source of truth: | ||
as a source of truth. The tab names are autogenerated based on the job names, | ||
and have the branch/version in their name. For example, for release x.y: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like a comment here that this is just the initial set of tests, and that each of these will be evaluated based on the criteria above. It's really not clear from the description that that's the case, just from your comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmmm, never mind, you DO have that above.
/lgtm I'm really happy with the test criteria. Particularly having overall owners for jobs, not just for the individual tests. |
- gce-master-scale-correctness | ||
- gce-master-scale-performance | ||
- gce-cos-master-serial | ||
- gke-cos-master-serial |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Action pending from the 1.12 release retrospective: should we open an issue on analyzing g[ck]e-cos-master-serial versus g[ck]e-cos-master-slow versus g[ck]e-cos-master-default for valuable coverage, quality of signal, support status, etc.? Or put another way, lacking that data and human resources to collect it...is it a beneficial clean up or a problem to simply drop g[ck]e-cos-master-serial instead of moving them to release-master-informing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have seen master-serial give us some additional value from time to time vs the master-default one. So there are a few tests spread between the 3 dashboards that give us signal, trick is finding out which ones they are. IOW, I dont think we should drop them completely (atleast right away), but be much more vigilant and aware of what value the various jobs in the informing dashboard are adding, when do we actually find a k8s release blocking issue due to their failures and how responsive the owners are. Once we have that data collected for a release or two we must be in a better place to make a call.
Holding one more business day for comment, will merge Tuesday 2018-10-30 |
- represented by a configured testgrid alert | ||
- contact info is also part of job description | ||
--> | ||
- Fails for no more than 10 runs in a row |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These requirements are a little strange-- suppose something is merged that breaks the tests. What we want to happen is that the release is blocked. What would actually happen, interpreting these rules literally, is that we kick all the tests out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, and would welcome help rewording this. I'm trying to put the stake in sand before setting it in stone. The idea is to punt out jobs that are perpetually failing due to a lack of attention. If they are failing and actively being worked, that's different. The case could also be made that there should be a standard process to follow of revert-before-fix once an offending commit is identified.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I personally think it's actually a bad idea to explicitly spell out the criteria, I think it requires human judgement. I suggest something like:
We intend to remove jobs that are getting an insufficient amount of attention from their owners, i.e., they have test problems that aren't being addressed. We won't punt jobs merely because they are failing, we'll punt them when they are not providing good signal (positive OR negative) AND there's no indication that they will begin providing a reliable signal in relatively short (e.g., 1 week) timeframe. There's no shame in being removed, once the problems have been fixed a job can be added back.
In order to prevent the rules from being gamed, we leave it up to humans to determine whether or not a given test meets the criteria.
I'd further add that if it's not obvious or is contentious whether a test job meets the criteria, that sig release and sig testing TLs together as a group should make the decision.
I wouldn't bother defining what should happen if that doesn't work, we should just figure it out if it ever happens.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're not going to have bots automatically act on this for quite a while (if ever?). Think of these more as triggering conditions for human review. I call out human review as a necessary part of the process to add or remove below.
I'd like to get humans out of the business of having to continually scan jobs and metrics, so they have more time to focus on the business of making decisions. My hope is that by specifying criteria in a manner that can be automated, we can then start focusing more on spelling out the process we should follow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added some of lavalamp's language and more strongly emphasized that humans have the final say. I also sketched out a process that we can iterate on in followups
3deee75
to
423ed58
Compare
/hold cancel |
/lgtm Let's do this. |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: AishSundar, jberkus, spiffxp The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
The Kubernetes community has developed an over-reliance on e2e tests that are flaky, slow, and seemingly unmaintained. We, the release team, end up spending much of our time rationalizing and explaining away failing or flaky tests as red-but-green, or ignoring meaningful failures under the guise of perpetual-failure being the expected normal state.
This results in delays leading up to release, contributes to the reputation of .0 releases being beta-quality at best, and allows regressions to slip through the cracks in patch releases.
This PR proposes release blocking job criteria based on #24 to help us eliminate noise and improve signal
The criteria spelled out here are less strict than the criteria in that issue but this is based on what the jobs look like today.
Once this PR lands I would like to move the release-blocking jobs toward the criteria listed, and refine as described in the TODO's. e.g.:
meeting these criteria
I plan on using #347 as the umbrella issue for that