Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement release blocking job criteria #347

Closed
spiffxp opened this issue Oct 22, 2018 · 36 comments
Closed

Implement release blocking job criteria #347

spiffxp opened this issue Oct 22, 2018 · 36 comments
Assignees
Labels
priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. sig/release Categorizes an issue or PR as relevant to SIG Release. sig/testing Categorizes an issue or PR as relevant to SIG Testing.
Milestone

Comments

@spiffxp
Copy link
Member

spiffxp commented Oct 22, 2018

This is an umbrella issue for followup work to #346

That PR describes aspirational release blocking job criteria. This issue is intended to track the followup work, including:

  • assignment of owners to all jobs
  • descriptions added to all release-master-blocking jobs
  • propose the creation of sig-foo-alerts@googlegroups.com or reuse of sig-foo-test-failures for all sigs that need to be responsive to test failures
  • the creation of the release-informing dashboard, and moving jobs out of release-master-blocking to that dashboard
  • a bigquery run that generates metrics for jobs currently on the release-master-blocking dashboard

/sig release
this is sig-release policy
/sig testing
this will be assisted by sig-testing tooling

EDIT 2019-07-23: AFAIK metrics is the only thing that remains to close this out

@spiffxp spiffxp added sig/release Categorizes an issue or PR as relevant to SIG Release. sig/testing Categorizes an issue or PR as relevant to SIG Testing. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. labels Oct 22, 2018
@spiffxp spiffxp changed the title Implement release block job criteria Implement release blocking job criteria Oct 22, 2018
@jberkus
Copy link
Contributor

jberkus commented Oct 23, 2018

Are we consolidating all non-blocking dashboards into -informing?

I'd be in favor of that.

@BenTheElder
Copy link
Member

/cc

@spiffxp
Copy link
Member Author

spiffxp commented Jan 3, 2019

/milestone v1.14

I would like for us to implement this for the v1.14 release

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 3, 2019
@justaugustus
Copy link
Member

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 29, 2019
@justaugustus justaugustus added this to To do in Release Team via automation May 1, 2019
@justaugustus
Copy link
Member

/help
/milestone v1.15

@k8s-ci-robot k8s-ci-robot modified the milestones: v1.14, v1.15 May 1, 2019
@k8s-ci-robot k8s-ci-robot added the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label May 1, 2019
@spiffxp
Copy link
Member Author

spiffxp commented May 7, 2019

  • assignment of owners to all jobs
  • descriptions added to all release-master-blocking jobs
  • propose the creation of sig-foo-alerts@googlegroups.com or reuse of sig-foo-test-failures for all sigs that need to be responsive to test failures

/assign
I'm handling this in #441

  • the creation of the release-informing dashboard, and moving jobs out of release-master-blocking to that dashboard

this is done

  • a bigquery run that generates metrics for jobs currently on the release-master-blocking dashboard

I would recommend someone look at https://github.com/kubernetes/test-infra/tree/master/metrics for this

@jeefy
Copy link
Member

jeefy commented Aug 12, 2019

/assign @jberkus

@wojtek-t
Copy link
Member

This is sort of the mechanism by which people review PRs ... I really hope you are not missing 75% or more of these confused

I actually bet I do if people don't ping me directly. I just opened a list of those where I was reviewer and I didn't see majority of those. If something is critical and really requires my attention, I should be at the very least assigned as approver (those I'm trying to follow, but I'm also missing some percentage of those).

That's not really reasonable, policies should be expected to evolve, not be set in stone, and tomorrow I may introduce some unforeseen form of testing that we need to account for.
I also don't think we should be trying to handle soak testing at all right now. There's no useful soak testing on which to inform policy.

I disagree with that. We shouldn't create a policy that we know want work as soon as we have soak tests. Those are important enough that I really believe we will prioritize them soon.

I'm not seeing an actual resolution in this thread,

There was a 1-hour-long discussion that happened during the zoom meeting. Where I think we roughly converged. @jberkus is actually documenting the outcome of it in his PR.

@BenTheElder
Copy link
Member

I actually bet I do if people don't ping me directly. I just opened a list of those where I was reviewer and I didn't see majority of those. If something is critical and really requires my attention, I should be at the very least assigned as approver (those I'm trying to follow, but I'm also missing some percentage of those).

This is a separate discussion but ... this kinda defeats the purpose of having assigned reviewers. Please consider dropping a note to contribex about why / how this doesn't work so we can fix this. :/

I disagree with that. We shouldn't create a policy that we know want work as soon as we have soak tests. Those are important enough that I really believe we will prioritize them soon.

There has been no indication that we will have soak tests.

What we call soak tests now are easily among the longest failing tests we have and likely to be removed in the near future... These have been relatively unmaintained for on the order of year(s). Who is prioritizing them? I've heard zero discussion of this in SIG-Testing or SIG-Release ...

There was a 1-hour-long discussion that happened during the zoom meeting. Where I think we roughly converged. @jberkus is actually documenting the outcome of it in his PR.

Excellent. 👍

@spiffxp
Copy link
Member Author

spiffxp commented Aug 16, 2019

I've got an initial attempt at a dashboard that displays metrics relevant to release-blocking criteria: kubernetes/test-infra#13879 (comment)

http://velodrome.k8s.io/dashboard/db/job-health-release-blocking

@spiffxp
Copy link
Member Author

spiffxp commented Aug 16, 2019

I caught that the bazel jobs were postsubmits that didn't meet the "scheduled at least 3 hours" criteria so swapped them with periodics that did kubernetes/test-infra#13907

@spiffxp
Copy link
Member Author

spiffxp commented Aug 16, 2019

The serial job takes way too long, and is failing due to timeout. I think we should kick out egregious offenders, and encourage a pattern of adding a feature-specific job if the feature is truly necessary to be release-blocking.

Opened issues to start this for

@tpepper
Copy link
Member

tpepper commented Aug 16, 2019

/assign @tpepper

@guineveresaenger
Copy link
Contributor

/assign @guineveresaenger

@spiffxp
Copy link
Member Author

spiffxp commented Aug 16, 2019

@msau42 is looking to split out serial storage tests into another release-blocking job as well kubernetes/test-infra#13936

I feel like splitting tests into more parallel blocking jobs is a sound approach for now. But, it's only going to get us but so far before we run into new limits:

  • we pick up additional overhead of standup/teardown of yet another cluster
  • more jobs = more opportunity for one of them to flake, so overall reliability of master-blocking may go down
  • what other jobs should be running these tests? eg: skew, upgrade/downgrade, scale, other envs, etc

At some point it's worth questioning why these tests need to be release-blocking, and if there is some sort of bar they should be held to. We presumably do this for Conformance tests, though IMO it's not as rigorously measured as it could be, and relies on extensive human review.

@jberkus
Copy link
Contributor

jberkus commented Aug 27, 2019

Not completed despite the merger of #752. Mostly because there's still some issues unanswered:

@alejandrox1
Copy link
Contributor

will start with #775
/assign

@alejandrox1
Copy link
Contributor

/remove-help
/milestone v1.17

@k8s-ci-robot k8s-ci-robot removed the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label Oct 7, 2019
@k8s-ci-robot k8s-ci-robot modified the milestones: v1.16, v1.17 Oct 7, 2019
@guineveresaenger
Copy link
Contributor

Closing in favor of #773, #774, #775.

Thanks everyone!

SIG Release automation moved this from In progress to Done (1.17) Oct 21, 2019
Release Team automation moved this from In progress to Done (1.17) Oct 21, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. sig/release Categorizes an issue or PR as relevant to SIG Release. sig/testing Categorizes an issue or PR as relevant to SIG Testing.
Projects
None yet
Development

No branches or pull requests