Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assess version markers used in release-branch job configs #850

Open
3 of 9 tasks
justaugustus opened this issue Nov 5, 2019 · 37 comments
Open
3 of 9 tasks

Assess version markers used in release-branch job configs #850

justaugustus opened this issue Nov 5, 2019 · 37 comments
Assignees
Labels
area/release-eng Issues or PRs related to the Release Engineering subproject kind/bug Categorizes issue or PR as related to a bug. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/release Categorizes an issue or PR as relevant to SIG Release.
Milestone

Comments

@justaugustus
Copy link
Member

justaugustus commented Nov 5, 2019

Version markers are text files stored in the root of various GCS buckets:

They represent the results of different types of Kubernetes build jobs and act as sort of a public API for accessing builds. One can see them leveraged in extraction strategies for e2e tests, release engineering tooling, and user-created scripts.

Unfortunately, the way certain version markers are generated and utilized can at best be confusing, and at worst, disruptive.

There are a variety of problems, some of which are symptoms of the other ones...

Generic version markers are not explicit

We publish a set of additional generic version markers:

  • k8s-master
  • k8s-beta
  • k8s-stable1
  • k8s-stable2
  • k8s-stable3

Depending on the point in the release cycle, the meaning of these markers can
change.

  • k8s-master always points to the version on master.
  • k8s-beta may represent:
    • masters build version (pre-branch cut)
    • a to-be-released build version (post-branch cut)
    • a recently released build version (post-release)

Knowing what these markers mean at any one time presumes knowledge of the
build/release process or a correct interpretation of the
Kubernetes versions doc,
which has frequently been out of date and lives in a low-visibility location.

Manually created jobs using generic version markers can be inaccurate

Non-generated jobs using generic version markers do not get the same level of
scrutiny as ones that are generated via
releng/test_config.yaml.

This leads to inaccuracies between the versions presumed to be used in test
and the versions that may be displayed in testgrid.

ci-kubernetes-e2e-gce-beta-stable1-gci-kubectl-skew is a great example:

https://github.com/kubernetes/test-infra/blob/96e08f4be2a86189f59c72055785f817ac346d30/config/jobs/kubernetes/sig-cli/sig-cli-config.yaml#L85-L112

All variants of that prowjob have landed on the sig-release-job-config-errors
dashboard for various misconfiguration issues that are the result of generic
version markers.


I'd like to establish a rough plan of record to continue iteratively fixing some of these issues.

Plan of record


Previous Issues

linux/amd64 version markers are colliding with cross builds

(Fixed in kubernetes/test-infra#18290.)

"Fast" (linux/amd64-only) builds run every 5 minutes, while cross builds run
every hour.
They also write to the same version markers (latest.txt,
latest-<major>.txt, latest-<major>.<minor>.txt).

The Kubernetes build jobs have a mechanism for checking if a build already
exists and will exit early to save on test cycles.

What this means is if a "fast" build has already happened for a commit, then
the corresponding cross build will exit without building.

This has been happening pretty consistently lately, so cross build consumers
are using much older versions of Kubernetes than intended.

(Note that this condition only happens on master.)

Cross builds are stored in a separate GCS bucket

(Fixed in kubernetes/test-infra#14030.)

This makes long-term usage of cross builds a little more difficult, since
scripts utilizing version markers tend to consider only the version marker
filename, while the GCS bucket name remains unparameterized.

Generated jobs may not represent intention

(Fixed in kubernetes/test-infra#15564.)

As the generic version markers can shift throughout the release cycle, every
time we regenerate jobs, they may not represent what we intend to test.

The best examples of this are pretty much every job using the k8s-beta
version marker, and more specifically, skew and upgrade jobs.

bazel version markers appear to be unused

(Fixed in kubernetes/test-infra#15612.)

ref: kubernetes/test-infra#15106

/assign
/area release-eng
/priority important-longterm
/milestone v1.17

@k8s-ci-robot k8s-ci-robot added the area/release-eng Issues or PRs related to the Release Engineering subproject label Nov 5, 2019
@k8s-ci-robot k8s-ci-robot added this to the v1.17 milestone Nov 5, 2019
@k8s-ci-robot k8s-ci-robot added the priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. label Nov 5, 2019
@tpepper
Copy link
Member

tpepper commented Nov 6, 2019

/cc

@justaugustus justaugustus modified the milestones: v1.17, v1.18 Dec 4, 2019
@justaugustus justaugustus added the sig/release Categorizes an issue or PR as relevant to SIG Release. label Dec 9, 2019
@justaugustus
Copy link
Member Author

justaugustus commented Dec 9, 2019

To respond to @spiffxp's comment on the Branch Management issue:

I would suggest that branch management is the role that should handle "what 'channel' (beta/stable1/stable2/stable3) corresponds to which version?"

Agreed. Now codified in the Branch Management handbook.

  • the beta channel should only exist after the release-1.y branch is cut, and be unused after the v1.y.0 release is cut (aka during the period that builds being cut have the word beta in them, and the branch manager is running branchff and handling cherry-picks prior to the .0 release)
  • the stableN versions are moved forward after the v1.y.0 release is cut, so that stable1 refers to v1.y.0, the most recent stable release, stable2 refers to v1.y-1.0, the previous stable release, etc.
  • release teams have forgotten to do this last part since 1.11 (ref: ref: kubernetes/test-infra#13577 (comment)), so we're in a state where the channels don't mean what they should

other ideas include:

I liked this idea and actually mentioned this to @tpepper after bumping into kubernetes/test-infra#15514.

My thought here is that creating the new release branch jobs immediately after the final patch release would result in turning down CI on the last-supported branch way sooner and giving time to watch the new release branch jobs stability.

This comes at the cost of more branch fast-forwards and more cherry picks.

Do we think that's worth it?

I'm not a fan of this nomenclature, especially because it has consistently caused confusion and inconsistency around what's under test at one period in the release cycle.

Are there any glaring things that we'd need to look out for going down this route?

@spiffxp
Copy link
Member

spiffxp commented Dec 9, 2019

This comes at the cost of more branch fast-forwards and more cherry picks.

I don't think it would cause more cherry-picks? Those don't start happening until after code freeze.

I was envisioning that alpha's would get cut off of the release-1.y branch with this approach, and that master's version wouldn't bump until after code freeze. This is different than today, where master's version bumps as soon as the release branch is cut.

Are there any glaring things that we'd need to look out for going down this route?

We'll "lose" historical data for jobs on our dashboards (testgrid, triage, velodrome, etc), since none of them comprehend job renames or moves. Early in the release cycle is probably the best time to induce such a gap.

Outside of that I suspect it's not glaring things, just lots of tiny renames. @Katharine might be able to better explain what prevented us from moving ahead with the rename in kubernetes/test-infra#12516.

@justaugustus
Copy link
Member Author

I don't think it would cause more cherry-picks? Those don't start happening until after code freeze.

@spiffxp -- Good point. This was mushy brain from triaging other stuff.

We'll "lose" historical data for jobs on our dashboards (testgrid, triage, velodrome, etc), since none of them comprehend job renames or moves. Early in the release cycle is probably the best time to induce such a gap.

I think I'm fine with losing some historical data if it leads to ease of management for the team over time.

@kubernetes/release-engineering -- What are your thoughts on this?

@justaugustus
Copy link
Member Author

Some discussion in Slack here: https://kubernetes.slack.com/archives/C09QZ4DQB/p1576104279099300
...and I'm poking at the version markers and release job generation here: kubernetes/test-infra#15564

@justaugustus
Copy link
Member Author

justaugustus commented Jan 13, 2020

Here's another instance of wrestling with version markers being a general nightmare: kubernetes/test-infra#15875

That PR should've been at most a few commits.

This cycle I'm going to be looking at renaming the release-branch jobs that reference beta,stable{1,2,3} and removing the generic suffix annotations. That should be an easy-ish way to get started.

From there, we'll need to look at refactoring generate_tests.py and test_config.yaml.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 12, 2020
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 12, 2020
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@justaugustus justaugustus reopened this Jul 1, 2020
@k8s-ci-robot k8s-ci-robot added the needs-kind Indicates a PR lacks a `kind/foo` label and requires one. label Jul 1, 2020
@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@justaugustus
Copy link
Member Author

/reopen
My old friend continues in kubernetes/test-infra#28079...

@k8s-ci-robot k8s-ci-robot reopened this Nov 22, 2022
@k8s-ci-robot
Copy link
Contributor

@justaugustus: Reopened this issue.

In response to this:

/reopen
My old friend continues in kubernetes/test-infra#28079...

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@justaugustus
Copy link
Member Author

k8s-beta rears it head again in the 1.26 release.
See this unmerged PR as an example of fixes: kubernetes/test-infra#26028

@xmudrii is on the right track by opening this issue for k8s-stable4 marker support as a stopgap: #2094

I'm happy for someone else to attempt closing this, but it may just as well make sense for me to tackle given the context.

@justaugustus
Copy link
Member Author

This plan in the description may be still the right path forward, but take caution and ask questions as it's been a while since I've updated it.

Plan of record

@justaugustus justaugustus added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. labels Nov 22, 2022
@justaugustus justaugustus modified the milestones: v1.21, v1.26 Nov 22, 2022
@xmudrii
Copy link
Member

xmudrii commented Nov 22, 2022

@justaugustus -- I'd be happy to help with this! Let me take a look into the current situation and then we can sync about this if needed.

@cici37
Copy link
Contributor

cici37 commented Nov 22, 2022

I would be happy to help with this work. Please let me know if help is needed :)

@justaugustus
Copy link
Member Author

Sounds good. Please work together on this, Cici + Marko!
/assign @cici37 @xmudrii
/unassign

@justaugustus
Copy link
Member Author

(Linking the k8s-stable4 issue that @xmudrii opened: #2094)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/release-eng Issues or PRs related to the Release Engineering subproject kind/bug Categorizes issue or PR as related to a bug. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/release Categorizes an issue or PR as relevant to SIG Release.
Projects
None yet
Development

No branches or pull requests