Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kubernetes CI Policy: release-blocking jobs must run in dedicated cluster #18549

Closed
27 tasks done
spiffxp opened this issue Jul 30, 2020 · 13 comments
Closed
27 tasks done
Assignees
Labels
area/jobs area/release-eng Issues or PRs related to the Release Engineering subproject priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/release Categorizes an issue or PR as relevant to SIG Release. sig/testing Categorizes an issue or PR as relevant to SIG Testing.
Milestone

Comments

@spiffxp
Copy link
Member

spiffxp commented Jul 30, 2020

Part of #18551

Why it's necessary:

  • we believe declaring Guaranteed Pod QOS jobs may not be defended against Best Effort or Burstable Pods hogging all resources on the same node
  • a cluster that only has Guaranteed pods is far more likely to respect resource requirements

What decisions do we need to make:

  • For all of the jobs being migrated to k8s-infra-prow-build, we're going to use the same node pool as everything else. To pin to a dedicated node pool will require much more boilerplate (or possibly augmenting prow's preset feature). If after migrating everything we find we do need a dedicated nodepool, then we'll pay the added cost.
  • Jobs that can't be migrated quickly will remain in google.com-owned k8s-prow-builds until we can overcome whatever obstacles are keeping them there. This means no community visibility into resource consumption, and test-infra-oncall will have to be relied upon.

Boskos projects to be added:

Jobs to be migrated

TODO:

@BenTheElder
Copy link
Member

NodePool is probably fine, though it has the downside that you have to explicitly set a matching node selector and taint toleration in every prowjob podspec if you want to actually make it dedicated. there's no abstraction for this, and they're a little verbose.

we've done this before when experimenting with ubuntu nodes for kind IPv6 before we'd standardized on that.

@spiffxp
Copy link
Member Author

spiffxp commented Aug 1, 2020

/sig testing
/sig release
/wg k8s-infra
/area release-eng
/area jobs

@spiffxp
Copy link
Member Author

spiffxp commented Aug 8, 2020

Current status (not counting issues that are held open to monitor jobs)

spiffxp@spiffxp-macbookpro:test-infra (master %)$ go test -v -count=1 ./config/tests/jobs
# ...
=== RUN   TestKubernetesReleaseBlockingJobsShouldRunOnK8sInfraProwBuild
    TestKubernetesReleaseBlockingJobsShouldRunOnK8sInfraProwBuild: jobs_test.go:1084: ci-kubernetes-build: should run on cluster: k8s-infra-prow-build, found: default
    TestKubernetesReleaseBlockingJobsShouldRunOnK8sInfraProwBuild: jobs_test.go:1084: ci-kubernetes-build-1-19: should run on cluster: k8s-infra-prow-build, found: default
    TestKubernetesReleaseBlockingJobsShouldRunOnK8sInfraProwBuild: jobs_test.go:1084: ci-kubernetes-build-fast: should run on cluster: k8s-infra-prow-build, found: default
    TestKubernetesReleaseBlockingJobsShouldRunOnK8sInfraProwBuild: jobs_test.go:1084: ci-kubernetes-build-stable1: should run on cluster: k8s-infra-prow-build, found: default
    TestKubernetesReleaseBlockingJobsShouldRunOnK8sInfraProwBuild: jobs_test.go:1084: ci-kubernetes-build-stable2: should run on cluster: k8s-infra-prow-build, found: default
    TestKubernetesReleaseBlockingJobsShouldRunOnK8sInfraProwBuild: jobs_test.go:1084: ci-kubernetes-build-stable3: should run on cluster: k8s-infra-prow-build, found: default
    TestKubernetesReleaseBlockingJobsShouldRunOnK8sInfraProwBuild: jobs_test.go:1084: periodic-kubernetes-bazel-build-1-16: should run on cluster: k8s-infra-prow-build, found: default
    TestKubernetesReleaseBlockingJobsShouldRunOnK8sInfraProwBuild: jobs_test.go:1084: periodic-kubernetes-bazel-build-1-17: should run on cluster: k8s-infra-prow-build, found: default
    TestKubernetesReleaseBlockingJobsShouldRunOnK8sInfraProwBuild: jobs_test.go:1084: periodic-kubernetes-bazel-build-1-18: should run on cluster: k8s-infra-prow-build, found: default
    TestKubernetesReleaseBlockingJobsShouldRunOnK8sInfraProwBuild: jobs_test.go:1084: periodic-kubernetes-bazel-build-1-19: should run on cluster: k8s-infra-prow-build, found: default
    TestKubernetesReleaseBlockingJobsShouldRunOnK8sInfraProwBuild: jobs_test.go:1084: periodic-kubernetes-bazel-build-master: should run on cluster: k8s-infra-prow-build, found: default
    TestKubernetesReleaseBlockingJobsShouldRunOnK8sInfraProwBuild: jobs_test.go:1084: periodic-kubernetes-bazel-test-1-16: should run on cluster: k8s-infra-prow-build, found: default
    TestKubernetesReleaseBlockingJobsShouldRunOnK8sInfraProwBuild: jobs_test.go:1084: periodic-kubernetes-bazel-test-1-17: should run on cluster: k8s-infra-prow-build, found: default
    TestKubernetesReleaseBlockingJobsShouldRunOnK8sInfraProwBuild: jobs_test.go:1084: periodic-kubernetes-bazel-test-1-18: should run on cluster: k8s-infra-prow-build, found: default
    TestKubernetesReleaseBlockingJobsShouldRunOnK8sInfraProwBuild: jobs_test.go:1084: periodic-kubernetes-bazel-test-1-19: should run on cluster: k8s-infra-prow-build, found: default
--- PASS: TestKubernetesReleaseBlockingJobsShouldRunOnK8sInfraProwBuild (0.02s)

Will need to decide how we'd like to define exceptions for the build jobs, and/or dedicate resources to them.

ameukam added a commit to ameukam/test-infra that referenced this issue Oct 29, 2020
Duplicate of ci-ingress-gce-e2e that validates the job can run on
k8s-infra-prow-build.

Ref: kubernetes/k8s.io#1093
Part of : kubernetes#18549

Signed-off-by: Arnaud Meukam <ameukam@gmail.com>
ameukam added a commit to ameukam/test-infra that referenced this issue Oct 29, 2020
Duplicate of ci-ingress-gce-e2e that validates the job can run on
k8s-infra-prow-build.

Ref: kubernetes/k8s.io#1093
Part of : kubernetes#18549

Signed-off-by: Arnaud Meukam <ameukam@gmail.com>
ameukam added a commit to ameukam/test-infra that referenced this issue Oct 29, 2020
Duplicate of ci-ingress-gce-e2e that validates the job can run on
k8s-infra-prow-build.

Ref: kubernetes/k8s.io#1093
Part of : kubernetes#18549

Signed-off-by: Arnaud Meukam <ameukam@gmail.com>
ameukam added a commit to ameukam/test-infra that referenced this issue Oct 29, 2020
Duplicate of ci-ingress-gce-e2e that validates the job can run on
k8s-infra-prow-build.

Ref: kubernetes/k8s.io#1093
Part of : kubernetes#18549

Signed-off-by: Arnaud Meukam <ameukam@gmail.com>
@justaugustus
Copy link
Member

Assigning Dan to TL completion of this.
/assign @hasheddan

@spiffxp
Copy link
Member Author

spiffxp commented Nov 5, 2020

Updated description to enumerate all release-blocking jobs and the umbrella issues tracking their progress or the PRs that migrated them

ameukam added a commit to ameukam/test-infra that referenced this issue Nov 10, 2020
Ref : kubernetes#19483
Part of: kubernetes#18549

Signed-off-by: Arnaud Meukam <ameukam@gmail.com>
ameukam added a commit to ameukam/test-infra that referenced this issue Nov 10, 2020
Ref : kubernetes#19483
Part of: kubernetes#18549

Signed-off-by: Arnaud Meukam <ameukam@gmail.com>
ameukam added a commit to ameukam/test-infra that referenced this issue Nov 10, 2020
Ref : kubernetes#19483
Part of: kubernetes#18549

Signed-off-by: Arnaud Meukam <ameukam@gmail.com>
ameukam added a commit to ameukam/test-infra that referenced this issue Nov 10, 2020
Ref : kubernetes#19483
Part of: kubernetes#18549

Signed-off-by: Arnaud Meukam <ameukam@gmail.com>
@spiffxp
Copy link
Member Author

spiffxp commented Jan 8, 2021

Current status...

spiffxp@spiffxp-macbookpro:test-infra (master %)$ go test -v -count=1 ./config/tests/jobs/...
# ...
=== RUN   TestKubernetesReleaseBlockingJobsShouldRunOnK8sInfraProwBuild
    jobs_test.go:1049: ci-kubernetes-build: should run on cluster: k8s-infra-prow-build, found: default
    jobs_test.go:1049: ci-kubernetes-build-1-19: should run on cluster: k8s-infra-prow-build, found: default
    jobs_test.go:1049: ci-kubernetes-build-1-20: should run on cluster: k8s-infra-prow-build, found: default
    jobs_test.go:1049: ci-kubernetes-build-stable2: should run on cluster: k8s-infra-prow-build, found: default
    jobs_test.go:1049: ci-kubernetes-build-stable3: should run on cluster: k8s-infra-prow-build, found: default
    jobs_test.go:1049: periodic-kubernetes-bazel-build-1-17: should run on cluster: k8s-infra-prow-build, found: default
    jobs_test.go:1049: periodic-kubernetes-bazel-build-1-18: should run on cluster: k8s-infra-prow-build, found: default
    jobs_test.go:1049: periodic-kubernetes-bazel-build-1-19: should run on cluster: k8s-infra-prow-build, found: default
    jobs_test.go:1049: periodic-kubernetes-bazel-build-1-20: should run on cluster: k8s-infra-prow-build, found: default
    jobs_test.go:1049: periodic-kubernetes-bazel-build-master: should run on cluster: k8s-infra-prow-build, found: default

We have "canary" versions of these jobs that run on k8s-infra-prow-build, and write to gs://k8s-release-dev.

However, there is much out there that depends on the existing jobs that write to gs://kubernetes-release-dev and gcr.io/kubernetes-ci-images (ref: #19483 (comment)). I'm considering a KEP for deprecation/migration/removal.

I think full completion of the deprecation process is out of scope for this issue. So I would like us to identify what the minimum set of jobs/scripts/etc should be pulling from gs://k8s-release-dev to call this done.

@spiffxp
Copy link
Member Author

spiffxp commented Feb 2, 2021

/milestone v1.21

@k8s-ci-robot k8s-ci-robot added this to the v1.21 milestone Feb 2, 2021
@spiffxp
Copy link
Member Author

spiffxp commented Feb 8, 2021

/priority important-soon

@k8s-ci-robot k8s-ci-robot added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Feb 8, 2021
@spiffxp
Copy link
Member Author

spiffxp commented Feb 23, 2021

I think the following PRs can close this out:

@spiffxp
Copy link
Member Author

spiffxp commented Feb 23, 2021

/assign
/assign @justaugustus
Can you think of anything else that needs to happen to call this done?

@spiffxp
Copy link
Member Author

spiffxp commented Feb 24, 2021

Tangentially related: followup work to deprecate google-hosted artifacts related to release-blocking builds will be tracked under kubernetes/k8s.io#1571

@spiffxp
Copy link
Member Author

spiffxp commented Feb 24, 2021

/close
Calling this done!

@k8s-ci-robot
Copy link
Contributor

@spiffxp: Closing this issue.

In response to this:

/close
Calling this done!

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/jobs area/release-eng Issues or PRs related to the Release Engineering subproject priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/release Categorizes an issue or PR as relevant to SIG Release. sig/testing Categorizes an issue or PR as relevant to SIG Testing.
Projects
None yet
Development

No branches or pull requests

5 participants