Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Move boskos testing projects pool to kubernetes.io #390

Open
cblecker opened this issue Oct 5, 2019 · 8 comments

Comments

@cblecker
Copy link
Member

commented Oct 5, 2019

I'd like to start looking at moving the boskos pool over to public-owned projects.

Things I see on the surface we'd need to do:

  • Template to create the projects
  • Understand what the "bare" state of these projects is
  • Who needs what permissions to these projects (should we set up an RBAC group for on-call? what service accounts do we need for prow/boskos/janitor?)
  • What quotas do we need to request for the projects
  • Probably other unknown things

cc: @kubernetes/test-infra-admins

@stevekuznetsov

This comment has been minimized.

Copy link

commented Oct 7, 2019

/retitle RFC: Move boskos testing projects pool to kubernetes.io

@k8s-ci-robot k8s-ci-robot changed the title RFC: Move bozkos testing projects pool to kubernetes.io RFC: Move boskos testing projects pool to kubernetes.io Oct 7, 2019
@BenTheElder

This comment has been minimized.

Copy link
Member

commented Oct 7, 2019

  • Template to create the projects

roughly it's just projects with the CI service account and some admins having access, and quota depending on which pool they are going into

  • Understand what the "bare" state of these projects is

Literally bare. No resources. Just a namespace w/ quota

  • Who needs what permissions to these projects (should we set up an RBAC group for on-call? what service accounts do we need for prow/boskos/janitor?)

Some humans should have backup access, but primarily the CI service account needs access.

That would be pr-kubekins@kubernetes-jenkins-pull.iam.gserviceaccount.com (this is visible from the CI logs)

In the future this should be some service account from a publicly owned prow.

  • What quotas do we need to request for the projects

Each boskos pool is defined by the kind of quota present. I don't think the GCP non-gke pool is particularly special (and the GKE pool should be managed by GKE...)

There are also pools for EG GPU testing, which need quota for that, and I think for scale testing (which of course need more of basically all resources)

-Probably other unknown things

We should consider the fact that the state of "is this project available" is in CRDs in the build cluster.

@BenTheElder

This comment has been minimized.

Copy link
Member

commented Oct 7, 2019

... continuing (accidentally hit enter)

As long as the state is in the build cluster, that means to switch prow over we'll either have serious disruption (need to spin down the pool) or need a whole new pool.

Humans generally have no need to access these projects, so in terms of getting the community access to the infra, the boskos projects are uninteresting, they're ~100% controlled by automation via public config already.

In terms of spending CNCF GCP credits, they're somewhat more interesting I suppose, if that's what we're going for.

If we're interested in just getting things migrated because we should migrate things, it will be much more useful to migrate boskos along with the management and state and generally replace the legacy prow.k8s.io service accounts etc. (can you tell by the jenkins in the name?) ...

@dims

This comment has been minimized.

Copy link
Member

commented Oct 16, 2019

/assign @thockin

@thockin

This comment has been minimized.

Copy link
Member

commented Oct 16, 2019

This seems like something we can and should enable ASAP. Christoph started with great questions. I'd like to add a couple:

How can we break down the billing or attribution for this? With a single big pool and a single CI service account, I have no idea who spent what money on what things. I think we need to do better than this.

  • Can we use a service account for each coarse "purpose"?
  • Can we use a distinct pool of projects for each coarse purpose?
  • Both?

Should we be EOL'ing projects after some number of uses (one? ten?) just for sanity?

Can quota requests be automated?

Who owns this, that we can have this conversation?

The net result of this is probably a script which ensures the requisite projects exist and have the correct IAM for the appropriate CI SA, plus a link to docs explaining what they are for. That alone seems straight-forward, but without an owner to drive it, I don't think we can reasonably do much.

@BenTheElder

This comment has been minimized.

Copy link
Member

commented Oct 17, 2019

Can we use a service account for each coarse "purpose"?

We can but the CI users will need to correctly activate their unique service account.

These service accounts need to make their way into Prow, and not much prevents someone from using the wrong SA (the Prow cluster is so old in-place upgraded that it doesn't have RBAC...)

Older style "bootstrap.py" prowjobs (don't worry about the details, most of our CI jobs are these though...) automagically activate a default service account before we get to testing.

Can we use a distinct pool of projects for each coarse purpose?

Yes, we have a few pools today. The GCP projects are monitored here showing a few types (EG GPU):
http://velodrome.k8s.io/dashboard/db/boskos-dashboard?orgId=1

The full set of resources (including AWS) is here: https://github.com/kubernetes/test-infra/blob/d8449cb095fb6dc791958bbaf8940c7c1007410c/prow/cluster/boskos-resources.yaml

The biggest trick is just figuring out what a distinct use is and carving these up...
Unfortunately a ton of our CI tests are relatively ownerless so this may be tricky.

Most tests use the generic GCE pool but they don't have to.

Should we be EOL'ing projects after some number of uses (one? ten?) just for sanity?

Prow runs O(12000) build/tests a day, if only 25% of those are GCP e2e we'd churn through ~300 projects a day at 10 uses before retiring. I think this probably wouldn't scale.

Can quota requests be automated?

I took a quick look now and didn't see an API, but I'm not sure.

Who owns this, that we can have this conversation?

• boskos the tool? => I might have an answer, but waiting for confirmation
• or this migration? ... unsure
• the prow.k8s.io deployment? => the test-infra maintainers / google engprod team nominally at the moment, the infra runs in the build / test workload cluster.

@thockin

This comment has been minimized.

Copy link
Member

commented Oct 17, 2019

@BenTheElder

This comment has been minimized.

Copy link
Member

commented Oct 17, 2019

As we consider moving prow into community space, we will HAVE to get a
better story around this.

Agreed. I'm certainly not thrilled about the current state...

That said, I generally don't think we can consider the presubmit testing to be trustworthy, and scheduling with boskos is cooperative. Changing that would be a bit involved.

Cull the herd?

Yes and no.

A lot of valuable signal shouldn't be culled imo, but still doesn't have a clear owner 😞 (EG who owns the periodic integration and unit testing ...?)

We probably need to enforce ownership better somehow. I'm not sure how.

I'd like to set the objective at EVERY test identifies which pool it
belongs to and then as needed we can split those pools to better indicate
the prime spenders.

We can do that incrementally with the new community owned pools we set up, I have no idea what the right granularity would be though.

A project takes 30-45 seconds to create.

... that is a lot faster than I thought. If we can get this to work, that would be a neat trick! 🙃

Yes

ACK, I'm hoping for an official "stepping up to the plate" in the next couple of days ... will circle back. @sebastienvas may serve as an transitionary owner (previously worked on this).

Yes

ACK ... I can certainly help, I'm also hoping for more help though, perhaps @dims who raised this :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants
You can’t perform that action at this time.