Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

document stance on allow-listing registry.k8s.io traffic #122

Closed
BenTheElder opened this issue Oct 25, 2022 · 11 comments · Fixed by #124
Closed

document stance on allow-listing registry.k8s.io traffic #122

BenTheElder opened this issue Oct 25, 2022 · 11 comments · Fixed by #124
Assignees
Labels
kind/documentation Categorizes issue or PR as related to documentation. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/k8s-infra Categorizes an issue or PR as relevant to SIG K8s Infra.

Comments

@BenTheElder
Copy link
Member

We cannot afford to commit to the backing endpoints and details of registry.k8s.io being stable, the project needs to be able to take advantage of whatever resources we have available to us at any given point in time in order to keep the project afloat.

As-is, we're very close to running out of funds and in an emergency state, exceeding our $3M/year GCP credits with container image hosting being a massively dominant cost in excess of 2/3 of our spend. Even in the future when we shift traffic to other platforms using the registry.k8s.io system, we need to remain flexible and should not commit to specific backing details.

E.G. we may be receiving new resources from other vendors, following the current escalation with the CNCF / Governing Board.

We should clearly document, prominently, bolded in this repo's README that we point https://registry.k8s.io to, an explicit stance on this.

We've already had requests to document the exact list of endpoints to allowlist, which is not an expectation we can sustain.

We should also consider giving pointers regarding how end-users can run their own mirrors to:

  • insulate themselves from shifting implementation details of registry.k8s.io affecting their egress allow-lists
  • reduce costs to the project
  • improve reliability for their own clusters (I.E. not depend on the uptime of a volunteer-staffed free registry)

/sig k8s-infra
/priority important-soon
/kind documentation

@k8s-ci-robot k8s-ci-robot added sig/k8s-infra Categorizes an issue or PR as relevant to SIG K8s Infra. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. kind/documentation Categorizes issue or PR as related to documentation. labels Oct 25, 2022
@BenTheElder BenTheElder self-assigned this Oct 25, 2022
@hh
Copy link
Member

hh commented Oct 25, 2022

/cc

@upodroid
Copy link
Member

upodroid commented Oct 25, 2022

End users get to pull images off the internet for free. For the 0.1% of our users who run K8s on networks with restricted egress, we will share the endpoints from which our images will be served for a particular source IP and 30 days (some other arbitrary window) notice if they change. If that doesn't work for customer X, then they should run mirrors at their own cost. You shouldn't be complaining about services offered for free.

We can make some uptime guarantees (99.5%) with the tradeoff that we can serve the images from wherever we want and we provide a pre-agreed notice period.

@BenTheElder
Copy link
Member Author

BenTheElder commented Oct 26, 2022

I don't think we should make any timing guarantees or uptime guarantees. This is free and barely staffed or funded.

At any point if we're ready to take advantage of new infra, we should be free to do so.
Users that are sensitive to these changes due to enterprise compliance etc should simply host their own, with guaranteed uptime and stable implementation details and API endpoints.

Even IP addresses we simply may not be able to guarantee. If users need to have extremely tight restrictions on this, they need to sort that out themselves going forward.

I don't think other free OSS package hosts commit to guarantees like this.

@BenTheElder
Copy link
Member Author

cc @dims @ameukam @spiffxp @thockin (k8s infra chairs + leads)

@dims
Copy link
Member

dims commented Oct 26, 2022

I agree Ben.

@ameukam
Copy link
Member

ameukam commented Oct 26, 2022

The project and the infrastructure is maintained by volunteers at the moment. we should not provide SLAs for the public workloads we host.

I agree with the proposal.

@thockin
Copy link
Member

thockin commented Oct 26, 2022

I think this is the only reasonable stance we can take. That said, I like the idea of "giving notice". We can't reach every user or fix it for them, but we CAN give warning.

What if we define a notification mechanism and send notice 2 weeks (or 30 days or something) before we add a new backend? Could be a mailing list or a git repo or a static URL or something that can be monitored. For users who need to know the full set, they can pay attention.

This shifts most of the onus back to those users who know best what they specifically need.

It doesn't need to be IPs (can't be), just the set of hostnames that the proxy might redirect them to for blob backends.

What think?

@BenTheElder
Copy link
Member Author

What if we define a notification mechanism and send notice 2 weeks (or 30 days or something) before we add a new backend? Could be a mailing list or a git repo or a static URL or something that can be monitored. For users who need to know the full set, they can pay attention.

Maybe, however ...

I can't find any precedence for this (other than the implicit, undocumented expectation we unintentionally created around k8s.gcr.io / gcr.io/google-containers).

Dockerhub, pypi, crates.io, the go module proxy, .... none of them appear to do anything remotely like documenting the required endpoints and waiting some period of time before rolling out new ones.

I'm not sure we should go out of our way, to create a new precedent, which then restricts our ability to roll out optimizations (and not just cost, also things like serving out of new regions to both limit cross-region traffic and to reduce latency), and adds additional work for maintaining the registry. We've already added and removed regions multiple times as we've built this out.

@aojea
Copy link
Member

aojea commented Oct 26, 2022

What if we define a notification mechanism and send notice 2 weeks (or 30 days or something) before we add a new backend?

is the notification mechanism not going to have the same problem of understaffing? if that happens, having a notification channel that doesn't work is worse than not having nothing IMHO

@jhoblitt
Copy link

It seems that an availability guarantee for "backend" endpoint is an anti-feature as:

  1. We don't want end-consumers to start depending directly on a specific endpoint. The odds are if a deployment is using a specific endpoint, the dependency will have a life time of weeks to years.
  2. It ties our hands operationally for doing load shedding or even taking a misbehaving endpoint offline.
  3. In the case where a client has spent minutes to hours trying to pull down a layer because of a poor connection, the end-user should already have strong tolerance to downloads timing out.

@BenTheElder
Copy link
Member Author

Drafted something here: #124

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/documentation Categorizes issue or PR as related to documentation. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/k8s-infra Categorizes an issue or PR as relevant to SIG K8s Infra.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants