Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Age Out Old Registry Content #144

Closed
BenTheElder opened this issue Feb 1, 2023 · 17 comments
Closed

Age Out Old Registry Content #144

BenTheElder opened this issue Feb 1, 2023 · 17 comments
Labels
committee/steering Denotes an issue or PR intended to be handled by the steering committee. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. sig/k8s-infra Categorizes an issue or PR as relevant to SIG K8s Infra. sig/release Categorizes an issue or PR as relevant to SIG Release.

Comments

@BenTheElder
Copy link
Member

We migrated forward all existing images since the beginning of hosting Kubernetes images, from gcr.io/google-containers to k8s.gcr.io to registry.k8s.io

We should consider establishing an intended retention policy for images hosted by registry.k8s.io, and communicating that early.

Not retaining all images indefinitely could help the container-image-promoter avoid dealing with an indefinitely growing set of images to synchronize and sign and may also have minor hosting cost benefits.

Even if we set a very lengthy period, we should probably consider not hosting content indefinitely.

@jeefy
Copy link
Member

jeefy commented Feb 1, 2023

Jeefy lizard brain policy idea:

Non-Prod images: Age out after 9 months
Prod images: Age out after release-EOL + 1y (so if 1.27 aged out Jan 2024, it would get culled Jan 2025)

@mrbobbytables
Copy link
Member

mrbobbytables commented Feb 1, 2023

I am broadly in support of this. I don't think its a reasonable ask for the project to host all releases for all time.
Do we have any data on which versions are being pulled?
I know there are 3rd party reports available (e.g. datadog report) that show 1.21 is still the most common version deployed right now, I would want to make sure we take that into account.

Right now I'm leaning towards EOL-9 (3 years), but would want some data before making any decision.

EDIT: Even without data I think we should remove the google-containers images...oof

@ameukam
Copy link
Member

ameukam commented Feb 1, 2023

/sig testing
/kind cleanup

@k8s-ci-robot k8s-ci-robot added sig/testing Categorizes an issue or PR as relevant to SIG Testing. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. labels Feb 1, 2023
@BenTheElder
Copy link
Member Author

BenTheElder commented Feb 1, 2023

Some additional points:

  • We have a lot of images that are not related to Kubernetes releases, but may have their own release timelines. I think we should maybe just set an N year policy, where N is relatively generous but still gives us room to stop perma-hosting.

  • The mechanism to remove images needs some thinking ...

    • The source of truth on what images are available is in the https://github.com/kubernetes/k8s.io repo, and is https://github.com/kubernetes-sigs/promo-tools copying to the backing registries (so to be clear no changes will happen in this repo, it's a multi-repo problem and I figured visibility might be best here, but we need to forward this to more people).
    • We might have to cull them from the image promoter manifests somehow, and then start to permit auto-pruning things that are in production storage but not in the manifests. Automating the policy to drop things from the source manifests sounds tricky, but I'm not sure there's a more reasonable approach. The implementation details need input from promo-tools maintainers and will probably at least somewhat drive the viable policy. cc @puerco @kubernetes/release-engineering (EDIT: see RFC: Feasibility of aging out old content kubernetes-sigs/promo-tools#719)
  • IMHO Despite needing to consider the mechanics, we should decide if we're doing this and on a reasonable policy and starting communication ahead of actually implementing, it may take time to staff these changes but communicating sooner alongside the new registry would be beneficial to users.

EDIT: Even without data I think we should remove the google-containers images...oof

Yeah. That also just hasn't happened due to lack of policy. We did the flip to k8s.gcr.io July 24th 2020. kubernetes/release#270 (comment)

/sig testing

While I'm sure sig-testing is happy to support this effort, I'd suggest that the policy is a combination of k8s-infra (what is k8s-infra willing to fund resources for generally) and release (particularly around the promo tools support for this and kubernetes release timeframe).

/sig k8s-infra
/sig release
/remove-sig testing

@k8s-ci-robot k8s-ci-robot added sig/k8s-infra Categorizes an issue or PR as relevant to SIG K8s Infra. sig/release Categorizes an issue or PR as relevant to SIG Release. and removed sig/testing Categorizes an issue or PR as relevant to SIG Testing. labels Feb 1, 2023
@mrbobbytables
Copy link
Member

/committee steering
as this has budget / cncf related items

@k8s-ci-robot k8s-ci-robot added the committee/steering Denotes an issue or PR intended to be handled by the steering committee. label Feb 1, 2023
@sftim
Copy link

sftim commented Feb 2, 2023

Technical aside: when we serve layer redirects, we then get an option to serve a Warning header alongside the redirect.

@BenTheElder
Copy link
Member Author

BenTheElder commented Feb 6, 2023

One additional complication for people to consider: We more commonly host our own base images at this point, building old commits from source will become more challenging (but not necessarily impossible*) if we age out those images.

* non-reproducible builds may be an issue, e.g. the "debian-base" image.

@justinsb
Copy link
Member

justinsb commented Feb 7, 2023

Posted this on the linked issue, but maybe it's better here:

The cost reduction would primarily be because we would break people, and encourage them to upgrade (?)

I think other OSS projects follow a similar strategy, e.g. the "normal" debian APT repos don't work with old releases, but there is a public archive.

I don't know the actual strategy for when debian distros move to the archive. For kubernetes, if we support 4 versions (?), we probably want to keep at least 5 versions "live" so that people aren't forced to upgrade all their EOL clusters at once, but we probably want to keep no more than 8 versions "live" so that people are nudged to upgrade. So I come up with a range of 5-8 releases if we wanted to split off old releases, and I can imagine a case for any of those values.

@xmudrii
Copy link
Member

xmudrii commented Feb 7, 2023

For kubernetes, if we support 4 versions (?), we probably want to keep at least 5 versions "live" so that people aren't forced to upgrade all their EOL clusters at once, but we probably want to keep no more than 8 versions "live" so that people are nudged to upgrade.

This is bringing up a very good point. We have to keep in mind that you can't skip Kubernetes versions when upgrading, otherwise you'll go against the skew policy and eventually break the cluster. Let's say that we remove images for up until 1.22, but someone is using 1.20. They don't have a way to upgrade their cluster to a newer version because they have to start with upgrading to 1.21 which wouldn't exist any longer. This is a very problematic scenario because the only way is more-or-less to start from scratch and that's unacceptable for many.

We need to be a bit generous here. I agree with @justinsb that we should target 5-8 release. I'd probably go closer to 8.

@BenTheElder
Copy link
Member Author

These are good points.

I think we should probably be implementing policy in terms of time though, both to be more managable to implement and because we have many images that are not part of Kubernetes releases.

If we consider the lifespan of releases but put it in terms of time we could say e.g. "3 years after publication" which would be 5-8 releases (since releases are every 1/3 year and supported for one year).

@jeremyrickard
Copy link

I think we should set a more aggressive timeline going forward, say starting with 1.27 we'll host artifacts for 1 year after things after they hit end of support. I'm not sure how we best handle older things, but things like 1.21 are still being heavily used. If we said "5" releases, that would still fall into that window pretty soon.

If we have to chose between the health of the project overall (CI infra, etc) and impacting people with those older versions, I think we have to unfortunately chose the health of the project :( Can we provide some extraordinary mechanisms for people to pull maybe tarballs of older stuff and some instructions on how to put those into a registry, like some sort of cold storage for folks?

@dims
Copy link
Member

dims commented Feb 10, 2023

💯 to say starting with 1.27 we'll host artifacts for 1 year after things after they hit end of support.

@xmudrii
Copy link
Member

xmudrii commented Feb 10, 2023

Can we provide some extraordinary mechanisms for people to pull maybe tarballs of older stuff and some instructions on how to put those into a registry, like some sort of cold storage for folks?

What if we keep the latest patch release for each minor? For example, 1 year after reaching the EOL date, we keep images for the latest patch release, but delete all other images. Eventually, we can remove the latest patch release, let's say, 3-4 years after the EOL date. That will reduce storage and (hopefully) bandwidth costs, but at the same time, it shouldn't break any clusters or workflows.

@aojea
Copy link
Member

aojea commented Feb 10, 2023

The source code will remain https://github.com/kubernetes/kubernetes/releases , is not that people that needs can not do make release ;)

@jberkus
Copy link

jberkus commented Feb 13, 2023

Technical capabilities aside, the ideal set would be, IMHO:

  • Everything for the last 4 releases
  • Just the final patch release for the 4 releases before that, just to enable upgrading

If we had the ability, I expect that there are probably lots of other images we could purge for older releases from subprojects, etc, which could be removed much more aggresively.

@BenTheElder
Copy link
Member Author

Thanks all for all the input and suggestions!


The source code will remain https://github.com/kubernetes/kubernetes/releases , is not that people that needs can not do make release ;)

That's not sufficient, because the base images cannot just be re-built from source typically, and the source code references base images. Those base images are a snapshot of various packages at a point in time.

We also run binary builds in containers sometimes (see kubernetes/kubernetes), where the build-environment image is also an important point-in-time snapshot for reproducing a source build.


We've been discussing this further in various forums, and on further review I think the cost angle is going to be near totally irrelevant once we have traffic off k8s.gcr.io onto registry.k8s.io.

Storage costs are not large enough to matter. Image storage de-dupes pretty well, it's compressed, in real numbers we're actually looking at less than 2 TB currently even after ~8 years. 2 TB costs like $50/mo to store in GCS standard tier for exxample.

Bandwidth costs matter, deeply, ... but only for content people actually use, and they're going to be a lot more manageable on registry.k8s.io (e.g. >50% of requests to k8s.gcr.io came from AWS, and we're now serving content out of AWS ... no egress).

For the tooling "needs to iterate all of this" angle, I've been working on a new tool that needs to scan all the images over the past few days, and it's actually quite doable to optimize this even as images grow. AR provides a cheap API to list all digests in a repo. I think we can teach the promoter tools to avoid repeatedly fetching content-addressable data we've already processed.


I think disappearing base images in particular is going to cause more harm than benefit ...

We're also really ramping up having users migrate, so I think we are missing the window on "lets get a policy in place before people use it", and it we've forced users to adapt to a new registry twice in the past few years, so I think we can just introduce a policy later if we wind up needing it.

@mrbobbytables outreach to subprojects and other CNCF projects to migrate, and other pending cost optimization efforts are probably a better use of time at the moment.

@BenTheElder BenTheElder closed this as not planned Won't fix, can't repro, duplicate, stale Mar 7, 2023
@BenTheElder
Copy link
Member Author

There are other issues tracking things like potentially sunsetting k8s.gcr.io, they're tracked in the general k8s-infra repo at https://github.com/kubernetes/k8s.io

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
committee/steering Denotes an issue or PR intended to be handled by the steering committee. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. sig/k8s-infra Categorizes an issue or PR as relevant to SIG K8s Infra. sig/release Categorizes an issue or PR as relevant to SIG Release.
Projects
None yet
Development

No branches or pull requests