-
Notifications
You must be signed in to change notification settings - Fork 72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🚨 Sigstore Signature images do not match across different geo-locations 🚨 #187
Comments
kubernetes-sigs/promo-tools#784 to track resolving any bugs in the image promoter |
So far this seems to only be the sigstore images. Given that clients will generally be fetching these with a tag that is computed based on the digest of the adjacent image that was signed, not the digest of the signature "images" themselves, this is probably unlikely to break anyone, but worth fixing regardless. |
This could cause a problem if a single image pull (many API calls) somehow gets routed to multiple instances of the registry.k8s.io backend in different regions, because the signature blobs available would not match. We think this is very unlikely. Still something to fix. |
So ... I've computed an index of all images like A partial_ref in this case is like This is ~600M of JSON. It took O(hours to obtain) given the rate limits on scanning our registries and the volume of images. I've then filtered this back down to only tag refs and digest refs that have no tags pointing at them. Both types map to the digest they point to. Filtering this way reduces the data set but not the information, it just means we skip the The tradeoff is to diff you need to check both the ref and the digest between two hosts, but we want to know if tags are different anyhow. I would share but even the filtered and processed version is 353M ... EDIT: The filtered version is available in gzip compressed JSON here: https://kubernetes.slack.com/archives/CJH2GBF7Y/p1679213042607229?thread_ts=1679166550.351119&cid=CJH2GBF7Y Anyhow, by running some queries over this data I can see that none of the backing registries have the same amount of refs. If I pick two regions and compute the ref diffs, what I see every time so far is a mix of dangling digest refs with no tag and Unfortunately there are a large amount of dangling digest refs in the region diffs, so we can't just say "well it's all sigstore tags" and call it a day. There are also too many of these to quickly fetch and check all of the manifests. But just inspecting a random sample of dangling digest refs from the region pairs I inspected, so far 100% of the time I would guess that we pushed signature tags multiple times to images and these dangling refs are previous signature pushes. ALL of the tag type references so far are |
The tag type references in the diff also suggest that we have signed images that only have a published signature at all in some regions AFAICT, which is a bit worse than exact signature varying by region ... E.G. for us-west1 vs us-west2 AR instances: Missing: Available:
You can verify that these are really missing / available by This also applies to k8s.gcr.io with eu/us/asia backing registries. However it's far less visible there as users are far less likely to ever encounter different backing registries given the very broad geographic scopes. |
Quantifying scale of sigstore tags:
Note that's going to include each manifest, so there's potentially one of these for each architecture within the same image. The more interesting detail is we have some other image tags that are only in some backends:
100% of these are only missing from the k8s.gcr.io registries, See below for how this happened #187 (comment) You can verify that these are in other backends like this sample:
Code in BenTheElder@2e32a2c / https://github.com/BenTheElder/registry.k8s.io/tree/check-images, data file in slack linked above. |
The "cluster-api-aure" tags were partially synced before and led to kubernetes/k8s.io#4368 which should be catching future mis-configuration leading to partial sync on the promoter config side of things. We should make sure that test is actually running on the migrated |
kubernetes/k8s.io#4988 will ensure we keep applying the regression test that should prevent mis-configuring subprojects to not promote to all regions. (IE the cluster-api-aure situation) |
Confirmed dangling digests that are not in all regions are 100% either sigstore manifests (containing Scanned with BenTheElder@a10201c |
So recapping: TLDR of backing registry skew after fully checking through all mismatching tags and digests in a snapshot from this weekend. The following cases appear to exist:
These are all known issues. 3) should not get worse via regression tests (kubernetes/k8s.io#4988) 1 & 2 are being worked on and kubernetes-sigs/promo-tools#784 is probably the best place to track that. See also for 1&2: |
OK, regarding the diverging Looking at images before the promoter started breaking due to the rate limits, the .sig layers match. I found a mismatching tag in the images promoted as part of the (failed) v1.26.3 release. For example, registry.k8s.io/kube-scheduler:v1.26.3 is fully signed and replicated, all match:
There are some images which have missing signatures, but the ones that are there, all match, for example
in all the images we promoted that day, the one that has a different digest is the
Here's what's going on: When the promoter signs, it stamps the images with its own signer identity:
(note the SA ID in the last line: krel-trust@ ) The diverging digest has the the identity from the signature we add when the build process runs:
(note the identity here is krel-staging@ ) This signature is the only one that is different in the release, so we are not resigning. It is simply that when processing the signatures for this particular image, the promoter got rate limited and died in the middle. |
Wait, we're pushing images to prod and then mutating them? Why? |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale |
Please note that this issue is linked in LFX Mentorship 2024 term 2. |
Thanks! This issue is just for tracking / visibility to users of the registry, the necessary changes will be in repos like kubernetes/release where image publication is managed, if/when it is fixed we will replicate updates back here for visibility. |
@aliok This project seems interesting to me. I really want to work on this project .Is there any prerequisite task that needs to be done ? |
Hi folks, please discuss possibly working on this in kubernetes/release#2962 and let's reserve this issue for indicating to users of the registry when we have progress or more details on the situation. |
Is there an existing issue for this?
What did you expect to happen?
Images should have identical digests no matter what region I pull from.
This does not appear to be the case for some of the sigstore images added by the image-promoter
Thread: https://kubernetes.slack.com/archives/CJH2GBF7Y/p1679166550351119
This issue is to track, the underlying fix will happen in the backing registries and in the image promoter (https://github.com/kubernetes-sigs/promo-tools) if we actively have a bug still causing this.
To be clear this is not a bug in the registry application, however it will be visible to users of the registry, and more visible on registry.k8s.io than k8s.gcr.io (because k8s.gcr.io has much much broader backing regions: eu, us, asia).
We'll want to fix the underlying issues if any remain in promo-tools and then fixup the backing registry contents somehow.
Debugging Information
I have script that inspects some important high-bandwidth images. It's a bit slow, and currently it only checks at k8s.gcr.io / registry.k8s.io https://github.com/BenTheElder/registry.k8s.io/blob/check-images/hack/tools/check-images.sh
We'll need to check the backing stores. I noticed a difference between my laptop at home and SSH to a cloud workstation.
Anything else?
/sig release
Code of Conduct
The text was updated successfully, but these errors were encountered: