Cloud run: recover missing digests via registry lookup server/#4986#879
Conversation
ToreMerkely
commented
May 11, 2026
- fix(digest): accept all four OCI manifest media types in one request
- feat(cloudrun): resolve tag-pinned image digests via registry lookup (Slice 7, #4986)
- fix(cloudrun): keep tag-pinned image as artifact name after digest resolution
- feat(cloudrun): --resolve-names flag rewrites digest-pinned artifact names to tag-pinned (Slice 8, #4986)
RemoteDockerImageSha256 previously asked the registry for only the
Docker fat-manifest type first, then retried with the Docker single-
arch type if that came back wrong. Two issues:
- A 404 on the first request short-circuited the retry: the function
returned the 404 error instead of falling back. Real registries
(Artifact Registry, GHCR, ECR) return 404 MANIFEST_UNKNOWN for an
Accept type they don't have.
- The two Docker types were the only ones advertised. OCI single-arch
manifests (vnd.oci.image.manifest.v1+json) and OCI image indexes
(vnd.oci.image.index.v1+json) were never asked for — and modern AR
pushes routinely use one of these.
Per the OCI Distribution spec, clients should list every manifest
media type they understand in a single Accept header; the registry
picks the best match and returns the canonical Docker-Content-Digest
header regardless of which it chose. Replace the two-call retry with
one request advertising all four canonical types.
Surfaces in cloud-run snapshots as the difference between
"jobs come back with empty digests" and "jobs come back digest-
pinned".
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…(Slice 7, #4986)
For Cloud Run Services, Cloud Run rewrites the image to a digest at
deploy time and exposes the digest-pinned reference on the Revision —
so the snapshot picks up the sha256 for free. For Cloud Run Jobs, the
spec is left as-deployed: tag-pinned at deploy, tag-pinned in the
snapshot, empty digest in the artifacts. Same gap applies to any
tag-pinned Service container as an edge case.
Close it with an OCI Distribution lookup (\`GET /v2/<name>/manifests/<tag>\`)
against the hosting registry, using a GCP OAuth bearer token derived
from Application Default Credentials. Works for both Jobs that have
never been executed and ones that have — the registry doesn't care.
- New \`digestResolver\` interface in internal/cloudrun + production
\`gcpRegistryResolver\` (registry.go). Scope-filtered to GCP-hosted
registries (Artifact Registry \`*-docker.pkg.dev\`, legacy GCR
\`*.gcr.io\`); other hosts skip resolution because the OAuth token
won't work against them.
- \`splitTagPinnedImage\` parses \`host/path:tag\`, rejecting digest-
pinned, no-tag, port-without-tag, and no-host inputs.
- \`Client\` gains \`resolver digestResolver\` and \`log *logger.Logger\`
fields; \`New(ctx, log)\` constructs a gcpRegistryResolver from ADC
(cloud-platform scope) and a logger. If the token source can't be
built (rare), resolver is left nil — behaviour falls back to today's
empty-digest output, never fatal.
- New \`(c *Client).resolveTagPinnedDigests(map)\` walks each artifact's
digests map, resolves tag-pinned entries, and swaps the key for the
digest-pinned form. Wired into \`ListJobs\` and the \`toService\`
revision loop so both kinds benefit. Failures log at debug and leave
the original tag-pinned entry alone.
- \`cmd/kosli/snapshotCloudRun.go\` passes the package-level logger to
\`cloudrun.New\`.
The runner's identity needs \`roles/artifactregistry.reader\` in
addition to \`roles/run.viewer\` for the lookup to succeed. Without
the new role, AR returns 403 DENIED, the resolver logs it at debug,
and the snapshot still completes with the tag-pinned reference and
empty digest (same as before this slice). See
out-4986-important-to-document.md.
Tests: 8 splitTagPinnedImage cases (AR/GCR/digest-pinned/no-tag/port-
no-tag/port+tag/empty-tag/no-host), isGCPRegistryHost table covering
AR + GCR + non-GCP rejections, 5 resolveTagPinnedDigests cases
(replaces empty, leaves resolved alone, failure preserves entry,
nil-resolver no-op, mixed map).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…solution
Slice 7 swapped the tag-pinned key for image@sha256:<hex> after a
successful registry lookup. The server uses the digests-map key as
the artifact's display name, so on the Kosli UI / API a resolved Job
artifact stopped looking like the thing the user deployed:
name: .../ghost-job@sha256:2c5cefcc...
vs. the more useful
name: .../ghost-job:c85b06b0... fingerprint: 2c5cefcc...
The tag is what humans wrote at \`gcloud run jobs deploy --image=...\`
and is what they recognise. Keep it as the key; only fill in the
resolved hex as the value.
Same rule applies to Service revisions if the resolver ever fires on
a tag-pinned Service container (rare — Cloud Run usually resolves
those at deploy time). \`resolveTagPinnedDigests\` is the single seam
for both kinds, so the behaviour is symmetric by construction.
Drive-by: \`digestResolver.Resolve\` no longer needs to return the
synthesised digest-pinned reference — drop the unused return so the
contract is "give me the hex" instead of "give me the hex and the
ref-form I might want to use".
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…names to tag-pinned (Slice 8, #4986)
Cloud Run Services come back from the API digest-pinned (image rewritten
to digest at deploy time, original tag dropped from the spec). The
resulting artifact name on Kosli is the opaque sha256, not the commit
SHA or version tag the user actually deployed. Same for any Job whose
spec was deployed digest-pinned.
Recover the tag via an Artifact Registry reverse-lookup:
- New \`tagResolver\` interface, prod impl \`gcpArtifactRegistryTagResolver\`
calling \`projects.locations.repositories.dockerImages.get\` with the
image+digest as the resource name. (Initial attempt used
dockerImages.list?filter=…, which returns HTTP 400 — that endpoint
doesn't actually accept a filter query param; gcloud's --filter is
client-side. dockerImages.get is single-record-by-name and works.)
- \`splitDigestPinnedAR\` parser for digest-pinned AR references,
scope-filtered to \`*-docker.pkg.dev\` (GCR lacks an equivalent
reverse-lookup endpoint).
- \`pickLatestTag\` for the multi-tag case: longest tag wins (commit
SHAs beat moving aliases like ":released"), lex breaks ties. AR
exposes no per-tag timestamp, so a time-based "latest" isn't an
option.
- New \`(c *Client).resolveNamesForDigestPinned\` walks the digests
map for entries containing "@sha256:", asks the resolver for a
tag, swaps the key for "image:<tag>". Hex value preserved. Same
fail-silent semantics as resolveTagPinnedDigests.
- Wired into both \`ListJobs\` and the \`toService\` revision loop.
- \`Client.New\` gains a \`resolveNames bool\` arg; both resolvers now
share a single ADC token source created once in New.
- \`--resolve-names\` boolean flag on \`kosli snapshot cloud-run\`,
default false.
Verified end-to-end against the hello-world-cli-demo test bed: every
Service revision and every Job artifact comes back tag-pinned with the
resolved digest as the value when the flag is set.
Requires roles/artifactregistry.reader on the runner's identity
(already needed for the forward-resolution path added in Slice 7); no
new permission.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Claude finished @ToreMerkely's task in 3m 39s —— View job PR Review: Cloud Run Registry Digest Resolution
Static AnalysisAll three checks pass:
SummaryClean, well-structured PR that adds two capabilities — forward digest resolution (tag→sha256 via OCI Distribution) and reverse name resolution (digest→tag via AR REST API) — while collapsing the old two-request I left three inline comments — one actionable, two informational:
What's Good
The previous review comments about map-mutation-during-iteration and URL path-segment encoding have been appropriately addressed or acknowledged. The |
…4986) The Long description claimed only \`roles/run.viewer\` was required. With Slices 7-8 in place, digest resolution and the optional name recovery also need \`roles/artifactregistry.reader\` on the AR repository (or the project). Spell this out, note that the missing role is non-fatal (tag-pinned artifacts just surface with empty digests), and call out the AR/GCR-only scope of the resolver so users deploying from non-GCP registries know to expect as-is image refs. Also a short paragraph on what \`--resolve-names\` does, mirroring the existing flag-description text. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Drive-by cleanup: a single struct literal in the table had over- padded field alignment because the surrounding entries used different field sets. gofmt produces the version in this commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…y calls (#4986)
Forward and reverse registry resolution were each calling
\`requests.NewKosliClient\` per artifact. \`NewKosliClient\` builds a
fresh \`retryablehttp.Client\` + \`http.Client\` + \`http.Transport\` —
which means a brand-new connection pool every time. At a handful of
artifacts the cost is invisible; at the "several hundred artifacts in
a single report" scale flagged on the PR, that's ~100s of TCP+TLS
handshakes serialised back-to-back (~150ms each = ~30s of dead time
on top of the real work).
Hoist the client construction up to \`cloudrun.New(...)\` so a single
\`*requests.Client\` is built once and shared by both:
- \`gcpRegistryResolver\` (forward, tag → digest, via the digest
package's OCI Distribution call)
- \`gcpArtifactRegistryTagResolver\` (reverse, digest → tag, via the
AR REST \`dockerImages.get\` call)
Connection pool warm, TCP+TLS handshakes amortise across every
artifact in the snapshot.
To plumb it through the forward path, \`digest.RemoteDockerImageSha256\`
now takes an injected \`*requests.Client\` instead of constructing its
own. The logger param drops out of the public signature — the client
already carries one. Only call site (the cloudrun forward resolver)
and the one test caller updated.
If client construction fails (rare; only on bad proxy URL), both
resolvers stay nil and the snapshot falls back to today's pre-
resolution behaviour. Same fail-open story as ADC token-source
failure.
Live-verified against the test bed: 6 artifacts, 6 status 200, one
client throughout.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…tructor (#4986) Slice 8 added \`--resolve-names\` and threaded it through to \`newCloudRunClient(ctx, resolveNames)\`. Unit-test coverage of the plumbing was missing: every stub in the suite discarded the bool parameter (\`_ bool\`), so a future refactor that silently breaks the wiring (e.g. binding the flag to the wrong variable, or forgetting to forward the bool) wouldn't have been caught. New test runs the cobra command twice — once with \`--resolve-names\` and once without — and asserts the bool that reaches the client constructor matches the user's intent. Also pins the default to false so a flipped default would be caught. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>