Skip to content

Cloud run: recover missing digests via registry lookup server/#4986#879

Merged
ToreMerkely merged 8 commits into
mainfrom
4986-cloud-run-job-for-missing-digest
May 11, 2026
Merged

Cloud run: recover missing digests via registry lookup server/#4986#879
ToreMerkely merged 8 commits into
mainfrom
4986-cloud-run-job-for-missing-digest

Conversation

@ToreMerkely
Copy link
Copy Markdown
Contributor

  • fix(digest): accept all four OCI manifest media types in one request
  • feat(cloudrun): resolve tag-pinned image digests via registry lookup (Slice 7, #4986)
  • fix(cloudrun): keep tag-pinned image as artifact name after digest resolution
  • feat(cloudrun): --resolve-names flag rewrites digest-pinned artifact names to tag-pinned (Slice 8, #4986)

ToreMerkely and others added 4 commits May 11, 2026 07:40
RemoteDockerImageSha256 previously asked the registry for only the
Docker fat-manifest type first, then retried with the Docker single-
arch type if that came back wrong. Two issues:

  - A 404 on the first request short-circuited the retry: the function
    returned the 404 error instead of falling back. Real registries
    (Artifact Registry, GHCR, ECR) return 404 MANIFEST_UNKNOWN for an
    Accept type they don't have.
  - The two Docker types were the only ones advertised. OCI single-arch
    manifests (vnd.oci.image.manifest.v1+json) and OCI image indexes
    (vnd.oci.image.index.v1+json) were never asked for — and modern AR
    pushes routinely use one of these.

Per the OCI Distribution spec, clients should list every manifest
media type they understand in a single Accept header; the registry
picks the best match and returns the canonical Docker-Content-Digest
header regardless of which it chose. Replace the two-call retry with
one request advertising all four canonical types.

Surfaces in cloud-run snapshots as the difference between
"jobs come back with empty digests" and "jobs come back digest-
pinned".

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…(Slice 7, #4986)

For Cloud Run Services, Cloud Run rewrites the image to a digest at
deploy time and exposes the digest-pinned reference on the Revision —
so the snapshot picks up the sha256 for free. For Cloud Run Jobs, the
spec is left as-deployed: tag-pinned at deploy, tag-pinned in the
snapshot, empty digest in the artifacts. Same gap applies to any
tag-pinned Service container as an edge case.

Close it with an OCI Distribution lookup (\`GET /v2/<name>/manifests/<tag>\`)
against the hosting registry, using a GCP OAuth bearer token derived
from Application Default Credentials. Works for both Jobs that have
never been executed and ones that have — the registry doesn't care.

  - New \`digestResolver\` interface in internal/cloudrun + production
    \`gcpRegistryResolver\` (registry.go). Scope-filtered to GCP-hosted
    registries (Artifact Registry \`*-docker.pkg.dev\`, legacy GCR
    \`*.gcr.io\`); other hosts skip resolution because the OAuth token
    won't work against them.
  - \`splitTagPinnedImage\` parses \`host/path:tag\`, rejecting digest-
    pinned, no-tag, port-without-tag, and no-host inputs.
  - \`Client\` gains \`resolver digestResolver\` and \`log *logger.Logger\`
    fields; \`New(ctx, log)\` constructs a gcpRegistryResolver from ADC
    (cloud-platform scope) and a logger. If the token source can't be
    built (rare), resolver is left nil — behaviour falls back to today's
    empty-digest output, never fatal.
  - New \`(c *Client).resolveTagPinnedDigests(map)\` walks each artifact's
    digests map, resolves tag-pinned entries, and swaps the key for the
    digest-pinned form. Wired into \`ListJobs\` and the \`toService\`
    revision loop so both kinds benefit. Failures log at debug and leave
    the original tag-pinned entry alone.
  - \`cmd/kosli/snapshotCloudRun.go\` passes the package-level logger to
    \`cloudrun.New\`.

The runner's identity needs \`roles/artifactregistry.reader\` in
addition to \`roles/run.viewer\` for the lookup to succeed. Without
the new role, AR returns 403 DENIED, the resolver logs it at debug,
and the snapshot still completes with the tag-pinned reference and
empty digest (same as before this slice). See
out-4986-important-to-document.md.

Tests: 8 splitTagPinnedImage cases (AR/GCR/digest-pinned/no-tag/port-
no-tag/port+tag/empty-tag/no-host), isGCPRegistryHost table covering
AR + GCR + non-GCP rejections, 5 resolveTagPinnedDigests cases
(replaces empty, leaves resolved alone, failure preserves entry,
nil-resolver no-op, mixed map).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…solution

Slice 7 swapped the tag-pinned key for image@sha256:<hex> after a
successful registry lookup. The server uses the digests-map key as
the artifact's display name, so on the Kosli UI / API a resolved Job
artifact stopped looking like the thing the user deployed:

    name: .../ghost-job@sha256:2c5cefcc...

vs. the more useful

    name: .../ghost-job:c85b06b0...   fingerprint: 2c5cefcc...

The tag is what humans wrote at \`gcloud run jobs deploy --image=...\`
and is what they recognise. Keep it as the key; only fill in the
resolved hex as the value.

Same rule applies to Service revisions if the resolver ever fires on
a tag-pinned Service container (rare — Cloud Run usually resolves
those at deploy time). \`resolveTagPinnedDigests\` is the single seam
for both kinds, so the behaviour is symmetric by construction.

Drive-by: \`digestResolver.Resolve\` no longer needs to return the
synthesised digest-pinned reference — drop the unused return so the
contract is "give me the hex" instead of "give me the hex and the
ref-form I might want to use".

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…names to tag-pinned (Slice 8, #4986)

Cloud Run Services come back from the API digest-pinned (image rewritten
to digest at deploy time, original tag dropped from the spec). The
resulting artifact name on Kosli is the opaque sha256, not the commit
SHA or version tag the user actually deployed. Same for any Job whose
spec was deployed digest-pinned.

Recover the tag via an Artifact Registry reverse-lookup:

  - New \`tagResolver\` interface, prod impl \`gcpArtifactRegistryTagResolver\`
    calling \`projects.locations.repositories.dockerImages.get\` with the
    image+digest as the resource name. (Initial attempt used
    dockerImages.list?filter=…, which returns HTTP 400 — that endpoint
    doesn't actually accept a filter query param; gcloud's --filter is
    client-side. dockerImages.get is single-record-by-name and works.)
  - \`splitDigestPinnedAR\` parser for digest-pinned AR references,
    scope-filtered to \`*-docker.pkg.dev\` (GCR lacks an equivalent
    reverse-lookup endpoint).
  - \`pickLatestTag\` for the multi-tag case: longest tag wins (commit
    SHAs beat moving aliases like ":released"), lex breaks ties. AR
    exposes no per-tag timestamp, so a time-based "latest" isn't an
    option.
  - New \`(c *Client).resolveNamesForDigestPinned\` walks the digests
    map for entries containing "@sha256:", asks the resolver for a
    tag, swaps the key for "image:<tag>". Hex value preserved. Same
    fail-silent semantics as resolveTagPinnedDigests.
  - Wired into both \`ListJobs\` and the \`toService\` revision loop.
  - \`Client.New\` gains a \`resolveNames bool\` arg; both resolvers now
    share a single ADC token source created once in New.
  - \`--resolve-names\` boolean flag on \`kosli snapshot cloud-run\`,
    default false.

Verified end-to-end against the hello-world-cli-demo test bed: every
Service revision and every Job artifact comes back tag-pinned with the
resolved digest as the value when the flag is set.

Requires roles/artifactregistry.reader on the runner's identity
(already needed for the forward-resolution path added in Slice 7); no
new permission.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ToreMerkely ToreMerkely enabled auto-merge (squash) May 11, 2026 08:55
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 11, 2026

Claude finished @ToreMerkely's task in 3m 39s —— View job


PR Review: Cloud Run Registry Digest Resolution

  • Run static analysis (build, vet, mod tidy) — all clean
  • Review changed files
  • Post review feedback

Static Analysis

All three checks pass:

  • go build ./... — compiles cleanly
  • go vet ./... — no issues
  • go mod tidy — no drift in go.mod / go.sum

Summary

Clean, well-structured PR that adds two capabilities — forward digest resolution (tag→sha256 via OCI Distribution) and reverse name resolution (digest→tag via AR REST API) — while collapsing the old two-request RemoteDockerImageSha256 dance into a single content-negotiated call. The shared HTTP client across both resolvers is a good performance call, especially at scale. Tests are thorough with good coverage of parsing edge cases.

I left three inline comments — one actionable, two informational:

  1. registry.go:175 — Missing HTTP status check before JSON unmarshal (actionable). If the AR API returns 403/404, the error surfaces as "digest has no tags" instead of the real HTTP status. A pre-unmarshal status check would make IAM-permission issues much easier to diagnose for operators.

  2. digest.go:343 — Empty digest on missing header (informational). Pre-existing behaviour now reachable from a second call site. The cloudrun caller guards for it; just flagging for awareness.

  3. registry.go:231docker.pkg.dev accepted by isGCPRegistryHost but rejected by splitDigestPinnedAR (informational). Both outcomes are correct — the forward resolver can use the bare host, the reverse resolver needs a concrete location — but a comment would help a future reader understand the intentional asymmetry.

What's Good

  • Shared client pattern — constructing one *requests.Client in New() and threading it through both resolvers is the right design for snapshot-scale workloads.
  • Fail-open semantics — resolver failures are debug-logged and leave original entries untouched; never fatal. This is the correct posture for an enrichment step.
  • splitTagPinnedImage parsing — correctly handles port-in-host vs. tag disambiguation (localhost:5000/foo vs. localhost:5000/foo:v1), with targeted test coverage for each edge.
  • Accept header consolidation — single round-trip content negotiation replacing the old two-request fallback is strictly better.
  • --resolve-names flag test (added per prior review feedback) — captures the bool at the constructor seam and verifies both flag-present and flag-absent paths.

The previous review comments about map-mutation-during-iteration and URL path-segment encoding have been appropriately addressed or acknowledged. The pickLatestTag heuristic (longest wins, lex tiebreak) is pragmatic for the commit-SHA-vs-alias use case.

Comment thread internal/cloudrun/cloudrun.go
Comment thread internal/cloudrun/registry.go Outdated
Comment thread internal/cloudrun/registry.go
Comment thread cmd/kosli/snapshotCloudRun_test.go
Comment thread internal/digest/digest.go Outdated
ToreMerkely and others added 2 commits May 11, 2026 11:01
…4986)

The Long description claimed only \`roles/run.viewer\` was required.
With Slices 7-8 in place, digest resolution and the optional name
recovery also need \`roles/artifactregistry.reader\` on the AR
repository (or the project). Spell this out, note that the missing
role is non-fatal (tag-pinned artifacts just surface with empty
digests), and call out the AR/GCR-only scope of the resolver so users
deploying from non-GCP registries know to expect as-is image refs.

Also a short paragraph on what \`--resolve-names\` does, mirroring the
existing flag-description text.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Drive-by cleanup: a single struct literal in the table had over-
padded field alignment because the surrounding entries used
different field sets. gofmt produces the version in this commit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread internal/cloudrun/registry.go Outdated
Comment thread internal/cloudrun/cloudrun.go
Comment thread cmd/kosli/snapshotCloudRun_test.go
ToreMerkely and others added 2 commits May 11, 2026 13:13
…y calls (#4986)

Forward and reverse registry resolution were each calling
\`requests.NewKosliClient\` per artifact. \`NewKosliClient\` builds a
fresh \`retryablehttp.Client\` + \`http.Client\` + \`http.Transport\` —
which means a brand-new connection pool every time. At a handful of
artifacts the cost is invisible; at the "several hundred artifacts in
a single report" scale flagged on the PR, that's ~100s of TCP+TLS
handshakes serialised back-to-back (~150ms each = ~30s of dead time
on top of the real work).

Hoist the client construction up to \`cloudrun.New(...)\` so a single
\`*requests.Client\` is built once and shared by both:

  - \`gcpRegistryResolver\` (forward, tag → digest, via the digest
    package's OCI Distribution call)
  - \`gcpArtifactRegistryTagResolver\` (reverse, digest → tag, via the
    AR REST \`dockerImages.get\` call)

Connection pool warm, TCP+TLS handshakes amortise across every
artifact in the snapshot.

To plumb it through the forward path, \`digest.RemoteDockerImageSha256\`
now takes an injected \`*requests.Client\` instead of constructing its
own. The logger param drops out of the public signature — the client
already carries one. Only call site (the cloudrun forward resolver)
and the one test caller updated.

If client construction fails (rare; only on bad proxy URL), both
resolvers stay nil and the snapshot falls back to today's pre-
resolution behaviour. Same fail-open story as ADC token-source
failure.

Live-verified against the test bed: 6 artifacts, 6 status 200, one
client throughout.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…tructor (#4986)

Slice 8 added \`--resolve-names\` and threaded it through to
\`newCloudRunClient(ctx, resolveNames)\`. Unit-test coverage of the
plumbing was missing: every stub in the suite discarded the bool
parameter (\`_ bool\`), so a future refactor that silently breaks
the wiring (e.g. binding the flag to the wrong variable, or
forgetting to forward the bool) wouldn't have been caught.

New test runs the cobra command twice — once with \`--resolve-names\`
and once without — and asserts the bool that reaches the client
constructor matches the user's intent. Also pins the default to
false so a flipped default would be caught.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread internal/cloudrun/registry.go
Comment thread internal/digest/digest.go
Comment thread internal/cloudrun/registry.go
@ToreMerkely ToreMerkely merged commit 586ed33 into main May 11, 2026
27 of 28 checks passed
@ToreMerkely ToreMerkely deleted the 4986-cloud-run-job-for-missing-digest branch May 11, 2026 11:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants