Conversation
Promotes the hardening from PR #1211 (cascade-restart mitigations + publish phase timing) to production. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
5 tasks
tadasant
approved these changes
Apr 27, 2026
rdimitrov
added a commit
that referenced
this pull request
Apr 27, 2026
…1213) ## Summary Follow-up to #1211 to encode the GCP API and IAM dependencies its node-SA bindings rely on. After #1211 merged, the staging deploy hit two undocumented prereqs in sequence — Cloud Resource Manager API not enabled (because `projects.NewIAMMember` calls SetIamPolicy through CRM), then the Pulumi service account missing `roles/resourcemanager.projectIamAdmin`. Both fixed manually to unblock; this PR makes them explicit so they don't bite again. ## Changes **`ensureRequiredAPIs` adopts five GCP APIs** as Pulumi-managed `projects.Service` resources, all with `DisableOnDestroy: false` and `DisableDependentServices: false` so a Pulumi destroy/refactor can never disable a shared API: | API | Why | |-----|-----| | `cloudresourcemanager` | `projects.NewIAMMember` → SetIamPolicy | | `compute` | `compute.GetDefaultServiceAccount` invoke + GKE | | `container` | GKE cluster | | `logging` | fluentbit-gke ships container logs | | `monitoring` | managed Prometheus collector ships metrics | CRM is created explicitly (not in the loop) so callers get a direct reference for `pulumi.DependsOn`. Storage is intentionally **not** Pulumi-managed — Pulumi state itself lives in a GCS bucket, chicken-and-egg. **`grantNodeServiceAccountRoles`** now `DependsOn` the CRM `projects.Service` for the four IAM bindings, and uses `compute.GetDefaultServiceAccount` to derive the SA email (Compute API was already required for GKE — avoids a CRM-dependent project-number lookup). **`deploy/README.md`** — adds `roles/resourcemanager.projectIamAdmin` to the Pulumi SA's required roles and clarifies which APIs must be enabled before the first `pulumi up` (storage, cloudresourcemanager, container) vs. which `ensureRequiredAPIs` adopts. ## Notes `projects.Service` is idempotent against already-enabled APIs — on existing projects (staging/prod), Pulumi adopts the live enablement into state without re-enabling. This was confirmed live: staging deploy went green after the manual unblocks above, validating the runtime model. ## Test plan - [x] `go build ./...` clean for `deploy/` - [x] `golangci-lint run` clean for `deploy/pkg/providers/gcp/...` (one pre-existing `nilnil` at line 61, unrelated) - [x] Manual unblock + green staging deploy verified the runtime behavior this PR encodes - [ ] On merge: staging deploy adopts the existing API enablements without churn, then proceeds normally - [ ] Prod deploy via #1212 (image promotion) inherits the same setup 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Promotes v1.7.1 to production. Contents: the hardening + diagnostics from #1211.
What's in v1.7.1 (vs v1.7.0):
PodDisruptionBudgetwithminAvailable: 1+TopologySpreadConstraintsto keep pods spread across nodesStartupProbe(30 × 5s) covering the new DB-retry budgetcmd/registry/main.goslogvalidate_mstiming on the publish path so the next slow/v0/publishis self-diagnosticTest plan
ghcr.io/modelcontextprotocol/registry:1.7.1)deploy-production.ymlrolls out v1.7.1 — watch for clean rolling update (maxSurge: 1,maxUnavailable: 0) and confirm both pods reach ReadyStarting MCP Registry Application v1.7.1shows up in Cloud Logging from the new pods🤖 Generated with Claude Code