Skip to content

deploy: update prod to v1.7.1#1212

Merged
rdimitrov merged 1 commit intomainfrom
promote-prod-v1.7.1
Apr 27, 2026
Merged

deploy: update prod to v1.7.1#1212
rdimitrov merged 1 commit intomainfrom
promote-prod-v1.7.1

Conversation

@rdimitrov
Copy link
Copy Markdown
Member

Summary

Promotes v1.7.1 to production. Contents: the hardening + diagnostics from #1211.

What's in v1.7.1 (vs v1.7.0):

  • Pulumi-managed IAM bindings for the default compute SA so fluentbit-gke and the GMP collector can ship logs/metrics (already applied to live cluster manually)
  • PodDisruptionBudget with minAvailable: 1 + TopologySpreadConstraints to keep pods spread across nodes
  • StartupProbe (30 × 5s) covering the new DB-retry budget
  • Bounded retry-with-backoff on initial DB connect in cmd/registry/main.go
  • Memory request 128→256Mi, limit 256→512Mi
  • slog validate_ms timing on the publish path so the next slow /v0/publish is self-diagnostic

Test plan

  • v1.7.1 release workflow built and pushed image successfully (ghcr.io/modelcontextprotocol/registry:1.7.1)
  • Pulumi.gcpProd.yaml diff is the single-line image bump
  • On merge, deploy-production.yml rolls out v1.7.1 — watch for clean rolling update (maxSurge: 1, maxUnavailable: 0) and confirm both pods reach Ready
  • After rollout, confirm Starting MCP Registry Application v1.7.1 shows up in Cloud Logging from the new pods

🤖 Generated with Claude Code

Promotes the hardening from PR #1211 (cascade-restart mitigations + publish
phase timing) to production.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
rdimitrov added a commit that referenced this pull request Apr 27, 2026
…1213)

## Summary

Follow-up to #1211 to encode the GCP API and IAM dependencies its
node-SA bindings rely on. After #1211 merged, the staging deploy hit two
undocumented prereqs in sequence — Cloud Resource Manager API not
enabled (because `projects.NewIAMMember` calls SetIamPolicy through
CRM), then the Pulumi service account missing
`roles/resourcemanager.projectIamAdmin`. Both fixed manually to unblock;
this PR makes them explicit so they don't bite again.

## Changes

**`ensureRequiredAPIs` adopts five GCP APIs** as Pulumi-managed
`projects.Service` resources, all with `DisableOnDestroy: false` and
`DisableDependentServices: false` so a Pulumi destroy/refactor can never
disable a shared API:

| API | Why |
|-----|-----|
| `cloudresourcemanager` | `projects.NewIAMMember` → SetIamPolicy |
| `compute` | `compute.GetDefaultServiceAccount` invoke + GKE |
| `container` | GKE cluster |
| `logging` | fluentbit-gke ships container logs |
| `monitoring` | managed Prometheus collector ships metrics |

CRM is created explicitly (not in the loop) so callers get a direct
reference for `pulumi.DependsOn`. Storage is intentionally **not**
Pulumi-managed — Pulumi state itself lives in a GCS bucket,
chicken-and-egg.

**`grantNodeServiceAccountRoles`** now `DependsOn` the CRM
`projects.Service` for the four IAM bindings, and uses
`compute.GetDefaultServiceAccount` to derive the SA email (Compute API
was already required for GKE — avoids a CRM-dependent project-number
lookup).

**`deploy/README.md`** — adds `roles/resourcemanager.projectIamAdmin` to
the Pulumi SA's required roles and clarifies which APIs must be enabled
before the first `pulumi up` (storage, cloudresourcemanager, container)
vs. which `ensureRequiredAPIs` adopts.

## Notes

`projects.Service` is idempotent against already-enabled APIs — on
existing projects (staging/prod), Pulumi adopts the live enablement into
state without re-enabling. This was confirmed live: staging deploy went
green after the manual unblocks above, validating the runtime model.

## Test plan

- [x] `go build ./...` clean for `deploy/`
- [x] `golangci-lint run` clean for `deploy/pkg/providers/gcp/...` (one
pre-existing `nilnil` at line 61, unrelated)
- [x] Manual unblock + green staging deploy verified the runtime
behavior this PR encodes
- [ ] On merge: staging deploy adopts the existing API enablements
without churn, then proceeds normally
- [ ] Prod deploy via #1212 (image promotion) inherits the same setup

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@rdimitrov rdimitrov merged commit 762a484 into main Apr 27, 2026
6 checks passed
@rdimitrov rdimitrov deleted the promote-prod-v1.7.1 branch April 27, 2026 16:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants