Skip to content

MIR-993: Configurable port-wait timeout for slow-cold-init addons#755

Merged
evanphx merged 1 commit into
mainfrom
mir-993-mysql-addon-fails-health-check-on-first-boot-15s-t
Apr 13, 2026
Merged

MIR-993: Configurable port-wait timeout for slow-cold-init addons#755
evanphx merged 1 commit into
mainfrom
mir-993-mysql-addon-fails-health-check-on-first-boot-15s-t

Conversation

@evanphx
Copy link
Copy Markdown
Contributor

@evanphx evanphx commented Apr 13, 2026

Summary

  • Fixes MIR-993: MySQL addon sandboxes consistently fail their first-boot health check on loaded dev hardware because MySQL 8's cold init (--initialize → temporary setup server → real server) takes ~20s, exceeding the hardcoded 15s port-bind budget. The pool then marks the sandbox DEAD and crash-backs-off, so the fast (~1s) second boot is delayed too.
  • Adds an optional port_wait_timeout to SandboxSpec, threaded through both the legacy controller path (controllers/sandbox/sandbox.go) and the saga create path (controllers/sandbox/create_saga.go — via a new PortWaitTimeout field on bootContainersOutwaitPortsIn, wired by matching saga:"port_wait_timeout" tags). Empty/invalid values fall back to the existing 15s default so normal apps and typos are unaffected.
  • Sets budgets on the affected addons via a new PortWaitTimeout field on addon.CreateSandboxPoolSpec:
    • MySQL: 60s (shared + dedicated)
    • Postgres: 60s (shared + dedicated)
    • RabbitMQ: 90s (dedicated) — Erlang VM + mnesia boot is slower than SQL engines
  • specsMatch now compares PortWaitTimeout so changing the budget on an existing pool is detected as a spec drift.

Test plan

  • make lint — 0 issues
  • Unit test TestResolvePortWaitTimeout covers empty / valid / zero / negative / garbage / bare-number fallback behavior
  • Saga tests TestCreateSandboxSaga_PortWaitTimeoutDefault / _Override assert the spec value flows through to WaitForPort
  • TestSpecsMatchCoversAllFields and TestSandboxControllerFrozen hashes updated; saga path carries the equivalent change
  • Manual repro in local dev cluster: server log on the MySQL shared pool sandbox shows waiting for ports to be bound ... timeout: 1m0s, first boot comes up cleanly without a DEAD / crash-backoff cycle
  • make test-blackbox — existing MySQL addon blackbox still green

@evanphx evanphx requested a review from a team as a code owner April 13, 2026 18:30
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 13, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 46064378-17a4-40b0-93eb-1082487878f1

📥 Commits

Reviewing files that changed from the base of the PR and between c79f8d2 and 310987c.

📒 Files selected for processing (15)
  • api/compute/compute_v1alpha/schema.gen.go
  • api/compute/schema.yml
  • controllers/deployment/launcher.go
  • controllers/deployment/specs_match_test.go
  • controllers/sandbox/create_saga.go
  • controllers/sandbox/create_saga_test.go
  • controllers/sandbox/port_wait_timeout_test.go
  • controllers/sandbox/sandbox.go
  • controllers/sandbox/sandbox_frozen_test.go
  • pkg/addon/framework.go
  • pkg/addon/mysql/dedicated.go
  • pkg/addon/mysql/shared.go
  • pkg/addon/postgresql/dedicated.go
  • pkg/addon/postgresql/shared.go
  • pkg/addon/rabbitmq/dedicated.go
✅ Files skipped from review due to trivial changes (5)
  • pkg/addon/postgresql/shared.go
  • controllers/deployment/specs_match_test.go
  • pkg/addon/mysql/shared.go
  • api/compute/schema.yml
  • pkg/addon/postgresql/dedicated.go
🚧 Files skipped from review as they are similar to previous changes (6)
  • pkg/addon/mysql/dedicated.go
  • pkg/addon/rabbitmq/dedicated.go
  • controllers/sandbox/sandbox_frozen_test.go
  • controllers/deployment/launcher.go
  • controllers/sandbox/port_wait_timeout_test.go
  • controllers/sandbox/sandbox.go

📝 Walkthrough

Walkthrough

Added a new optional PortWaitTimeout string field to SandboxSpec and schema (duration parsed via time.ParseDuration). Controller logic was updated to use resolvePortWaitTimeout with a 15s default instead of a hardcoded timeout; specsMatch compares the new field. Saga inputs/outputs and wait logic were adjusted, tests added/updated (including port-wait unit tests, mock runtime, fingerprint and frozen-file hash updates). The addon framework exposes PortWaitTimeout, and MySQL/PostgreSQL pools set 60s while RabbitMQ sets 90s.


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
controllers/deployment/launcher.go (1)

846-848: Prefer semantic timeout comparison over raw string equality.

Line 846 currently mismatches equivalent duration strings (for example, 60s vs 1m), which can trigger unnecessary pool replacement.

♻️ Suggested refactor
-	if spec1.PortWaitTimeout != spec2.PortWaitTimeout {
-		return fmt.Sprintf("port wait timeout mismatch: %s vs %s", spec1.PortWaitTimeout, spec2.PortWaitTimeout), false
-	}
+	resolve := func(raw string) time.Duration {
+		if strings.TrimSpace(raw) == "" {
+			return 15 * time.Second
+		}
+		d, err := time.ParseDuration(raw)
+		if err != nil || d <= 0 {
+			return 15 * time.Second
+		}
+		return d
+	}
+	t1 := resolve(spec1.PortWaitTimeout)
+	t2 := resolve(spec2.PortWaitTimeout)
+	if t1 != t2 {
+		return fmt.Sprintf("port wait timeout mismatch: %s (%s) vs %s (%s)",
+			spec1.PortWaitTimeout, t1, spec2.PortWaitTimeout, t2), false
+	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@controllers/deployment/launcher.go` around lines 846 - 848, The current
string equality check for PortWaitTimeout on spec1 and spec2 can misreport
equivalent durations (e.g. "60s" vs "1m"); change the comparison to parse both
spec1.PortWaitTimeout and spec2.PortWaitTimeout with time.ParseDuration, handle
parse errors (log/return a clear mismatch/error if parsing fails), and compare
the resulting time.Duration values numerically so equivalent durations are
treated as equal; reference the symbols spec1.PortWaitTimeout and
spec2.PortWaitTimeout and ensure the mismatch message reflects the original
strings or the parsed durations for clarity.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@api/compute/schema.yml`:
- Around line 26-32: Update the port_wait_timeout documentation to state that
empty, unparsable, or non-positive durations (<=0, e.g. "0s") are treated the
same by the controller and will fall back to the default of 15s; reference that
values are parsed via time.ParseDuration and give an example of a non-positive
value being ignored so users don't assume "0s" disables the wait. Keep the
existing examples (e.g. "60s") and explicitly note addons needing longer waits
should set a larger positive duration.

---

Nitpick comments:
In `@controllers/deployment/launcher.go`:
- Around line 846-848: The current string equality check for PortWaitTimeout on
spec1 and spec2 can misreport equivalent durations (e.g. "60s" vs "1m"); change
the comparison to parse both spec1.PortWaitTimeout and spec2.PortWaitTimeout
with time.ParseDuration, handle parse errors (log/return a clear mismatch/error
if parsing fails), and compare the resulting time.Duration values numerically so
equivalent durations are treated as equal; reference the symbols
spec1.PortWaitTimeout and spec2.PortWaitTimeout and ensure the mismatch message
reflects the original strings or the parsed durations for clarity.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 056a84f3-60d3-4147-9e40-1c09ea2ed031

📥 Commits

Reviewing files that changed from the base of the PR and between a29b71d and c79f8d2.

📒 Files selected for processing (15)
  • api/compute/compute_v1alpha/schema.gen.go
  • api/compute/schema.yml
  • controllers/deployment/launcher.go
  • controllers/deployment/specs_match_test.go
  • controllers/sandbox/create_saga.go
  • controllers/sandbox/create_saga_test.go
  • controllers/sandbox/port_wait_timeout_test.go
  • controllers/sandbox/sandbox.go
  • controllers/sandbox/sandbox_frozen_test.go
  • pkg/addon/framework.go
  • pkg/addon/mysql/dedicated.go
  • pkg/addon/mysql/shared.go
  • pkg/addon/postgresql/dedicated.go
  • pkg/addon/postgresql/shared.go
  • pkg/addon/rabbitmq/dedicated.go

Comment thread api/compute/schema.yml Outdated
Sandbox creation fails any port-bind health check that doesn't complete
within a hardcoded 15s, intentionally marking the sandbox DEAD so the
pool crash-backs-off. MySQL 8's first-boot runs --initialize, a temp
setup server, then the real server — ~20s on loaded dev hardware — so
the first attempt always fails, and crash cooldown delays the retry.
Postgres (initdb) and RabbitMQ (Erlang VM + mnesia) have similar
cold-init behavior.

Add an optional port_wait_timeout to SandboxSpec, threaded through both
the legacy controller path and the saga create path, and set sensible
budgets on the affected addons:

- MySQL:    60s (shared + dedicated)
- Postgres: 60s (shared + dedicated)
- RabbitMQ: 90s (dedicated)

Empty spec falls back to the existing 15s default, so normal apps are
unaffected. Invalid/non-positive values also fall back rather than
bricking a pool.

Fixes MIR-993.
@evanphx evanphx force-pushed the mir-993-mysql-addon-fails-health-check-on-first-boot-15s-t branch from c79f8d2 to 310987c Compare April 13, 2026 20:09
@evanphx evanphx merged commit d00e1a2 into main Apr 13, 2026
12 checks passed
@evanphx evanphx deleted the mir-993-mysql-addon-fails-health-check-on-first-boot-15s-t branch April 13, 2026 21:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants