fix: increase readiness probe timeout and period for all products#136
Conversation
Raise TimeoutSeconds from 1→3 and PeriodSeconds from 3→5 across Connect, Workbench, PackageManager, and Flightdeck. The 1s timeout was causing false readiness failures under load spikes on otherwise healthy pods.
../../.worktrees resolves to Projects/.worktrees, not ptd-workspace/.worktrees. Correct to ../.worktrees throughout.
|
Claude finished @Lytol's task —— View job Review Complete
Submitted a review via the GitHub review system. The changes look good — the timeout/period increases are reasonable for addressing false readiness failures under load. Left one non-blocking suggestion about extracting the duplicated magic numbers into shared constants. |
There was a problem hiding this comment.
Review Summary
Changes verified: The diff correctly updates TimeoutSeconds (1→3) and PeriodSeconds (3→5) on readiness probes for Connect, Workbench, PackageManager, and Flightdeck. Liveness probes are untouched. The CLAUDE.md changes fix worktree paths (unrelated but fine to bundle).
Controller checklist:
- ✅ Reconciliation remains idempotent — these are static defaults applied every reconcile
- ✅ No API/CRD changes required — values are hardcoded in controller logic
- ✅ Existing deployments pick up new values on next rolling update
- ✅ No security concerns (probe config only)
One suggestion (non-blocking):
The timeout and period values are now duplicated as magic numbers across four controllers (connect.go:731-732, workbench.go:934-935, package_manager.go:582-583, flightdeck_controller.go:349-350). Consider extracting shared constants like:
const (
defaultReadinessTimeoutSeconds int32 = 3
defaultReadinessPeriodSeconds int32 = 5
)This would make future tuning a one-line change and make intent clearer. The codebase already uses this pattern for defaultWorkbenchReadinessProbePath.
Note on FailureThreshold differences: Connect and Workbench use FailureThreshold: 3 while PackageManager and Flightdeck use FailureThreshold: 2. This is pre-existing and likely intentional (heavier products get more chances), but worth calling out for visibility.
LGTM — the values are reasonable for reducing false failures under load while still detecting genuinely unhealthy pods within ~15-25s.
## [1.24.1](v1.24.0...v1.24.1) (2026-05-07) ### Bug Fixes * increase readiness probe timeout and period for all products ([#136](#136)) ([fe3803e](fe3803e))
Summary
TimeoutSecondsfrom 1s → 3s andPeriodSecondsfrom 3s → 5s on readiness probes for Connect, Workbench, PackageManager, and FlightdeckNote that I also fixed the CLAUDE.md to use the appropriate nesting for
worktreepaths.Test plan
go test ./...passes (pre-existing failures ininternal/controller/coreare unrelated — missing kubebuilder etcd binary locally, reproduce identically on unmodifiedmain)go vet ./...cleankubectl describe pod)