Skip to content

fix: increase readiness probe timeout and period for all products#136

Merged
Lytol merged 3 commits into
mainfrom
probe-readiness-timeout
May 7, 2026
Merged

fix: increase readiness probe timeout and period for all products#136
Lytol merged 3 commits into
mainfrom
probe-readiness-timeout

Conversation

@Lytol
Copy link
Copy Markdown
Contributor

@Lytol Lytol commented May 7, 2026

Summary

  • Raises TimeoutSeconds from 1s → 3s and PeriodSeconds from 3s → 5s on readiness probes for Connect, Workbench, PackageManager, and Flightdeck
  • The 1s timeout was causing false readiness failures on healthy pods under load spikes (observed on Workbench)
  • Liveness probes and all other probe fields (thresholds, initial delay) are unchanged
  • No API/CRD changes — purely controller defaults; existing deployments will pick up the new values automatically on the next reconcile via a rolling update

Note that I also fixed the CLAUDE.md to use the appropriate nesting for worktree paths.

Test plan

  • go test ./... passes (pre-existing failures in internal/controller/core are unrelated — missing kubebuilder etcd binary locally, reproduce identically on unmodified main)
  • go vet ./... clean
  • Deploy to a test cluster and confirm readiness probes reflect new values (kubectl describe pod)
  • Verify no false readiness failures under normal and loaded conditions

Lytol added 2 commits May 7, 2026 15:05
Raise TimeoutSeconds from 1→3 and PeriodSeconds from 3→5 across
Connect, Workbench, PackageManager, and Flightdeck. The 1s timeout
was causing false readiness failures under load spikes on otherwise
healthy pods.
../../.worktrees resolves to Projects/.worktrees, not
ptd-workspace/.worktrees. Correct to ../.worktrees throughout.
@claude
Copy link
Copy Markdown

claude Bot commented May 7, 2026

Claude finished @Lytol's task —— View job


Review Complete

  • Read review guidelines
  • Analyze diff and changed files
  • Submit review

Submitted a review via the GitHub review system. The changes look good — the timeout/period increases are reasonable for addressing false readiness failures under load. Left one non-blocking suggestion about extracting the duplicated magic numbers into shared constants.

Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Summary

Changes verified: The diff correctly updates TimeoutSeconds (1→3) and PeriodSeconds (3→5) on readiness probes for Connect, Workbench, PackageManager, and Flightdeck. Liveness probes are untouched. The CLAUDE.md changes fix worktree paths (unrelated but fine to bundle).

Controller checklist:

  • ✅ Reconciliation remains idempotent — these are static defaults applied every reconcile
  • ✅ No API/CRD changes required — values are hardcoded in controller logic
  • ✅ Existing deployments pick up new values on next rolling update
  • ✅ No security concerns (probe config only)

One suggestion (non-blocking):

The timeout and period values are now duplicated as magic numbers across four controllers (connect.go:731-732, workbench.go:934-935, package_manager.go:582-583, flightdeck_controller.go:349-350). Consider extracting shared constants like:

const (
    defaultReadinessTimeoutSeconds int32 = 3
    defaultReadinessPeriodSeconds  int32 = 5
)

This would make future tuning a one-line change and make intent clearer. The codebase already uses this pattern for defaultWorkbenchReadinessProbePath.

Note on FailureThreshold differences: Connect and Workbench use FailureThreshold: 3 while PackageManager and Flightdeck use FailureThreshold: 2. This is pre-existing and likely intentional (heavier products get more chances), but worth calling out for visibility.

LGTM — the values are reasonable for reducing false failures under load while still detecting genuinely unhealthy pods within ~15-25s.

@Lytol Lytol marked this pull request as ready for review May 7, 2026 22:08
amdove
amdove previously approved these changes May 7, 2026
@Lytol Lytol added this pull request to the merge queue May 7, 2026
Merged via the queue into main with commit fe3803e May 7, 2026
7 checks passed
@Lytol Lytol deleted the probe-readiness-timeout branch May 7, 2026 22:19
ian-flores pushed a commit that referenced this pull request May 7, 2026
## [1.24.1](v1.24.0...v1.24.1) (2026-05-07)

### Bug Fixes

* increase readiness probe timeout and period for all products ([#136](#136)) ([fe3803e](fe3803e))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants