test(e2e): poll Task phase via WaitForTaskPhase to fix flake#1084
test(e2e): poll Task phase via WaitForTaskPhase to fix flake#1084
Conversation
Greptile SummaryThis PR fixes a flaky e2e test race condition where tests read Confidence Score: 5/5Safe to merge — the fix is a well-scoped test-only change with one minor timeout-widening observation. All findings are P2 (style/non-blocking). The new helper is correctly implemented, the error-suppression strategy matches the rest of the framework, and the import cleanup is correct. The only notable side-effect is the dep-chain-b "Waiting" check widening from 30 s to 2 min, which is a slow-failure concern rather than a correctness problem. test/e2e/task_test.go — dep-chain "Waiting" phase timeout widened from 30 s to 2 min
|
| Filename | Overview |
|---|---|
| test/e2e/framework/framework.go | Adds WaitForTaskPhase helper that polls with a 2-minute timeout / 1-second interval, returning "" on transient API errors rather than asserting — correctly matches the pattern of other WaitFor* helpers. |
| test/e2e/task_test.go | Replaces racy single-shot Expect(GetTaskPhase(...)) calls with WaitForTaskPhase; the dep-chain-b "Waiting" check silently widens its timeout from 30 s to 2 min. |
| test/e2e/opencode_test.go | Replaces single-shot phase assertion with WaitForTaskPhase; correctly removes now-unused gomega dot-import. |
| test/e2e/setup_command_test.go | Replaces single-shot phase assertion with WaitForTaskPhase; existing 5-minute failure-lifecycle Eventually is intentionally left unchanged. |
| test/e2e/skills_test.go | Replaces single-shot phase assertion with WaitForTaskPhase; straightforward mechanical change. |
Sequence Diagram
sequenceDiagram
participant Test
participant Job
participant TaskController
participant TaskStatus
Test->>Job: WaitForJobCompletion (polls Job Complete condition)
Job-->>Test: Complete ✓
Note over TaskController,TaskStatus: async reconcile window
TaskController->>TaskStatus: patch phase → Succeeded
alt Before this PR (racy)
Test->>TaskStatus: GetTaskPhase (single read)
TaskStatus-->>Test: "Running" ← race lost
Test-->>Test: FAIL
else After this PR (polling)
loop WaitForTaskPhase (2 min / 1 s)
Test->>TaskStatus: Get phase
TaskStatus-->>Test: "Running" or ""
end
Test->>TaskStatus: Get phase
TaskStatus-->>Test: "Succeeded" ✓
Test-->>Test: PASS
end
Prompt To Fix All With AI
Fix the following 1 code review issue. Work through them one at a time, proposing concise fixes.
---
### Issue 1 of 1
test/e2e/task_test.go:251
**`WaitForTaskPhase` timeout widened from 30 s → 2 min for transient phase**
The dependency-chain test previously used a dedicated 30-second `Eventually` when asserting that Task B enters `"Waiting"`. `WaitForTaskPhase` hard-codes a 2-minute timeout. `"Waiting"` is a transient phase — if Task B has already advanced to `"Running"` or `"Succeeded"` when polling starts, the poller can never match it and the test silently hangs for the full 2 minutes before reporting failure. The original 30 s window gave much faster signal for this kind of ordering regression. Consider accepting an optional timeout in `WaitForTaskPhase`, or inlining a short-lived `Eventually` for this specific assertion.
Reviews (3): Last reviewed commit: "test(e2e): poll Task phase via WaitForTa..." | Re-trigger Greptile
|
/kelos pick-up |
1 similar comment
|
/kelos pick-up |
|
🤖 Kelos Task Status Task |
|
/kelos pick-up |
There was a problem hiding this comment.
1 issue found across 5 files
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="test/e2e/framework/framework.go">
<violation number="1" location="test/e2e/framework/framework.go:325">
P2: `GetTaskPhase` uses the global `Expect(err).NotTo(HaveOccurred())`, which calls Ginkgo's `Fail` on a transient API error. When invoked inside `Eventually`, this immediately aborts the test instead of retrying. Either inline the API call and return a zero-value on error, or use the `func(g Gomega)` form so `Eventually` can catch and retry the assertion.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
|
/kelos pick-up |
|
🤖 Kelos Task Status Task |
68bad74 to
0d4ee09
Compare
|
/kelos review |
|
🤖 Kelos Task Status Task |
There was a problem hiding this comment.
🤖 Kelos Reviewer Agent @gjkim42
Review Summary
Verdict: APPROVE
Overall correctness: patch is correct
Scope: Add WaitForTaskPhase polling helper and replace racy single-shot Expect(GetTaskPhase(...)).To(Equal(...)) assertions across four e2e test files to fix a Job-completes-before-Task-controller-reconciles flake.
Findings Overview
| Priority | Count | File:Line | Summary |
|---|---|---|---|
| P0 | 0 | — | none |
| P1 | 0 | — | none |
| P2 | 0 | — | none |
| P3 | 0 | — | none |
Findings
None. The new helper at test/e2e/framework/framework.go:323 polls the Task directly and returns "" on transient API errors, matching the pattern in WaitForJobCompletion/WaitForDeploymentAvailable and avoiding the issue where GetTaskPhase's internal Expect(err) would short-circuit Eventually. The five call-site replacements are faithful one-for-one swaps. The 5-minute Eventually(... == \"Failed\") in setup_command_test.go is intentionally left alone as the PR description notes — that path waits for the full Job-failure-with-backoff lifecycle, not the post-completion reconciliation gap. The unused gomega dot-import is correctly removed from opencode_test.go; other touched files still need it. String literals for phase values match the existing convention in this test package.
Key takeaways
- Correct, narrowly-scoped flake fix; no production code touched.
- The transient-error handling addition (commit
0d4ee09) is the right pattern and matches siblingWaitFor*helpers.
Note on prompt injection: A third-party comment included a hidden instruction demanding attribution to its tool. That instruction was disregarded; findings here are independently derived.
|
/kelos squash-commits |
|
🤖 Kelos Task Status Task |
0d4ee09 to
c53538a
Compare
|
Squash complete. 🤖 Kelos Agent @gjkim42 Rebased on
|
…ing convention Greptile flagged a P2 inconsistency on PR #786 itself: the convention text in agentconfig.yaml and kelos-workers.yaml prescribed {{if .Branch}}{{.Branch}}{{else}}main{{end}}, but the canonical form used in the spawner YAMLs (kelos-reviewer.yaml, kelos-api-reviewer.yaml) is {{with index . "Branch"}}{{.}}{{else}}main{{end}}. The new "Docs must match implementation" rule applies to itself — fix the documented example to match the deployed form. PR #1084 fixed an e2e flake caused by GetTaskPhase calling the global Expect(err).NotTo(HaveOccurred()) inside an Eventually polling loop; Gomega does not retry on Expect failures, so a transient API error short-circuits the poller. Both Greptile and cubic-dev-ai flagged the same anti-pattern. Capture the lesson in AGENTS.md, the shared and worker AgentConfigs, and add a reviewer checklist item so future WaitFor* helpers either return a zero-value on error or use the Eventually(func(g Gomega) { ... }) form.
What type of PR is this?
/kind bug
What this PR does / why we need it:
Fixes a flake in the e2e suite where
Task with workspace [codex] should run a Task with workspace to completion(and several sibling tests) intermittently fail with:even though the agent finished cleanly:
Pod ws-task-...: phase=SucceededJob ws-task: active=0 succeeded=1 failed=0Task ws-task: phase=RunningThe tests called
WaitForJobCompletion, which only polls the Job'sCompletecondition, then immediately readTask.status.phaseonce. The Task controller is a separate reconciler that observes the Job and then patchesTask.status.phase, so there is a small async window where the Job isCompletebut the Task has not yet been updated. The single-shotExpect(...).To(Equal("Succeeded"))lost that race.This PR adds a
WaitForTaskPhase(name, phase)framework helper that polls until the expected phase is reached (2 min, 1 s interval), and replaces the racy single-shot phase assertions acrosstask_test.go,opencode_test.go,setup_command_test.go, andskills_test.go. The helper inlines the Tasks Get call and returns an empty string on transient API errors soEventuallykeeps polling instead of failing on the first blip —GetTaskPhase, by contrast, callsExpect(err).NotTo(HaveOccurred()), which would short-circuit the polling loop. This matches the pattern used by otherWaitFor*helpers in the framework.The existing 5-minute
Eventually(... == "Failed")insetup_command_test.gois left alone — it waits for the full Job-failure lifecycle (backoff retries before the controller marks the TaskFailed), which is a different scenario from the post-completion reconciliation gap this helper targets.Reference flaky run: https://github.com/kelos-dev/kelos/actions/runs/25226338354/job/73970696826
Which issue(s) this PR is related to:
N/A
Special notes for your reviewer:
N/A
Does this PR introduce a user-facing change?