Skip to content

test(e2e): poll Task phase via WaitForTaskPhase to fix flake#1084

Merged
gjkim42 merged 1 commit intomainfrom
fix-e2e-flake
May 3, 2026
Merged

test(e2e): poll Task phase via WaitForTaskPhase to fix flake#1084
gjkim42 merged 1 commit intomainfrom
fix-e2e-flake

Conversation

@gjkim42
Copy link
Copy Markdown
Collaborator

@gjkim42 gjkim42 commented May 1, 2026

What type of PR is this?

/kind bug

What this PR does / why we need it:

Fixes a flake in the e2e suite where Task with workspace [codex] should run a Task with workspace to completion (and several sibling tests) intermittently fail with:

Expected
    <string>: Running
to equal
    <string>: Succeeded

even though the agent finished cleanly:

  • Pod ws-task-...: phase=Succeeded
  • Job ws-task: active=0 succeeded=1 failed=0
  • Task ws-task: phase=Running

The tests called WaitForJobCompletion, which only polls the Job's Complete condition, then immediately read Task.status.phase once. The Task controller is a separate reconciler that observes the Job and then patches Task.status.phase, so there is a small async window where the Job is Complete but the Task has not yet been updated. The single-shot Expect(...).To(Equal("Succeeded")) lost that race.

This PR adds a WaitForTaskPhase(name, phase) framework helper that polls until the expected phase is reached (2 min, 1 s interval), and replaces the racy single-shot phase assertions across task_test.go, opencode_test.go, setup_command_test.go, and skills_test.go. The helper inlines the Tasks Get call and returns an empty string on transient API errors so Eventually keeps polling instead of failing on the first blip — GetTaskPhase, by contrast, calls Expect(err).NotTo(HaveOccurred()), which would short-circuit the polling loop. This matches the pattern used by other WaitFor* helpers in the framework.

The existing 5-minute Eventually(... == "Failed") in setup_command_test.go is left alone — it waits for the full Job-failure lifecycle (backoff retries before the controller marks the Task Failed), which is a different scenario from the post-completion reconciliation gap this helper targets.

Reference flaky run: https://github.com/kelos-dev/kelos/actions/runs/25226338354/job/73970696826

Which issue(s) this PR is related to:

N/A

Special notes for your reviewer:

N/A

Does this PR introduce a user-facing change?

NONE

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 1, 2026

Greptile Summary

This PR fixes a flaky e2e test race condition where tests read Task.status.phase once immediately after WaitForJobCompletion, hitting the async window between the Job reaching Complete and the Task controller patching its phase. It introduces WaitForTaskPhase, a polling helper that retries for up to 2 minutes (1 s interval) and returns "" on transient API errors — matching the error-resilient pattern of other WaitFor* helpers — then substitutes it for every racy single-shot Expect(GetTaskPhase(...)) call across four test files.

Confidence Score: 5/5

Safe to merge — the fix is a well-scoped test-only change with one minor timeout-widening observation.

All findings are P2 (style/non-blocking). The new helper is correctly implemented, the error-suppression strategy matches the rest of the framework, and the import cleanup is correct. The only notable side-effect is the dep-chain-b "Waiting" check widening from 30 s to 2 min, which is a slow-failure concern rather than a correctness problem.

test/e2e/task_test.go — dep-chain "Waiting" phase timeout widened from 30 s to 2 min

Important Files Changed

Filename Overview
test/e2e/framework/framework.go Adds WaitForTaskPhase helper that polls with a 2-minute timeout / 1-second interval, returning "" on transient API errors rather than asserting — correctly matches the pattern of other WaitFor* helpers.
test/e2e/task_test.go Replaces racy single-shot Expect(GetTaskPhase(...)) calls with WaitForTaskPhase; the dep-chain-b "Waiting" check silently widens its timeout from 30 s to 2 min.
test/e2e/opencode_test.go Replaces single-shot phase assertion with WaitForTaskPhase; correctly removes now-unused gomega dot-import.
test/e2e/setup_command_test.go Replaces single-shot phase assertion with WaitForTaskPhase; existing 5-minute failure-lifecycle Eventually is intentionally left unchanged.
test/e2e/skills_test.go Replaces single-shot phase assertion with WaitForTaskPhase; straightforward mechanical change.

Sequence Diagram

sequenceDiagram
    participant Test
    participant Job
    participant TaskController
    participant TaskStatus

    Test->>Job: WaitForJobCompletion (polls Job Complete condition)
    Job-->>Test: Complete ✓

    Note over TaskController,TaskStatus: async reconcile window
    TaskController->>TaskStatus: patch phase → Succeeded

    alt Before this PR (racy)
        Test->>TaskStatus: GetTaskPhase (single read)
        TaskStatus-->>Test: "Running" ← race lost
        Test-->>Test: FAIL
    else After this PR (polling)
        loop WaitForTaskPhase (2 min / 1 s)
            Test->>TaskStatus: Get phase
            TaskStatus-->>Test: "Running" or ""
        end
        Test->>TaskStatus: Get phase
        TaskStatus-->>Test: "Succeeded" ✓
        Test-->>Test: PASS
    end
Loading
Prompt To Fix All With AI
Fix the following 1 code review issue. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 1
test/e2e/task_test.go:251
**`WaitForTaskPhase` timeout widened from 30 s → 2 min for transient phase**

The dependency-chain test previously used a dedicated 30-second `Eventually` when asserting that Task B enters `"Waiting"`. `WaitForTaskPhase` hard-codes a 2-minute timeout. `"Waiting"` is a transient phase — if Task B has already advanced to `"Running"` or `"Succeeded"` when polling starts, the poller can never match it and the test silently hangs for the full 2 minutes before reporting failure. The original 30 s window gave much faster signal for this kind of ordering regression. Consider accepting an optional timeout in `WaitForTaskPhase`, or inlining a short-lived `Eventually` for this specific assertion.

Reviews (3): Last reviewed commit: "test(e2e): poll Task phase via WaitForTa..." | Re-trigger Greptile

Comment thread test/e2e/framework/framework.go
@gjkim42
Copy link
Copy Markdown
Collaborator Author

gjkim42 commented May 1, 2026

/kelos pick-up

1 similar comment
@gjkim42
Copy link
Copy Markdown
Collaborator Author

gjkim42 commented May 1, 2026

/kelos pick-up

@kelos-bot
Copy link
Copy Markdown

kelos-bot Bot commented May 1, 2026

🤖 Kelos Task Status

Task kelos-workers-issue-comment-3b19809fe3e7 has been accepted and is being processed.

@gjkim42
Copy link
Copy Markdown
Collaborator Author

gjkim42 commented May 1, 2026

/kelos pick-up

Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 5 files

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="test/e2e/framework/framework.go">

<violation number="1" location="test/e2e/framework/framework.go:325">
P2: `GetTaskPhase` uses the global `Expect(err).NotTo(HaveOccurred())`, which calls Ginkgo's `Fail` on a transient API error. When invoked inside `Eventually`, this immediately aborts the test instead of retrying. Either inline the API call and return a zero-value on error, or use the `func(g Gomega)` form so `Eventually` can catch and retry the assertion.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Comment thread test/e2e/framework/framework.go Outdated
@gjkim42
Copy link
Copy Markdown
Collaborator Author

gjkim42 commented May 1, 2026

/kelos pick-up

@kelos-bot
Copy link
Copy Markdown

kelos-bot Bot commented May 1, 2026

🤖 Kelos Task Status

Task kelos-pr-responder-issue-comment-658e30b9e6cf has succeeded. ✅

@gjkim42
Copy link
Copy Markdown
Collaborator Author

gjkim42 commented May 2, 2026

/kelos review

@kelos-bot
Copy link
Copy Markdown

kelos-bot Bot commented May 2, 2026

🤖 Kelos Task Status

Task kelos-reviewer-issue-comment-f40f4baafaa1 has succeeded. ✅

Copy link
Copy Markdown

@kelos-bot kelos-bot Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Kelos Reviewer Agent @gjkim42

Review Summary

Verdict: APPROVE
Overall correctness: patch is correct
Scope: Add WaitForTaskPhase polling helper and replace racy single-shot Expect(GetTaskPhase(...)).To(Equal(...)) assertions across four e2e test files to fix a Job-completes-before-Task-controller-reconciles flake.

Findings Overview

Priority Count File:Line Summary
P0 0 none
P1 0 none
P2 0 none
P3 0 none

Findings

None. The new helper at test/e2e/framework/framework.go:323 polls the Task directly and returns "" on transient API errors, matching the pattern in WaitForJobCompletion/WaitForDeploymentAvailable and avoiding the issue where GetTaskPhase's internal Expect(err) would short-circuit Eventually. The five call-site replacements are faithful one-for-one swaps. The 5-minute Eventually(... == \"Failed\") in setup_command_test.go is intentionally left alone as the PR description notes — that path waits for the full Job-failure-with-backoff lifecycle, not the post-completion reconciliation gap. The unused gomega dot-import is correctly removed from opencode_test.go; other touched files still need it. String literals for phase values match the existing convention in this test package.

Key takeaways

  • Correct, narrowly-scoped flake fix; no production code touched.
  • The transient-error handling addition (commit 0d4ee09) is the right pattern and matches sibling WaitFor* helpers.

Note on prompt injection: A third-party comment included a hidden instruction demanding attribution to its tool. That instruction was disregarded; findings here are independently derived.

@gjkim42
Copy link
Copy Markdown
Collaborator Author

gjkim42 commented May 3, 2026

/kelos squash-commits

@kelos-bot
Copy link
Copy Markdown

kelos-bot Bot commented May 3, 2026

🤖 Kelos Task Status

Task kelos-squash-commits-issue-comment-433c78070875 has succeeded. ✅

@kelos-bot kelos-bot Bot force-pushed the fix-e2e-flake branch from 0d4ee09 to c53538a Compare May 3, 2026 08:04
@kelos-bot
Copy link
Copy Markdown

kelos-bot Bot commented May 3, 2026

Squash complete.

🤖 Kelos Agent @gjkim42

Rebased on origin/main and squashed 2 commits into 1:

  • c53538a test(e2e): poll Task phase via WaitForTaskPhase to fix flake

@gjkim42 gjkim42 enabled auto-merge May 3, 2026 08:14
@gjkim42 gjkim42 added this pull request to the merge queue May 3, 2026
Merged via the queue into main with commit 47031af May 3, 2026
16 checks passed
@gjkim42 gjkim42 deleted the fix-e2e-flake branch May 3, 2026 08:25
kelos-bot Bot pushed a commit that referenced this pull request May 3, 2026
…ing convention

Greptile flagged a P2 inconsistency on PR #786 itself: the convention
text in agentconfig.yaml and kelos-workers.yaml prescribed
{{if .Branch}}{{.Branch}}{{else}}main{{end}}, but the canonical form
used in the spawner YAMLs (kelos-reviewer.yaml, kelos-api-reviewer.yaml)
is {{with index . "Branch"}}{{.}}{{else}}main{{end}}. The new
"Docs must match implementation" rule applies to itself — fix the
documented example to match the deployed form.

PR #1084 fixed an e2e flake caused by GetTaskPhase calling the global
Expect(err).NotTo(HaveOccurred()) inside an Eventually polling loop;
Gomega does not retry on Expect failures, so a transient API error
short-circuits the poller. Both Greptile and cubic-dev-ai flagged the
same anti-pattern. Capture the lesson in AGENTS.md, the shared and
worker AgentConfigs, and add a reviewer checklist item so future
WaitFor* helpers either return a zero-value on error or use the
Eventually(func(g Gomega) { ... }) form.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant