Skip to content

Conversation

@jumski
Copy link
Contributor

@jumski jumski commented Jun 19, 2025

Problem

The playground app wasn't receiving real-time updates when steps failed, only when they completed. Investigation revealed that step:failed events were not being broadcast from the fail_task function.

Root Cause

PostgreSQL was optimizing away the broadcast_step_failed CTE because:

  • It only contained a SELECT statement
  • It wasn't referenced by the main query
  • PostgreSQL didn't recognize that realtime.send() has necessary side effects

This caused the step:failed broadcast to be silently skipped.

Solution

Replaced the CTE approach with a direct check and PERFORM statement after the main UPDATE:

-- Check if step failed by querying the step_states table
SELECT (status = 'failed') INTO v_step_failed 
FROM pgflow.step_states 
WHERE pgflow.step_states.run_id = fail_task.run_id 
  AND pgflow.step_states.step_slug = fail_task.step_slug;

-- Send broadcast event for step failure if the step was failed
IF v_step_failed THEN
  PERFORM realtime.send(...);
END IF;

This matches the pattern already used for run:failed events and ensures the broadcast always executes.

Testing

Added regression tests in both pgTAP and TypeScript to prevent this issue from recurring:

  • pkgs/core/supabase/tests/regressions/step_failed_event_bug.test.sql
  • pkgs/client/__tests__/integration/regressions/step-failed-event-bug.test.ts

Both tests verify that step:failed events are properly broadcast when a step fails permanently.

Fixes

Resolves the issue where error/failed status changes don't propagate to the playground app UI.

jumski added 6 commits June 19, 2025 20:52
…optimization

- Added new integration tests to demonstrate that step:failed events are not broadcasted
- Included tests for both full and minimal reproduction scenarios highlighting the bug
- Updated test setup to simulate failure conditions and verify event broadcasting
- Documented the issue where CTE optimization causes realtime.send() to be skipped
- These changes help reproduce and confirm the bug, guiding future fixes
… optimization

- Updated test to reflect manual status update and added note about missing step:started event
- Corrected SQL in test setup to match step slug naming conventions
- Fixed bug in SQL logic that caused step:failed event not to be sent by using PERFORM in CTE
- Adjusted test assertions to verify event broadcasting after fix
- Clarified comments and logging for better understanding of the bug and fix
Refactors the test to replace manual step status updates with a poll_and_fail helper,
ensuring consistent failure handling and simplifying the test logic.
…iled event broadcasting

Refactors test file locations, renames for clarity, and enhances test descriptions.
Addresses a bug where step:failed events were not broadcast due to CTE optimization.
Includes removal of obsolete test file and updates to test comments for better clarity.
…events and messaging

Enhances the fail_task procedure to accurately check step failure status, send appropriate realtime
events for both step and run failures, and manage message retries and archiving for failed tasks.
Also updates migration to include the new realtime event handling logic.
- Replace CTE-based broadcast with direct PERFORM statement
- Add regression tests to prevent future issues
- Fix column ambiguity in fail_task function
@changeset-bot
Copy link

changeset-bot bot commented Jun 19, 2025

🦋 Changeset detected

Latest commit: ab17a0c

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 7 packages
Name Type
@pgflow/core Patch
pgflow Patch
@pgflow/client Patch
@pgflow/edge-worker Patch
@pgflow/example-flows Patch
@pgflow/dsl Patch
@pgflow/website Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jun 19, 2025

Important

Review skipped

Auto reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@netlify
Copy link

netlify bot commented Jun 19, 2025

Deploy Preview for pgflow-demo canceled.

Name Link
🔨 Latest commit 6a3cdf6
🔍 Latest deploy log https://app.netlify.com/projects/pgflow-demo/deploys/685470ea9489f90008967b9d

@nx-cloud
Copy link

nx-cloud bot commented Jun 19, 2025

View your CI Pipeline Execution ↗ for commit 6a3cdf6.

Command Status Duration Result
nx affected -t lint typecheck test build ✅ Succeeded 3m 56s View ↗

☁️ Nx Cloud last updated this comment at 2025-06-19 20:24:30 UTC

@cloudflare-workers-and-pages
Copy link

cloudflare-workers-and-pages bot commented Jun 19, 2025

Deploying pgflow with  Cloudflare Pages  Cloudflare Pages

Latest commit: 6a3cdf6
Status: ✅  Deploy successful!
Preview URL: https://6fe22bf3.pgflow.pages.dev
Branch Preview URL: https://fix-playground-status-update.pgflow.pages.dev

View logs

@jumski jumski merged commit 220c867 into main Jun 19, 2025
7 checks passed
graphite-app bot pushed a commit that referenced this pull request Nov 3, 2025
…Testing

 (#304)

## Summary

This PR addresses **critical bugs** in pgflow's realtime broadcasting system and dramatically improves test coverage for broadcast events. Two unreferenced CTEs in `start_ready_steps()` are never executed due to PostgreSQL's query optimizer, causing:

1. **`step:started`** **events never broadcast** - breaks real-time DAG visualization
2. **`step:completed`** **events for empty maps never broadcast** - breaks observability for edge cases

Additionally, existing integration tests were too weak to catch these bugs. This PR adds comprehensive test coverage using new event matchers that verify exact event sequences, payloads, and counts.

## Root Cause: PostgreSQL CTE Optimization

PostgreSQL's query optimizer **does not execute unreferenced CTEs with SELECT statements** - even when those SELECTs call functions with side effects like `realtime.send()`.

**Critical distinction:**

- ✅ **CTEs with INSERT/UPDATE/DELETE** - ALWAYS executed (side effects assumed)
- ❌ **CTEs with SELECT only** - SKIPPED if unreferenced (no side effects assumed)

In `pkgs/core/schemas/0100_function_start_ready_steps.sql`:

```sql
-- This CTE is NEVER executed (unreferenced SELECT)
broadcast_events AS (
  SELECT realtime.send(...)  -- ❌ Never runs!
  FROM started_step_states
),

-- This CTE is ALSO never executed (unreferenced SELECT)
broadcast_empty_completed AS (
  SELECT realtime.send(...)  -- ❌ Never runs!
  FROM completed_empty_steps
),

-- Only this INSERT executes
INSERT INTO pgflow.step_tasks (...)
SELECT ... FROM sent_messages;  -- ✅ Runs, but CTEs above don't
```

## Bugs Discovered

### Bug 1: `step:started` Events Never Broadcast

**Impact:**

- Clients never receive `step:started` events during flow execution
- DAG visualizations only update when steps complete (not when they begin)
- Active step tracking (`flowState.activeStep`) never updates
- WebSocket inspection shows only `step:*:completed` events

**Affected Code:** `pkgs/core/schemas/0100_function_start_ready_steps.sql:144-162`

The `broadcast_events` CTE that sends `step:started` messages is unreferenced and never executes.

### Bug 2: Empty Map `step:completed` Events Never Broadcast

**Impact:**

- Empty map steps (arrays with length 0) complete silently with no broadcast
- Clients miss completion events for edge case flows
- Breaks observability for map steps with empty input arrays

**Affected Code:** `pkgs/core/schemas/0100_function_start_ready_steps.sql:45-64`

The `broadcast_empty_completed` CTE is unreferenced and never executes.

## Historical Context

This is the **same bug pattern** that affected `step:failed` events, fixed in commit `2b1ea777` (June 19, 2025):

> "fix: address bug where step:failed event was not broadcast due to CTE optimization"

**Related commits:**

- `2b1ea777` - Fixed `step:failed` broadcast with PERFORM statements
- `220c8672` - PR #161: "Fix step:failed events not being broadcast"
- `ab17a0c5` - Ensured step:failed events are broadcast
- `fe18250a` - Addressed broadcasting bug due to CTE optimization

The fix for `step:failed` used explicit `PERFORM` statements to force execution. However, this pattern was **never applied** to `step:started` or empty map broadcasts.

## Why Tests Didn't Catch This

The existing integration test in `pkgs/client/__tests__/integration/real-flow-execution.test.ts` only verified that **some** events were received:

```typescript
// OLD TEST (too weak)
let stepEventCount = 0;
step.on('*', () => { stepEventCount++; });
// ...
expect(stepEventCount).toBeGreaterThan(0);  // ❌ Passes with only 1 event!
```

This passes even if only `step:completed` is broadcast (count = 1), without checking for `step:started`.

## Changes in This PR

### 1\. Test Infrastructure Improvements

Added comprehensive event matchers to `pkgs/client/__tests__/helpers/test-utils.ts`:

- `toHaveReceivedEvent(type, payload?)` - Verify specific event was received with optional payload matching
- `toNotHaveReceivedEvent(type)` - Verify event was NOT received
- `toHaveReceivedEventCount(type, count)` - Verify exact event count
- `toHaveReceivedTotalEvents(count)` - Verify total event count
- `toHaveReceivedEventSequence(types[])` - Verify exact event sequence
- `toHaveReceivedInOrder(type1, type2)` - Verify ordering of two events

These matchers enable precise verification of broadcast behavior.

### 2\. Client Unit Test Coverage (100+ New Test Cases)

Enhanced `pkgs/client/__tests__/FlowRun.test.ts` and `FlowStep.test.ts` with:

**FlowRun Event Tests:**

- ✅ Comprehensive payload validation for `run:started`, `run:completed`, `run:failed`
- ✅ Event lifecycle verification (started → completed, started → failed)
- ✅ Duplicate event rejection (same status transitions)
- ✅ Foreign-run event protection
- ✅ `waitForStatus(Failed)` with timeout/abort support

**FlowStep Event Tests:**

- ✅ Comprehensive payload validation for all step event types
- ✅ Event sequence verification (started → completed, started → failed)
- ✅ Empty map edge case (completed ONLY, no started)
- ✅ Duplicate event rejection
- ✅ Foreign-step event protection
- ✅ `waitForStatus(Started)` edge cases for empty maps

**Key improvements:**

- All event assertions now use event matchers (not just count checks)
- Payload validation ensures correct data in events
- Event sequence verification catches ordering bugs
- Edge case coverage for empty maps and error conditions

### 3\. Integration Test Coverage (4 New Tests)

Added critical integration tests to `pkgs/client/__tests__/integration/real-flow-execution.test.ts`:

#### Test 1: **CRITICAL -** **`step:started`** **Broadcast Verification** (CURRENTLY FAILING)

```typescript
it('CRITICAL: broadcasts step:started events (CTE optimization bug check)', ...)
```

This test **WILL FAIL** until Bug #1 is fixed. It specifically verifies:

- `step:started` event is broadcast when `start_ready_steps()` executes
- Event payload contains correct `run_id`, `step_slug`, `status`
- Event sequence is `['step:started', 'step:completed']`
- Both events received exactly once

**Why it fails:** The `broadcast_events` CTE is unreferenced and never executes.

#### Test 2: Empty Map Steps Edge Case

```typescript
it('empty map steps: skip step:started and go straight to step:completed', ...)
```

Verifies the **expected behavior** for empty map steps:

- NO `step:started` event (empty maps skip started state)
- Only `step:completed` event is broadcast
- Event payload has correct status and empty array output

**Note:** This test will ALSO FAIL due to Bug #2 (`broadcast_empty_completed` unreferenced).

#### Test 3: Enhanced Event Verification

```typescript
it('receives broadcast events during flow execution', ...)
```

Updated to use event matchers instead of weak count checks:

- Verifies exact event types (`run:completed`, `step:completed`)
- Validates event payloads (run_id, flow_slug, status, output)
- Counts events to ensure no duplicates

#### Test 4: `waitForStatus(Started)` Behavior

```typescript
it('waitForStatus(Started): waits for step to reach Started status', ...)
```

Verifies that:

- Root steps are started immediately by `start_flow()`
- `waitForStatus(Started)` resolves immediately if already started
- Step has `started_at` timestamp when in Started status

### 4\. Documentation Updates

Updated `.claude/skills/pgtap-testing/SKILL.md` description to be more comprehensive:

- Added all common phrasings users might use to trigger testing skill
- Emphasized realtime event testing patterns
- Made skill activation more reliable

## Failing Tests (To Be Fixed in Follow-up Commit)

The following tests are **EXPECTED TO FAIL** until the SQL bug is fixed:

1. ❌ `CRITICAL: broadcasts step:started events` - Fails because `broadcast_events` CTE never executes
2. ❌ `empty map steps: skip step:started and go straight to step:completed` - Fails because `broadcast_empty_completed` CTE never executes

## Solution (Not Included in This PR)

The fix will use the same pattern as the `step:failed` fix from commit `2b1ea777`:

**Option 1: Reference CTEs to force execution**

```sql
broadcast_events AS (
  SELECT realtime.send(...) as broadcast_result
  FROM started_step_states
)

INSERT INTO pgflow.step_tasks (...)
SELECT ... FROM sent_messages
-- Force CTE execution by referencing it
WHERE EXISTS (SELECT 1 FROM broadcast_events WHERE false);
```

The `WHERE EXISTS (...WHERE false)` ensures:

- CTE **must be evaluated** (referenced)
- **Zero performance impact** (WHERE false = no filtering)
- **No change to INSERT behavior**

**Option 2: Use PERFORM statements (like step:failed fix)**

```sql
-- Move broadcasts out of CTE
PERFORM realtime.send(...)
FROM started_step_states;

-- Then do INSERT
INSERT INTO pgflow.step_tasks ...
```

## Testing Strategy

This PR follows a **test-first approach**:

1. ✅ **Add comprehensive test infrastructure** (event matchers)
2. ✅ **Add failing tests that document expected behavior**
3. ⬜ **Fix SQL bugs** (follow-up commit)
4. ⬜ **Verify all tests pass**

## Impact Assessment

### Before This PR

- **Test Coverage:** Weak event counting (any event = pass)
- **Bug Detection:** Failed to catch missing broadcasts
- **Observability:** Client applications missing critical events
- **Real-time UX:** DAG visualizations only update on completion

### After This PR (Tests Only)

- **Test Coverage:** 100+ new test cases with event matchers
- **Bug Detection:** Failing tests document exact bugs
- **Documentation:** Clear understanding of event lifecycles
- **Regression Prevention:** Future CTE bugs will be caught

### After SQL Fix (Follow-up)

- **All Tests Pass:** Green CI with comprehensive coverage
- **Full Observability:** All broadcast events working correctly
- **Real-time UX:** DAG visualizations update during execution
- **Production Ready:** Robust event system with test coverage

## Files Changed

### Test Infrastructure

- `pkgs/client/__tests__/helpers/test-utils.ts` - Event matchers already exist (no changes)

### Unit Tests (Enhanced)

- `pkgs/client/__tests__/FlowRun.test.ts` - Added lifecycle, payload, edge case tests
- `pkgs/client/__tests__/FlowStep.test.ts` - Added lifecycle, payload, empty map tests

### Integration Tests (New)

- `pkgs/client/__tests__/integration/real-flow-execution.test.ts` - Added 4 new tests

### Documentation

- `.claude/skills/pgtap-testing/SKILL.md` - Enhanced skill description

### SQL (Bug Location - Not Fixed in This PR)

- `pkgs/core/schemas/0100_function_start_ready_steps.sql` - Contains both bugs (lines 45-64, 144-162)

## Next Steps

1. **Merge this PR** - Establishes test coverage and failing tests
2. **Fix SQL bugs** - Apply CTE reference pattern or PERFORM statements
3. **Verify tests pass** - All new tests should turn green
4. **Audit other SQL functions** - Check for similar unreferenced CTE patterns
5. **Consider linting rule** - Detect unreferenced CTEs with function calls

## Checklist

- [x] Added comprehensive event matchers to test utils
- [x] Enhanced FlowRun unit tests with event verification
- [x] Enhanced FlowStep unit tests with event verification
- [x] Added failing integration test for `step:started` broadcast
- [x] Added failing integration test for empty map broadcast
- [x] Updated documentation (skill descriptions)
- [x] Documented root cause and historical context
- [ ] Fixed SQL bugs (follow-up commit)
- [ ] All tests passing (after SQL fix)

## Related Issues

This PR addresses the same class of bug as:

- Commit `2b1ea777` - `step:failed` broadcast fix
- PR #161 - "Fix step:failed events not being broadcast"

## Breaking Changes

None. This PR only adds tests and documentation.

## Migration Required

None. SQL changes will come in follow-up commit.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants