Skip to content

df.signal not propagated into child sub-orchestrations (breaks wait_for_signal inside df.race) #150

@pinodeca

Description

@pinodeca

Summary

When df.wait_for_signal is wrapped inside a df.race branch, an external signal sent via df.signal(instance_id, name, data) is recorded on the parent instance but is never propagated to the child sub-orchestration that hosts the waiter. The signal branch therefore loses every race against any sibling that completes (timeout, sleep, sql), eliminating the canonical "approve / reject within N seconds" pattern.

Originally filed against v0.1.1 during bug bash; confirmed still present on HEAD of main (v0.2.1 development).

Repro

-- Race: signal branch (waits up to 20s) vs sleep branch (15s)
SELECT df.start_workflow(
  df.race(
    df.seq(df.wait_for_signal('approve', 20), df.sql('INSERT INTO audit VALUES (...);')),
    df.seq(df.sleep(15),                       df.sql('INSERT INTO audit VALUES (...);'))
  ),
  'some-instance-id'
);

-- Send signal 4s later
SELECT df.signal('some-instance-id', 'approve', '{}');

Expected: signal branch wins at T+4s, audit row written.
Actual: sleep branch wins at T+15s; signal branch is cancelled with parent dropped sub-orchestration future.

Control (works correctly)

df.seq(df.wait_for_signal('go', 60), df.sql(...))

The wait_for_signal mechanism itself is fine; only the race / sub-orchestration routing path is broken.

Evidence (duroxide.history of failing instance)

event detail
6 SubOrchestrationScheduled — signal branch as sub::6
7 SubOrchestrationScheduled — sleep branch as sub::7
8 (T+4s) ExternalEvent name='approve' data='{}' on the parent
9 (T+15s) SubOrchestrationCompleted source=7 (sleep wins)
11 SubOrchestrationCancelRequested source=6, reason=dropped_future
12 SubOrchestrationFailed kind=Cancelled

The parent's ExternalEvent is never routed to sub::6 where wait_for_signal('approve', 20) is blocked.

Root cause

Signal routing does not traverse the parent → child sub-orchestration boundary.

Impact

Breaks the canonical "approve / reject within N seconds" workflow. Any composition that places wait_for_signal inside race (or any other operator that lowers branches into sub-orchestrations) is silently broken. Workarounds (hoisting wait_for_signal to the top level and emulating race via timers) defeat the purpose of df.race.

Suggested fix

When df.signal writes an ExternalEvent for instance I, fan the event out to all live descendant sub-orchestration sessions of I whose pending WaitForSignal node has a matching signal_name. Candidate locations:

  • raise_external_event in src/client.rs — query duroxide for live children + matching waiters, raise to each.
  • Or push the fan-out into the duroxide-pg-opt provider where parent/child relationships are already tracked.

Test coverage gap

Signal tests and race tests exist independently but no E2E test combines wait_for_signal inside a race branch:

  • Signal-only: tests/e2e/sql/07_signals.sql, tests/e2e/sql/11_cross_connection.sql
  • Race-only: tests/e2e/sql/01_core_primitives.sql, tests/e2e/sql/19_vars_in_join_race.sql, tests/e2e/sql/22_break_in_join_race.sql

A new E2E exercising the repro above should land alongside the fix.

Metadata

Metadata

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions