Summary
df.wait_for_completion(instance_id text, timeout_seconds int default 30) -> text is a synchronous polling helper that returns the child's terminal status as plain text (completed / failed / cancelled), not a Durofut / future envelope.
Its own doc comment in src/dsl.rs describes it as "a helper function for pg_regress tests to simplify polling logic" — yet it lives in the df schema alongside the composable DSL primitives, has a name that strongly suggests a deferred wait, and is the only thing in the public surface that mentions waiting on another instance.
When a user (reasonably) drops it into a composer:
df.seq(
df.wait_for_completion(<child_id>, 30),
df.sql($$INSERT INTO audit(who) VALUES ('parent-post')$$)
)
…the text it returns at compose time is baked into the graph as the SQL body of a SQL node. The worker then tries to execute completed (or failed) as SQL and the parent dies with:
SQL execution failed: error returned from database: syntax error at or near "completed"
SQL execution failed: error returned from database: syntax error at or near "failed"
This is the same class of bug as #149 (df.setvar/unsetvar/clearvars returning 'OK'), but worse because the injected text is dynamic runtime status rather than a constant, and there is no other documented way to wait on a child instance.
Standalone, the function also silently blocks the calling backend for up to timeout_seconds (std::thread::sleep loop in src/dsl.rs). If the session disconnects mid-wait, nothing resumes — the "wait" is not durable.
Reproduction
CREATE EXTENSION pg_durable;
CREATE TABLE bb_audit (id bigserial PRIMARY KEY, who text, ts timestamptz default now());
CREATE TABLE bb_ids (name text PRIMARY KEY, instance_id text NOT NULL);
-- Child completes quickly
INSERT INTO bb_ids(name, instance_id)
SELECT 'child', df.start(
df.seq(
df.sql($$INSERT INTO bb_audit(who) VALUES ('child-step1')$$),
df.sql($$INSERT INTO bb_audit(who) VALUES ('child-step2')$$)
), 'demo_child');
SELECT df.run(); SELECT pg_sleep(1.5);
-- Parent "waits" on already-completed child
INSERT INTO bb_ids(name, instance_id)
SELECT 'parent', df.start(
df.seq(
df.sql($$INSERT INTO bb_audit(who) VALUES ('parent-pre')$$),
df.seq(
df.wait_for_completion((SELECT instance_id FROM bb_ids WHERE name='child'), 30),
df.sql($$INSERT INTO bb_audit(who) VALUES ('parent-post')$$)
)
), 'demo_parent');
SELECT df.run(); SELECT pg_sleep(2);
SELECT execution_id, node_id, node_type, query, status, result
FROM df.instance_nodes((SELECT instance_id FROM bb_ids WHERE name='parent'), 5)
ORDER BY node_id;
Observed: parent fails; one SQL node has query = 'completed'; parent-post never runs.
Why this matters
- Footgun by name & location. A function literally called
wait_for_completion in the df schema will be the first thing users reach for when looking for parent-waits-for-child.
- No HINT. The resulting
syntax error at or near "completed" gives zero clue that the root cause is a function-shape mismatch.
- Standalone use is also surprising. Holding a backend in a
thread::sleep loop defeats the whole point of a durable-function library, and a session disconnect silently abandons the wait.
Suggested actions
Pick one of these for the function itself:
- (A) Change
df.wait_for_completion to return a real future envelope (e.g. {"node_type":"WAIT_FOR_INSTANCE","query":"<child_id>","timeout":30}) so the engine can suspend the parent durably and resume on child completion. Strongly preferred — it makes the name truthful and unblocks the composition use case.
- (B) Rename the current implementation to something like
df.poll_status_blocking(instance_id, timeout_seconds), move it out of the composable DSL surface (or clearly document it as a session-blocking test/app helper), and stop granting it as part of the normal DSL ACL.
Independently, add the composer-level argument-shape check proposed in #149 so any non-future argument to df.seq / df.join / df.race / etc. produces:
ERROR: argument N to df.seq is plain text 'completed', not a future envelope —
HINT: functions that return text cannot be composed; did you mean df.sql(...)
or a wait primitive?
That single check resolves this bug, #149, and any future siblings.
Related
Environment
- Originally reported on Azure HorizonDB cluster
bertanari20260513hdb32v (PG 17.9, HorizonDB build c8e7b717d05) with pg_durable 0.1.1 during HorizonDB Hackathon Bug Bash Tier-2 (T2.3 / AI-04-D).
- Confirmed still present on
main (v0.2.1): df.wait_for_completion is unchanged in src/dsl.rs and still in the granted DSL surface (src/lib.rs).
Summary
df.wait_for_completion(instance_id text, timeout_seconds int default 30) -> textis a synchronous polling helper that returns the child's terminal status as plain text (completed/failed/cancelled), not aDurofut/ future envelope.Its own doc comment in src/dsl.rs describes it as "a helper function for pg_regress tests to simplify polling logic" — yet it lives in the
dfschema alongside the composable DSL primitives, has a name that strongly suggests a deferred wait, and is the only thing in the public surface that mentions waiting on another instance.When a user (reasonably) drops it into a composer:
…the text it returns at compose time is baked into the graph as the SQL body of a SQL node. The worker then tries to execute
completed(orfailed) as SQL and the parent dies with:This is the same class of bug as #149 (
df.setvar/unsetvar/clearvarsreturning'OK'), but worse because the injected text is dynamic runtime status rather than a constant, and there is no other documented way to wait on a child instance.Standalone, the function also silently blocks the calling backend for up to
timeout_seconds(std::thread::sleeploop in src/dsl.rs). If the session disconnects mid-wait, nothing resumes — the "wait" is not durable.Reproduction
Observed: parent fails; one SQL node has
query = 'completed';parent-postnever runs.Why this matters
wait_for_completionin thedfschema will be the first thing users reach for when looking for parent-waits-for-child.syntax error at or near "completed"gives zero clue that the root cause is a function-shape mismatch.thread::sleeploop defeats the whole point of a durable-function library, and a session disconnect silently abandons the wait.Suggested actions
Pick one of these for the function itself:
df.wait_for_completionto return a real future envelope (e.g.{"node_type":"WAIT_FOR_INSTANCE","query":"<child_id>","timeout":30}) so the engine can suspend the parent durably and resume on child completion. Strongly preferred — it makes the name truthful and unblocks the composition use case.df.poll_status_blocking(instance_id, timeout_seconds), move it out of the composable DSL surface (or clearly document it as a session-blocking test/app helper), and stop granting it as part of the normal DSL ACL.Independently, add the composer-level argument-shape check proposed in #149 so any non-future argument to
df.seq/df.join/df.race/ etc. produces:That single check resolves this bug, #149, and any future siblings.
Related
df.setvar/unsetvar/clearvarsreturn'OK'instead of futures). A composer-level argument-shape check fixes both.df.signalnot propagated into child sub-orchestrations. Adjacent area (parent/child instance coordination) but different mechanism.Environment
bertanari20260513hdb32v(PG 17.9, HorizonDB buildc8e7b717d05) with pg_durable 0.1.1 during HorizonDB Hackathon Bug Bash Tier-2 (T2.3 / AI-04-D).main(v0.2.1):df.wait_for_completionis unchanged in src/dsl.rs and still in the granted DSL surface (src/lib.rs).