Skip to content

df.wait_for_completion returns plain text and silently corrupts composers (df.seq, df.join, …) #151

@pinodeca

Description

@pinodeca

Summary

df.wait_for_completion(instance_id text, timeout_seconds int default 30) -> text is a synchronous polling helper that returns the child's terminal status as plain text (completed / failed / cancelled), not a Durofut / future envelope.

Its own doc comment in src/dsl.rs describes it as "a helper function for pg_regress tests to simplify polling logic" — yet it lives in the df schema alongside the composable DSL primitives, has a name that strongly suggests a deferred wait, and is the only thing in the public surface that mentions waiting on another instance.

When a user (reasonably) drops it into a composer:

df.seq(
  df.wait_for_completion(<child_id>, 30),
  df.sql($$INSERT INTO audit(who) VALUES ('parent-post')$$)
)

…the text it returns at compose time is baked into the graph as the SQL body of a SQL node. The worker then tries to execute completed (or failed) as SQL and the parent dies with:

SQL execution failed: error returned from database: syntax error at or near "completed"
SQL execution failed: error returned from database: syntax error at or near "failed"

This is the same class of bug as #149 (df.setvar/unsetvar/clearvars returning 'OK'), but worse because the injected text is dynamic runtime status rather than a constant, and there is no other documented way to wait on a child instance.

Standalone, the function also silently blocks the calling backend for up to timeout_seconds (std::thread::sleep loop in src/dsl.rs). If the session disconnects mid-wait, nothing resumes — the "wait" is not durable.

Reproduction

CREATE EXTENSION pg_durable;
CREATE TABLE bb_audit (id bigserial PRIMARY KEY, who text, ts timestamptz default now());
CREATE TABLE bb_ids   (name text PRIMARY KEY, instance_id text NOT NULL);

-- Child completes quickly
INSERT INTO bb_ids(name, instance_id)
SELECT 'child', df.start(
  df.seq(
    df.sql($$INSERT INTO bb_audit(who) VALUES ('child-step1')$$),
    df.sql($$INSERT INTO bb_audit(who) VALUES ('child-step2')$$)
  ), 'demo_child');
SELECT df.run(); SELECT pg_sleep(1.5);

-- Parent "waits" on already-completed child
INSERT INTO bb_ids(name, instance_id)
SELECT 'parent', df.start(
  df.seq(
    df.sql($$INSERT INTO bb_audit(who) VALUES ('parent-pre')$$),
    df.seq(
      df.wait_for_completion((SELECT instance_id FROM bb_ids WHERE name='child'), 30),
      df.sql($$INSERT INTO bb_audit(who) VALUES ('parent-post')$$)
    )
  ), 'demo_parent');
SELECT df.run(); SELECT pg_sleep(2);

SELECT execution_id, node_id, node_type, query, status, result
FROM df.instance_nodes((SELECT instance_id FROM bb_ids WHERE name='parent'), 5)
ORDER BY node_id;

Observed: parent fails; one SQL node has query = 'completed'; parent-post never runs.

Why this matters

  1. Footgun by name & location. A function literally called wait_for_completion in the df schema will be the first thing users reach for when looking for parent-waits-for-child.
  2. No HINT. The resulting syntax error at or near "completed" gives zero clue that the root cause is a function-shape mismatch.
  3. Standalone use is also surprising. Holding a backend in a thread::sleep loop defeats the whole point of a durable-function library, and a session disconnect silently abandons the wait.

Suggested actions

Pick one of these for the function itself:

  • (A) Change df.wait_for_completion to return a real future envelope (e.g. {"node_type":"WAIT_FOR_INSTANCE","query":"<child_id>","timeout":30}) so the engine can suspend the parent durably and resume on child completion. Strongly preferred — it makes the name truthful and unblocks the composition use case.
  • (B) Rename the current implementation to something like df.poll_status_blocking(instance_id, timeout_seconds), move it out of the composable DSL surface (or clearly document it as a session-blocking test/app helper), and stop granting it as part of the normal DSL ACL.

Independently, add the composer-level argument-shape check proposed in #149 so any non-future argument to df.seq / df.join / df.race / etc. produces:

ERROR: argument N to df.seq is plain text 'completed', not a future envelope —
HINT:  functions that return text cannot be composed; did you mean df.sql(...)
       or a wait primitive?

That single check resolves this bug, #149, and any future siblings.

Related

Environment

  • Originally reported on Azure HorizonDB cluster bertanari20260513hdb32v (PG 17.9, HorizonDB build c8e7b717d05) with pg_durable 0.1.1 during HorizonDB Hackathon Bug Bash Tier-2 (T2.3 / AI-04-D).
  • Confirmed still present on main (v0.2.1): df.wait_for_completion is unchanged in src/dsl.rs and still in the granted DSL surface (src/lib.rs).

Metadata

Metadata

Labels

bugSomething isn't working

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions