Skip to content

SPOC-393: Use pg_replication_origin_status for wait_for_sync_event#348

Merged
mason-sharp merged 4 commits intomainfrom
task/SPOC-393-sync-event
Feb 27, 2026
Merged

SPOC-393: Use pg_replication_origin_status for wait_for_sync_event#348
mason-sharp merged 4 commits intomainfrom
task/SPOC-393-sync-event

Conversation

@rasifr
Copy link
Copy Markdown
Member

@rasifr rasifr commented Feb 17, 2026

Previously, wait_for_sync_event() relied on spock.progress.remote_commit_lsn
to track apply progress. This required Spock to maintain its own progress
table, adding complexity.

This change switches to using PostgreSQL's native pg_replication_origin_status
view. The key insight is that PostgreSQL's RecordTransactionCommit() only
calls replorigin_session_advance() when there is actual WAL written. For
empty transactions (like sync_event), this never happens.

The fix adds explicit replorigin_session_advance() call in handle_commit()
for empty transactions, ensuring pg_replication_origin_status.remote_lsn
is advanced even when no WAL is written.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Feb 17, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds/changes spock.wait_for_sync_event procedures and public signature, advances replication origin for empty commits in apply logic, adds a regression test for sync_event/wait behavior, updates Makefile test targets, and adds installation/limitations documentation (including a duplicated subsection).

Changes

Cohort / File(s) Summary
Build configuration
Makefile
Added sync_event to the REGRESS test targets.
SQL: upgrade script (5.0.4→6.0.0-devel)
sql/spock--5.0.4--6.0.0-devel.sql
Drops prior overloads and adds two spock.wait_for_sync_event overloads (origin_id oid and origin name) with per-iteration subscription validation, progress checks via replication origin status, timeout handling, sleep/rollback loop, and updated DECLARE/DROP statements.
SQL: public signature & implementation (6.0.0-devel)
sql/spock--6.0.0-devel.sql
Changes exported spock.wait_for_sync_event parameter type from oid to name (internally resolves to origin_id), introduces clock-based timeout, per-iteration sub_enabled checks, slot/progress retrieval via pg_replication_origin_status, and delegates to internal implementation when resolving names.
Apply logic (C)
src/spock_apply.c
In handle_commit, added an else branch for empty transactions to call replorigin_session_advance(end_lsn, InvalidXLogRecPtr) when a replication origin is active, ensuring origin progress advances for empty commits.
Tests
tests/regress/sql/sync_event.sql
New regression test exercising spock.sync_event() and spock.wait_for_sync_event() (by node name and OID), verifies propagation, timeout and error cases, captures/uses LSNs, and performs cleanup.
Docs: installation notes
docs/install_spock.md
Adds guidance about replication-origin sizing across PostgreSQL versions (max_replication_slots vs max_active_replication_origins) and recommended sizing considerations.
Docs: limitations (duplicate content)
docs/limitations.md
Adds “Mixed Spock and Native Logical Replication” subsection describing ambiguity between Spock node_id and Postgres RepOriginId; the subsection appears twice and should be deduplicated.

Poem

🐇 I twitch my whiskers at the log,
I chase the LSN along the fog.
Empty hops now nudge the stream,
Tests confirm the waking dream.
Small rabbit cheers — sync on, bright!

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title directly and precisely describes the main change: switching wait_for_sync_event to use pg_replication_origin_status instead of the previous approach.
Description check ✅ Passed The description is clearly related to the changeset, explaining the motivation for the change and the technical solution involving pg_replication_origin_status and replorigin_session_advance.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch task/SPOC-393-sync-event

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (3)
sql/spock--6.0.0-devel.sql (1)

444-508: Same timeout precision and NULL progress_lsn observations apply here as in the migration SQL.

This is the base install version of the same procedure. The recommended refactor to use clock_timestamp() for timeout tracking (instead of the incremental elapsed_time counter) applies here as well for consistency.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@sql/spock--6.0.0-devel.sql` around lines 444 - 508, The wait loop in
procedure spock.wait_for_sync_event uses an incremental elapsed_time counter and
compares progress_lsn directly to lsn without handling NULL; change timeout
handling to use clock_timestamp() by recording a start_ts := clock_timestamp()
before the loop and replace elapsed_time checks with clock_timestamp() -
start_ts >= (timeout || ' seconds')::interval, and guard the progress_lsn vs lsn
comparison to handle NULL (e.g. only compare when progress_lsn IS NOT NULL,
otherwise treat as not reached). Update references in the procedure to use
start_ts, progress_lsn and timeout interval comparison and leave the pg_sleep
loop otherwise unchanged.
sql/spock--5.0.4--6.0.0-devel.sql (2)

236-238: No row in pg_replication_origin_status silently leaves progress_lsn as NULL.

If the apply worker hasn't started yet or the origin doesn't exist in pg_replication_origin_status, progress_lsn stays NULL and the loop spins until timeout. This is likely acceptable, but a debug RAISE NOTICE on the first miss (or a NOT FOUND check) would help users diagnose why wait_for_sync_event is timing out.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@sql/spock--5.0.4--6.0.0-devel.sql` around lines 236 - 238, The SELECT INTO
from pg_replication_origin_status can leave progress_lsn NULL when no row
exists; update the wait_for_sync_event logic to detect that case by checking the
PL/pgSQL NOT FOUND condition (or testing progress_lsn IS NULL) immediately after
the SELECT and emit a single RAISE NOTICE on the first miss to aid debugging;
introduce a local boolean flag (e.g., first_miss) initialized true, RAISE NOTICE
including sub_slot and that pg_replication_origin_status had no row, then set
first_miss := false so subsequent loop iterations don’t spam notices, and
otherwise continue the existing timeout/loop behavior.

190-254: Timeout tracking uses loop-count-based estimation, not wall-clock time.

elapsed_time increments by 0.2 each iteration, but the actual time per iteration includes query execution overhead (the subscription check + the origin status query + ROLLBACK). Under load, the real elapsed time could significantly exceed the requested timeout. Consider using clock_timestamp() for accurate wall-clock tracking, similar to wait_for_apply_worker in the same extension.

♻️ Proposed fix for accurate timeout tracking
 DECLARE
 	target_id		oid;
-	elapsed_time	numeric := 0;
+	start_time		timestamptz := clock_timestamp();
 	progress_lsn	pg_lsn;
 	sub_is_enabled	bool;
 	sub_slot		name;
 ...
-		elapsed_time := elapsed_time + .2;
-		IF timeout <> 0 AND elapsed_time >= timeout THEN
+		IF timeout <> 0 AND EXTRACT(EPOCH FROM clock_timestamp() - start_time) >= timeout THEN
 			result := false;
 			RETURN;
 		END IF;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@sql/spock--5.0.4--6.0.0-devel.sql` around lines 190 - 254, The timeout logic
in spock.wait_for_sync_event is using a loop-count-based elapsed_time += 0.2
which underestimates real wall-clock time; change to record a start timestamp
(e.g., start_ts := clock_timestamp()) at the top of the procedure and replace
the elapsed_time numeric increments with a wall-clock check like extract(epoch
FROM (clock_timestamp() - start_ts)) to compare against timeout (seconds),
keeping the existing timeout semantics and the rest of the loop (subscription
checks, progress_lsn query, ROLLBACK, pg_sleep) unchanged; update or remove the
elapsed_time variable and ensure comparisons use the new timestamp-based elapsed
calculation.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@sql/spock--5.0.4--6.0.0-devel.sql`:
- Around line 236-238: The SELECT INTO from pg_replication_origin_status can
leave progress_lsn NULL when no row exists; update the wait_for_sync_event logic
to detect that case by checking the PL/pgSQL NOT FOUND condition (or testing
progress_lsn IS NULL) immediately after the SELECT and emit a single RAISE
NOTICE on the first miss to aid debugging; introduce a local boolean flag (e.g.,
first_miss) initialized true, RAISE NOTICE including sub_slot and that
pg_replication_origin_status had no row, then set first_miss := false so
subsequent loop iterations don’t spam notices, and otherwise continue the
existing timeout/loop behavior.
- Around line 190-254: The timeout logic in spock.wait_for_sync_event is using a
loop-count-based elapsed_time += 0.2 which underestimates real wall-clock time;
change to record a start timestamp (e.g., start_ts := clock_timestamp()) at the
top of the procedure and replace the elapsed_time numeric increments with a
wall-clock check like extract(epoch FROM (clock_timestamp() - start_ts)) to
compare against timeout (seconds), keeping the existing timeout semantics and
the rest of the loop (subscription checks, progress_lsn query, ROLLBACK,
pg_sleep) unchanged; update or remove the elapsed_time variable and ensure
comparisons use the new timestamp-based elapsed calculation.

In `@sql/spock--6.0.0-devel.sql`:
- Around line 444-508: The wait loop in procedure spock.wait_for_sync_event uses
an incremental elapsed_time counter and compares progress_lsn directly to lsn
without handling NULL; change timeout handling to use clock_timestamp() by
recording a start_ts := clock_timestamp() before the loop and replace
elapsed_time checks with clock_timestamp() - start_ts >= (timeout || '
seconds')::interval, and guard the progress_lsn vs lsn comparison to handle NULL
(e.g. only compare when progress_lsn IS NOT NULL, otherwise treat as not
reached). Update references in the procedure to use start_ts, progress_lsn and
timeout interval comparison and leave the pg_sleep loop otherwise unchanged.

Copy link
Copy Markdown
Contributor

@danolivo danolivo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is necessary to address the general issues mentioned before we allow this core integration to spread further throughout our code.

Comment thread src/spock_apply.c
if (replorigin_session_origin != InvalidRepOriginId &&
replorigin_session_origin != DoNotReplicateId)
{
replorigin_session_advance(end_lsn, InvalidXLogRecPtr);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using this machinery, Spock refers to the maximum number of replication states defined by the GUC max_active_replication_origins (which can be changed only at restart). We need to document it and provide recommendations for its use, as we do for other important parameters (see, for example, max_replication_slots).

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah this is added in PG 18+. Added a note in docs.

Comment thread src/spock_apply.c
* pg_replication_origin_status.remote_lsn is updated.
*/
if (replorigin_session_origin != InvalidRepOriginId &&
replorigin_session_origin != DoNotReplicateId)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replorigin_session_origin refers to shared memory. The same does Postgres logical replication. The address it points to is based on an OriginID value. The originID spaces for pgoutput and Spock are separate, and generated values can accidentally overlap. In this case, we would have concurrent access to and rewriting of local_lsn and remote_lsn with values from entirely different address spaces.
It looks like to use replorigin_session_advance() correctly, you must guarantee that the originIDs of the two tools do not intersect, or disable pgoutput altogether.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A very low probability of actually happening. Spock and pgoutput both use PostgreSQL's pg_replication_origin catalog—there's no separate origin ID space. They also use different naming prefixes (spk_* vs pg_*), so no chance of collision. PostgreSQL protects against concurrent access via ReplicationOriginLock. The replorigin_session_advance() call only executes within Spock's apply worker and advances the slot owned by that worker. Each subscription creates its own slot with a unique origin; there's no chance of collision.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rasifr , just look how the session_replication_state is initialised. It refers to the shared array element - replication_states[i]. The 'ownership' is identified by the originID value. The 'uniqueness' of the Origin ID is separate in Spock and Native LR.

Aside from the question of collision probability between the Spock and upstream algorithms, there is an obvious case in which the native LR will 'advance' the same shmem area as Spock, with totally unrelated data.

It seems quite a typical case when a user downloads some data via LR from a legacy cloud provider to a cutting-edge multimaster Spock configuration and experiences strange replication failures. I do strictly recommend at least reducing the chance to the bare minimum, and let us identify potential cases by using a specific range of origin IDs.

But generally, we should avoid referring to any native LR structures to avoid potential collisions.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking deeper into this, I think I understand your concern now. You are pointing out how Spock replaces PG native origin id with Spock's node_id, right?

That's a valid concern. This is a limitation of the architecture. a design decision since 4.x (commit 3165452). I will create a ticket to track this. For now, I've documented the limitation in docs/limitations.md.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rasifr , commit, you're referring to is a good commit that have no correlation to the issue I pointed out.
The problem I want to draw your attention to is the unavoidable conflict between native LR and the Spock one, introduced recently with reference to replication_states[].

Just for clarity, if it is ok to have potential interference with native LR, there is no reason to generate subscription IDs separately - just reuse SubscriptionRelationId/GetNewOidWithIndex() and have standard subscription OIDs instead of dangerously unstable hash values, right? But we don't do it and manage subscriptions on our own. So, why do we allow the danger of inconsistent slot advancement?

Comment thread sql/spock--5.0.4--6.0.0-devel.sql Outdated
@rasifr rasifr force-pushed the task/SPOC-393-sync-event branch from 4ed0d10 to 563616f Compare February 24, 2026 10:39
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
sql/spock--5.0.4--6.0.0-devel.sql (1)

186-250: Same concerns as in spock--6.0.0-devel.sql apply here.

The procedure body is identical to the base extension file. The issues flagged there (replication origin external_id vs sub_slot assumption, timeout drift, NULL error message) apply equally to this migration copy.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@sql/spock--5.0.4--6.0.0-devel.sql` around lines 186 - 250, The
spock.wait_for_sync_event procedure duplicates the same issues: it assumes
pg_replication_origin_status.external_id equals sub_slot, uses elapsed_time with
sleep causing timeout drift, and raises exceptions with NULL values; fix by (1)
resolving the correct origin lookup (match pg_replication_origin_status using
external_id = sub_slot only after validating sub_slot is NOT NULL and/or use the
appropriate origin identifier stored for the subscription), (2) replace the
elapsed_time accumulation with a monotonic timestamp check (record start :=
clock_timestamp() and compare clock_timestamp() - start >= make_interval(secs =>
timeout)), and (3) avoid including raw possibly-NULL identifiers in RAISE
EXCEPTION messages (coalesce or format the message to handle NULLs), updating
references in this procedure to sub_slot, progress_lsn, elapsed_time, timeout,
and the SELECT from pg_replication_origin_status accordingly.
🧹 Nitpick comments (1)
sql/spock--6.0.0-devel.sql (1)

498-505: Timeout tracks only sleep time, not wall-clock time.

elapsed_time accumulates only the fixed 0.2s sleep increments and ignores the time spent executing queries and the ROLLBACK. Under load, the actual wall-clock duration could significantly exceed the requested timeout. Note that the sibling function wait_for_apply_worker (Line 668) uses clock_timestamp() for accurate wall-clock tracking.

Use clock_timestamp() for accurate timeout
 DECLARE
 	target_id		oid;
-	elapsed_time	numeric := 0;
+	start_time		timestamptz := clock_timestamp();
 	progress_lsn	pg_lsn;
 	sub_is_enabled	bool;
 	sub_slot		name;
...
-		elapsed_time := elapsed_time + .2;
-		IF timeout <> 0 AND elapsed_time >= timeout THEN
+		IF timeout <> 0 AND
+		   EXTRACT(EPOCH FROM clock_timestamp() - start_time) >= timeout THEN
 			result := false;
 			RETURN;
 		END IF;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@sql/spock--6.0.0-devel.sql` around lines 498 - 505, The loop currently
increments elapsed_time by fixed 0.2 and thus tracks only sleep time; change the
timeout check to use wall-clock time via clock_timestamp(): record a start_ts
(e.g., start_ts := clock_timestamp()) before the loop and inside the loop
compute elapsed := EXTRACT(EPOCH FROM (clock_timestamp() - start_ts)) (or update
elapsed_time from that expression) and use that elapsed value for the IF timeout
<> 0 AND elapsed >= timeout THEN check so ROLLBACK and query execution time are
included (see the sibling function wait_for_apply_worker for the precise
approach).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@sql/spock--6.0.0-devel.sql`:
- Around line 457-459: The exception message uses origin_id when origin_id IS
NULL so it always prints '<NULL>'; update the RAISE EXCEPTION to include a
useful identifier instead (for example use the name-based variable origin or
other contextual variable already computed earlier) or omit the % placeholder
and provide a clear static message; locate the block checking origin_id (the
variables origin_id and origin are in scope) and change the RAISE EXCEPTION to
reference origin (the node name) or add both origin_id and origin name to the
message so it no longer prints only '<NULL>'.

---

Duplicate comments:
In `@sql/spock--5.0.4--6.0.0-devel.sql`:
- Around line 186-250: The spock.wait_for_sync_event procedure duplicates the
same issues: it assumes pg_replication_origin_status.external_id equals
sub_slot, uses elapsed_time with sleep causing timeout drift, and raises
exceptions with NULL values; fix by (1) resolving the correct origin lookup
(match pg_replication_origin_status using external_id = sub_slot only after
validating sub_slot is NOT NULL and/or use the appropriate origin identifier
stored for the subscription), (2) replace the elapsed_time accumulation with a
monotonic timestamp check (record start := clock_timestamp() and compare
clock_timestamp() - start >= make_interval(secs => timeout)), and (3) avoid
including raw possibly-NULL identifiers in RAISE EXCEPTION messages (coalesce or
format the message to handle NULLs), updating references in this procedure to
sub_slot, progress_lsn, elapsed_time, timeout, and the SELECT from
pg_replication_origin_status accordingly.

---

Nitpick comments:
In `@sql/spock--6.0.0-devel.sql`:
- Around line 498-505: The loop currently increments elapsed_time by fixed 0.2
and thus tracks only sleep time; change the timeout check to use wall-clock time
via clock_timestamp(): record a start_ts (e.g., start_ts := clock_timestamp())
before the loop and inside the loop compute elapsed := EXTRACT(EPOCH FROM
(clock_timestamp() - start_ts)) (or update elapsed_time from that expression)
and use that elapsed value for the IF timeout <> 0 AND elapsed >= timeout THEN
check so ROLLBACK and query execution time are included (see the sibling
function wait_for_apply_worker for the precise approach).

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4ed0d10 and 563616f.

⛔ Files ignored due to path filters (1)
  • tests/regress/expected/sync_event.out is excluded by !**/*.out
📒 Files selected for processing (6)
  • Makefile
  • docs/install_spock.md
  • sql/spock--5.0.4--6.0.0-devel.sql
  • sql/spock--6.0.0-devel.sql
  • src/spock_apply.c
  • tests/regress/sql/sync_event.sql
✅ Files skipped from review due to trivial changes (1)
  • docs/install_spock.md
🚧 Files skipped from review as they are similar to previous changes (3)
  • tests/regress/sql/sync_event.sql
  • src/spock_apply.c
  • Makefile

Comment thread sql/spock--6.0.0-devel.sql
Copy link
Copy Markdown
Contributor

@ibrarahmad ibrarahmad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, @rasifr Please resolve @danolivo comments

@rasifr rasifr force-pushed the task/SPOC-393-sync-event branch from 563616f to cec39c9 Compare February 25, 2026 09:50
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (2)
sql/spock--5.0.4--6.0.0-devel.sql (2)

241-244: ⚠️ Potential issue | 🟡 Minor

Same timeout IS NULL silent infinite-loop risk as in the fresh-install script.

The identical procedure body at Line 241 carries the same timeout <> 0 NULL-evaluation bug flagged in sql/spock--6.0.0-devel.sql Lines 499-503. Apply the same fix here.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@sql/spock--5.0.4--6.0.0-devel.sql` around lines 241 - 244, The IF that checks
timeout currently uses "timeout <> 0" which evaluates to NULL when timeout is
NULL and silently skips the timeout branch; update the condition in the
procedure body that uses variables timeout, start_time, result and
clock_timestamp() to explicitly guard against NULL (e.g. use "timeout IS NOT
NULL AND timeout <> 0" or "timeout IS DISTINCT FROM 0") so the timeout check
behaves identically to the fix in spock--6.0.0-devel.sql; keep the rest of the
block (setting result := false; RETURN;) unchanged.

205-207: Same dead sub_is_enabled initial fetch as in the fresh-install script.

Applies identically here; see the optional refactor comment on sql/spock--6.0.0-devel.sql Lines 463-465.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@sql/spock--5.0.4--6.0.0-devel.sql` around lines 205 - 207, The SELECT that
fetches sub_enabled and sub_slot_name into variables sub_is_enabled and sub_slot
is redundant here (same dead fetch as in the fresh-install script); remove or
refactor it so you only read the subscription fields when actually needed—either
delete the SELECT entirely if sub_is_enabled isn't used later, or change it to
fetch only sub_slot_name INTO sub_slot (or perform the lookup lazily where
used). Adjust references to sub_is_enabled/sub_slot accordingly to match the
optional refactor used in sql/spock--6.0.0-devel.sql.
🧹 Nitpick comments (1)
sql/spock--6.0.0-devel.sql (1)

463-465: sub_is_enabled fetched initially but never used before it is overwritten.

The pre-loop SELECT sub_enabled, sub_slot_name INTO sub_is_enabled, sub_slot populates sub_is_enabled, but the loop unconditionally re-fetches sub_is_enabled at Line 474 before the value is ever read. The initial fetch of sub_is_enabled is dead — only sub_slot matters from that query. Dropping sub_is_enabled from the initial SELECT removes the misleading dead assignment.

♻️ Proposed refactor
-	SELECT sub_enabled, sub_slot_name INTO sub_is_enabled, sub_slot
-		FROM spock.subscription
-		WHERE sub_origin = origin_id AND sub_target = target_id;
+	SELECT sub_slot_name INTO sub_slot
+		FROM spock.subscription
+		WHERE sub_origin = origin_id AND sub_target = target_id;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@sql/spock--6.0.0-devel.sql` around lines 463 - 465, The initial SELECT into
sub_is_enabled and sub_slot from spock.subscription (SELECT sub_enabled,
sub_slot_name INTO sub_is_enabled, sub_slot WHERE sub_origin = origin_id AND
sub_target = target_id) assigns sub_is_enabled but it is overwritten later in
the loop; remove sub_is_enabled from this pre-loop SELECT so only sub_slot is
fetched (SELECT sub_slot_name INTO sub_slot ...), leaving the later in-loop
retrieval of sub_is_enabled intact; update the INTO list and selected columns
accordingly to avoid the dead assignment while keeping the later logic that
re-fetches sub_is_enabled unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/limitations.md`:
- Around line 170-174: The file contains a duplicated subsection titled "Mixed
Spock and Native Logical Replication"; locate the second (identical) occurrence
of that subsection and remove it so the document only contains a single
instance. Ensure you remove the entire duplicated block including the heading
and the recommendation paragraph, and keep the first original occurrence intact
to preserve the intended content.

In `@sql/spock--6.0.0-devel.sql`:
- Around line 499-503: Replace the NULL-sensitive condition so explicit NULL
timeouts don't bypass the timeout check: change the clause that currently reads
"IF timeout <> 0 AND EXTRACT(EPOCH FROM (clock_timestamp() - start_time)) >=
timeout THEN" to use COALESCE(timeout, 0) (or explicitly check timeout IS NOT
NULL) — i.e., use "IF COALESCE(timeout, 0) <> 0 AND EXTRACT(EPOCH FROM
(clock_timestamp() - start_time)) >= COALESCE(timeout, 0) THEN" (referencing the
timeout, start_time and result variables in this procedure) so NULL passed for
timeout is treated as 0 and the infinite loop is prevented.

---

Duplicate comments:
In `@sql/spock--5.0.4--6.0.0-devel.sql`:
- Around line 241-244: The IF that checks timeout currently uses "timeout <> 0"
which evaluates to NULL when timeout is NULL and silently skips the timeout
branch; update the condition in the procedure body that uses variables timeout,
start_time, result and clock_timestamp() to explicitly guard against NULL (e.g.
use "timeout IS NOT NULL AND timeout <> 0" or "timeout IS DISTINCT FROM 0") so
the timeout check behaves identically to the fix in spock--6.0.0-devel.sql; keep
the rest of the block (setting result := false; RETURN;) unchanged.
- Around line 205-207: The SELECT that fetches sub_enabled and sub_slot_name
into variables sub_is_enabled and sub_slot is redundant here (same dead fetch as
in the fresh-install script); remove or refactor it so you only read the
subscription fields when actually needed—either delete the SELECT entirely if
sub_is_enabled isn't used later, or change it to fetch only sub_slot_name INTO
sub_slot (or perform the lookup lazily where used). Adjust references to
sub_is_enabled/sub_slot accordingly to match the optional refactor used in
sql/spock--6.0.0-devel.sql.

---

Nitpick comments:
In `@sql/spock--6.0.0-devel.sql`:
- Around line 463-465: The initial SELECT into sub_is_enabled and sub_slot from
spock.subscription (SELECT sub_enabled, sub_slot_name INTO sub_is_enabled,
sub_slot WHERE sub_origin = origin_id AND sub_target = target_id) assigns
sub_is_enabled but it is overwritten later in the loop; remove sub_is_enabled
from this pre-loop SELECT so only sub_slot is fetched (SELECT sub_slot_name INTO
sub_slot ...), leaving the later in-loop retrieval of sub_is_enabled intact;
update the INTO list and selected columns accordingly to avoid the dead
assignment while keeping the later logic that re-fetches sub_is_enabled
unchanged.

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 563616f and cec39c9.

⛔ Files ignored due to path filters (1)
  • tests/regress/expected/sync_event.out is excluded by !**/*.out
📒 Files selected for processing (4)
  • docs/install_spock.md
  • docs/limitations.md
  • sql/spock--5.0.4--6.0.0-devel.sql
  • sql/spock--6.0.0-devel.sql
🚧 Files skipped from review as they are similar to previous changes (1)
  • docs/install_spock.md

Comment thread docs/limitations.md
Comment on lines +170 to +174
### Mixed Spock and Native Logical Replication

Spock uses its own 16-bit node_id (derived from node name hash) to track transaction origins in commit timestamps, rather than PostgreSQL's RepOriginId. Both ID spaces overlap (0-65535), which can cause ambiguity if Spock and native PostgreSQL logical replication run on the same database.

**Recommendation:** Avoid running Spock and native logical replication subscriptions on the same database.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Duplicate subsection — remove the second copy.

The AI-generated summary for this file explicitly notes that this "Mixed Spock and Native Logical Replication" subsection is inserted a second time later in the document with identical content. Please remove the duplicate occurrence.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/limitations.md` around lines 170 - 174, The file contains a duplicated
subsection titled "Mixed Spock and Native Logical Replication"; locate the
second (identical) occurrence of that subsection and remove it so the document
only contains a single instance. Ensure you remove the entire duplicated block
including the heading and the recommendation paragraph, and keep the first
original occurrence intact to preserve the intended content.

Comment on lines +499 to 503
IF timeout <> 0 AND
EXTRACT(EPOCH FROM (clock_timestamp() - start_time)) >= timeout THEN
result := false;
RETURN;
END IF;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Silent infinite loop when timeout is explicitly passed as NULL.

timeout <> 0 evaluates to NULL (not TRUE) when timeout IS NULL, so the timeout branch is never taken and the procedure loops forever. The parameter has DEFAULT 0, so ordinary callers are unaffected, but a caller who explicitly passes NULL hits a deadlock-style hang with no error.

🛡️ Proposed fix
-		IF timeout <> 0 AND
-		   EXTRACT(EPOCH FROM (clock_timestamp() - start_time)) >= timeout THEN
+		IF (timeout IS NOT NULL AND timeout <> 0) AND
+		   EXTRACT(EPOCH FROM (clock_timestamp() - start_time)) >= timeout THEN
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
IF timeout <> 0 AND
EXTRACT(EPOCH FROM (clock_timestamp() - start_time)) >= timeout THEN
result := false;
RETURN;
END IF;
IF (timeout IS NOT NULL AND timeout <> 0) AND
EXTRACT(EPOCH FROM (clock_timestamp() - start_time)) >= timeout THEN
result := false;
RETURN;
END IF;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@sql/spock--6.0.0-devel.sql` around lines 499 - 503, Replace the
NULL-sensitive condition so explicit NULL timeouts don't bypass the timeout
check: change the clause that currently reads "IF timeout <> 0 AND EXTRACT(EPOCH
FROM (clock_timestamp() - start_time)) >= timeout THEN" to use COALESCE(timeout,
0) (or explicitly check timeout IS NOT NULL) — i.e., use "IF COALESCE(timeout,
0) <> 0 AND EXTRACT(EPOCH FROM (clock_timestamp() - start_time)) >=
COALESCE(timeout, 0) THEN" (referencing the timeout, start_time and result
variables in this procedure) so NULL passed for timeout is treated as 0 and the
infinite loop is prevented.

Previously, wait_for_sync_event() relied on spock.progress.remote_commit_lsn
to track apply progress. This required Spock to maintain its own progress
table, adding complexity.

This change switches to using PostgreSQL's native pg_replication_origin_status
view. The key insight is that PostgreSQL's RecordTransactionCommit() only
calls replorigin_session_advance() when there is actual WAL written. For
empty transactions (like sync_event), this never happens.

The fix adds explicit replorigin_session_advance() call in handle_commit()
for empty transactions, ensuring pg_replication_origin_status.remote_lsn
is advanced even when no WAL is written.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@rasifr rasifr force-pushed the task/SPOC-393-sync-event branch from cec39c9 to ef65317 Compare February 25, 2026 10:40
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
sql/spock--6.0.0-devel.sql (1)

499-503: Silent infinite loop when timeout is explicitly passed as NULL.

timeout <> 0 evaluates to NULL (not TRUE) when timeout IS NULL, so the timeout branch is never taken and the procedure loops forever.

Proposed fix
-		IF timeout <> 0 AND
-		   EXTRACT(EPOCH FROM (clock_timestamp() - start_time)) >= timeout THEN
+		IF COALESCE(timeout, 0) <> 0 AND
+		   EXTRACT(EPOCH FROM (clock_timestamp() - start_time)) >= timeout THEN
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@sql/spock--6.0.0-devel.sql` around lines 499 - 503, The IF branch that checks
timeout currently uses "timeout <> 0" which yields NULL when timeout IS NULL and
prevents the timeout logic from running; update the condition in that IF (the
block referencing timeout, start_time, result and clock_timestamp()) to handle
NULL explicitly — e.g. use COALESCE(timeout, 0) <> 0 AND EXTRACT(...) >= timeout
or add an explicit "timeout IS NOT NULL AND timeout <> 0 AND ..." so a NULL
timeout no longer causes a silent infinite loop.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@sql/spock--6.0.0-devel.sql`:
- Around line 499-503: The IF branch that checks timeout currently uses "timeout
<> 0" which yields NULL when timeout IS NULL and prevents the timeout logic from
running; update the condition in that IF (the block referencing timeout,
start_time, result and clock_timestamp()) to handle NULL explicitly — e.g. use
COALESCE(timeout, 0) <> 0 AND EXTRACT(...) >= timeout or add an explicit
"timeout IS NOT NULL AND timeout <> 0 AND ..." so a NULL timeout no longer
causes a silent infinite loop.

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between cec39c9 and ef65317.

⛔ Files ignored due to path filters (1)
  • tests/regress/expected/sync_event.out is excluded by !**/*.out
📒 Files selected for processing (7)
  • Makefile
  • docs/install_spock.md
  • docs/limitations.md
  • sql/spock--5.0.4--6.0.0-devel.sql
  • sql/spock--6.0.0-devel.sql
  • src/spock_apply.c
  • tests/regress/sql/sync_event.sql
🚧 Files skipped from review as they are similar to previous changes (4)
  • sql/spock--5.0.4--6.0.0-devel.sql
  • Makefile
  • tests/regress/sql/sync_event.sql
  • docs/limitations.md

rasifr and others added 3 commits February 25, 2026 16:33
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The two wait_for_sync_event() procedure overloads differed only in how
they received the origin parameter (name vs OID). Refactor to have the
name-based version resolve the node name to OID, then delegate to the
OID-based version, reducing code duplication.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add note about max_active_replication_origins for PostgreSQL 18+
  (separate GUC from max_replication_slots)
- Document limitation: avoid mixing Spock with native logical
  replication on the same database due to overlapping origin ID spaces

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@rasifr rasifr force-pushed the task/SPOC-393-sync-event branch from ef65317 to 893c530 Compare February 25, 2026 11:34
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (2)
sql/spock--6.0.0-devel.sql (1)

499-500: ⚠️ Potential issue | 🟡 Minor

Guard against explicit NULL timeout to avoid silent unbounded waits.

On Line 499, timeout <> 0 evaluates to NULL when timeout is explicitly NULL, so the timeout path is skipped forever.

🛡️ Proposed fix
 BEGIN
 	IF origin_id IS NULL THEN
 		RAISE EXCEPTION 'Invalid NULL origin_id';
 	END IF;
+	IF timeout IS NULL THEN
+		RAISE EXCEPTION 'Invalid NULL timeout';
+	END IF;
 	target_id := node_id FROM spock.node_info();
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@sql/spock--6.0.0-devel.sql` around lines 499 - 500, The IF guard currently
uses "timeout <> 0" which yields NULL when timeout is explicitly NULL and
therefore skips the timeout check; update the condition to explicitly handle
NULL (e.g., check timeout IS NOT NULL AND timeout <> 0, or use COALESCE(timeout,
0) <> 0) so the branch executes only for a real non-zero timeout before
evaluating EXTRACT(EPOCH FROM (clock_timestamp() - start_time)); reference the
timeout variable and start_time/clock_timestamp() call in the IF condition to
locate and update the logic.
sql/spock--5.0.4--6.0.0-devel.sql (1)

241-243: ⚠️ Potential issue | 🟡 Minor

Keep upgrade path consistent: same nullable-timeout trap exists here.

Line 241 has the same timeout <> 0 NULL-sensitive behavior, so explicit NULL can bypass timeout handling in upgraded installations too.

🛡️ Proposed fix
 BEGIN
 	IF origin_id IS NULL THEN
 		RAISE EXCEPTION 'Invalid NULL origin_id';
 	END IF;
+	IF timeout IS NULL THEN
+		RAISE EXCEPTION 'Invalid NULL timeout';
+	END IF;
 	target_id := node_id FROM spock.node_info();
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@sql/spock--5.0.4--6.0.0-devel.sql` around lines 241 - 243, The timeout check
is NULL-sensitive and can be bypassed if timeout is NULL; update the condition
in the block that currently reads "IF timeout <> 0 AND EXTRACT(EPOCH FROM
(clock_timestamp() - start_time)) >= timeout THEN" to explicitly handle NULLs
(e.g., "IF timeout IS NOT NULL AND timeout <> 0 AND EXTRACT(EPOCH FROM
(clock_timestamp() - start_time)) >= timeout THEN" or use COALESCE(timeout, 0)
in the comparison) so upgraded installations cannot bypass the timeout; modify
the conditional where variables timeout, start_time, and result are used
accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@sql/spock--5.0.4--6.0.0-devel.sql`:
- Around line 241-243: The timeout check is NULL-sensitive and can be bypassed
if timeout is NULL; update the condition in the block that currently reads "IF
timeout <> 0 AND EXTRACT(EPOCH FROM (clock_timestamp() - start_time)) >= timeout
THEN" to explicitly handle NULLs (e.g., "IF timeout IS NOT NULL AND timeout <> 0
AND EXTRACT(EPOCH FROM (clock_timestamp() - start_time)) >= timeout THEN" or use
COALESCE(timeout, 0) in the comparison) so upgraded installations cannot bypass
the timeout; modify the conditional where variables timeout, start_time, and
result are used accordingly.

In `@sql/spock--6.0.0-devel.sql`:
- Around line 499-500: The IF guard currently uses "timeout <> 0" which yields
NULL when timeout is explicitly NULL and therefore skips the timeout check;
update the condition to explicitly handle NULL (e.g., check timeout IS NOT NULL
AND timeout <> 0, or use COALESCE(timeout, 0) <> 0) so the branch executes only
for a real non-zero timeout before evaluating EXTRACT(EPOCH FROM
(clock_timestamp() - start_time)); reference the timeout variable and
start_time/clock_timestamp() call in the IF condition to locate and update the
logic.

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ef65317 and 893c530.

⛔ Files ignored due to path filters (1)
  • tests/regress/expected/sync_event.out is excluded by !**/*.out
📒 Files selected for processing (6)
  • Makefile
  • docs/install_spock.md
  • docs/limitations.md
  • sql/spock--5.0.4--6.0.0-devel.sql
  • sql/spock--6.0.0-devel.sql
  • tests/regress/sql/sync_event.sql
🚧 Files skipped from review as they are similar to previous changes (3)
  • tests/regress/sql/sync_event.sql
  • Makefile
  • docs/install_spock.md

@mason-sharp mason-sharp merged commit 294bf5e into main Feb 27, 2026
10 checks passed
@mason-sharp mason-sharp deleted the task/SPOC-393-sync-event branch February 27, 2026 17:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants