Spoc 388: Z0DAN should quickly fail instead of waiting timeout (plpgsql only) #330

danolivo · 2026-02-04T13:20:48Z

Fix Z0DAN error handling: Change RAISE NOTICE to RAISE EXCEPTION across all error handlers in zodan.sql so that add_node fails immediately on error instead of silently logging and eventually timing out. Affected operations: node creation, sync event triggering/waiting, replication slot creation/advancement, subscription creation/enabling, and replication lag monitoring.
Fix subscription name assignment: Move sub_name assignment to the correct location inside the subscription creation block where it is actually used.
Fix Z0DAN initialization: Add COMMIT in present_final_cluster_state to let newly created subscriptions update their state; check all subscriptions across the cluster, not only the new node.
Fix status label: Correct [INITIALIZING] to [SYNCING] for subscriptions in syncing state.
Allow removing the last Spock node via remove_node.
Remove unreachable code: Drop CONTINUE/RAISE statements after RAISE EXCEPTION.
Add 3-node cluster TAP test (016_zodan_3n_timeout).

coderabbitai · 2026-02-04T13:21:10Z

📝 Walkthrough

Walkthrough

Added a required src_dsn parameter to final-state procedures and added remote-status probing; converted many NOTICE/CONTINUE soft-fail paths into EXCEPTIONs; removed the last-node removal guard; and added a TAP test asserting fast failure when critical replication hooks are missing.

Changes

Cohort / File(s)	Summary
ZODAN core procedures `samples/Z0DAN/zodan.sql`, `samples/Z0DAN/zodan_cleanup.sql`	Updated `spock.present_final_cluster_state` signature to `(src_dsn text, initial_node_count integer, verb boolean DEFAULT false)`; added COMMIT, per-node/subscription state tracking, remote dblink probing via `src_dsn`, revised wait/timeout logic and per-subscription messaging, and replaced many NOTICE/CONTINUE paths with EXCEPTIONs.
Client-side Python flow `samples/Z0DAN/zodan.py`	Added verification that remote nodes have subscriptions enabled; split 2-node vs 3+ node flows; introduced per-node sync_lsn and commit_lsn tracking, revised slot/subscription creation, trigger/wait logic, and expanded final-state verification across all nodes.
Node removal flow `samples/Z0DAN/zodremove.sql`, `samples/Z0DAN/zodremove.py`	Removed the runtime guard that prevented removing the last node (deleted the check raising when node_count <= 1), allowing the removal flow to proceed in single-node scenarios.
Test schedule `tests/tap/schedule`	Removed `014_rolling_upgrade` and added `017_zodan_3n_timeout` after `016_crash_recovery_progress`.
New TAP test `tests/tap/t/017_zodan_3n_timeout.pl`	Added a TAP script that builds a 3-node scenario and verifies `spock.add_node` fails promptly (rather than timing out) when critical hooks/functions (`sync_event`, `pg_replication_slot_advance`, `sub_create`) are renamed/missing; includes timing assertions, error-code checks, restoration and cleanup.

Poem

🐰 I nibble through SQL, whiskers all a-twitch,
Where notices once whispered, exceptions now stitch.
I hop with a DSN and probe every thread,
The last node set free — no guard overhead.
Quick failures now catch the missing hook's glitch.

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description check	✅ Passed	The description accurately details all major changes in the PR, including error handling fixes, subscription assignment fixes, initialization improvements, status label corrections, removal guard removal, and test additions.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Title check	✅ Passed	The title accurately summarizes the main change: updating Z0DAN error handling to fail quickly instead of timing out, specifically for the plpgsql implementation, which aligns with the primary objectives of the PR.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch spoc-388

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🤖 Fix all issues with AI agents

In `@samples/Z0DAN/zodan.sql`:
- Around line 1844-1895: Update the DROP PROCEDURE statement for
spock.present_final_cluster_state to match the new signature that includes
src_dsn as the first parameter: change the old DROP PROCEDURE IF EXISTS
spock.present_final_cluster_state(integer, boolean) to DROP PROCEDURE IF EXISTS
spock.present_final_cluster_state(text, integer, boolean) so the cleanup matches
the revised procedure signature.

In `@tests/tap/t/016_zodan_3n_timeout.pl`:
- Around line 43-78: Capture and assert the add_node call actually failed
instead of only checking elapsed time: wrap the CALL spock.add_node invocation
(where scalar_query is used) so you record any error/exception (e.g. via eval {
scalar_query(...) } and check $@ or by using a scalar_query variant that returns
status), then add an assertion that the call failed (e.g. ok(!$result || $@,
"...add_node failed as expected") and optionally assert the error message
contains "spock.sync_event" or "function ... does not exist"); keep the existing
timing check (ok($elapsed_time < 600,...)). Apply the same pattern to the other
add_node tests referenced (the blocks calling scalar_query for spock.add_node at
the other locations) and leave the cleanup calls (CALL spock.remove_node)
unchanged.
- Around line 80-117: The timer currently starts before the successful add_node
for n2 so $elapsed_time includes that operation; move the start_time = time()
assignment to immediately before the failing CALL spock.add_node for n3 (i.e.
right before the scalar_query(3, qq{ CALL spock.add_node ... }) invocation) so
the measured duration only covers the failing attempt, then compute
$elapsed_time and assert it is < 600 as before; remove or relocate the earlier
$start_time assignment placed just after renaming pg_replication_slot_advance to
avoid double-timing.

samples/Z0DAN/zodan.sql

tests/tap/t/017_zodan_3n_timeout.pl

mason-sharp · 2026-02-05T18:16:42Z

unfortunately needs a file rename and rebase

Procedure present_final_cluster_state should execute COMMIT command to let newly created subscriptions from working to the new node update their state. Also, final cluster state should check all subscriptions around the cluster, not only the new node.

- Change RAISE NOTICE to RAISE EXCEPTION in error handlers to properly propagate errors instead of silently continuing - Move sub_name assignment to the correct location inside the subscription creation block where it is actually used - Add new test 016_zodan_2n_timeout for 2-node timeout scenario Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Change RAISE NOTICE to RAISE EXCEPTION in zodan.sql error handlers to ensure add_node fails immediately rather than continuing silently and eventually timing out. Affected operations: - Replication lag monitoring - Creating replication slots on other nodes - Enabling subscriptions - Creating subscriptions on other nodes - Triggering and waiting for sync events - Checking commit LSN for slot advancement - Advancing replication slots - Triggering sync_event on source node Also remove unreachable code (CONTINUE/RAISE statements after RAISE EXCEPTION). Add comprehensive 3-node cluster test that verifies quick failure by: - spock.sync_event() is missing - pg_replication_slot_advance() is missing - spock.sub_create() is missing Test also verifies successful cluster formation with 3 nodes and 2 enabled subscriptions per node. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

scalar_query() uses backticks which don't throw Perl exceptions on error - they just return empty output and set $? with the exit code. Now properly check psql exit code ($? >> 8) after each add_node call that is expected to fail, and assert non-zero exit code to verify the operation actually failed. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

samples/Z0DAN/zodan.sql (2)
300-311: ⚠️ Potential issue | 🔴 Critical

create_sub silently swallows exceptions when verb=false — fails to propagate errors.

The exception handler at lines 303–309 only raises when verb is true. When verb=false (which is how the test and likely production calls use it), a failure in spock.sub_create via dblink is silently caught and the procedure returns as if it succeeded. This directly undermines the PR's goal of fast failure: errors from subscription creation on remote nodes won't propagate, and the caller (e.g., create_sub_on_new_node_to_src_node) will continue as though the subscription was created successfully.
Suggested fix — always re-raise
     EXCEPTION
         WHEN OTHERS THEN
-            IF verb THEN
-                RAISE EXCEPTION E'
-            [STEP 3] Subscription "%" creation failed on remote node! Error: %
-            ', subscription_name, SQLERRM;
-            END IF;
+            RAISE EXCEPTION E'
+        [STEP 3] Subscription "%" creation failed on remote node! Error: %
+        ', subscription_name, SQLERRM;
     END;
547-568: ⚠️ Potential issue | 🔴 Critical

Same verb-gated exception issue in create_node.

Both the NULL-joinid check (line 556) and the WHEN OTHERS handler (line 563) only raise exceptions when verb=true. When verb=false, node creation failures are silently ignored.
Suggested fix — always re-raise
         IF joinid IS NOT NULL THEN
             IF verb THEN
                 RAISE NOTICE 'Node % created remotely with id % and DSN: %', node_name, joinid, dsn;
             END IF;
         ELSE
-            IF verb THEN
-                RAISE EXCEPTION E'
-            [STEP 3] Node "%" creation failed remotely!
-            ', node_name;
-            END IF;
+            RAISE EXCEPTION E'
+        [STEP 3] Node "%" creation failed remotely!
+        ', node_name;
         END IF;
     EXCEPTION
         WHEN OTHERS THEN
-            IF verb THEN
-                RAISE EXCEPTION E'
-            [STEP 3] Node "%" creation failed remotely! Error: %
-            ', node_name, SQLERRM;
-            END IF;
+            RAISE EXCEPTION E'
+        [STEP 3] Node "%" creation failed remotely! Error: %
+        ', node_name, SQLERRM;
     END;

🤖 Fix all issues with AI agents

In `@samples/Z0DAN/zodan.sql`:
- Around line 1284-1288: The EXCEPTION block raises an exception then has an
unreachable CONTINUE; to allow the loop to continue on error, replace the RAISE
EXCEPTION with a non-fatal logger (e.g., RAISE WARNING or RAISE NOTICE) that
includes SQLERRM and rec.node_name, or alternatively remove the CONTINUE if you
intend to abort; update the EXCEPTION handler around the block that references
SQLERRM and rec.node_name so the error is logged with the same message but
without terminating execution.
- Around line 1141-1145: The EXCEPTION block contains an unreachable "RAISE;"
after a terminating "RAISE EXCEPTION" call; remove the redundant "RAISE;"
statement so the handler only logs/raises once. Locate the EXCEPTION section
that includes the RAISE EXCEPTION '    ✗ %', rpad('Creating new node ' ||
new_node_name || ' (error: ' || SQLERRM || ')', 120, ' ') and delete the
subsequent standalone "RAISE;" line (inside the same block) to eliminate dead
code.

In `@tests/tap/t/017_zodan_3n_timeout.pl`:
- Around line 75-80: The remove_node calls are building DSN strings that include
user=$db_user but omit password=$db_password, so scalar_query("CALL
spock.remove_node(...)") will fail under non-trust auth; update each remove_node
invocation (the scalar_query calls invoking spock.remove_node in this file) to
include password=$db_password in the target_node_dsn argument (matching the
add_node DSN format used elsewhere) so the cleanup executes with proper
credentials; ensure all occurrences (the remove_node calls near the scalar_query
invocations) are updated consistently.

samples/Z0DAN/zodan.sql

tests/tap/t/017_zodan_3n_timeout.pl

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

samples/Z0DAN/zodan.sql (2)
300-311: ⚠️ Potential issue | 🔴 Critical

create_sub silently swallows errors when verb is false — breaks the "fail fast" objective.

The EXCEPTION handler at lines 304-309 only raises when verb is true. When verb is false (as in the TAP test), a dblink failure in sub_create is silently discarded and create_sub returns normally. This means the outer caller (create_sub_on_new_node_to_src_node, etc.) never sees the error, and add_node continues instead of aborting.

This directly undermines the PR's goal and will cause the third TAP test scenario (renamed sub_create on n1) to silently succeed (exit 0) instead of failing fast.
Proposed fix — always propagate the error
     BEGIN
         SELECT * FROM dblink(node_dsn, remotesql) AS t(sid oid) INTO sid;
 
     EXCEPTION
         WHEN OTHERS THEN
-            IF verb THEN
-                RAISE EXCEPTION E'
-            [STEP 3] Subscription "%" creation failed on remote node! Error: %
-            ', subscription_name, SQLERRM;
-            END IF;
+            RAISE EXCEPTION E'
+        [STEP 3] Subscription "%" creation failed on remote node! Error: %
+        ', subscription_name, SQLERRM;
     END;
547-568: ⚠️ Potential issue | 🟠 Major

create_node has the same conditional-raise pattern — errors swallowed when verb is false.

Both the joinid IS NULL branch (line 556) and the WHEN OTHERS handler (line 563) guard RAISE EXCEPTION with IF verb. When verb is false, a remote node-creation failure is silently ignored and the procedure returns normally, allowing add_node to proceed on a phantom node.
Proposed fix — raise unconditionally
         IF joinid IS NOT NULL THEN
             IF verb THEN
                 RAISE NOTICE 'Node % created remotely with id % and DSN: %', node_name, joinid, dsn;
             END IF;
         ELSE
-            IF verb THEN
-                RAISE EXCEPTION E'
-            [STEP 3] Node "%" creation failed remotely!
-            ', node_name;
-            END IF;
+            RAISE EXCEPTION E'
+        [STEP 3] Node "%" creation failed remotely!
+        ', node_name;
         END IF;
     EXCEPTION
         WHEN OTHERS THEN
-            IF verb THEN
-                RAISE EXCEPTION E'
-            [STEP 3] Node "%" creation failed remotely! Error: %
-            ', node_name, SQLERRM;
-            END IF;
+            RAISE EXCEPTION E'
+        [STEP 3] Node "%" creation failed remotely! Error: %
+        ', node_name, SQLERRM;
     END;

🤖 Fix all issues with AI agents

In `@tests/tap/t/017_zodan_3n_timeout.pl`:
- Around line 124-143: create_sub currently swallows exceptions from the dblink
call to spock.sub_create when the parameter verb is false, causing add_node to
succeed silently; update the exception handler inside the create_sub function
(in zodan.sql) to re-raise the caught exception (including its message and
detail) rather than suppressing it so that failures in spock.sub_create
propagate back to add_node and cause a non-zero exit; ensure the handler
references the original exception (e.g., using RAISE with SQLERRM or the stored
exception variable) so tests asserting add_node failure (calling add_node with
verb := false) now observe a failure.

tests/tap/t/017_zodan_3n_timeout.pl

ibrarahmad

We also need to update the Python script.

ibrarahmad · 2026-02-09T13:00:24Z

samples/Z0DAN/zodremove.sql

        RAISE NOTICE '    ✓ Total nodes in cluster: %', node_count;
    END IF;

-    -- Validate that we're not trying to remove the last node


Why have we removed that?

As we discussed earlier, there is a problem: if add_node fails to add n3 in earlier steps, the remove script thinks that n3 is the last node (not enough data to detect other nodes). So, we need something like that (or kinda 'forced removal' flag).

danolivo · 2026-02-09T13:16:14Z

We also need to update the Python script.

Yeah. That means enrolling on a Python course for me. Let's try to shift this responsibility to Claude ;).

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

samples/Z0DAN/zodan.py (2)
494-526: ⚠️ Potential issue | 🟡 Minor

Misleading log message in the 3+ node path.

Line 496 prints "3+ node scenario: sync event stored, skipping disabled subscriptions" but the code immediately below (lines 520–526) does create disabled subscriptions. This appears to be a copy-paste leftover from the 2-node message on line 491.
Suggested fix
-            self.notice("    - 3+ node scenario: sync event stored, skipping disabled subscriptions")
+            self.notice(f"    - 3+ node scenario: triggering sync event on node {rec['node_name']}")
968-1030: ⚠️ Potential issue | 🟡 Minor

Duplicate method show_all_subscription_status — this definition is dead code.

There are two definitions of show_all_subscription_status in this class: one here (line 968) and another at line 1346. In Python, the second definition silently shadows the first, making this block unreachable dead code. The version at line 1346 includes the new status labels ([DISABLED], [INITIALIZING], [SYNCING]) from this PR.

Remove this dead definition to avoid confusion.

🤖 Fix all issues with AI agents

In `@samples/Z0DAN/zodan.py`:
- Around line 528-531: The log phase numbers are out of order because add_node
calls trigger_sync_on_other_nodes_and_wait_on_source before
create_source_to_new_subscription; update the notice messages so they match
execution order: change trigger_sync_on_other_nodes_and_wait_on_source's notice
from "Phase 5: Triggering sync events on other nodes..." to "Phase 4: ..." and
change create_source_to_new_subscription's notice from "Phase 4: Creating source
to new node subscription" to "Phase 5: ...". Ensure both functions
(trigger_sync_on_other_nodes_and_wait_on_source and
create_source_to_new_subscription) use the corrected phase numbers so output is
monotonically increasing.
- Around line 810-816: The code calls self.run_psql(new_node_dsn, sql) for the
spock.wait_for_sync_event call but does not check the result so failures are
ignored; replace the direct run_psql call with the existing wait_for_sync_event
helper (or at minimum check run_psql's return value and raise on non-zero) so
errors in wait_for_sync_event are detected and propagated; update both the
2-node path (where sync_lsn, new_node_name, src_node_name, timeout_ms are used)
and the multi-node path (the similar block) to call
wait_for_sync_event(new_node_dsn, src_node_name, sync_lsn, timeout_ms) or assert
the returned result and raise an exception on failure.

🧹 Nitpick comments (3)

samples/Z0DAN/zodan.py (3)
229-266: Fragile exception discrimination and inconsistent subprocess usage.

Several concerns in this validation block:

Fragile string matching (lines 262-264): if "disabled subscription" in str(e) is brittle — if the error message at line 258 is ever reworded, this silently swallows the exception and falls through to re-raise as a generic failure. Consider using a custom exception class instead.

Inconsistent subprocess usage (lines 237-249): This block shells out with subprocess.run directly, duplicating logic already in self.run_psql and self.get_spock_nodes. Using the existing helpers would reduce duplication and ensure consistent error handling.

Ambiguous variable name (line 240): l is flagged by Ruff (E741). Rename to something descriptive like line.
Proposed fix for the fragile exception pattern

Define a specific exception (at class or module level):
class SubscriptionDisabledException(Exception):
    pass
Then use it:
-                                    if sub_enabled != 't':
-                                        raise Exception(f"Node {node_nm} has disabled subscription {sub_nm}")
+                                    if sub_enabled != 't':
+                                        raise SubscriptionDisabledException(f"Node {node_nm} has disabled subscription {sub_nm}")
 ...
-        except Exception as e:
-            if "disabled subscription" in str(e):
-                raise
-            self.format_notice("✗", f"Failed to check subscription statuses: {str(e)}")
-            raise
+        except SubscriptionDisabledException:
+            raise
+        except Exception as e:
+            self.format_notice("✗", f"Failed to check subscription statuses: {e!s}")
+            raise
Fix ambiguous variable name on line 240
-                lines = [l.strip() for l in result.stdout.strip().split('\n') if l.strip()]
+                lines = [line.strip() for line in result.stdout.strip().split('\n') if line.strip()]
828-830: Use raise ... from e to preserve exception chain.

Both lines 829 and 864 create a new Exception wrapping the caught one but discard the traceback chain. This makes debugging harder since the original stack trace is lost.
Proposed fix
             except Exception as e:
-                raise Exception(f"Enabling subscription {sub_name} (error: {e})")
+                raise Exception(f"Enabling subscription {sub_name} (error: {e})") from e
Apply the same change at both line 829 and line 864.
Also applies to: 863-864

936-944: Subscription status display only queries source node.

Lines 940–944 run spock.sub_show_status() against self.src_dsn only, while the replication-readiness loop above (lines 900–909) correctly iterates all nodes. This means the "Subscription Status" section may miss subscriptions on non-source nodes. Phase 13 (show_all_subscription_status) covers all nodes, but the output here could mislead an operator checking just this section.

Consider querying all nodes here as well for consistency, or adding a note that this is a source-only view.

samples/Z0DAN/zodan.py

coderabbitai

Actionable comments posted: 2

🤖 Fix all issues with AI agents

In `@samples/Z0DAN/zodan.py`:
- Around line 494-496: The log message inside the loop over other_nodes
incorrectly says "skipping disabled subscriptions" while the code actually
creates disabled subscriptions and will repeat per node; update the notice()
call to accurately reflect the action (e.g., "3+ node scenario: sync event
stored; creating disabled subscriptions" or move it outside the loop so it logs
once), or remove/adjust it to avoid per-node repetition — locate the loop that
iterates over other_nodes and the notice(...) invocation and make the message
and placement consistent with the subsequent disabled-subscription creation
logic.
- Around line 229-266: The DSN query in dsns_sql currently excludes the local
node so the source node's subscriptions aren't checked; modify dsns_sql (used
where cmd is constructed) to include local node(s) (remove the WHERE node_id NOT
IN (...) filter or adjust it to select all nodes) so the loop that runs sub_sql
against each node covers the source as well, and also fix the l variable in the
list comprehension (lines = [l.strip() ...]) to a non-ambiguous name like
line_str or line to satisfy linters (ruff E741) and update subsequent references
(lines, parts, node_dsn_val, node_nm) accordingly. Ensure existing error
handling around the subscription check (the try/except that inspects "disabled
subscription") remains unchanged.

🧹 Nitpick comments (2)

samples/Z0DAN/zodan.py (2)
571-572: new_node_dsn parameter is unused.

Static analysis flags this (ruff ARG002). All other phase methods accept this parameter for interface consistency, so if that's intentional, consider prefixing with _ to signal it's unused: _new_node_dsn.

827-828: Re-raised exceptions lose the original traceback — use raise ... from e.

Both exception handlers wrap and re-raise but drop the original cause, making debugging harder. This is flagged by ruff B904.
Suggested fix
-            except Exception as e:
-                raise Exception(f"Enabling subscription {sub_name} (error: {e})")
+            except Exception as e:
+                raise Exception(f"Enabling subscription {sub_name} (error: {e})") from e
Apply the same pattern at line 860.
Also applies to: 859-860

coderabbitai · 2026-02-09T15:03:50Z

samples/Z0DAN/zodan.py

+        # Check: all nodes in the cluster have only enabled subscriptions
+        try:
+            # Get DSNs of all remote nodes from source
+            dsns_sql = """
+            SELECT if_dsn, node_name
+            FROM spock.node JOIN spock.node_interface ON (if_nodeid = node_id)
+            WHERE node_id NOT IN (SELECT node_id FROM spock.local_node)
+            """
+            cmd = ["psql", src_dsn, "-X", "-A", "-F", "|", "-t", "-c", dsns_sql]
+            result = subprocess.run(cmd, capture_output=True, text=True)
+            if result.returncode == 0:
+                lines = [l.strip() for l in result.stdout.strip().split('\n') if l.strip()]
+                for line in lines:
+                    parts = line.split('|')
+                    if len(parts) >= 2:
+                        node_dsn_val = parts[0].strip()
+                        node_nm = parts[1].strip()
+                        # Check subscriptions on this node
+                        sub_sql = "SELECT sub_name, sub_enabled FROM spock.subscription"
+                        sub_cmd = ["psql", node_dsn_val, "-X", "-A", "-F", "|", "-t", "-c", sub_sql]
+                        sub_result = subprocess.run(sub_cmd, capture_output=True, text=True)
+                        if sub_result.returncode == 0:
+                            sub_lines = [sl.strip() for sl in sub_result.stdout.strip().split('\n') if sl.strip()]
+                            for sub_line in sub_lines:
+                                sub_parts = sub_line.split('|')
+                                if len(sub_parts) >= 2:
+                                    sub_nm = sub_parts[0].strip()
+                                    sub_enabled = sub_parts[1].strip()
+                                    if sub_enabled != 't':
+                                        raise Exception(f"Node {node_nm} has disabled subscription {sub_nm}")
+                                    elif self.verbose:
+                                        self.format_notice("OK:", f"Node with DSN {node_dsn_val} has enabled subscription {sub_nm}")
+                self.format_notice("OK:", "Checking each Spock node has only active subscriptions")
+        except Exception as e:
+            if "disabled subscription" in str(e):
+                raise
+            self.format_notice("✗", f"Failed to check subscription statuses: {str(e)}")
+            raise


⚠️ Potential issue | 🟡 Minor

Pre-check only inspects remote nodes — the source node's own subscriptions are never verified.

The query at line 234 excludes the local node (WHERE node_id NOT IN (SELECT node_id FROM spock.local_node)), so if the source node itself has a disabled subscription, this check will pass silently. If the PR goal is to verify all nodes in the cluster have only enabled subscriptions, the source node should be included.

Also, line 240 uses l as a variable name, which is flagged by linters (ruff E741) as ambiguous — easily confused with 1.

Suggested fix for variable name

- lines = [l.strip() for l in result.stdout.strip().split('\n') if l.strip()] + lines = [line.strip() for line in result.stdout.strip().split('\n') if line.strip()]

🧰 Tools

🪛 Ruff (0.14.14)

[error] 238-238: subprocess call: check for execution of untrusted input

(S603)

[error] 240-240: Ambiguous variable name: l

(E741)

[error] 249-249: subprocess call: check for execution of untrusted input

(S603)

[warning] 258-258: Abstract raise to an inner function

(TRY301)

[warning] 258-258: Create your own exception

(TRY002)

[warning] 258-258: Avoid specifying long messages outside the exception class

(TRY003)

[warning] 265-265: Use explicit conversion flag

Replace with conversion flag

(RUF010)

🤖 Prompt for AI Agents

In `@samples/Z0DAN/zodan.py` around lines 229 - 266, The DSN query in dsns_sql currently excludes the local node so the source node's subscriptions aren't checked; modify dsns_sql (used where cmd is constructed) to include local node(s) (remove the WHERE node_id NOT IN (...) filter or adjust it to select all nodes) so the loop that runs sub_sql against each node covers the source as well, and also fix the l variable in the list comprehension (lines = [l.strip() ...]) to a non-ambiguous name like line_str or line to satisfy linters (ruff E741) and update subsequent references (lines, parts, node_dsn_val, node_nm) accordingly. Ensure existing error handling around the subscription check (the try/except that inspects "disabled subscription") remains unchanged.

coderabbitai · 2026-02-09T15:03:50Z

samples/Z0DAN/zodan.py

+        for rec in other_nodes:
+            # Trigger sync event on origin node and store LSN
+            self.notice("    - 3+ node scenario: sync event stored, skipping disabled subscriptions")


⚠️ Potential issue | 🟡 Minor

Misleading log message — says "skipping disabled subscriptions" but then creates them.

Line 496 prints "3+ node scenario: sync event stored, skipping disabled subscriptions", but the code immediately below (lines 519–526) creates disabled subscriptions. This appears to be copy-pasted from the 2-node branch at line 491. The message will also repeat once per node in the loop, which is likely unintended.

Suggested fix

- for rec in other_nodes: - # Trigger sync event on origin node and store LSN - self.notice(" - 3+ node scenario: sync event stored, skipping disabled subscriptions") - sync_lsn = self.sync_event(rec['dsn']) + self.notice(" - 3+ node scenario") + for rec in other_nodes: + # Trigger sync event on origin node and store LSN + sync_lsn = self.sync_event(rec['dsn'])

🤖 Prompt for AI Agents

In `@samples/Z0DAN/zodan.py` around lines 494 - 496, The log message inside the loop over other_nodes incorrectly says "skipping disabled subscriptions" while the code actually creates disabled subscriptions and will repeat per node; update the notice() call to accurately reflect the action (e.g., "3+ node scenario: sync event stored; creating disabled subscriptions" or move it outside the loop so it logs once), or remove/adjust it to avoid per-node repetition — locate the loop that iterates over other_nodes and the notice(...) invocation and make the message and placement consistent with the subsequent disabled-subscription creation logic.

danolivo requested a review from ibrarahmad February 4, 2026 13:20

danolivo self-assigned this Feb 4, 2026

danolivo added the enhancement New feature or request label Feb 4, 2026

coderabbitai bot reviewed Feb 4, 2026

View reviewed changes

samples/Z0DAN/zodan.sql Show resolved Hide resolved

tests/tap/t/017_zodan_3n_timeout.pl Show resolved Hide resolved

tests/tap/t/017_zodan_3n_timeout.pl Show resolved Hide resolved

danolivo and others added 5 commits February 6, 2026 14:40

Fix Z0DAN initialization issue.

567ce38

Procedure present_final_cluster_state should execute COMMIT command to let newly created subscriptions from working to the new node update their state. Also, final cluster state should check all subscriptions around the cluster, not only the new node.

Allow the remove_node routine remove the last Spock node

7d6216a

danolivo force-pushed the spoc-388 branch from 8667884 to 6b7c54b Compare February 6, 2026 13:54

coderabbitai bot reviewed Feb 6, 2026

View reviewed changes

samples/Z0DAN/zodan.sql Show resolved Hide resolved

samples/Z0DAN/zodan.sql Show resolved Hide resolved

tests/tap/t/017_zodan_3n_timeout.pl Show resolved Hide resolved

Fixes after review

4602ac3

coderabbitai bot reviewed Feb 9, 2026

View reviewed changes

tests/tap/t/017_zodan_3n_timeout.pl Show resolved Hide resolved

ibrarahmad requested changes Feb 9, 2026

View reviewed changes

coderabbitai bot reviewed Feb 9, 2026

View reviewed changes

samples/Z0DAN/zodan.py Outdated Show resolved Hide resolved

samples/Z0DAN/zodan.py Outdated Show resolved Hide resolved

danolivo force-pushed the spoc-388 branch from 2d5a81e to 61a5c4b Compare February 9, 2026 14:57

coderabbitai bot reviewed Feb 9, 2026

View reviewed changes

danolivo changed the title ~~Spoc 388: Z0DAN should quickly fail instead of waiting timeout~~ Spoc 388: Z0DAN should quickly fail instead of waiting timeout (plpgsql only) Feb 10, 2026

danolivo force-pushed the spoc-388 branch from 61a5c4b to 4602ac3 Compare February 10, 2026 08:27

ibrarahmad approved these changes Feb 10, 2026

View reviewed changes

ibrarahmad merged commit 3250807 into main Feb 10, 2026
18 checks passed

ibrarahmad deleted the spoc-388 branch February 10, 2026 18:08

Spoc 388: Z0DAN should quickly fail instead of waiting timeout (plpgsql only) #330

Spoc 388: Z0DAN should quickly fail instead of waiting timeout (plpgsql only) #330

Conversation

danolivo commented Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai bot commented Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mason-sharp commented Feb 5, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ibrarahmad left a comment

Choose a reason for hiding this comment

Uh oh!

ibrarahmad Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

danolivo Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

danolivo commented Feb 9, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

danolivo commented Feb 4, 2026 •

edited

Loading

coderabbitai bot commented Feb 4, 2026 •

edited

Loading