Skip to content

TRANSDISCARD/SUB_DISABLE: switch from subtransaction replay to read-only exception row capture#400

Merged
mason-sharp merged 9 commits intomainfrom
spoc-484
Apr 2, 2026
Merged

TRANSDISCARD/SUB_DISABLE: switch from subtransaction replay to read-only exception row capture#400
mason-sharp merged 9 commits intomainfrom
spoc-484

Conversation

@danolivo
Copy link
Copy Markdown
Contributor

@danolivo danolivo commented Mar 24, 2026

  • In TRANSDISCARD and SUB_DISABLE modes, replace per-DML subtransaction replay with a single read-only transaction. Each DML handler logs a row to spock.exception_log instead of attempting the operation. Only the record that originally caused the error carries the real error message; other records get an empty string.
  • Identify the failed record by xact_action_counter (saved as failed_action in SpockExceptionLog), which is unique per DML within a transaction.
  • In DISCARD mode, each DML still runs inside a subtransaction; only the failed ones are logged.
  • Add exception_row_capture regression test covering all three exception modes with different breakage methods (absent table, truncated table, deleted row) and an INSERT_EXISTS conflict to verify that dry-run modes leave spock.resolutions empty while DISCARD populates it.

…-run replay

In TRANSDISCARD and SUB_DISABLE modes, the second (retry) pass replays
the transaction as read-only.  Each DML handler logs a row to
spock.exception_log instead of attempting the operation.  The record
that originally caused the error carries the real error message; other
records get an empty string.

The failed record is identified by xact_action_counter (saved as
failed_action in SpockExceptionLog), which is unique per DML within a
transaction.

In DISCARD mode, each DML still runs inside a subtransaction; only the
failed ones are logged.

Add dry_run_logging regression test covering TRANSDISCARD (absent
table), DISCARD (truncated table), and SUB_DISABLE (deleted row), with
INSERT_EXISTS conflict on drl_t1 to verify that dry-run modes leave
spock.resolutions empty while DISCARD populates it.
@danolivo danolivo self-assigned this Mar 24, 2026
@danolivo danolivo added the enhancement New feature or request label Mar 24, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 24, 2026

📝 Walkthrough

Walkthrough

This PR implements exception row capture functionality in Spock by introducing a failed_action field to track which action triggered an error, configuring read-only transactions during exception replay, restructuring DML and SQL handling in try-block modes, updating fallback error messages, and adding comprehensive regression test coverage for exception behavior validation.

Changes

Cohort / File(s) Summary
Exception Handler Struct and Implementation
include/spock_exception_handler.h, src/spock_exception_handler.c
Added uint32 failed_action field to SpockExceptionLog struct to record the xact_action_counter at error time. Changed fallback exception log message from "unknown" to "unavailable" when error message is unavailable.
Apply Logic Exception Handling
src/spock_apply.c
Major refactoring of exception behavior in try-block mode: sets transaction_read_only for TRANSDISCARD/SUB\_DISABLE replay, removes parent-transaction abort/discard path, restructures DML execution to skip actual operations while logging exceptions, reworks queued SQL handling, attaches real error message only to the action that caused it, suppresses COMMIT LSN advancement for SUB\_DISABLE retry, and triggers SUB\_DISABLE disable via ERROR after successful commit.
Regression Test Suite
Makefile, tests/regress/sql/replication_set.sql, tests/regress/sql/exception_row_capture.sql
Updated Makefile to replace replication_set test with exception_row_capture. Enhanced replication_set.sql with connection switching, spock.exception_log truncation, and additional validation queries. Added new comprehensive exception_row_capture.sql test script covering three exception\_behaviour modes (transdiscard, discard, sub\_disable) with conflict scenarios, DML failures, and assertion validation of exception logs and resolutions.

Poem

🐰 An error hops, we catch its track,
With failed_action on its back,
Read-only burrows, safe and sound,
Where lost replication rows are found,
Exception logs with messages clear—
No more mysteries down here! 🌙

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 70.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main change: switching TRANSDISCARD/SUB_DISABLE from subtransaction replay to read-only exception row capture, matching the core feature described in the changeset.
Description check ✅ Passed The pull request description is directly related to the changeset, providing a clear explanation of the implementation approach for TRANSDISCARD/SUB_DISABLE modes and test coverage.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch spoc-484

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

coderabbitai[bot]

This comment was marked as outdated.

…producer

Replace wait_slot_confirm_lsn with sync_event/wait_for_sync_event to
eliminate replication race conditions when querying exception_log on the
subscriber.

Add a TRANSDISCARD sub-test that mixes DML with a replicate_ddl call
inside a single failing transaction.  This reproduces a duplicate DDL
logging bug in handle_queued_ddl detected by CodeRabbit during review:
the TRANSDISCARD/SUB_DISABLE branch logs via add_entry_to_exception_log,
then falls through to the generic logging block which logs again.

Remove test section numbering per project conventions.
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/regress/sql/dry_run_logging.sql`:
- Around line 107-110: Update the test comment that currently reads "Expect
entries with duplicated DDL record" to reflect the intended behavior of no
duplicate DDL logging; modify the comment above the SELECT querying
spock.exception_log (the query using table_name, operation, (error_message <>
'') AS has_error, ddl_statement ordered by command_counter) to state that the
test expects NO duplicate DDL entries (i.e., a single DDL record per command).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: fc49989d-5676-4a35-890d-8275efe21d33

📥 Commits

Reviewing files that changed from the base of the PR and between f359f4d and 936e414.

⛔ Files ignored due to path filters (1)
  • tests/regress/expected/dry_run_logging.out is excluded by !**/*.out
📒 Files selected for processing (2)
  • src/spock_apply.c
  • tests/regress/sql/dry_run_logging.sql

…ISABLE

The TRANSDISCARD/SUB_DISABLE branch logged via add_entry_to_exception_log
then fell through to the generic logging block which called
should_log_exception and logged again, producing duplicate DDL entries.

Merge both call sites into a single add_entry_to_exception_log after the
if/else branches.  The TRANSDISCARD/SUB_DISABLE branch now only extracts
the SQL text; actual logging is done in the shared block.

Add Assert(my_exception_log_index >= 0) before accessing exception_log_ptr
since use_try_block guarantees the index was set in handle_begin.

Bug reported by CodeRabbit during review.
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (3)
src/spock_apply.c (3)

1287-1308: Consider adding defensive assertion before accessing exception_log_ptr.

The code at line 2302 includes Assert(my_exception_log_index >= 0) before accessing exception_log_ptr, but the DML handlers (INSERT, UPDATE, DELETE) don't have this check. While my_exception_log_index should be valid when use_try_block is true (set during handle_begin), adding the assertion would provide consistent defensive validation.

🛡️ Suggested defensive assertion
 		if (exception_behaviour == TRANSDISCARD ||
 			exception_behaviour == SUB_DISABLE)
 		{
+			Assert(my_exception_log_index >= 0);
+
 			/*
 			 * TRANSDISCARD and SUB_DISABLE: skip the DML and log the
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/spock_apply.c` around lines 1287 - 1308, Add a defensive Assert that
my_exception_log_index is valid before dereferencing exception_log_ptr in the
DML handlers (the INSERT/UPDATE/DELETE branches where exception_log_ptr[...] is
accessed and log_insert_exception(...) is called). Specifically, in the block
using exception_log_ptr[my_exception_log_index] (and before setting
exception_log_ptr[my_exception_log_index].local_tuple or reading
.initial_error_message), insert Assert(my_exception_log_index >= 0) to mirror
the check used in handle_begin/other code paths so the handlers (where
use_try_block is expected) validate the index before use.

1299-1304: Optional: Extract repeated error_msg pattern to helper function.

The same logic for computing error_msg is repeated in four handlers (INSERT, UPDATE, DELETE, SQL):

char *error_msg =
    (xact_action_counter ==
     exception_log_ptr[my_exception_log_index].failed_action &&
     exception_log_ptr[my_exception_log_index].initial_error_message[0]) ?
    exception_log_ptr[my_exception_log_index].initial_error_message :
    NULL;

This could be extracted to a helper function to improve maintainability and ensure consistent behavior if the logic ever needs to change.

♻️ Suggested helper function
/*
 * Get the error message to log for the current action during dry-run replay.
 * Returns the initial error message only for the action that caused the
 * original failure; returns NULL for all other actions.
 */
static inline char *
get_dry_run_error_msg(void)
{
    Assert(my_exception_log_index >= 0);
    
    if (xact_action_counter == exception_log_ptr[my_exception_log_index].failed_action &&
        exception_log_ptr[my_exception_log_index].initial_error_message[0] != '\0')
        return exception_log_ptr[my_exception_log_index].initial_error_message;
    
    return NULL;
}

Also applies to: 1477-1482, 1594-1599, 2307-2312

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/spock_apply.c` around lines 1299 - 1304, Extract the repeated error_msg
computation into a small static helper (e.g. static inline char
*get_dry_run_error_msg(void)) that reads my_exception_log_index,
xact_action_counter and exception_log_ptr and returns
exception_log_ptr[my_exception_log_index].initial_error_message when
xact_action_counter == exception_log_ptr[my_exception_log_index].failed_action
and the initial_error_message is non-empty, otherwise returns NULL; replace the
duplicated ternary in the four handlers (the blocks around the places that set
error_msg in INSERT/UPDATE/DELETE/SQL handlers) with a call to
get_dry_run_error_msg() and keep the Assert(my_exception_log_index >= 0) in the
helper.

871-878: Consider clearing initial_error_message for consistency.

Lines 840-843 (TRANSDISCARD path) clear initial_error_message[0] = '\0', but lines 871-876 (SUB_DISABLE path) do not. While this is harmless since the worker exits after spock_disable_subscription(), adding the clear would maintain consistency between the two paths.

♻️ Suggested change for consistency
 			exception_log = &exception_log_ptr[my_exception_log_index];
 			exception_log->commit_lsn = InvalidXLogRecPtr;
+			exception_log->initial_error_message[0] = '\0';
 			MySpockWorker->restart_delay = 0;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/spock_apply.c` around lines 871 - 878, In the SUB_DISABLE path add the
same clearing of the saved initial error string as done in the TRANSDISCARD
path: after assigning exception_log = &exception_log_ptr[my_exception_log_index]
and before calling spock_disable_subscription()/elog(ERROR...), set
exception_log->initial_error_message[0] = '\0' so the SpockExceptionLog
(exception_log) is consistent with the TRANSDISCARD handling; keep the rest
(commit_lsn, MySpockWorker->restart_delay, spock_disable_subscription(), elog)
unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@src/spock_apply.c`:
- Around line 1287-1308: Add a defensive Assert that my_exception_log_index is
valid before dereferencing exception_log_ptr in the DML handlers (the
INSERT/UPDATE/DELETE branches where exception_log_ptr[...] is accessed and
log_insert_exception(...) is called). Specifically, in the block using
exception_log_ptr[my_exception_log_index] (and before setting
exception_log_ptr[my_exception_log_index].local_tuple or reading
.initial_error_message), insert Assert(my_exception_log_index >= 0) to mirror
the check used in handle_begin/other code paths so the handlers (where
use_try_block is expected) validate the index before use.
- Around line 1299-1304: Extract the repeated error_msg computation into a small
static helper (e.g. static inline char *get_dry_run_error_msg(void)) that reads
my_exception_log_index, xact_action_counter and exception_log_ptr and returns
exception_log_ptr[my_exception_log_index].initial_error_message when
xact_action_counter == exception_log_ptr[my_exception_log_index].failed_action
and the initial_error_message is non-empty, otherwise returns NULL; replace the
duplicated ternary in the four handlers (the blocks around the places that set
error_msg in INSERT/UPDATE/DELETE/SQL handlers) with a call to
get_dry_run_error_msg() and keep the Assert(my_exception_log_index >= 0) in the
helper.
- Around line 871-878: In the SUB_DISABLE path add the same clearing of the
saved initial error string as done in the TRANSDISCARD path: after assigning
exception_log = &exception_log_ptr[my_exception_log_index] and before calling
spock_disable_subscription()/elog(ERROR...), set
exception_log->initial_error_message[0] = '\0' so the SpockExceptionLog
(exception_log) is consistent with the TRANSDISCARD handling; keep the rest
(commit_lsn, MySpockWorker->restart_delay, spock_disable_subscription(), elog)
unchanged.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 031139ab-7009-4644-ab84-545c7b0c6bb2

📥 Commits

Reviewing files that changed from the base of the PR and between 936e414 and 7f4beea.

⛔ Files ignored due to path filters (1)
  • tests/regress/expected/dry_run_logging.out is excluded by !**/*.out
📒 Files selected for processing (2)
  • src/spock_apply.c
  • tests/regress/sql/dry_run_logging.sql
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/regress/sql/dry_run_logging.sql

Also, harden the PG_CATCH block: explicitly switch to ApplyOperationContext
before CopyErrorData so edata lands in a known context, call
FlushErrorState before RollbackAndReleaseCurrentSubTransaction to clear
the error stack, and move ReleaseCurrentSubTransaction into PG_TRY so
errors during release are caught.
Copy link
Copy Markdown
Member

@mason-sharp mason-sharp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this, it is much better than what we were doing before. I will revisit after the read-only PR.

RollbackAndReleaseCurrentSubTransaction() switches CurrentMemoryContext
to the parent transaction context via AtSubCleanup_Memory().  The
PG_CATCH blocks in handle_update, handle_delete, and handle_sql_or_exception
called CopyErrorData() after the rollback, allocating the error copy in
the wrong context.  They also did not call FlushErrorState().

Standardize all four DML/DDL exception handlers (INSERT, UPDATE, DELETE,
SQL) to follow the same sequence:

  1. MemoryContextSwitchTo(ApplyOperationContext)
  2. CopyErrorData()
  3. FlushErrorState()
  4. RollbackAndReleaseCurrentSubTransaction()
  5. MemoryContextSwitchTo(ApplyOperationContext)  -- restore after rollback

The post-rollback context restore ensures the Assert at the end of
handle_insert (CurrentMemoryContext == ApplyOperationContext) passes
on the error path.

Also change exception_log error_message from empty string to
"unavailable" when the message is NULL, to avoid user confusion
between empty strings and NULL values when querying.
@danolivo danolivo requested a review from mason-sharp March 30, 2026 13:15
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/spock_apply.c (1)

546-546: ⚠️ Potential issue | 🟠 Major

Potential off-by-one array access in exception_log slot search.

The loop conditions at lines 546 and 584 use i <= SpockCtx->total_workers, iterating up to and including index total_workers. However, exception_log_shmem_size() allocates sizeof(SpockExceptionLog) * nworkers where nworkers = SpockCtx->total_workers, meaning valid indices are 0 to total_workers - 1.

If a match is found at i == total_workers, my_exception_log_index will be set to an out-of-bounds index, causing memory corruption when accessed later (lines 668, 741, 820, 873, 881, 889, and throughout DML handlers).

🐛 Proposed fix

Change both loops from <= to <:

-		for (int i = 0; i <= SpockCtx->total_workers; i++)
+		for (int i = 0; i < SpockCtx->total_workers; i++)

Apply at line 546 and line 584.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/spock_apply.c` at line 546, The loops that search the exception_log array
iterate with "for (int i = 0; i <= SpockCtx->total_workers; i++)" and can index
past the allocated range because exception_log_shmem_size() allocates
SpockCtx->total_workers entries; change those loop conditions to "i <
SpockCtx->total_workers" so iteration is 0..total_workers-1. Update both
occurrences that set/use my_exception_log_index when scanning
SpockCtx->exception_log (and any similar searches) to use "<" instead of "<=" to
avoid out-of-bounds access and memory corruption.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@src/spock_apply.c`:
- Line 546: The loops that search the exception_log array iterate with "for (int
i = 0; i <= SpockCtx->total_workers; i++)" and can index past the allocated
range because exception_log_shmem_size() allocates SpockCtx->total_workers
entries; change those loop conditions to "i < SpockCtx->total_workers" so
iteration is 0..total_workers-1. Update both occurrences that set/use
my_exception_log_index when scanning SpockCtx->exception_log (and any similar
searches) to use "<" instead of "<=" to avoid out-of-bounds access and memory
corruption.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: ab67b0c4-411b-464f-be0d-ff348173ab60

📥 Commits

Reviewing files that changed from the base of the PR and between a069f87 and c5b4612.

⛔ Files ignored due to path filters (2)
  • tests/regress/expected/dry_run_logging.out is excluded by !**/*.out
  • tests/regress/expected/replication_set.out is excluded by !**/*.out
📒 Files selected for processing (2)
  • src/spock_apply.c
  • src/spock_exception_handler.c
✅ Files skipped from review due to trivial changes (1)
  • src/spock_exception_handler.c

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/spock_apply.c (1)

1197-1221: ⚠️ Potential issue | 🟠 Major

failed_action can still snapshot the wrong operation metadata.

This field only works if xact_action_counter is advanced before anything that can fail. handle_insert() still reaches spock_read_insert() before bumping the counter, and handle_sql_or_exception() never bumps it at all. A missing-table INSERT or a transaction with multiple queued SQL/DDL messages can therefore record the previous action number, so replay may attach the initial error to the wrong exception_log row(s).

Also applies to: 2304-2315, 3297-3303

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/spock_apply.c`:
- Around line 2307-2313: The code currently uses JsonbToCString on
queued_message->message->root which produces JSON-formatted text; replace this
with the same extraction used in handle_sql(): validate the JSONB root with
JB_ROOT_IS_SCALAR, extract the scalar text value and duplicate it with pnstrdup
(as handle_sql() does around JB_ROOT_IS_SCALAR and pnstrdup), and assign that
raw SQL string to sql before incrementing exception_command_counter so logged
SQL is the original statement text rather than JSON.
- Around line 1524-1545: The handlers handle_update, handle_delete, and
handle_sql_or_exception copy ErrorData into ApplyOperationContext during
exception handling but never free that context; add a call to
MemoryContextReset(ApplyOperationContext) at the end of each of these functions
(just before their closing brace) mirroring the pattern used in handle_insert to
release allocations from CopyErrorData and related structures so they don't
accumulate for the worker lifetime; locate the exception code paths that call
CopyErrorData() and ensure MemoryContextReset(ApplyOperationContext) is executed
unconditionally at function exit.
- Around line 446-453: handle_truncate() must mirror the skip-and-log behavior
used by handle_insert/handle_update/handle_delete/handle_sql for TRANSDISCARD
and SUB_DISABLE modes: detect if exception_behaviour is TRANSDISCARD or
SUB_DISABLE (same condition used in begin_replication_step() that sets
transaction_read_only) and instead of calling ExecuteTruncateGuts(), log the
truncated relation(s) as skipped and return without executing; ensure the log
message clearly identifies the TRUNCATE and the mode (TRANSDISCARD/SUB_DISABLE)
and avoid side effects by not invoking ExecuteTruncateGuts() when in those
modes.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 7be7ea71-025a-41c9-9b88-8e95255e532c

📥 Commits

Reviewing files that changed from the base of the PR and between a069f87 and c5b4612.

⛔ Files ignored due to path filters (2)
  • tests/regress/expected/dry_run_logging.out is excluded by !**/*.out
  • tests/regress/expected/replication_set.out is excluded by !**/*.out
📒 Files selected for processing (2)
  • src/spock_apply.c
  • src/spock_exception_handler.c
✅ Files skipped from review due to trivial changes (1)
  • src/spock_exception_handler.c

Add the same skip-and-log guard used by the DML handlers so that TRUNCATE
is recorded as discarded and the table contents are preserved.
…ql_or_exception

In DISCARD mode, these handlers call CopyErrorData() into ApplyOperationContext
inside their PG_CATCH blocks but never reset that context afterward. This leaks
ErrorData across every failed operation within a transaction replay. Add
MemoryContextReset(ApplyOperationContext) at function exit, matching the
existing pattern in handle_insert.
@codacy-production
Copy link
Copy Markdown

codacy-production bot commented Mar 31, 2026

Up to standards ✅

🟢 Issues 0 issues

Results:
0 new issues

View in Codacy

🟢 Metrics 5 duplication

Metric Results
Duplication 5

View in Codacy

TIP This summary will be updated as you push new changes. Give us feedback

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
src/spock_apply.c (2)

3319-3340: ⚠️ Potential issue | 🔴 Critical

Guard my_exception_log_index before writing retry metadata.

exception_log_ptr != NULL is not a sufficient guard here. If anything throws before handle_begin() finds or creates the worker slot, my_exception_log_index is still -1, and Lines 3321-3340 write before the start of the shared-memory array.

🛡️ Minimal guard
-		if (exception_log_ptr != NULL)
+		if (exception_log_ptr != NULL && my_exception_log_index >= 0)
 		{
 			...
 		}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/spock_apply.c` around lines 3319 - 3340, The writes to
exception_log_ptr[...] must only occur when my_exception_log_index is valid; add
a guard checking that my_exception_log_index >= 0 (and exception_log_ptr !=
NULL) before writing initial_error_message, initial_operation, and
failed_action, or defer these writes until handle_begin() has been called and a
slot allocated; specifically update the block around my_exception_log_index,
exception_log_ptr, and the use of errcallback_arg.action_name in
src/spock_apply.c so you only access exception_log_ptr[my_exception_log_index]
when my_exception_log_index is non-negative (or ensure handle_begin() sets it
first).

1197-1221: ⚠️ Potential issue | 🟠 Major

Set INSERT retry metadata before relation lookup.

spock_read_insert() is still the first failure point here, but errcallback_arg.action_name and xact_action_counter are only assigned afterwards. A missing-table INSERT will therefore capture the previous action in initial_operation / failed_action, and the retry path falls back to the generic relation error instead of marking the triggering INSERT as the failed row.

🛠️ Suggested direction
-	rel = spock_read_insert(s, RowExclusiveLock, &newtup);
+	errcallback_arg.action_name = "INSERT";
+	xact_action_counter++;
+	rel = spock_read_insert(s, RowExclusiveLock, &newtup);
 	if (unlikely(rel == NULL))
 	{
 		...
-		log_insert_exception(true, "Spock can't find relation", NULL,
+		log_insert_exception(true,
+							 (xact_action_counter ==
+							  exception_log_ptr[my_exception_log_index].failed_action &&
+							  exception_log_ptr[my_exception_log_index].initial_error_message[0] != '\0') ?
+							 exception_log_ptr[my_exception_log_index].initial_error_message :
+							 "Spock can't find relation",
+							 NULL,
 							 NULL, NULL, "INSERT");
 		...
 	}
 
-	errcallback_arg.action_name = "INSERT";
 	errcallback_arg.rel = rel;
-	xact_action_counter++;

Also applies to: 1225-1227

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/spock_apply.c` around lines 1197 - 1221, Before calling spock_read_insert
move the assignment of retry metadata so the failing INSERT is recorded: set
errcallback_arg.action_name to "INSERT" (or the appropriate action string) and
bump xact_action_counter (and assign initial_operation/failed_action as your
retry bookkeeping expects) prior to invoking spock_read_insert; this ensures a
missing-table INSERT records the correct failed_action and errcallback data used
by the retry path instead of inheriting a previous action. Apply the same change
to the analogous block referenced at the other occurrence (lines ~1225-1227) so
UPDATE/other operations also set their action metadata before doing relation
lookups.
♻️ Duplicate comments (1)
src/spock_apply.c (1)

2335-2347: ⚠️ Potential issue | 🟠 Major

Log the raw queued statement here, not its JSON wrapper.

JsonbToCString() stores a quoted/escaped JSON rendering, not the same plain SQL text that handle_sql() executes, and this path no longer enforces the JB_ROOT_IS_SCALAR validation from the execution path. Dry-run ddl_statement values will diverge from normal logging.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/spock_apply.c` around lines 2335 - 2347, The code is logging a
JSON-quoted SQL string via JsonbToCString for the TRANSDISCARD/SUB_DISABLE
branch, which diverges from the execution path; instead, validate that the
queued_message->message->root is a scalar (use JB_ROOT_IS_SCALAR) and extract
the raw SQL text the same way handle_sql() does (reuse its extraction helper or
replicate its logic to pull the scalar string without JSON quoting) before
incrementing exception_command_counter so dry-run logs match executed SQL.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@src/spock_apply.c`:
- Around line 3319-3340: The writes to exception_log_ptr[...] must only occur
when my_exception_log_index is valid; add a guard checking that
my_exception_log_index >= 0 (and exception_log_ptr != NULL) before writing
initial_error_message, initial_operation, and failed_action, or defer these
writes until handle_begin() has been called and a slot allocated; specifically
update the block around my_exception_log_index, exception_log_ptr, and the use
of errcallback_arg.action_name in src/spock_apply.c so you only access
exception_log_ptr[my_exception_log_index] when my_exception_log_index is
non-negative (or ensure handle_begin() sets it first).
- Around line 1197-1221: Before calling spock_read_insert move the assignment of
retry metadata so the failing INSERT is recorded: set
errcallback_arg.action_name to "INSERT" (or the appropriate action string) and
bump xact_action_counter (and assign initial_operation/failed_action as your
retry bookkeeping expects) prior to invoking spock_read_insert; this ensures a
missing-table INSERT records the correct failed_action and errcallback data used
by the retry path instead of inheriting a previous action. Apply the same change
to the analogous block referenced at the other occurrence (lines ~1225-1227) so
UPDATE/other operations also set their action metadata before doing relation
lookups.

---

Duplicate comments:
In `@src/spock_apply.c`:
- Around line 2335-2347: The code is logging a JSON-quoted SQL string via
JsonbToCString for the TRANSDISCARD/SUB_DISABLE branch, which diverges from the
execution path; instead, validate that the queued_message->message->root is a
scalar (use JB_ROOT_IS_SCALAR) and extract the raw SQL text the same way
handle_sql() does (reuse its extraction helper or replicate its logic to pull
the scalar string without JSON quoting) before incrementing
exception_command_counter so dry-run logs match executed SQL.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 63f85652-7c76-4585-964c-e121cf5bf7ac

📥 Commits

Reviewing files that changed from the base of the PR and between a069f87 and 5d3a782.

⛔ Files ignored due to path filters (2)
  • tests/regress/expected/dry_run_logging.out is excluded by !**/*.out
  • tests/regress/expected/replication_set.out is excluded by !**/*.out
📒 Files selected for processing (2)
  • src/spock_apply.c
  • src/spock_exception_handler.c
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/spock_exception_handler.c

-- Common scenario: three tables on provider, one broken on subscriber.
-- A single transaction with DMLs on all three tables triggers an error
-- on the broken table. The first DML (INSERT into drl_t1) also creates
-- a conflict (INSERT_EXISTS) to verify that dry-run modes do not log
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Somewhat nit picky, but I think we should use a different term since dry-run implies "what would happen". How about we use "log-only replay"?

After changing throughout and the read-only PR gets in, we can get this merged in.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mason-sharp . It is up to you as a native speaker. I've never seen 'log-only replay' and used the 'dry-run mode' as a standard term in the industry. Feel free to commit your change at the top of the branch.

The TRANSDISCARD and SUB_DISABLE modes step through each operation
in a failed transaction without executing DML, capturing every row
to exception_log. "Exception row capture" describes this behavior
more precisely than "dry-run", which implies a trial execution.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
tests/regress/sql/exception_row_capture.sql (2)

116-118: Missing sync after DDL cleanup could cause test ordering issues.

The DROP TABLE IF EXISTS public.drl_dummy DDL at line 118 has no sync before the reset section begins. While the subsequent reset block (lines 124-131) performs its own DDL that will serialize properly, adding a sync here would make the test more robust against potential timing issues.

♻️ Optional: Add sync after cleanup DDL
 -- Cleanup the dummy table
 \c :provider_dsn
 SELECT spock.replicate_ddl('DROP TABLE IF EXISTS public.drl_dummy');
+SELECT spock.sync_event() AS sync_lsn \gset
+\c :subscriber_dsn
+CALL spock.wait_for_sync_event(NULL, 'test_provider', :'sync_lsn', 30);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/regress/sql/exception_row_capture.sql` around lines 116 - 118, After
issuing the cleanup DDL via spock.replicate_ddl('DROP TABLE IF EXISTS
public.drl_dummy'), add an explicit replication sync call immediately after that
statement so the DROP is fully applied before the test reset block starts;
locate the spock.replicate_ddl call in the file and follow it with the test
suite's existing sync mechanism (e.g., the same sync helper used elsewhere in
these tests) to serialize the DDL and avoid ordering/timing issues.

78-81: Consider using error_message = '' for clearer intent or explicit value checking.

The condition error_message <> '' relies on non-error records storing an empty string. Per the code, NULL error messages become 'unavailable', which would make this check return true for both actual errors and the fallback case.

If the implementation passes empty string '' for non-error records (as stated in the PR description), this works correctly. However, for test robustness, consider checking for the actual expected values or adding a comment clarifying the expected error_message values for each row.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/regress/sql/exception_row_capture.sql` around lines 78 - 81, The test's
boolean expression (error_message <> '') is ambiguous given the code may set
NULLs to 'unavailable'; update the SELECT against spock.exception_log to
explicitly check the expected values—either replace (error_message <> '') AS
has_error with a clear equality (error_message = '') AS has_no_error or, better,
use an explicit membership check like (error_message NOT IN ('', 'unavailable'))
AS has_error to only flag real errors; also add a brief inline comment
explaining which error_message sentinel values the test expects.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@tests/regress/sql/exception_row_capture.sql`:
- Around line 116-118: After issuing the cleanup DDL via
spock.replicate_ddl('DROP TABLE IF EXISTS public.drl_dummy'), add an explicit
replication sync call immediately after that statement so the DROP is fully
applied before the test reset block starts; locate the spock.replicate_ddl call
in the file and follow it with the test suite's existing sync mechanism (e.g.,
the same sync helper used elsewhere in these tests) to serialize the DDL and
avoid ordering/timing issues.
- Around line 78-81: The test's boolean expression (error_message <> '') is
ambiguous given the code may set NULLs to 'unavailable'; update the SELECT
against spock.exception_log to explicitly check the expected values—either
replace (error_message <> '') AS has_error with a clear equality (error_message
= '') AS has_no_error or, better, use an explicit membership check like
(error_message NOT IN ('', 'unavailable')) AS has_error to only flag real
errors; also add a brief inline comment explaining which error_message sentinel
values the test expects.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: f80b64f2-08ae-4a57-bae7-b4edbf23a1f0

📥 Commits

Reviewing files that changed from the base of the PR and between a069f87 and 6a9d5b6.

⛔ Files ignored due to path filters (2)
  • tests/regress/expected/exception_row_capture.out is excluded by !**/*.out
  • tests/regress/expected/replication_set.out is excluded by !**/*.out
📒 Files selected for processing (4)
  • Makefile
  • src/spock_apply.c
  • src/spock_exception_handler.c
  • tests/regress/sql/exception_row_capture.sql
✅ Files skipped from review due to trivial changes (2)
  • src/spock_exception_handler.c
  • Makefile
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/spock_apply.c

@mason-sharp mason-sharp changed the title SPOC-484: TRANSDISCARD/SUB_DISABLE: switch from subtransaction replay to read-only dry-run logging TRANSDISCARD/SUB_DISABLE: switch from subtransaction replay to read-only exception row capture Apr 2, 2026
@mason-sharp mason-sharp merged commit 47524cf into main Apr 2, 2026
10 checks passed
@mason-sharp mason-sharp deleted the spoc-484 branch April 2, 2026 23:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants