You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
On a Spock mesh, the armed Ricart-Agrawala bakery does not serialize multiple concurrent writers per node, on ≥2 nodes at once, to the same
cold table. Some commits hit Conflict_409 (CatalogCommitConflicts) at
Lakekeeper — the exact outcome the bakery exists to prevent — and the losing
writes error out (rows lost unless the app retries).
This affects any cold write through the mesh bakery — reproduced on both
a decoupled (iceberg-only) table and a genuinely tiered table (is_iceberg_only=false, hot_table set). Hot-tier writes (Postgres heap) are unaffected.
The single-writer-per-node case is solid. This is a coverage gap: the mesh
bakery has never been tested above 1 concurrent writer per node.
Related to but distinct from #17 Bug 2 — that was the vanilla single-node
advisory-lock path on slow Azure ADLS. This is the armed R-A mesh path
(v_armed = true) on fast local SeaweedFS, with multiple writers per node
across nodes. #17 also noted the slower the storage round-trip, the worse the
race — which is why this matters more in production, not less (below).
Scope: is this local-only, or real in production?
Real in production — likely more readily than locally.
The 409 is Lakekeeper's optimistic-concurrency check
(CatalogCommitConflicts / "snapshot has changed") — how every Iceberg
REST catalog enforces atomic commits. Backend- and deployment-independent;
identical against real S3/Azure/GCS.
In a real distributed deployment, nodes sit across regions/clouds → wider
Spock replication lag → the window between "my claim committed" and "peer
applied + acked" is larger, so the race is at least as likely.
Caveat: reproduced locally (2 nodes, one host); production frequency is
unquantified and timing-dependent. But there is no basis to think production
is immune.
What works (calibration — the bakery is real)
Workload
Result
Write on node A, read on node B (cross-node visibility)
✅
1 writer per node, concurrent, cross-node, same table
✅ (this is story_mesh)
N concurrent writers on a single node
✅ (clause (a) same-node-min serializes them)
N writers per node, on ≥2 nodes, same cold table, at once
❌ Conflict_409
Evidence
2-node Spock mesh (MESH=on, db1=snowflake 691, db2=109), one shared
Lakekeeper + SeaweedFS. Fire per-node concurrent cold writes on both nodes.
Decoupled table (iceberg-only events_lake):
Test
Result
1 writer per node × 5 rounds
10/10 land, 0 err
3 writers, same node only
3/3 land, 0 err
5 writers per node, cross-node
7/10 land, 12 err
3 writers per node, cross-node
5/6 land, 4 err
5 writers per node, cross-node, peer_alive_window_ms=60000
7/10, 12 err (raising the window does not help)
Tiered table (events, is_iceberg_only=false, hot_table="public"._events,
provisioned via the archiver, watermark replicated to the peer):
Test
Result
1 cold write per node, cross-node
0 err
5 cold writes per node, cross-node
12 err — same CatalogCommitConflicts (409)
Client error (identical on both table types):
ERROR: (PGDuckDB/DuckdbXactCallback_Cpp) TransactionContext Error: Failed to commit:
Failed to commit Iceberg transaction: Request to '.../transactions/commit' returned a
non-200 status code (Conflict_409).
message: CatalogCommitConflicts, context: { expected: <snap>, found: <snap> }
=> Requirement failed: Branch or tag `main`'s snapshot has changed
type: CatalogCommitConflicts
reason: Conflict
The coldfront.claim_acks trail shows tickets issued and acked by both nodes,
so the claim machinery runs — the failure is the ordering guarantee not
holding, not a missing ack.
Reproduction
On a formed mesh (e.g. ci/topo/mesh.sh --mode decoupled or --mode tiered)
with a cold table registered on each node:
# fire >=3 concurrent writers PER node, on >=2 nodes, to the SAME table, at once.# (tiered: use a ts older than the archive_watermark cutoff so the write routes cold)foriin$(seq 1 5);do
psql -h db1 -U coldfront -d coldfront -c "INSERT INTO <cold_table> VALUES (...);"&
psql -h db2 -U coldfront -d coldfront -c "INSERT INTO <cold_table> VALUES (...);"&donewait# several clients fail with Conflict_409 / CatalogCommitConflicts; some rows never land
Root cause (hypothesis)
The wait loop in coldfront._claim_iceberg_lock
(extension/coldfront/coldfront--1.0.sql:1844-1879) exits on:
(a) same-node-min (1845–1849): no same-node claim on this table has a
smaller ticket, and
(b) peer-ack (1850–1877): every alive peer has acked my ticket (alive =
walsender heartbeat within coldfront.peer_alive_window_ms, default 5s).
Both clauses are individually correct, and same-node serialization (a) works
in isolation (3-same-node → 3/3 clean). The failure appears when both nodes
have multiple outstanding claims at once — the cross-node ack/defer interaction
under real timing (async Spock replication lag, the gap between a claim
committing and the peer applying it) lets two commits reach Lakekeeper
together. The release path documents a related window
(coldfront--1.0.sql:1889-1891: "the next writer may briefly proceed while our
iceberg commit is still in flight").
Both decoupled and tiered cold writes go through this same function
(coldfront.c cold rewrite → _exec_iceberg_with_claim; coldfront--1.0.sql:740-745 cold branch → _claim_iceberg_lock), which is why
both reproduce identically.
Ruled out as causes:
Not snowflake.node config — it is per-instance by design; the alignment
check (coldfront--1.0.sql:1782-1791) requires snowflake.node == hashtext(spock_node_name) & 1023. Distinct per-connection
ids would break clause (a) and the dead-peer mapping.
Not peer_alive_window_ms — raising it to 60s does not help (see table).
This is bakery/protocol territory, which per repo policy is TLA+-model-first
(docs/formal/); flagging rather than proposing a code change.
CI coverage gap
ci/journey.sh:story_mesh (the mesh cross-node concurrency story) fires exactly 1 writer per node. The 8-way concurrency story
(story_decoupled_concurrency) runs on the vanilla topology (single node,
long advisory lock), not the mesh. So multi-writer-per-node on the mesh is
untested for both table types.
Suggested next steps
Decide whether multi-writer-per-node (across nodes, same cold table) is a supported mesh workload or explicitly out of scope.
If supported: add a mesh concurrency story that fires ≥3 writers per node on
≥2 nodes to the same cold table (decoupled and tiered), and extend the
TLA+ model to cover concurrent multi-claim-per-node ordering.
If out of scope: document the per-node cold-write concurrency limit for mesh,
and consider surfacing a clearer error / built-in retry than a raw Conflict_409.
Summary
On a Spock mesh, the armed Ricart-Agrawala bakery does not serialize
multiple concurrent writers per node, on ≥2 nodes at once, to the same
cold table. Some commits hit
Conflict_409(CatalogCommitConflicts) atLakekeeper — the exact outcome the bakery exists to prevent — and the losing
writes error out (rows lost unless the app retries).
This affects any cold write through the mesh bakery — reproduced on both
a decoupled (iceberg-only) table and a genuinely tiered table (
is_iceberg_only=false,hot_tableset). Hot-tier writes (Postgres heap) are unaffected.The single-writer-per-node case is solid. This is a coverage gap: the mesh
bakery has never been tested above 1 concurrent writer per node.
Related to but distinct from #17 Bug 2 — that was the vanilla single-node
advisory-lock path on slow Azure ADLS. This is the armed R-A mesh path
(
v_armed = true) on fast local SeaweedFS, with multiple writers per nodeacross nodes. #17 also noted the slower the storage round-trip, the worse the
race — which is why this matters more in production, not less (below).
Scope: is this local-only, or real in production?
Real in production — likely more readily than locally.
(
CatalogCommitConflicts/ "snapshot has changed") — how every IcebergREST catalog enforces atomic commits. Backend- and deployment-independent;
identical against real S3/Azure/GCS.
CatalogCommitConflictsfamily on realAzure ADLS, and observed it was hidden on fast local SeaweedFS and
exposed on slower storage. So this is not a SeaweedFS/emulator artifact.
Spock replication lag → the window between "my claim committed" and "peer
applied + acked" is larger, so the race is at least as likely.
unquantified and timing-dependent. But there is no basis to think production
is immune.
What works (calibration — the bakery is real)
story_mesh)Evidence
2-node Spock mesh (
MESH=on, db1=snowflake 691, db2=109), one sharedLakekeeper + SeaweedFS. Fire
per-nodeconcurrent cold writes on both nodes.Decoupled table (iceberg-only
events_lake):peer_alive_window_ms=60000Tiered table (
events,is_iceberg_only=false,hot_table="public"._events,provisioned via the archiver, watermark replicated to the peer):
CatalogCommitConflicts (409)Client error (identical on both table types):
The
coldfront.claim_ackstrail shows tickets issued and acked by both nodes,so the claim machinery runs — the failure is the ordering guarantee not
holding, not a missing ack.
Reproduction
On a formed mesh (e.g.
ci/topo/mesh.sh --mode decoupledor--mode tiered)with a cold table registered on each node:
Root cause (hypothesis)
The wait loop in
coldfront._claim_iceberg_lock(
extension/coldfront/coldfront--1.0.sql:1844-1879) exits on:smaller ticket, and
walsender heartbeat within
coldfront.peer_alive_window_ms, default 5s).Both clauses are individually correct, and same-node serialization (a) works
in isolation (3-same-node → 3/3 clean). The failure appears when both nodes
have multiple outstanding claims at once — the cross-node ack/defer interaction
under real timing (async Spock replication lag, the gap between a claim
committing and the peer applying it) lets two commits reach Lakekeeper
together. The release path documents a related window
(
coldfront--1.0.sql:1889-1891: "the next writer may briefly proceed while ouriceberg commit is still in flight").
Both decoupled and tiered cold writes go through this same function
(
coldfront.ccold rewrite →_exec_iceberg_with_claim;coldfront--1.0.sql:740-745cold branch →_claim_iceberg_lock), which is whyboth reproduce identically.
Ruled out as causes:
snowflake.nodeconfig — it is per-instance by design; the alignmentcheck (
coldfront--1.0.sql:1782-1791) requiressnowflake.node == hashtext(spock_node_name) & 1023. Distinct per-connectionids would break clause (a) and the dead-peer mapping.
peer_alive_window_ms— raising it to 60s does not help (see table).This is bakery/protocol territory, which per repo policy is TLA+-model-first
(
docs/formal/); flagging rather than proposing a code change.CI coverage gap
ci/journey.sh:story_mesh(the mesh cross-node concurrency story) fires exactly1 writer per node. The 8-way concurrency story
(
story_decoupled_concurrency) runs on the vanilla topology (single node,long advisory lock), not the mesh. So multi-writer-per-node on the mesh is
untested for both table types.
Suggested next steps
supported mesh workload or explicitly out of scope.
≥2 nodes to the same cold table (decoupled and tiered), and extend the
TLA+ model to cover concurrent multi-claim-per-node ordering.
and consider surfacing a clearer error / built-in retry than a raw
Conflict_409.Environment
docker/Dockerfile.duckdb15(PG 18),MESH=oncoldfront.iceberg_async_parquet=on,coldfront.iceberg_bakery_patch=on(image defaults)archiver
examples/walkthrough) demo; thedemo itself stays at 1 writer/node/round (the proven-safe regime).