Skip to content

Mesh R-A bakery: concurrent multi-writer-per-node cross-node cold writes race → Conflict_409 (decoupled + tiered; CI covers only 1 writer/node) #35

Description

@AntTheLimey

Summary

On a Spock mesh, the armed Ricart-Agrawala bakery does not serialize
multiple concurrent writers per node, on ≥2 nodes at once, to the same
cold table. Some commits hit Conflict_409 (CatalogCommitConflicts) at
Lakekeeper — the exact outcome the bakery exists to prevent — and the losing
writes error out (rows lost unless the app retries).

This affects any cold write through the mesh bakery — reproduced on both
a decoupled (iceberg-only) table and a genuinely tiered table
(is_iceberg_only=false,
hot_table set). Hot-tier writes (Postgres heap) are unaffected.

The single-writer-per-node case is solid. This is a coverage gap: the mesh
bakery has never been tested above 1 concurrent writer per node.

Related to but distinct from #17 Bug 2 — that was the vanilla single-node
advisory-lock path on slow Azure ADLS. This is the armed R-A mesh path
(v_armed = true) on fast local SeaweedFS, with multiple writers per node
across nodes. #17 also noted the slower the storage round-trip, the worse the
race — which is why this matters more in production, not less (below).

Scope: is this local-only, or real in production?

Real in production — likely more readily than locally.

  • The 409 is Lakekeeper's optimistic-concurrency check
    (CatalogCommitConflicts / "snapshot has changed") — how every Iceberg
    REST catalog enforces atomic commits. Backend- and deployment-independent;
    identical against real S3/Azure/GCS.
  • Azure ADLS Gen2 backend — 3 bugs found: app role LocalFileSystem, concurrent writer 409, compactor manifest format-version mismatch #17 Bug 2 already hit the same CatalogCommitConflicts family on real
    Azure ADLS
    , and observed it was hidden on fast local SeaweedFS and
    exposed on slower storage. So this is not a SeaweedFS/emulator artifact.
  • In a real distributed deployment, nodes sit across regions/clouds → wider
    Spock replication lag
    → the window between "my claim committed" and "peer
    applied + acked" is larger, so the race is at least as likely.
  • Caveat: reproduced locally (2 nodes, one host); production frequency is
    unquantified and timing-dependent. But there is no basis to think production
    is immune.

What works (calibration — the bakery is real)

Workload Result
Write on node A, read on node B (cross-node visibility)
1 writer per node, concurrent, cross-node, same table ✅ (this is story_mesh)
N concurrent writers on a single node ✅ (clause (a) same-node-min serializes them)
N writers per node, on ≥2 nodes, same cold table, at once Conflict_409

Evidence

2-node Spock mesh (MESH=on, db1=snowflake 691, db2=109), one shared
Lakekeeper + SeaweedFS. Fire per-node concurrent cold writes on both nodes.

Decoupled table (iceberg-only events_lake):

Test Result
1 writer per node × 5 rounds 10/10 land, 0 err
3 writers, same node only 3/3 land, 0 err
5 writers per node, cross-node 7/10 land, 12 err
3 writers per node, cross-node 5/6 land, 4 err
5 writers per node, cross-node, peer_alive_window_ms=60000 7/10, 12 err (raising the window does not help)

Tiered table (events, is_iceberg_only=false, hot_table="public"._events,
provisioned via the archiver, watermark replicated to the peer):

Test Result
1 cold write per node, cross-node 0 err
5 cold writes per node, cross-node 12 err — same CatalogCommitConflicts (409)

Client error (identical on both table types):

ERROR:  (PGDuckDB/DuckdbXactCallback_Cpp) TransactionContext Error: Failed to commit:
Failed to commit Iceberg transaction: Request to '.../transactions/commit' returned a
non-200 status code (Conflict_409).
 message: CatalogCommitConflicts, context: { expected: <snap>, found: <snap> }
   => Requirement failed: Branch or tag `main`'s snapshot has changed
 type: CatalogCommitConflicts
 reason: Conflict

The coldfront.claim_acks trail shows tickets issued and acked by both nodes,
so the claim machinery runs — the failure is the ordering guarantee not
holding, not a missing ack.

Reproduction

On a formed mesh (e.g. ci/topo/mesh.sh --mode decoupled or --mode tiered)
with a cold table registered on each node:

# fire >=3 concurrent writers PER node, on >=2 nodes, to the SAME table, at once.
# (tiered: use a ts older than the archive_watermark cutoff so the write routes cold)
for i in $(seq 1 5); do
  psql -h db1 -U coldfront -d coldfront -c "INSERT INTO <cold_table> VALUES (...);" &
  psql -h db2 -U coldfront -d coldfront -c "INSERT INTO <cold_table> VALUES (...);" &
done
wait
# several clients fail with Conflict_409 / CatalogCommitConflicts; some rows never land

Root cause (hypothesis)

The wait loop in coldfront._claim_iceberg_lock
(extension/coldfront/coldfront--1.0.sql:1844-1879) exits on:

  • (a) same-node-min (1845–1849): no same-node claim on this table has a
    smaller ticket, and
  • (b) peer-ack (1850–1877): every alive peer has acked my ticket (alive =
    walsender heartbeat within coldfront.peer_alive_window_ms, default 5s).

Both clauses are individually correct, and same-node serialization (a) works
in isolation (3-same-node → 3/3 clean). The failure appears when both nodes
have multiple outstanding claims at once — the cross-node ack/defer interaction
under real timing (async Spock replication lag, the gap between a claim
committing and the peer applying it) lets two commits reach Lakekeeper
together. The release path documents a related window
(coldfront--1.0.sql:1889-1891: "the next writer may briefly proceed while our
iceberg commit is still in flight").

Both decoupled and tiered cold writes go through this same function
(coldfront.c cold rewrite → _exec_iceberg_with_claim;
coldfront--1.0.sql:740-745 cold branch → _claim_iceberg_lock), which is why
both reproduce identically.

Ruled out as causes:

  • Not snowflake.node config — it is per-instance by design; the alignment
    check (coldfront--1.0.sql:1782-1791) requires
    snowflake.node == hashtext(spock_node_name) & 1023. Distinct per-connection
    ids would break clause (a) and the dead-peer mapping.
  • Not peer_alive_window_ms — raising it to 60s does not help (see table).

This is bakery/protocol territory, which per repo policy is TLA+-model-first
(docs/formal/); flagging rather than proposing a code change.

CI coverage gap

ci/journey.sh:story_mesh (the mesh cross-node concurrency story) fires exactly
1 writer per node. The 8-way concurrency story
(story_decoupled_concurrency) runs on the vanilla topology (single node,
long advisory lock), not the mesh. So multi-writer-per-node on the mesh is
untested for both table types.

Suggested next steps

  1. Decide whether multi-writer-per-node (across nodes, same cold table) is a
    supported mesh workload or explicitly out of scope.
  2. If supported: add a mesh concurrency story that fires ≥3 writers per node on
    ≥2 nodes to the same cold table (decoupled and tiered), and extend the
    TLA+ model to cover concurrent multi-claim-per-node ordering.
  3. If out of scope: document the per-node cold-write concurrency limit for mesh,
    and consider surfacing a clearer error / built-in retry than a raw
    Conflict_409.

Environment

  • 2-node Spock mesh, docker/Dockerfile.duckdb15 (PG 18), MESH=on
  • pg_duckdb 1.5.4 + patched duckdb-iceberg; coldfront.iceberg_async_parquet=on,
    coldfront.iceberg_bakery_patch=on (image defaults)
  • SeaweedFS (local S3), shared Lakekeeper; tiered table provisioned via the
    archiver
  • Surfaced while building the Distributed (examples/walkthrough) demo; the
    demo itself stays at 1 writer/node/round (the proven-safe regime).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions