Mesh R-A bakery: concurrent multi-writer-per-node cross-node cold writes race → Conflict_409 (decoupled + tiered; CI covers only 1 writer/node)

## Summary

On a Spock **mesh**, the armed Ricart-Agrawala bakery does **not** serialize
**multiple concurrent writers per node, on ≥2 nodes at once**, to the same
cold table. Some commits hit `Conflict_409` (`CatalogCommitConflicts`) at
Lakekeeper — the exact outcome the bakery exists to prevent — and the losing
writes error out (rows lost unless the app retries).

This affects **any cold write through the mesh bakery** — reproduced on **both
a decoupled (iceberg-only) table and a genuinely tiered table** (`is_iceberg_only=false`,
`hot_table` set). Hot-tier writes (Postgres heap) are unaffected.

The single-writer-per-node case is solid. This is a **coverage gap**: the mesh
bakery has never been tested above 1 concurrent writer per node.

Related to but **distinct from #17 Bug 2** — that was the *vanilla single-node*
advisory-lock path on *slow Azure ADLS*. This is the *armed R-A mesh* path
(`v_armed = true`) on *fast local SeaweedFS*, with multiple writers per node
across nodes. #17 also noted the slower the storage round-trip, the worse the
race — which is why this matters more in production, not less (below).

## Scope: is this local-only, or real in production?

**Real in production — likely more readily than locally.**

- The 409 is **Lakekeeper's optimistic-concurrency check**
  (`CatalogCommitConflicts` / "snapshot has changed") — how *every* Iceberg
  REST catalog enforces atomic commits. Backend- and deployment-independent;
  identical against real S3/Azure/GCS.
- **#17 Bug 2 already hit the same `CatalogCommitConflicts` family on real
  Azure ADLS**, and observed it was *hidden* on fast local SeaweedFS and
  *exposed* on slower storage. So this is not a SeaweedFS/emulator artifact.
- In a real distributed deployment, nodes sit across regions/clouds → **wider
  Spock replication lag** → the window between "my claim committed" and "peer
  applied + acked" is *larger*, so the race is at least as likely.
- Caveat: reproduced locally (2 nodes, one host); production *frequency* is
  unquantified and timing-dependent. But there is no basis to think production
  is immune.

## What works (calibration — the bakery is real)

| Workload | Result |
|---|---|
| Write on node A, read on node B (cross-node visibility) | ✅ |
| 1 writer per node, concurrent, cross-node, same table | ✅ (this is `story_mesh`) |
| N concurrent writers on a **single** node | ✅ (clause (a) same-node-min serializes them) |
| **N writers per node, on ≥2 nodes, same cold table, at once** | ❌ **Conflict_409** |

## Evidence

2-node Spock mesh (`MESH=on`, db1=snowflake 691, db2=109), one shared
Lakekeeper + SeaweedFS. Fire `per-node` concurrent cold writes on both nodes.

**Decoupled table** (iceberg-only `events_lake`):

| Test | Result |
|---|---|
| 1 writer per node × 5 rounds | 10/10 land, 0 err |
| 3 writers, same node only | 3/3 land, 0 err |
| **5 writers per node, cross-node** | **7/10 land, 12 err** |
| 3 writers per node, cross-node | 5/6 land, 4 err |
| 5 writers per node, cross-node, `peer_alive_window_ms=60000` | **7/10, 12 err** (raising the window does not help) |

**Tiered table** (`events`, `is_iceberg_only=false`, `hot_table="public"._events`,
provisioned via the archiver, watermark replicated to the peer):

| Test | Result |
|---|---|
| 1 cold write per node, cross-node | 0 err |
| **5 cold writes per node, cross-node** | **12 err — same `CatalogCommitConflicts (409)`** |

Client error (identical on both table types):

```
ERROR:  (PGDuckDB/DuckdbXactCallback_Cpp) TransactionContext Error: Failed to commit:
Failed to commit Iceberg transaction: Request to '.../transactions/commit' returned a
non-200 status code (Conflict_409).
 message: CatalogCommitConflicts, context: { expected: <snap>, found: <snap> }
   => Requirement failed: Branch or tag `main`'s snapshot has changed
 type: CatalogCommitConflicts
 reason: Conflict
```

The `coldfront.claim_acks` trail shows tickets issued and acked by both nodes,
so the claim machinery runs — the failure is the ordering guarantee not
holding, not a missing ack.

## Reproduction

On a formed mesh (e.g. `ci/topo/mesh.sh --mode decoupled` or `--mode tiered`)
with a cold table registered on each node:

```bash
# fire >=3 concurrent writers PER node, on >=2 nodes, to the SAME table, at once.
# (tiered: use a ts older than the archive_watermark cutoff so the write routes cold)
for i in $(seq 1 5); do
  psql -h db1 -U coldfront -d coldfront -c "INSERT INTO <cold_table> VALUES (...);" &
  psql -h db2 -U coldfront -d coldfront -c "INSERT INTO <cold_table> VALUES (...);" &
done
wait
# several clients fail with Conflict_409 / CatalogCommitConflicts; some rows never land
```

## Root cause (hypothesis)

The wait loop in `coldfront._claim_iceberg_lock`
(`extension/coldfront/coldfront--1.0.sql:1844-1879`) exits on:

- **(a) same-node-min** (1845–1849): no same-node claim on this table has a
  smaller ticket, and
- **(b) peer-ack** (1850–1877): every *alive* peer has acked my ticket (alive =
  walsender heartbeat within `coldfront.peer_alive_window_ms`, default 5s).

Both clauses are individually correct, and same-node serialization (a) works
in isolation (3-same-node → 3/3 clean). The failure appears when *both* nodes
have multiple outstanding claims at once — the cross-node ack/defer interaction
under real timing (async Spock replication lag, the gap between a claim
committing and the peer applying it) lets two commits reach Lakekeeper
together. The release path documents a related window
(`coldfront--1.0.sql:1889-1891`: "the next writer may briefly proceed while our
iceberg commit is still in flight").

Both decoupled and tiered cold writes go through this same function
(`coldfront.c` cold rewrite → `_exec_iceberg_with_claim`;
`coldfront--1.0.sql:740-745` cold branch → `_claim_iceberg_lock`), which is why
both reproduce identically.

**Ruled out as causes:**

- **Not `snowflake.node` config** — it is per-instance by design; the alignment
  check (`coldfront--1.0.sql:1782-1791`) requires
  `snowflake.node == hashtext(spock_node_name) & 1023`. Distinct per-connection
  ids would break clause (a) and the dead-peer mapping.
- **Not `peer_alive_window_ms`** — raising it to 60s does not help (see table).

This is bakery/protocol territory, which per repo policy is TLA+-model-first
(`docs/formal/`); flagging rather than proposing a code change.

## CI coverage gap

`ci/journey.sh:story_mesh` (the mesh cross-node concurrency story) fires exactly
**1 writer per node**. The 8-way concurrency story
(`story_decoupled_concurrency`) runs on the **vanilla** topology (single node,
long advisory lock), not the mesh. So multi-writer-per-node on the mesh is
untested for both table types.

## Suggested next steps

1. Decide whether multi-writer-per-node (across nodes, same cold table) is a
   **supported** mesh workload or explicitly out of scope.
2. If supported: add a mesh concurrency story that fires ≥3 writers per node on
   ≥2 nodes to the same cold table (decoupled **and** tiered), and extend the
   TLA+ model to cover concurrent multi-claim-per-node ordering.
3. If out of scope: document the per-node cold-write concurrency limit for mesh,
   and consider surfacing a clearer error / built-in retry than a raw
   `Conflict_409`.

## Environment

- 2-node Spock mesh, `docker/Dockerfile.duckdb15` (PG 18), `MESH=on`
- pg_duckdb 1.5.4 + patched duckdb-iceberg; `coldfront.iceberg_async_parquet=on`,
  `coldfront.iceberg_bakery_patch=on` (image defaults)
- SeaweedFS (local S3), shared Lakekeeper; tiered table provisioned via the
  archiver
- Surfaced while building the Distributed (`examples/walkthrough`) demo; the
  demo itself stays at 1 writer/node/round (the proven-safe regime).


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Mesh R-A bakery: concurrent multi-writer-per-node cross-node cold writes race → Conflict_409 (decoupled + tiered; CI covers only 1 writer/node) #35

Summary

Scope: is this local-only, or real in production?

What works (calibration — the bakery is real)

Evidence

Reproduction

Root cause (hypothesis)

CI coverage gap

Suggested next steps

Environment

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Workload	Result
Write on node A, read on node B (cross-node visibility)	✅
1 writer per node, concurrent, cross-node, same table	✅ (this is `story_mesh`)
N concurrent writers on a single node	✅ (clause (a) same-node-min serializes them)
N writers per node, on ≥2 nodes, same cold table, at once	❌ Conflict_409

Test	Result
1 writer per node × 5 rounds	10/10 land, 0 err
3 writers, same node only	3/3 land, 0 err
5 writers per node, cross-node	7/10 land, 12 err
3 writers per node, cross-node	5/6 land, 4 err
5 writers per node, cross-node, `peer_alive_window_ms=60000`	7/10, 12 err (raising the window does not help)

Test	Result
1 cold write per node, cross-node	0 err
5 cold writes per node, cross-node	12 err — same `CatalogCommitConflicts (409)`

Uh oh!

Mesh R-A bakery: concurrent multi-writer-per-node cross-node cold writes race → Conflict_409 (decoupled + tiered; CI covers only 1 writer/node) #35

Description

Summary

Scope: is this local-only, or real in production?

What works (calibration — the bakery is real)

Evidence

Reproduction

Root cause (hypothesis)

CI coverage gap

Suggested next steps

Environment

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions