Today `ShardCoordinator` keeps the remembered-entities list in memory: `entitiesPerShard` is rebuilt from `EntityStarted` / `EntityStopped` notifications during the cluster's lifetime. That works while the cluster runs, but a full cluster restart (deploy, outage) loses the entire entity registry — `rememberEntities: true` becomes a no-op until messages re-arrive.
Strategy: persist `entitiesPerShard` to the existing journal under a well-known persistenceId (`sharding-coordinator-{typeName}`). On coordinator preStart, replay the journal to rebuild the registry before processing the first message. Builds on issue #39 (Persistent ShardCoordinator state) but is scoped narrower — only the entity registry, not the full allocation map.
Components:
| File |
Task |
| `src/cluster/sharding/ShardCoordinator.ts` |
On `handleEntityStarted` / `handleEntityStopped`, persist a delta to the journal. |
| `src/cluster/sharding/RememberEntitiesStore.ts` (new) |
Tiny abstraction so users can plug in a custom backend (default: same Journal as `PersistentActor`). |
| `tests/multi-node/sharding-remember-entities.test.ts` (new) |
Multi-node test: spawn entities, full cluster restart, entities recreated on the new coordinator without user messages. |
Estimate: 3-4 days.
Verification:
- Restart the entire cluster: the new coordinator loads the persisted entity registry and re-issues `RememberedEntities` to the regions, which respawn entities.
- Stress test: 10k entities across 3 nodes, restart cluster, every entity is reachable in < 10 s.
Out of scope:
Today `ShardCoordinator` keeps the remembered-entities list in memory: `entitiesPerShard` is rebuilt from `EntityStarted` / `EntityStopped` notifications during the cluster's lifetime. That works while the cluster runs, but a full cluster restart (deploy, outage) loses the entire entity registry — `rememberEntities: true` becomes a no-op until messages re-arrive.
Strategy: persist `entitiesPerShard` to the existing journal under a well-known persistenceId (`sharding-coordinator-{typeName}`). On coordinator preStart, replay the journal to rebuild the registry before processing the first message. Builds on issue #39 (Persistent ShardCoordinator state) but is scoped narrower — only the entity registry, not the full allocation map.
Components:
Estimate: 3-4 days.
Verification:
Out of scope: