Scaling Beyond the 50MB Monolithic CRDT Limit #31

kavinsood · 2026-03-24T20:14:24Z

kavinsood
Mar 24, 2026
Maintainer

YAOS currently uses a monolithic vault model: one vault maps to one shared Y.Doc containing markdown content, metadata, folder
structure, blob references, and tombstones. This is a deliberate V1 choice, not an accident.

That architecture gives YAOS its strongest product properties:

atomic folder renames
cross-file structural consistency
simple sync semantics
a single shared collaboration surface
straightforward snapshotting and recovery

It also creates a real ceiling. The current docs describe roughly 40-50 MB of raw markdown text as a comfortable target for the
monolith. That is not a hard crash line, but it is the point where CPU cost, memory pressure, and mobile startup behavior become worth
treating as first-class design concerns.

This RFC does not propose re-architecting YAOS immediately.
It defines:

what problem we are actually solving
what the current architecture already guarantees
which scaling paths are real
which ones are mostly illusions
and what the pragmatic roadmap should be if the monolith stops being the right default

Why this RFC Exists

monolith.md already explains why YAOS chose the monolith.
That document is the rationale for the current design.

This RFC is different.

Its purpose is to answer the next question:

If YAOS succeeds and users push beyond the comfortable monolithic ceiling, what is the least-wrong path forward that preserves YAOS’s
core moat of atomic structural integrity?

This is a future-scaling decision framework, not a restatement of current architecture.

Motivation

There are three separate pressures here.

1. Server-side memory and compute ceilings

The current checkpoint+journal storage engine solved write amplification. It did not remove the cost of holding and operating on one
large in-memory Y.Doc.

Large vaults still pay for:

Y.encodeStateAsUpdate(...)
Y.encodeStateVector(...)
merge/apply work during load and reconnect
larger cold-start replay and checkpoint operations

2. Client-side startup and mobile cost

Even if transport and storage are efficient, a large monolithic update still has to be:

downloaded
parsed
applied
indexed into editor/runtime state

Mobile devices will feel this first.

3. Competitive ceiling

YAOS is intentionally optimized for normal human note vaults, not 20 GB archival datasets. That is fine. But if YAOS becomes the
default recommendation for serious PKM users, the “what happens when my vault gets huge?” question stops being theoretical and becomes
a product-boundary question.

Current Architecture

Today YAOS uses:

one shared Y.Doc per vault
markdown text as fileId -> Y.Text
metadata as CRDT maps
intentional markdown/blob tombstones to block resurrection
chunked checkpoint+journal persistence on the server
schema-version gating and local reset paths on the client

Important current properties:

folder renames are batched into a single transaction
file IDs stay stable across rename waves
deleted paths remain tombstoned to prevent stale offline resurrection
local cache reset already exists
full “nuclear reset” already exists

That means YAOS already has two important ingredients for a future scaling plan:

a tested structural-consistency baseline
an existing UX precedent for “your local state is too stale, throw it away and resync”

Problem Statement

The scaling problem is often described too vaguely. In practice, there are three different kinds of “history” involved:

A. Storage history

This is the checkpoint+journal layer on the server.

YAOS already compacts this. That solved the old “rewrite the whole doc on every save” failure mode.

B. Yjs causal/history state

This is the in-memory CRDT state that grows with long-lived editing churn.

This is the real monolith ceiling.

C. YAOS application-level tombstones

These are explicit file/blob deletion markers stored in the CRDT to prevent resurrection from stale devices.

These are intentional correctness data. They are not the same thing as generic Yjs causal history, and they cannot be casually
vacuumed away.

Any RFC that talks about “garbage collection” must keep these three layers separate.

Critical Observation

A naive “vacuum” is not actually a vacuum.

It is tempting to think this works:

call Y.encodeStateAsUpdate(doc)
apply that update to a fresh Y.Doc
replace the old doc

That does not meaningfully reset causal history by itself.

In a local synthetic Yjs experiment, re-encoding and re-applying the same update preserved essentially the same encoded size, while
rebuilding a brand new doc from the materialized final text shrank it dramatically. In other words:

checkpoint rewrite is storage compaction
snapshotting is backup/recovery
neither automatically implies CRDT-history vacuuming

If YAOS wants an actual epoch reset, it must rebuild semantic state into a causally new document, not just replay the old update into
a fresh shell.

Non-Goals

This RFC does not propose:

replacing the monolith in V1
sacrificing atomic folder renames by default
claiming “infinite scale” for YAOS today
solving distributed sharding across multiple servers
treating LiveSync’s architecture as the target to copy
adding a high-complexity multiplexed subdocument architecture immediately

Design Principles

Any future scaling path must be judged against these rules:

Preserve atomic structural operations by default.
Never casually reintroduce deleted files from stale clients.
Prefer explicit operator-controlled escape hatches over magical background GC.
Use measured thresholds, not fear-driven refactors.
Separate “support larger vaults” from “support arbitrarily large vaults.”
Do not trade correctness for scale unless the tradeoff is explicit and user-visible.

Approaches Considered

1. Stay Monolithic, Add Instrumentation and Thresholds

This is the immediate path and should happen first regardless of everything else.

Add measurement for:

encoded CRDT bytes
checkpoint bytes
journal bytes
active markdown path count
tombstoned path count
startup sync duration
cold-start replay mode and journal size
client-side local load / provider sync timing

Also add user-facing thresholds:

healthy
warning
danger zone

This does not solve the ceiling, but it turns “50MB-ish” into an actual operational signal instead of a vibe.

Verdict: Mandatory first step.

2. Epoch-Fenced Rebuild

This is the most pragmatic short-term escape hatch.

Mechanically, this would mean:

materialize the current semantic vault state
rebuild it into a causally new Y.Doc
increment a syncEpoch
reject clients from older epochs
force stale clients to clear local cache and rehydrate from the new epoch

This is important:

the epoch cutover must be explicit
stale clients must not be allowed to delta-merge old causal history into the new epoch
the server and client protocol would need to carry epoch information, not just schema version

What this buys us:

preserves monolithic semantics within an epoch
gives an operational reset button when a vault becomes unhealthy
avoids a full architectural rewrite as the first response

What it costs:

cold devices may require a reset. The client-side plugin will detect the Epoch mismatch, automatically backup local unsynced changes to a recovery folder, and pull the new Epoch state.
“automatic safe GC when all devices are past X” is not something YAOS can currently prove, because there is no durable per-device
acknowledged high-water-mark registry today
operator tooling and UX will need to explain epoch cutovers clearly

The crucial nuance is that this is not “garbage collect in place.”
It is “rebuild and start a new causal era.”

Verdict: Best short-term scaling escape hatch.

3. Two-Tier Hybrid Model: Graph Doc + Leaf Docs

This is the strongest long-term scalable architecture currently on the table.

Structure:

a small monolithic Graph Doc contains file IDs, paths, metadata, structural state, blob refs, and deletion markers
each markdown file’s text lives in its own Leaf Doc
the client loads and subscribes only to the leaf docs it needs
the graph remains always-on and small

What this preserves:

atomic folder/path structure inside the graph
stable file IDs
scalable text capacity by paging content docs in and out

What this gives up:

truly atomic multi-note content mutations across leaf docs
simple single-doc mental model
trivial server routing

What it requires:

multiplexed room/subscription transport
lazy load/unload policies
LRU or similar eviction for inactive leaf docs
consistency handling between graph updates and content updates
explicit handling for intermediate “tearing” states during partial sync

This is probably the correct V2 architecture if YAOS ever needs to scale beyond the monolith while preserving its structural moat.

Verdict: Best long-term research direction.

4. Pure Per-File Sharding / Subdocuments Everywhere

This is the most obvious answer and the most dangerous to hand-wave.

Yes, it scales text capacity.
No, it does not preserve YAOS’s strongest guarantee.

Problems:

folder rename becomes multi-doc orchestration
cross-file mutations tear
partial sync exposes semantically invalid intermediate states
WebSocket/provider complexity rises sharply
native Yjs subdocuments do not magically solve transactionality

This may still be viable for a future “scale mode,” but it should not be the default path, and it should not be described as
equivalent to the current architecture.

Verdict: Not recommended as the first scale-up move.

Comparative View

Approach	Preserves atomic structure	Extends ceiling	Complexity	Recommended
Instrumentation + thresholds	Yes	No	Low	Yes, now
Epoch-fenced rebuild	Yes, within epoch	Medium	Medium	Yes, next
Graph + leaf docs	Mostly graph-level	High	High	Yes, research
Pure per-file/subdocs	No, not fully	High	High	Not first

Recommendation

The recommendation is:

1. Do not replace the monolith now

The current monolith is coherent, tested, and still the correct default for YAOS.

2. Add observability first

Before any refactor, teach YAOS to measure and report monolith health.

3. Build an epoch-fenced rebuild path as the first real escape hatch

This should be the first scaling feature that actually changes behavior.

Not because it is glamorous, but because it preserves YAOS’s strongest product property: atomic structural integrity.

4. Treat the Graph + Leaf model as V2 research

That is the serious long-term architecture if YAOS needs to serve much larger text vaults without giving up its identity.

Proposed Roadmap

Phase 1: Instrumentation

Add server and client metrics for:

encoded document size
checkpoint size
journal size
replay mode on load
active markdown paths
tombstoned markdown paths
startup sync duration
local IndexedDB load time
provider sync time

Add a debug/diagnostics surface that makes vault health visible.

Phase 2: Danger-Zone UX

Add warning thresholds and user-facing messaging for large vaults.

Possible actions:

show current vault text footprint
warn when vault is entering a monolithic danger zone
suggest snapshots before risky operations
explain that old idle devices may need a reset after future maintenance

Phase 3: Epoch-Fenced Rebuild

Implement:

syncEpoch
epoch-aware handshake
stale-epoch rejection path
“reset local cache and rejoin” UX
operator-triggered rebuild flow
clear safety docs around what happens to stale devices

This should initially be manual and explicit.

Phase 4: Hybrid Research Track

Explore:

graph document scope
leaf document storage layout
multiplexed transport design
eviction policy for inactive content docs
consistency model for graph/content races
whether this is opt-in per vault or a future default for large vaults

Acceptance Criteria

This RFC should be considered meaningfully implemented when:

YAOS can measure monolith health instead of guessing
users get explicit visibility before a vault becomes unhealthy
operators have a safe rebuild path that does not silently corrupt semantics
stale clients cannot resurrect pre-epoch history into a rebuilt vault
current rename atomicity remains intact for the default path
the codebase has a documented research direction for post-monolith scaling

Open Questions

What exact metric should define the danger zone:
encoded CRDT bytes, live markdown bytes, startup latency, or some composite score?
Should epoch rebuild be:
manual only, suggested, or eventually automatic under strict thresholds?
In a hybrid model, what belongs in the Graph Doc:
just path/file ID mappings, or also metadata, blob refs, and tombstones?
If YAOS ever introduces a “scale mode,” should that be:
automatic, per-vault, or explicit at vault creation time?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scaling Beyond the 50MB Monolithic CRDT Limit #31

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Scaling Beyond the 50MB Monolithic CRDT Limit #31

Uh oh!

kavinsood Mar 24, 2026 Maintainer

Why this RFC Exists

Motivation

1. Server-side memory and compute ceilings

2. Client-side startup and mobile cost

3. Competitive ceiling

Current Architecture

Problem Statement

A. Storage history

B. Yjs causal/history state

C. YAOS application-level tombstones

Critical Observation

Non-Goals

Design Principles

Approaches Considered

1. Stay Monolithic, Add Instrumentation and Thresholds

2. Epoch-Fenced Rebuild

3. Two-Tier Hybrid Model: Graph Doc + Leaf Docs

4. Pure Per-File Sharding / Subdocuments Everywhere

Comparative View

Recommendation

1. Do not replace the monolith now

2. Add observability first

3. Build an epoch-fenced rebuild path as the first real escape hatch

4. Treat the Graph + Leaf model as V2 research

Proposed Roadmap

Phase 1: Instrumentation

Phase 2: Danger-Zone UX

Phase 3: Epoch-Fenced Rebuild

Phase 4: Hybrid Research Track

Acceptance Criteria

Open Questions

Replies: 0 comments

kavinsood
Mar 24, 2026
Maintainer