feat: stabilizing encoding #219

zxch3n · 2023-12-14T15:00:31Z

This PR implements a new encode schema that is more extendible and more compact. It’s also simpler and takes less binary size and maintaining effort. It is inspired by the Automerge Encoding Format.

The main motivation is the extensibility. When we integrate a new CRDT algorithm, we don’t want to make a breaking change to the encoding or keep multiple versions of the encoding schema in the code, as it will make our WASM size much larger. We need a stable and extendible encoding schema for our v1.0 version.

This PR also exposes the ops that compose the current container state. For example, now you can make a query about which operation a certain character quickly. This behavior is required in the new snapshot encoding, so it’s included in this PR.

Encoding Schema

Header

The header has 22 bytes.

(0-4 bytes) Magic Bytes: The encoding starts with loro as magic bytes.
(4-20 bytes) Checksum: MD5 checksum of the encoded data, including the header starting from 20th bytes. The checksum is encoded as a 16-byte array. The checksum and magic bytes fields are trimmed when calculating the checksum.
(20-21 bytes) Encoding Method (2 bytes, big endian): Multiple encoding methods are available for a specific encoding version.

Encode Mode: Updates

In this approach, only ops, specifically their historical record, are encoded, while document states are excluded.

Like Automerge's format, we employ columnar encoding for operations and changes.

Previously, operations were ordered by their Operation ID (OpId) before columnar encoding. However, sorting operations based on their respective containers initially enhance compression potential.

Encode Mode: Snapshot

This mode simultaneously captures document state and historical data. Upon importing a snapshot into a new document, initialization occurs directly from the snapshot, bypassing the need for CRDT-based recalculations.

Unlike previous snapshot encoding methods, the current binary output in snapshot mode is compatible with the updates mode. This enhances the efficiency of importing snapshots into non-empty documents, where initialization via snapshot is infeasible.

Additionally, when feasible, we leverage the sequence of operations to construct state snapshots. In CRDTs, deducing the specific ops constituting the current container state is feasible. These ops are tagged in relation to the container, facilitating direct state reconstruction from them. This approach, pioneered by Automerge, significantly improves compression efficiency.

Performance Changes

Speed

Performance change compared to the main branch: https://app.warp.dev/block/PgfIexIuOyIgLvw8rUFd01

WASM binary size change

Name	WASM Size	Compressed WASM Size
Old	1.1MB	434KB
New	1.0MB	392 KB

Benchmarks on Drawing Scenario

Comparing the drawing scenario simulation with the current main branch.

Note: Compression on updates has been removed in the new version, leading to larger update sizes. Users can now select their preferred compression algorithm in the application code. GZip achieves approximately a 1:2 compression rate for Loro's exported snapshots or updates.

New

Commit id: c5a9de9

task	action_size	peer_num	ops_num	changes_num	snapshot_size	updates_size	apply_duration	encode_snapshot_duration	encode_udpate_duration	decode_snapshot_duration	decode_update_duration
async draw	100	1	331	1	4331	3249	1.7253749999999999	0.444375	0.136125	0.270375	0.098417
async draw	1000	1	3183	1	41244	30924	8.089292	3.608	0.8943329999999999	1.706042	0.677125
async draw	10000	1	30425	1	412193	304779	51.759791	21.762458	5.855875	10.804625	4.471084
async draw	100	5	331	6	4306	3268	2.621417	0.211708	0.06962499999999999	0.125666	0.049458
async draw	1000	5	3183	40	41700	31300	24.57975	2.176625	0.7471669999999999	1.120708	0.450083
async draw	10000	5	30425	423	399620	298549	259.627834	25.277917000000002	7.533333	11.203167	4.434083
async draw	1000	10	3183	74	42253	31876	62.091167000000006	2.730542	1.007916	1.249708	0.485583
async draw	10000	10	30425	802	409375	307095	706.9671669999999	30.747792	12.96475	12.208833	4.6555
async draw	100000	10	300974	8020	4391001	3277488	7845.397167	412.351958	163.365458	197.313416	71.150375
async draw	100000	10	301649	8025	4393930	3279154	7741.7285	389.887416	164.647583	175.813834	58.29075
realtime draw	100	5	183	19	2611	2009	3.116041	0.158208	0.063375	0.092542	0.03625
realtime draw	1000	5	2224	225	30679	23151	36.664875	2.621333	0.7188329999999999	0.9147919999999999	0.350334
realtime draw	10000	5	19958	2078	279096	207387	340.255042	18.365584000000002	6.899958000000001	8.970832999999999	3.4971249999999996
realtime draw	1000	10	2224	247	30811	23345	88.070708	2.722125	0.7929999999999999	0.916875	0.355667
realtime draw	10000	10	19958	2267	280460	209220	906.6364169999999	28.438458999999998	9.982083	9.192499999999999	3.402541
realtime draw	100000	10	201536	22807	3020382	2250934	9471.699084000002	265.325334	97.806834	112.942625	42.071833000000005
realtime draw	100000	10	200147	22689	2996669	2231459	9421.214875	255.862458	109.221375	111.309667	39.806416999999996

Old

Commit id: 727b5c2

task	action_size	peer_num	ops_num	changes_num	snapshot_size	updates_size	apply_duration	encode_snapshot_duration	encode_udpate_duration	decode_snapshot_duration	decode_update_duration
async draw	100	1	331	1	6309	3196	2.3123750000000003	0.152416	0.068583	0.163917	0.043292
async draw	1000	1	3183	1	65346	30412	3.6163749999999997	0.857167	0.563375	1.0254999999999999	0.259208
async draw	10000	1	30425	1	689366	127846	33.938833	9.000291	27.895834	12.676708999999999	3.2680830000000003
async draw	100	5	331	6	5841	3195	2.337041	0.10370900000000001	0.059500000000000004	0.125	0.035042000000000004
async draw	1000	5	3183	40	64579	30964	23.193790999999997	0.886	0.666792	1.04525	0.287791
async draw	10000	5	30425	423	648862	134323	270.331667	10.477291	30.871875	14.26975	4.268083
async draw	1000	10	3183	74	64215	32056	73.371667	1.158	1.283958	1.5499159999999998	0.39008400000000004
async draw	10000	10	30425	802	656302	139013	768.7101250000001	8.794791	34.85925	13.085458	3.876375
async draw	100000	10	300974	8020	7473115	1375519	7457.90225	169.94775	390.182417	186.63025	43.447292
async draw	100000	10	301649	8025	7463445	1374789	7288.698917	144.877208	409.18995800000005	182.909583	54.961083
realtime draw	100	5	183	19	3246	1897	1.795458	0.070958	0.06179199999999999	0.08354199999999999	0.027834
realtime draw	1000	5	2224	225	46273	22680	21.180457999999998	0.574958	0.6417499999999999	0.82675	0.24195799999999998
realtime draw	10000	5	19958	2078	428831	203476	190.994834	7.591584	6.350334	8.236833	2.2846670000000002
realtime draw	1000	10	2224	247	45682	22777	47.032917	0.5868340000000001	0.743958	0.966208	1.161958
realtime draw	10000	10	19958	2267	429854	205141	459.522916	7.437708000000001	9.953875	8.139208	2.444708
realtime draw	100000	10	201536	22807	5020772	1001571	5092.416	98.40154199999999	260.887208	130.19129199999998	39.607583000000005
realtime draw	100000	10	200147	22689	4981825	998571	5026.834375	92.946041	256.480167	109.06908299999999	34.260958

BREAKING CHANGE: encoding schema is changed

Leeeon233

What’s the change in document size?

it's representable in encode mode already

based on - container idx - prop - lamport - peerid

richtext state textchunk merge err

crates/loro-internal/src/encoding.rs

crates/loro-internal/src/container/richtext/richtext_state.rs

crates/loro-internal/src/diff_calc/tree.rs

crates/loro-internal/src/encoding.rs

firedbg/version.toml

crates/loro-internal/src/diff_calc.rs

feat: new encode method (reordered)

2e26a7d

zxch3n marked this pull request as draft December 14, 2023 15:02

zxch3n added 6 commits December 15, 2023 15:38

feat: add encode/decode mark start

edc03f9

refactor: remove InnerMapContent

b77f213

feat: decode map and list

6e0f8a7

refactor: extract decode op and encode op

b37a2fa

feat: extract header encode and decode, add checksum to encoding

60f65c6

BREAKING CHANGE: encoding schema is changed

fix: use encode reordered by default and pass all tests

c6ebe5f

Leeeon233 reviewed Dec 15, 2023

View reviewed changes

zxch3n added 21 commits December 17, 2023 13:33

refactor: rm encode schema version

436d8a7

it's representable in encode mode already

fix: warnings & add incompatible future encoding err

ff2742c

fix: should sort ops before encoding

fc7b149

based on - container idx - prop - lamport - peerid

feat: link ids to richtext state

c6279cc

feat: add opid to list state entries

714b889

feat: add last move op id to tree state

d1a8372

perf: rm counter and lamport from change encoding table

6b098ca

refactor: add to snapshot ops and from snapshot ops to states

a2dc734

refactor: refine id_int_map impl

740374c

refactor: refine state snapshot api

c51b953

refactor: make updates encoding more snapshot-friendly

df4d24b

refactor: refine updates encoder

65e1469

refactor: extract encode mod

4753de0

feat: impl new snapshot encoding (bk)

4cbdcb0

feat(utils): id_int_map now can get values in given spans

dff0bc8

feat: decode snapshot bk (buggy)

8c7b724

fix: snapshot encoding

62edc34

richtext state textchunk merge err

fix: list decode snapshot err

1e7ab1b

fix: empty state err

d0b5ab6

fix: tree snapshot encoding

8d0a16b

feat: decode snapshot by updates if cannot reset the current doc

c37f46d

zxch3n marked this pull request as ready for review December 28, 2023 10:16

zxch3n changed the title ~~WIP: feat: stabilizing encoding~~ feat: stabilizing encoding Dec 28, 2023

zxch3n requested a review from Leeeon233 December 28, 2023 10:38

chore: sort ops by prop first in snapshot mode

cdadc13

Leeeon233 approved these changes Dec 28, 2023

View reviewed changes

zxch3n and others added 11 commits December 28, 2023 22:13

test: add new fuzz target for the drawing task

68487c2

test: refine drawing task simulation

315215d

fix: a few issues related to comments

a88db6a

fix: type err

d9a5281

test: add import fuzz

bf6e765

test: add new fuzz target

c786e6d

fix: a pending change import issue

ec7f83c

refactor: reuse the action code in draw task

a6489f4

test: expand sync coverage

3ba3a49

fix: tree retreat last move op

314fab6

docs: update encoding readme

b83b92d

zxch3n added the refactor label Dec 29, 2023

zxch3n added 10 commits December 31, 2023 23:08

test: add a failed case

dda2856

fix: encoding issue and simplify

2e025eb

chore: update scripts

6ed516a

chore: add loro_value!() macro

d1d14ce

fix: encode/decode deep values

9eafe31

fix: fix a few warnings

b0105bd

fix: import pending when decoding snapshot

142f977

test: add minify utils to new fuzzing tests

5e80587

fix: should update frontiers of oplog after applying pending changes

61aaf7b

docs: add comments

b2d66a9

zxch3n merged commit bc27a47 into main Jan 2, 2024
1 check passed

zxch3n deleted the feat-encode-stable branch January 2, 2024 09:03

zxch3n mentioned this pull request Jan 24, 2024

Add benchmarks for Loro dmonad/crdt-benchmarks#24

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: stabilizing encoding #219

feat: stabilizing encoding #219

zxch3n commented Dec 14, 2023 •

edited

Leeeon233 left a comment

feat: stabilizing encoding #219

feat: stabilizing encoding #219

Conversation

zxch3n commented Dec 14, 2023 • edited

Encoding Schema

Header

Encode Mode: Updates

Encode Mode: Snapshot

Performance Changes

Speed

WASM binary size change

Benchmarks on Drawing Scenario

New

Old

Leeeon233 left a comment

Choose a reason for hiding this comment

zxch3n commented Dec 14, 2023 •

edited