New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(state-keeper): implement asynchronous RocksDB cache #1256
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need to look deeper, but I really like it from the first glance - it's very well abstracted from state keeper.
I wanted to make a step back and think about CachedStorageState
state transitions.
- Maybe naive, but do we really need "not initialized" status? can we start catching up and determine the correct state in
new
already? - Note that during state keeper operation, the cache cannot fall behind by more than one block. When state keeper is running, there is no other process that would be able to add blocks to L1 batches table. This is not the case on startup - on startup we might be way behind (reason (1) - SSD failed and we replaced it, reason (2) - we are recovering from snapshot). Thus, I think we don't need logic that would switch from rocksDB to Postgres - only the other way around. With this I think we can simplify the logic.
I might definitely be off here - also curious what @slowli thinks
…-keeper-async-cache
…-keeper-async-cache
ae0399e
to
438cb40
Compare
8f9de62
to
8b91205
Compare
core/lib/zksync_core/src/state_keeper/batch_executor/main_executor.rs
Outdated
Show resolved
Hide resolved
bd74c9b
to
7294f16
Compare
7294f16
to
74a2c1e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! Please wait for @slowli 's final approval before merging
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good; I only have minor nits other than spawning the syncing task.
core/lib/zksync_core/src/state_keeper/batch_executor/main_executor.rs
Outdated
Show resolved
Hide resolved
core/lib/zksync_core/src/state_keeper/batch_executor/tests/tester.rs
Outdated
Show resolved
Hide resolved
core/lib/zksync_core/src/state_keeper/batch_executor/tests/tester.rs
Outdated
Show resolved
Hide resolved
core/lib/zksync_core/src/state_keeper/batch_executor/main_executor.rs
Outdated
Show resolved
Hide resolved
core/node/node_framework/src/implementations/layers/state_keeper/main_batch_executor.rs
Show resolved
Hide resolved
e1990ba
to
397c801
Compare
397c801
to
f6516bb
Compare
…-keeper-async-cache
…-keeper-async-cache
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice.
core/lib/zksync_core/src/state_keeper/batch_executor/tests/tester.rs
Outdated
Show resolved
Hide resolved
d0d16bf
to
0b76b99
Compare
…-keeper-async-cache
…-keeper-async-cache
354839d
to
92cbc0c
Compare
No performance difference detected (anymore) |
🤖 I have created a release *beep* *boop* --- ## [22.1.0](core-v22.0.0...core-v22.1.0) (2024-03-28) ### Features * Drop prover tables in core database ([#1436](#1436)) ([0d78122](0d78122)) * **en:** consistency checker persistent cursor ([#1466](#1466)) ([03496e6](03496e6)) * **en:** Make snapshot syncing future-proof ([#1441](#1441)) ([8c26a7a](8c26a7a)) * **genesis:** Using genesis config only during the genesis ([#1423](#1423)) ([4b634fd](4b634fd)) * **node_framework:** Add a task to handle sigint ([#1471](#1471)) ([2ba6527](2ba6527)) * **node-framework:** Add circuit breaker checker layer to framework ([#1452](#1452)) ([2c7a6bf](2c7a6bf)) * **prover:** export prover traces through OTLP ([#1427](#1427)) ([16dce75](16dce75)) * sigint initialization only after snapshots is applied ([#1356](#1356)) ([c7c7356](c7c7356)) * Split witness generator timeout configs by round ([#1505](#1505)) ([8074d01](8074d01)) * **state-keeper:** implement asynchronous RocksDB cache ([#1256](#1256)) ([da41f63](da41f63)) * **state-keeper:** Refactor persistence in `StateKeeper` ([#1411](#1411)) ([e26091a](e26091a)) * **state-keeper:** Remove `WitnessBlockState` generation from state keeper ([#1507](#1507)) ([8ae0355](8ae0355)) * Switch contract verification API to axum and get rid of actix-web usage ([#1467](#1467)) ([e7a9d61](e7a9d61)) ### Bug Fixes * **api:** `filters_disabled` should only affect HTTP endpoints ([#1493](#1493)) ([8720568](8720568)) * **api:** Fix API server shutdown flow ([#1425](#1425)) ([780f6b0](780f6b0)) * **prover:** Remove FriProtocolVersionId ([#1510](#1510)) ([6aa51b0](6aa51b0)) * **prover:** Remove redundant LoadingMode ([#1496](#1496)) ([e7583f4](e7583f4)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please).
What ❔
This PR adds a stateful storage abstraction
CachedStorage
that encapsulates both RocksDB and Postgres behind short-livedBox<dyn StorageRead>
connections. InternallyCachedStorage
keeps track if RocksDB is behind Postgres and tries to catch it up in the background if so.Apologies if the code is a little hairy, I tried to preserve the existing invariants as much as possible to limit the impact of this on other parts of the codebase.
Tested this locally by running load tests for 3 hours (~3k miniblocks), deleting RocksDB state and then restarting. Seems to be working fine, but it catches up really fast and I am not sure how to generate more data faster. So I am hoping to test this on testnet/staging with one of the external nodes.
Also trying to come up with a way to unit test this so that is WIP for now. In the meanwhile any feedback is welcome!
Note: metric for RocksDB-Postgres lag already exists
Why ❔
The main design goal here was to ensure liveliness of the system by providing a way to get a
ReadStorage
implementation while avoiding blocking operations as much as possible. If RocksDB ever gets wiped for whatever reason, State Keeper will keep working on Postgres-backedStorageRead
connections for a few hours whileCachedStorage
is trying to catch RocksDB up.Checklist
zk fmt
andzk lint
.zk spellcheck
.zk linkcheck
.