Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(state-keeper): implement asynchronous RocksDB cache #1256

Merged
merged 58 commits into from Mar 27, 2024

Conversation

itegulov
Copy link
Contributor

What ❔

This PR adds a stateful storage abstraction CachedStorage that encapsulates both RocksDB and Postgres behind short-lived Box<dyn StorageRead> connections. Internally CachedStorage keeps track if RocksDB is behind Postgres and tries to catch it up in the background if so.

Apologies if the code is a little hairy, I tried to preserve the existing invariants as much as possible to limit the impact of this on other parts of the codebase.

Tested this locally by running load tests for 3 hours (~3k miniblocks), deleting RocksDB state and then restarting. Seems to be working fine, but it catches up really fast and I am not sure how to generate more data faster. So I am hoping to test this on testnet/staging with one of the external nodes.

Also trying to come up with a way to unit test this so that is WIP for now. In the meanwhile any feedback is welcome!

Note: metric for RocksDB-Postgres lag already exists

Why ❔

The main design goal here was to ensure liveliness of the system by providing a way to get a ReadStorage implementation while avoiding blocking operations as much as possible. If RocksDB ever gets wiped for whatever reason, State Keeper will keep working on Postgres-backed StorageRead connections for a few hours while CachedStorage is trying to catch RocksDB up.

Checklist

  • PR title corresponds to the body of PR (we generate changelog entries from PRs).
  • Tests for the changes have been added / updated.
  • Documentation comments have been added / updated.
  • Code has been formatted via zk fmt and zk lint.
  • Spellcheck has been run via zk spellcheck.
  • Linkcheck has been run via zk linkcheck.

@itegulov itegulov marked this pull request as draft February 27, 2024 06:16
Copy link
Collaborator

@RomanBrodetski RomanBrodetski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to look deeper, but I really like it from the first glance - it's very well abstracted from state keeper.

I wanted to make a step back and think about CachedStorageState state transitions.

  • Maybe naive, but do we really need "not initialized" status? can we start catching up and determine the correct state in new already?
  • Note that during state keeper operation, the cache cannot fall behind by more than one block. When state keeper is running, there is no other process that would be able to add blocks to L1 batches table. This is not the case on startup - on startup we might be way behind (reason (1) - SSD failed and we replaced it, reason (2) - we are recovering from snapshot). Thus, I think we don't need logic that would switch from rocksDB to Postgres - only the other way around. With this I think we can simplify the logic.

I might definitely be off here - also curious what @slowli thinks

core/lib/zksync_core/src/state_keeper/cached_storage.rs Outdated Show resolved Hide resolved
core/lib/zksync_core/src/state_keeper/cached_storage.rs Outdated Show resolved Hide resolved
core/lib/zksync_core/src/state_keeper/cached_storage.rs Outdated Show resolved Hide resolved
core/lib/zksync_core/src/state_keeper/cached_storage.rs Outdated Show resolved Hide resolved
@itegulov itegulov force-pushed the daniyar-pla-747-state-keeper-async-cache branch 3 times, most recently from ae0399e to 438cb40 Compare March 4, 2024 09:03
@itegulov itegulov force-pushed the daniyar-pla-747-state-keeper-async-cache branch 2 times, most recently from 8f9de62 to 8b91205 Compare March 4, 2024 10:44
@itegulov itegulov force-pushed the daniyar-pla-747-state-keeper-async-cache branch from bd74c9b to 7294f16 Compare March 5, 2024 03:40
@itegulov itegulov force-pushed the daniyar-pla-747-state-keeper-async-cache branch from 7294f16 to 74a2c1e Compare March 5, 2024 03:55
@itegulov itegulov requested a review from slowli March 13, 2024 06:25
RomanBrodetski
RomanBrodetski previously approved these changes Mar 13, 2024
Copy link
Collaborator

@RomanBrodetski RomanBrodetski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Please wait for @slowli 's final approval before merging

Copy link
Contributor

@slowli slowli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good; I only have minor nits other than spawning the syncing task.

@itegulov itegulov force-pushed the daniyar-pla-747-state-keeper-async-cache branch from 397c801 to f6516bb Compare March 16, 2024 06:47
@itegulov itegulov requested a review from slowli March 19, 2024 08:12
slowli
slowli previously approved these changes Mar 26, 2024
Copy link
Contributor

@slowli slowli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice.

@itegulov itegulov force-pushed the daniyar-pla-747-state-keeper-async-cache branch from 354839d to 92cbc0c Compare March 26, 2024 22:01
Copy link
Contributor

No performance difference detected (anymore)

@itegulov itegulov added this pull request to the merge queue Mar 27, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Mar 27, 2024
@itegulov itegulov added this pull request to the merge queue Mar 27, 2024
Merged via the queue into main with commit da41f63 Mar 27, 2024
29 checks passed
@itegulov itegulov deleted the daniyar-pla-747-state-keeper-async-cache branch March 27, 2024 10:01
RomanBrodetski pushed a commit that referenced this pull request Mar 28, 2024
🤖 I have created a release *beep* *boop*
---


##
[22.1.0](core-v22.0.0...core-v22.1.0)
(2024-03-28)


### Features

* Drop prover tables in core database
([#1436](#1436))
([0d78122](0d78122))
* **en:** consistency checker persistent cursor
([#1466](#1466))
([03496e6](03496e6))
* **en:** Make snapshot syncing future-proof
([#1441](#1441))
([8c26a7a](8c26a7a))
* **genesis:** Using genesis config only during the genesis
([#1423](#1423))
([4b634fd](4b634fd))
* **node_framework:** Add a task to handle sigint
([#1471](#1471))
([2ba6527](2ba6527))
* **node-framework:** Add circuit breaker checker layer to framework
([#1452](#1452))
([2c7a6bf](2c7a6bf))
* **prover:** export prover traces through OTLP
([#1427](#1427))
([16dce75](16dce75))
* sigint initialization only after snapshots is applied
([#1356](#1356))
([c7c7356](c7c7356))
* Split witness generator timeout configs by round
([#1505](#1505))
([8074d01](8074d01))
* **state-keeper:** implement asynchronous RocksDB cache
([#1256](#1256))
([da41f63](da41f63))
* **state-keeper:** Refactor persistence in `StateKeeper`
([#1411](#1411))
([e26091a](e26091a))
* **state-keeper:** Remove `WitnessBlockState` generation from state
keeper ([#1507](#1507))
([8ae0355](8ae0355))
* Switch contract verification API to axum and get rid of actix-web
usage ([#1467](#1467))
([e7a9d61](e7a9d61))


### Bug Fixes

* **api:** `filters_disabled` should only affect HTTP endpoints
([#1493](#1493))
([8720568](8720568))
* **api:** Fix API server shutdown flow
([#1425](#1425))
([780f6b0](780f6b0))
* **prover:** Remove FriProtocolVersionId
([#1510](#1510))
([6aa51b0](6aa51b0))
* **prover:** Remove redundant LoadingMode
([#1496](#1496))
([e7583f4](e7583f4))

---
This PR was generated with [Release
Please](https://github.com/googleapis/release-please). See
[documentation](https://github.com/googleapis/release-please#release-please).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants