Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(db): Fix root cause of RocksDB misbehavior #301

Merged

Conversation

slowli
Copy link
Contributor

@slowli slowli commented Oct 24, 2023

What ❔

Fixes (hopefully) the root cause of recent RocksDB misbehavior. After recent refactoring, it improperly cancels background compactions / flushes each time a RocksDB instance is dropped, which leads to compactions / flushes being constantly interrupted. (After refactoring, RocksDB instances are Cloneeable and act essentially as Arc wrappers; a new instance is created and dropped, in particular, when processing each chunk of L1 batches.) A proper solution, implemented by this PR, is to cancel background work when dropping an internal, non-Arc'd DB instance.

Why ❔

That's a bug that makes RocksDB behave very unstably, as witnessed by recent issues with it.

Checklist

  • PR title corresponds to the body of PR (we generate changelog entries from PRs).
  • Code has been formatted via zk fmt and zk lint.

@slowli slowli changed the title refactor(db): Make stall metrics more meaningful fix(db): Fix root cause of RocksDB misbehavior Oct 24, 2023
@slowli
Copy link
Contributor Author

slowli commented Oct 24, 2023

In hindsight, the issue was obvious; recent RocksDB logs are littered with "Shutdown: canceling all background work" logs, which I couldn't attribute previously.

@codecov
Copy link

codecov bot commented Oct 24, 2023

Codecov Report

Attention: 8 lines in your changes are missing coverage. Please review.

Comparison is base (f048485) 35.63% compared to head (72f2ae7) 35.58%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #301      +/-   ##
==========================================
- Coverage   35.63%   35.58%   -0.06%     
==========================================
  Files         520      520              
  Lines       28350    28359       +9     
==========================================
- Hits        10103    10092      -11     
- Misses      18247    18267      +20     
Files Coverage Δ
core/lib/storage/src/metrics.rs 25.00% <0.00%> (-1.67%) ⬇️
core/lib/storage/src/db.rs 50.46% <45.45%> (-1.45%) ⬇️

... and 24 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@slowli slowli marked this pull request as ready for review October 25, 2023 06:04
@slowli slowli requested a review from a team as a code owner October 25, 2023 06:04
popzxc
popzxc previously approved these changes Oct 25, 2023
Copy link
Member

@popzxc popzxc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hoped that the PR would be titled RocksDB Wars: Episode IV -- The Stalled Writes Strike Back

Nice and certainly makes sense. If this works out, would it make sense to roll back the new code added in the previous 3 PRs? Less code is generally better than more code.

@slowli
Copy link
Contributor Author

slowli commented Oct 25, 2023

@popzxc We have write stalls reported for the state keeper cache, so I think we should leave this logic as-is for now. New RocksDB metrics also make sense, IMO; for one, they'll allow to estimate whether this fix works.

@slowli slowli enabled auto-merge October 25, 2023 06:14
@slowli slowli disabled auto-merge October 25, 2023 06:14
@slowli slowli enabled auto-merge October 25, 2023 06:15
@slowli slowli requested a review from a team as a code owner October 25, 2023 06:54
@slowli slowli force-pushed the aov-pla-629-investigate-write-stalls-in-rocksdb-cleanup branch from 0adf69b to ef461d6 Compare October 25, 2023 08:04
auto-merge was automatically disabled October 25, 2023 10:54

Merge queue setting changed

@Deniallugo Deniallugo added this pull request to the merge queue Oct 26, 2023
Merged via the queue into main with commit d6c30ab Oct 26, 2023
25 checks passed
@Deniallugo Deniallugo deleted the aov-pla-629-investigate-write-stalls-in-rocksdb-cleanup branch October 26, 2023 11:49
github-merge-queue bot pushed a commit that referenced this pull request Oct 26, 2023
🤖 I have created a release *beep* *boop*
---


##
[16.2.0](core-v16.1.0...core-v16.2.0)
(2023-10-26)


### Features

* **basic_witness_producer_input:** Add Basic Witness Producer Input
component ([#156](#156))
([3cd24c9](3cd24c9))
* **core:** adding pubdata to statekeeper and merkle tree
([#259](#259))
([1659c84](1659c84))


### Bug Fixes

* **db:** Fix root cause of RocksDB misbehavior
([#301](#301))
([d6c30ab](d6c30ab))
* **en:** gracefully shutdown en waiting for reorg detector
([#270](#270))
([f048485](f048485))

---
This PR was generated with [Release
Please](https://github.com/googleapis/release-please). See
[documentation](https://github.com/googleapis/release-please#release-please).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants