Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds RocksDB support #2197

Merged
merged 14 commits into from Aug 28, 2019

Conversation

@wezrule
Copy link
Collaborator

commented Aug 4, 2019

This PR allows the node to use RocksDB instead of LMDB for the ledger by setting the CMake flag -DNANO_ROCKSDB=ON (config/flag options will be added later). There is no upgrade path, so bootstrapping will have to be done from scratch.

RocksDB does not allow getting counts (only estimates), due to the way LSM trees work, multiple puts to the same key will result in duplications before the compaction phase. From: google/leveldb#119 it is recommended to store counts separately which I've done in a separate cached_counts table.

The write transactions used by RocksDB will fail on commit () if there is a conflicting key being modified. It is now a requirement to state which ones are modified during a tx_begin_write () call.
Each column family has a lock, and if there can be concurrent modifications of the same keys then they must be added to the "requires lock" collection, otherwise they can be added to the "no lock required" collection. This now requires some thought, and it is also more visible what transactions are modifying. Confirmation height is one such table which does not require a lock, even though there is concurrent modification of the table, only new ones are added in 1 thread, and modifications to existing ones in another thread. There may be other ones like this, but they have not been investigated yet (a pessimistic approach was taken).

Notes: When backing up using --snapshot CLI option, it is currently set up to do incremental backups, which reduces copying the need to copy the whole database. However if deleting the original files then the backup directory should also be deleted otherwise there can be inconsistencies (Put in documentation?)

RocksDB is an external dependency as there was no clear way to incorporate it in our build.
Windows:
Recommended way:

  • add set (VCPKG_LIBRARY_LINKAGE static) to the top of %VCPKG_DIR%\ports\rocksdb\portfile.cmake
  • vcpkg install rocksdb:x64-windows

Ubuntu:
apt-get install librocksdb-dev
apt-get install zlib1g-dev (may be required if system does not have zlib installed already)
Or

git clone https://github.com/facebook/rocksdb.git
cd rocksdb
make static_lib
make install

Note: If running into:
/usr/bin/ld: ../node/libnode.a(rocksdb.cpp.o):(.data.rel.ro._ZTIN7rocksdb11StackableDBE[_ZTIN7rocksdb11StackableDBE]+0x10): undefined reference to 'typeinfo for rocksdb::DB'
Then it may be necessary to build with USE_RTTI=1:
facebook/rocksdb#4329

Mac:
brew install rocksdb

For more instructions:
https://github.com/facebook/rocksdb/blob/master/INSTALL.md

The following CMake options can be used to specify where the RocksDB and zlib libraries are (another dependency):

ROCKSDB_INCLUDE_DIRS
ROCKSDB_LIBRARIES
ZLIB_LIBRARY
ZLIB_INCLUDE_DIR

I've noted some things that may/should be done on top of this:

  • Run TSAN/ASAN/Valgrind and if appropriate use suppressions (similarly to LMDB). TSAN/ASAN has been run on Clang/Ubuntu and there is 1 recurring TSAN error, however the same is reported here ethereum/aleth#4740 so I don't think there is necessarily anything to worry about. However I haven't been able to suppress it, but leaving that for another task.
  • Try WriteBatch instead of using write transactions (we do explicit locking ourselves)
  • Config options for user customizability. Memtable size, number of memtables, compaction threads, bloom filter memory usage etc.. Care must be taken to ensure backwards/forwards compatibility though.
  • RPC calls currently lock all tables, they should only lock necessary ones
  • Use read-modify-write merge operator available in RocksDB to update the new cached_counts table more efficiently without requiring separate get/put operations.
  • Debug mode on MSVC is very slow when creating column families, can anything be done about this?
  • Support in ci for running both LMDB/RocksDB tests (currently using only RocksDB)
  • Benchmarking/Profiling comparing to LMDB
  • Test snapshots using different RocksDB versions on another system
  • Try and reduce locking of column families if unnecessary.
  • Calling Delete has to do a get currently to make sure it exists, this may not be necessary in some places.
  • get always makes a copy
  • Some gets are one offs so they can be turned off for block caching.
  • Add db time tracker like we have for LMDB?
  • There is a way to merge results from 2 tables in RocksDB, rather than doing a lot of the stuff manually in rocksdb_iterator.hpp
  • Better suited options per column family, such as memtable size, compaction style etc.
  • Combine some of the smaller column families together and use prefix extraction to write/obtain the values.
  • Check if removing all peers in peers column family can be relegated to just MSVC/Debug again since all the new optimizations. Previously write commits also failed in ongoing_peer_store
  • Support automatic backups that were added in #2198

12 threads 3.4GHZ CPU, 16GB RAM, SSD. Time taken (only taking into account bootstrapping):

--- LMDB RocksDB
Full sync (live) 3h 15m 4h 15m

1vCPU, 2GB RAM, SSD. (Note: There was 6tps going on during later half of the LMDB sync)

--- LMDB RocksDB
3.5M checked blocks 7h 15m 5h 30m

Comparison

LMDB RocksDB
Tested with the node for many years Unreleased
1 file (data.ldb) 100+ files
15GB live ledger 11GB (after flushing WAL)
Not many options to configure Very configurable
Many optimization made already by us Many optimizations can be made in the future to improve
Part of the build process Required external dep (incl recent version 5.13?+ which may not be available in your system repo)
More I/O Less I/O (writes are flushed in bulk)

Some test notes, try:
Run tests with LMDB & RocksDB tests
Test snapshot/vacuum with RocksDB compaction
RPC calls with RocksDB
CLI write commands, e.g peer_clear should give an appropriate error message if the node is already running.

@wezrule wezrule added this to the V20.0 milestone Aug 4, 2019
@wezrule wezrule requested review from argakiig, SergiySW and cryptocode Aug 4, 2019
@wezrule wezrule self-assigned this Aug 4, 2019
@wezrule wezrule removed the incomplete label Aug 9, 2019
Copy link
Collaborator

left a comment

LGTM, tested with rocksdb 6.1.2 (brew)

Copy link
Collaborator

left a comment

lgtm, tested with vcpkg pulled rocksdb6.1.2 static with msvc-14.1x64

wezrule added 2 commits Aug 12, 2019
@SergiySW

This comment has been minimized.

Copy link
Collaborator

commented Aug 22, 2019

LGTM, tested with rocksdb 6.2.2 static git build, Debian 10

wezrule added 5 commits Aug 22, 2019
…tabase to explicitly flush them
@wezrule wezrule requested a review from SergiySW Aug 27, 2019
@wezrule wezrule merged commit bcb7f39 into nanocurrency:master Aug 28, 2019
1 of 2 checks passed
1 of 2 checks passed
continuous-integration/travis-ci/pr The Travis CI build could not complete due to an error
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
@wezrule wezrule deleted the wezrule:add_rocksdb_support branch Aug 28, 2019
@zhyatt zhyatt referenced this pull request Aug 28, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
V20
Awaiting triage
4 participants
You can’t perform that action at this time.