os/bluestore: [RFC] merge multiple TransContext into a batch to improve performance #34

ifed01 · 2017-05-15T14:26:54Z

This is somewhat hackery approach to merge multiple TransContexts into a single batch to submit them in a single step. The code relies on RocksDB WriteBatch internals to merge multiple batches into a single one. Unfortunately more elegant solution to collect operations out of RocksDB WriteBatch and then build it from scratch isn't beneficial from performance POV as we overburden kv_sync_thread. (See code at (https://github.com/ifed01/ceph/tree/wip-bluestore-large-batch-prod)
This PR is actually a question if we want to proceed this way as well as a summary on obtained numbers.
Results below are for FIO bluestore plugin for both highly concurrent 4K appends and overwrites using 2 NVME drives. (min_alloc_size=4K too). Each case was run for 3 times

Original code (kv_finisher tread PR + intrusive lists for TransContextx)
Append:
Run 1 = 370 Mb/s , run 2 = 367 Mb/s, run3 = 358 Mb/s
Overwrite:
Run 1 = 329 Mb/s , run 2 = 333 Mb/s, run3 = 329 Mb/s

This PR code on top of above-mentioned original one
Append:
Run 1 = 419 Mb/s , run 2 = 413 Mb/s, run3 = 420 Mb/s
Overwrite:
Run 1 = 338 Mb/s , run 2 = 338 Mb/s, run3 = 345 Mb/s

Failed approach to collect KV operations out of RocksDB and build a resulting batch before the submit
Append:
Run 1 = 368 Mb/s , run 2 = 375 Mb/s, run3 = 373 Mb/s
Overwrite:
Run 1 = 270 Mb/s , run 2 = 272 Mb/s, run3 = 278 Mb/s

The kv_sync_thread is a bottleneck; making it do less work improves performance on fast devices. Signed-off-by: Jianpeng Ma <jianpeng.ma@intel.com> Signed-off-by: Igor Fedotov <ifedotov@mirantis.com> Signed-off-by: Sage Weil <sage@redhat.com>

No reason to push them onto *another* Finisher thread. Signed-off-by: Sage Weil <sage@redhat.com>

This means that the completion thread could in theory get backed up. However, it resolves hard-to-hit deadlock cases where the completion is blocked by pg->lock and the lock holder is blocked on a throttle. Signed-off-by: Sage Weil <sage@redhat.com>

…cific dir This hack only works for 1 osd, but it lets you put it on a specific directory/device. Signed-off-by: Sage Weil <sage@redhat.com>

We aren't using these anymore. We may want to shard the finalizer thread, but that can work a bit differently. Signed-off-by: Sage Weil <sage@redhat.com>

Signed-off-by: Igor Fedotov <ifedotov@mirantis.com>

Signed-off-by: Igor Fedotov <ifedotov@mirantis.com> yet another way to merge rocksdb transactions into a batch

liewegas · 2017-05-19T19:41:28Z

This looks pretty straightforward! I think if we go down this path, though, we'd want to patch upstream rocksdb properly to implement the transaction append. And make txc0 reserve enough memory based on what we expect to see for the whole list, etc.

Alternatively, rocksdb could provide a submit_transaction call that takes a vector of writebatches and does the same thing internally without the memcpy. (That would probably be more efficient, fwiw.)

or, we could batch explicitly in bluestore. (Maybe this is what adam's appraoch did? I don't remember)

attach the transaction to the OpSequencer
give it a mutex and bool sealed; flag
when we create a TransContext, use the osr's txc if it is not sealed. otherwise, grab a new one.
hold hte mutex over prepare_transaction
in kv sync thread, claim ownership of the transaction, seal it, submit it, etc.

The downside to that approach is that prepare_transaction might be slow (esp if there is a read/modify/write) and that may block up kv sync thread. I don't think there is a way around that without doing a memcpy like your approach. Which probably means that's the way to go...

93f760c57c Merge pull request #40 from ivancich/wip-change-client-rec-init 824d92dd3d Merge pull request #38 from ivancich/wip-improve-next-request-return 941d1bef54 Change initialization of IndIntruHeapData to C++'s value-initialization to better future-proof the code. Since at the momeent they are scalars they'll be zero-initialized (i.e., to zero). However if they ever become something more complex, their default constructors will be called. 19153d979f Merge pull request #39 from ivancich/wip-delta-rho-plugin a94c4e086c Allow the calculations of rho and delta to be handled by a "tracker" specified via template parameter (i.e., by static polymorphism). The tracker follows a simple interface constisting of three functions and one static function. 856a26c466 Clarify code surrounding the return value of do_next_request. b632cfda4f Merge pull request #37 from ivancich/wip-fix-uninit-data e6df585153 The coverity scan published in ceph-devel on 2017-09-21 revealed some uninitialized data in a constructor. This fixes that. 165a02542d Merge pull request #34 from TaewoongKim/anticipate 72e4df95cf Make anticipation_timeout configurable with config file 2f06d632d5 Add anticipation duration that keeps from resetting tag values to the current time git-subtree-dir: src/dmclock git-subtree-split: 93f760c57c75b9eb88382bcba29fcac3ce365e7f

liewegas and others added 8 commits April 6, 2017 11:27

os/bluestore: do completions inline in kv_finisher_thread

881fd67

No reason to push them onto *another* Finisher thread. Signed-off-by: Sage Weil <sage@redhat.com>

vstart.sh: add --filestore_path to direct osd.0 filestore data to spe…

377e049

…cific dir This hack only works for 1 osd, but it lets you put it on a specific directory/device. Signed-off-by: Sage Weil <sage@redhat.com>

os/bluestore: remove (sharded) finishers

ba9e2a7

We aren't using these anymore. We may want to shard the finalizer thread, but that can work a bit differently. Signed-off-by: Sage Weil <sage@redhat.com>

os/bluestore: remove (sharded) finishers

d29e55c

We aren't using these anymore. We may want to shard the finalizer thread, but that can work a bit differently. Signed-off-by: Sage Weil <sage@redhat.com>

os/bluestore: start using intrusive list for TransContext

583c23a

Signed-off-by: Igor Fedotov <ifedotov@mirantis.com>

os/bluestore: POC on batching multiple TransContexts

be113c9

Signed-off-by: Igor Fedotov <ifedotov@mirantis.com> yet another way to merge rocksdb transactions into a batch

liewegas force-pushed the wip-bluestore-kv-finisher branch 3 times, most recently from 65330a1 to bedcbcd Compare May 19, 2017 16:28

ifed01 mentioned this pull request May 25, 2017

Extend WriteBatch interface with an ability to merge multiple batches facebook/rocksdb#2371

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

os/bluestore: [RFC] merge multiple TransContext into a batch to improve performance #34

os/bluestore: [RFC] merge multiple TransContext into a batch to improve performance #34

ifed01 commented May 15, 2017

liewegas commented May 19, 2017

os/bluestore: [RFC] merge multiple TransContext into a batch to improve performance #34

Are you sure you want to change the base?

os/bluestore: [RFC] merge multiple TransContext into a batch to improve performance #34

Conversation

ifed01 commented May 15, 2017

liewegas commented May 19, 2017