Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

os/bluestore: [RFC] merge multiple TransContext into a batch to improve performance #34

Open
wants to merge 8 commits into
base: wip-bluestore-kv-finisher
Choose a base branch
from

Conversation

ifed01
Copy link

@ifed01 ifed01 commented May 15, 2017

This is somewhat hackery approach to merge multiple TransContexts into a single batch to submit them in a single step. The code relies on RocksDB WriteBatch internals to merge multiple batches into a single one. Unfortunately more elegant solution to collect operations out of RocksDB WriteBatch and then build it from scratch isn't beneficial from performance POV as we overburden kv_sync_thread. (See code at (https://github.com/ifed01/ceph/tree/wip-bluestore-large-batch-prod)
This PR is actually a question if we want to proceed this way as well as a summary on obtained numbers.
Results below are for FIO bluestore plugin for both highly concurrent 4K appends and overwrites using 2 NVME drives. (min_alloc_size=4K too). Each case was run for 3 times

Original code (kv_finisher tread PR + intrusive lists for TransContextx)
Append:
Run 1 = 370 Mb/s , run 2 = 367 Mb/s, run3 = 358 Mb/s
Overwrite:
Run 1 = 329 Mb/s , run 2 = 333 Mb/s, run3 = 329 Mb/s

This PR code on top of above-mentioned original one
Append:
Run 1 = 419 Mb/s , run 2 = 413 Mb/s, run3 = 420 Mb/s
Overwrite:
Run 1 = 338 Mb/s , run 2 = 338 Mb/s, run3 = 345 Mb/s

Failed approach to collect KV operations out of RocksDB and build a resulting batch before the submit
Append:
Run 1 = 368 Mb/s , run 2 = 375 Mb/s, run3 = 373 Mb/s
Overwrite:
Run 1 = 270 Mb/s , run 2 = 272 Mb/s, run3 = 278 Mb/s

liewegas and others added 8 commits April 6, 2017 11:27
The kv_sync_thread is a bottleneck; making it do less work improves
performance on fast devices.

Signed-off-by: Jianpeng Ma <jianpeng.ma@intel.com>
Signed-off-by: Igor Fedotov <ifedotov@mirantis.com>
Signed-off-by: Sage Weil <sage@redhat.com>
No reason to push them onto *another* Finisher thread.

Signed-off-by: Sage Weil <sage@redhat.com>
This means that the completion thread could in theory get backed up.
However, it resolves hard-to-hit deadlock cases where the completion is
blocked by pg->lock and the lock holder is blocked on a throttle.

Signed-off-by: Sage Weil <sage@redhat.com>
…cific dir

This hack only works for 1 osd, but it lets you put it on a specific
directory/device.

Signed-off-by: Sage Weil <sage@redhat.com>
We aren't using these anymore.

We may want to shard the finalizer thread, but that can work a bit
differently.

Signed-off-by: Sage Weil <sage@redhat.com>
We aren't using these anymore.

We may want to shard the finalizer thread, but that can work a bit
differently.

Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Igor Fedotov <ifedotov@mirantis.com>
Signed-off-by: Igor Fedotov <ifedotov@mirantis.com>

yet another way to merge rocksdb transactions into a batch
@liewegas liewegas force-pushed the wip-bluestore-kv-finisher branch 3 times, most recently from 65330a1 to bedcbcd Compare May 19, 2017 16:28
@liewegas
Copy link
Owner

This looks pretty straightforward! I think if we go down this path, though, we'd want to patch upstream rocksdb properly to implement the transaction append. And make txc0 reserve enough memory based on what we expect to see for the whole list, etc.

Alternatively, rocksdb could provide a submit_transaction call that takes a vector of writebatches and does the same thing internally without the memcpy. (That would probably be more efficient, fwiw.)

or, we could batch explicitly in bluestore. (Maybe this is what adam's appraoch did? I don't remember)

  • attach the transaction to the OpSequencer
  • give it a mutex and bool sealed; flag
  • when we create a TransContext, use the osr's txc if it is not sealed. otherwise, grab a new one.
  • hold hte mutex over prepare_transaction
  • in kv sync thread, claim ownership of the transaction, seal it, submit it, etc.

The downside to that approach is that prepare_transaction might be slow (esp if there is a read/modify/write) and that may block up kv sync thread. I don't think there is a way around that without doing a memcpy like your approach. Which probably means that's the way to go...

liewegas pushed a commit that referenced this pull request Oct 3, 2017
93f760c57c Merge pull request #40 from ivancich/wip-change-client-rec-init
824d92dd3d Merge pull request #38 from ivancich/wip-improve-next-request-return
941d1bef54 Change initialization of IndIntruHeapData to C++'s value-initialization to better future-proof the code. Since at the momeent they are scalars they'll be zero-initialized (i.e., to zero). However if they ever become something more complex, their default constructors will be called.
19153d979f Merge pull request #39 from ivancich/wip-delta-rho-plugin
a94c4e086c Allow the calculations of rho and delta to be handled by a "tracker" specified via template parameter (i.e., by static polymorphism). The tracker follows a simple interface constisting of three functions and one static function.
856a26c466 Clarify code surrounding the return value of do_next_request.
b632cfda4f Merge pull request #37 from ivancich/wip-fix-uninit-data
e6df585153 The coverity scan published in ceph-devel on 2017-09-21 revealed some uninitialized data in a constructor. This fixes that.
165a02542d Merge pull request #34 from TaewoongKim/anticipate
72e4df95cf Make anticipation_timeout configurable with config file
2f06d632d5 Add anticipation duration that keeps from resetting tag values to the current time

git-subtree-dir: src/dmclock
git-subtree-split: 93f760c57c75b9eb88382bcba29fcac3ce365e7f
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants