FEAT: async post-commit #381

Closed
pbalcer opened this Issue Dec 23, 2016 · 3 comments

Comments

Projects
None yet
2 participants
@pbalcer

pbalcer commented Dec 23, 2016

Asynchronous post commit

Rationale

The post-commit phase of the transaction, the purpose of which is to effectively clean the undo logs that were being used during the transaction, can take a lot of time, especially when the transaction has performed many TX_FREE operations, as the bulk of that operation is performed once the transaction finishes. The thing is though, the time-frame in which this phase is finished is completely irrelevant to the correctness of the system. Currently it is performed sequentially after finishing the transaction, performing the pre-commit phase (flushing of data), and marking the transaction as committed (on the on-media layout).

Description

The idea is to run the post-commit phase in a separate worker thread that runs in the background. The worker would perform everything that is currently performed in the post-commit phase of the transaction (cleanup the vectors, perform TX_FREE and so on) and additionally would release the lane once finished.

This mechanism would be opt-in and the user would have to provide the library with a thread that will be allowed to run for the entire life of the application.

This optimization makes the transactional free operation almost free from the perspective of the calling thread, as the actual free would be offloaded to a completely separate thread.

API Changes

Two new CTL entry point would be added:
"tx.post_commit.thread", ""tx.post_commit.queue_depth"

The thread entry points takes a pointer to an existing thread that runs the worker function and queue depths defines how many transactions can wait in the workers queue before every other calling thread would have to wait - this is to limit the potential issue with using too many lanes (as lanes would only be released once the worker thread finishes the task).

Implementation details

The worker thread would be implemented using a multiple produced single consumer queue (a circular buffer with queue depth defining its size). Due to the fact that the actual performance of the worker doesn't matter that much, it would spend most of its time sleeping and only periodically waiting to check if there's work to be performed.

@pbalcer pbalcer added this to the 1.4 milestone Dec 23, 2016

@pbalcer pbalcer self-assigned this Dec 23, 2016

@pbalcer pbalcer referenced this issue in pmem/pmdk Feb 13, 2017

Merged

obj: async post-commit #1671

@krzycz

This comment has been minimized.

Show comment
Hide comment
@krzycz

krzycz Feb 22, 2017

  1. What is the expected performance gain? I.e for our tree examples/benchmark?
  2. Looks like in some cases (many threads, but only one worker) the performance could be actually worse.
  3. Even though TX_FREE is offloaded to another thread, it would still touch the heap metadata, so it might be in contention with other threads doing TX_ALLOC, right?

krzycz commented Feb 22, 2017

  1. What is the expected performance gain? I.e for our tree examples/benchmark?
  2. Looks like in some cases (many threads, but only one worker) the performance could be actually worse.
  3. Even though TX_FREE is offloaded to another thread, it would still touch the heap metadata, so it might be in contention with other threads doing TX_ALLOC, right?
@pbalcer

This comment has been minimized.

Show comment
Hide comment
@pbalcer

pbalcer Feb 27, 2017

I've implemented a simple benchmark that preallocates objects and then creates workers that free them inside of a transaction. Each worker frees oids_per_worker objects and each transaction of a worker frees oids_per_tx objects.

As you can see, the benefit can be quite significant when a single transaction performs a lot of work (the data shows up to 6x improvement). The biggest benefit is for workloads that perform big transactions and then perform a different task. There's also no significant adverse effects when the CPU is over provisioned (i.e. more threads then CPU cores), the post-commit workers simply sleep most of the time (but that's assuming the queue length is smaller than the number of lanes).

For very tiny transactions, the benefit is very small because we add communication overhead to already small amount of work.

There's also are very noticeable diminishing returns when increasing the number of post-commit workers, this is because actual worker threads eventually hit the threshold of how fast they can perform the transactions.

oids_per_worker;oids_per_tx;workers;post-commit workers;time elapsed

4000000;100;1;0;2.591
4000000;100;1;1;1.289
4000000;100;1;2;1.587

4000000;1000;1;0;2.549
4000000;1000;1;1;1.112
4000000;1000;1;2;0.922
4000000;1000;1;3;0.864
4000000;1000;1;4;0.798

4000000;10000;1;0;2.396
4000000;10000;1;1;0.390
4000000;10000;1;2;0.390

4000000;100;2;0;2.586
4000000;100;2;1;1.761
4000000;100;2;2;1.710

4000000;1000;2;0;2.496
4000000;1000;2;1;1.679
4000000;1000;2;2;1.272
4000000;1000;2;8;0.909

4000000;10000;2;0;2.424
4000000;10000;2;1;0.805
4000000;10000;2;2;0.581

4000000;1000;8;0;3.115
4000000;1000;8;1;3.120
4000000;1000;8;4;3.106

4000000;1;1;1;2.837
4000000;1;1;1;2.837
4000000;1;1;2;2.993

As for tree benchmarks, they are not a relevant workload for this feature, at least in the way the remove are implemented in the benchmarks.

  1. If you set the queue depth to a value lower than the number of lanes, that won't be a problem (because then the threads will perform the post_commit cleanup synchronously).

  2. The free operation of our allocator scales linearly and I have a patch in the works that makes it completely lockfree. Currently the only way for a contention to happen is when two threads try to concurrently memory from the same run (which shouldn't happen very often).

pbalcer commented Feb 27, 2017

I've implemented a simple benchmark that preallocates objects and then creates workers that free them inside of a transaction. Each worker frees oids_per_worker objects and each transaction of a worker frees oids_per_tx objects.

As you can see, the benefit can be quite significant when a single transaction performs a lot of work (the data shows up to 6x improvement). The biggest benefit is for workloads that perform big transactions and then perform a different task. There's also no significant adverse effects when the CPU is over provisioned (i.e. more threads then CPU cores), the post-commit workers simply sleep most of the time (but that's assuming the queue length is smaller than the number of lanes).

For very tiny transactions, the benefit is very small because we add communication overhead to already small amount of work.

There's also are very noticeable diminishing returns when increasing the number of post-commit workers, this is because actual worker threads eventually hit the threshold of how fast they can perform the transactions.

oids_per_worker;oids_per_tx;workers;post-commit workers;time elapsed

4000000;100;1;0;2.591
4000000;100;1;1;1.289
4000000;100;1;2;1.587

4000000;1000;1;0;2.549
4000000;1000;1;1;1.112
4000000;1000;1;2;0.922
4000000;1000;1;3;0.864
4000000;1000;1;4;0.798

4000000;10000;1;0;2.396
4000000;10000;1;1;0.390
4000000;10000;1;2;0.390

4000000;100;2;0;2.586
4000000;100;2;1;1.761
4000000;100;2;2;1.710

4000000;1000;2;0;2.496
4000000;1000;2;1;1.679
4000000;1000;2;2;1.272
4000000;1000;2;8;0.909

4000000;10000;2;0;2.424
4000000;10000;2;1;0.805
4000000;10000;2;2;0.581

4000000;1000;8;0;3.115
4000000;1000;8;1;3.120
4000000;1000;8;4;3.106

4000000;1;1;1;2.837
4000000;1;1;1;2.837
4000000;1;1;2;2.993

As for tree benchmarks, they are not a relevant workload for this feature, at least in the way the remove are implemented in the benchmarks.

  1. If you set the queue depth to a value lower than the number of lanes, that won't be a problem (because then the threads will perform the post_commit cleanup synchronously).

  2. The free operation of our allocator scales linearly and I have a patch in the works that makes it completely lockfree. Currently the only way for a contention to happen is when two threads try to concurrently memory from the same run (which shouldn't happen very often).

@pbalcer

This comment has been minimized.

Show comment
Hide comment

pbalcer commented Mar 30, 2017

image

@krzycz krzycz closed this May 17, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment