[Future] Implement lock-less reading + epoch based GC #345

LionsAd · 2018-11-19T13:21:15Z

Motivation

Once #337 is implemented we have one rwlock per hash table entry.

A hash table entry is a pointer to a list head and the list head then contains pointer to further elements.

Lockless lists are possible and not even that complicated:

Items can be consumed from the start by #n consumers at the same time [atomic cmpxchg]
Items can be added at the end by 1 producer [atomic cmpxchg]

Reading of the list is always safe.

We have two operations, which can currently happen on the list:

Insert
Delete

Update is currently implemented as Delete + Insert and hence is the same for the purpose of this issue.

Insert just means inserting at the end of the list:

For that a write lock can be obtained on the hash bucket, the item atomically added and releasing of the write lock again

A --> B --> NULL => A --> B --> C --> NULL

Delete works the same:

Again the write lock is obtained and we want to delete B:

A --> B --> C --> NULL => A --> C --> NULL

So a read lock is not necessary at all -- even now (without the more granular locking) we could have lossless reading + normal write lock mutex.

Implementation

Abstract list functions to use purely inline list_insert() / inline list_delete()
All list modification operations need to be atomic and use a write lock
Reads will automatically be correct

Reference:

Lockless lists: https://en.wikipedia.org/wiki/Non-blocking_linked_list

(split up from #333)

The text was updated successfully, but these errors were encountered:

nikic · 2018-11-19T13:33:40Z

The hard part about lock-less lists is not so much the algorithm itself, but the garbage collection mechanism. Unless we want to rely on heuristics like "if an item has been removed for N seconds, it's unlikely to be still used by any process", properly handling this without read-locks in a non-GC language would require implementing epoch-based reclamation, or a similar mechanism.

LionsAd · 2018-11-19T14:38:07Z

Okay, I got it - will mark as future. I don't think epoch based algorithms are that bad and we could replace the current GC with epoch based as well, which would remove the lingering item on GC list thingy.

LionsAd · 2018-11-19T17:18:43Z

Just documenting possible solutions:

RCU with https://liburcu.org/ (Kernel support for sys_membarrier since 2015 / Linux 4.3.0)
Epoch -- no real library exists, but feels pretty suitable to our use case.

Documentation:

Epoch in Rust: https://aturon.github.io/blog/2015/08/27/epoch/#lock-free-data-structures
Overview of performance of different schemes (QSBR, Epoch, hazard pointers, lock free reference counting) - http://csng.cs.toronto.edu/publication_files/0000/0159/jpdc07.pdf
Technical report on lock-less data structures: https://www.cl.cam.ac.uk/techreports/UCAM-CL-TR-579.pdf
RCU: https://www.kernel.org/doc/Documentation/RCU/listRCU.txt
RCU: https://en.wikipedia.org/wiki/Read-copy-update
RCU: https://lwn.net/Articles/262464/
Better way paper [DEBRA / DEBRA+]: https://arxiv.org/pdf/1712.01044.pdf
Performance comparison: http://www.cs.toronto.edu/~tomhart/papers/tomhart_thesis.pdf
libqbsr with ebr: https://github.com/rmind/libqsbr/blob/master/src/ebr.c

LionsAd · 2018-11-19T18:16:50Z

Actually there is a sequence we can use that should work for our data structures as long as we use our normal reference counting, which we use right now anyway:

retry:

entry = head;
inc_ref(entry);
if (head != entry) {
  dec_ref(entry);
  goto retry;
}

Most more complicated lock-less schemes also do not want to use ATOMIC_ADD / ATOMIC_DEL during reads and hence need more complicated schemes.

But the currently used reader-writer locks also uses at least one atomic operation for entering the read section, so performance should be the same -- as long as there is not too many items in the queue.

nikic · 2018-11-19T18:23:10Z

@LionsAd This would still have the issue that between entry = head and inc_ref(entry), the entry may have been deleted (and due to ABA checking for a pointer difference is not enough to detect that, even if we discount that the inc_ref might correct some now unrelated memory).

LionsAd · 2018-11-19T18:25:02Z

Ohhh right, that is the reason why the hazard pointer must not live in the data structure itself.

LionsAd · 2018-11-20T03:56:33Z

After studying several lockless schemes I agree that "pure" lockless is pretty difficult to achieve and has it's own drawbacks, but as you said it's a GC problem and we can take advantage of that.

Lets assume the following structure (not real C):

struct entry_t {
  entry_t* next;
  void* data_t;
};

struct data_t {
  volatile inc_ref;
  int memsize;
};

The list would be traversed as usual and if the hash and key of the data entry match, the data entry is returned and the incref increased on it.

As we have seen the list structure itself is unproblematic for insert and update, but problematic for GC on delete.

Assuming we keep the READ() lock [for simplicity reasons a global one], we can just use an atomic operation for delete, put it on the free list and gain the WLOCK() once before we attempt to do GC.

This will ensure that no list is accessed anymore by anyone. That means we do not need to call the wlock for each delete operation, but just once for GC purposes (which can be infrequent and timed and even use a TRY-LOCK CS.) The WLOCK() is our synchronize, which ensures no one can be possibly operating on the deleted item anymore.

This will reduce the number of wlocks taken dramatically and hopefully delete is the least frequent operation for practical purposes.

But even then performance is better than now already.

For updates the same is true:

Because updates use the normal incref, we can just use cmpxchg() with old_data and new_data.

Again we have the GC problem:

A process could still be stuck before increasing the inc_ref()

Again we just take the WLOCK() once during GC and ensure all inc_ref() operations have completed.
Then check the ref_count == 0.

The HUGE advantage of the barrier based approach is that in ~ 99% of the cases, we do not use the WLOCK(), but when we use it, we also directly unlock it again. It's only used to synchronize that all readers have left the critical section.

And that gives us kinda RCU / semantics without needing to use an extra library.

And given we work pretty okay already now with reader-writer locks, I presume that is all that is needed for our purposes.

In fact thinking more about it, I think we do not even need the extra data structure, because all we need is to ensure the inc_ref is synchronized, which is guaranteed by the WLOCK().

Therefore read-lock protected lists + ref count protected entries + WLOCK() for synchronize during GC will work.

And if we ever wanted to implement EPOCH + ref count, then that would be also simple based on that -- as long as we have mutual exclusion on the GC:

Implementation of RLOCK / RUNLOCK would be:

int epoch = atomic_read(global epoch)
// -- barrier
atomic_inc(&lock[epoch]);
// --- List operation
atomic_dec(&lock[epoch]);

GC:

// enter mutex CS 

// cmpxchg() global gc list with local one, gc'd items have gc_next pointer

int old_epoch = global epoch;
int epoch = (old_epoch + 1) % 2; // As we busy wait we need only two states ...

// Atomic operation only ensures that all threads see the new value.
cmpxchg(global epoch, old_epoch, epoch);

while (lock[old_epoch] > 0) {
  // wait ...
  usleep(0);
}

// Now GC all entries with ref_count == 0

// leave mutex CS

The busy waiting is not nice, but potentially for our use case more performant than the RWLOCK as processes should leave the read-locked CS pretty fast.

It's probably not more performant than pure epoch or RCU, but should be good enough for our use case and definitely better than what we have now :).

Oh, i see how to implement even less busy waiting epoch, just need 3 GC lists;

// enter mutex CS 

int old_epoch = global epoch;
int epoch = (old_epoch + 1) % 3;
int check_epoch = (old_epoch == 0 ? 2 : old_epoch - 1);

// Atomic operation only ensures that all threads see the new value.
cmpxchg(global epoch, old_epoch, epoch);

// We still wait, but if GC is not called that frequently, then this will always succeed in practice.
while (UNLIKELY(lock[check_epoch] > 0)) {
  // wait ...
  usleep(0);
}

// Now GC all entries with ref_count == 0 from gc_list[check_epoch]
// That again is very likely as much time has passed since this was added to gc_list()

local_gc_list = { entry <= gc_list[check_epoch] | entry.ref_count == 0}
gc_list[check_epoch] =  { entry <= gc_list[check_epoch] | entry.ref_count > 0}

// leave mutex CS

// Now cleanup local gc list
gc_real(local_gc_list)

Usually this would not be enough, but as we use reference counters that does work.

In case we never give entries to PHP obviously just holding the read lock longer would work, too.

LionsAd · 2018-11-25T14:24:13Z

The theoretical read performance (read-only) of lockless is 0.3 vs. 1.5 seconds [current] in my benchmark with 21 (or 210) limited keys and a 50000 loop, therefore 5x faster.

Using atomic_inc / atomic_dec on a global bottleneck, this reduces it to 0.5 seconds, so "only" 3x faster.

[ Ironically the a++ without using ATOMIC_INC can be actually slower. The reason is likely that the cache line is read (cache-miss), then updated vs. exclusive locked and updated.

That part however feels just relevant for a synthetic setup.]

Using atomic_inc / atomic_dec on the per entry reference counts, does not have a measurable performance effect in this benchmark however, so splitting up the locks per hash table bucket (or at a larger granularity) is still worth it -- even for a GC version with lockless updates using RCU or similar.

nikic added the performance label Nov 19, 2018

LionsAd mentioned this issue Nov 19, 2018

Use in-place replacement for apc_store updates #346

Open

LionsAd changed the title ~~Implement lock-less reading~~ [Future] Implement lock-less reading + epoch based GC Nov 19, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Future] Implement lock-less reading + epoch based GC #345

[Future] Implement lock-less reading + epoch based GC #345

LionsAd commented Nov 19, 2018 •

edited

nikic commented Nov 19, 2018

LionsAd commented Nov 19, 2018

LionsAd commented Nov 19, 2018 •

edited

LionsAd commented Nov 19, 2018

nikic commented Nov 19, 2018

LionsAd commented Nov 19, 2018 •

edited

LionsAd commented Nov 20, 2018 •

edited

LionsAd commented Nov 25, 2018 •

edited

[Future] Implement lock-less reading + epoch based GC #345

[Future] Implement lock-less reading + epoch based GC #345

Comments

LionsAd commented Nov 19, 2018 • edited

Motivation

Implementation

nikic commented Nov 19, 2018

LionsAd commented Nov 19, 2018

LionsAd commented Nov 19, 2018 • edited

LionsAd commented Nov 19, 2018

nikic commented Nov 19, 2018

LionsAd commented Nov 19, 2018 • edited

LionsAd commented Nov 20, 2018 • edited

LionsAd commented Nov 25, 2018 • edited

LionsAd commented Nov 19, 2018 •

edited

LionsAd commented Nov 19, 2018 •

edited

LionsAd commented Nov 19, 2018 •

edited

LionsAd commented Nov 20, 2018 •

edited

LionsAd commented Nov 25, 2018 •

edited