Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Future] Implement lock-less reading + epoch based GC #345

Open
LionsAd opened this issue Nov 19, 2018 · 8 comments
Open

[Future] Implement lock-less reading + epoch based GC #345

LionsAd opened this issue Nov 19, 2018 · 8 comments

Comments

@LionsAd
Copy link
Contributor

LionsAd commented Nov 19, 2018

Motivation

Once #337 is implemented we have one rwlock per hash table entry.

A hash table entry is a pointer to a list head and the list head then contains pointer to further elements.

Lockless lists are possible and not even that complicated:

  • Items can be consumed from the start by #n consumers at the same time [atomic cmpxchg]
  • Items can be added at the end by 1 producer [atomic cmpxchg]

Reading of the list is always safe.

We have two operations, which can currently happen on the list:

  • Insert
  • Delete

Update is currently implemented as Delete + Insert and hence is the same for the purpose of this issue.

  • Insert just means inserting at the end of the list:

For that a write lock can be obtained on the hash bucket, the item atomically added and releasing of the write lock again

A --> B --> NULL => A --> B --> C --> NULL

  • Delete works the same:

Again the write lock is obtained and we want to delete B:

A --> B --> C --> NULL => A --> C --> NULL

So a read lock is not necessary at all -- even now (without the more granular locking) we could have lossless reading + normal write lock mutex.

Implementation

  • Abstract list functions to use purely inline list_insert() / inline list_delete()
  • All list modification operations need to be atomic and use a write lock
  • Reads will automatically be correct

Reference:

Lockless lists: https://en.wikipedia.org/wiki/Non-blocking_linked_list

(split up from #333)

@nikic
Copy link
Collaborator

nikic commented Nov 19, 2018

The hard part about lock-less lists is not so much the algorithm itself, but the garbage collection mechanism. Unless we want to rely on heuristics like "if an item has been removed for N seconds, it's unlikely to be still used by any process", properly handling this without read-locks in a non-GC language would require implementing epoch-based reclamation, or a similar mechanism.

@LionsAd
Copy link
Contributor Author

LionsAd commented Nov 19, 2018

Okay, I got it - will mark as future. I don't think epoch based algorithms are that bad and we could replace the current GC with epoch based as well, which would remove the lingering item on GC list thingy.

@LionsAd LionsAd changed the title Implement lock-less reading [Future] Implement lock-less reading + epoch based GC Nov 19, 2018
@LionsAd
Copy link
Contributor Author

LionsAd commented Nov 19, 2018

Just documenting possible solutions:

  • RCU with https://liburcu.org/ (Kernel support for sys_membarrier since 2015 / Linux 4.3.0)
  • Epoch -- no real library exists, but feels pretty suitable to our use case.

Documentation:

@LionsAd
Copy link
Contributor Author

LionsAd commented Nov 19, 2018

Actually there is a sequence we can use that should work for our data structures as long as we use our normal reference counting, which we use right now anyway:

retry:

entry = head;
inc_ref(entry);
if (head != entry) {
  dec_ref(entry);
  goto retry;
} 

Most more complicated lock-less schemes also do not want to use ATOMIC_ADD / ATOMIC_DEL during reads and hence need more complicated schemes.

But the currently used reader-writer locks also uses at least one atomic operation for entering the read section, so performance should be the same -- as long as there is not too many items in the queue.

@nikic
Copy link
Collaborator

nikic commented Nov 19, 2018

@LionsAd This would still have the issue that between entry = head and inc_ref(entry), the entry may have been deleted (and due to ABA checking for a pointer difference is not enough to detect that, even if we discount that the inc_ref might correct some now unrelated memory).

@LionsAd
Copy link
Contributor Author

LionsAd commented Nov 19, 2018

Ohhh right, that is the reason why the hazard pointer must not live in the data structure itself.

@LionsAd
Copy link
Contributor Author

LionsAd commented Nov 20, 2018

After studying several lockless schemes I agree that "pure" lockless is pretty difficult to achieve and has it's own drawbacks, but as you said it's a GC problem and we can take advantage of that.

Lets assume the following structure (not real C):

struct entry_t {
  entry_t* next;
  void* data_t;
};

struct data_t {
  volatile inc_ref;
  int memsize;
};

The list would be traversed as usual and if the hash and key of the data entry match, the data entry is returned and the incref increased on it.

As we have seen the list structure itself is unproblematic for insert and update, but problematic for GC on delete.

Assuming we keep the READ() lock [for simplicity reasons a global one], we can just use an atomic operation for delete, put it on the free list and gain the WLOCK() once before we attempt to do GC.

This will ensure that no list is accessed anymore by anyone. That means we do not need to call the wlock for each delete operation, but just once for GC purposes (which can be infrequent and timed and even use a TRY-LOCK CS.) The WLOCK() is our synchronize, which ensures no one can be possibly operating on the deleted item anymore.

This will reduce the number of wlocks taken dramatically and hopefully delete is the least frequent operation for practical purposes.

But even then performance is better than now already.

For updates the same is true:

Because updates use the normal incref, we can just use cmpxchg() with old_data and new_data.

Again we have the GC problem:

  • A process could still be stuck before increasing the inc_ref()

Again we just take the WLOCK() once during GC and ensure all inc_ref() operations have completed.
Then check the ref_count == 0.

The HUGE advantage of the barrier based approach is that in ~ 99% of the cases, we do not use the WLOCK(), but when we use it, we also directly unlock it again. It's only used to synchronize that all readers have left the critical section.

And that gives us kinda RCU / semantics without needing to use an extra library.

And given we work pretty okay already now with reader-writer locks, I presume that is all that is needed for our purposes.


In fact thinking more about it, I think we do not even need the extra data structure, because all we need is to ensure the inc_ref is synchronized, which is guaranteed by the WLOCK().

Therefore read-lock protected lists + ref count protected entries + WLOCK() for synchronize during GC will work.


And if we ever wanted to implement EPOCH + ref count, then that would be also simple based on that -- as long as we have mutual exclusion on the GC:

Implementation of RLOCK / RUNLOCK would be:

int epoch = atomic_read(global epoch)
// -- barrier
atomic_inc(&lock[epoch]);
// --- List operation
atomic_dec(&lock[epoch]);

GC:

// enter mutex CS 

// cmpxchg() global gc list with local one, gc'd items have gc_next pointer

int old_epoch = global epoch;
int epoch = (old_epoch + 1) % 2; // As we busy wait we need only two states ...

// Atomic operation only ensures that all threads see the new value.
cmpxchg(global epoch, old_epoch, epoch);

while (lock[old_epoch] > 0) {
  // wait ...
  usleep(0);
}

// Now GC all entries with ref_count == 0

// leave mutex CS

The busy waiting is not nice, but potentially for our use case more performant than the RWLOCK as processes should leave the read-locked CS pretty fast.

It's probably not more performant than pure epoch or RCU, but should be good enough for our use case and definitely better than what we have now :).


Oh, i see how to implement even less busy waiting epoch, just need 3 GC lists;

// enter mutex CS 

int old_epoch = global epoch;
int epoch = (old_epoch + 1) % 3;
int check_epoch = (old_epoch == 0 ? 2 : old_epoch - 1);

// Atomic operation only ensures that all threads see the new value.
cmpxchg(global epoch, old_epoch, epoch);

// We still wait, but if GC is not called that frequently, then this will always succeed in practice.
while (UNLIKELY(lock[check_epoch] > 0)) {
  // wait ...
  usleep(0);
}

// Now GC all entries with ref_count == 0 from gc_list[check_epoch]
// That again is very likely as much time has passed since this was added to gc_list()

local_gc_list = { entry <= gc_list[check_epoch] | entry.ref_count == 0}
gc_list[check_epoch] =  { entry <= gc_list[check_epoch] | entry.ref_count > 0}

// leave mutex CS

// Now cleanup local gc list
gc_real(local_gc_list)

Usually this would not be enough, but as we use reference counters that does work.

In case we never give entries to PHP obviously just holding the read lock longer would work, too.

@LionsAd
Copy link
Contributor Author

LionsAd commented Nov 25, 2018

The theoretical read performance (read-only) of lockless is 0.3 vs. 1.5 seconds [current] in my benchmark with 21 (or 210) limited keys and a 50000 loop, therefore 5x faster.

Using atomic_inc / atomic_dec on a global bottleneck, this reduces it to 0.5 seconds, so "only" 3x faster.

[ Ironically the a++ without using ATOMIC_INC can be actually slower. The reason is likely that the cache line is read (cache-miss), then updated vs. exclusive locked and updated.

That part however feels just relevant for a synthetic setup.]

Using atomic_inc / atomic_dec on the per entry reference counts, does not have a measurable performance effect in this benchmark however, so splitting up the locks per hash table bucket (or at a larger granularity) is still worth it -- even for a GC version with lockless updates using RCU or similar.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants