Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alternative lockless Mutex implementation (2nd try) #11520

Closed
wants to merge 9 commits into from

Conversation

bill-myers
Copy link
Contributor

This is yet another attempt to create a satisfying implementation of Mutex (first in #11462, second in #11509), created since I believe I have managed to devise a working lockless algorithm.

I think the algorithm is correct, but please check; if it is, then this has none of the drawbacks of the previous pulls.

This code is based on Alex Crichton's code, but replaces the "held" AtomicBool variable with a more sophisticated MutexState struct which consists of an atomic (lockers, queue_size) pair.

Also, we replace the intrusive MPSC queue implementation with one using list reversal, to remove the need to yield there and the need to store a stub node in the Mutex.

Features:

  • No yields!
  • Only one OS mutex
  • Smallest Mutex struct size ever, since the new MPSC implementation no longers needs a stub node
  • Fair even when combining native and green tasks
  • Provides "can zap Mutex just after unlock" guarantee since we can now provide it for free

Non-features:

  • Recursively locking will deadlock; this can be fixed, but enlarges the mutex or requires an OS mutex with a trylock that can detect recursion (but you can anyway deadlock by taking two mutexes in opposite orders from two threads, so not sure if it's essential)

Non-issues:

  • Uses cmpxchg loops rather than non-loops; however, the previous version was using yield loops
  • Green tasks can block on the OS mutex, but only when it is known that another thread is about to release it; again, the previous version was using yield loops for the same purpose, which are worse
  • unlock is O(n) in the number of contending tasks in the worst case due to list reversal, when green tasks are used; however mutex lock is already fundamentally O(n) due to the obvious contention properties of mutexes, so this shouldn't really matter

The only known issue is that as implemented now this only works with up to 2^16 tasks on 32-bit and up to 2^32 tasks on 64-bit, since it needs to store two variables with a task count in a single atomic variable.

This shouldn't be a problem for two reasons:

  1. As long as tasks use at least 64KB (in user stack, kernel stack, etc.) and 64-bit platforms actually only provide 48-bit addressing (this is the case for x86-64 right now) the 2^16/2^32 limitation cannot be overrun
  2. Anyway, Pentium Pro+ and ARMv6+ provide 64-bit atomics in 32-bit mode (cmpxchg8b and ldrexd etc.) and x86-64 processors beyond the early Opterons as well as ARMv8 provide 128-bit atomics in 64-bit mode (cmpxchg16b), so it's only a matter of implementing support for and using those if desired

Currently Rust uses 2MB stacks by default, so programs that don't specify the stack size manually already cannot trip this limitation.

@omasanori
Copy link
Contributor

you can anyway deadlock by taking two mutexes in opposite orders from two threads

Couldn't we provide something like std::lock in C++? (maybe off-topic, though)

@bill-myers bill-myers closed this Jan 13, 2014
@bill-myers bill-myers reopened this Jan 13, 2014
@bill-myers
Copy link
Contributor Author

Couldn't we provide something like std::lock in C++? (maybe off-topic, though)

Maybe as a macro; anyway, it doesn't help with the user incorrectly calling it recursively unless you ban all recursive locking (which is limiting, since it prevents composition).

That said, I'm not sure whether we should detect recursion, or even have the option of allowing recursion, and it's mostly independent of the mutex implementation algorithm.

@alexcrichton
Copy link
Member

This looks promising, thanks for implementing this!

I'll take a look at this tomorrow.

@alexcrichton
Copy link
Member

In general, I'd prefer a cleaner history, so could you replace the original mutex commit with your mutex commit? Dealing with the fallout will probably be similar.

@@ -0,0 +1,111 @@
/* Copyright (c) 2010-2011 Dmitry Vyukov. All rights reserved.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This license is the wrong license if you've re-implemented it.

I'm certainly no lock-free expert, so to me this queue looks correct, but I don't consider myself qualified to officially say so. Do you know of any external implementations of this flavor of queue that I could look at as well? Any literature/papers would also be a good reference point.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, it has been implemented by drastically modifying the previous version, so whether it's a derived work or not could be debatable.

The conditions say that the copyright notice must not be removed if it is a derived work, so the simplest course of action is to leave it as is, and add an extra copyright line if desired.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is implemented from memory, I'm much more uncomfortable than using the previous version. Why did you re-implement the queue? Was it to remove the Thread::yield_now() in the inconsistent case?

// have or are about to take the OS mutex using a blocking lock call.
//
// It is now as fair as the OS mutex allows, even when mixing green and native tasks,
// since native threads will queue like green tasks, if any green task is queued.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to see a lot more about this explanation here. There's still no mention of the general protocol or the general idea of why native tasks are blocked on the queue. This also doesn't justify why it's ok to block native tasks on the queue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise native tasks will either always have priority over green tasks, or green tasks will always have priority over native tasks, depending on whether unlock prefers to unlock or dequeue when lockers > 1 && queue_size != 0.

@bill-myers
Copy link
Contributor Author

I don't think this is true. Each thread will have only one &mut pointer, the problem is that it is aliased across threads. That has nothing to do with LLVM optimizations.

First of all, &mut guarantees that it is not aliased AT ALL, that's the whole point of it!

If aliased &mut between tasks are allowed in general then memory safety and freedom from data races are both broken.

Also, it's incorrect that each thread will have only one &mut to the Mutex.

Let's say you have a MutexArc implemented using Mutex and make a clone of it.

Now, you lock the first one, which returns the guard object, which must ultimately contain a LockGuard and thus an &mut to the Mutex.

Then you recursively attempt to lock the clone: at this point that recursive lock operation is using an &mut which is the same as the one in the LockGuard of the first lock.

This breaks Rust semantics.

Again, any illusion of safety is purely an illusion, there is no code in the mutex that is truly safe

The Mutex is safe to use (except for destroy, which is unsafe and maybe shouldn't be there in the first place).

Let's say you have tasks and you want each to make three beeps using the computer speaker using a distinctive frequency, by calling an hypotetical speaker::beep(frequency: uint) three times.

You don't want the beeps to interleave, so you put a Mutex in an Arc, and use that to make sure it doesn't happen.

This is a perfectly valid and memory-safe use for Mutex that doesn't need unsafe code.

I believe that usage of &mut will protect against recursive locking if you're using a Mutex

No, because if you are locking the Mutex as part of a MutexArc, for instance, then the MutexArc is clonable so you can do the recursive locking by locking the other clone.

If you are a "naked" Mutex where other paths cannot be created, you can't send have the tasks simultaneously accessing it, which makes it pointless to use one.

For comparison, Cell::set also mutates the underlying value, but it doesn't take an &mut self.

This is an entirely different use case where you're capturing interior mutability, it has nothing to do with mutexes
or atomics. Using a Cell is considered a smell and all of rust favors inherited mutability.

Cell and Atomic (let's assume the latter exists) are both mechanisms to allow to mutate aliased memory, and the only difference is that Cell either must not be allowed or will result in undefined values if the aliases are shared between tasks.

Cell is thus strictly more restrictive in its pre-conditions that Atomic, so it makes no sense for Cell to require a self pointer that provides less guarantees than Atomic.

And of course, it breaks the non-aliasing property of &mut as already argued.

That said, the fix to the mutability of self is mostly separate from the rest, but it conflicts with the rest of the code in a non-trivial way (because lock as implemented no longer borrow checks with &mut self), so I think introducing Mutex with the correct &self pointers is simpler that doing it in two commits.

@alexcrichton
Copy link
Member

First of all, &mut guarantees that it is not aliased AT ALL, that's the whole point of it!

Correct. That is not the only point of it. For atomics we are using it to indicate "this is being mutated"

If aliased &mut between tasks are allowed in general then memory safety and freedom from data races are both broken.

Correct. That is why it's unsafe to create multiple &mut aliases to an atomic.

Let's say you have a MutexArc implemented using Mutex and make a clone of it.

The point that you're missing is that it's unsafe to clone it.

This breaks Rust semantics.

So does sharing the memory of the mutex itself, but that's what must be done.

The Mutex is safe to use (except for destroy, which is unsafe and maybe shouldn't be there in the first place).

Correct. It must be unsafely shared. This is not a first-class primitive for everyone everywhere to use. This is a primitive to build other concurrency primitives. For example, a MutexArc would contain one of these inside of it, and it would protect some shared data. You should never be interacting with a Mutex directly.

No, because if you are locking the Mutex as part of a MutexArc, for instance, then the MutexArc is clonable so you can do the recursive locking by locking the other clone.

Again, cloning a MutexArc is unsafe in terms of the usage of the mutex.

If you want to remove &mut self in atomics, please do so in a separate PR or issue. This is not up for debate as a portion of this PR (it has enough as-is). If you would like to debate whether Mutex should take &self or &mut self, I think it should take &mut self because that's what it's truly doing and it doesn't really matter from a usability perspective because no one should be using this directly anyway.

@bill-myers
Copy link
Contributor Author

Correct. That is why it's unsafe to create multiple &mut aliases to an atomic.

I think having multiple &mut aliases is not merely unsafe, but needs to be defined as resulting in undefined behavior if they are actually used, and thus unsafe code must not do so, just like unsafe code must not transmute random integers or zero to &T or ~T and access those pointers, and so on.

The alternative of having multiple &mut aliases not be undefined behavior means that Rust programs will be slower because some optimizations cannot be applied (specifically, telling LLVM using TBAA or another mechanism that there cannot be any aliases and thus that memory accesses can be moved with no restrictions and that loads can be freely merged or duplicated).

Anyway, it's pretty easy to remove all the changes from & -> &mut self if desired.

@bill-myers
Copy link
Contributor Author

Unfortunately, I'm not sure if I'll have time to do further work myself on this.

I believe the current code in the pull request should be an acceptable design and mergeable, except for potential bugs, cosmetic issues and the fact that things like the &mut change and Mutex/Cond split could perhaps be split in separate commits (but at the cost of having more "churn" in the series as things are written and then changed by subsequent commits).

We might want to switch to 64-bit atomics on 32-bit and perhaps 128-bit atomics on 64-bit, but that can be done later, since the task limit won't be hit anyway without non-default stack sizes that make it impossible to call C functions.

Another thing to do is to replace critical sections and events with SRW locks and condition variables on Windows Vista+: note that both of those structures are pointed-sized and statically initializable just like the atomic uints in the current code, so it's trivial to switch at runtime between the current code on Windows XP and reinterpreting those fields as SRW locks or condition variables for Windows Vista and later.

Finally an rwlock is probably needed as well, and should be possible to implement by extending the algorithm in the code to use a (queue_size, readers, writers) tuple and putting the read mode in the queued nodes (64/128-bit atomics will allow to fit the triple).

Native rwlocks don't always support a truly fair FIFO mode (at least the one in Linux glibc NPTL either lets all readers or all writers get ahead), so it might be a good idea to let the user specify what they prefer or if they want the rwlock to be FIFO, and avoid blocking native tasks on the native rwlock at all if the user selects FIFO.

@alexcrichton
Copy link
Member

Unfortunately, I'm not sure if I'll have time to do further work myself on this.

Would you like me to take it from here?

@bill-myers
Copy link
Contributor Author

Would you like me to take it from here?

Yes, if you'd like, thanks.

Opened issue #11583 for mutability in atomics (I used an issue since the code is not complete, since it doesn't fix the actual intrinsics).

@alexcrichton
Copy link
Member

Thanks again for all your work!

@alexcrichton
Copy link
Member

Closing for now, I'll reopen once I've gotten through all the rebasings

@alexcrichton
Copy link
Member

I got around to benchmarking this code, and here's the results I got. The cxchg is this mutex and the held is the original version that I wrote. Each test was run with 1,2,4,6,8 threads, and the test was simply locking a mutex N times and then immediately unlocking. I used this test to get an idea of what the overhead of using a lock is and also what happens on contention of lots of threads running around.

The caveat here is that the green numbers aren't too significant. We currently need to seriously tune the green schedulers for workloads like this because the numbers all over the place. The numbers provided here are all run with RUST_THREADS=1 which essentially serializes the whole program (rust tasks never yield to one another). Again, these numbers don't mean much in terms of green, but they do in terms of native.

In this table, these are all runtimes, so lower is better. All tests were run in an arch linux VM with 4 cores.

# threads pthreads native held native cxchg green held green cxchg
1 0.207 0.195 0.358 0.204 0.325
2 1.113 1.098 1.644 0.394 0.644
4 3.437 - 4.530 2.945 - 4.199 8.066 - 10.656 0.774 1.284
6 4.859 - 6.070 5.547 14.090 1.136 1.909
8 7.009 6.391 17.365 1.614 2.575\

My conclusions from this are that doing pretty much any atomic other than pthread_mutex_lock is just not going to cut it. If we truly want to get right next to pthreads performance, we're going to pretty much eliminate all atomics. The green threads aren't a good enough case to look at just yet because scheduling among them is in such an un-tuned state right now that there's no way we can compare alternate mutex implementations.

With this in mind, I'm going to abandon this implementation and go back to the original version. Once the green scheduler is tuned, we can revisit to see what the impact is for green threads in finer detail.

@brson
Copy link
Contributor

brson commented Jan 17, 2014

@alexcrichton Thanks for the measurements. That's a little disappointing since I liked the elegance of this solution.

@bill-myers
Copy link
Contributor Author

Added an optimization to the cmpxchg loops and benchmarks, see https://github.com/bill-myers/rust/tree/mutex-lockless for the updated code.

flip1995 pushed a commit to flip1995/rust that referenced this pull request Sep 25, 2023
…nt-levels, r=xFrednet

used_underscore_bindings: respect lint levels on the binding definition

Fixes rust-lang#11520
Fixes rust-lang#947

Also ignores usages of `PhantomData`

changelog: none
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants