Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use consume #71

Closed
wants to merge 7 commits into from
Closed

Use consume #71

wants to merge 7 commits into from

Conversation

pitdicker
Copy link
Contributor

@pitdicker pitdicker commented Nov 2, 2019

The Atomic ordering consume is meant for data structures that are rarely written but often read. That sounds like a description of once_cell 馃槃. On all modern processors it can be implemented without using any synchronization fences.

This PR is an experiment, to see if it really gives a performance improvement. This is very much a proof of concept, not cleaned up or intended for merging.

To use consume ordering, there has to be a data dependency between the atomic and the data it synchronizes. In other words: without reading the atomic it must be impossible for this thread to know where the data lives that has to be synchronized.

The queue of Waiters in the std implementation is a perfect fit: the atomic contains a pointer to the first node, and every node contains a pointer to the next. This is like an example from the book of a data dependency chain that could be used with Consume. But as that part is not performance sensitive, we can just leave it as is.

To create a data dependency of value on state, I made state encode an offset from the address of the OnceCell struct to the value field. Encoding a raw pointer is not possible, as the OnceCell can be moved. This trick relies on the compiler not to change the layout of an initialized struct (which seems like something we can rely on).

Rust doesn't expose an Ordering::Consume. But it doesn't need anything special: just using Ordering::Relaxed is enough, and is what Crossbeam uses. We just have to next use the atomic as a pointer, array index, pointer offset, etc.

To test the ordering works, I added a break_it test in examples. It can reliably detect unsufficent synchronization when I run it in release mode on my phone, which is easy using dinghy. If I intentionally reduce the synchronization of the current implementation, it will detect a few races. But with consume it works. To make sure it doesn't work by accident, I tuned down all other orderings to the lowest possible for now.

In the existing benchmark that does nothing but read from a OnceCell in a loop, there is not any real difference I can measure. In a changed benchmark where all threads also do some work, like writing some data, a difference becomes more clear. This is a result of 10 runs of the modified bench.rs:

x86_64 acquire x86_64 consume aarch64 acquire aarch64 consume
171.6ms 107.5ms 204.5ms 155.2ms
158.7ms 115.4ms 204.1ms 179.5ms
184.3ms 108.0ms 207.7ms 156.3ms
107.2ms 116.3ms 204.0ms 181.0ms
143.2ms 114.3ms 204.0ms 181.6ms
164.5ms 116.5ms 204.8ms 154.3ms
113.6ms 114.8ms 205.9ms 157.3ms
107.4ms 116.1ms 203.7ms 155.0ms
181.0ms 114.6ms 205.5ms 177.1ms
176.0ms 105.2ms 202.2ms 155.7ms

To make state in the std implementation encode everything, I set up the following scheme:

>= 0 INITIALIZED (offset to data)
  -1 EMPTY
  -2 RUNNING
< -2 RUNNING, encoded waiter pointer, -(waiterptr >> 1)
  • One goal was to make the check on whether a OnceCell is initialized cheap: just a comparison against zero, with no bitmasks or even another integer to compare against.
  • The offset in the initialized case can be any positive number between 0..isize::max. It is doubtful such a large struct can be allocated, let alone to have padding that large! The maximum alignment that can be specified using #[repr(align)] is 2^29. And with large alignments, the compiler currently changes the layout of the struct to have state after value, which gives an offset of 0. Anyway, reserving this whole range seems like a safe enough choice.
  • All negative values indicate the OnceCell is not initialized. We can control the alignment of the Waiter nodes, which I set to 8. This causes the pointer to a Waiter node to be in the range 8..usize::max. after encoding it with -(waiterptr >> 1) it will be in the range -(4..isize::max). That still leaves -1, -2 and -3 to encode the states EMPTY and RUNNING (without other threads waiting).

I will start cleaning the code up, and add more comments. Don't put too much time in reviewing yet. But I am interested in your opinion! Is this worth pursuing?

@pitdicker
Copy link
Contributor Author

Added a commit that does the same thing for the parking_lot implementation.

@matklad
Copy link
Owner

matklad commented Nov 2, 2019

This definitely sounds really exciting! I must confess that the only thing I know about Consume ordering is that it exists. So, I won't be able to review this for this reason (and there's a problem that I have a ton of PRs to review at rust-analyzer as well...).

I wonder what's the best way to proceed is here? Even if I were able to convince myself that consume is OK here, I wouldn't feel confident just changing the implementation. I want once_cell to be, first and foremost, conservative and dependable crate, and doing some weak memory order shenanigans seems risky.

At the same time, I would very much like to unlock and encourage experimentation here. Perhaps we can hide this behind the feature flag? That way, we can actually publish this for folks to easily try it out

@pitdicker
Copy link
Contributor Author

I want once_cell to be, first and foremost, conservative and dependable crate, and doing some weak memory order shenanigans seems risky.

Big 馃憤.

Shall I make a separate PR with the two extra tests/benchmarks first?

I think rust-lang/rust#65719 should allow others to take care of some review work. If the standard library likes some of the changes, that would give a little more confidence for changes to the atomic orderings in once_cell? I would like to make a PR at some point that at least removes the UB from aliasing a mutable pointer.

At the same time, I would very much like to unlock and encourage experimentation here. Perhaps we can hide this behind the feature flag? That way, we can actually publish this for folks to easily try it out

Thank you, good to hear!

Using consume ordering requires a different abstraction between lib.rs and imp_std/imp_pl. Shall I try to figure something out that works for the current implementation and for something using consume, and make a PR for that first?

BTW, what do you think about a vendored MaybeUnit module like parking_lot now has? Otherwise I will probably have to duplicate the unchecked unwrapping of an option in a couple of places, or make it into some function. It seems nicer to instead move towards what the future is going to be.

and there's a problem that I have a ton of PRs to review at rust-analyzer as well...

Thank you for all the work. This is just a bit of a hobby for me.

@matklad
Copy link
Owner

matklad commented Nov 2, 2019

Shall I make a separate PR with the two extra tests/benchmarks first?

Yep, tests and benches are always useful!

I would like to make a PR at some point that at least removes the UB from aliasing a mutable pointer.

馃憤

Using consume ordering requires a different abstraction between lib.rs and imp_std/imp_pl. Shall I try to figure something out that works for the current implementation and for something using consume, and make a PR for that first?

I think I would prefer for this to be a completely separate module under a cfg flag, even it means some code duplication. So, cfg-ing top-level sync module seems like it would do the trick.

BTW, what do you think about a vendored MaybeUnit module like parking_lot now has?

I see that is has T: Copy bound, so I feel like we won't be able to use this. But if it works, 馃憤 from me.

Thank you for all the work. This is just a bit of a hobby for me.

I believe you have commit access to the repo, so feel free about moving this project in directions which you find interesting, as long as it requires opt-in via a feature flag :) For things that change the "official" implementation I hope to provide reviews without too much lag. Also, how would you feel about implementing the once_cell RFC (sending PR to rust-lang rust and pestering bors and T-lib until it is merged) ^^?

@pitdicker
Copy link
Contributor Author

I think I would prefer for this to be a completely separate module under a cfg flag, even it means some code duplication. So, cfg-ing top-level sync module seems like it would do the trick.

I hope the necessary changes are minor, but that is a good option also. Would you be okay with splitting the unsync and sync modules out of lib.rs?

I see that is has T: Copy bound, so I feel like we won't be able to use this. But if it works, 馃憤 from me.

I hit that problem right around the time of your comment. Now I took an approach where a vendored MaybeUninit is just a wrapper around Option. Terrible.

I believe you have commit access to the repo, so feel free about moving this project in directions which you find interesting, as long as it requires opt-in via a feature flag :) For things that change the "official" implementation I hope to provide reviews without too much lag.

Thank you! Lets start small...

Also, how would you feel about implementing the once_cell RFC (sending PR to rust-lang rust and pestering bors and T-lib until it is merged) ^^?

No promises yet, but that sounds like fun 馃槃. But after some experimenting first.

@matklad
Copy link
Owner

matklad commented Jan 3, 2020

I'd like to come back to this! #83 made me think that this is actually an API-design question!

Specifically, consider this program:

static FLAG: AtomicBool = AtomicBool::new(false);
static CELL: OnceCell<()> = OnceCell::new();

// thread1
CELL.get_or_init(|| FLAG.store(true, Relaxed));

// thread2
if CELL.get().is_some() {
  assert!(FLAG.load(Relaxed))
}

With Acquire, the assert will never fire, but with Consume it could (at least, as far as my understanding of memory models goes)! So what synchronization we use is actually a part of the public API of the crate, and I think how "why not both" is useful.

Specifically, I think it's important that we do provide Acquire guarantee by default, but maybe we should have a way to opt out of it. I think a good practical way do do this would be to add a separate top-level consume module, which provides the same signature as sync, but with weaker synchronization gurantees. We'd also use an optional dependecy on crossbeam util to get AtomicConsume.

A nice property of this plan is that, as this becomes opt-in for the user, I won't be worrying about breaking clients by relying on implementation-defined semantics. @pitdicker wdyt?

@matklad matklad reopened this Jan 3, 2020
@pitdicker
Copy link
Contributor Author

That is a good issue. I wrote my reply on the RFC thread too soon, I'll think some more.

@HadrienG2
Copy link

HadrienG2 commented Jan 25, 2020

Another reason for consume semantics to be optional is that it is very shady territory as far as language specifications go. There's a reason why Rust does not expose Ordering::Consume natively. Even in the C++ world, compilers currently implement memory_order_consume as an alias for memory_order_acquire, and cppreference currently features the following ominous statement on its memory model page:

The specification of release-consume ordering is being revised, and the use of memory_order_consume is temporarily discouraged.

In particular, the implementation that is proposed in this PR is incorrect, because it fails to take compiler optimizations into account. Given this kind of pseudocode, which the PR's definition of consume morally reduces to...

let ptr = atomic_ptr.load(Ordering::Relaxed);
let dependent = unsafe { *ptr };

...the compiler is allowed to optimize out the load from "ptr", either partially or entirely, and use cached data from previous loads instead. This would violate the synchronization guarantees that consume ordering intends to provide.

Therefore, although this code currently works, it is not guaranteed to keep working in future compiler releases or in all use cases of OnceCell (e.g. for all type parameters T, for all surrounding codes around OnceCell calls...), and should therefore not be used.

Crossbeam's AtomicConsume tries to work around this problem by using an Acquire compiler_fence, which should resolve the problem in most cases, but...

  • It enforces a stronger compiler memory barrier than desired: all previous loads are prevented from being reordered with all subsequent loads and stores. This is even stronger than an Acquire load, which only affects the load tagged with Acquire ordering and subsequent loads and stores!
  • It leaves many important things unspecified, for example no one is currently able to precisely tell you how dependencies in code translate into dependencies into emitted assembly, beyond basic facts that every sane person agrees upon such as "clearly we can't say that array[idx-idx] depends on idx". What about situations where it's "obvious" to the compiler that a load from memory can be elided? Should you use volatile loads all the time just to be safe?

TL;DR: Consume is currently a memory model specification farce. If you truly need it for performance, please at least make it optional and use (something like) crossbeam's implementation, which tries to tell the compiler that you're doing something shady.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants