internal/refcount.h: remove fence in CRYPTO_DOWN_REF. #6900

dot-asm · 2018-08-09T11:47:01Z

Fencing is about data visibility in multi-processor system and ordering
loads and stores. When reference counter hits zero, current thread is
last one to hold the reference, and it's actually about to discard the
object. It doesn't have to care about data becoming visible to other
processors. Nor is there reason to assume that speculative loads in
destructor would yield stale data, because object was already assumed
to be "stable" when it was used just before reference counter hit zero.

This is a spin-off from #6883, where it was said that original code was based on assertions made here. I don't quite accept their premise, hence this request. First some definitions. What's "release fence"? Ensuring that all preceding stores are complete and visible to all processors. What's "acquire fence"? Ensuring that none of following loads are initiated before fence. Wording at referred page suggests that that an object is discarded while it's still being used by another thread. But thing is that if this was actual case, i.e. object is still in use by another thread, then no amount of fencing would help. Whole point with reference counters is that when current thread observes zero, it's supposed to be actual proof that nobody else is using the object. Which means that all stores are long committed and one doesn't have to worry about loads. Even if some were speculatively performed earlier the outcome would be same. In sense that if reference counter didn't hit zero, then result will be discarded. And if it did hit zero, then result is still valid.

Fencing is about data visibility in multi-processor system and ordering loads and stores. When reference counter hits zero, current thread is last one to hold the reference, and it's actually about to discard the object. It doesn't have to care about data becoming visible to other processors. Nor is there reason to assume that speculative loads in destructor would yield stale data, because object was already assumed to be "stable" when it was used just before reference counter hit zero.

kroeckx · 2018-08-09T20:44:26Z

A related thread that I did not fully read yet is:
http://compgroups.net/comp.programming.threads/acquire-barrier-for-atomic-reference/2187894

kroeckx · 2018-08-09T20:51:13Z

So I think the current code is correct, and I would like to see a reference to some other document that says this fence is not needed.

kroeckx · 2018-08-09T20:57:41Z

This page might also be interesting to read: https://www.boost.org/doc/libs/1_66_0/doc/html/atomic/thread_coordination.html

kroeckx · 2018-08-09T21:06:32Z

So my understanding for the fence is to make sure the free() happens after the reference counter becomes 0.

dot-asm · 2018-08-10T09:25:57Z

So my understanding for the fence is to make sure the free() happens after the reference counter becomes 0.

??? free happens after the reference count hits zero without any fences...

Formally speaking the only possibility is this. Consider store ptr, if (--ref_count==0 { free(load ptr) }. Now imagine T1:store ptr, T1:speculative load ptr(*), T2:store ptr, T2:--ref_count, T1:--ref_count==0, T1:free("load" ptr). Observing this sequence you can say "that's it, we need fence in order to get rid of speculative load." But recall that we are looking at shared object. There are two possible ways to handle stores to shared objects (other than reference counter itself): a) perform stores under lock; b) those stores and loads are atomic by themselves. Now, if it's a), then real sequence is lock, store ptr, unlock, if (--ref_count==0) { free(load ptr) }. In this case speculative load(*) is bound to be canceled(*), because unlock incurs release fence that ensures that data is consistent on all processors. And if it's b), then ptr itself should be fenced accordingly, so that load is not speculated. Either way, reference counter itself doesn't have to be fenced.

Of course one can say that above line of reasoning is "practitioner's view", and more specifically "bound to be canceled" might have no formal support in language standard wording...

(*) Just in case, "who" performs speculative load? Most importantly it couldn't have been compiler, it's not in position to move it. And not only because of if, but even because compiler reorders no loads or stores around atomic operations. It could only be hardware! But then note that hardware is also in position to cancel it and re-execute it. And it actually does so when time comes.

kroeckx · 2018-08-11T07:19:46Z

My current understanding, which might be wrong, I wish someone properly explained it, is that without it, the compiler is allowed to move the free() before the decrement and that that would cause undefined behaviour. The fence is only there to prevent the compiler from doing that so that there is no undefined behaviour, we don't care about the CPU executing that fence, but there is no way to tell the compiler not to do it without that fence.

kroeckx · 2018-08-11T07:38:13Z

https://groups.google.com/forum/?_escaped_fragment_=topic/comp.programming.threads/GSkj7l-xTZ8

kroeckx · 2018-08-11T07:41:04Z

A copy & paste from that:

True. If the object itself is never accessed in the "delete" branch
(e.g. because the destructor is trivial) then you don't need an acquire
fence for that.

If the destructor is not trivial then presumably it does read the data
in some fashion, and thus needs the acquire fence.

kroeckx · 2018-08-11T08:40:30Z

so my current understanding is that the other thread might have written something to that structure we're going to free, and we need to make sure that all data (other then the reference count) written by the other thread is visible if we're going to read it. But I have a hard time thinking about objects for which we could have a reference counter where me modify more than the reference counter itself.

kroeckx · 2018-08-11T10:18:16Z

This seems to be the best thread about this issue that I've found:
https://groups.google.com/a/isocpp.org/forum/#!topic/std-discussion/OHv-oNSuJuk

So I think I have to mostly agree with your change, I'm still not really sure about the relaxed vs release thing.

dot-asm · 2018-08-11T13:19:58Z

But I have a hard time thinking about objects for which we could have a reference counter where me modify more than the reference counter itself.

Easiest example to imagine is when 'ptr' in the above example is a head of a linked list that you would manage in lock-free manner, with atomic_compare_exchange. I mean operations on shared object would involve growing this list and then destructor would have to traverse it. We, as OpenSSL developers, don't do this, it's just an example that is [relatively] easy to wrap your head around. But once again, since you would manage the list in lock-free manner, access to its head in destructor would have to be fenced accordingly. And fencing around reference counter would be redundant. As well as without writable fields, i.e. always.

without it, the compiler is allowed to move the free() before the decrement

No. Fencing is not about how compiler orders operations, but what message is conveyed to hardware. In sense that compiler won't ever move free before the decrement, but it can tell hardware to hold back speculative loads if you ask for acquire (or wait out pending stores if you ask for release). As for operations ordering per se. I fail to see that compiler would be actually allowed to reorder loads and stores around atomic operations. Indeed, consider following snippet

long foo(long * __restrict ptr, long val, int * __restrict ref)
{
    *ptr = var;
#ifdef ATOMIC
    __atomic_fetch_add(ref,1,__ATOMIC_RELAXED);
#else
    (*ref)++;
#endif
    return *ptr;
}

If compiled as is, it's compiled as

    mov  val,ret
    mov  val,*(ptr)
    add  1,*(ref)

As you can see, *ptr in return statement is canceled, as value is copied straight to return register. But if compiled with additional -DATOMIC flag, it's compiled as

    mov  val,*(ptr)
    lock add 1,*(ref)
    mov  *(ptr),ret

Note that it refused to cache *ptr and scrupulously preserved order of loads and stores on different sides of atomic operation. Just in case, if you play with values other than __ATOMIC_RELAXED, you won't see any difference in generated x86 code. If you want it to be more illustrative, use alpha compiler ;-)

kroeckx · 2018-08-11T14:02:03Z

I don't know if you ever saw https://herbsutter.com/2013/02/11/atomic-weapons-the-c-memory-model-and-modern-hardware/

After looking at that talk again, I'm removing my approval

kroeckx · 2018-08-11T23:06:29Z

So my current understanding: The release makes sure that the other thread sees the value before the if, without the release you're not sure when that write will be visible to the other threads, only that it will be decremented at some point. The release does not prevent part of the free() to already be executed before the decrement. If an other thread did modify something, and did a released, and we want to read that in the free() call, we need an acquire.

dot-asm · 2018-08-12T11:29:53Z

OK, I managed to observe a regular load ending up on different sides of atomic operation depending on explicit memory order used, more specifically relaxed vs. anything else. Not with gcc ;-) This means that assertion that compiler never reorders non-atomic loads and stores around any atomic was incorrect. Sorry about confusion. Compiler actually can reorder stuff at least around relaxed, and others ought to work one way. This in turn means that if compiler[!] does speculative load prior reference counter decrement it can pick up a field's stale value. Unless[!] the field is atomic. Because if it is (and it should), then compiler won't speculatively load it, because compiler won't reorder any of atomic operations, even relaxed. Standard does say at least that. [Note that hardware still can speculatively initiate load, but then it would have to cancel it.]

The release makes sure that the other thread sees the value before the if,

I'd argue that it's wrong way to think about it. Release is not about the value, it's really about preceding stores. Indeed, you have no control over when another thread observes any of the stored values. The only thing you can control is that by the time another thread sees "magic" value at specific location, the rest of the object you are releasing is "stable". Maybe it's just me, but common problem I have with posts, blogs, etc. is that they leave you with impression that acquire in one thread waits for release in another. It doesn't. Both are really about whether or not current thread leaves or observes consistent state. And just like 'release' is about preceding stores, 'acquire' is about following loads. And there are two sides to "about". One side is what compiler can reorder, and another what does it tell hardware. For example what's a releasing store on "weak" architecture? It's store memory barrier followed by actual store. Yes, you fence preceding stores, but not "magic" store itself. What's acquiring load? It's load followed by load memory barrier. Note that there is asymmetry of following character. Let's say one thread makes releasing store and another acquiring load. Now imagine that there are some stores to shared memory locations prior releasing store and some "corresponding" loads after acquiring load. Now imagine that acquiring load happens just before releasing store. Since you have no control over when stores are visible to acquiring thread, loads past acquiring one can pick up new values, which will result in inconsistent view. So that releasing thread does leave consistent picture behind, yet it's no guarantee that acquiring thread observes one. Solution to this kind of situation is of course locked access, i.e. with mutexes around these loads and stores. Note that in such case the "magic" load doesn't have to be acquiring, because mutex lock implies at least load memory barrier, i.e. 'acquire'. Nor does "magic" store have to be releasing, because mutex unlock implies at least store memory barrier, i.e. 'release'. "At least" in last two sentences mean that in real life both lock and unlock are full barriers.

... without the release you're not sure when that write will be visible to the other threads, only that it will be decremented at some point.

I'd say that this is exactly the kind of wrong impression mentioned above that one is left with. You are not sure when outcome of reference counter modification is visible to another thread irregardless whether it's releasing, acquiring, both or neither. It's really about loads and stores around it.

The release does not prevent part of the free() to already be executed before the decrement.

It's a trick formulation. Does "already be executed" refer to speculative execution by hardware (which can always be, and regularly is canceled in case of misprediction), or does it refer to compiler actually scheduling parts for destructor ahead of decrement? Just to be clear, if we are talking about actual free(3) call, compiler is not in position to do any of it ahead of decrement. The only ambiguity is what happens prior the call, and even more specifically, which pointer is picked up for passing down.

If an other thread did modify something, and did a released, and we want to read that in the free() call, we need an acquire.

As mentioned earlier, if one is in business of field modifications (other than reference counter), then access to these fields has to be orchestrated, independently of reference counter. And this means that consistent view would be maintained independently of reference counter. Going back to store ptr, if (--ref_count==0) { free(load ptr) }(*) example. It will just have to be [at least] atomic_store ptr, if (--ref_count==0) { free(atomic_load ptr). But then, since all are atomics, compiler won't load ptr prior --ref_count, even is it's relaxed, and you are actually set. One can say that it can also be lock, store ptr, unlock, if (--ref_count==0) { free(load ptr) }. Question here is if one can consider lack of lock/unlock around load ptr as bug or not. One might be able to say that it's not needed, because thread that is designated to run free(ptr) is supposed to be last ones to access the object. But since compiler[!] can bring load ahead of relaxed[!] --ref_count, unlocked load can end up with stale ptr value. In other words one can actually say that acquire semantics would be appropriate for reference counter decrement if we want to encourage/support questionable programming practices. Well, it should be noted that this is internal header, not something we expose to applications, so question also is if any of this is actually our case :-)

(*) Just in case, ptr is not pointer to object, but a member! Complete sequence is actually if (--ref_count == 0) { free(load ptr), free(object) }. Note that free(object) is never problem, the dispute is exclusively about what happens to fields.

dot-asm · 2018-08-12T11:44:10Z

Complete sequence is actually if (--ref_count == 0) { free(load ptr), free(object) }.

I think it becomes even more apparent if one puts it like this. Let's say that besides ptr we even have pointer to lock object as part of current object. In such case complete sequence would be if (--ref_count == 0) { free(load ptr), free(lock), free(object) }. Note that free(lock) can't be problematic either, because this pointer is constant/"stable" throughout whole life-time of the object. In other words it's only a problem to handle fields that change past constructor time. And since it's shared object the updates has to be orchestrated one way or another. And no amount of fencing of the reference counter itself would replace the said orchestration. Except for the corner case when one decides that final accord of the said orchestration is not needed in destructor. If such case is considered legitimate that is :-)

kroeckx · 2018-08-12T17:20:58Z

(I had written a long post, but it seems to mostly put what you've said in other words. Since we both seem to agree on that, no need to repeat it.)

Solution to this kind of situation is of course locked access, i.e. with mutexes around these loads and stores. Note that in such case the "magic" load doesn't have to be acquiring, because mutex lock implies at least load memory barrier, i.e. 'acquire'. Nor does "magic" store have to be releasing, because mutex unlock implies at least store memory barrier, i.e. 'release'. "At least" in last two sentences mean that in real life both lock and unlock are full barriers.

I think that if you want to make counters relaxed, it would mean that we need to 'acquire' the lock before calling FOO_free() (and then release it again).

Note that in case we don't use atomics, we do an lock/acquire + unlock/release to modify the counter, so in that case there would be no need to do an additional lock/acquire.

dot-asm · 2018-08-13T08:18:28Z

So what do we settle for? No change just close? Relaxed decrement as in this PR? Or relaxed decrement with condition acquire fence? Or acquiring decrement? To summarize, it's argued that at least release fence is not required, because post-constructor modifications to shared structure will have to be orchestrated, and said orchestration implies release fence. It's disputed whether or not unlocked or non-atomic access to modified structure members in destructor is acceptable practice. If it is considered acceptable, then acquire fence would be appropriate. And it it's not considered acceptable, then relaxed decrement is all we need.

kroeckx · 2018-08-13T09:17:56Z

There needs to be some kind of "happens before" relationship, and I think taking the lock or doing the atomic release/acquire are ways to guarantee that relationship. I have a feeling that not taking the lock does not guarantee that relationship. The standards don't seem to be covering this properly. Under the assumption that it's needed, I think the options are: - Leave it as is - Change the acquire fence to an acquire of the counter - Change the decrement to acquire / release - Change it to relaxed decrement and take the lock There are a few alternative ways to implement to implement the locking, like the caller should take the lock or the free function itself should take the lock. The caller should also release the lock, or the free function should release the lock. You can skip the lock/unlock in case atomics are not supported. In any case, it will require proper documentation, and it seems one of those things that the next person will get wrong. Under the assumption that it's not needed, you can just change it to relaxed decrement.

paulidale

I can't think of a problem using relaxed ordering here.
The fence seems unnecessary.

paulidale · 2018-08-14T01:00:17Z

include/internal/refcount.h

-    *ret = __atomic_fetch_sub(val, 1, __ATOMIC_RELEASE) - 1;
-    if (*ret == 0)
-        __atomic_thread_fence(__ATOMIC_ACQUIRE);
+    *ret = __atomic_fetch_sub(val, 1, __ATOMIC_RELAXED) - 1;


Why not __atomic_sub_fetch here which saves the - 1 at the end?
Likewise for atomic_fetch_sub_explicit and the add equivalents.

dot-asm · 2018-08-14T06:27:16Z

I can't think of a problem using relaxed ordering here. The fence seems unnecessary.

It's discussed in sufficient details above. Problem is that whether or not acquire fence is appropriate depends on object and how it's destructed (and whether or not we can consider unserialized access to members appropriate in destructor). And one simply can't make that specific assumptions in macro like one in question. Hence desire to make some kind of "one-size-fits-all".

paulidale · 2018-08-14T08:28:13Z

I've read through the commentary. I still don't see how a problem will occur.

Is more than one thread going to decrement the counter to zero? I don't see how this is possible with an atomic operation in place. Sure, we don't know which thread will see the zero but one will definitely see it. That thread will then de-allocate and continue. Speculative execution is a furphy, the speculation will be wound back once the true value is known.

I can envisage scenarios where there might be an issue (one thread is decrementing the counter while another is incrementing it). The question here is how did the second thread learn about the reference? I.e. there is likely a third party which holds the reference too. Even if that isn't true, the various fences are not going to solve the problem -- fences don't cause other threads to wait.

If we're concerned enough, locking is the only solution. Lock free data structures are difficult and don't involve acquire and release semantics.

paulidale · 2018-08-14T08:30:26Z

We are talking about reference counts here. If other fields in an object are being modified as well, a lock is the only way to go (not strictly but practically).

dot-asm · 2018-08-14T09:06:07Z

There was "T1:store ptr, T1:speculative load ptr, T2:store ptr, T2:--ref_count, T1:--ref_count==0, T1:free("load" ptr)" mentioned above. This speculative load is formally permitted to be moved ahead of --ref_count if it, --ref_count, doesn't claim acquire fence. It was argued than it should not have happened, because "load ptr" should have been atomic by itself (in which case compiler would let it alone), or locked. Even in destructor!!! But programmer can as well argue why would I lock or atomically access members in destructor? Indeed, it's very last object use, right? Except that that speculative load can ruin the party... And so question boils down to if we can view practice of not locking or atomizing otherwise lockable or atomic member specifically in destructor acceptable. If we don't view it acceptable, then relaxed decrement suffices. And if we accept it, then acquire would be appropriate. Note that this is not necessarily about what we actually currently do. It's rather about not making specific assumption in refcount.h and keeping it general-purpose.

dot-asm · 2018-08-14T09:14:37Z

As already said, there are two sides to memory_order thing: a)what compiler can reorder, and b) what does it tell to hardware. One might tend to think about b) and say "hey, memory barrier instruction doesn't solve anything here". And it's totally correct, it doesn't. It's about a). Ideally one needs something that would prevent compiler to take loads ahead of --ref_count, but without telling hardware anything.

kroeckx · 2018-08-14T10:03:30Z

So you're now talking about: Change it to relaxed decrementing and have an acquire fence It's nowhere documented that this works correctly. It will prevent the compiler from moving loads before that acquire. But it doesn't seem to guarantee that the write from the other thread are visible. That will depend on what an unlock guarantees, which seems to be intentionally underspecified.

kroeckx · 2018-08-14T12:21:04Z

I guess you can look at any pointer being written during the lock is also atomic, that the lock is a release, that we will use the same atomic pointer after the acquire fence, and that that is enough to make things work. That would mean that you could change it to a relaxed decrement and an acquire fence, under the assumption that things like the pointer are in fact read and written atomicly.

dot-asm · 2018-08-14T12:27:42Z

So you're now talking about:

??? I was talking about it for a while. It's an option.

But it doesn't seem to guarantee that the write from the other thread are visible.

My point is that somebody else will just has to guarantee it, it's just not counter's responsibility. Not providing that guarantee elsewhere would be a bug elsewhere.

dot-asm · 2018-08-14T12:31:00Z

it's just not counter's responsibility.

One can also put it as following. If it was counter's responsibility you'd have to release even up-ref...

dot-asm · 2018-08-14T13:59:54Z

If it was counter's responsibility you'd have to release even up-ref...

This is misleading, disregard the remark :-)

dot-asm · 2018-08-14T15:28:43Z

Ideally one needs something that would prevent compiler to take loads ahead of --ref_count, but without telling hardware anything.

There is such thing and it's a call to function that compiler has no source code for. One can give extra "fright" to compiler by passing pointer to object to the said function, which wouldn't have to do anything with it. For example

long foo(long * __restrict p, long v)
{
    *p = v;
    bar();
    return *p;
}

was observed to be compiled without load. But if you bar(p), then load is generated. As well as if you omit __restrict.

kroeckx · 2018-08-14T15:42:24Z

Yes, a function the compiler doesn't know anything about acts as a barrier for the compiler but not for the hardware.

paulidale · 2018-08-14T22:28:34Z

My understanding is that memory barriers are designed more for hardware executing out of order than for compilers. For compilers volatile asm("" ::: "memory") should suffice.

Atomic operation only guarantee that the specific operation is atomic, there is no guarantee that other operations aren't scheduled before or afterwards. Thus:

use p->foo;
atomic_decrement p->ref_count

won't guarantee the execution ordering and another thread could do a second decrement and free before p->foo is resolved. A memory barrier will help here but it would have to be on every operation not just the decrement to zero. Fences don't go cross thread, they guarantee consistency for the current thread and global visibility. Thus, making the thread that decrements to zero's memory accesses visible won't prevent another thread from trying to load out of order (i.e. after its decrement is finished).

When a compiler sees a function call, the C standard demands a sequence point i.e. all memory operations that could be impacted by the call have been done. Instead of a function call, I think volatile asm("" ::: "memory") would work here too and be cheaper. The case above with the restricted pointer is special, the restrict keyword indicates that the pointer does not alias anything and thus bar() cannot modify or use what it points to. v is on the stack and again is inaccessible to bar().

The safest solution is to use atomic operations (relaxed) and introduce a memory barrier after all decrements. Atomic operations by themselves will guarantee the decrement works properly and consequently the free but they can't stop out of order execution of related accesses.

Or am I babbling nonsense again?

kroeckx · 2018-08-14T22:51:00Z

You should really look at things like that talk from Herb Sutter I pointed to, or any other talk or paper about the memory model of C11 / C++11/14. There is for instance: https://en.cppreference.com/w/c/atomic/memory_order https://en.cppreference.com/w/cpp/atomic/memory_order https://en.cppreference.com/w/c/atomic/atomic_thread_fence https://en.cppreference.com/w/cpp/atomic/atomic_thread_fence (The C++ version is more complete, it applies to C too.) Sequence points no longer exist in C and C++, they have been replaced by "sequenced before" and "sequenced after". There is now a concept of "happens before". To synchronize multiple threads there is a sequential consistent model, but you can also go for the weaker model acquire/release (and consume). The relaxed model does not synchronize between threads.

Further thought is required...

paulidale · 2018-08-15T00:17:39Z

It had been a while since I last read through these :( Concurrency is hard.

A releasing decrement and an acquiring fence on zero? The release guarantees that everything done before is visible, the acquire prevents things being done early. This is what the original code does.

I am unsure if the acquire fence is unnecessary. Nothing else has a reference to this and, apart from the free which we're doing, no code should try to access this memory after giving up their reference. This means there are no loads or stores that other threads might be yet to do. However, there could be unseen changes from other threads that we're not seeing. Acquire lets us see them.

Can we replace the release atomic with a relaxed one? I'm beginning to think not. We need the guarantee it provides. I.e. that no stores have been moved past that point.

dot-asm · 2018-08-15T06:36:45Z

memory barriers are designed more for hardware executing out of order than for compilers.

No, it's really both.

For compilers volatile asm("" ::: "memory") should suffice.

It's not portable. [Unfortunately.]

Atomic operation only guarantee that the specific operation is atomic, there is no guarantee that other operations aren't scheduled before or afterwards.

Depends on chosen memory_order and operation. Then compiler is prohibited to change order of atomic operations themselves, independent on memory_order.

A releasing decrement and an acquiring fence on zero? The release guarantees that everything done before is visible, the acquire prevents things being done early.

As already said, if you are in business of changing fields in shared structure (besides reference counter), it will have to be orchestrated in one way or another. Not doing so would be a severe bug. And point is that any kind of orchestration implies release fence. This means that by the time reference counter is decremented other data is already visible on other processors.

However, there could be unseen changes from other threads that we're not seeing. Acquire lets us see them.

It's wrong way to think about it. Acquire fence doesn't let you see unseen changes. It only means that following loads are to be executed past it, that's all. In a sense "acquire fence, load" is meaningless operation by itself. "If (there are changes) { acquire fence, load }" is meaningful (provided that "there are changes" condition is met after all the changes are visible on all processors), but not "acquire fence, load" by itself.

…squashed].

dot-asm · 2018-08-15T08:18:38Z

Restored conditional acquire fence.

Just in case, #6883, adds MSC-specific section. Unfortunately C doesn't provide such nuanced intrinsics, those available and used are full fences.

dot-asm · 2018-08-15T09:13:55Z

I moved _MSC_VER-specific section from #6883 and covered ARM. Even re-formatted...

kroeckx · 2018-08-15T09:36:52Z

include/internal/refcount.h

+{
+    *ret = _InterlockedExchangeAdd_nf(val, -1) - 1;
+    if (*ret == 0)
+        __dmb(_ARM_BARRIER_ISH);


Why ISH and not SY?

That's what compilers generate.

I actually found a comment to say it's safe to do so ...

kroeckx · 2018-08-15T09:57:33Z

include/internal/refcount.h

+
+static __inline int CRYPTO_UP_REF(volatile int *val, int *ret, void *lock)
+{
+    *ret = _InterlockedExchangeAdd(val, 1) + 1;


I wonder if _InterlockedIncrement() is useful instead. (Note that _InterlockedIncrement returns the new value, not the old value.)

Curious fact. Unlike other architectures where generated code is all the same, on Itanium ExchangeAdd and Increment generate different instruction sequences... There is no loop in Increment sequence...

kroeckx · 2018-08-15T10:32:25Z

include/internal/refcount.h

+static __inline int CRYPTO_DOWN_REF(volatile int *val, int *ret, void *lock)
+{
+    *ret = _InterlockedExchangeAdd(val, -1) - 1;
+    return 1;


Do we still need an acquire fence here? Or is that a full barrier?

No. On x86 it's not needed, because all interlocked instructions are full barriers.

EDIT. Just in case "no" refers to first question. As in "do we need fence?", "no."

"This function generates a full memory barrier (or fence) to ensure that memory operations are completed in order."

(That is not in all copies of their documentation, but it is at least in some.)

Right. Just in case, thing is that there is no _nf on x86, because there is no need for it, interlocked instructions are full barriers.

there is no _nf on x86, because there is no need for it, interlocked instructions are full barriers.

Or rather there is no way to implement _nf, interlocked instructions is the only option, and they are unconditionally full barriers.

dot-asm · 2018-08-15T11:01:36Z

Editorial note. I'll change commit message to

internal/refcount.h: overhaul fencing and add _MSC_VER section.

Relax memory_order on counter decrement itself, because mutable
members of the reference-counted structure should be visible on all
processors independently on counter. [Even re-format and minimize
dependency on other headers.]

Relax memory_order on counter decrement itself, because mutable members of the reference-counted structure should be visible on all processors independently on counter. [Even re-format and minimize dependency on other headers.] Reviewed-by: Kurt Roeckx <kurt@roeckx.be> (Merged from openssl#6900)

kroeckx previously approved these changes Aug 11, 2018

View reviewed changes

kroeckx added the approval: done This pull request has the required number of approvals label Aug 11, 2018

dot-asm removed the approval: done This pull request has the required number of approvals label Aug 12, 2018

paulidale previously approved these changes Aug 14, 2018

View reviewed changes

internal/refcount.h: restore acquire fence and add commentary [to be …

65cccc8

…squashed].

internal/refcount.h: add _MSC_VER section and reformat [to be squshaed].

f1edfe1

kroeckx reviewed Aug 15, 2018

View reviewed changes

kroeckx approved these changes Aug 15, 2018

View reviewed changes

kroeckx reviewed Aug 15, 2018

View reviewed changes

dot-asm closed this Aug 16, 2018

internal/refcount.h: remove fence in CRYPTO_DOWN_REF. #6900

internal/refcount.h: remove fence in CRYPTO_DOWN_REF. #6900

Conversation

dot-asm commented Aug 9, 2018

kroeckx commented Aug 9, 2018

kroeckx commented Aug 9, 2018

kroeckx commented Aug 9, 2018

kroeckx commented Aug 9, 2018

dot-asm commented Aug 10, 2018 • edited

kroeckx commented Aug 11, 2018

kroeckx commented Aug 11, 2018

kroeckx commented Aug 11, 2018

kroeckx commented Aug 11, 2018

kroeckx commented Aug 11, 2018

dot-asm commented Aug 11, 2018

kroeckx commented Aug 11, 2018

kroeckx commented Aug 11, 2018

dot-asm commented Aug 12, 2018

dot-asm commented Aug 12, 2018

kroeckx commented Aug 12, 2018

dot-asm commented Aug 13, 2018

kroeckx commented Aug 13, 2018 via email

paulidale left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dot-asm commented Aug 14, 2018

paulidale commented Aug 14, 2018

paulidale commented Aug 14, 2018

dot-asm commented Aug 14, 2018 • edited

dot-asm commented Aug 14, 2018

kroeckx commented Aug 14, 2018 via email

kroeckx commented Aug 14, 2018 via email

dot-asm commented Aug 14, 2018

dot-asm commented Aug 14, 2018

dot-asm commented Aug 14, 2018

dot-asm commented Aug 14, 2018

kroeckx commented Aug 14, 2018 via email

paulidale commented Aug 14, 2018

kroeckx commented Aug 14, 2018 via email

paulidale commented Aug 15, 2018

dot-asm commented Aug 15, 2018

dot-asm commented Aug 15, 2018

dot-asm commented Aug 15, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dot-asm Aug 15, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dot-asm commented Aug 15, 2018

dot-asm commented Aug 10, 2018 •

edited

dot-asm commented Aug 14, 2018 •

edited

dot-asm Aug 15, 2018 •

edited