Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

internal/refcount.h: remove fence in CRYPTO_DOWN_REF. #6900

Closed
wants to merge 3 commits into from

Conversation

dot-asm
Copy link
Contributor

@dot-asm dot-asm commented Aug 9, 2018

Fencing is about data visibility in multi-processor system and ordering
loads and stores. When reference counter hits zero, current thread is
last one to hold the reference, and it's actually about to discard the
object. It doesn't have to care about data becoming visible to other
processors. Nor is there reason to assume that speculative loads in
destructor would yield stale data, because object was already assumed
to be "stable" when it was used just before reference counter hit zero.

This is a spin-off from #6883, where it was said that original code was based on assertions made here. I don't quite accept their premise, hence this request. First some definitions. What's "release fence"? Ensuring that all preceding stores are complete and visible to all processors. What's "acquire fence"? Ensuring that none of following loads are initiated before fence. Wording at referred page suggests that that an object is discarded while it's still being used by another thread. But thing is that if this was actual case, i.e. object is still in use by another thread, then no amount of fencing would help. Whole point with reference counters is that when current thread observes zero, it's supposed to be actual proof that nobody else is using the object. Which means that all stores are long committed and one doesn't have to worry about loads. Even if some were speculatively performed earlier the outcome would be same. In sense that if reference counter didn't hit zero, then result will be discarded. And if it did hit zero, then result is still valid.

Fencing is about data visibility in multi-processor system and ordering
loads and stores. When reference counter hits zero, current thread is
last one to hold the reference, and it's actually about to discard the
object. It doesn't have to care about data becoming visible to other
processors. Nor is there reason to assume that speculative loads in
destructor would yield stale data, because object was already assumed
to be "stable" when it was used just before reference counter hit zero.
@kroeckx
Copy link
Member

kroeckx commented Aug 9, 2018

A related thread that I did not fully read yet is:
http://compgroups.net/comp.programming.threads/acquire-barrier-for-atomic-reference/2187894

@kroeckx
Copy link
Member

kroeckx commented Aug 9, 2018

So I think the current code is correct, and I would like to see a reference to some other document that says this fence is not needed.

@kroeckx
Copy link
Member

kroeckx commented Aug 9, 2018

This page might also be interesting to read: https://www.boost.org/doc/libs/1_66_0/doc/html/atomic/thread_coordination.html

@kroeckx
Copy link
Member

kroeckx commented Aug 9, 2018

So my understanding for the fence is to make sure the free() happens after the reference counter becomes 0.

@dot-asm
Copy link
Contributor Author

dot-asm commented Aug 10, 2018

So my understanding for the fence is to make sure the free() happens after the reference counter becomes 0.

??? free happens after the reference count hits zero without any fences...

Formally speaking the only possibility is this. Consider store ptr, if (--ref_count==0 { free(load ptr) }. Now imagine T1:store ptr, T1:speculative load ptr(*), T2:store ptr, T2:--ref_count, T1:--ref_count==0, T1:free("load" ptr). Observing this sequence you can say "that's it, we need fence in order to get rid of speculative load." But recall that we are looking at shared object. There are two possible ways to handle stores to shared objects (other than reference counter itself): a) perform stores under lock; b) those stores and loads are atomic by themselves. Now, if it's a), then real sequence is lock, store ptr, unlock, if (--ref_count==0) { free(load ptr) }. In this case speculative load(*) is bound to be canceled(*), because unlock incurs release fence that ensures that data is consistent on all processors. And if it's b), then ptr itself should be fenced accordingly, so that load is not speculated. Either way, reference counter itself doesn't have to be fenced.

Of course one can say that above line of reasoning is "practitioner's view", and more specifically "bound to be canceled" might have no formal support in language standard wording...

(*) Just in case, "who" performs speculative load? Most importantly it couldn't have been compiler, it's not in position to move it. And not only because of if, but even because compiler reorders no loads or stores around atomic operations. It could only be hardware! But then note that hardware is also in position to cancel it and re-execute it. And it actually does so when time comes.

@kroeckx
Copy link
Member

kroeckx commented Aug 11, 2018

My current understanding, which might be wrong, I wish someone properly explained it, is that without it, the compiler is allowed to move the free() before the decrement and that that would cause undefined behaviour. The fence is only there to prevent the compiler from doing that so that there is no undefined behaviour, we don't care about the CPU executing that fence, but there is no way to tell the compiler not to do it without that fence.

@kroeckx
Copy link
Member

kroeckx commented Aug 11, 2018

A copy & paste from that:

True. If the object itself is never accessed in the "delete" branch
(e.g. because the destructor is trivial) then you don't need an acquire
fence for that.

If the destructor is not trivial then presumably it does read the data
in some fashion, and thus needs the acquire fence.

@kroeckx
Copy link
Member

kroeckx commented Aug 11, 2018

so my current understanding is that the other thread might have written something to that structure we're going to free, and we need to make sure that all data (other then the reference count) written by the other thread is visible if we're going to read it. But I have a hard time thinking about objects for which we could have a reference counter where me modify more than the reference counter itself.

@kroeckx
Copy link
Member

kroeckx commented Aug 11, 2018

This seems to be the best thread about this issue that I've found:
https://groups.google.com/a/isocpp.org/forum/#!topic/std-discussion/OHv-oNSuJuk

So I think I have to mostly agree with your change, I'm still not really sure about the relaxed vs release thing.

kroeckx
kroeckx previously approved these changes Aug 11, 2018
@kroeckx kroeckx added the approval: done This pull request has the required number of approvals label Aug 11, 2018
@dot-asm
Copy link
Contributor Author

dot-asm commented Aug 11, 2018

But I have a hard time thinking about objects for which we could have a reference counter where me modify more than the reference counter itself.

Easiest example to imagine is when 'ptr' in the above example is a head of a linked list that you would manage in lock-free manner, with atomic_compare_exchange. I mean operations on shared object would involve growing this list and then destructor would have to traverse it. We, as OpenSSL developers, don't do this, it's just an example that is [relatively] easy to wrap your head around. But once again, since you would manage the list in lock-free manner, access to its head in destructor would have to be fenced accordingly. And fencing around reference counter would be redundant. As well as without writable fields, i.e. always.

without it, the compiler is allowed to move the free() before the decrement

No. Fencing is not about how compiler orders operations, but what message is conveyed to hardware. In sense that compiler won't ever move free before the decrement, but it can tell hardware to hold back speculative loads if you ask for acquire (or wait out pending stores if you ask for release). As for operations ordering per se. I fail to see that compiler would be actually allowed to reorder loads and stores around atomic operations. Indeed, consider following snippet

long foo(long * __restrict ptr, long val, int * __restrict ref)
{
    *ptr = var;
#ifdef ATOMIC
    __atomic_fetch_add(ref,1,__ATOMIC_RELAXED);
#else
    (*ref)++;
#endif
    return *ptr;
}

If compiled as is, it's compiled as

    mov  val,ret
    mov  val,*(ptr)
    add  1,*(ref)

As you can see, *ptr in return statement is canceled, as value is copied straight to return register. But if compiled with additional -DATOMIC flag, it's compiled as

    mov  val,*(ptr)
    lock add 1,*(ref)
    mov  *(ptr),ret

Note that it refused to cache *ptr and scrupulously preserved order of loads and stores on different sides of atomic operation. Just in case, if you play with values other than __ATOMIC_RELAXED, you won't see any difference in generated x86 code. If you want it to be more illustrative, use alpha compiler ;-)

@kroeckx
Copy link
Member

kroeckx commented Aug 11, 2018

@kroeckx kroeckx dismissed their stale review August 11, 2018 22:10

After looking at that talk again, I'm removing my approval

@kroeckx
Copy link
Member

kroeckx commented Aug 11, 2018

So my current understanding: The release makes sure that the other thread sees the value before the if, without the release you're not sure when that write will be visible to the other threads, only that it will be decremented at some point. The release does not prevent part of the free() to already be executed before the decrement. If an other thread did modify something, and did a released, and we want to read that in the free() call, we need an acquire.

@dot-asm dot-asm removed the approval: done This pull request has the required number of approvals label Aug 12, 2018
@dot-asm
Copy link
Contributor Author

dot-asm commented Aug 12, 2018

OK, I managed to observe a regular load ending up on different sides of atomic operation depending on explicit memory order used, more specifically relaxed vs. anything else. Not with gcc ;-) This means that assertion that compiler never reorders non-atomic loads and stores around any atomic was incorrect. Sorry about confusion. Compiler actually can reorder stuff at least around relaxed, and others ought to work one way. This in turn means that if compiler[!] does speculative load prior reference counter decrement it can pick up a field's stale value. Unless[!] the field is atomic. Because if it is (and it should), then compiler won't speculatively load it, because compiler won't reorder any of atomic operations, even relaxed. Standard does say at least that. [Note that hardware still can speculatively initiate load, but then it would have to cancel it.]

The release makes sure that the other thread sees the value before the if,

I'd argue that it's wrong way to think about it. Release is not about the value, it's really about preceding stores. Indeed, you have no control over when another thread observes any of the stored values. The only thing you can control is that by the time another thread sees "magic" value at specific location, the rest of the object you are releasing is "stable". Maybe it's just me, but common problem I have with posts, blogs, etc. is that they leave you with impression that acquire in one thread waits for release in another. It doesn't. Both are really about whether or not current thread leaves or observes consistent state. And just like 'release' is about preceding stores, 'acquire' is about following loads. And there are two sides to "about". One side is what compiler can reorder, and another what does it tell hardware. For example what's a releasing store on "weak" architecture? It's store memory barrier followed by actual store. Yes, you fence preceding stores, but not "magic" store itself. What's acquiring load? It's load followed by load memory barrier. Note that there is asymmetry of following character. Let's say one thread makes releasing store and another acquiring load. Now imagine that there are some stores to shared memory locations prior releasing store and some "corresponding" loads after acquiring load. Now imagine that acquiring load happens just before releasing store. Since you have no control over when stores are visible to acquiring thread, loads past acquiring one can pick up new values, which will result in inconsistent view. So that releasing thread does leave consistent picture behind, yet it's no guarantee that acquiring thread observes one. Solution to this kind of situation is of course locked access, i.e. with mutexes around these loads and stores. Note that in such case the "magic" load doesn't have to be acquiring, because mutex lock implies at least load memory barrier, i.e. 'acquire'. Nor does "magic" store have to be releasing, because mutex unlock implies at least store memory barrier, i.e. 'release'. "At least" in last two sentences mean that in real life both lock and unlock are full barriers.

... without the release you're not sure when that write will be visible to the other threads, only that it will be decremented at some point.

I'd say that this is exactly the kind of wrong impression mentioned above that one is left with. You are not sure when outcome of reference counter modification is visible to another thread irregardless whether it's releasing, acquiring, both or neither. It's really about loads and stores around it.

The release does not prevent part of the free() to already be executed before the decrement.

It's a trick formulation. Does "already be executed" refer to speculative execution by hardware (which can always be, and regularly is canceled in case of misprediction), or does it refer to compiler actually scheduling parts for destructor ahead of decrement? Just to be clear, if we are talking about actual free(3) call, compiler is not in position to do any of it ahead of decrement. The only ambiguity is what happens prior the call, and even more specifically, which pointer is picked up for passing down.

If an other thread did modify something, and did a released, and we want to read that in the free() call, we need an acquire.

As mentioned earlier, if one is in business of field modifications (other than reference counter), then access to these fields has to be orchestrated, independently of reference counter. And this means that consistent view would be maintained independently of reference counter. Going back to store ptr, if (--ref_count==0) { free(load ptr) }(*) example. It will just have to be [at least] atomic_store ptr, if (--ref_count==0) { free(atomic_load ptr). But then, since all are atomics, compiler won't load ptr prior --ref_count, even is it's relaxed, and you are actually set. One can say that it can also be lock, store ptr, unlock, if (--ref_count==0) { free(load ptr) }. Question here is if one can consider lack of lock/unlock around load ptr as bug or not. One might be able to say that it's not needed, because thread that is designated to run free(ptr) is supposed to be last ones to access the object. But since compiler[!] can bring load ahead of relaxed[!] --ref_count, unlocked load can end up with stale ptr value. In other words one can actually say that acquire semantics would be appropriate for reference counter decrement if we want to encourage/support questionable programming practices. Well, it should be noted that this is internal header, not something we expose to applications, so question also is if any of this is actually our case :-)

(*) Just in case, ptr is not pointer to object, but a member! Complete sequence is actually if (--ref_count == 0) { free(load ptr), free(object) }. Note that free(object) is never problem, the dispute is exclusively about what happens to fields.

@dot-asm
Copy link
Contributor Author

dot-asm commented Aug 12, 2018

Complete sequence is actually if (--ref_count == 0) { free(load ptr), free(object) }.

I think it becomes even more apparent if one puts it like this. Let's say that besides ptr we even have pointer to lock object as part of current object. In such case complete sequence would be if (--ref_count == 0) { free(load ptr), free(lock), free(object) }. Note that free(lock) can't be problematic either, because this pointer is constant/"stable" throughout whole life-time of the object. In other words it's only a problem to handle fields that change past constructor time. And since it's shared object the updates has to be orchestrated one way or another. And no amount of fencing of the reference counter itself would replace the said orchestration. Except for the corner case when one decides that final accord of the said orchestration is not needed in destructor. If such case is considered legitimate that is :-)

@kroeckx
Copy link
Member

kroeckx commented Aug 12, 2018

(I had written a long post, but it seems to mostly put what you've said in other words. Since we both seem to agree on that, no need to repeat it.)

Solution to this kind of situation is of course locked access, i.e. with mutexes around these loads and stores. Note that in such case the "magic" load doesn't have to be acquiring, because mutex lock implies at least load memory barrier, i.e. 'acquire'. Nor does "magic" store have to be releasing, because mutex unlock implies at least store memory barrier, i.e. 'release'. "At least" in last two sentences mean that in real life both lock and unlock are full barriers.

I think that if you want to make counters relaxed, it would mean that we need to 'acquire' the lock before calling FOO_free() (and then release it again).

Note that in case we don't use atomics, we do an lock/acquire + unlock/release to modify the counter, so in that case there would be no need to do an additional lock/acquire.

@dot-asm
Copy link
Contributor Author

dot-asm commented Aug 13, 2018

So what do we settle for? No change just close? Relaxed decrement as in this PR? Or relaxed decrement with condition acquire fence? Or acquiring decrement? To summarize, it's argued that at least release fence is not required, because post-constructor modifications to shared structure will have to be orchestrated, and said orchestration implies release fence. It's disputed whether or not unlocked or non-atomic access to modified structure members in destructor is acceptable practice. If it is considered acceptable, then acquire fence would be appropriate. And it it's not considered acceptable, then relaxed decrement is all we need.

@kroeckx
Copy link
Member

kroeckx commented Aug 13, 2018 via email

paulidale
paulidale previously approved these changes Aug 14, 2018
Copy link
Contributor

@paulidale paulidale left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't think of a problem using relaxed ordering here.
The fence seems unnecessary.

*ret = __atomic_fetch_sub(val, 1, __ATOMIC_RELEASE) - 1;
if (*ret == 0)
__atomic_thread_fence(__ATOMIC_ACQUIRE);
*ret = __atomic_fetch_sub(val, 1, __ATOMIC_RELAXED) - 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not __atomic_sub_fetch here which saves the - 1 at the end?
Likewise for atomic_fetch_sub_explicit and the add equivalents.

@dot-asm
Copy link
Contributor Author

dot-asm commented Aug 14, 2018

I can't think of a problem using relaxed ordering here. The fence seems unnecessary.

It's discussed in sufficient details above. Problem is that whether or not acquire fence is appropriate depends on object and how it's destructed (and whether or not we can consider unserialized access to members appropriate in destructor). And one simply can't make that specific assumptions in macro like one in question. Hence desire to make some kind of "one-size-fits-all".

@paulidale
Copy link
Contributor

I've read through the commentary. I still don't see how a problem will occur.

Is more than one thread going to decrement the counter to zero? I don't see how this is possible with an atomic operation in place. Sure, we don't know which thread will see the zero but one will definitely see it. That thread will then de-allocate and continue. Speculative execution is a furphy, the speculation will be wound back once the true value is known.

I can envisage scenarios where there might be an issue (one thread is decrementing the counter while another is incrementing it). The question here is how did the second thread learn about the reference? I.e. there is likely a third party which holds the reference too. Even if that isn't true, the various fences are not going to solve the problem -- fences don't cause other threads to wait.

If we're concerned enough, locking is the only solution. Lock free data structures are difficult and don't involve acquire and release semantics.

@paulidale
Copy link
Contributor

We are talking about reference counts here. If other fields in an object are being modified as well, a lock is the only way to go (not strictly but practically).

@dot-asm
Copy link
Contributor Author

dot-asm commented Aug 14, 2018

There was "T1:store ptr, T1:speculative load ptr, T2:store ptr, T2:--ref_count, T1:--ref_count==0, T1:free("load" ptr)" mentioned above. This speculative load is formally permitted to be moved ahead of --ref_count if it, --ref_count, doesn't claim acquire fence. It was argued than it should not have happened, because "load ptr" should have been atomic by itself (in which case compiler would let it alone), or locked. Even in destructor!!! But programmer can as well argue why would I lock or atomically access members in destructor? Indeed, it's very last object use, right? Except that that speculative load can ruin the party... And so question boils down to if we can view practice of not locking or atomizing otherwise lockable or atomic member specifically in destructor acceptable. If we don't view it acceptable, then relaxed decrement suffices. And if we accept it, then acquire would be appropriate. Note that this is not necessarily about what we actually currently do. It's rather about not making specific assumption in refcount.h and keeping it general-purpose.

@dot-asm
Copy link
Contributor Author

dot-asm commented Aug 14, 2018

As already said, there are two sides to memory_order thing: a)what compiler can reorder, and b) what does it tell to hardware. One might tend to think about b) and say "hey, memory barrier instruction doesn't solve anything here". And it's totally correct, it doesn't. It's about a). Ideally one needs something that would prevent compiler to take loads ahead of --ref_count, but without telling hardware anything.

@kroeckx
Copy link
Member

kroeckx commented Aug 14, 2018 via email

@kroeckx
Copy link
Member

kroeckx commented Aug 14, 2018 via email

@dot-asm
Copy link
Contributor Author

dot-asm commented Aug 14, 2018

So you're now talking about:

??? I was talking about it for a while. It's an option.

But it doesn't seem to guarantee that the write from the other thread are visible.

My point is that somebody else will just has to guarantee it, it's just not counter's responsibility. Not providing that guarantee elsewhere would be a bug elsewhere.

@dot-asm
Copy link
Contributor Author

dot-asm commented Aug 14, 2018

it's just not counter's responsibility.

One can also put it as following. If it was counter's responsibility you'd have to release even up-ref...

@dot-asm
Copy link
Contributor Author

dot-asm commented Aug 14, 2018

If it was counter's responsibility you'd have to release even up-ref...

This is misleading, disregard the remark :-)

@dot-asm
Copy link
Contributor Author

dot-asm commented Aug 14, 2018

Ideally one needs something that would prevent compiler to take loads ahead of --ref_count, but without telling hardware anything.

There is such thing and it's a call to function that compiler has no source code for. One can give extra "fright" to compiler by passing pointer to object to the said function, which wouldn't have to do anything with it. For example

long foo(long * __restrict p, long v)
{
    *p = v;
    bar();
    return *p;
}

was observed to be compiled without load. But if you bar(p), then load is generated. As well as if you omit __restrict.

@kroeckx
Copy link
Member

kroeckx commented Aug 14, 2018 via email

@paulidale
Copy link
Contributor

My understanding is that memory barriers are designed more for hardware executing out of order than for compilers. For compilers volatile asm("" ::: "memory") should suffice.

Atomic operation only guarantee that the specific operation is atomic, there is no guarantee that other operations aren't scheduled before or afterwards. Thus:

use p->foo;
atomic_decrement p->ref_count

won't guarantee the execution ordering and another thread could do a second decrement and free before p->foo is resolved. A memory barrier will help here but it would have to be on every operation not just the decrement to zero. Fences don't go cross thread, they guarantee consistency for the current thread and global visibility. Thus, making the thread that decrements to zero's memory accesses visible won't prevent another thread from trying to load out of order (i.e. after its decrement is finished).

When a compiler sees a function call, the C standard demands a sequence point i.e. all memory operations that could be impacted by the call have been done. Instead of a function call, I think volatile asm("" ::: "memory") would work here too and be cheaper. The case above with the restricted pointer is special, the restrict keyword indicates that the pointer does not alias anything and thus bar() cannot modify or use what it points to. v is on the stack and again is inaccessible to bar().

The safest solution is to use atomic operations (relaxed) and introduce a memory barrier after all decrements. Atomic operations by themselves will guarantee the decrement works properly and consequently the free but they can't stop out of order execution of related accesses.

Or am I babbling nonsense again?

@kroeckx
Copy link
Member

kroeckx commented Aug 14, 2018 via email

@paulidale paulidale dismissed their stale review August 15, 2018 00:00

Further thought is required...

@paulidale
Copy link
Contributor

It had been a while since I last read through these :( Concurrency is hard.

A releasing decrement and an acquiring fence on zero? The release guarantees that everything done before is visible, the acquire prevents things being done early. This is what the original code does.

I am unsure if the acquire fence is unnecessary. Nothing else has a reference to this and, apart from the free which we're doing, no code should try to access this memory after giving up their reference. This means there are no loads or stores that other threads might be yet to do. However, there could be unseen changes from other threads that we're not seeing. Acquire lets us see them.

Can we replace the release atomic with a relaxed one? I'm beginning to think not. We need the guarantee it provides. I.e. that no stores have been moved past that point.

@dot-asm
Copy link
Contributor Author

dot-asm commented Aug 15, 2018

memory barriers are designed more for hardware executing out of order than for compilers.

No, it's really both.

For compilers volatile asm("" ::: "memory") should suffice.

It's not portable. [Unfortunately.]

Atomic operation only guarantee that the specific operation is atomic, there is no guarantee that other operations aren't scheduled before or afterwards.

Depends on chosen memory_order and operation. Then compiler is prohibited to change order of atomic operations themselves, independent on memory_order.

A releasing decrement and an acquiring fence on zero? The release guarantees that everything done before is visible, the acquire prevents things being done early.

As already said, if you are in business of changing fields in shared structure (besides reference counter), it will have to be orchestrated in one way or another. Not doing so would be a severe bug. And point is that any kind of orchestration implies release fence. This means that by the time reference counter is decremented other data is already visible on other processors.

However, there could be unseen changes from other threads that we're not seeing. Acquire lets us see them.

It's wrong way to think about it. Acquire fence doesn't let you see unseen changes. It only means that following loads are to be executed past it, that's all. In a sense "acquire fence, load" is meaningless operation by itself. "If (there are changes) { acquire fence, load }" is meaningful (provided that "there are changes" condition is met after all the changes are visible on all processors), but not "acquire fence, load" by itself.

@dot-asm
Copy link
Contributor Author

dot-asm commented Aug 15, 2018

Restored conditional acquire fence.

Just in case, #6883, adds MSC-specific section. Unfortunately C doesn't provide such nuanced intrinsics, those available and used are full fences.

@dot-asm
Copy link
Contributor Author

dot-asm commented Aug 15, 2018

I moved _MSC_VER-specific section from #6883 and covered ARM. Even re-formatted...

{
*ret = _InterlockedExchangeAdd_nf(val, -1) - 1;
if (*ret == 0)
__dmb(_ARM_BARRIER_ISH);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why ISH and not SY?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's what compilers generate.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually found a comment to say it's safe to do so ...


static __inline int CRYPTO_UP_REF(volatile int *val, int *ret, void *lock)
{
*ret = _InterlockedExchangeAdd(val, 1) + 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if _InterlockedIncrement() is useful instead. (Note that _InterlockedIncrement returns the new value, not the old value.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious fact. Unlike other architectures where generated code is all the same, on Itanium ExchangeAdd and Increment generate different instruction sequences... There is no loop in Increment sequence...

static __inline int CRYPTO_DOWN_REF(volatile int *val, int *ret, void *lock)
{
*ret = _InterlockedExchangeAdd(val, -1) - 1;
return 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need an acquire fence here? Or is that a full barrier?

Copy link
Contributor Author

@dot-asm dot-asm Aug 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. On x86 it's not needed, because all interlocked instructions are full barriers.

EDIT. Just in case "no" refers to first question. As in "do we need fence?", "no."

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"This function generates a full memory barrier (or fence) to ensure that memory operations are completed in order."

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(That is not in all copies of their documentation, but it is at least in some.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. Just in case, thing is that there is no _nf on x86, because there is no need for it, interlocked instructions are full barriers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is no _nf on x86, because there is no need for it, interlocked instructions are full barriers.

Or rather there is no way to implement _nf, interlocked instructions is the only option, and they are unconditionally full barriers.

@dot-asm
Copy link
Contributor Author

dot-asm commented Aug 15, 2018

Editorial note. I'll change commit message to

internal/refcount.h: overhaul fencing and add _MSC_VER section.

Relax memory_order on counter decrement itself, because mutable
members of the reference-counted structure should be visible on all
processors independently on counter. [Even re-format and minimize
dependency on other headers.]

dot-asm pushed a commit to dot-asm/openssl that referenced this pull request Aug 16, 2018
Relax memory_order on counter decrement itself, because mutable
members of the reference-counted structure should be visible on all
processors independently on counter. [Even re-format and minimize
dependency on other headers.]

Reviewed-by: Kurt Roeckx <kurt@roeckx.be>
(Merged from openssl#6900)
@dot-asm dot-asm closed this Aug 16, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants