<xatomic.h>: Consider adding _mm_pause on x86/x64 #680

AlexGuteniev · 2020-04-03T05:31:33Z

I'm referring to this part:

Lines 26 to 40 in 260cfaf

    
           #if defined(_M_CEE_PURE) || defined(_M_IX86) || defined(_M_X64) 
        
           #define _INTRIN_RELAXED(x) x 
        
           #define _INTRIN_ACQUIRE(x) x 
        
           #define _INTRIN_RELEASE(x) x 
        
           #define _INTRIN_ACQ_REL(x) x 
        
           #define _YIELD_PROCESSOR() 
        
           #elif defined(_M_ARM) || defined(_M_ARM64) 
        
           #define _INTRIN_RELAXED(x) _CONCAT(x, _nf) 
        
           #define _INTRIN_ACQUIRE(x) _CONCAT(x, _acq) 
        
           #define _INTRIN_RELEASE(x) _CONCAT(x, _rel) 
        
           // We don't have interlocked intrinsics for acquire-release ordering, even on 
        
           // ARM32/ARM64, so fall back to sequentially consistent. 
        
           #define _INTRIN_ACQ_REL(x) x 
        
           #define _YIELD_PROCESSOR() __yield()

In particular, this line:

STL/stl/inc/xatomic.h

Line 31 in 260cfaf

#define _YIELD_PROCESSOR()

In "winnt.h" it is defined as follows:
#define YieldProcessor _mm_pause

In documentation on YieldProcessor use of pause instruction is documented:

This macro can be called on all processor platforms where Windows is supported, but it has no effect on some platforms. The definition varies from platform to platform. The following are some definitions of this macro in Winnt.h:
#define YieldProcessor() __asm { rep nop }
#define YieldProcessor _mm_pause
#define YieldProcessor __yield

Other synchronization libraries also make use of pause instruction.

@BillyONeal explained in #613 (comment):

The answer I've heard from folks that supposedly know better is "never _mm_pause ever" -- that macro was defined long before my time but it was defined as empty for x86/x64 and I don't think we would want to change that without extremely strong profiling evidence that it was a good idea.

Though I understand that there are some folks that know something, it all highly suspicious that this STL is the only library I know that avoids pause.

The text was updated successfully, but these errors were encountered:

StephanTLavavej · 2020-04-03T06:44:36Z

Marking as info needed; I agree with Billy, I'd want to see performance benchmarks that indicate this is a significant win and never harmful.

BillyONeal · 2020-04-03T21:28:54Z

I pinged whatever folks from whom I might have heard this here: https://twitter.com/MalwareMinigun/status/1246187718645698562 we'll see what they say

AlexGuteniev · 2020-04-04T06:46:12Z

@BillyONeal , it seems that one of persons you are asking is the same Olivier, who made reference implementation of atomic wait with _mm_pause in spinning :-)

If the concern is unpredictable duration then, well, lets not rely on duration and don't do more than one _mm_pause per spin iteration (and this STL does not do it, as I can see)

I agree that benchmark should be provided. I disagree that you can improve / not make worse every case -- there's always a tradeoff.

AlexGuteniev · 2020-04-04T12:26:09Z

I can provide a test program that shows _mm_pause is better.
Though I cannot prove it is legitimate, it should be enough to start at least considering _mm_pause

#include <chrono>
#include <iostream>
#include <thread>
#include <mutex>
#include <atomic>
#include <thread>
#include <intrin.h>

struct MammothAtomicValue
{
    unsigned v;

    void increment()
    {
        ++v;
    }

    bool is_equal(const MammothAtomicValue& other) const
    {
        return v == other.v;
    }

    bool is_greater_or_equal(const MammothAtomicValue& other) const
    {
        return v >= other.v;
    }
};

template<bool use_pause>
struct MammothAtomic
{
    void lock()
    {
        while (spin_mutex.exchange(true, std::memory_order_acquire))
        {
            if (use_pause)
            {
                _mm_pause();
            }
            else
            {
                // notihng
            }
        }
    }

    void unlock()
    {
        spin_mutex.store(false, std::memory_order_release);
    }

    MammothAtomicValue load()
    {
        lock();
        MammothAtomicValue result = value;
        unlock();
        return result;
    }

    MammothAtomicValue cas(const MammothAtomicValue& desired, 
    const MammothAtomicValue& expected)
    {
        lock();
        MammothAtomicValue result = value;
        if (value.is_equal(expected))
        {
            value = desired;
        }
        unlock();
        return result;
    }

    MammothAtomicValue value;
    std::atomic<bool>  spin_mutex = false;

    MammothAtomicValue increment()
    {
        MammothAtomicValue expected = load();
        MammothAtomicValue result = expected;
        for (;;)
        {
            result.increment();
            result = cas(result, expected);
            if (result.is_equal(expected))
            {
                break;
            }
            expected = result;
        }
        return result;
    }
};


template<bool use_pause>
std::chrono::steady_clock::duration run_benchmark()
{
    MammothAtomicValue initial = { 0 };
    MammothAtomicValue final = { 2000000 };

    MammothAtomic<use_pause> value = { initial };

    std::atomic<bool> race_start_flag = false;

    auto worker = [&] {
        while (!race_start_flag.load()) {}

        while (!value.increment().is_greater_or_equal(final)) {}
    };

   // with two workers, _mm_pause is double win, with one worker no effect
   // with more than two some sort of win
    std::thread workers[2]; 

    for (auto& w : workers) {
        w = std::thread(worker);
    }

    auto start_time = std::chrono::steady_clock::now();
    race_start_flag.store(true);
    for (auto& w : workers) {
        w.join();
    }

    return std::chrono::steady_clock::now() - start_time;
}

int main()
{
    namespace chrono = std::chrono;

    for (int i = 0; i < 4; i++)
    {
        std::cout << chrono::duration_cast<chrono::duration<double, std::milli>>(
          run_benchmark<false>()).count() << "ms without _mm_pause\t";
        std::cout << chrono::duration_cast<chrono::duration<double, std::milli>>(
          run_benchmark<true>()).count() << "ms with _mm_pause" << std::endl;
    }
}

StephanTLavavej · 2020-04-06T23:11:25Z

I disagree that you can improve / not make worse every case -- there's always a tradeoff.

Some things are pure wins - e.g. given compiler support, if constexpr utterly annihilates tag dispatch. But yes, when there's a tradeoff, we need to make choices.

chuggafan · 2020-04-08T16:53:01Z

_mm_pause also has the issue of being different lengths depending on which Intel arch you're using, 10 cycles vs the newer ones having 140 cycles, so if people using atomics rely on a very fast atomic check time then they could have a significant drop in performance, https://aloiskraus.wordpress.com/2018/06/16/why-skylakex-cpus-are-sometimes-50-slower-how-intel-has-broken-existing-code/ also the people who work on .NET found out. So for some extremely high-threadcount high contention atomics this could explode the time it takes as this article explains.

So adding _mm_pause can negatively affect some workloads while positively affect others, so it'd probably be best to find out which type of workload is the most often used and optimize for that while informing others of the change (if the change happens).

AlexGuteniev · 2020-04-10T05:51:31Z

@BillyONeal , I have an idea.

In scope of #593 (atomic waits) I can add some minimal level of Windows XP support,
That is just spinning on wait, first actively, then using SwitchToThread, and no-op on notify_one, notify_all. Then, atomic wait could be used to resolve #370 , and also for internal atomic spinlock.

Then, the question about _YIELD_PROCESSOR would be question about nothing: _YIELD_PROCESSOR would only be called for odd-sized atomics waiting and for old operating systems.

(Note that WaitOnAddress by itself spins before going to kernel wait, and it spins with pause, but it is not the responsibility of STL anymore)

What do you think?

MikeGitb · 2020-04-10T07:52:27Z

So adding _mm_pause can negatively affect some workloads while positively affect others, so it'd probably be best to find out which type of workload is the most often used and optimize for that while informing others of the change (if the change happens).

Correct me if I'm wrong, but the problem in the linked article was that .net used something like 50 pauses in the first iteration and then tripled that in each iteration. Of course that leads to very long backoff times very quickly. That doesn't mean that something like

while(!flag){
       _mm_pause();
}

isn't still better than

while(!flag){
       ;
}

It is only when you try to figure out the "optimal" number of pause instructions that this becomes an issue but e.g. 1 is certainly still better than 0.
Also, if exact perf behavior is important enough that I worry about 100 instead of 10 cycles granularity of a spinlock, I'm very unlikely to use a STL construct anyway whose implementation might change from one toolchain version to another and certainly differs from one STL to another.

And generally speaking, I'd certainly expect a _YIELD_PROCESSOR macro to do at least something on every architecture and not just to be a No Op.

chuggafan · 2020-04-11T00:05:26Z

@MikeGitb Well, that might have been the case, but just to make sure I ran a modified version of @AlexGuteniev 's test program where I instead ran 60 worker threads, a total of 30 runs, and then averaged the results at the end of the program, this is on an i7-7700HQ with Visual Studio's debug window in release x64 mode,
here are my results, so it seems that on average _mm_pause is legitimately better but in individual cases it can be a toss up. Of course, this isn't extensive testing done with other, real-world programs, but it's nice to see now that there's "real" results, where on average _mm_pause wins by approximately 2.3 seconds or so.

The main reason I threw the article out there was to warn against being too quick on a decision like this because there's been problems with _mm_pause before due to Intel shenanigans.

Of course, there should still be more testing and a more "concrete" answer found, but I just want to warn against premature changes where it's not explicitly a bug but can affect performance, as anything to do with performance goes: Test before making claims.

AlexGuteniev · 2020-05-03T03:59:58Z

One of other issues (166) mentions an interesting document, Intel Architecture Optimization Manual:

https://software.intel.com/sites/default/files/managed/9e/bc/64-ia-32-architectures-optimization-manual.pdf

The part 2.3.4 Pause Latency in Skylake Microarchitecture covers this subject.

It mentions number of cycles growth for pause instruction, and still recommends using pause for busy wait. Even recommends doing exponential backoff on pause.

AlexGuteniev · 2020-06-11T14:28:22Z

Here came across an explanation of a strong point to have at least one _mm_pause
https://stackoverflow.com/a/12904645/2945027

AlexGuteniev · 2020-07-17T08:22:46Z

Can a decision be made here?
Is some other benchmark needed? Or exact quotes from authoritative sources?
Or do you want to just ignore the issue until vNext, where you can just avoid spinning in the STL code?

BillyONeal · 2020-07-18T02:57:23Z

I think your benchmark program demonstrates that we should add the pause. (I was reassigned to the vcpkg team between March and June and so did not see that)

BillyONeal · 2020-07-18T02:59:52Z

The increasing backoff example in the Skylake manual you reference is probably a good idea too.

Resolves microsoft#370 , resolves microsoft#680

AlexGuteniev · 2020-08-05T18:10:06Z

Updated benchmark:

#include <chrono>
#include <iostream>
#include <thread>
#include <mutex>
#include <atomic>
#include <thread>
#include <intrin.h>

struct MammothAtomicValue
{
    unsigned v;

    void increment()
    {
        ++v;
    }

    bool is_equal(const MammothAtomicValue& other) const
    {
        return v == other.v;
    }

    bool is_greater_or_equal(const MammothAtomicValue& other) const
    {
        return v >= other.v;
    }
};

enum use_pause_t
{
    no_pause_no_load,
    no_pause_load,
    one_pause_no_load,
    one_pause_load,
    recommended_method,
};


template<use_pause_t use_pause>
struct MammothAtomic
{
    void lock()
    {
        switch (use_pause)
        {
        case no_pause_no_load:
            while (spin_mutex.exchange(true, std::memory_order_acquire))
            {
                // nothing
            }
            break;

        case no_pause_load:
            while (spin_mutex.exchange(true, std::memory_order_acquire))
            {
                while (spin_mutex.load(std::memory_order_relaxed))
                {
                    // nothing
                }
            }
            break;

        case one_pause_no_load:
            while (spin_mutex.exchange(true, std::memory_order_acquire))
            {
                _mm_pause();
            }
            break;

        case one_pause_load:
            while (spin_mutex.exchange(true, std::memory_order_acquire))
            {
                while (spin_mutex.load(std::memory_order_relaxed))
                {
                    _mm_pause();
                }
            }
            break;

        case recommended_method:
        {
            int mask = 1;
            int const max = 64;
            while (spin_mutex.exchange(true, std::memory_order_acquire))
            {
                while (spin_mutex.load(std::memory_order_relaxed))
                {
                    for (int i = mask; i; --i)
                    {
                        _mm_pause();
                    }
                    mask = mask < max ? mask << 1 : max;
                }
            }
            break;
        }
        }
    }

    void unlock()
    {
        spin_mutex.store(false, std::memory_order_release);
    }

    MammothAtomicValue load()
    {
        lock();
        MammothAtomicValue result = value;
        unlock();
        return result;
    }

    MammothAtomicValue cas(const MammothAtomicValue& desired,
        const MammothAtomicValue& expected)
    {
        lock();
        MammothAtomicValue result = value;
        if (value.is_equal(expected))
        {
            value = desired;
        }
        unlock();
        return result;
    }

    MammothAtomicValue value;
    std::atomic<bool>  spin_mutex = false;

    MammothAtomicValue increment()
    {
        MammothAtomicValue expected = load();
        MammothAtomicValue result = expected;
        for (;;)
        {
            result.increment();
            result = cas(result, expected);
            if (result.is_equal(expected))
            {
                break;
            }
            expected = result;
        }
        return result;
    }
};


template<use_pause_t use_pause>
std::chrono::steady_clock::duration run_benchmark()
{
    MammothAtomicValue initial = { 0 };
    MammothAtomicValue final = { 2000000 };

    MammothAtomic<use_pause> value = { initial };

    std::atomic<bool> race_start_flag = false;

    auto worker = [&] {
        while (!race_start_flag.load()) {}

        while (!value.increment().is_greater_or_equal(final)) {}
    };

    // with two workers, _mm_pause is double win, with one worker no effect
    // with more than two some sort of win
    std::thread workers[4];

    for (auto& w : workers) {
        w = std::thread(worker);
    }

    auto start_time = std::chrono::steady_clock::now();
    race_start_flag.store(true);
    for (auto& w : workers) {
        w.join();
    }

    return std::chrono::steady_clock::now() - start_time;
}

int main()
{
    namespace chrono = std::chrono;

    for (int i = 0; i < 4; i++)
    {
        std::cout << chrono::duration_cast<chrono::duration<double, std::milli>>(
            run_benchmark<no_pause_no_load>()).count() << " none\t";
        std::cout << chrono::duration_cast<chrono::duration<double, std::milli>>(
            run_benchmark<no_pause_load>()).count() << " load\t";
        std::cout << chrono::duration_cast<chrono::duration<double, std::milli>>(
            run_benchmark<one_pause_no_load>()).count() << " pause\t";
        std::cout << chrono::duration_cast<chrono::duration<double, std::milli>>(
            run_benchmark<one_pause_load>()).count() << " both\t";
        std::cout << chrono::duration_cast<chrono::duration<double, std::milli>>(
            run_benchmark<recommended_method>()).count() << " as recommended" << std::endl;
    }
}

Results:

709.133 none    209.563 load    471.057 pause   123.12 both     30.3102 as recommended
694.65 none     211.071 load    459.964 pause   134.984 both    29.807 as recommended
662.293 none    198.458 load    465.107 pause   142.421 both    30.088 as recommended
721.094 none    214.071 load    472.891 pause   124.019 both    30.6598 as recommended

cbezault · 2020-08-06T18:02:56Z

On an ARM64 device.

646.903 none    564.265 load    632.973 pause   582.179 both    579.654 as recommended
672.233 none    574.724 load    701.4 pause     583.214 both    577.396 as recommended
685.813 none    561.642 load    668.817 pause   574.117 both    592.333 as recommended
727.697 none    552.126 load    628.689 pause   568.138 both    567.249 as recommended
651.309 none    578.399 load    604.984 pause   566.021 both    568.212 as recommended
660.482 none    576.616 load    635.601 pause   556.175 both    572.447 as recommended
700.489 none    546.835 load    680.566 pause   551.833 both    565.911 as recommended
687.385 none    602.884 load    641.061 pause   593.046 both    601.97 as recommended
645.611 none    622.082 load    661.407 pause   608.765 both    611.843 as recommended
697.712 none    641.034 load    694.685 pause   588.235 both    645.334 as recommended
667.783 none    631.312 load    683.622 pause   566.966 both    593.436 as recommended
655.962 none    570.809 load    640.184 pause   561.283 both    585.693 as recommended
612.771 none    554.762 load    647.957 pause   602.216 both    554.297 as recommended
607.908 none    580.776 load    526.137 pause   525.39 both     579.019 as recommended
700.611 none    580.204 load    663.573 pause   567.024 both    576.747 as recommended
599.543 none    590.085 load    625.307 pause   589.871 both    592.454 as recommended
650.27 none     559.459 load    577.195 pause   581.821 both    585.806 as recommended
588.967 none    550.916 load    556.084 pause   564.167 both    580.734 as recommended
602.823 none    555.853 load    591.483 pause   555.066 both    581.193 as recommended

StephanTLavavej changed the title ~~<xatomic.h>: Consider adding _mm_pause on x86-64~~ <xatomic.h>: Consider adding _mm_pause on x64 Apr 3, 2020

StephanTLavavej added performance Must go faster info needed We need more info before working on this labels Apr 3, 2020

StephanTLavavej changed the title ~~<xatomic.h>: Consider adding _mm_pause on x64~~ <xatomic.h>: Consider adding _mm_pause on x86/x64 Apr 3, 2020

StephanTLavavej added decision needed We need to choose something before working on this and removed info needed We need more info before working on this labels Apr 8, 2020

AlexGuteniev mentioned this issue Jul 16, 2020

First STL Contributors Meetup #921

Closed

StephanTLavavej removed the decision needed We need to choose something before working on this label Jul 21, 2020

AlexGuteniev added a commit to AlexGuteniev/STL that referenced this issue Aug 4, 2020

Use _mm_pause, load loop, and SRLWOCK

c139abc

Resolves microsoft#370 , resolves microsoft#680

AlexGuteniev mentioned this issue Aug 4, 2020

Use _mm_pause, load loop #1146

Merged

AlexGuteniev mentioned this issue Aug 15, 2020

<intrin0.h>: Expose _mm_pause #1197

Closed

StephanTLavavej closed this as completed in #1146 Jan 26, 2021

StephanTLavavej added the fixed Something works now, yay! label Jan 26, 2021

AlexGuteniev mentioned this issue Sep 26, 2021

Examples of Benchmarking Programs #1096

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

<xatomic.h>: Consider adding _mm_pause on x86/x64 #680

<xatomic.h>: Consider adding _mm_pause on x86/x64 #680

AlexGuteniev commented Apr 3, 2020

StephanTLavavej commented Apr 3, 2020

BillyONeal commented Apr 3, 2020

AlexGuteniev commented Apr 4, 2020

AlexGuteniev commented Apr 4, 2020

StephanTLavavej commented Apr 6, 2020

chuggafan commented Apr 8, 2020

AlexGuteniev commented Apr 10, 2020 •

edited

MikeGitb commented Apr 10, 2020 •

edited

chuggafan commented Apr 11, 2020

AlexGuteniev commented May 3, 2020

AlexGuteniev commented Jun 11, 2020

AlexGuteniev commented Jul 17, 2020

BillyONeal commented Jul 18, 2020

BillyONeal commented Jul 18, 2020

AlexGuteniev commented Aug 5, 2020

cbezault commented Aug 6, 2020

<xatomic.h>: Consider adding _mm_pause on x86/x64 #680

<xatomic.h>: Consider adding _mm_pause on x86/x64 #680

Comments

AlexGuteniev commented Apr 3, 2020

StephanTLavavej commented Apr 3, 2020

BillyONeal commented Apr 3, 2020

AlexGuteniev commented Apr 4, 2020

AlexGuteniev commented Apr 4, 2020

StephanTLavavej commented Apr 6, 2020

chuggafan commented Apr 8, 2020

AlexGuteniev commented Apr 10, 2020 • edited

MikeGitb commented Apr 10, 2020 • edited

chuggafan commented Apr 11, 2020

AlexGuteniev commented May 3, 2020

AlexGuteniev commented Jun 11, 2020

AlexGuteniev commented Jul 17, 2020

BillyONeal commented Jul 18, 2020

BillyONeal commented Jul 18, 2020

AlexGuteniev commented Aug 5, 2020

Updated benchmark:

Results:

cbezault commented Aug 6, 2020

AlexGuteniev commented Apr 10, 2020 •

edited

MikeGitb commented Apr 10, 2020 •

edited