Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

<xatomic.h>: Consider adding _mm_pause on x86/x64 #680

Closed
AlexGuteniev opened this issue Apr 3, 2020 · 16 comments · Fixed by #1146
Closed

<xatomic.h>: Consider adding _mm_pause on x86/x64 #680

AlexGuteniev opened this issue Apr 3, 2020 · 16 comments · Fixed by #1146
Labels
fixed Something works now, yay! performance Must go faster

Comments

@AlexGuteniev
Copy link
Contributor

I'm referring to this part:

STL/stl/inc/xatomic.h

Lines 26 to 40 in 260cfaf

#if defined(_M_CEE_PURE) || defined(_M_IX86) || defined(_M_X64)
#define _INTRIN_RELAXED(x) x
#define _INTRIN_ACQUIRE(x) x
#define _INTRIN_RELEASE(x) x
#define _INTRIN_ACQ_REL(x) x
#define _YIELD_PROCESSOR()
#elif defined(_M_ARM) || defined(_M_ARM64)
#define _INTRIN_RELAXED(x) _CONCAT(x, _nf)
#define _INTRIN_ACQUIRE(x) _CONCAT(x, _acq)
#define _INTRIN_RELEASE(x) _CONCAT(x, _rel)
// We don't have interlocked intrinsics for acquire-release ordering, even on
// ARM32/ARM64, so fall back to sequentially consistent.
#define _INTRIN_ACQ_REL(x) x
#define _YIELD_PROCESSOR() __yield()

In particular, this line:

#define _YIELD_PROCESSOR()

In "winnt.h" it is defined as follows:
#define YieldProcessor _mm_pause

In documentation on YieldProcessor use of pause instruction is documented:

This macro can be called on all processor platforms where Windows is supported, but it has no effect on some platforms. The definition varies from platform to platform. The following are some definitions of this macro in Winnt.h:

#define YieldProcessor() __asm { rep nop }
#define YieldProcessor _mm_pause
#define YieldProcessor __yield

Other synchronization libraries also make use of pause instruction.

@BillyONeal explained in #613 (comment):

The answer I've heard from folks that supposedly know better is "never _mm_pause ever" -- that macro was defined long before my time but it was defined as empty for x86/x64 and I don't think we would want to change that without extremely strong profiling evidence that it was a good idea.

Though I understand that there are some folks that know something, it all highly suspicious that this STL is the only library I know that avoids pause.

@StephanTLavavej StephanTLavavej changed the title <xatomic.h>: Consider adding _mm_pause on x86-64 <xatomic.h>: Consider adding _mm_pause on x64 Apr 3, 2020
@StephanTLavavej StephanTLavavej added performance Must go faster info needed We need more info before working on this labels Apr 3, 2020
@StephanTLavavej
Copy link
Member

Marking as info needed; I agree with Billy, I'd want to see performance benchmarks that indicate this is a significant win and never harmful.

@StephanTLavavej StephanTLavavej changed the title <xatomic.h>: Consider adding _mm_pause on x64 <xatomic.h>: Consider adding _mm_pause on x86/x64 Apr 3, 2020
@BillyONeal
Copy link
Member

I pinged whatever folks from whom I might have heard this here: https://twitter.com/MalwareMinigun/status/1246187718645698562 we'll see what they say

@AlexGuteniev
Copy link
Contributor Author

@BillyONeal , it seems that one of persons you are asking is the same Olivier, who made reference implementation of atomic wait with _mm_pause in spinning :-)

If the concern is unpredictable duration then, well, lets not rely on duration and don't do more than one _mm_pause per spin iteration (and this STL does not do it, as I can see)

I agree that benchmark should be provided. I disagree that you can improve / not make worse every case -- there's always a tradeoff.

@AlexGuteniev
Copy link
Contributor Author

I can provide a test program that shows _mm_pause is better.
Though I cannot prove it is legitimate, it should be enough to start at least considering _mm_pause

#include <chrono>
#include <iostream>
#include <thread>
#include <mutex>
#include <atomic>
#include <thread>
#include <intrin.h>

struct MammothAtomicValue
{
    unsigned v;

    void increment()
    {
        ++v;
    }

    bool is_equal(const MammothAtomicValue& other) const
    {
        return v == other.v;
    }

    bool is_greater_or_equal(const MammothAtomicValue& other) const
    {
        return v >= other.v;
    }
};

template<bool use_pause>
struct MammothAtomic
{
    void lock()
    {
        while (spin_mutex.exchange(true, std::memory_order_acquire))
        {
            if (use_pause)
            {
                _mm_pause();
            }
            else
            {
                // notihng
            }
        }
    }

    void unlock()
    {
        spin_mutex.store(false, std::memory_order_release);
    }

    MammothAtomicValue load()
    {
        lock();
        MammothAtomicValue result = value;
        unlock();
        return result;
    }

    MammothAtomicValue cas(const MammothAtomicValue& desired, 
    const MammothAtomicValue& expected)
    {
        lock();
        MammothAtomicValue result = value;
        if (value.is_equal(expected))
        {
            value = desired;
        }
        unlock();
        return result;
    }

    MammothAtomicValue value;
    std::atomic<bool>  spin_mutex = false;

    MammothAtomicValue increment()
    {
        MammothAtomicValue expected = load();
        MammothAtomicValue result = expected;
        for (;;)
        {
            result.increment();
            result = cas(result, expected);
            if (result.is_equal(expected))
            {
                break;
            }
            expected = result;
        }
        return result;
    }
};


template<bool use_pause>
std::chrono::steady_clock::duration run_benchmark()
{
    MammothAtomicValue initial = { 0 };
    MammothAtomicValue final = { 2000000 };

    MammothAtomic<use_pause> value = { initial };

    std::atomic<bool> race_start_flag = false;

    auto worker = [&] {
        while (!race_start_flag.load()) {}

        while (!value.increment().is_greater_or_equal(final)) {}
    };

   // with two workers, _mm_pause is double win, with one worker no effect
   // with more than two some sort of win
    std::thread workers[2]; 

    for (auto& w : workers) {
        w = std::thread(worker);
    }

    auto start_time = std::chrono::steady_clock::now();
    race_start_flag.store(true);
    for (auto& w : workers) {
        w.join();
    }

    return std::chrono::steady_clock::now() - start_time;
}

int main()
{
    namespace chrono = std::chrono;

    for (int i = 0; i < 4; i++)
    {
        std::cout << chrono::duration_cast<chrono::duration<double, std::milli>>(
          run_benchmark<false>()).count() << "ms without _mm_pause\t";
        std::cout << chrono::duration_cast<chrono::duration<double, std::milli>>(
          run_benchmark<true>()).count() << "ms with _mm_pause" << std::endl;
    }
}

@StephanTLavavej
Copy link
Member

I disagree that you can improve / not make worse every case -- there's always a tradeoff.

Some things are pure wins - e.g. given compiler support, if constexpr utterly annihilates tag dispatch. But yes, when there's a tradeoff, we need to make choices.

@chuggafan
Copy link

_mm_pause also has the issue of being different lengths depending on which Intel arch you're using, 10 cycles vs the newer ones having 140 cycles, so if people using atomics rely on a very fast atomic check time then they could have a significant drop in performance, https://aloiskraus.wordpress.com/2018/06/16/why-skylakex-cpus-are-sometimes-50-slower-how-intel-has-broken-existing-code/ also the people who work on .NET found out. So for some extremely high-threadcount high contention atomics this could explode the time it takes as this article explains.

So adding _mm_pause can negatively affect some workloads while positively affect others, so it'd probably be best to find out which type of workload is the most often used and optimize for that while informing others of the change (if the change happens).

@StephanTLavavej StephanTLavavej added decision needed We need to choose something before working on this and removed info needed We need more info before working on this labels Apr 8, 2020
@AlexGuteniev
Copy link
Contributor Author

AlexGuteniev commented Apr 10, 2020

@BillyONeal , I have an idea.

In scope of #593 (atomic waits) I can add some minimal level of Windows XP support,
That is just spinning on wait, first actively, then using SwitchToThread, and no-op on notify_one, notify_all. Then, atomic wait could be used to resolve #370 , and also for internal atomic spinlock.

Then, the question about _YIELD_PROCESSOR would be question about nothing: _YIELD_PROCESSOR would only be called for odd-sized atomics waiting and for old operating systems.

(Note that WaitOnAddress by itself spins before going to kernel wait, and it spins with pause, but it is not the responsibility of STL anymore)

What do you think?

@MikeGitb
Copy link

MikeGitb commented Apr 10, 2020

So adding _mm_pause can negatively affect some workloads while positively affect others, so it'd probably be best to find out which type of workload is the most often used and optimize for that while informing others of the change (if the change happens).

Correct me if I'm wrong, but the problem in the linked article was that .net used something like 50 pauses in the first iteration and then tripled that in each iteration. Of course that leads to very long backoff times very quickly. That doesn't mean that something like

while(!flag){
       _mm_pause();
}

isn't still better than

while(!flag){
       ;
}

It is only when you try to figure out the "optimal" number of pause instructions that this becomes an issue but e.g. 1 is certainly still better than 0.
Also, if exact perf behavior is important enough that I worry about 100 instead of 10 cycles granularity of a spinlock, I'm very unlikely to use a STL construct anyway whose implementation might change from one toolchain version to another and certainly differs from one STL to another.

And generally speaking, I'd certainly expect a _YIELD_PROCESSOR macro to do at least something on every architecture and not just to be a No Op.

@chuggafan
Copy link

@MikeGitb Well, that might have been the case, but just to make sure I ran a modified version of @AlexGuteniev 's test program where I instead ran 60 worker threads, a total of 30 runs, and then averaged the results at the end of the program, this is on an i7-7700HQ with Visual Studio's debug window in release x64 mode,
image here are my results, so it seems that on average _mm_pause is legitimately better but in individual cases it can be a toss up. Of course, this isn't extensive testing done with other, real-world programs, but it's nice to see now that there's "real" results, where on average _mm_pause wins by approximately 2.3 seconds or so.

The main reason I threw the article out there was to warn against being too quick on a decision like this because there's been problems with _mm_pause before due to Intel shenanigans.

Of course, there should still be more testing and a more "concrete" answer found, but I just want to warn against premature changes where it's not explicitly a bug but can affect performance, as anything to do with performance goes: Test before making claims.

@AlexGuteniev
Copy link
Contributor Author

One of other issues (166) mentions an interesting document, Intel Architecture Optimization Manual:

https://software.intel.com/sites/default/files/managed/9e/bc/64-ia-32-architectures-optimization-manual.pdf

The part 2.3.4 Pause Latency in Skylake Microarchitecture covers this subject.

It mentions number of cycles growth for pause instruction, and still recommends using pause for busy wait. Even recommends doing exponential backoff on pause.

@AlexGuteniev
Copy link
Contributor Author

Here came across an explanation of a strong point to have at least one _mm_pause
https://stackoverflow.com/a/12904645/2945027

@AlexGuteniev
Copy link
Contributor Author

Can a decision be made here?
Is some other benchmark needed? Or exact quotes from authoritative sources?
Or do you want to just ignore the issue until vNext, where you can just avoid spinning in the STL code?

@BillyONeal
Copy link
Member

I think your benchmark program demonstrates that we should add the pause. (I was reassigned to the vcpkg team between March and June and so did not see that)

@BillyONeal
Copy link
Member

The increasing backoff example in the Skylake manual you reference is probably a good idea too.

@StephanTLavavej StephanTLavavej removed the decision needed We need to choose something before working on this label Jul 21, 2020
AlexGuteniev added a commit to AlexGuteniev/STL that referenced this issue Aug 4, 2020
@AlexGuteniev
Copy link
Contributor Author

Updated benchmark:

#include <chrono>
#include <iostream>
#include <thread>
#include <mutex>
#include <atomic>
#include <thread>
#include <intrin.h>

struct MammothAtomicValue
{
    unsigned v;

    void increment()
    {
        ++v;
    }

    bool is_equal(const MammothAtomicValue& other) const
    {
        return v == other.v;
    }

    bool is_greater_or_equal(const MammothAtomicValue& other) const
    {
        return v >= other.v;
    }
};

enum use_pause_t
{
    no_pause_no_load,
    no_pause_load,
    one_pause_no_load,
    one_pause_load,
    recommended_method,
};


template<use_pause_t use_pause>
struct MammothAtomic
{
    void lock()
    {
        switch (use_pause)
        {
        case no_pause_no_load:
            while (spin_mutex.exchange(true, std::memory_order_acquire))
            {
                // nothing
            }
            break;

        case no_pause_load:
            while (spin_mutex.exchange(true, std::memory_order_acquire))
            {
                while (spin_mutex.load(std::memory_order_relaxed))
                {
                    // nothing
                }
            }
            break;

        case one_pause_no_load:
            while (spin_mutex.exchange(true, std::memory_order_acquire))
            {
                _mm_pause();
            }
            break;

        case one_pause_load:
            while (spin_mutex.exchange(true, std::memory_order_acquire))
            {
                while (spin_mutex.load(std::memory_order_relaxed))
                {
                    _mm_pause();
                }
            }
            break;

        case recommended_method:
        {
            int mask = 1;
            int const max = 64;
            while (spin_mutex.exchange(true, std::memory_order_acquire))
            {
                while (spin_mutex.load(std::memory_order_relaxed))
                {
                    for (int i = mask; i; --i)
                    {
                        _mm_pause();
                    }
                    mask = mask < max ? mask << 1 : max;
                }
            }
            break;
        }
        }
    }

    void unlock()
    {
        spin_mutex.store(false, std::memory_order_release);
    }

    MammothAtomicValue load()
    {
        lock();
        MammothAtomicValue result = value;
        unlock();
        return result;
    }

    MammothAtomicValue cas(const MammothAtomicValue& desired,
        const MammothAtomicValue& expected)
    {
        lock();
        MammothAtomicValue result = value;
        if (value.is_equal(expected))
        {
            value = desired;
        }
        unlock();
        return result;
    }

    MammothAtomicValue value;
    std::atomic<bool>  spin_mutex = false;

    MammothAtomicValue increment()
    {
        MammothAtomicValue expected = load();
        MammothAtomicValue result = expected;
        for (;;)
        {
            result.increment();
            result = cas(result, expected);
            if (result.is_equal(expected))
            {
                break;
            }
            expected = result;
        }
        return result;
    }
};


template<use_pause_t use_pause>
std::chrono::steady_clock::duration run_benchmark()
{
    MammothAtomicValue initial = { 0 };
    MammothAtomicValue final = { 2000000 };

    MammothAtomic<use_pause> value = { initial };

    std::atomic<bool> race_start_flag = false;

    auto worker = [&] {
        while (!race_start_flag.load()) {}

        while (!value.increment().is_greater_or_equal(final)) {}
    };

    // with two workers, _mm_pause is double win, with one worker no effect
    // with more than two some sort of win
    std::thread workers[4];

    for (auto& w : workers) {
        w = std::thread(worker);
    }

    auto start_time = std::chrono::steady_clock::now();
    race_start_flag.store(true);
    for (auto& w : workers) {
        w.join();
    }

    return std::chrono::steady_clock::now() - start_time;
}

int main()
{
    namespace chrono = std::chrono;

    for (int i = 0; i < 4; i++)
    {
        std::cout << chrono::duration_cast<chrono::duration<double, std::milli>>(
            run_benchmark<no_pause_no_load>()).count() << " none\t";
        std::cout << chrono::duration_cast<chrono::duration<double, std::milli>>(
            run_benchmark<no_pause_load>()).count() << " load\t";
        std::cout << chrono::duration_cast<chrono::duration<double, std::milli>>(
            run_benchmark<one_pause_no_load>()).count() << " pause\t";
        std::cout << chrono::duration_cast<chrono::duration<double, std::milli>>(
            run_benchmark<one_pause_load>()).count() << " both\t";
        std::cout << chrono::duration_cast<chrono::duration<double, std::milli>>(
            run_benchmark<recommended_method>()).count() << " as recommended" << std::endl;
    }
}

Results:

709.133 none    209.563 load    471.057 pause   123.12 both     30.3102 as recommended
694.65 none     211.071 load    459.964 pause   134.984 both    29.807 as recommended
662.293 none    198.458 load    465.107 pause   142.421 both    30.088 as recommended
721.094 none    214.071 load    472.891 pause   124.019 both    30.6598 as recommended

@cbezault
Copy link
Contributor

cbezault commented Aug 6, 2020

On an ARM64 device.

646.903 none    564.265 load    632.973 pause   582.179 both    579.654 as recommended
672.233 none    574.724 load    701.4 pause     583.214 both    577.396 as recommended
685.813 none    561.642 load    668.817 pause   574.117 both    592.333 as recommended
727.697 none    552.126 load    628.689 pause   568.138 both    567.249 as recommended
651.309 none    578.399 load    604.984 pause   566.021 both    568.212 as recommended
660.482 none    576.616 load    635.601 pause   556.175 both    572.447 as recommended
700.489 none    546.835 load    680.566 pause   551.833 both    565.911 as recommended
687.385 none    602.884 load    641.061 pause   593.046 both    601.97 as recommended
645.611 none    622.082 load    661.407 pause   608.765 both    611.843 as recommended
697.712 none    641.034 load    694.685 pause   588.235 both    645.334 as recommended
667.783 none    631.312 load    683.622 pause   566.966 both    593.436 as recommended
655.962 none    570.809 load    640.184 pause   561.283 both    585.693 as recommended
612.771 none    554.762 load    647.957 pause   602.216 both    554.297 as recommended
607.908 none    580.776 load    526.137 pause   525.39 both     579.019 as recommended
700.611 none    580.204 load    663.573 pause   567.024 both    576.747 as recommended
599.543 none    590.085 load    625.307 pause   589.871 both    592.454 as recommended
650.27 none     559.459 load    577.195 pause   581.821 both    585.806 as recommended
588.967 none    550.916 load    556.084 pause   564.167 both    580.734 as recommended
602.823 none    555.853 load    591.483 pause   555.066 both    581.193 as recommended

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fixed Something works now, yay! performance Must go faster
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants