-
Notifications
You must be signed in to change notification settings - Fork 26
faster ascii check #2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Just realized the unit test doesn't test |
|
The benchmark is correct on a properly configured server. It is definitely not correct on any random machine. I’ll wait for your test. Thanks. |
|
You can easily disable OS-controlled frequency scaling, but I'm not sure that it's possible to totally override the throttling internal to the CPU that's governed by thermal/electrical limits? Maybe you could peg the CPU to some low frequency (here just LFM, but AVX2 base freq if you make a 256-bit version) and correct for the difference between that true frequency, and the base frequency that rdtsc counts at? Curious how you configured the server to correctly measure. Code was fine, added a test. Thanks! |
|
Great code. |
|
Name added to AUTHORS. |
On my test machine, Linux counters agree exactly (less than 1% error) with rdtsc on sizeable arrays. See 654e660 The downside of Linux counters is that they only work... on Linux. |
|
Cool, thanks, and thanks for that commit. |
|
I also expect Linux counters to have more overhead. So I would only use them for large tasks. |
That is probably true of the SSE implementation today. It may not continue to be true if you move to higher-power AVX/AVX2/AVX-512 instructions. The CPU can and does downclock to meet thermal limits under high AVX load ("Workloads using Intel AVX instructions may reduce processor frequency as far down as the AVX base frequency to stay within TDP limits"). I think @zbjornson's concern has merit, for benchmarking an eventual AVX implementation. |
|
Fair point, but some things to take into account...
So though I agree with everything you guys wrote, I would urge some caution. Can you measure these slowdowns on benchmarking servers? Note that if someone wants to collaborate with me on this, I can make available servers (by ssh). |
|
@pcordes do you want to post your measured frequency asm test on SKL-EP with AVX and AVX-512, or mind if I post it? |
|
Mysticial reports that SKX clock throttling might only go down to AVX512 levels when FP uops are scheduled to port 5. This is compatible with my measurements. @zbjornson sure, I can summarize what I did last year on Google Cloud VMs where we didn't have access to performance counters or even actual P-state information (so we couldn't even look at This was before full commercial release of Skylake-AVX512 silicon (Xeon Bronze/Silver/Gold/Platinum, and i9-79xx), and before full public access even on Google Compute Engine, so I don't know that these chips even behave qualitatively the same way as we'll find on full production hardware. They were probably engineering-samples; their max turbo is certainly lower than any of the 28-core Xeon chips: 3.8GHz). This was during the limited access period, so I think we really did have whole physical machines to ourselves, especially when we configured one with max CPUs and RAM (and thus we really were seeing max turbo). Measurement noise was very low compared to the Haswell VMs. Instead of messing around with There are two levels of throttling below max turbo that I detected on google's early-access SKX VMs with (probably) pre-production CPUs:
The CPU info says the baseline clock frequency is 2.0GHz. Measurement overhead or whatever led to my numbers being slightly lower than even multiples of 100MHz. Presumably it's really 2.7GHz / 2.4 / 2.0. The VM reported the CPU as having 45MB of L3 cache. That's more L3 cache then Intel ended up putting in their real 28-core Xeon Platinum chips (38.5MB), but maybe that's KVM falsely combining the L3 caches of multiple sockets into one fake CPUID result. If it was really 22MB per socket, the HW might have been similar to a 16-core Xeon Platinum 8153 at 2.0 base / 2.8 max turbo. The code I used has comments with total time and inferred CPU frequency for different single instructions or blocks Assemble and link this with The AVX512 results: You can find more latency/throughput results at http://instlatx64.atw.hu/, including a spreadsheet of with timings from real hardware, and uop->port mappings extracted from IACA and / or Intel's published material. |
|
@lemire your blog post says you tested on an Intel Xeon W-2104 CPU @ 3.20GHz. That CPU is special for 3 reasons:
Using the port5 FMA may be what triggers throttling, so CPUs without the port5 FMA unit may never do the full AVX512 throttling. (IDK if anyone's done better experiments to figure out exactly what triggers throttling). AVX512 throttling only seems to go down to the rated baseline sticker clock speed, i.e. 3.2GHz for your CPU. The higher clocks are for the common case where there's thermal / power headroom, not a guarantee, I don't think. I'm not totally clear on this point, but I think that AVX512 can only slow your CPU down to the same minimum frequency as hitting TDP limits from running heavy 256-bit FMA code on all cores, and that frequency is the sticker frequency that a CPU advertises itself as. i.e. the sticker frequency is a guaranteed baseline unless your CPU seriously overheats from inadequate cooling, and a different kind of throttling kicks in. Anyway, if that's the case, then a CPU without Turbo at all will have max AVX512 frequency = AVX2 FMA frequency = sticker frequency, so you won't ever see anything on that CPU :P. There should be a Stack Overflow chat archive where Mysticial was telling me about overclocking his i9-79xx SKX desktop. It didn't turn up in some quick googling, but I do remember he said it had a configurable AVX512 frequency in the BIOS, but I don't remember if there was also an AVX frequency below the max turbo for workloads with no FMA or no AVX at all. If you aren't already using 512-bit vectors extensively, it's not always worth it to vectorize a small part of your program with 512-bit vectors. I think the clock-speed hit lasts something like milliseconds, i.e. millions of clock cycles. If we can confirm the hit only applies for 512-bit FMA instructions, then 512-bit integer instructions could be used without concern if you're already heavily using 256-bit vectors enough that your CPU is already limited to the AVX clock. (Even a single 512-bit vector-integer instruction reduced the max turbo down to AVX2-FMA levels.) Changing frequency stalls for thousands or tens of thousands of cycles, if the discrepancy between Also note that when any 512-bit uops are in the scheduler, the vector execution units on port1 are shut down. (Only applies to operations on ZMM registers, not to all AVX-512 instructions. e.g. |
|
@pcordes Note that I always configure testing servers so that they have turbo disabled. |
That's useful for microbenchmarking, especially if you want to use RDTSC instead of perf counters, and what you want to look at for tuning / profiling. But in real life, people run code with turbo and power-saving enabled, so it's also useful to know how fast different implementations run in wall-clock time. e.g. in software that doesn't otherwise use AVX512 for anything, it's useful to know how an AVX512 implementation compares against an AVX2 implementation, at their respective max clock speeds. Or with all cores loaded at whatever clock speed the CPU manages... This is the kind of thing people need to take into account when deciding which version to use in their program on their server. Although really that's so specific that most people should just benchmark themselves, because the % of time that different programs spend in these functions will be very different. But it would still be interesting to see how AVX2 vs. AVX512 stacks up on the same hardware, without any artificial clock-speed limitations for either. |
|
I agree with you. What I want to stress is that one needs to measure these things. Otherwise, we risk taking decisions not based on engineering facts but rather based on stuff we read on the Internet ("AVX-512 slows down your server"). That's no good. Also, Intel's documentation should not be trusted. One needs to measure and verify. |
|
@pcordes I see you live in Halifax. I was once a professor at Acadia University nearby, and I still have friends in Halifax. Nova Scotia is the one place in the world where "brown or white bread?" does not mean what I've come to expect. |
with 4096, 50k iterations, Skylake-EP
I think this is correct... unless a zero byte is considered not valid ASCII.
Also, rdtsc counts reference cycles, not core clock cycles, so it's not really a correct benchmark. For comparing two implementations it's okayish (assuming they don't affect the core clock speed differently).