Skip to content

Conversation

@zbjornson
Copy link
Contributor

@zbjornson zbjornson commented May 16, 2018

with 4096, 50k iterations, Skylake-EP

# before:
validate_ascii_fast(data, N)   :  0.102 cycles per operation (best)    0.118 cycles per operation (avg)
# after:
validate_ascii_fast(data, N)   :  0.086 cycles per operation (best)    0.101 cycles per operation (avg) 

I think this is correct... unless a zero byte is considered not valid ASCII.


Also, rdtsc counts reference cycles, not core clock cycles, so it's not really a correct benchmark. For comparing two implementations it's okayish (assuming they don't affect the core clock speed differently).

@zbjornson
Copy link
Contributor Author

Just realized the unit test doesn't test validate_fast_ascii, so I haven't actually tested this. Need to run right now, but will come back to this.

@lemire
Copy link
Owner

lemire commented May 16, 2018

The benchmark is correct on a properly configured server. It is definitely not correct on any random machine.

I’ll wait for your test. Thanks.

@zbjornson
Copy link
Contributor Author

zbjornson commented May 16, 2018

You can easily disable OS-controlled frequency scaling, but I'm not sure that it's possible to totally override the throttling internal to the CPU that's governed by thermal/electrical limits? Maybe you could peg the CPU to some low frequency (here just LFM, but AVX2 base freq if you make a 256-bit version) and correct for the difference between that true frequency, and the base frequency that rdtsc counts at? Curious how you configured the server to correctly measure.


Code was fine, added a test. Thanks!

@lemire lemire merged commit ffd20a1 into lemire:master May 16, 2018
@lemire
Copy link
Owner

lemire commented May 16, 2018

Great code.

@lemire
Copy link
Owner

lemire commented May 16, 2018

Name added to AUTHORS.

@lemire
Copy link
Owner

lemire commented May 16, 2018

You can easily disable OS-controlled frequency scaling, but I'm not sure that it's possible to totally override the throttling internal to the CPU that's governed by thermal/electrical limits?

On my test machine, Linux counters agree exactly (less than 1% error) with rdtsc on sizeable arrays. See 654e660

The downside of Linux counters is that they only work... on Linux.

@zbjornson zbjornson deleted the faster branch May 16, 2018 21:38
@zbjornson
Copy link
Contributor Author

Cool, thanks, and thanks for that commit.

@lemire
Copy link
Owner

lemire commented May 16, 2018

I also expect Linux counters to have more overhead. So I would only use them for large tasks.

@cemeyer
Copy link

cemeyer commented May 16, 2018

On my test machine, Linux counters agree exactly (less than 1% error) with rdtsc on sizeable arrays. See 654e660

That is probably true of the SSE implementation today. It may not continue to be true if you move to higher-power AVX/AVX2/AVX-512 instructions. The CPU can and does downclock to meet thermal limits under high AVX load ("Workloads using Intel AVX instructions may reduce processor frequency as far down as the AVX base frequency to stay within TDP limits").

I think @zbjornson's concern has merit, for benchmarking an eventual AVX implementation.

@lemire
Copy link
Owner

lemire commented May 16, 2018

@cemeyer

Fair point, but some things to take into account...

So though I agree with everything you guys wrote, I would urge some caution. Can you measure these slowdowns on benchmarking servers?

Note that if someone wants to collaborate with me on this, I can make available servers (by ssh).

@zbjornson
Copy link
Contributor Author

@pcordes do you want to post your measured frequency asm test on SKL-EP with AVX and AVX-512, or mind if I post it?

@pcordes
Copy link

pcordes commented May 17, 2018

Mysticial reports that SKX clock throttling might only go down to AVX512 levels when FP uops are scheduled to port 5. This is compatible with my measurements.


@zbjornson sure, I can summarize what I did last year on Google Cloud VMs where we didn't have access to performance counters or even actual P-state information (so we couldn't even look at /proc/cpuinfo to see the real current clock speeds).

This was before full commercial release of Skylake-AVX512 silicon (Xeon Bronze/Silver/Gold/Platinum, and i9-79xx), and before full public access even on Google Compute Engine, so I don't know that these chips even behave qualitatively the same way as we'll find on full production hardware. They were probably engineering-samples; their max turbo is certainly lower than any of the 28-core Xeon chips: 3.8GHz). This was during the limited access period, so I think we really did have whole physical machines to ourselves, especially when we configured one with max CPUs and RAM (and thus we really were seeing max turbo). Measurement noise was very low compared to the Haswell VMs.

Instead of messing around with rdtsc directly, I made loops that run at a known 1 cycle per iteration, or one per 4 clocks bottlenecked on FMA latency. Put that in a static executable that just runs a loop and exit, and time the whole process with time or perf stat. rdtsc and wall-clock time are exactly equivalent, but using wall-clock time avoids needing to know the actual RDTSC frequency.

There are two levels of throttling below max turbo that I detected on google's early-access SKX VMs with (probably) pre-production CPUs:

  • 2688 MHz: max turbo, integer-only or light-weight AVX2
  • 2392 MHz: heavy AVX2 (high throughput FMA) or light AVX512 (latency-bottlenecked VFMADD zmm, or just vxorps zmm)
  • 1984 MHz: heavy AVX512 (1/clock VFMADD zmm or VADDPS)

The CPU info says the baseline clock frequency is 2.0GHz. Measurement overhead or whatever led to my numbers being slightly lower than even multiples of 100MHz. Presumably it's really 2.7GHz / 2.4 / 2.0.

The VM reported the CPU as having 45MB of L3 cache. That's more L3 cache then Intel ended up putting in their real 28-core Xeon Platinum chips (38.5MB), but maybe that's KVM falsely combining the L3 caches of multiple sockets into one fake CPUID result. If it was really 22MB per socket, the HW might have been similar to a 16-core Xeon Platinum 8153 at 2.0 base / 2.8 max turbo.

The code I used has comments with total time and inferred CPU frequency for different single instructions or blocks

global _start
_start:
mov ecx, 2000000000

align 64
.loop:
  ;; AVX2
  ;vxorps ymm2, ymm2, ymm2      ;; 0m0.744s  = 2688 MHz  (same as with no insns)                                                                           

  ;vfmadd132ps ymm2, ymm2, ymm2 ;; 0m2.976s -> latency=4 -> 2688 MHz                                                                                       
 
  ;vxorps xmm2, xmm2, xmm2   ; dependency-breaking
  ;vfmadd132ps ymm2, ymm2, ymm2 ;; 0m2.976s -> 2392 MHz (throttle only with high throughput)

   ; and more, see below
  dec ecx
  jne .loop

mov eax, 231   ; __NR_exit_group
syscall

Assemble and link this with nasm -felf64 foo.asm -g -Fdwarf && ld -o foo foo.o, creating a static binary. No dynamic linker or CRT startup code runs before or after this loop; these are the only instructions that run (in user-space) for time ./foo

The AVX512 results:

  ;; AVX512 single instructions                                                                                                                            
  ;kxnorw  k1, k0,k0           ;; 0m0.744s  = 2688 MHz  (same as with no insns)
  ;vmovaps zmm2, zmm0          ;; 0m0.836s  = 2392 MHz
  ;vxorps zmm2, zmm1, zmm0     ;; 0m0.836s
  ;vxorps zmm2, zmm2, zmm2     ;; 0m0.836s
  ;vfmadd132ps zmm2, zmm2, zmm2 ;; 0m3.348s -> latency=4 -> 2389 MHz


  ;;; pairs/groups of instructions
  ;vxorps zmm2, zmm2, zmm2
  ;vfmadd132ps zmm2, zmm2, zmm2 ;; 0m1.008s  = 1984 MHz  (power limit?)

  ;vxorps zmm2, zmm2, zmm2 ; whether this is hoisted or not
  ;vaddps zmm1, zmm2, zmm2      ;; 0m1.008s  = 1984 MHz

  ;kunpckwd  k4, k0, k1
  ;kunpckwd  k5, k2, k3
  ;kunpckdq  k6, k4, k5       ;; 0m2.232s = 2688 MHz bottlenecked on 1 per clock throughput

  ;kunpckbw  k0, k0, k0
  ;kunpckwd  k0, k0, k0
  ;kunpckdq  k0, k0, k0      ;; 0m8.932s = 2688 MHz at 12c latency (4 per kunpck)

You can find more latency/throughput results at http://instlatx64.atw.hu/, including a spreadsheet of with timings from real hardware, and uop->port mappings extracted from IACA and / or Intel's published material.

@pcordes
Copy link

pcordes commented May 17, 2018

@lemire your blog post says you tested on an Intel Xeon W-2104 CPU @ 3.20GHz.

That CPU is special for 3 reasons:

  • It doesn't support turbo so max freq = baseline all the time.
  • It only has one 512-bit FMA unit.
  • It's a quad-core (maybe based on the same ring-bus interconnect as regular Skylake-client)?

Using the port5 FMA may be what triggers throttling, so CPUs without the port5 FMA unit may never do the full AVX512 throttling. (IDK if anyone's done better experiments to figure out exactly what triggers throttling).

AVX512 throttling only seems to go down to the rated baseline sticker clock speed, i.e. 3.2GHz for your CPU. The higher clocks are for the common case where there's thermal / power headroom, not a guarantee, I don't think.

I'm not totally clear on this point, but I think that AVX512 can only slow your CPU down to the same minimum frequency as hitting TDP limits from running heavy 256-bit FMA code on all cores, and that frequency is the sticker frequency that a CPU advertises itself as. i.e. the sticker frequency is a guaranteed baseline unless your CPU seriously overheats from inadequate cooling, and a different kind of throttling kicks in. Anyway, if that's the case, then a CPU without Turbo at all will have max AVX512 frequency = AVX2 FMA frequency = sticker frequency, so you won't ever see anything on that CPU :P.

There should be a Stack Overflow chat archive where Mysticial was telling me about overclocking his i9-79xx SKX desktop. It didn't turn up in some quick googling, but I do remember he said it had a configurable AVX512 frequency in the BIOS, but I don't remember if there was also an AVX frequency below the max turbo for workloads with no FMA or no AVX at all.


If you aren't already using 512-bit vectors extensively, it's not always worth it to vectorize a small part of your program with 512-bit vectors. I think the clock-speed hit lasts something like milliseconds, i.e. millions of clock cycles. If we can confirm the hit only applies for 512-bit FMA instructions, then 512-bit integer instructions could be used without concern if you're already heavily using 256-bit vectors enough that your CPU is already limited to the AVX clock. (Even a single 512-bit vector-integer instruction reduced the max turbo down to AVX2-FMA levels.)

Changing frequency stalls for thousands or tens of thousands of cycles, if the discrepancy between rdtsc vs. the reference_cycles performance counter on P-state transitions is an accurate measure of transition latency. So even if your AVX512 usage could be bursty, it's not worth it unless you save more cycles than that.


Also note that when any 512-bit uops are in the scheduler, the vector execution units on port1 are shut down. (Only applies to operations on ZMM registers, not to all AVX-512 instructions. e.g. vfmadd132ps ymm31, ymm30, ymm29 requires an AVX512 EVEX prefix to use high registers, but it's a 256-bit instruction, not 512-bit.)

@lemire
Copy link
Owner

lemire commented May 17, 2018

@pcordes Note that I always configure testing servers so that they have turbo disabled.

@pcordes
Copy link

pcordes commented May 18, 2018

Note that I always configure testing servers so that they have turbo disabled.

That's useful for microbenchmarking, especially if you want to use RDTSC instead of perf counters, and what you want to look at for tuning / profiling.

But in real life, people run code with turbo and power-saving enabled, so it's also useful to know how fast different implementations run in wall-clock time. e.g. in software that doesn't otherwise use AVX512 for anything, it's useful to know how an AVX512 implementation compares against an AVX2 implementation, at their respective max clock speeds.

Or with all cores loaded at whatever clock speed the CPU manages...

This is the kind of thing people need to take into account when deciding which version to use in their program on their server. Although really that's so specific that most people should just benchmark themselves, because the % of time that different programs spend in these functions will be very different. But it would still be interesting to see how AVX2 vs. AVX512 stacks up on the same hardware, without any artificial clock-speed limitations for either.

@lemire
Copy link
Owner

lemire commented May 18, 2018

@pcordes

I agree with you. What I want to stress is that one needs to measure these things. Otherwise, we risk taking decisions not based on engineering facts but rather based on stuff we read on the Internet ("AVX-512 slows down your server"). That's no good. Also, Intel's documentation should not be trusted. One needs to measure and verify.

@lemire
Copy link
Owner

lemire commented May 18, 2018

@pcordes I see you live in Halifax. I was once a professor at Acadia University nearby, and I still have friends in Halifax. Nova Scotia is the one place in the world where "brown or white bread?" does not mean what I've come to expect.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants