In this code I collect experiments with a code generator that emits C code. The code generator (https://github.com/plops/cl-cpp-generator) is very simple. A single function consumes a lisp like language and emits a string with C code.
Although this approach is very simple, it gives access to the expressiveness of Commmon Lisp macros. I can easily play around with different code structures, try ideas fast, test the performance and throw non-working attempts away without regret.
Currently, I want to understand 2D Fast Fourier Transform. First on the CPU and eventually on the GPU.
How should memory accesses be ordered?
Should twiddle factors be recomputed on the fly or tabulated? Or is a mixed approach faster?
First I made a 16 element FFT with radix 4 work. It computes row-wise DFTs of a 4x4 matrix, transposes and does it again. This works on CPU.
Then I use this function to implement a 256 element FFT. Currently, this function isn’t working.
file | generated code | description | ||
---|---|---|---|---|
gen.lisp | cuda_try.cu | read out some details about the cuda device, also tried and aborted porting fft code | ||
gen-cpu.lisp | cpu_try.c | implement dft, fft. here i first understood how cooley-tukey split a 1d fft into ffts of 2d matrices and transpositions | ||
gen-simd.lisp | simd_try.c | i tried to express the code from gen-cpu for vectorization. i gave up because it looks like a mess and will be much slower than gpu | ||
hex.lisp | -none- | hex representation for floating point constants. i added this to cl-cpp-generator and use it for constants like twiddle factors for fft | ||
gen-pycuda-colab.lisp | pycuda_colab.py | simple example that generates python containing a string with cuda code that can be run on google colab | ||
gen-pycuda-colab2.lisp | pycuda_colab2.py | 2 stage fft that runs on google colab. looks quite messy becaus cuda has no complex numbers |
task | priority (1 .. high) | reason | comment |
---|---|---|---|
measure performance with nvidia nsight compute | 1 | learn which counters are interesting | tried on vast.ai and google colab. can’t get it to work |
read performance metrics from inside cuda | 2 | guide optimization of tile size | is that even possible? |
port https://github.com/plops/satellite-plot to cuda | 3 | decompress raw satellite payload | |
learn SAR focussing | 1 | ||
learn satellite orbit | 4 | ||
SAR focussing with terrain | 4 | ||
measure surface movement with SAR sequences | 5 | ||
store global surface timeline with 10km resolution | 6 |
## Intel Core i5 CPU M 520 @ 2.40GHz
L1I | 64 KiB | 2x32 KiB | 4-way set associative | write-back |
L1D | 64 KiB | 2x32 KiB | 8-way set associative | write-back |
L2 | 512 KiB | 2x256 KiB | 8-way set associative | write-back |
L3 | 3 MiB | 2x1.5 MiB | 12-way set associative | write-back |
## Intel x7-x8750
L1I | 128 KiB | 4x32 KiB 8-way set associative (per core) |
L1D | 96 KiB | 4x24 KiB 6-way set associative (per core) |
L2 | 2 MiB | 2x1 MiB 16-way set associative (per 2 cores) |
L3 | 0 KiB | No L3 |
## Intel Core i5-7400 CPU @ 3.00GHz
L1I | 128 KiB | 4x32 KiB | 8-way set associative | |
L1D | 128 KiB | 4x32 KiB | 8-way set associative | write-back |
L2 | 1 MiB | 4x256 KiB | 4-way set associative | write-back |
L3 | 6 MiB | 4x1.5 MiB | 12-way set associative | write-back |
## Intel Xeon E3 1245-v5 @ 3.5GHz
L1 | 256 KiB | |||
L1I | 128 KiB | 4x32 KiB | 8-way set associative | |
L1D | 128 KiB | 4x32 KiB | 8-way set associative | write-back |
L2 | 1 MiB | 4x256 KiB | 4-way set associative | write-back |
L3 | 8 MiB | 4x2 MiB | write-back |
## Intel Xeon (Skylake, IBRS) @ 2.10 GHz
- GenuineIntel Family 6 Model 85 Stepping 4
- https://browser.geekbench.com/geekbench3/8539129
- https://en.wikichip.org/wiki/intel/microarchitectures/skylake_(server)
- rdpmc instruction
https://software.intel.com/en-us/forums/software-tuning-performance-optimization-platform-monitoring/topic/595214
- PAPI overheads are typically in excess of 2000 cycles to read a single counter.
- “perf stat” command (or similar) will sometimes use these fixed function counters and will disable them on exit
- use kernel parameters isolcpus=0 to free one of the cpus for benchmarking
- https://easyperf.net/blog/2019/04/03/Precise-timing-of-machine-code-with-Linux-perf
- cuda complex library https://github.com/jtravs/cuda_complex/blob/master/cuda_complex.hpp
- Vasily Volkov FFT on GPU (2008) https://people.eecs.berkeley.edu/~kubitron/courses/cs258-S08/projects/reports/project6_report.pdf
- https://github.com/vetter/shoc/blob/master/src/cuda/level1/fft/fftlib.cu
- http://www.bealto.com/gpu-fft_intro.html
- https://mc.stanford.edu/cgi-bin/images/7/75/SC08_FFT_on_GPUs.pdf
- eliminate power-of-two memory stride for better caching https://www.davidhbailey.com/dhbpapers/fftzp.pdf
- how to split dft into 2d https://en.wikipedia.org/wiki/Cooley%E2%80%93Tukey_FFT_algorithm
- gcc vectorization https://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html
- profile mem access (2007) https://lwn.net/Articles/257209/
- opannotate .. lists the source or assembler code of the program and shows the instructions where the event was recognized
- how to use perf https://www.youtube.com/watch?v=M6ldFtwWup0
- https://github.com/RRZE-HPC/likwid/wiki
- http://www.brendangregg.com/blog/2017-05-09/cpu-utilization-is-wrong.html
- good explanation how to measure and interprete timing http://sites.utexas.edu/jdm4372/2018/07/23/comments-on-timing-short-code-sections-on-intel-processors/
- https://github.com/jdmccalpin/low-overhead-timers
- https://github.com/jdmccalpin/periodic-performance-counters
- offloading https://www.youtube.com/watch?v=kIA_UtdVabQ
- #pragma omp target teams distribute parallel for collapse(2) map(to:zr,zi,xscale,yscale) map(from:results[0:npixels])
- https://software.intel.com/en-us/forums/intel-isa-extensions/topic/289038
- working prefetch increases the cache misses (but speeds up execution)
- resource stall ratio: RS_FULL ratio and ROB_FULL ratio
- Frequent resource stalls indicate the presence of tuning opportunities like frequent cache misses, long execution paths, and memory order buffer (MOB) stalls
- https://www.blackhat.com/docs/us-15/materials/us-15-Herath-These-Are-Not-Your-Grand-Daddys-CPU-Performance-Counters-CPU-Hardware-Performance-Counters-For-Security.pdf
- Performance Application Programming Interface http://icl.cs.utk.edu/papi/Perfmon2 http://perfmon2.sourceforge.net/
- Andi Kleen’s pmu-tools https://github.com/andikleen/pmu-tools
- toplev .. Estimate on which part of the CPU pipeline a workload bottlenecks using the TopDown mode
- sudo ./toplev.py –all –core C0 taskset -c 0 ~/stage/cl-gen-cuda-try/source/cpu_try_gcc
- http://www.cs.technion.ac.il/~erangi/TMA_using_Linux_perf__Ahmad_Yasin.pdf
- turning hyperthreading off gives 8 (instead of 4 general purpose counters) since sandy bridge
- for profiling disable virtualization, turbo and smt
- affinity: taskset0x<mask> in Linux or start /affinity 0x<mask> in Windows
- Performance Analysis in Modern Multicores (cpu architect intel) http://cs.haifa.ac.il/~yosi/PARC/yasin.pdf
- Butterfly FFT http://cs.haifa.ac.il/courses/verilog/p2011.html
- LikwidPerfCtr https://github.com/rrze-likwid/likwid
- gcc simd https://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html
- https://news.ycombinator.com/item?id=16050812
- some simd helpers https://godbolt.org/g/N9VvXZ
- how to load https://stackoverflow.com/questions/9318115/loading-data-for-gccs-vector-extensions
- complex number https://stackoverflow.com/questions/41417061/how-to-enable-sse3-autovectorization-in-gcc
- complex https://github.com/Roger-luo/CSIMD
- complex https://nullprogram.com/blog/2015/07/10/
- complex with eigen
https://stackoverflow.com/questions/45298855/how-to-write-portable-simd-code-for-complex-multiplicative-reduction
- store real and imaginary parts in a way that they can be directly loaded one packet at once, interleaved is not good
- -mfma -mavx512f
- https://github.com/p12tic/libsimdpp
- https://www.ti.uni-bielefeld.de/downloads/publications/templateSIMD.pdf
- http://www.ti.uni-bielefeld.de/html/people/moeller/tsimd_warpingsimd.html
- only academic license
- https://www.agner.org/optimize/vectorclass.pdf
- https://github.com/google/dimsum
- ispc
- how risc-v does simd https://www.youtube.com/watch?v=GzZ-8bHsD5s
- handmade ray 03 https://www.youtube.com/watch?v=dpvrPYdTkPw
- -fmerge-constants Attempt to merge identical constants (string constants and floating point constants) across compilation units. This option is the default for optimized compilation if the assembler and linker support it. Use -fno-merge-constants to inhibit this behavior.