GitHub - plops/cl-gen-cuda-try: try to generate some cuda code using https://github.com/plops/cl-cpp-generator and https://github.com/plops/cl-py-generator. Currently I use google colab and hope to progress towards signal processing code for synthetic aperture radar data of the ESA Copernicus satellites.

In this code I collect experiments with a code generator that emits C code. The code generator (https://github.com/plops/cl-cpp-generator) is very simple. A single function consumes a lisp like language and emits a string with C code.

Although this approach is very simple, it gives access to the expressiveness of Commmon Lisp macros. I can easily play around with different code structures, try ideas fast, test the performance and throw non-working attempts away without regret.

Currently, I want to understand 2D Fast Fourier Transform. First on the CPU and eventually on the GPU.

How should memory accesses be ordered?

Should twiddle factors be recomputed on the fly or tabulated? Or is a mixed approach faster?

First I made a 16 element FFT with radix 4 work. It computes row-wise DFTs of a 4x4 matrix, transposes and does it again. This works on CPU.

Then I use this function to implement a 256 element FFT. Currently, this function isn’t working.

file	generated code	description
gen.lisp	cuda_try.cu	read out some details about the cuda device, also tried and aborted porting fft code
gen-cpu.lisp	cpu_try.c	implement dft, fft. here i first understood how cooley-tukey split a 1d fft into ffts of 2d matrices and transpositions
gen-simd.lisp	simd_try.c	i tried to express the code from gen-cpu for vectorization. i gave up because it looks like a mess and will be much slower than gpu
hex.lisp	-none-	hex representation for floating point constants. i added this to cl-cpp-generator and use it for constants like twiddle factors for fft
gen-pycuda-colab.lisp	pycuda_colab.py	simple example that generates python containing a string with cuda code that can be run on google colab
gen-pycuda-colab2.lisp	pycuda_colab2.py	2 stage fft that runs on google colab. looks quite messy becaus cuda has no complex numbers

task	priority (1 .. high)	reason	comment
measure performance with nvidia nsight compute	1	learn which counters are interesting	tried on vast.ai and google colab. can’t get it to work
read performance metrics from inside cuda	2	guide optimization of tile size	is that even possible?
port https://github.com/plops/satellite-plot to cuda	3	decompress raw satellite payload
learn SAR focussing	1
learn satellite orbit	4
SAR focussing with terrain	4
measure surface movement with SAR sequences	5
store global surface timeline with 10km resolution	6

## Intel Core i5 CPU M 520 @ 2.40GHz

https://en.wikichip.org/wiki/intel/core_i5/i5-520m

L1I	64 KiB	2x32 KiB	4-way set associative	write-back
L1D	64 KiB	2x32 KiB	8-way set associative	write-back
L2	512 KiB	2x256 KiB	8-way set associative	write-back
L3	3 MiB	2x1.5 MiB	12-way set associative	write-back

## Intel x7-x8750

https://en.wikichip.org/wiki/intel/atom_x7/z8750

L1I	128 KiB	4x32 KiB 8-way set associative (per core)
L1D	96 KiB	4x24 KiB 6-way set associative (per core)
L2	2 MiB	2x1 MiB 16-way set associative (per 2 cores)
L3	0 KiB	No L3

## Intel Core i5-7400 CPU @ 3.00GHz

https://en.wikichip.org/wiki/intel/core_i5/i5-7400

L1I	128 KiB	4x32 KiB	8-way set associative
L1D	128 KiB	4x32 KiB	8-way set associative	write-back
L2	1 MiB	4x256 KiB	4-way set associative	write-back
L3	6 MiB	4x1.5 MiB	12-way set associative	write-back

## Intel Xeon E3 1245-v5 @ 3.5GHz

https://en.wikichip.org/wiki/intel/xeon_e3/e3-1245_v5

L1	256 KiB
L1I	128 KiB	4x32 KiB	8-way set associative
L1D	128 KiB	4x32 KiB	8-way set associative	write-back
L2	1 MiB	4x256 KiB	4-way set associative	write-back
L3	8 MiB	4x2 MiB		write-back

## Intel Xeon (Skylake, IBRS) @ 2.10 GHz

GenuineIntel Family 6 Model 85 Stepping 4
https://browser.geekbench.com/geekbench3/8539129
https://en.wikichip.org/wiki/intel/microarchitectures/skylake_(server)

rdpmc instruction https://software.intel.com/en-us/forums/software-tuning-performance-optimization-platform-monitoring/topic/595214
- PAPI overheads are typically in excess of 2000 cycles to read a single counter.
- “perf stat” command (or similar) will sometimes use these fixed function counters and will disable them on exit
use kernel parameters isolcpus=0 to free one of the cpus for benchmarking
https://easyperf.net/blog/2019/04/03/Precise-timing-of-machine-code-with-Linux-perf

cuda complex library https://github.com/jtravs/cuda_complex/blob/master/cuda_complex.hpp
Vasily Volkov FFT on GPU (2008) https://people.eecs.berkeley.edu/~kubitron/courses/cs258-S08/projects/reports/project6_report.pdf
https://github.com/vetter/shoc/blob/master/src/cuda/level1/fft/fftlib.cu
http://www.bealto.com/gpu-fft_intro.html
https://mc.stanford.edu/cgi-bin/images/7/75/SC08_FFT_on_GPUs.pdf
eliminate power-of-two memory stride for better caching https://www.davidhbailey.com/dhbpapers/fftzp.pdf
how to split dft into 2d https://en.wikipedia.org/wiki/Cooley%E2%80%93Tukey_FFT_algorithm
gcc vectorization https://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html
profile mem access (2007) https://lwn.net/Articles/257209/
- opannotate .. lists the source or assembler code of the program and shows the instructions where the event was recognized
how to use perf https://www.youtube.com/watch?v=M6ldFtwWup0
https://github.com/RRZE-HPC/likwid/wiki
http://www.brendangregg.com/blog/2017-05-09/cpu-utilization-is-wrong.html

good explanation how to measure and interprete timing http://sites.utexas.edu/jdm4372/2018/07/23/comments-on-timing-short-code-sections-on-intel-processors/
https://github.com/jdmccalpin/low-overhead-timers
https://github.com/jdmccalpin/periodic-performance-counters
offloading https://www.youtube.com/watch?v=kIA_UtdVabQ
- #pragma omp target teams distribute parallel for collapse(2) map(to:zr,zi,xscale,yscale) map(from:results[0:npixels])
https://software.intel.com/en-us/forums/intel-isa-extensions/topic/289038
- working prefetch increases the cache misses (but speeds up execution)
- resource stall ratio: RS_FULL ratio and ROB_FULL ratio
- Frequent resource stalls indicate the presence of tuning opportunities like frequent cache misses, long execution paths, and memory order buffer (MOB) stalls
https://www.blackhat.com/docs/us-15/materials/us-15-Herath-These-Are-Not-Your-Grand-Daddys-CPU-Performance-Counters-CPU-Hardware-Performance-Counters-For-Security.pdf
- Performance Application Programming Interface http://icl.cs.utk.edu/papi/Perfmon2 http://perfmon2.sourceforge.net/
- Andi Kleen’s pmu-tools https://github.com/andikleen/pmu-tools
  - toplev .. Estimate on which part of the CPU pipeline a workload bottlenecks using the TopDown mode
  - sudo ./toplev.py –all –core C0 taskset -c 0 ~/stage/cl-gen-cuda-try/source/cpu_try_gcc
  - http://www.cs.technion.ac.il/~erangi/TMA_using_Linux_perf__Ahmad_Yasin.pdf
    - turning hyperthreading off gives 8 (instead of 4 general purpose counters) since sandy bridge
    - for profiling disable virtualization, turbo and smt
    - affinity: taskset0x<mask> in Linux or start /affinity 0x<mask> in Windows
  - Performance Analysis in Modern Multicores (cpu architect intel) http://cs.haifa.ac.il/~yosi/PARC/yasin.pdf
  - Butterfly FFT http://cs.haifa.ac.il/courses/verilog/p2011.html
- LikwidPerfCtr https://github.com/rrze-likwid/likwid

gcc simd https://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html
- https://news.ycombinator.com/item?id=16050812
- some simd helpers https://godbolt.org/g/N9VvXZ
- how to load https://stackoverflow.com/questions/9318115/loading-data-for-gccs-vector-extensions
- complex number https://stackoverflow.com/questions/41417061/how-to-enable-sse3-autovectorization-in-gcc
- complex https://github.com/Roger-luo/CSIMD
- complex https://nullprogram.com/blog/2015/07/10/
- complex with eigen https://stackoverflow.com/questions/45298855/how-to-write-portable-simd-code-for-complex-multiplicative-reduction
  - store real and imaginary parts in a way that they can be directly loaded one packet at once, interleaved is not good
  - -mfma -mavx512f
https://github.com/p12tic/libsimdpp
- http://p12tic.github.io/libsimdpp/v2.2-dev/libsimdpp/w/
https://www.ti.uni-bielefeld.de/downloads/publications/templateSIMD.pdf
- http://www.ti.uni-bielefeld.de/html/people/moeller/tsimd_warpingsimd.html
- only academic license
https://www.agner.org/optimize/vectorclass.pdf
https://github.com/google/dimsum
ispc
how risc-v does simd https://www.youtube.com/watch?v=GzZ-8bHsD5s
handmade ray 03 https://www.youtube.com/watch?v=dpvrPYdTkPw
-fmerge-constants Attempt to merge identical constants (string constants and floating point constants) across compilation units. This option is the default for optimized compilation if the assembler and linker support it. Use -fno-merge-constants to inhibit this behavior.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

source

source

README.org

README.org

gen-cpu.lisp

gen-cpu.lisp

gen-pycuda-colab.lisp

gen-pycuda-colab.lisp

gen-pycuda-colab2.lisp

gen-pycuda-colab2.lisp

gen-simd.lisp

gen-simd.lisp

gen.lisp

gen.lisp

hex.lisp

hex.lisp

Repository files navigation

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 198 Commits
source		source
README.org		README.org
gen-cpu.lisp		gen-cpu.lisp
gen-pycuda-colab.lisp		gen-pycuda-colab.lisp
gen-pycuda-colab2.lisp		gen-pycuda-colab2.lisp
gen-simd.lisp		gen-simd.lisp
gen.lisp		gen.lisp
hex.lisp		hex.lisp

plops/cl-gen-cuda-try

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks

Languages