Skip to content

plops/cl-gen-cuda-try

Repository files navigation

In this code I collect experiments with a code generator that emits C code. The code generator (https://github.com/plops/cl-cpp-generator) is very simple. A single function consumes a lisp like language and emits a string with C code.

Although this approach is very simple, it gives access to the expressiveness of Commmon Lisp macros. I can easily play around with different code structures, try ideas fast, test the performance and throw non-working attempts away without regret.

Currently, I want to understand 2D Fast Fourier Transform. First on the CPU and eventually on the GPU.

How should memory accesses be ordered?

Should twiddle factors be recomputed on the fly or tabulated? Or is a mixed approach faster?

First I made a 16 element FFT with radix 4 work. It computes row-wise DFTs of a 4x4 matrix, transposes and does it again. This works on CPU.

Then I use this function to implement a 256 element FFT. Currently, this function isn’t working.

filegenerated codedescription
gen.lispcuda_try.curead out some details about the cuda device, also tried and aborted porting fft code
gen-cpu.lispcpu_try.cimplement dft, fft. here i first understood how cooley-tukey split a 1d fft into ffts of 2d matrices and transpositions
gen-simd.lispsimd_try.ci tried to express the code from gen-cpu for vectorization. i gave up because it looks like a mess and will be much slower than gpu
hex.lisp-none-hex representation for floating point constants. i added this to cl-cpp-generator and use it for constants like twiddle factors for fft
gen-pycuda-colab.lisppycuda_colab.pysimple example that generates python containing a string with cuda code that can be run on google colab
gen-pycuda-colab2.lisppycuda_colab2.py2 stage fft that runs on google colab. looks quite messy becaus cuda has no complex numbers
taskpriority (1 .. high)reasoncomment
measure performance with nvidia nsight compute1learn which counters are interestingtried on vast.ai and google colab. can’t get it to work
read performance metrics from inside cuda2guide optimization of tile sizeis that even possible?
port https://github.com/plops/satellite-plot to cuda3decompress raw satellite payload
learn SAR focussing1
learn satellite orbit4
SAR focussing with terrain4
measure surface movement with SAR sequences5
store global surface timeline with 10km resolution6

## Intel Core i5 CPU M 520 @ 2.40GHz

L1I64 KiB2x32 KiB4-way set associativewrite-back
L1D64 KiB2x32 KiB8-way set associativewrite-back
L2512 KiB2x256 KiB8-way set associativewrite-back
L33 MiB2x1.5 MiB12-way set associativewrite-back

## Intel x7-x8750

L1I128 KiB4x32 KiB 8-way set associative (per core)
L1D96 KiB4x24 KiB 6-way set associative (per core)
L22 MiB2x1 MiB 16-way set associative (per 2 cores)
L30 KiBNo L3

## Intel Core i5-7400 CPU @ 3.00GHz

L1I128 KiB4x32 KiB8-way set associative
L1D128 KiB4x32 KiB8-way set associativewrite-back
L21 MiB4x256 KiB4-way set associativewrite-back
L36 MiB4x1.5 MiB12-way set associativewrite-back

## Intel Xeon E3 1245-v5 @ 3.5GHz

L1256 KiB
L1I128 KiB4x32 KiB8-way set associative
L1D128 KiB4x32 KiB8-way set associativewrite-back
L21 MiB4x256 KiB4-way set associativewrite-back
L38 MiB4x2 MiBwrite-back

## Intel Xeon (Skylake, IBRS) @ 2.10 GHz

About

try to generate some cuda code using https://github.com/plops/cl-cpp-generator and https://github.com/plops/cl-py-generator. Currently I use google colab and hope to progress towards signal processing code for synthetic aperture radar data of the ESA Copernicus satellites.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published