CUDA memory access recorder and visualizer
This is a simple tool that records all memory accesses and timestamps of the accesses in a CUDA program. It is done by writing every memory access index and streaming multiprocessor clock cycle value during the access into global device memory. Depending on the amount of memory accesses, this might require quite a lot of space and makes this tool usable only on very small datasets.
The example directory contains some screen recordings of access pattern animations for the examples.
To generate access patterns for the examples seen above, go to
cd examples/v0 make && ./bin/main
Start a local web server for the animation app:
cd ../../web && python3 -m http.server
Go to http://0.0.0.0:8000 and submit the generated
You should now see the access pattern from the first gif.
To use the v1 kernel, open
examples/v0/main.cu and define the
kernel_v1 macro instead of
If it does not work
Some things to try:
- Make sure you are running wrapper
- The wrapper objects support only array indexing, pointer arithmetic etc. is not available.
- Make sure the wrapper object calls
enter_kernelonce somewhere at the beginning of a kernel before the first memory access.
- Make sure you call
cudaDeviceSynchronizeafter the kernel call so that the unified memory pointers are accessible at the host.
- Define the value of macro
PR_VERBOSITYas 1 before including
pattern_recorder.cuh. This will trigger some asserts.
- If you are getting a warning of possibly using too much device memory, try reducing the number of required memory accesses by using a smaller data sample.