You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
- GPU-side gamma correction kernel avoids copying full float4 accum buffer to host each frame;
only the small uint8 display image (3 bytes/pixel vs 16) is transferred back
- Accumulation buffer stays permanently on GPU; camera reset uses cudaMemset instead of free/realloc
- Cache cudaGetDeviceProperties result instead of querying every frame
- Flatten CRTP material dispatch into direct switch in scatter_material(), reducing register pressure
- BVH child ordering uses ray direction sign along split axis (one comparison) instead of
two length_squared() distance computations per interior node
- BVH nodes padded to 64-byte cache-line alignment for single-transaction fetches
- Add --motion-samples CLI parameter (default 10) for minimum samples during camera motion
- Update optimization plan with status tracking
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy file name to clipboardExpand all lines: CUDA_OPTIMIZATION_PLAN.md
+30-9Lines changed: 30 additions & 9 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -3,15 +3,17 @@
3
3
## Context
4
4
RayON's CUDA renderer is significantly slower than a comparable Vulkan RT raytracer (RayTracingInVulkan) primarily because it performs BVH traversal and intersection in software on shader cores, while Vulkan uses dedicated RT cores. This plan catalogs actionable optimizations and assesses OptiX migration.
5
5
6
+
See `explanations/VULKAN_VS_CUDA_PERFORMANCE.md` for the detailed comparison.
Uncomment `--use_fast_math` in CMakeLists.txt line ~208. Enables fast `rsqrtf`, fused multiply-add, relaxed denormals. Negligible visual impact for a renderer.
10
+
### Option 1: Enable `--use_fast_math`— DONE
11
+
Enables fast `rsqrtf`, fused multiply-add, relaxed denormals. Negligible visual impact.
Split monolithic kernel into separate stages (ray gen → intersect → shade per material → bounce). Eliminates most warp divergence. Major architectural change.
@@ -40,6 +41,26 @@ Replace 1:1 pixel-thread mapping with fixed thread count pulling from global que
40
41
### Option 9: Migrate to OptiX (Hard, ~5-10x speedup)
41
42
Use NVIDIA OptiX SDK to access hardware RT cores for BVH traversal and intersection. This is the only path to match Vulkan RT performance. See detailed assessment below.
Accumulation buffer stays on GPU. GPU-side `gammaCorrectKernel` produces uint8 display image directly. Only the small uint8 image (3 bytes/pixel) is copied to host instead of the full float4 buffer (16 bytes/pixel). Also uses `cudaMemset` instead of free/realloc on camera change.
`getOptimalBlockSize()` caches result in static variable instead of calling `cudaGetDeviceProperties()` every frame.
50
+
-**File**: `renderer_cuda_device.cu`
51
+
52
+
### Option C: CUDA streams for async display copy (Medium, ~10-15% latency hiding)
53
+
Overlap kernel execution with display buffer transfer using CUDA streams. Currently the pipeline is fully synchronous.
54
+
-**Files**: `renderer_cuda_device.cu`
55
+
56
+
### Option E: BVH child ordering by ray direction sign (Medium, ~5-15% speedup)
57
+
Replace expensive distance-to-center heuristic with ray direction sign along split axis. One comparison instead of two `length_squared()` computations per interior node.
58
+
-**Files**: `cuda_raytracer.cuh`
59
+
60
+
### Option F: Flatten material dispatch in ray_color (Medium, ~5-15% speedup)
61
+
Remove CRTP lambda dispatch (`dispatch_material_bool`) and replace with explicit switch. Reduces register pressure and gives `nvcc` better optimization control.
0 commit comments