Skip to content

Commit dafa369

Browse files
pmudryclaude
andcommitted
CUDA progressive renderer optimizations: eliminate D2H round-trip, flatten materials, improve BVH
- GPU-side gamma correction kernel avoids copying full float4 accum buffer to host each frame; only the small uint8 display image (3 bytes/pixel vs 16) is transferred back - Accumulation buffer stays permanently on GPU; camera reset uses cudaMemset instead of free/realloc - Cache cudaGetDeviceProperties result instead of querying every frame - Flatten CRTP material dispatch into direct switch in scatter_material(), reducing register pressure - BVH child ordering uses ray direction sign along split axis (one comparison) instead of two length_squared() distance computations per interior node - BVH nodes padded to 64-byte cache-line alignment for single-transaction fetches - Add --motion-samples CLI parameter (default 10) for minimum samples during camera motion - Update optimization plan with status tracking Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent 1f5929a commit dafa369

File tree

9 files changed

+324
-152
lines changed

9 files changed

+324
-152
lines changed

CUDA_OPTIMIZATION_PLAN.md

Lines changed: 30 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -3,15 +3,17 @@
33
## Context
44
RayON's CUDA renderer is significantly slower than a comparable Vulkan RT raytracer (RayTracingInVulkan) primarily because it performs BVH traversal and intersection in software on shader cores, while Vulkan uses dedicated RT cores. This plan catalogs actionable optimizations and assesses OptiX migration.
55

6+
See `explanations/VULKAN_VS_CUDA_PERFORMANCE.md` for the detailed comparison.
7+
68
## Optimization Options
79

8-
### Option 1: Enable `--use_fast_math` (Easy, ~10-30% speedup)
9-
Uncomment `--use_fast_math` in CMakeLists.txt line ~208. Enables fast `rsqrtf`, fused multiply-add, relaxed denormals. Negligible visual impact for a renderer.
10+
### Option 1: Enable `--use_fast_math` — DONE
11+
Enables fast `rsqrtf`, fused multiply-add, relaxed denormals. Negligible visual impact.
1012
- **File**: `CMakeLists.txt`
1113

12-
### Option 2: Fix accumulation buffer memory layout (Easy, ~5-15% speedup)
13-
Change from 3 separate float writes per pixel to `float4` coalesced writes.
14-
- **Files**: `gpu_renderers/shaders/render_acc_kernel.cu`, `renderer_cuda_device.cu`, `renderer_cuda_progressive_host.hpp`
14+
### Option 2: Fix accumulation buffer memory layout — DONE
15+
Kernel already uses `float4` coalesced reads/writes.
16+
- **Files**: `gpu_renderers/shaders/render_acc_kernel.cu`
1517

1618
### Option 3: Increase occupancy / tune kernel launch (Medium, ~10-20% speedup)
1719
Profile with `ncu`, test 512 threads/block, evaluate register pressure vs. occupancy tradeoff.
@@ -22,12 +24,11 @@ Bind BVH node and geometry arrays as CUDA texture objects for better cache behav
2224
- **Files**: `renderer_cuda_device.cu`, `scene_builder_cuda.cu`, `cuda_raytracer.cuh`
2325

2426
### Option 5: Compact BVH node layout (Medium, ~10-20% speedup)
25-
Pack BVH nodes to 64-byte cache-line alignment. Consider MBVH (4-wide) to reduce tree depth.
27+
Pack BVH nodes to 64-byte cache-line alignment. Store child AABBs together so both children can be tested with a single cache line fetch.
2628
- **Files**: `cuda_scene.cuh`, `cuda_raytracer.cuh`, `scene_builder_cuda.cu`, `scenes/scene_description.hpp`
2729

28-
### Option 6: Russian roulette from bounce 1 (Easy, ~5-10% speedup)
29-
Start Russian roulette termination earlier (currently bounce 3) with energy compensation.
30-
- **Files**: `gpu_renderers/cuda_raytracer.cuh`
30+
### Option 6: Russian roulette from bounce 1 — DONE
31+
Already starts at bounce 1 with energy compensation in `cuda_raytracer.cuh`.
3132

3233
### Option 7: Wavefront path tracing (Hard, ~30-50% speedup)
3334
Split monolithic kernel into separate stages (ray gen → intersect → shade per material → bounce). Eliminates most warp divergence. Major architectural change.
@@ -40,6 +41,26 @@ Replace 1:1 pixel-thread mapping with fixed thread count pulling from global que
4041
### Option 9: Migrate to OptiX (Hard, ~5-10x speedup)
4142
Use NVIDIA OptiX SDK to access hardware RT cores for BVH traversal and intersection. This is the only path to match Vulkan RT performance. See detailed assessment below.
4243

44+
### Option A: Eliminate D2H round-trip in progressive renderer — DONE
45+
Accumulation buffer stays on GPU. GPU-side `gammaCorrectKernel` produces uint8 display image directly. Only the small uint8 image (3 bytes/pixel) is copied to host instead of the full float4 buffer (16 bytes/pixel). Also uses `cudaMemset` instead of free/realloc on camera change.
46+
- **Files**: `render_acc_kernel.cu`, `renderer_cuda_device.cu`, `renderer_cuda_progressive_host.hpp`
47+
48+
### Option B: Cache device properties — DONE
49+
`getOptimalBlockSize()` caches result in static variable instead of calling `cudaGetDeviceProperties()` every frame.
50+
- **File**: `renderer_cuda_device.cu`
51+
52+
### Option C: CUDA streams for async display copy (Medium, ~10-15% latency hiding)
53+
Overlap kernel execution with display buffer transfer using CUDA streams. Currently the pipeline is fully synchronous.
54+
- **Files**: `renderer_cuda_device.cu`
55+
56+
### Option E: BVH child ordering by ray direction sign (Medium, ~5-15% speedup)
57+
Replace expensive distance-to-center heuristic with ray direction sign along split axis. One comparison instead of two `length_squared()` computations per interior node.
58+
- **Files**: `cuda_raytracer.cuh`
59+
60+
### Option F: Flatten material dispatch in ray_color (Medium, ~5-15% speedup)
61+
Remove CRTP lambda dispatch (`dispatch_material_bool`) and replace with explicit switch. Reduces register pressure and gives `nvcc` better optimization control.
62+
- **Files**: `cuda_raytracer.cuh`
63+
4364
## OptiX Migration Assessment
4465

4566
### What OptiX Provides

src/rayon/gpu_renderers/cuda_raytracer.cuh

Lines changed: 99 additions & 84 deletions
Original file line numberDiff line numberDiff line change
@@ -407,35 +407,35 @@ __device__ inline bool hit_scene(const CudaScene::Scene &scene, const ray_simple
407407
}
408408
else
409409
{
410-
// Interior node: push children onto stack
411-
// Push farther child first for better traversal order
410+
// Interior node: push children, near child last (processed first)
411+
// Use split axis + ray direction sign to determine near/far child
412+
// This is a single comparison vs. two length_squared() computations
412413
int left_child = node.data.interior.left_child;
413414
int right_child = node.data.interior.right_child;
414415

415-
// Simple heuristic: test which child is closer
416-
f3 left_center = (scene.bvh_nodes[left_child].bounds_min + scene.bvh_nodes[left_child].bounds_max) * 0.5f;
417-
f3 right_center =
418-
(scene.bvh_nodes[right_child].bounds_min + scene.bvh_nodes[right_child].bounds_max) * 0.5f;
419-
420-
float dist_left = (left_center - r.orig).length_squared();
421-
float dist_right = (right_center - r.orig).length_squared();
422-
423-
if (dist_left < dist_right)
416+
// Determine which child is "near" based on ray direction along split axis
417+
float dir_component;
418+
switch (node.split_axis)
424419
{
425-
// Right is farther, push it first
426-
if (stack_ptr < 32)
427-
stack[stack_ptr++] = right_child;
428-
if (stack_ptr < 32)
429-
stack[stack_ptr++] = left_child;
430-
}
431-
else
432-
{
433-
// Left is farther, push it first
434-
if (stack_ptr < 32)
435-
stack[stack_ptr++] = left_child;
436-
if (stack_ptr < 32)
437-
stack[stack_ptr++] = right_child;
420+
case 0:
421+
dir_component = r.dir.x;
422+
break;
423+
case 1:
424+
dir_component = r.dir.y;
425+
break;
426+
default:
427+
dir_component = r.dir.z;
428+
break;
438429
}
430+
431+
// If ray goes in positive direction along split axis, left child is near
432+
int near_child = dir_component >= 0.0f ? left_child : right_child;
433+
int far_child = dir_component >= 0.0f ? right_child : left_child;
434+
435+
if (stack_ptr < 32)
436+
stack[stack_ptr++] = far_child;
437+
if (stack_ptr < 32)
438+
stack[stack_ptr++] = near_child;
439439
}
440440
}
441441
}
@@ -474,10 +474,69 @@ __device__ inline bool hit_scene(const CudaScene::Scene &scene, const ray_simple
474474
}
475475

476476
/**
477-
* @brief Ray color computation using new material system
477+
* @brief Inline material scatter — flat switch, no CRTP dispatch overhead.
478478
*
479-
* This version uses compile-time material dispatch via CRTP templates.
480-
* The compiler generates optimized code for each material type with zero overhead.
479+
* Handles emission accumulation and scatter in a single switch.
480+
* Returns true if the ray was scattered, false if absorbed/emissive.
481+
*/
482+
__device__ __forceinline__ bool scatter_material(const hit_record_simple &rec, const ray_simple &current_ray,
483+
ray_simple &scattered_ray, f3 &attenuation, f3 &emitted,
484+
curandState *state)
485+
{
486+
using namespace Materials;
487+
488+
switch (rec.material)
489+
{
490+
case LAMBERTIAN:
491+
{
492+
Lambertian mat(LambertianParams{rec.color});
493+
emitted = f3(0.0f, 0.0f, 0.0f);
494+
return mat.scatter(current_ray, rec, attenuation, scattered_ray, state);
495+
}
496+
case MIRROR:
497+
{
498+
Mirror mat(MirrorParams{rec.color});
499+
emitted = f3(0.0f, 0.0f, 0.0f);
500+
return mat.scatter(current_ray, rec, attenuation, scattered_ray, state);
501+
}
502+
case ROUGH_MIRROR:
503+
{
504+
RoughMirror mat(RoughMirrorParams{rec.color, rec.roughness});
505+
emitted = f3(0.0f, 0.0f, 0.0f);
506+
return mat.scatter(current_ray, rec, attenuation, scattered_ray, state);
507+
}
508+
case GLASS:
509+
{
510+
Glass mat(GlassParams{rec.refractive_index});
511+
emitted = f3(0.0f, 0.0f, 0.0f);
512+
return mat.scatter(current_ray, rec, attenuation, scattered_ray, state);
513+
}
514+
case LIGHT:
515+
{
516+
emitted = rec.emission * g_light_intensity;
517+
return false;
518+
}
519+
case CONSTANT:
520+
{
521+
emitted = rec.color;
522+
return false;
523+
}
524+
case SHOW_NORMALS:
525+
{
526+
emitted = 0.5f * (rec.normal + f3(1.0f, 1.0f, 1.0f));
527+
return false;
528+
}
529+
default:
530+
emitted = f3(0.0f, 0.0f, 0.0f);
531+
return false;
532+
}
533+
}
534+
535+
/**
536+
* @brief Ray color computation with flattened material dispatch
537+
*
538+
* Uses a direct switch for material scatter/emission instead of CRTP template
539+
* dispatch, reducing register pressure and giving nvcc better optimization control.
481540
*/
482541
__device__ inline f3 ray_color(const ray_simple &r, const CudaScene::Scene &scene, curandState *state, int depth
483542
#ifdef DIAGS
@@ -486,8 +545,6 @@ __device__ inline f3 ray_color(const ray_simple &r, const CudaScene::Scene &scen
486545
#endif
487546
)
488547
{
489-
using namespace Materials;
490-
491548
f3 accumulated_color(0.0f, 0.0f, 0.0f);
492549
f3 accumulated_attenuation(1.0f, 1.0f, 1.0f);
493550
ray_simple current_ray = r;
@@ -501,69 +558,27 @@ __device__ inline f3 ray_color(const ray_simple &r, const CudaScene::Scene &scen
501558

502559
if (hit_scene(scene, current_ray, 0.001f, FLT_MAX, rec))
503560
{
504-
// Create material descriptor from hit record
505-
// TODO: This is a temporary adapter - ideally hit_scene would return MaterialDescriptor directly
506-
MaterialDescriptor mat_desc;
561+
f3 attenuation;
562+
ray_simple scattered_ray;
563+
f3 emitted;
507564

508-
switch (rec.material)
565+
bool did_scatter = scatter_material(rec, current_ray, scattered_ray, attenuation, emitted, state);
566+
567+
if (emitted.length_squared() > 0.0f)
509568
{
510-
case LAMBERTIAN:
511-
mat_desc = MaterialDescriptor::makeLambertian(rec.color);
512-
break;
513-
case MIRROR:
514-
mat_desc = MaterialDescriptor::makeMirror(rec.color);
515-
break;
516-
case ROUGH_MIRROR:
517-
mat_desc = MaterialDescriptor::makeRoughMirror(rec.color, rec.roughness);
518-
break;
519-
case GLASS:
520-
mat_desc = MaterialDescriptor::makeGlass(rec.refractive_index);
521-
break;
522-
case LIGHT:
523-
mat_desc = MaterialDescriptor::makeLight(rec.emission);
524-
break;
525-
case CONSTANT:
526-
mat_desc = MaterialDescriptor::makeConstant(rec.color);
527-
break;
528-
case SHOW_NORMALS:
529-
mat_desc = MaterialDescriptor::makeShowNormals(rec.normal);
530-
break;
569+
accumulated_color = accumulated_color + accumulated_attenuation * emitted;
531570
}
532571

533-
// Dispatch to appropriate material using compile-time template magic
534-
bool scattered = dispatch_material_bool(
535-
mat_desc,
536-
[&](auto material) -> bool
537-
{
538-
// Check if emissive first
539-
f3 emitted = material.emission();
540-
if (emitted.length_squared() > 0.0f)
541-
{
542-
accumulated_color = accumulated_color + f3(accumulated_attenuation.x * emitted.x,
543-
accumulated_attenuation.y * emitted.y,
544-
accumulated_attenuation.z * emitted.z);
545-
return false; // Light materials don't scatter
546-
}
547-
548-
// Scatter the ray
549-
f3 attenuation;
550-
ray_simple scattered_ray;
551-
if (material.scatter(current_ray, rec, attenuation, scattered_ray, state))
552-
{
553-
current_ray = scattered_ray;
554-
accumulated_attenuation =
555-
f3(accumulated_attenuation.x * attenuation.x, accumulated_attenuation.y * attenuation.y,
556-
accumulated_attenuation.z * attenuation.z);
557-
return true;
558-
}
559-
return false;
560-
});
561-
562-
if (!scattered)
572+
if (!did_scatter)
563573
{
564574
return accumulated_color;
565575
}
566576

577+
current_ray = scattered_ray;
578+
accumulated_attenuation =
579+
f3(accumulated_attenuation.x * attenuation.x, accumulated_attenuation.y * attenuation.y,
580+
accumulated_attenuation.z * attenuation.z);
581+
567582
// Russian Roulette path termination (from bounce 1 for early path culling)
568583
if (bounce > 0)
569584
{

src/rayon/gpu_renderers/cuda_scene.cuh

Lines changed: 19 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -166,9 +166,22 @@ struct Geometry
166166
// BVH STRUCTURES (for Phase 5)
167167
//==============================================================================
168168

169-
struct BVHNode
169+
/**
170+
* @brief Cache-line-aligned BVH node (64 bytes)
171+
*
172+
* Packed to exactly one 64-byte cache line so that each node fetch loads
173+
* all needed data in a single memory transaction. Layout:
174+
* bytes 0-11: bounds_min (f3)
175+
* bytes 12-23: bounds_max (f3)
176+
* bytes 24-27: left_child / first_geom_idx
177+
* bytes 28-31: right_child / geom_count
178+
* byte 32: is_leaf
179+
* byte 33: split_axis
180+
* bytes 34-63: padding (reserved for future use)
181+
*/
182+
struct alignas(64) BVHNode
170183
{
171-
f3 bounds_min, bounds_max;
184+
f3 bounds_min, bounds_max; // 24 bytes
172185

173186
union NodeData
174187
{
@@ -185,10 +198,11 @@ struct BVHNode
185198
} leaf;
186199

187200
__host__ __device__ NodeData() {} // Empty constructor for union
188-
} data;
201+
} data; // 8 bytes
189202

190-
bool is_leaf;
191-
uint8_t split_axis;
203+
bool is_leaf; // 1 byte
204+
uint8_t split_axis; // 1 byte
205+
uint8_t _pad[30]; // Pad to 64 bytes
192206
};
193207

194208
//==============================================================================

0 commit comments

Comments
 (0)