Skip to content

Commit 671f69d

Browse files
committed
Block rendering with CUDA
Final version to be cleaned-up 38e9 rays in 1 minute
1 parent c987c4d commit 671f69d

File tree

12 files changed

+881
-248
lines changed

12 files changed

+881
-248
lines changed

CMakeLists.txt

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -99,6 +99,20 @@ endif()
9999
# Executables
100100
add_executable(v0_single_threaded ${EXTERNAL} ${SOURCE_V0_SINGLE_THREADED})
101101

102+
# Find and link SDL2 for real-time display (optional)
103+
find_package(SDL2 QUIET)
104+
if(SDL2_FOUND)
105+
include_directories(${SDL2_INCLUDE_DIRS})
106+
target_link_libraries(v0_single_threaded ${SDL2_LIBRARIES})
107+
add_definitions(-DSDL2_FOUND)
108+
message(STATUS "SDL2 found and linked for real-time display")
109+
else()
110+
message(WARNING "SDL2 not found. Real-time display will be disabled. Install SDL2 to enable real-time rendering.")
111+
message(WARNING "On Ubuntu/Debian: sudo apt-get install libsdl2-dev")
112+
message(WARNING "On macOS: brew install sdl2")
113+
message(WARNING "On Windows: Download from https://www.libsdl.org/download-2.0.php")
114+
endif()
115+
102116
# Link CUDA libraries if CUDA is found
103117
# Create res directory in output directory
104118
file(MAKE_DIRECTORY ${CMAKE_BINARY_DIR}/res)

README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -13,15 +13,15 @@ It uses [single-file public domain (or MIT licensed) libraries for C/C++](https:
1313
## Using CMAKE
1414

1515
```bash
16-
cmake .
16+
cmake
1717
make -j
1818
./v0_single_threaded
1919
```
2020

2121
Or, all at once :
2222

2323
```bash
24-
cmake . && make -j 24 && ./v0_single_threaded
24+
cmake . && make -j && ./v0_single_threaded
2525
```
2626

2727
## Whitin VSCode
Lines changed: 91 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,91 @@
1+
# Explanation of `renderPixelsCUDA` and `renderPixelsKernel`
2+
3+
This document provides a detailed explanation of the CUDA implementation for rendering pixels in the ray tracer, focusing on the `renderPixelsCUDA` host function and the `renderPixelsKernel` device kernel found in `src/v0_single_threaded/camera_cuda.cu`.
4+
5+
## Overview
6+
7+
The rendering process is parallelized using CUDA by assigning the computation of each pixel's color to a separate GPU thread. This allows for a massive speedup compared to a single-threaded CPU implementation. The process involves three main stages:
8+
9+
1. **Setup on the Host (CPU)**: The C++ code prepares the rendering parameters and allocates memory on the GPU.
10+
2. **Execution on the Device (GPU)**: A CUDA kernel is launched, where thousands of threads concurrently trace rays to compute pixel colors.
11+
3. **Data Transfer back to Host (CPU)**: The final rendered image is copied from GPU memory back to CPU memory to be saved.
12+
13+
---
14+
15+
## Host Function: `renderPixelsCUDA`
16+
17+
This function is the C++ entry point that orchestrates the entire CUDA rendering pipeline. It runs on the CPU.
18+
19+
```cpp
20+
extern "C" unsigned long long renderPixelsCUDA(unsigned char* image, int width, int height,
21+
double cam_center_x, double cam_center_y, double cam_center_z,
22+
double pixel00_x, double pixel00_y, double pixel00_z,
23+
double delta_u_x, double delta_u_y, double delta_u_z,
24+
double delta_v_x, double delta_v_y, double delta_v_z,
25+
int samples_per_pixel, int max_depth)
26+
```
27+
28+
### Key Steps and CUDA Specifics:
29+
30+
1. **GPU Memory Allocation (`cudaMalloc`)**:
31+
* `cudaMalloc(&d_image, image_size)`: Allocates a buffer on the GPU's VRAM to store the final image data. `d_image` is a pointer to this device memory.
32+
* `cudaMalloc(&d_ray_count, sizeof(unsigned long long))`: Allocates memory for a single 64-bit integer on the device. This will be used as an atomic counter to track the total number of rays traced by all threads.
33+
34+
2. **GPU Memory Initialization (`cudaMemset`)**:
35+
* `cudaMemset(d_ray_count, 0, ...)`: Initializes the ray counter on the device to zero.
36+
* `cudaMemset(d_image, 0, ...)`: Clears the device image buffer.
37+
38+
3. **Execution Configuration (Grid and Block Dimensions)**:
39+
* `dim3 block_size(32, 4);`: Defines the dimensions of a **thread block**. Here, each block contains `32 * 4 = 128` threads arranged in a 2D layout. Using rectangular blocks is a common heuristic to improve performance by optimizing memory access patterns and avoiding artifacts that can sometimes arise from perfectly square configurations.
40+
* `dim3 grid_size(...)`: Defines the dimensions of the **grid of blocks**. The calculation `(width + block_size.x - 1) / block_size.x` is a standard CUDA idiom to ensure enough blocks are launched to cover every pixel of the image, even if the image dimensions are not perfect multiples of the block dimensions.
41+
42+
4. **Kernel Launch (`<<<...>>>`)**:
43+
* `renderPixelsKernel<<<grid_size, block_size>>>(...);`: This is the most critical part. It launches the `renderPixelsKernel` function on the GPU.
44+
* The `<<<grid_size, block_size>>>` syntax tells the CUDA runtime how many threads to launch and how to group them. In this case, it launches a 2D grid of thread blocks.
45+
* All parameters (camera data, image dimensions, device pointers) are passed from the host to the kernel. Note that `double` precision values from the host are cast to `float`, as the kernel is optimized to use single-precision arithmetic, which is much faster on most consumer GPUs.
46+
47+
5. **Synchronization and Error Checking**:
48+
* `cudaGetLastError()`: Since kernel launches are asynchronous (the CPU code continues immediately without waiting for the GPU to finish), this function is called to check for any errors that might have occurred when launching the kernel.
49+
* `cudaDeviceSynchronize()`: This is a blocking call that pauses the CPU thread until all previously issued commands on the GPU have completed. This is essential to ensure the rendering is finished before we try to copy the results back.
50+
51+
6. **Data Transfer from Device to Host (`cudaMemcpy`)**:
52+
* `cudaMemcpy(image, d_image, ..., cudaMemcpyDeviceToHost)`: Copies the rendered pixel data from the GPU's memory (`d_image`) back to the host's main memory (`image`).
53+
* `cudaMemcpy(&host_ray_count, d_ray_count, ...)`: Copies the final ray count from the GPU back to a host variable.
54+
55+
7. **Cleanup (`cudaFree`)**:
56+
* `cudaFree(d_image)` and `cudaFree(d_ray_count)`: Releases the memory that was allocated on the GPU, preventing memory leaks in VRAM.
57+
58+
---
59+
60+
## Device Kernel: `renderPixelsKernel`
61+
62+
This function runs on the GPU. A separate instance of this kernel (a thread) is executed for each pixel in the output image.
63+
64+
```cpp
65+
__global__ void renderPixelsKernel(unsigned char* image, int width, int height, ..., unsigned long long* ray_count)
66+
```
67+
68+
### Key Steps and CUDA Specifics:
69+
70+
1. **`__global__` Specifier**: This keyword declares the function as a "kernel" that can be called from the host (CPU) and is executed on the device (GPU).
71+
72+
2. **Global Thread-to-Pixel Mapping**:
73+
* `int x = blockIdx.x * blockDim.x + threadIdx.x;`
74+
* `int y = blockIdx.y * blockDim.y + threadIdx.y;`
75+
* This is the standard CUDA pattern for computing a unique global ID for each thread. `blockIdx` gives the ID of the current block in the grid, `blockDim` gives the size of the block, and `threadIdx` gives the ID of the current thread within its block. This calculation maps each thread to a unique `(x, y)` pixel coordinate.
76+
77+
3. **Random State Initialization (`curand_init`)**:
78+
* Each thread must have its own independent random number generator state to avoid visual artifacts.
79+
* `curand_init(...)`: Initializes the cuRAND library's state for the current thread. The seed is made unique for each pixel by combining its coordinates and the system clock, ensuring that each pixel's anti-aliasing and material scattering calculations are statistically independent.
80+
81+
4. **Ray Tracing Loop**:
82+
* For each sample per pixel, the thread calculates a unique ray direction with a random offset for anti-aliasing.
83+
* It calls the `ray_color` device function, which recursively traces the ray through the scene.
84+
85+
5. **Atomic Operations (`atomicAdd`)**:
86+
* Inside `ray_color`, the global ray counter is incremented using `atomicAdd(ray_count, 1)`.
87+
* An atomic operation is crucial here because thousands of threads are trying to increment the same memory location (`d_ray_count`) simultaneously. `atomicAdd` ensures that these operations are serialized, preventing race conditions and guaranteeing a correct final count.
88+
89+
6. **Writing Output**:
90+
* After accumulating the color from all samples, the thread performs gamma correction and converts the final floating-point color value to an 8-bit RGB triplet.
91+
* It then writes these three bytes directly to the correct location in the global image buffer (`d_image`). Since each thread is responsible for a unique pixel, there are no write conflicts between threads at this stage.
Lines changed: 102 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,102 @@
1+
# Real-time CUDA Ray Tracer Display
2+
3+
This document explains how to use the real-time display feature of the CUDA ray tracer.
4+
5+
## Prerequisites
6+
7+
To enable real-time display, you need to install SDL2:
8+
9+
### Ubuntu/Debian
10+
```bash
11+
sudo apt-get update
12+
sudo apt-get install libsdl2-dev
13+
```
14+
15+
### macOS
16+
```bash
17+
brew install sdl2
18+
```
19+
20+
### Windows
21+
Download the development libraries from: https://www.libsdl.org/download-2.0.php
22+
23+
## How It Works
24+
25+
The real-time display feature renders the image in small tiles (64x64 pixels by default) and updates the display after each tile is completed. This allows you to see the rendering progress in real-time rather than waiting for the entire image to finish.
26+
27+
### Key Features:
28+
- **Progressive Rendering**: Watch the image build up tile by tile
29+
- **Interactive Window**: Close the window anytime with the X button
30+
- **Same Quality**: Uses the same CUDA rendering engine as the offline version
31+
- **Memory Efficient**: Only renders one tile at a time on the GPU
32+
33+
## Usage
34+
35+
1. **Build the project** (after installing SDL2):
36+
```bash
37+
mkdir build
38+
cd build
39+
cmake ..
40+
make
41+
```
42+
43+
2. **Run the program**:
44+
```bash
45+
./v0_single_threaded
46+
```
47+
48+
3. **Choose rendering mode**:
49+
- Option 1: CPU Parallel (original)
50+
- Option 2: CUDA GPU (original)
51+
- Option 3: **CUDA GPU with Real-time Display** (new)
52+
53+
4. **Watch the rendering**:
54+
- A window will open showing the progressive rendering
55+
- Each tile appears as it's completed
56+
- The window remains open after rendering is complete
57+
- Close the window to exit
58+
59+
## Technical Details
60+
61+
### Tile-Based Rendering
62+
- **Tile Size**: 64x64 pixels (configurable in `camera.h`)
63+
- **Rendering Order**: Left-to-right, top-to-bottom
64+
- **Memory Management**: Full image buffer maintained on both CPU and GPU
65+
- **Synchronization**: Each tile is synchronized before display update
66+
67+
### Performance Considerations
68+
- **GPU Memory**: Uses same amount as full rendering (not tile-optimized yet)
69+
- **Display Updates**: Small delay (10ms) between tiles for visibility
70+
- **Event Handling**: Window remains responsive during rendering
71+
72+
### Code Structure
73+
- `renderPixelsCUDART()`: Main real-time rendering method in `camera.h`
74+
- `renderPixelsCUDATile()`: Host function for tile rendering in `camera_cuda.cu`
75+
- `renderPixelsTileKernel()`: CUDA kernel for tile-based rendering
76+
77+
## Troubleshooting
78+
79+
### SDL2 Not Found
80+
If you see warnings about SDL2 not being found:
81+
1. Install SDL2 using the commands above
82+
2. Re-run `cmake ..` and `make`
83+
3. The real-time option will be automatically enabled
84+
85+
### Window Doesn't Appear
86+
- Check that your display is properly configured
87+
- Try running from a graphical terminal
88+
- On Linux, ensure X11 forwarding is working if using SSH
89+
90+
### Performance Issues
91+
- Reduce tile size in `camera.h` for more frequent updates
92+
- Increase samples per pixel for better quality
93+
- The real-time version uses the same CUDA optimization as the offline version
94+
95+
## Future Improvements
96+
97+
Potential enhancements for the real-time display:
98+
- Adaptive tile sizing based on GPU memory
99+
- Parallel tile rendering for faster updates
100+
- Preview mode with lower quality tiles
101+
- Progress bar and timing estimates
102+
- Save intermediate results
File renamed without changes.
File renamed without changes.
File renamed without changes.

src/v0_single_threaded/camera.h

Lines changed: 122 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,10 @@
77
#include <atomic>
88
#include <chrono>
99

10+
#ifdef SDL2_FOUND
11+
#include <SDL2/SDL.h>
12+
#endif
13+
1014
#pragma once
1115

1216
inline double degrees_to_radians(double degrees)
@@ -23,12 +27,12 @@ class Camera
2327
const int image_width = static_cast<int>(aspect_ratio * image_height);
2428

2529
double vfov = 35.0; // vertical field of view in degrees
26-
point3 lookfrom = point3(2, 2.5, 3); // Point camera is looking from
27-
point3 lookat = point3(-1, 0, -1); // Point camera is looking at
30+
point3 lookfrom = point3(-2, 2, 5); // Point camera is looking from
31+
point3 lookat = point3(-2, -0.5, -1); // Point camera is looking at
2832
vec3 vup = vec3(0, 1, 0); // Camera-relative "up" direction
2933

3034
int samples_per_pixel = 1;
31-
const int max_depth = 16; // Maximum ray bounce depth
35+
const int max_depth = 24; // Maximum ray bounce depth
3236

3337
std::atomic<long long> n_rays{0}; // Number of rays traced so far with this cam (thread-safe)
3438

@@ -105,6 +109,121 @@ class Camera
105109

106110
cout << "CUDA rendering completed in " << duration.count() << " milliseconds" << endl;
107111
}
112+
113+
#ifdef SDL2_FOUND
114+
void renderPixelsCUDART(vector<unsigned char> &image)
115+
{
116+
// Initialize SDL
117+
if (SDL_Init(SDL_INIT_VIDEO) < 0) {
118+
std::cerr << "SDL could not initialize! SDL_Error: " << SDL_GetError() << std::endl;
119+
return;
120+
}
121+
122+
// Create window
123+
SDL_Window* window = SDL_CreateWindow("CUDA Ray Tracer - Real-time",
124+
SDL_WINDOWPOS_UNDEFINED, SDL_WINDOWPOS_UNDEFINED,
125+
image_width, image_height, SDL_WINDOW_SHOWN);
126+
if (window == nullptr) {
127+
std::cerr << "Window could not be created! SDL_Error: " << SDL_GetError() << std::endl;
128+
SDL_Quit();
129+
return;
130+
}
131+
132+
// Create renderer
133+
SDL_Renderer* renderer = SDL_CreateRenderer(window, -1, SDL_RENDERER_ACCELERATED);
134+
if (renderer == nullptr) {
135+
std::cerr << "Renderer could not be created! SDL_Error: " << SDL_GetError() << std::endl;
136+
SDL_DestroyWindow(window);
137+
SDL_Quit();
138+
return;
139+
}
140+
141+
// Create texture for the image
142+
SDL_Texture* texture = SDL_CreateTexture(renderer, SDL_PIXELFORMAT_RGB24,
143+
SDL_TEXTUREACCESS_STREAMING,
144+
image_width, image_height);
145+
if (texture == nullptr) {
146+
std::cerr << "Texture could not be created! SDL_Error: " << SDL_GetError() << std::endl;
147+
SDL_DestroyRenderer(renderer);
148+
SDL_DestroyWindow(window);
149+
SDL_Quit();
150+
return;
151+
}
152+
153+
auto start_time = std::chrono::high_resolution_clock::now();
154+
155+
// Render in tiles for real-time display
156+
const int tile_size = 128; // Size of each tile
157+
const int tiles_x = (image_width + tile_size - 1) / tile_size;
158+
const int tiles_y = (image_height + tile_size - 1) / tile_size;
159+
160+
std::cout << "Rendering in " << tiles_x << "x" << tiles_y << " tiles..." << std::endl;
161+
162+
for (int tile_y = 0; tile_y < tiles_y; ++tile_y) {
163+
for (int tile_x = 0; tile_x < tiles_x; ++tile_x) {
164+
int start_x = tile_x * tile_size;
165+
int start_y = tile_y * tile_size;
166+
int end_x = std::min(start_x + tile_size, image_width);
167+
int end_y = std::min(start_y + tile_size, image_height);
168+
169+
// Render this tile
170+
unsigned long long cuda_ray_count = ::renderPixelsCUDATile(
171+
image.data(), image_width, image_height,
172+
camera_center.x(), camera_center.y(), camera_center.z(),
173+
pixel00_loc.x(), pixel00_loc.y(), pixel00_loc.z(),
174+
pixel_delta_u.x(), pixel_delta_u.y(), pixel_delta_u.z(),
175+
pixel_delta_v.x(), pixel_delta_v.y(), pixel_delta_v.z(),
176+
samples_per_pixel, max_depth, start_x, start_y, end_x, end_y);
177+
178+
n_rays.fetch_add(cuda_ray_count, std::memory_order_relaxed);
179+
180+
// Update texture with the new tile
181+
SDL_UpdateTexture(texture, nullptr, image.data(), image_width * 3);
182+
183+
// Clear and render
184+
SDL_RenderClear(renderer);
185+
SDL_RenderCopy(renderer, texture, nullptr, nullptr);
186+
SDL_RenderPresent(renderer);
187+
188+
// Handle events to keep window responsive
189+
SDL_Event event;
190+
while (SDL_PollEvent(&event)) {
191+
if (event.type == SDL_QUIT) {
192+
goto cleanup;
193+
}
194+
}
195+
196+
// Small delay to make the progressive rendering visible
197+
SDL_Delay(1);
198+
}
199+
}
200+
201+
cleanup:
202+
auto end_time = std::chrono::high_resolution_clock::now();
203+
auto duration = std::chrono::duration_cast<std::chrono::milliseconds>(end_time - start_time);
204+
205+
std::cout << "Real-time CUDA rendering completed in " << duration.count() << " milliseconds" << std::endl;
206+
std::cout << "Press any key to close the window..." << std::endl;
207+
208+
// Wait for user to close window
209+
bool quit = false;
210+
while (!quit) {
211+
SDL_Event event;
212+
while (SDL_PollEvent(&event)) {
213+
if (event.type == SDL_QUIT) {
214+
quit = true;
215+
}
216+
}
217+
SDL_Delay(1);
218+
}
219+
220+
// Cleanup
221+
SDL_DestroyTexture(texture);
222+
SDL_DestroyRenderer(renderer);
223+
SDL_DestroyWindow(window);
224+
SDL_Quit();
225+
}
226+
#endif
108227

109228
void renderPixelsParallelWithTiming(const hittable &scene, vector<unsigned char> &image)
110229
{

0 commit comments

Comments
 (0)