Skip to content

m-schuetz/CuRast

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CuRast: Cuda-Based Software Rasterization for Billions of Triangles

[Paper]

About: Nanite has demonstrated that small triangles can be rasterized more efficiently with custom compute shaders than with the fixed-function hardware pipeline. Building on this insight, we explore how far this advantage can be pushed for real-time rendering of massive triangle datasets without relying on precomputed LODs or acceleration structures.

Method: A 3-stage rasterization pipeline first rasterizes small triangles efficiently in stage 1, and falls back to other stages for increasingly larger triangles. Stage 1 assumes triangles are small and uses 1 thread to render them directly. If they are not, they are instead queued for stage 2 which uses 1 warp to render larger triangles with more compute power. If they are still too large, they are split up and queued for stage 3.

Results: With CUDA, we can render large models with hundreds of millions of unique triangles 2-5x faster than Vulkan, or up to 12x faster when it comes to instanced triangles. For smaller models producing large triangles, or models with numerous meshes with few triangles, Vulkan remains 10x faster.

Limitations: We currently focus on dense, opaque meshes like those you would typically obtain from photogrammetry/3D reconstruction. Blending/Transparency is not yet supported, and scenes with thousands of low-poly meshes are not implemented efficiently.

Future Work: To make it suitable for games, we intend to (1) optimize handling of scenes with tens of thousands of nodes/meshes, (2) add support for hierarchical clustered LODs such as those produced by Meshoptimizer, (3) add support for transparency, likely in its own stage so as to keep opaque rasterization untouched and fast.

Zorah rendered in 67.3ms into a 3840x2160 framebuffer (RTX 5090). 13.5 billion triangles in view frustum. Venice (400M triangles) rendered in 7.98ms (1920x1080p, RTX 5090). 3000 instances with 1M triangles each, rendered in 9.8ms (1920x1080p, RTX 5090).

Installing

Windows

Dependencies:

  • CUDA 13.1
  • Visual Studio 2026
  • An RTX 4090

Create Visual Studio solution files in a build folder via cmake:

mkdir build
cd build
cmake ../

Compile and run with visual Studio 2026. Drag and drop glb or gltf files to load them.

Linux

TODO.

Main challenge: We're using the windows API for memory mapping (easily read from files) and unbuffered IO (efficiently read from files). mmap on linux should be straightforward, but what about fast sequential SSD reads without buffering overhead? io_uring?

Getting Started

You can either drag&drop glb or gltf files into the application, or modify initScene() in main.cpp to load at startup and get some control over the settings. Note that glb support is limited, some/many glb files may not work. For data sets like Zorah, drag&drop won't work as Zorah is too large to fit in VRAM and requires loading with .compress = true. For Venice, we also have .useJpegTextures enabled which keeps textures jpeg-compressed on the GPU to save some VRAM.

Data Sets

Some test data sets we've been using, with download link if available.

Data Set Triangles Description
Komainu Kobe 60M Original images courtesy of Gildas Sidobre, NRHK, distributed by Open Heritage 3D.
Hakone Lantern 1M Created with Reality Scan, simplified with Meshoptimizer.
Sponza 262k We use the sponza-png.glb modified by Ludicon. Original authors and modifications over the years by Marko Dabrovic, Frank Meinl, Crytek, Hans-Kristian Arntzen, Morgan McGuire.
Zorah 18.9B We use the original zorah_main_public.gltf data set which has, since, been replaced by v2. The newer version is compressed, perhaps Meshoptimizer can decompress it?
Venice 400M Courtesy of Iconem and the Fondazione Musei Civici di Venezia.

Program

File Role
src/main.cpp Entry point and the place to define hardcoded startup scenes.
src/CuRast.h
src/CuRastSettings.h Some runtime settings, but also the place where we put the USE_VULKAN_SHARED_MEMORY macro if we want to enable Vulkan.
src/kernels/triangles_visbuffer.cu CUDA kernels for triangle rasterization
src/kernels/resolve.cu Transforms visibility buffer to color texture for display
src/CuRast_render.h Host-side draw code that launches the kernels.

Known Issues

  • Our glb loader is targeted towards loading Zorah fast and compressing it on the fly. This lead to design decisions like having 16 threads, each of which allocates as much host memory as the size of the largest index buffer. This can cause issues on systems with not enough RAM, or data sets with enormous index buffers.
  • If compiled with Vulkan support (see CuRastSettings.h), you can only switch the rasterizer from CUDA to Vulkan, but not back. That is because we implemented converting from CUDA textures to Vulkan, but not the other way around.
  • Can only drag&drop one glb per session. Needs restart to load a new glb.
  • We don't handle "frames in flight" yet. While draw data is assembled on the CPU, the GPU may be idle and wait. In the future, while the GPU finishes drawing the current frame, the CPU should already be preparing the next frame.

References and Further Reads

  • Nanite: Clustered LODs and software rasterization.
  • FreePipe: The first to propose using atomicMin for direct rasterization without the need to sort.
  • CUDARaster: An efficient, hierarchical software rasterization pipeline for CUDA.
  • cuRE: A CUDA rendering engine (cuRE) based on a streaming pipeline that processes multiple rasterization stages simultaneously, rather than one after the other.
  • Meshoptimizer: Optimizes the arrangement of vertices and triangles to improve locality and/or vertex reuse, and also features hierarchical clustered LOD construction.
  • "Billions of triangles in minutes": A blog post describing the clustered LOD construction algorithm in meshoptimizer, and the road to reducing the preprocessing time for the entire Zorah data set down to just about two and a half minutes.
  • "Learning from failure": A talk about the architecture and software rasterization process of the PS4 game Dreams. [video]

About

Cuda-Based Software Rasterization for Billions of Triangles

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors