CuRast: Cuda-Based Software Rasterization for Billions of Triangles

About: Nanite has demonstrated that small triangles can be rasterized more efficiently with custom compute shaders than with the fixed-function hardware pipeline. Building on this insight, we explore how far this advantage can be pushed for real-time rendering of massive triangle datasets without relying on precomputed LODs or acceleration structures.

Method: A 3-stage rasterization pipeline first rasterizes small triangles efficiently in stage 1, and falls back to other stages for increasingly larger triangles. Stage 1 assumes triangles are small and uses 1 thread to render them directly. If they are not, they are instead queued for stage 2 which uses 1 warp to render larger triangles with more compute power. If they are still too large, they are split up and queued for stage 3.

Results: With CUDA, we can render large models with hundreds of millions of unique triangles 2-5x faster than Vulkan, or up to 12x faster when it comes to instanced triangles. For smaller models producing large triangles, or models with numerous meshes with few triangles, Vulkan remains 10x faster.

Limitations: We currently focus on dense, opaque meshes like those you would typically obtain from photogrammetry/3D reconstruction. Blending/Transparency is not yet supported, and scenes with thousands of low-poly meshes are not implemented efficiently.

Future Work: To make it suitable for games, we intend to (1) optimize handling of scenes with tens of thousands of nodes/meshes, (2) add support for hierarchical clustered LODs such as those produced by Meshoptimizer, (3) add support for transparency, likely in its own stage so as to keep opaque rasterization untouched and fast.


Zorah rendered in 67.3ms into a 3840x2160 framebuffer (RTX 5090). 13.5 billion triangles in view frustum.	Venice (400M triangles) rendered in 7.98ms (1920x1080p, RTX 5090).	3000 instances with 1M triangles each, rendered in 9.8ms (1920x1080p, RTX 5090).

Installing

Windows

Dependencies:

CUDA 13.1
Visual Studio 2026
An RTX 4090

Create Visual Studio solution files in a build folder via cmake:

mkdir build
cd build
cmake ../

Compile and run with visual Studio 2026. Drag and drop glb or gltf files to load them.

Linux

TODO.

Main challenge: We're using the windows API for memory mapping (easily read from files) and unbuffered IO (efficiently read from files). mmap on linux should be straightforward, but what about fast sequential SSD reads without buffering overhead? io_uring?

Getting Started

You can either drag&drop glb or gltf files into the application, or modify initScene() in main.cpp to load at startup and get some control over the settings. Note that glb support is limited, some/many glb files may not work. For data sets like Zorah, drag&drop won't work as Zorah is too large to fit in VRAM and requires loading with .compress = true. For Venice, we also have .useJpegTextures enabled which keeps textures jpeg-compressed on the GPU to save some VRAM.

Data Sets

Some test data sets we've been using, with download link if available.

Data Set	Triangles	Description
Komainu Kobe	60M	Original images courtesy of Gildas Sidobre, NRHK, distributed by Open Heritage 3D.
Hakone Lantern	1M	Created with Reality Scan, simplified with Meshoptimizer.
Sponza	262k	We use the sponza-png.glb modified by Ludicon. Original authors and modifications over the years by Marko Dabrovic, Frank Meinl, Crytek, Hans-Kristian Arntzen, Morgan McGuire.
Zorah	18.9B	We use the original zorah_main_public.gltf data set which has, since, been replaced by v2. The newer version is compressed, perhaps Meshoptimizer can decompress it?
Venice	400M	Courtesy of Iconem and the Fondazione Musei Civici di Venezia.

Program

File	Role
src/main.cpp	Entry point and the place to define hardcoded startup scenes.
src/CuRast.h
src/CuRastSettings.h	Some runtime settings, but also the place where we put the USE_VULKAN_SHARED_MEMORY macro if we want to enable Vulkan.
src/kernels/triangles_visbuffer.cu	CUDA kernels for triangle rasterization
src/kernels/resolve.cu	Transforms visibility buffer to color texture for display
src/CuRast_render.h	Host-side draw code that launches the kernels.

Known Issues

Our glb loader is targeted towards loading Zorah fast and compressing it on the fly. This lead to design decisions like having 16 threads, each of which allocates as much host memory as the size of the largest index buffer. This can cause issues on systems with not enough RAM, or data sets with enormous index buffers.
If compiled with Vulkan support (see CuRastSettings.h), you can only switch the rasterizer from CUDA to Vulkan, but not back. That is because we implemented converting from CUDA textures to Vulkan, but not the other way around.
Can only drag&drop one glb per session. Needs restart to load a new glb.
We don't handle "frames in flight" yet. While draw data is assembled on the CPU, the GPU may be idle and wait. In the future, while the GPU finishes drawing the current frame, the CPU should already be preparing the next frame.

References and Further Reads

Nanite: Clustered LODs and software rasterization.
FreePipe: The first to propose using atomicMin for direct rasterization without the need to sort.
CUDARaster: An efficient, hierarchical software rasterization pipeline for CUDA.
cuRE: A CUDA rendering engine (cuRE) based on a streaming pipeline that processes multiple rasterization stages simultaneously, rather than one after the other.
Meshoptimizer: Optimizes the arrangement of vertices and triangles to improve locality and/or vertex reuse, and also features hierarchical clustered LOD construction.
"Billions of triangles in minutes": A blog post describing the clustered LOD construction algorithm in meshoptimizer, and the road to reducing the preprocessing time for the entire Zorah data set down to just about two and a half minutes.
"Learning from failure": A talk about the architecture and software rasterization process of the PS4 game Dreams. [video]

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
.vscode		.vscode
cmake		cmake
docs		docs
libs		libs
resources		resources
src		src
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
LICENSE.md		LICENSE.md
README.md		README.md
example_donaukanal_urania.glb		example_donaukanal_urania.glb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CuRast: Cuda-Based Software Rasterization for Billions of Triangles

Installing

Windows

Linux

Getting Started

Data Sets

Program

Known Issues

References and Further Reads

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CuRast: Cuda-Based Software Rasterization for Billions of Triangles

Installing

Windows

Linux

Getting Started

Data Sets

Program

Known Issues

References and Further Reads

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages