GPU-Accelerated Dynamic Fracture in the Browser - CIS 565 Final Project
Jiatong He, Kai Ninomiya
Our goal was to create a gpu-accelerated interactive real-time mesh fracture application that runs in the browser. We used WebCL to parallelize the algorithm and CubicVR to render and simulate the rigid bodies.
click + drag: Rotate camera view
alt + click + drag: Pan camera view
mouse scroll: Zoom camera
click on object + drag: Move selected object around
click + drag: Rotate camera view
F + click on object: Fracture object
W + click: Toggle wireframe
D + click: Toggle fracture pattern
Based on Real Time Dynamic Fracture with Volumetric Approximate Convex Decompositions by Müller, Chentanez, and Kim.
Table of Contents
- Algorithm Overview
- Implementation Details
- Performance Analysis
At a high level, fracturing is implemented by performing boolean intersection between segments of a fracture pattern and the segments of the object to be fractured. The fracture pattern can be pre-generated, as is the case for our implementation.
A pre-generated test fracture pattern (series of solid meshes):
The first step is to align the fracture pattern with the point of impact.
The wireframe is the impacted mesh, the solid is the fracture pattern
We use the point the user clicks on the object as the point of impact, and transform the fracture mesh appropriately (all meshes are centered at 0,0,0).
The mesh must then be intersected with the fracture mesh, resulting in one shard per cell of the fracture pattern. A simple way to do this is to clip the mesh against each face of the cell, for each cell in the fracture pattern.
A solid render of the 3D fracture pattern we are using
If a shard completely fills a cell, then it can be replaced with the cell's geometry. This reduces the number of triangles produced by the intersection.
If a clipping cell results in disconnected pieces within a cell, island detection should be used to split those pieces into multiple shards, instead of just one. That way you won't have disconnected pieces moving together as though they were one mesh.
Partial fracture occurs if we limit the area of effect of the fracture to some distance around the point of impact.
Notice how the bottom and right sides of the wall stay intact while the upper-left is fractured
Rather than allowing the entire mesh to be fractured, we only fully shard the cells within the area of effect. Shards in cells outside of the area of effect can be merged together back into a single mesh.
* : not implemented.
A fractured torus
The fracturing algorithm was our greatest challenge. It's an algorithm that is naturally sequential--clipping polygons usually requires some knowledge of neighbors and other information. However, we devised a method that succesfully targets independent pieces of the algorithm at the cost of some accuracy.
Our intersection algorithm is simple clipping planes. For each cell, the mesh is clipped by each cell face to give us the shard. What's interesting is how we parallelized it.
Our strategy for the parallelization of the intersection was to treat the mesh as a set of disconnected triangles. By doing so, we could parallelize by-cell-by-triangle.
Diagram of the parallel algorithm we used to clip meshes
For each face of the cell, we clip each triangle in the mesh by that face independently, then create the new faces for them. We can process all cells at once, and iterate a total number of times equal to the maximum number of faces in a single cell.
Our implementation uses stream compaction to remove culled triangles each iteration in order to keep the number of triangles under control (otherwise it could grow at a rate of 2^n). We tried both a sequential and a parallel version of this algorithm to see which one was better. The sequential implementation simply iterates through the list and pushes non-culled objects into a new array.
The reason we wanted to do stream compaction on the GPU was to reduce the amount of memory transfer between CPU and GPU. Each time our plane clipping kernel returned, we would need to copy the entire output back onto the CPU, remove bad values, add new faces, and put everything back into the GPU. If stream compaction were parallelized, we would not have that problem.
We implemented stream compaction in WebCL, but ran into some performance issues that made it much slower than the copy+process on CPU method. As a result, we abandoned the stream compaction (the code is still in the stream-compaction branch) and are now removing bad values sequentially. The performance analysis section furhter below contains more details about this issue.
This feature is noteworthy because we technically cheated this one. Instead of properly combining faces, or doing some processing, we just group all the fragments that are not in the area of effect into a single mesh. This means that said mesh will have: 1, several times more geometry than other fragments, 2, faces inside of the mesh, and 3, slightly overlapping/disconnected edges on the surface.
The body on the upper-left is the merged mesh. See how its individual components are clearly visible in the wireframe?
Working with WebCL
Because our target was an in-browser experience, we were limited to two choices for GPU-acceleration: WebGL and WebCL. While WebGL runs natively in most browsers, it does not yet support compute shaders as of this time, so we would have had to hack a solution using textures and feedbacks. WebCL, on the other hand, is supported by no browsers, but Nokia has a plugin that can run it. We chose to use WebCL for its flexibility compared to WebGL.
WebCL Performance Issues
Integration into an Existing Renderer/Rigid Body Simulator (CubicVR)
Because our main focus was creating the fractured geometry, we looked for an existing renderer and rigid body simulator. CubicVR (physics built on ammo.js, which is compiled from bullet) provides a very simple-to-use library for both, though we ran into some issues here as well. The performance issues we had with usingCubicVR are detailed in the Performance Analysis section of the readme.
Performance measurements were taken on one of the following setups. (Performance comparisons are on one machine.)
- Arch Linux, Intel i5-4670 + NVIDIA GTX 750 (CPU/GPU comparisons)
- Windows 8.1, Intel i7-4700HQ (CPU-only measurements)
We began to implement fracture using a naive sequential algorithm as a proof-of-concept. The algorithm runs somewhat differently compared to the parallel algorithm (it sequentially does the clipping planes on the entire mesh, and keeps the mesh closed), but it's nice as a basis of comparison.
Intersection: GPU vs. CPU, Parallel vs. Sequential
We compared the runtime of our code on the GPU vs. the CPU, as well as the parallel and sequential implementations of the intersection.
Runtime of the fracture algorithm on the CPU compared to the GPU
The main thing to notice here is that the CPU and GPU times are nearly identical. I have no explanation for why reads/writes have the same runtime, as well as kernels. It may just be an issue with a data set that's too small.
Runtime of the parallel algorithm compared a naive sequential version
Ignoring the part of the parallel graph where it starts to dip down, it's generally clear that the parallel algorithm does not scale very quickly with increasing triangle count, while the sequential algorithm begins taking far more time with each step. While this graph does not show it, the sequential implementation quickly surpasses the parallel implementation in runtime with higher number of triangles, as expected.
Stream Compaction: Parallel vs. Sequential
Runtime of the stream compaction algorithm vs. a sequential remove
Strangely enough, the stream compaction (though a naive implementation) performs far worse than the sequential remove, which requires a lot of back-and-forth between the CPU and GPU. We weren't sure what was causing this, so we investigated the performance of WebCL to see if that was the issue (Stream Compaction makes a LOT of webCL calls).
WebCL runtimes on the GPU and CPU. Note how they are nearly identical
As expected from the Intersection performance analysis, the GPU and CPU calls are actually nearly identical. The only possible error is in the "Run Kernel" values in each, since they may be affected by our worksize setup. The runtimes of each of the other calls average to around 0.8ms each. This quickly adds up when you need to make somee 50 of these calls for each face-per-cell, along with running the actual kernel. We suspect that the poor performance of the parallel algorithms are due to this slower runtime.
CubicVR's Limits in Real-Time Simulation
CubicVR posed some challenges to our simulation, specifically in how our fragmented shards were added to the scene. The main factor is computing the collision detection surface on each of the shards. Using a convex hull collision, the processing time from CubicVR was far higher (~80% of the total time) than the time taken during our algorithm. However, convex hull collisions ran quickly in the simulator. In order to prevent long freezes, we switched over to mesh collision ang got the following results:
CubicVR's contribution to the total time of our algorithm
Now CubicVR is no longer the source of the bottleneck of our algorithm, but mesh collision runs much more slowly than convex hull collision.
The issue here is the large number of triangles we are generating. Convex hulls take a long time to calculate for meshes with large triangle counts. We can reduce the calculation time by implementing optimizations that reduce the geometry for each shard (they end up with far more faces than is necessar) such as the "welding" step.
 Matthias Müller, Nuttapong Chentanez, and Tae-Yong Kim. 2013. Real time dynamic fracture with volumetric approximate convex decompositions. ACM Trans. Graph. 32, 4, Article 115 (July 2013), 10 pages. DOI=10.1145/2461912.2461934 http://doi.acm.org/10.1145/2461912.2461934
 stats.js. Copyright 2009-2012 Mr.doob. Used under MIT License. https://github.com/mrdoob/stats.js