Add robustness to GPU shaders #537

raphlinus · 2024-03-28T03:10:27Z

Make each stage quit early if a previous stage has failed.

The CPU shaders are minimally changed to be layout compatible. For the most part, they'll panic on a bounds check if sizes are exceeded. That's arguably useful for debugging, but a case can be made they should have the same behavior as the GPU shaders.

Work towards #366

Make each stage quit early if a previous stage has failed. The CPU shaders are minimally changed to be layout compatible. For the most part, they'll panic on a bounds check if sizes are exceeded. That's arguably useful for debugging, but a case can be made they should have the same behavior as the GPU shaders. Work towards #366

raphlinus · 2024-03-28T03:44:37Z

Possibly this can be merged as-is, but perhaps more work should be done so that clients can handle the failures reasonably. For now, rendering appears to freeze (zoom into longpathdash, then zoom back out; rendering will resume).

DJMcNab

A few nits, but generally this looks like a decent improvement

I think the CPU shaders probably do need some rethinking, but I'm not sure the best way to do so

shader/binning.wgsl

DJMcNab · 2024-03-28T08:59:56Z

shader/binning.wgsl

@@ -63,7 +64,18 @@ fn main(
    for (var i = 0u; i < N_SLICE; i += 1u) {
        atomicStore(&sh_bitmaps[i][local_id.x], 0u);


I know this isn't related to this PR, but as far as I can tell, this is already guaranteed to be zeroed. If this is to work around a driver/naga bug, we should have a comment here

Ah, I didn't realize that was a strong guarantee. In WebGPU world, it's probably worth skipping this explicit zeroing, but in native world it might be worth compiling with zeroing by infrastructure disabled, in which case we would need this.

Good point - for us it's not impactful, but e.g. before #363 this would have mattered for the MSL conversion

shader/binning.wgsl

DJMcNab · 2024-03-28T09:03:46Z

shader/coarse.wgsl

+            // This sets us up for technical UB, as lots of threads will be writing
+            // to the same locations. But I think it's fine, and predicating the
+            // writes would probably slow things down.


Would it be reasonable to have new_cmd=cmd_offset here? I think that would avoid the UB - instead we'd just overwrite in the same location in each loop

Intriguing idea! However, that won't quite avoid UB, as cmd_offset will edge into the allocation following this one. Setting it to cmd_limit - (PTCL_INCREMENT - PTCL_HEADROOM) almost works, but only if it's not in its initial segment. I can't think of a good solution in that case, as we still ideally want the limit where it is so it accurately allocates as if there were enough memory, for the purposes of reporting the size back. My gut feeling is that if we were very concerned about the technical UB, we should in fact predicate the writes.

Ah, because the allocations are variably sized? I can't say I'm that happy about adding UB, but I do agree that it's unlikely to cause a problem in practise.

I wonder how bad the cost of writing to the same location is in terms of memory bandwidth/cache coherency?
I agree that this is probably fine as-is, though

DJMcNab · 2024-03-28T09:07:48Z

shader/coarse.wgsl

-    if (failed & (STAGE_BINNING | STAGE_TILE_ALLOC | STAGE_PATH_COARSE)) != 0u {
+    if failed != 0u {
+        if wg_id.x == 0u && local_id.x == 0u {
+            atomicOr(&bump.failed, STAGE_COARSE);


What does this represent? I would think this should be STAGE_{BEFORE_COARSE}, and then only if the check on seg_counts failed

I'll make the change, now that I'm thinking these flags will be interpreted by the client, not just to early-out downstream.

DJMcNab · 2024-03-28T09:18:03Z

shader/path_count_setup.wgsl

-    let lines = atomicLoad(&bump.lines);
-    indirect.count_x = (lines + (WG_SIZE - 1u)) / WG_SIZE;
+    if atomicLoad(&bump.failed) != 0u {
+        indirect.count_x = 0u;


I'm impressed that this works. Reading the specs suggest it's fine.

Yes, this works and it's the only way I'm aware of that allows you to "abort" this type of indirect dispatch (there are more sophisticated ways with bindless, see for example: https://developer.apple.com/documentation/metal/indirect_command_encoding/encoding_indirect_command_buffers_on_the_gpu?language=objc).

Interestingly, I couldn't find any explicit wording in the WebGPU, Metal, Vulkan, or D3D12 docs that this is the expected behavior but "0" falls within the accepted range for all of them. See also this past discussion: gpuweb/gpuweb#1045

Clarify some nits, and also make a distinction between reporting failure in path_count and coarse.

DJMcNab

In its current state, the fact that there's no feedback at all when we miss a frame isn't ideal. I guess for debugging we could have a feature which slightly increases the effective alpha (/applies a red tinged overlay?) when rendering fails

Although that would have to increase each frame (which might actually be ideal, to give a view on the latency we get in dealing with under-allocation in #366.

armansito

Approved with one question.

armansito · 2024-03-29T23:48:33Z

shader/coarse.wgsl

@@ -152,11 +155,19 @@ fn main(
    // We need to check only prior stages, as if this stage has failed in another workgroup, 
    // we still want to know this workgroup's memory requirement.   
    if local_id.x == 0u {
+        var failed = atomicLoad(&bump.failed) & (STAGE_BINNING | STAGE_TILE_ALLOC | STAGE_FLATTEN);
+        if atomicLoad(&bump.seg_counts) > config.seg_counts_size {
+            failed |= STAGE_PATH_COUNT;


Why not set this in path_count?

Basically because path_count doesn't bind config. I'm also a bit wary of divergence but there's probably no meaningful impact on performance. I think it basically comes down to a style issue whether you tick the flag there or later.

If you can really rely on buffer robustness, then maybe at some point you can drop the write predication and just look at the read after the fact. One thing at the back of my head is the possibility of wrapping u32, but I think I'll choose not to worry about that too much right now.

armansito · 2024-03-30T00:01:57Z

shader/path_count_setup.wgsl

-    let lines = atomicLoad(&bump.lines);
-    indirect.count_x = (lines + (WG_SIZE - 1u)) / WG_SIZE;
+    if atomicLoad(&bump.failed) != 0u {
+        indirect.count_x = 0u;


Yes, this works and it's the only way I'm aware of that allows you to "abort" this type of indirect dispatch (there are more sophisticated ways with bindless, see for example: https://developer.apple.com/documentation/metal/indirect_command_encoding/encoding_indirect_command_buffers_on_the_gpu?language=objc).

Interestingly, I couldn't find any explicit wording in the WebGPU, Metal, Vulkan, or D3D12 docs that this is the expected behavior but "0" falls within the accepted range for all of them. See also this past discussion: gpuweb/gpuweb#1045

Following #537 it is possible for the flatten stage to fail and flag a failure. In some cases this can cause invalid / corrupt bounding box data to propagate downstream, leading to a hang in the per-tile backdrop calculation loop. Triggering this is highly subtle, so I don't have a test case as part of vello scenes that can reliably reproduce this. Regardless, it makes sense to check for the upstream failures and terminate the work in general.

Following #537 it is possible for the flatten stage to fail and flag a failure. In some cases this can cause invalid / corrupt bounding box data to propagate downstream, leading to a hang in the per-tile backdrop calculation loop. Triggering this is highly subtle, so I don't have a test case as part of vello scenes that can reliably reproduce this. Regardless, it makes sense to check for the upstream failures and terminate the work in general. I made backdrop_dyn check for any upstream failure and I didn't make it signal its own failure flag. I also didn't change the logic in the CPU shader since the other stages I checked (flatten, coarse) do not implement error signaling in their CPU counterparts. Let me know if you'd like me to work on those.

DJMcNab approved these changes Mar 28, 2024

View reviewed changes

Address review feedback

1fe82bb

Clarify some nits, and also make a distinction between reporting failure in path_count and coarse.

DJMcNab approved these changes Mar 29, 2024

View reviewed changes

armansito approved these changes Mar 30, 2024

View reviewed changes

raphlinus added this pull request to the merge queue Apr 1, 2024

Merged via the queue into main with commit aad292d Apr 1, 2024
15 checks passed

raphlinus deleted the robustness branch April 1, 2024 16:37

armansito mentioned this pull request Apr 20, 2024

[backdrop_dyn] Handle upstream pipeline failure #553

Merged

DJMcNab mentioned this pull request Apr 23, 2024

refactor: repo documentation, crate names #552

Merged

raphlinus mentioned this pull request May 20, 2024

Full system hang on Apple M1 8GB #548

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add robustness to GPU shaders #537

Add robustness to GPU shaders #537

raphlinus commented Mar 28, 2024

raphlinus commented Mar 28, 2024

DJMcNab left a comment

DJMcNab Mar 28, 2024

raphlinus Mar 28, 2024

DJMcNab Mar 29, 2024

DJMcNab Mar 28, 2024

raphlinus Mar 28, 2024

DJMcNab Mar 29, 2024

DJMcNab Mar 28, 2024

raphlinus Mar 28, 2024

DJMcNab Mar 28, 2024

armansito Mar 30, 2024

DJMcNab left a comment

armansito left a comment

armansito Mar 29, 2024

raphlinus Mar 30, 2024

armansito Mar 30, 2024

		@@ -63,7 +64,18 @@ fn main(
		for (var i = 0u; i < N_SLICE; i += 1u) {
		atomicStore(&sh_bitmaps[i][local_id.x], 0u);

Add robustness to GPU shaders #537

Add robustness to GPU shaders #537

Conversation

raphlinus commented Mar 28, 2024

raphlinus commented Mar 28, 2024

DJMcNab left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DJMcNab left a comment

Choose a reason for hiding this comment

armansito left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment