Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fragment shader not writing to depth buffer correctly #314

Closed
s-perron opened this issue Nov 1, 2019 · 9 comments
Closed

Fragment shader not writing to depth buffer correctly #314

s-perron opened this issue Nov 1, 2019 · 9 comments

Comments

@s-perron
Copy link
Contributor

s-perron commented Nov 1, 2019

I have noticed a shader that does not write to the depth buffer correctly. It seems like two OpKills with a load of an image are needed. I find it very odd that the load of the image makes a difference.

I will attached a reduced version of the spir-v, and the gcn that was generated. In the gcn, there is a sample trace through the code that shows the problem. Please let me know if I have misunderstood any of the instructions. I'm still learning gcn.

@s-perron
Copy link
Contributor Author

s-perron commented Nov 1, 2019

@s-perron
Copy link
Contributor Author

s-perron commented Nov 1, 2019

I need to go back to the drawing board. The first instructions in BB0_4 is an OR, but I treated it like an AND.

@linqun
Copy link
Member

linqun commented Nov 1, 2019

What will happen if change the sample function from texture to textureLod?
use textureLod can avoid set whole quad mode (s_wqm_b64), and simplify the generated gcn code.

@s-perron
Copy link
Contributor Author

s-perron commented Nov 1, 2019

After looking at it more, I believe the implicit lod is required for to see the problem. The s_wqm_b64 instruction is done at the start of the function. Presumebly because the implicit lod needs all threads in the quad to be active.

However, after the code related to the first OpKill, some threads will get turned off. When the image_sameple_b is reached, not every thread in quads with active threads will be active.

Does this sound right?

@s-perron
Copy link
Contributor Author

s-perron commented Nov 1, 2019

After looking at it further, I believe that the "SI Whole Quad Mode" is not complete.

Non uniform control flow will cause a work group that is in whole quad mode to no longer be in whole quad mode because part of a quad could go one way and the rest another. However, there is no instruction to put it back into whole quad mode.

I can take a look at it next week if nobody is looking at it yet.

Here is another example:

               OpCapability Shader
               OpCapability SampledBuffer
               OpCapability StorageImageExtendedFormats
          %1 = OpExtInstImport "GLSL.std.450"
               OpMemoryModel Logical GLSL450
               OpEntryPoint Fragment %ShadePixel "main" %gl_FragCoord
               OpExecutionMode %ShadePixel OriginUpperLeft
               OpSource HLSL 600
               OpName %ShadePixel "ShadePixel"
               OpDecorate %gl_FragCoord BuiltIn FragCoord
      %float = OpTypeFloat 32
    %v4float = OpTypeVector %float 4
       %bool = OpTypeBool
    %float_0 = OpConstant %float 0
%_ptr_Input_v4float = OpTypePointer Input %v4float
       %void = OpTypeVoid
         %26 = OpTypeFunction %void
%gl_FragCoord = OpVariable %_ptr_Input_v4float Input
 %ShadePixel = OpFunction %void None %26
         %27 = OpLabel
         %28 = OpLoad %v4float %gl_FragCoord
         %29 = OpCompositeExtract %float %28 0
         %30 = OpFOrdLessThan %bool %29 %float_0
               OpSelectionMerge %42 None
               OpBranchConditional %30 %42 %31
         %31 = OpLabel
         %37 = OpDPdx %v4float %28
         %38 = OpCompositeExtract %float %37 0
         %39 = OpFOrdLessThan %bool %38 %float_0
               OpSelectionMerge %40 None
               OpBranchConditional %39 %41 %40
         %41 = OpLabel
               OpKill
         %40 = OpLabel
               OpBranch %42
         %42 = OpLabel
               OpReturn
               OpFunctionEnd

@s-perron
Copy link
Contributor Author

s-perron commented Nov 1, 2019

Here is the code that llpc generates for the new example

Disassembly of section .text:
_amdgpu_ps_main:
        s_mov_b64 s[0:1], exec                                     // 000000000000: BE80017E 
        s_wqm_b64 exec, exec                                       // 000000000004: BEFE077E 
        v_cmp_ngt_f32_e32 vcc, 0, v2                               // 000000000008: 7C960480 
        s_and_saveexec_b64 s[2:3], vcc                             // 00000000000C: BE82206A 
        s_xor_b64 s[2:3], exec, s[2:3]                             // 000000000010: 8882027E 
        s_cbranch_execz BB0_4                                      // 000000000014: BF88000D 

BB0_1:
        v_mov_b32_e32 v0, v2                                       // 000000000018: 7E000302 
        s_nop 1                                                    // 00000000001C: BF800001 
        v_mov_b32_dpp v0, v0  quad_perm:[1,1,1,1] row_mask:0xf bank_mask:0xf bound_ctrl:0// 000000000020: 7E0002FA FF085500 
        s_nop 1                                                    // 000000000028: BF800001 
        v_subrev_f32_dpp v0, v2, v0  quad_perm:[0,0,0,0] row_mask:0xf bank_mask:0xf bound_ctrl:0// 00000000002C: 060000FA FF080002 
        v_mov_b32_e32 v0, v0                                       // 000000000034: 7E000300 
        v_cmp_gt_f32_e32 vcc, 0, v0                                // 000000000038: 7C880080 
        s_and_saveexec_b64 s[4:5], vcc                             // 00000000003C: BE84206A 
        s_xor_b64 s[4:5], exec, s[4:5]                             // 000000000040: 8884047E 

BB0_2:
        s_mov_b64 exec, 0                                          // 000000000044: BEFE0180 

BB0_3:
        s_or_b64 exec, exec, s[4:5]                                // 000000000048: 87FE047E 

BB0_4:
        s_or_b64 exec, exec, s[2:3]                                // 00000000004C: 87FE027E 
        s_and_b64 exec, exec, s[0:1]                               // 000000000050: 86FE007E 
        v_mov_b32_e32 v0, 0                                        // 000000000054: 7E000280 
        exp mrt0 v0, off, off, off done vm                         // 000000000058: C4001801 00000000 
        s_endpgm                                                   // 000000000060: BF810000 

@perlfu
Copy link
Contributor

perlfu commented Nov 2, 2019

The "SI Whole Quad Mode" implementation currently only supports Vulkan's model of implicit derivatives. The behaviour of these is only defined for uniform control flow.

From SPIRV 1.5 spec.:

2.20 Code Motion
Texturing instructions in the Fragment Execution Model that rely on an implicit derivative cannot be moved into control flow that is not known to be uniform control flow within each derivative group.

I haven't stared hard at the shader here, but I suspect this limitation might be what you are running into.

@perlfu
Copy link
Contributor

perlfu commented Nov 2, 2019

If this is the case, then you probably need to be using the DemoteToHelperInvocationEXT extension, substituting OpDemoteToHelperInvocationEXT for OpKill instructions.
Note that OpDemoteToHelperInvocationEXT is not a terminator, so it's not always a simple substitution as branches must converge.
I've been looking at adding a pass to do this automatically for shaders which rely on this kind of behaviour.

@s-perron
Copy link
Contributor Author

s-perron commented Nov 4, 2019

Thanks for the explanation. That sound exactly like what is going on. This was an hlsl shader being compiled for Vulkan instead of directx. I now think this should be fixed in DXC. It should generate the new instruction instead of an OpKill.

Also the pass you are looking at writing might be better in spirv-tools. It could be used by anybody instead of being specific to AMD hardware. I'll see if it will be needed on our side.

@s-perron s-perron closed this as completed Nov 4, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants