Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

shaders: semantics of "discard" #361

Closed
dneto0 opened this issue Jul 9, 2019 · 19 comments
Closed

shaders: semantics of "discard" #361

dneto0 opened this issue Jul 9, 2019 · 19 comments
Labels
copyediting Pure editorial stuff (copyediting, *.bs file syntax, etc.) programming model wgsl WebGPU Shading Language Issues
Milestone

Comments

@dneto0
Copy link
Contributor

dneto0 commented Jul 9, 2019

GLSL and HLSL have long had a useful "discard" primitive. MSL has discard_fragment().
GLSL's discard maps to OpKill in SPIR-V.

In D3D discard, the invocation continues execution so it can later participate in computing derivatives.

In SPIR-V OpKill stops later side effects from occurring, and control flow becomes non-uniform. So you can't compute derivatives after executing OpKill.

SPIR-V has a recently proposed multi-vendor extension to add D3D-style discard, via SPV_EXT_demote_to_helper_invocation. See KhronosGroup/SPIRV-Registry#43 for details.

The MSL discard_fragment() builtin is similar to OpKill:

Multiple fragment threads or helper threads associated with a fragment thread execute
together to compute derivatives. If any (but not all) of these threads executes the
discard_fragment function, the behavior of any derivative computations (explicit or implicit)
is undefined.

@krogovin
Copy link

krogovin commented Aug 8, 2019

I'd advocate the following:

  • the default behavior is that derivative ops after discard might be correct or not.
  • as an extension, have a shader property that can specify to follow D3D's style discard.

This way, the MSL is safe. My personal 2 cents is that the D3D discard style can leave some performance on the floor for situations where a fragment shader has an early out via discard.

@grorg grorg added the wgsl WebGPU Shading Language Issues label Jun 23, 2020
@grorg
Copy link
Contributor

grorg commented Jul 7, 2020

Discussed at the 2020-07-07 meeting.

@kdashg
Copy link
Contributor

kdashg commented Jul 7, 2020

Resolved: discard/kill breaks uniform control flow. Since derivatives must be within UCF, this is safe on all backends.

@kdashg kdashg mentioned this issue Jul 7, 2020
@dneto0
Copy link
Contributor Author

dneto0 commented Jul 9, 2020

Revisiting this:

  1. In this weeks meeting @litherum raised the issue of blending of fragment outputs. From the minutes:

MM: The three different backend languages have different definitions of discard. Discard is different then an early return because the blending step after the fragment shader doesn’t execute if you hit a discard statement. Early return would still blend after the fragment shader.

I want to highlight that as an important requirement I hadn't captured in the summary above.
OpKill is an early return from the entry point, and also throws away the fragment so it has no further effects in the pipeline.
Vulkan 25. Fragment operations says:

Fragments can also be programmatically discarded in a fragment shader by executing OpKill.

  1. Reviewing the MSL spec, I see that in Metal 2.3 the behaviour of discard_fragment() flipped from GLSL-discard semantics to HLSL-discard semantics:

Multiple fragment threads or helper threads associated with a fragment thread execute
together to compute derivatives. Prior to Metal 2.3, if any (but not all) of these threads executes
the discard_fragment function, the thread is terminated and the behavior of any derivative
computations (explicit or implicit) is undefined. Since Metal 2.3, discard_fragment marks
the fragment is terminated while continuing to execute in parallel to make derivative
computations well defined. Even though execution continues, the write behavior remains the
same as before. The fragment will discard the fragment output and discard all writes to buffer
or texture after discard_fragment.

Sure, that looks like it doesn't matter for WGSL because of our plan for an extra statically-checked uniformity rule.

However, there is a separating case which does come up in real life. Schematically in HLSL:

  while (true) {
      discard;
  }
  • If you have semantics of "super-return", i.e. terminate the invocation right away, then it terminates. Yay.
  • If you have "demote to helper" then you either have:
    • run forever. (boo)
    • look like you run forever, except that the implementation sees that when all invocations in the quad have been demoted to helper, then they all terminate themselves.

With a slightly more complex example you can coax one of the 4 invocations in a quad to run forever, hanging up the other ones.

int quad_idx = (int(gl_FragCoord.x) & 1) + (int(gl_FragCoord.y) & 1) * 2;
if (quad_idx == 0) {
   terminate; 
}
while (quad_idx < 8) {
   quad_idx *= 2;
}

The quad_idx will be 0 for a "top-left" fragment, and non-zero for the other 3 it's partnered with.
So with the demote-to-helper semantics you get infinite loop for the top-left, and its partners are stuck there as well.

My point is that there are considerations that go beyond uniformity-for-the-purpose-of-derivatives.

I think whatever path we choose, there will be work to do in translation to different back-ends.

@magcius
Copy link

magcius commented Jul 9, 2020

Just to be clear, while (true) { discard; } in HLSL is an infinite loop, because HLSL's discard is not a "super-return".

@magcius
Copy link

magcius commented Jul 9, 2020

Sure, that looks like it doesn't matter for WGSL because of our plan for an extra statically-checked uniformity rule.

So what you're saying is that code like this is invalid?

Output main(texcoord : TEXCOORD0) {
    float4 alphaTex = AlphaTex.Sample(texcoord);
    if (alphaTex.a <= 0.25) {
        // No need to process the rest of the shader, do expensive lighting, etc.
        discard;
        return;
    }

    float4 diffuse = DiffuseTex.Sample(texcoord);
    out.color = doExpensiveLighting(diffuse);
}

How would you suggest we write it? The goal is to kill the shader if all four helper invocations are dead, but still let them continue to interact for correct derivatives. Will we just be tanking performance and have to evaluate expensive lighting on all fragments?

@dneto0
Copy link
Contributor Author

dneto0 commented Jul 9, 2020

Yes, that code is invalid in WGSL by committee decision from at least San Diego F2F in January 2019. There would be a conservative static analysis that would reject such code before being run.

I noted at the time that there were going to be applications that would not work, or would suffer. We left it at that. We didn't have metrics on hand, and the sentiment at the time was to error on consistency and portability.

Please do challenge that.

@dneto0
Copy link
Contributor Author

dneto0 commented Jul 9, 2020

Just to be clear, while (true) { discard; } in HLSL is an infinite loop, because HLSL's discard is not a "super-return".

Except that some implementations will terminate. So whether it's spec'd or not, you get inconsistent behaviour from implementations.

@magcius
Copy link

magcius commented Jul 9, 2020

Except that some implementations will terminate.

That's surprising to me. It will eventually terminate in a TDR, but I expect this to be something MS tests for compatibility. I know of a port house that had to emulate the DX12 behavior in Vulkan by translating all discards into while loops while porting games over.

@dneto0
Copy link
Contributor Author

dneto0 commented Jul 9, 2020

Specifically, I'm told that NVIDIA drivers will terminate all the quad invocations when all four in the quad become helpers.

@kvark
Copy link
Contributor

kvark commented Jul 9, 2020

@magcius that's a good piece of alpha-testing code. I think it's fine for us to disallow it, in accordance to the decision we made for "discard" (that is makes control flow non-uniform). The users can work around it:

Output main(texcoord : TEXCOORD0) {
    float2 gradX = ddx(texcoord);
    float2 gradY = ddy(texcoord);
    float4 alphaTex = AlphaTex.SampleGrad(texcoord, gradX, gradY);
    if (alphaTex.a <= 0.25) {
        // No need to process the rest of the shader, do expensive lighting, etc.
        discard;
        return;
    }

    float4 diffuse = DiffuseTex.SampleGrad(texcoord, gradX, gradY);
    out.color = doExpensiveLighting(diffuse);
}

To clarify, this would work because SampleGrad doesn't rely on uniform control flow, and the gradients are obtained while we still are in the control flow, so calling ddx and ddy there is valid.

@magcius
Copy link

magcius commented Jul 9, 2020

@kvark that's going to be more expensive. Submitting gradients alongside texture coordinates is double the amount of data going to the sampler unit. On Apple platforms, I believe it was substantially slower (I believe about 20% overhead?)

@kvark
Copy link
Contributor

kvark commented Jul 9, 2020

We can talk great length about the performance aspects of this workaround, but it is nevertheless a rather non-intrusive workaround that people can do. The other alternative is having non-portable shader code, which is not even on the table.

@krogovin
Copy link

The idea that some pretty standard shader constructs allowed in all other shading languages are not allowed in WGSL is going raise the barrier to entry IMO.

The examples in these threads all have really simple control flow; things can get much nastier later with more complicated control flow. I confess that I think that the goal of shaders are guaranteed to be portable across all GPU vendors (especially in mobile) is only possible on paper. Shader compilers in drivers are often strictly restricted with respect to control flow "depth" where the computation of that depth is GPU architecture specific and the limits also being GPU architecture specific. When that depth is exceeded one gets lovely messages like "internal compiler error" or "shader too complicated to compile".

Trying to write something that will detect non-uniform control flow I think is going to fail in that some uniform flows will be misidentified. I have a large battery of techniques that guarantee that the flow control is uniform within a primitive, but the flow control is generated by data read (per fragment). The upshot being that the shader is portable with the data I feed it, but not on arbitrary data.

Making the WGSL require that using derivative ops is uniform flow control for all data fed to it will make a fair number of techniques that I use (and have used all the way back to GL3) not possible.

To play devil's advocate, what the current situation with WebGL's discard and uniform flow control when it maps to Direct3D?

@dneto0
Copy link
Contributor Author

dneto0 commented Jul 21, 2020

To play devil's advocate, what the current situation with WebGL's discard and uniform flow control when it maps to Direct3D?

That is to be discussed in #921

I confess that I think that the goal of shaders are guaranteed to be portable across all GPU vendors (especially in mobile) is only possible on paper.

This has been discussed at the community group. The cost of attaining consistent behaviour across implementation depends a lot on the particular feature you're talking about. We're dealing with them on a case by case basis.

We're trying to iterate on this. Regarding uniformity vs. discard, we've chosen on a path which is strictly conservative and which we know, as you point out, will reject techniques that are known to work in other APIs. We do know that there is a performance cost to it; we have not quantified the performance/power cost, and don't have a good way to understand the prevalence/importance of those techniques.

I can see a future where W3C releases an initial WebGPU which is more restrictive, and over time folks like you can quantify the importance for us, which then lets us revisit particular decisions with more data in hand.

@magcius
Copy link

magcius commented Jul 22, 2020

This is where I will suggest "please talk to IHVs". You have no ability to meaningfully profile as an ISV, as the shader compiler can do magic beyond your recognition and thwart pretty much any benchmark. Only IHVs with knowledge of what things compile to can make informed decisions.

Metal intentionally switched to D3D semantics because it's what everybody wants. I have suggested before that emulation of D3D-like semantics is a good idea, and I still partially believe it. Perhaps opt-in?

@krogovin
Copy link

This is where I will suggest "please talk to IHVs". You have no ability to meaningfully profile as an ISV, as the shader compiler can do magic beyond your recognition and thwart pretty much any benchmark. Only IHVs with knowledge of what things compile to can make informed decisions.

This is spot on. Control flow is the most miserable part of making a shader compiler and some of the weirdness that can happen from control flow can be very GPU specific.

@Kangz Kangz added this to the V1.0 milestone Feb 2, 2022
@kdashg kdashg added the wgsl resolved Resolved - waiting for a change to the WGSL specification label Apr 26, 2022
@Kangz Kangz added the copyediting Pure editorial stuff (copyediting, *.bs file syntax, etc.) label Sep 20, 2022
@Kangz Kangz removed the wgsl resolved Resolved - waiting for a change to the WGSL specification label Sep 20, 2022
@teoxoy
Copy link
Member

teoxoy commented Jan 10, 2023

Considering #3176 landed, should this be closed?

@kdashg
Copy link
Contributor

kdashg commented Jan 10, 2023

I think so.

@kdashg kdashg closed this as completed Jan 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
copyediting Pure editorial stuff (copyediting, *.bs file syntax, etc.) programming model wgsl WebGPU Shading Language Issues
Projects
None yet
Development

No branches or pull requests

8 participants