-
Notifications
You must be signed in to change notification settings - Fork 316
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
shaders: semantics of "discard" #361
Comments
I'd advocate the following:
This way, the MSL is safe. My personal 2 cents is that the D3D discard style can leave some performance on the floor for situations where a fragment shader has an early out via discard. |
Discussed at the 2020-07-07 meeting. |
Resolved: discard/kill breaks uniform control flow. Since derivatives must be within UCF, this is safe on all backends. |
Revisiting this:
I want to highlight that as an important requirement I hadn't captured in the summary above.
Sure, that looks like it doesn't matter for WGSL because of our plan for an extra statically-checked uniformity rule. However, there is a separating case which does come up in real life. Schematically in HLSL:
With a slightly more complex example you can coax one of the 4 invocations in a quad to run forever, hanging up the other ones. int quad_idx = (int(gl_FragCoord.x) & 1) + (int(gl_FragCoord.y) & 1) * 2;
if (quad_idx == 0) {
terminate;
}
while (quad_idx < 8) {
quad_idx *= 2;
} The quad_idx will be 0 for a "top-left" fragment, and non-zero for the other 3 it's partnered with. My point is that there are considerations that go beyond uniformity-for-the-purpose-of-derivatives. I think whatever path we choose, there will be work to do in translation to different back-ends. |
Just to be clear, |
So what you're saying is that code like this is invalid? Output main(texcoord : TEXCOORD0) {
float4 alphaTex = AlphaTex.Sample(texcoord);
if (alphaTex.a <= 0.25) {
// No need to process the rest of the shader, do expensive lighting, etc.
discard;
return;
}
float4 diffuse = DiffuseTex.Sample(texcoord);
out.color = doExpensiveLighting(diffuse);
} How would you suggest we write it? The goal is to kill the shader if all four helper invocations are dead, but still let them continue to interact for correct derivatives. Will we just be tanking performance and have to evaluate expensive lighting on all fragments? |
Yes, that code is invalid in WGSL by committee decision from at least San Diego F2F in January 2019. There would be a conservative static analysis that would reject such code before being run. I noted at the time that there were going to be applications that would not work, or would suffer. We left it at that. We didn't have metrics on hand, and the sentiment at the time was to error on consistency and portability. Please do challenge that. |
Except that some implementations will terminate. So whether it's spec'd or not, you get inconsistent behaviour from implementations. |
That's surprising to me. It will eventually terminate in a TDR, but I expect this to be something MS tests for compatibility. I know of a port house that had to emulate the DX12 behavior in Vulkan by translating all discards into while loops while porting games over. |
Specifically, I'm told that NVIDIA drivers will terminate all the quad invocations when all four in the quad become helpers. |
@magcius that's a good piece of alpha-testing code. I think it's fine for us to disallow it, in accordance to the decision we made for "discard" (that is makes control flow non-uniform). The users can work around it: Output main(texcoord : TEXCOORD0) {
float2 gradX = ddx(texcoord);
float2 gradY = ddy(texcoord);
float4 alphaTex = AlphaTex.SampleGrad(texcoord, gradX, gradY);
if (alphaTex.a <= 0.25) {
// No need to process the rest of the shader, do expensive lighting, etc.
discard;
return;
}
float4 diffuse = DiffuseTex.SampleGrad(texcoord, gradX, gradY);
out.color = doExpensiveLighting(diffuse);
} To clarify, this would work because |
@kvark that's going to be more expensive. Submitting gradients alongside texture coordinates is double the amount of data going to the sampler unit. On Apple platforms, I believe it was substantially slower (I believe about 20% overhead?) |
We can talk great length about the performance aspects of this workaround, but it is nevertheless a rather non-intrusive workaround that people can do. The other alternative is having non-portable shader code, which is not even on the table. |
The idea that some pretty standard shader constructs allowed in all other shading languages are not allowed in WGSL is going raise the barrier to entry IMO. The examples in these threads all have really simple control flow; things can get much nastier later with more complicated control flow. I confess that I think that the goal of shaders are guaranteed to be portable across all GPU vendors (especially in mobile) is only possible on paper. Shader compilers in drivers are often strictly restricted with respect to control flow "depth" where the computation of that depth is GPU architecture specific and the limits also being GPU architecture specific. When that depth is exceeded one gets lovely messages like "internal compiler error" or "shader too complicated to compile". Trying to write something that will detect non-uniform control flow I think is going to fail in that some uniform flows will be misidentified. I have a large battery of techniques that guarantee that the flow control is uniform within a primitive, but the flow control is generated by data read (per fragment). The upshot being that the shader is portable with the data I feed it, but not on arbitrary data. Making the WGSL require that using derivative ops is uniform flow control for all data fed to it will make a fair number of techniques that I use (and have used all the way back to GL3) not possible. To play devil's advocate, what the current situation with WebGL's discard and uniform flow control when it maps to Direct3D? |
That is to be discussed in #921
This has been discussed at the community group. The cost of attaining consistent behaviour across implementation depends a lot on the particular feature you're talking about. We're dealing with them on a case by case basis. We're trying to iterate on this. Regarding uniformity vs. discard, we've chosen on a path which is strictly conservative and which we know, as you point out, will reject techniques that are known to work in other APIs. We do know that there is a performance cost to it; we have not quantified the performance/power cost, and don't have a good way to understand the prevalence/importance of those techniques. I can see a future where W3C releases an initial WebGPU which is more restrictive, and over time folks like you can quantify the importance for us, which then lets us revisit particular decisions with more data in hand. |
This is where I will suggest "please talk to IHVs". You have no ability to meaningfully profile as an ISV, as the shader compiler can do magic beyond your recognition and thwart pretty much any benchmark. Only IHVs with knowledge of what things compile to can make informed decisions. Metal intentionally switched to D3D semantics because it's what everybody wants. I have suggested before that emulation of D3D-like semantics is a good idea, and I still partially believe it. Perhaps opt-in? |
This is spot on. Control flow is the most miserable part of making a shader compiler and some of the weirdness that can happen from control flow can be very GPU specific. |
Considering #3176 landed, should this be closed? |
I think so. |
GLSL and HLSL have long had a useful "discard" primitive. MSL has discard_fragment().
GLSL's discard maps to OpKill in SPIR-V.
In D3D discard, the invocation continues execution so it can later participate in computing derivatives.
In SPIR-V OpKill stops later side effects from occurring, and control flow becomes non-uniform. So you can't compute derivatives after executing OpKill.
SPIR-V has a recently proposed multi-vendor extension to add D3D-style discard, via SPV_EXT_demote_to_helper_invocation. See KhronosGroup/SPIRV-Registry#43 for details.
The MSL discard_fragment() builtin is similar to OpKill:
The text was updated successfully, but these errors were encountered: