Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FP16 support #658

Closed
litherum opened this issue Mar 30, 2020 · 27 comments
Closed

FP16 support #658

litherum opened this issue Mar 30, 2020 · 27 comments
Labels
investigation wgsl resolved Resolved - waiting for a change to the WGSL specification wgsl WebGPU Shading Language Issues
Projects
Milestone

Comments

@litherum
Copy link
Contributor

litherum commented Mar 30, 2020

FP16 provides significant benefits over FP32:

  • It uses half the memory size (of course). This is particularly important on mobile devices where available memory is limited. Specifically, one of the most common / strongest request the Metal team gets from 3rd parties is for features which can help decrease their memory use.
  • Because it uses half the size, memory bandwidth is more effectively used.
  • Even without the bandwidth increase, devices often have increased ALU performance for half-precision operations.
  • Power consumption is decreased on some devices, leading to better battery life.

I wanted to characterize the ALU performance, so I made a small Metal benchmark to execute on iOS. Here are the shaders:

constant unsigned int iterations [[function_constant(0)]];

kernel void aluFP32(constant float& seed, device atomic_uint& result) {
    float localSeed = seed;
    float counter = seed;
    for (unsigned int i = 0; i < iterations; ++i)
        counter *= localSeed;
    atomic_store_explicit(&result, as_type<unsigned int>(counter), memory_order_relaxed);
}

kernel void aluFP16(constant float& seed, device atomic_uint& result) {
    half localSeed = seed;
    half counter = seed;
    for (unsigned int i = 0; i < iterations; ++i)
        counter *= localSeed;
    atomic_store_explicit(&result, as_type<unsigned int>(static_cast<float>(counter)), memory_order_relaxed);
}

When running on an iPhone 11 Pro, here are the results:

Screen Shot 2020-03-30 at 3 48 00 PM

As you can see, FP16 is a demonstrable 24.9% progression. Theoretically, it could be a 50% progression on this device. This, coupled with the significant decrease in memory footprint, indicates the feature is important to include in WGSL.

@litherum litherum added the wgsl WebGPU Shading Language Issues label Mar 30, 2020
@kainino0x
Copy link
Contributor

cc #230

@kainino0x
Copy link
Contributor

Definite +1 from us for fp16/fp64 extensions; we've been expecting them for a long time. Possibly separate extensions for 16-bit (and 8-bit) reads/writes from memory.

@Kangz
Copy link
Contributor

Kangz commented Mar 31, 2020

+CC @qjia7 that was investigating the very same thing from the TF.js side. Unfortunately I don't think Vulkan requires support for FP16 so it should be an extension. Also one thing to be careful about is that there are two separate capabilities between being able to load FP16 data from buffers, and having FP16 ALUs, so we have to figure out how we want to expose them.

@dj2
Copy link
Member

dj2 commented Mar 31, 2020

So, if fp16 has to be an extension in WebGPU, should it be enabled through an extension in WGSL so it can't be used accidentally?

@kainino0x
Copy link
Contributor

Two options IMO:

  • WebGL style: using shaders with the extension requires the API extension to also be enabled
  • API extension automatically enables a shader extension

@kdashg
Copy link
Contributor

kdashg commented Mar 31, 2020

A third option is GLSL-style minimum precision guarantees. This is particularly valuable for doing basic math on lowp float (~float9) data for colors, and keeping colors packed into 32bytes.

@kainino0x
Copy link
Contributor

For shader variables that makes sense to me, though technically it does generate a portability issue since it's not possible to know you're testing with the lowest possible precision. WebGL users run into this occasionally, but not often, since hardware is in practice not that inconsistent.

@dneto0
Copy link
Contributor

dneto0 commented Mar 31, 2020

I agree FP16 is a highly desirable feature.

Sadly, support is not universal among Vulkan devices (but growing). So I agree this would have to be an extension.

Let's keep it simple and have a single extension to enable the feature. Vulkan split it into a storage (load/store) feature for certain storage classes, and a distinct arithmetic feature. The apparent motivation was that some devices supported one and not the other, and vice versa. Let's avoid that. Let's make FP16 one feature in WebGPU.

@dneto0
Copy link
Contributor

dneto0 commented Mar 31, 2020

A third option is GLSL-style minimum precision guarantees.

it does generate a portability issue since it's not possible to know you're testing with the lowest possible precision.

SPIR-V opted to model GLSL lowp and mediump as:

  • 32-bit float type (in particular for load/store, so no memory bandwidth advantage)
  • but variables and arithmetic operations can be decorated with RelaxedPrecision, which permits intermediate results to be computed as if with fp16.

Some implementations do take advantage of this RelaxedPrecision feature to attain better performance and energy usage.

If I were king, I'd rather the world adopt FP16 instead. RelaxedPrecision feels like a half-step we should avoid with WebGPU.

@dneto0
Copy link
Contributor

dneto0 commented Mar 31, 2020

I forgot to mention where to get more info:

The Vulkan 16bit float arithmetic feature bit is "shaderFloat16" from "VK_KHR_shader_float16_int8"

The Vulkan 16bit storage features are from "VK_KHR_16bit_storage"

@kdashg
Copy link
Contributor

kdashg commented Mar 31, 2020

It would be valuable to enable devs to tag things as lower precision than full float32, whether that's float16 or something more vague. (but perhaps more flexible)

Does Qualcomm Android really not have a non-float32 arithmetic path for spir-v? That seems surprising to me. I had thought I'd seen cases where moving the same GLSL from desktop to mobile caused lack-of-precision artifacts! Maybe this was related to input/output load/stores, not arithmetic?

@litherum
Copy link
Contributor Author

litherum commented Apr 2, 2020

If you unroll the loops, you can get much closer to the theoretical maximum:

Screen Shot 2020-04-01 at 9 52 44 PM

This represents a 44% progression on that same iPhone 11 Pro.

@xhcao
Copy link

xhcao commented Apr 2, 2020

Recently, I had re-written dawn/examples/ComputeBoids example with FP16 arithmetic and FP16 data load/store, could also get nearly 50% performance improvement on Vulkan backend.
Minimum precision data types as GLSL lowp and mediump is good idea, developers don't need to author multiple shaders, and supports more devices. But sometimes developers want to truly FP16 feature, they care performance more than precision. Is there a method for WebGPU that could take account of both? As HLSL min16float maps to float16_t in -enable-16bit-types mode. https://github.com/microsoft/DirectXShaderCompiler/wiki/16-Bit-Scalar-Types

@litherum litherum added this to Discussion in WGSL Apr 4, 2020
@Jasper-Bekkers
Copy link

Like @dneto0 points out, as a developer f16 support would be amazing to have. However, support for them is not as universal as we'd like so we typically would only ship f16 based code on fixed platforms (console / phone) where we can guarantee it's availability.

@dneto0
Copy link
Contributor

dneto0 commented Apr 7, 2020

There's a couple of things going on here.
Fundamentally, you can be limited by memory bandwidth, or by ALU.

If you're only limited by memory bandwidth, then load/store of fp16 values is what you need.
For a long time GLSL (and hence SPIR-V) has had a core feature where you can unpack a 32bit uint into two 32-bit floats, by interpreting the uint as 2 16bit floats and doing the conversion.
See "uint packHalf2x16 (vec2 v)" in the GLSL spec, or UnpackHalf2x16 in SPIR-V extended instructions set GLSL.std.450 https://www.khronos.org/registry/spir-v/specs/unified1/GLSL.std.450.html
So the pattern is: load as uints, possibly vectors of them, then do the unpack conversions.
(There are only scalar forms of the unpacks; maybe nobody ever needed the vector forms.)

So that might explain why there wasn't enough pressure early enough to force this issue in the Android space.

But I certainly believe there are also gains when you only do the arithmetic in 16bits.

@dneto0
Copy link
Contributor

dneto0 commented Apr 7, 2020

I'm just explaining backstory here. I'm not advocating any particular path.

Regarding mediump and lowp:

As I understand it, mediump and lowp caused a lot of grief due to variability between devices.

In the move to Vulkan, they were remapped to SPIR-V RelaxedPrecision.
(https://github.com/KhronosGroup/GLSL/blob/master/extensions/khr/GL_KHR_vulkan_glsl.txt#L642 )
SPIR-V RelaxedPrecision says the storage footprint is still 32bits but you can do the arithmetic in any precision between 16 and 32bits. https://www.khronos.org/registry/spir-v/specs/unified1/SPIRV.html#_a_id_relaxedprecisionsection_a_relaxed_precision

Yes, this still allows some painful variability. (I've helped customers through this). A slight mitigation is you can clamp the precision of a result to 16bits with the funky OpQuantizeToF16 https://www.khronos.org/registry/spir-v/specs/unified1/SPIRV.html#OpQuantizeToF16
So you can make your powerful desktop GPU emulate the weakest spec device by (having a compilation flow) litter those quantizations all over the place.

@petamoriken
Copy link

Currently, TypedArray for FP16 is Stage 1 of TC39 proposals process but without active progress.
https://es.discourse.group/t/float16-on-typedarrays-dataview-math-hfround/303

I think it's an important proposal for this issue, so can't work on it?

@kdashg
Copy link
Contributor

kdashg commented Apr 13, 2020

Thanks for the heads-up, though JS FP16 support is not required (and not directly important) for WebGPU. (WebGL already has FP16 resources) Indeed most native work is done in languages (all of them?) without first-class FP16 support. F16<->double manual conversion is pretty similar in C++ as in JS, which is proof-of-MVP to me.

@xhcao
Copy link

xhcao commented Apr 13, 2020

@litherum, you did not check whether Metal support true FP16 in your demo. I read Metal spec, Metal does not raise requirements, for example Metal version or hardware query result, to support true FP16 feature. Does all mac devices support metal true 16-bit float?
D3D12 supports true 16-bit float when shader model 6.2 or higher, also should check D3D12_FEATURE_D3D12_OPTIONS4 feature whether hardware supports true FP16.

@grorg
Copy link
Contributor

grorg commented Apr 14, 2020

Discussed at the 2020-04-14 meeting.

@grorg
Copy link
Contributor

grorg commented Apr 14, 2020

Resolution was FP16 supported as an optional extension covering computation and storage, but with some follow-up issues to be raised (e.g @dneto0 on quantize)

Interpolation is not included.

@grorg grorg added wgsl resolved Resolved - waiting for a change to the WGSL specification investigation and removed for wgsl meeting investigation labels Apr 14, 2020
@dneto0
Copy link
Contributor

dneto0 commented Apr 15, 2020

I said in the meeting that:

  • I really like this feature, as a single feature.
  • But this does not meet the "minimum" aspect of a "minimum viable product".

Spec, testing, and tooling work will compete for staff time for actually-minimum-viable work.

@litherum
Copy link
Contributor Author

litherum commented Apr 18, 2020

We believe this is, actually, part of the minimum set. In our team's experience, many shaders just straight-up won't run at reasonable speeds without FP16 support. It's absence makes many apps unusable.

@kainino0x
Copy link
Contributor

Do you have a proposal to achieve that? Relaxed precision?

@litherum
Copy link
Contributor Author

litherum commented Apr 18, 2020

According to the conversation during this week's call, the group seemed to agree that an extension was the right direction.

(I'm not saying FP16 should be part of core. I'm saying FP16 should be usable on iPhones in the first software release of WebGPU)

@kainino0x
Copy link
Contributor

Ok, sorry, I thought "minimum set" meant core and you were disagreeing with the result from the meeting.

Agreed with having it at release, but I think that's a vendor decision. (I hope we will be able to do it as well.)

@jzm-intel
Copy link
Contributor

Closing this issue, as the F16 extension against spec PR #2696 has been merged.

WGSL automation moved this from Resolved: Needs Specification Work to Done Apr 21, 2022
ben-clayton pushed a commit to ben-clayton/gpuweb that referenced this issue Sep 6, 2022
…pect* (gpuweb#658)

- Remove cases that do not specify an aspect for sampling a multiplanar
  format
- Choose 'uint' as the GPUTextureSampleType when sampling stencil
- Pass the depthStencilFormat to the render bundle encoder to
  match the GPURenderPassDescriptor

Bug: crbug.com/dawn/993
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
investigation wgsl resolved Resolved - waiting for a change to the WGSL specification wgsl WebGPU Shading Language Issues
Projects
WGSL
Done
Development

No branches or pull requests