FP16 support #658

litherum · 2020-03-30T22:50:53Z

FP16 provides significant benefits over FP32:

It uses half the memory size (of course). This is particularly important on mobile devices where available memory is limited. Specifically, one of the most common / strongest request the Metal team gets from 3rd parties is for features which can help decrease their memory use.
Because it uses half the size, memory bandwidth is more effectively used.
Even without the bandwidth increase, devices often have increased ALU performance for half-precision operations.
Power consumption is decreased on some devices, leading to better battery life.

I wanted to characterize the ALU performance, so I made a small Metal benchmark to execute on iOS. Here are the shaders:

constant unsigned int iterations [[function_constant(0)]];

kernel void aluFP32(constant float& seed, device atomic_uint& result) {
    float localSeed = seed;
    float counter = seed;
    for (unsigned int i = 0; i < iterations; ++i)
        counter *= localSeed;
    atomic_store_explicit(&result, as_type<unsigned int>(counter), memory_order_relaxed);
}

kernel void aluFP16(constant float& seed, device atomic_uint& result) {
    half localSeed = seed;
    half counter = seed;
    for (unsigned int i = 0; i < iterations; ++i)
        counter *= localSeed;
    atomic_store_explicit(&result, as_type<unsigned int>(static_cast<float>(counter)), memory_order_relaxed);
}

When running on an iPhone 11 Pro, here are the results:

As you can see, FP16 is a demonstrable 24.9% progression. Theoretically, it could be a 50% progression on this device. This, coupled with the significant decrease in memory footprint, indicates the feature is important to include in WGSL.

The text was updated successfully, but these errors were encountered:

kainino0x · 2020-03-30T22:58:37Z

cc #230

kainino0x · 2020-03-30T23:00:32Z

Definite +1 from us for fp16/fp64 extensions; we've been expecting them for a long time. Possibly separate extensions for 16-bit (and 8-bit) reads/writes from memory.

Kangz · 2020-03-31T08:58:07Z

+CC @qjia7 that was investigating the very same thing from the TF.js side. Unfortunately I don't think Vulkan requires support for FP16 so it should be an extension. Also one thing to be careful about is that there are two separate capabilities between being able to load FP16 data from buffers, and having FP16 ALUs, so we have to figure out how we want to expose them.

dj2 · 2020-03-31T13:40:34Z

So, if fp16 has to be an extension in WebGPU, should it be enabled through an extension in WGSL so it can't be used accidentally?

kainino0x · 2020-03-31T16:33:59Z

Two options IMO:

WebGL style: using shaders with the extension requires the API extension to also be enabled
API extension automatically enables a shader extension

kdashg · 2020-03-31T18:01:20Z

A third option is GLSL-style minimum precision guarantees. This is particularly valuable for doing basic math on lowp float (~float9) data for colors, and keeping colors packed into 32bytes.

kainino0x · 2020-03-31T19:08:35Z

For shader variables that makes sense to me, though technically it does generate a portability issue since it's not possible to know you're testing with the lowest possible precision. WebGL users run into this occasionally, but not often, since hardware is in practice not that inconsistent.

dneto0 · 2020-03-31T19:28:11Z

I agree FP16 is a highly desirable feature.

Sadly, support is not universal among Vulkan devices (but growing). So I agree this would have to be an extension.

Let's keep it simple and have a single extension to enable the feature. Vulkan split it into a storage (load/store) feature for certain storage classes, and a distinct arithmetic feature. The apparent motivation was that some devices supported one and not the other, and vice versa. Let's avoid that. Let's make FP16 one feature in WebGPU.

dneto0 · 2020-03-31T19:33:04Z

A third option is GLSL-style minimum precision guarantees.

it does generate a portability issue since it's not possible to know you're testing with the lowest possible precision.

SPIR-V opted to model GLSL lowp and mediump as:

32-bit float type (in particular for load/store, so no memory bandwidth advantage)
but variables and arithmetic operations can be decorated with RelaxedPrecision, which permits intermediate results to be computed as if with fp16.

Some implementations do take advantage of this RelaxedPrecision feature to attain better performance and energy usage.

If I were king, I'd rather the world adopt FP16 instead. RelaxedPrecision feels like a half-step we should avoid with WebGPU.

dneto0 · 2020-03-31T19:41:44Z

I forgot to mention where to get more info:

The Vulkan 16bit float arithmetic feature bit is "shaderFloat16" from "VK_KHR_shader_float16_int8"

https://vulkan.gpuinfo.org/displayextension.php?name=VK_KHR_shader_float16_int8
https://vulkan.gpuinfo.org/listreports.php?extensionfeature=shaderFloat16
- Android support is very much lagging desktop, with Arm out ahead of Qualcomm

The Vulkan 16bit storage features are from "VK_KHR_16bit_storage"

https://vulkan.gpuinfo.org/displayextension.php?name=VK_KHR_16bit_storage
There are 4 feature bits, again because of differences in device support.

kdashg · 2020-03-31T23:28:09Z

It would be valuable to enable devs to tag things as lower precision than full float32, whether that's float16 or something more vague. (but perhaps more flexible)

Does Qualcomm Android really not have a non-float32 arithmetic path for spir-v? That seems surprising to me. I had thought I'd seen cases where moving the same GLSL from desktop to mobile caused lack-of-precision artifacts! Maybe this was related to input/output load/stores, not arithmetic?

litherum · 2020-04-02T04:55:01Z

If you unroll the loops, you can get much closer to the theoretical maximum:

This represents a 44% progression on that same iPhone 11 Pro.

xhcao · 2020-04-02T13:14:08Z

Recently, I had re-written dawn/examples/ComputeBoids example with FP16 arithmetic and FP16 data load/store, could also get nearly 50% performance improvement on Vulkan backend.
Minimum precision data types as GLSL lowp and mediump is good idea, developers don't need to author multiple shaders, and supports more devices. But sometimes developers want to truly FP16 feature, they care performance more than precision. Is there a method for WebGPU that could take account of both? As HLSL min16float maps to float16_t in -enable-16bit-types mode. https://github.com/microsoft/DirectXShaderCompiler/wiki/16-Bit-Scalar-Types

Jasper-Bekkers · 2020-04-06T12:50:16Z

Like @dneto0 points out, as a developer f16 support would be amazing to have. However, support for them is not as universal as we'd like so we typically would only ship f16 based code on fixed platforms (console / phone) where we can guarantee it's availability.

dneto0 · 2020-04-07T13:46:42Z

There's a couple of things going on here.
Fundamentally, you can be limited by memory bandwidth, or by ALU.

If you're only limited by memory bandwidth, then load/store of fp16 values is what you need.
For a long time GLSL (and hence SPIR-V) has had a core feature where you can unpack a 32bit uint into two 32-bit floats, by interpreting the uint as 2 16bit floats and doing the conversion.
See "uint packHalf2x16 (vec2 v)" in the GLSL spec, or UnpackHalf2x16 in SPIR-V extended instructions set GLSL.std.450 https://www.khronos.org/registry/spir-v/specs/unified1/GLSL.std.450.html
So the pattern is: load as uints, possibly vectors of them, then do the unpack conversions.
(There are only scalar forms of the unpacks; maybe nobody ever needed the vector forms.)

So that might explain why there wasn't enough pressure early enough to force this issue in the Android space.

But I certainly believe there are also gains when you only do the arithmetic in 16bits.

dneto0 · 2020-04-07T13:53:18Z

I'm just explaining backstory here. I'm not advocating any particular path.

Regarding mediump and lowp:

As I understand it, mediump and lowp caused a lot of grief due to variability between devices.

In the move to Vulkan, they were remapped to SPIR-V RelaxedPrecision.
(https://github.com/KhronosGroup/GLSL/blob/master/extensions/khr/GL_KHR_vulkan_glsl.txt#L642 )
SPIR-V RelaxedPrecision says the storage footprint is still 32bits but you can do the arithmetic in any precision between 16 and 32bits. https://www.khronos.org/registry/spir-v/specs/unified1/SPIRV.html#_a_id_relaxedprecisionsection_a_relaxed_precision

Yes, this still allows some painful variability. (I've helped customers through this). A slight mitigation is you can clamp the precision of a result to 16bits with the funky OpQuantizeToF16 https://www.khronos.org/registry/spir-v/specs/unified1/SPIRV.html#OpQuantizeToF16
So you can make your powerful desktop GPU emulate the weakest spec device by (having a compilation flow) litter those quantizations all over the place.

petamoriken · 2020-04-11T22:35:35Z

Currently, TypedArray for FP16 is Stage 1 of TC39 proposals process but without active progress.
https://es.discourse.group/t/float16-on-typedarrays-dataview-math-hfround/303

I think it's an important proposal for this issue, so can't work on it?

kdashg · 2020-04-13T07:31:10Z

Thanks for the heads-up, though JS FP16 support is not required (and not directly important) for WebGPU. (WebGL already has FP16 resources) Indeed most native work is done in languages (all of them?) without first-class FP16 support. F16<->double manual conversion is pretty similar in C++ as in JS, which is proof-of-MVP to me.

xhcao · 2020-04-13T13:06:36Z

@litherum, you did not check whether Metal support true FP16 in your demo. I read Metal spec, Metal does not raise requirements, for example Metal version or hardware query result, to support true FP16 feature. Does all mac devices support metal true 16-bit float?
D3D12 supports true 16-bit float when shader model 6.2 or higher, also should check D3D12_FEATURE_D3D12_OPTIONS4 feature whether hardware supports true FP16.

grorg · 2020-04-14T18:38:40Z

Discussed at the 2020-04-14 meeting.

grorg · 2020-04-14T18:48:23Z

Resolution was FP16 supported as an optional extension covering computation and storage, but with some follow-up issues to be raised (e.g @dneto0 on quantize)

Interpolation is not included.

dneto0 · 2020-04-15T13:04:15Z

I said in the meeting that:

I really like this feature, as a single feature.
But this does not meet the "minimum" aspect of a "minimum viable product".

Spec, testing, and tooling work will compete for staff time for actually-minimum-viable work.

litherum · 2020-04-18T00:59:10Z

We believe this is, actually, part of the minimum set. In our team's experience, many shaders just straight-up won't run at reasonable speeds without FP16 support. It's absence makes many apps unusable.

kainino0x · 2020-04-18T03:17:42Z

Do you have a proposal to achieve that? Relaxed precision?

litherum · 2020-04-18T05:37:57Z

According to the conversation during this week's call, the group seemed to agree that an extension was the right direction.

(I'm not saying FP16 should be part of core. I'm saying FP16 should be usable on iPhones in the first software release of WebGPU)

kainino0x · 2020-04-18T22:01:10Z

Ok, sorry, I thought "minimum set" meant core and you were disagreeing with the result from the meeting.

Agreed with having it at release, but I think that's a vendor decision. (I hope we will be able to do it as well.)

jzm-intel · 2022-04-21T12:34:53Z

Closing this issue, as the F16 extension against spec PR #2696 has been merged.

…pect* (gpuweb#658) - Remove cases that do not specify an aspect for sampling a multiplanar format - Choose 'uint' as the GPUTextureSampleType when sampling stencil - Pass the depthStencilFormat to the render bundle encoder to match the GPURenderPassDescriptor Bug: crbug.com/dawn/993

litherum added the wgsl WebGPU Shading Language Issues label Mar 30, 2020

kainino0x added the investigation label Mar 30, 2020

litherum added this to Discussion in WGSL Apr 4, 2020

grorg added the for meeting label Apr 7, 2020

grorg added wgsl resolved Resolved - waiting for a change to the WGSL specification investigation and removed for wgsl meeting investigation labels Apr 14, 2020

dneto0 mentioned this issue Apr 14, 2020

Consider whether we need a quantize-f32-value-to-f16 #704

Closed

litherum mentioned this issue May 13, 2020

shaders: floating point widths other than 32 bits #272

Closed

litherum added this to the MVP milestone May 13, 2020

litherum moved this from Under Discussion to Resolved: Needs Specification Work in WGSL May 26, 2020

kainino0x mentioned this issue Sep 3, 2020

Device.getShaderPrecisionFormat(shaderType, precisionType) #1048

Closed

kvark mentioned this issue Jul 29, 2021

Parse shaders with half-precision floats (float16) gfx-rs/naga#1149

Closed

HackerFoo mentioned this issue Mar 30, 2022

[Merged by Bors] - Sprites - keep color as 4 f32 bevyengine/bevy#4361

Closed

jzm-intel mentioned this issue Apr 15, 2022

FP16 Extension against spec #2696

Merged

jzm-intel closed this as completed Apr 21, 2022

WGSL automation moved this from Resolved: Needs Specification Work to Done Apr 21, 2022

josephrocca mentioned this issue Jun 29, 2022

float16 support in web runtimes microsoft/onnxruntime#9758

Open

dneto0 mentioned this issue Oct 13, 2022

why doesn't WGSL have mediump floats #3524

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FP16 support #658

FP16 support #658

litherum commented Mar 30, 2020 •

edited

Loading

kainino0x commented Mar 30, 2020

kainino0x commented Mar 30, 2020

Kangz commented Mar 31, 2020

dj2 commented Mar 31, 2020

kainino0x commented Mar 31, 2020

kdashg commented Mar 31, 2020

kainino0x commented Mar 31, 2020

dneto0 commented Mar 31, 2020

dneto0 commented Mar 31, 2020 •

edited

Loading

dneto0 commented Mar 31, 2020

kdashg commented Mar 31, 2020

litherum commented Apr 2, 2020

xhcao commented Apr 2, 2020

Jasper-Bekkers commented Apr 6, 2020

dneto0 commented Apr 7, 2020

dneto0 commented Apr 7, 2020

petamoriken commented Apr 11, 2020

kdashg commented Apr 13, 2020

xhcao commented Apr 13, 2020

grorg commented Apr 14, 2020

grorg commented Apr 14, 2020 •

edited

Loading

dneto0 commented Apr 15, 2020

litherum commented Apr 18, 2020 •

edited

Loading

kainino0x commented Apr 18, 2020

litherum commented Apr 18, 2020 •

edited

Loading

kainino0x commented Apr 18, 2020

jzm-intel commented Apr 21, 2022

FP16 support #658

FP16 support #658

Comments

litherum commented Mar 30, 2020 • edited Loading

kainino0x commented Mar 30, 2020

kainino0x commented Mar 30, 2020

Kangz commented Mar 31, 2020

dj2 commented Mar 31, 2020

kainino0x commented Mar 31, 2020

kdashg commented Mar 31, 2020

kainino0x commented Mar 31, 2020

dneto0 commented Mar 31, 2020

dneto0 commented Mar 31, 2020 • edited Loading

dneto0 commented Mar 31, 2020

kdashg commented Mar 31, 2020

litherum commented Apr 2, 2020

xhcao commented Apr 2, 2020

Jasper-Bekkers commented Apr 6, 2020

dneto0 commented Apr 7, 2020

dneto0 commented Apr 7, 2020

petamoriken commented Apr 11, 2020

kdashg commented Apr 13, 2020

xhcao commented Apr 13, 2020

grorg commented Apr 14, 2020

grorg commented Apr 14, 2020 • edited Loading

dneto0 commented Apr 15, 2020

litherum commented Apr 18, 2020 • edited Loading

kainino0x commented Apr 18, 2020

litherum commented Apr 18, 2020 • edited Loading

kainino0x commented Apr 18, 2020

jzm-intel commented Apr 21, 2022

litherum commented Mar 30, 2020 •

edited

Loading

dneto0 commented Mar 31, 2020 •

edited

Loading

grorg commented Apr 14, 2020 •

edited

Loading

litherum commented Apr 18, 2020 •

edited

Loading

litherum commented Apr 18, 2020 •

edited

Loading