diff --git a/rfcs/20230309-e4m3b11.md b/rfcs/20230309-e4m3b11.md new file mode 100644 index 0000000000..eab42b67ea --- /dev/null +++ b/rfcs/20230309-e4m3b11.md @@ -0,0 +1,50 @@ +# RFC: E4M3B11FNUZ in XLA + +## Summary + +Google has hardware which features a floating-point format similar to the one +recommended in Sun et al.[^1]: + +- 1 sign bit +- 4 exponent bits +- 3 significand bits + +At first glance, this format seems similar to the E4M3FN[^2] format already +present in XLA and the E4M3FNUZ[^3] format in LLVM but there are some important +differences. Let's take a look at the details. + +## Details + +Let's compare some values representable in E4M3B11FNUZ with E4M3FN and E4M3FNUZ: + +| |E4M3FN |E4M3FNUZ |E4M3B11FNUZ | +|-------------------|--------------------------------------|---------------------------------------|--------------------------------------| +|Bias |7 |8 |11 | +|Min Normal Value |±0001.000 = 1.0 * 2-6 |±0001.000 = 1.0 * 2-7 |±0001.000 = 1.0 * 2-10 | +|Max Normal Value |±1111.110 = 1.75 * 28 = 448|±1111.111 = 1.875 * 27 = 240|±1111.111 = 1.875 * 24 = 30| +|Min Subnormal Value|±0000.001 = 1.0 * 2-9 |±0000.001 = 1.0 * 2-10 |±0000.001 = 1.0 * 2-13 | +|Max Subnormal Value|±0000.111 = 0.875 * 2-6 |±0000.111 = 0.875 * 2-7 |±0000.111 = 0.875 * 2-10 | +|Infinity |N/A |N/A |N/A | +|NaN |±1111.111 |-0000.000 |-0000.000 | +|-0 |-0000.000 |N/A |N/A | + +These differences are caused by: + +- A difference in exponent bias changes the range of representable numbers. +- E4M3FN reserves the all one exponent and trailing significand field to +represent NaN, the other formats use the representation which would be used by +negative zero to represent NaN. +- E4M3FN can represent negative zero in the normal way, the other formats +cannot represent negative zero. + +## Changes in XLA + +Adding this type will be mostly along the same lines as the +[FP8 RFC](https://github.com/openxla/xla/discussions/22): a new type added to +the formats already supported; scaling will be represented in the same way it +is supported in other FP8 formats. Additionally, this type would also become +added to LLVM's APFloat class and MLIR. + +[^1]: [Hybrid 8-bit Floating Point (HFP8) Training and Inference for Deep Neural Networks by Sun et al.](https://dl.acm.org/doi/10.5555/3454287.3454728) +[^2]: [FP8 Formats for Deep Learning by Micikevicius et al.](https://arxiv.org/abs/2209.05433) +[^3]: [8-bit Numerical Formats for Deep Neural Networks by Noune et al.](https://arxiv.org/abs/2206.02915)