[RFC] Add a new type to StableHLO: E4M3B11FNUZ (#1308)

This is a proposal to add a new floating point type to StableHLO, please see rfcs/20230309-e4m3b11.md for more details.
openxla · Mar 24, 2023 · 4664754 · 4664754
1 parent ef7a111
commit 4664754
Showing 1 changed file with 50 additions and 0 deletions.
diff --git a/rfcs/20230309-e4m3b11.md b/rfcs/20230309-e4m3b11.md
@@ -0,0 +1,50 @@
+# RFC: E4M3B11FNUZ in XLA
+
+## Summary
+
+Google has hardware which features a floating-point format similar to the one
+recommended in Sun et al.[^1]:
+
+- 1 sign bit
+- 4 exponent bits
+- 3 significand bits
+
+At first glance, this format seems similar to the E4M3FN[^2] format already
+present in XLA and the E4M3FNUZ[^3] format in LLVM but there are some important
+differences. Let's take a look at the details.
+
+## Details
+
+Let's compare some values representable in E4M3B11FNUZ with E4M3FN and E4M3FNUZ:
+
+|                   |E4M3FN                                |E4M3FNUZ                               |E4M3B11FNUZ                           |
+|-------------------|--------------------------------------|---------------------------------------|--------------------------------------|
+|Bias               |7                                     |8                                      |11                                    |
+|Min Normal Value   |±0001.000 = 1.0 * 2<sup>-6</sup>      |±0001.000 = 1.0 * 2<sup>-7</sup>       |±0001.000 = 1.0 * 2<sup>-10</sup>     |
+|Max Normal Value   |±1111.110 = 1.75 * 2<sup>8</sup> = 448|±1111.111 = 1.875 * 2<sup>7</sup> = 240|±1111.111 = 1.875 * 2<sup>4</sup> = 30|
+|Min Subnormal Value|±0000.001 = 1.0 * 2<sup>-9</sup>      |±0000.001 = 1.0 * 2<sup>-10</sup>      |±0000.001 = 1.0 * 2<sup>-13</sup>     |
+|Max Subnormal Value|±0000.111 = 0.875 * 2<sup>-6</sup>    |±0000.111 = 0.875 * 2<sup>-7</sup>     |±0000.111 = 0.875 * 2<sup>-10</sup>   |
+|Infinity           |N/A                                   |N/A                                    |N/A                                   |
+|NaN                |±1111.111                             |-0000.000                              |-0000.000                             |
+|-0                 |-0000.000                             |N/A                                    |N/A                                   |
+
+These differences are caused by:
+
+- A difference in exponent bias changes the range of representable numbers.
+- E4M3FN reserves the all one exponent and trailing significand field to
+represent NaN, the other formats use the representation which would be used by
+negative zero to represent NaN.
+- E4M3FN can represent negative zero in the normal way, the other formats
+cannot represent negative zero.
+
+## Changes in XLA
+
+Adding this type will be mostly along the same lines as the
+[FP8 RFC](https://github.com/openxla/xla/discussions/22): a new type added to
+the formats already supported; scaling will be represented in the same way it
+is supported in other FP8 formats. Additionally, this type would also become
+added to LLVM's APFloat class and MLIR.
+
+[^1]: [Hybrid 8-bit Floating Point (HFP8) Training and Inference for Deep Neural Networks by Sun et al.](https://dl.acm.org/doi/10.5555/3454287.3454728)
+[^2]: [FP8 Formats for Deep Learning by Micikevicius et al.](https://arxiv.org/abs/2209.05433)
+[^3]: [8-bit Numerical Formats for Deep Neural Networks by Noune et al.](https://arxiv.org/abs/2206.02915)