-
Notifications
You must be signed in to change notification settings - Fork 100
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[RFC] Add a new type to StableHLO: E4M3B11FNUZ (#1308)
This is a proposal to add a new floating point type to StableHLO, please see rfcs/20230309-e4m3b11.md for more details.
- Loading branch information
Showing
1 changed file
with
50 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,50 @@ | ||
# RFC: E4M3B11FNUZ in XLA | ||
|
||
## Summary | ||
|
||
Google has hardware which features a floating-point format similar to the one | ||
recommended in Sun et al.[^1]: | ||
|
||
- 1 sign bit | ||
- 4 exponent bits | ||
- 3 significand bits | ||
|
||
At first glance, this format seems similar to the E4M3FN[^2] format already | ||
present in XLA and the E4M3FNUZ[^3] format in LLVM but there are some important | ||
differences. Let's take a look at the details. | ||
|
||
## Details | ||
|
||
Let's compare some values representable in E4M3B11FNUZ with E4M3FN and E4M3FNUZ: | ||
|
||
| |E4M3FN |E4M3FNUZ |E4M3B11FNUZ | | ||
|-------------------|--------------------------------------|---------------------------------------|--------------------------------------| | ||
|Bias |7 |8 |11 | | ||
|Min Normal Value |±0001.000 = 1.0 * 2<sup>-6</sup> |±0001.000 = 1.0 * 2<sup>-7</sup> |±0001.000 = 1.0 * 2<sup>-10</sup> | | ||
|Max Normal Value |±1111.110 = 1.75 * 2<sup>8</sup> = 448|±1111.111 = 1.875 * 2<sup>7</sup> = 240|±1111.111 = 1.875 * 2<sup>4</sup> = 30| | ||
|Min Subnormal Value|±0000.001 = 1.0 * 2<sup>-9</sup> |±0000.001 = 1.0 * 2<sup>-10</sup> |±0000.001 = 1.0 * 2<sup>-13</sup> | | ||
|Max Subnormal Value|±0000.111 = 0.875 * 2<sup>-6</sup> |±0000.111 = 0.875 * 2<sup>-7</sup> |±0000.111 = 0.875 * 2<sup>-10</sup> | | ||
|Infinity |N/A |N/A |N/A | | ||
|NaN |±1111.111 |-0000.000 |-0000.000 | | ||
|-0 |-0000.000 |N/A |N/A | | ||
|
||
These differences are caused by: | ||
|
||
- A difference in exponent bias changes the range of representable numbers. | ||
- E4M3FN reserves the all one exponent and trailing significand field to | ||
represent NaN, the other formats use the representation which would be used by | ||
negative zero to represent NaN. | ||
- E4M3FN can represent negative zero in the normal way, the other formats | ||
cannot represent negative zero. | ||
|
||
## Changes in XLA | ||
|
||
Adding this type will be mostly along the same lines as the | ||
[FP8 RFC](https://github.com/openxla/xla/discussions/22): a new type added to | ||
the formats already supported; scaling will be represented in the same way it | ||
is supported in other FP8 formats. Additionally, this type would also become | ||
added to LLVM's APFloat class and MLIR. | ||
|
||
[^1]: [Hybrid 8-bit Floating Point (HFP8) Training and Inference for Deep Neural Networks by Sun et al.](https://dl.acm.org/doi/10.5555/3454287.3454728) | ||
[^2]: [FP8 Formats for Deep Learning by Micikevicius et al.](https://arxiv.org/abs/2209.05433) | ||
[^3]: [8-bit Numerical Formats for Deep Neural Networks by Noune et al.](https://arxiv.org/abs/2206.02915) |