Skip to content

Commit

Permalink
[RFC] Add a new type to StableHLO: E4M3B11FNUZ (#1308)
Browse files Browse the repository at this point in the history
This is a proposal to add a new floating point type to StableHLO, please
see rfcs/20230309-e4m3b11.md for more details.
  • Loading branch information
majnemer committed Mar 24, 2023
1 parent ef7a111 commit 4664754
Showing 1 changed file with 50 additions and 0 deletions.
50 changes: 50 additions & 0 deletions rfcs/20230309-e4m3b11.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
# RFC: E4M3B11FNUZ in XLA

## Summary

Google has hardware which features a floating-point format similar to the one
recommended in Sun et al.[^1]:

- 1 sign bit
- 4 exponent bits
- 3 significand bits

At first glance, this format seems similar to the E4M3FN[^2] format already
present in XLA and the E4M3FNUZ[^3] format in LLVM but there are some important
differences. Let's take a look at the details.

## Details

Let's compare some values representable in E4M3B11FNUZ with E4M3FN and E4M3FNUZ:

| |E4M3FN |E4M3FNUZ |E4M3B11FNUZ |
|-------------------|--------------------------------------|---------------------------------------|--------------------------------------|
|Bias |7 |8 |11 |
|Min Normal Value |±0001.000 = 1.0 * 2<sup>-6</sup> |±0001.000 = 1.0 * 2<sup>-7</sup> |±0001.000 = 1.0 * 2<sup>-10</sup> |
|Max Normal Value |±1111.110 = 1.75 * 2<sup>8</sup> = 448|±1111.111 = 1.875 * 2<sup>7</sup> = 240|±1111.111 = 1.875 * 2<sup>4</sup> = 30|
|Min Subnormal Value|±0000.001 = 1.0 * 2<sup>-9</sup> |±0000.001 = 1.0 * 2<sup>-10</sup> |±0000.001 = 1.0 * 2<sup>-13</sup> |
|Max Subnormal Value|±0000.111 = 0.875 * 2<sup>-6</sup> |±0000.111 = 0.875 * 2<sup>-7</sup> |±0000.111 = 0.875 * 2<sup>-10</sup> |
|Infinity |N/A |N/A |N/A |
|NaN |±1111.111 |-0000.000 |-0000.000 |
|-0 |-0000.000 |N/A |N/A |

These differences are caused by:

- A difference in exponent bias changes the range of representable numbers.
- E4M3FN reserves the all one exponent and trailing significand field to
represent NaN, the other formats use the representation which would be used by
negative zero to represent NaN.
- E4M3FN can represent negative zero in the normal way, the other formats
cannot represent negative zero.

## Changes in XLA

Adding this type will be mostly along the same lines as the
[FP8 RFC](https://github.com/openxla/xla/discussions/22): a new type added to
the formats already supported; scaling will be represented in the same way it
is supported in other FP8 formats. Additionally, this type would also become
added to LLVM's APFloat class and MLIR.

[^1]: [Hybrid 8-bit Floating Point (HFP8) Training and Inference for Deep Neural Networks by Sun et al.](https://dl.acm.org/doi/10.5555/3454287.3454728)
[^2]: [FP8 Formats for Deep Learning by Micikevicius et al.](https://arxiv.org/abs/2209.05433)
[^3]: [8-bit Numerical Formats for Deep Neural Networks by Noune et al.](https://arxiv.org/abs/2206.02915)

0 comments on commit 4664754

Please sign in to comment.