Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Add a new type to StableHLO: E4M3B11FNUZ #1308

Merged
merged 1 commit into from
Mar 24, 2023
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
50 changes: 50 additions & 0 deletions rfcs/20230309-e4m3b11.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
# RFC: E4M3B11FNUZ in XLA

## Summary

Google has hardware which features a floating-point format similar to the one
recommended in Sun et al.[^1]:

- 1 sign bit
- 4 exponent bits
- 3 significand bits

At first glance, this format seems similar to the E4M3FN[^2] format already
present in XLA and the E4M3FNUZ[^3] format in LLVM but there are some important
differences. Let's take a look at the details.

## Details

Let's compare some values representable in E4M3B11FNUZ with E4M3FN and E4M3FNUZ:

| |E4M3FN |E4M3FNUZ |E4M3B11FNUZ |
|-------------------|--------------------------------------|---------------------------------------|--------------------------------------|
|Bias |7 |8 |11 |
|Min Normal Value |±0001.000 = 1.0 * 2<sup>-6</sup> |±0001.000 = 1.0 * 2<sup>-7</sup> |±0001.000 = 1.0 * 2<sup>-10</sup> |
|Max Normal Value |±1111.110 = 1.75 * 2<sup>8</sup> = 448|±1111.111 = 1.875 * 2<sup>7</sup> = 240|±1111.111 = 1.875 * 2<sup>4</sup> = 30|
|Min Subnormal Value|±0000.001 = 1.0 * 2<sup>-9</sup> |±0000.001 = 1.0 * 2<sup>-10</sup> |±0000.001 = 1.0 * 2<sup>-13</sup> |
|Max Subnormal Value|±0000.111 = 0.875 * 2<sup>-6</sup> |±0000.111 = 0.875 * 2<sup>-7</sup> |±0000.111 = 0.875 * 2<sup>-10</sup> |
|Infinity |N/A |N/A |N/A |
|NaN |±1111.111 |-0000.000 |-0000.000 |
|-0 |-0000.000 |N/A |N/A |

These differences are caused by:

- A difference in exponent bias changes the range of representable numbers.
- E4M3FN reserves the all one exponent and trailing significand field to
represent NaN, the other formats use the representation which would be used by
negative zero to represent NaN.
- E4M3FN can represent negative zero in the normal way, the other formats
cannot represent negative zero.

## Changes in XLA

Adding this type will be mostly along the same lines as the
[FP8 RFC](https://github.com/openxla/xla/discussions/22): a new type added to
the formats already supported; scaling will be represented in the same way it
is supported in other FP8 formats. Additionally, this type would also become
added to LLVM's APFloat class and MLIR.

[^1]: [Hybrid 8-bit Floating Point (HFP8) Training and Inference for Deep Neural Networks by Sun et al.](https://dl.acm.org/doi/10.5555/3454287.3454728)
[^2]: [FP8 Formats for Deep Learning by Micikevicius et al.](https://arxiv.org/abs/2209.05433)
[^3]: [8-bit Numerical Formats for Deep Neural Networks by Noune et al.](https://arxiv.org/abs/2206.02915)