# Custom floating point formats in machine learning 

It seems like every few weeks someone is announcing a new floating point format. This (hopefully evolving) blog post is to help me keep track of them.

First a quick recap of how floating point works.

## How floating point works

The standard layout for floating point is:
1. A sign bit
2. Some exponent bits
3. Some significand bits

Each floating point format defines a **base** and a **precision** such that: 

$$
n = \text{sign} \times \text{significand} \times \text{base}^{\text{exponent}}
$$

Where precision specifies the number of significant figures in the significand. E.g. for precision 4, it has the form `d.ddd`, etc. 

If we are in base 10, with precision 2, the number 1 can be written:

$$
1 \times 1.00 \times 10^{-1}
$$

If you want more detail, I highly recommend [this visual explanation](https://fabiensanglard.net/floating_point_visually_explained/).

The core thing to understand is that we want to trade-off **range** (number of exponent bits) for **precision** (number of significand bits). 

If we assume most of deep learning happens in the range 0-1, that would mean we probably only ever touch one exponent bit maximum and could dedicate the rest to higher precision (we'll check this assumption later). There is a cost to this though; the physical size of a hardware multiplier scales with the _square_ of the significand width [1].

### FP16

![](figs/fp16.pdf)
- 1 sign bit
- 5 exponent bits
- 10 significand bits

Range: 1e^-38 to 3e^38
Precision at X:

### FP32

- 1 sign bit
- 8 exponent bits
- 23 significand bits

Range: 5.96e^(-8) to 65504
Precision at X:

### [BrainFloat16](https://cloud.google.com/blog/products/ai-machine-learning/bfloat16-the-secret-to-high-performance-on-cloud-tpus)

![](figs/BrainFloat16.pdf)
- 1 sign bit
- 8 exponent bits
- 7 significand bits

Range: 1e-38 to 3e^(38)
Precision at X:

> The physical size of a hardware multiplier scales with the _square_ of the mantissa width.
> neural networks are far more sensitive to the size of the exponent than that of the mantissa

### [Nvidia TensorFloat32](https://blogs.nvidia.com/blog/2020/05/14/tensorfloat-32-precision-format/)

![](figs/TensorFloat32.pdf)
- 1 sign bit
- 8 exponent bits
- 10 significand bits

Range:  5.96e^(-8) to 65504
Precision at X:

19 bits?!

> TF32 uses the same 10-bit mantissa as the half-precision (FP16) math, shown to have more than sufficient margin for the precision requirements of AI workloads. And TF32 adopts the same 8-bit exponent as FP32 so it can support the same numeric range.

### [DALL-e](https://arxiv.org/abs/2102.12092)

> "Adam moments are stored in 1-6-9 for the running mean (1-bit for the sign, 6-bits for the exponent, and 9-bits for the significand), and 0-6-10 for the variance"

### [Graphcore AIFloat](https://docs.graphcore.ai/projects/ai-float-white-paper/en/latest/ai-float.html#the-ipu-ai-floattm-format)
- confusing


### [Tesla CFP8](???)
Can't find any info


## Further reading
- [An Nvidia guide](http://developer.download.nvidia.com/assets/cuda/files/NVIDIA-CUDA-Floating-Point.pdf)
- [A visual explanation](https://fabiensanglard.net/floating_point_visually_explained/)
- [Demystifying floating point](https://blog.demofox.org/2017/11/21/floating-point-precision/)

## Bibliography

[1] https://cloud.google.com/blog/products/ai-machine-learning/bfloat16-the-secret-to-high-performance-on-cloud-tpus