# <span style="color:green"> Objective </span>

- To understand the fundamentals of floating-point representation
- To understand the IEEE-754 Floating Point Standard
- CUDA GPU Floating-point speed, accuracy and precision
    - Cause of errors
    - Algorithm considerations
    - Deviations from IEEE-754
    - Accuracy of device runtime functions
    - -fastmath compiler option
    - Future performance considerations

<hr style="height:2px">

# <span style="color:green"> What is IEEE floating-point format? </span>


![alt tag](img/3.png)

<hr style="height:2px">

# <span style="color:green"> Normalized Representation </span>

![alt tag](img/4.png)
<hr style="height:2px">

# <span style="color:green"> Exponent Representation </span>

![alt tag](img/5.png)
<hr style="height:2px">

# <span style="color:green"> A simple, hypothetical 5-bit FP format </span>

![alt tag](img/6.png)
<hr style="height:2px">

# <span style="color:green"> Representable Numbers </span>

![alt tag](img/7.png)
<hr style="height:2px">

# <span style="color:green"> Representable Numbers of a 5-bit Hypothetical IEEE Format </span>

![alt tag](img/8.png)
<hr style="height:2px">

# <span style="color:green"> Flush to Zero </span>

- Treat all bit patterns with E=0 as 0.0
    - This takes away several representable numbers near zero and lump them all into 0.0
    - For a representation with large M, a large number of representable numbers will be removed

![alt tag](img/9.png)

![alt tag](img/10.png)

<hr style="height:2px">

# <span style="color:green"> Why is flushing to zero problematic? </span>

- Many physical model calculations work on values that are very close to zero
    - Dark (but not totally black) sky in movie rendering
    - Small distance fields in electrostatic potential calculation
    - ...
- Without Denormalization, these calculations tend to create artifacts that compromise the integrity of the models

<hr style="height:2px">

# <span style="color:green"> Denormalized Numbers </span>

- The actual method adopted by the IEEE standard is called “denormalized numbers” or “gradual underflow”.
    - The method relaxes the normalization requirement for numbers very close to 0.
    - Whenever E=0, the mantissa is no longer assumed to be of the form 1.XX. Rather, it is assumed to be 0.XX. In general, if the n-bit exponent is 0, the value is 0.M * 2 - 2 ^(n-1) + 2

![alt tag](img/12.png)
<hr style="height:2px">

# <span style="color:green"> Denormalization </span>

![alt tag](img/13.png)
<hr style="height:2px">

# <span style="color:green"> IEEE 754 Format and Precision </span>

- Single Precision
    - 1-bit sign, 8 bit exponent (bias-127 excess), 23 bit fraction
- Double Precision
    - 1-bit sign, 11-bit exponent (1023-bias excess), 52 bit fraction
    - The largest error for representing a number is reduced to 1/229 of single precision representation

<hr style="height:2px">

# <span style="color:green"> Special Bit Patterns </span>

![alt tag](img/15.png)
<hr style="height:2px">

# <span style="color:green"> Floating Point Accuracy and Rounding </span>

- The accuracy of a floating point arithmetic operation is measured by the maximal error introduced by the operation.
- The most common source of error in floating point arithmetic is when the operation generates a result that cannot be exactly represented and thus requires rounding.
- Rounding occurs if the mantissa of the result value needs too many bits to be represented exactly.

<hr style="height:2px">

# <span style="color:green"> Rounding and Error </span>

![alt tag](img/17.png)



![alt tag](img/18.png)

<hr style="height:2px">

# <span style="color:green"> Error Measure </span>

- If a hardware adder has at least two more bit positions than the total (both implicit and explicit) number of mantissa bits, the error would never be more than half of the place value of the mantissa
    - 0.001 in our 5-bit format
- We refer to this as 0.5 ULP (Units in the Last Place)
    - If the hardware is designed to perform arithmetic and rounding operations perfectly, the most error that one should introduce should be no more than 0.5 ULP
    - The error is limited by the precision for this case

<hr style="height:2px">

# <span style="color:green"> Order of Operations Matter </span>

- Floating point operations are not strictly associative
- The root cause is that some times a very small number can disappear when added to or subtracted from a very large number.
    - (Large + Small) + Small ≠ Large + (Small + Small)

<hr style="height:2px">

# <span style="color:green"> Algorithm Considerations </span>

![alt tag](img/21.png)
<hr style="height:2px">

# <span style="color:green"> Runtime Math Library </span>

![alt tag](img/22.png)
<hr style="height:2px">

# <span style="color:green"> Make your program float-safe! </span>

- Newer GPU hardware has double precision support
    - Double precision will have additional performance cost
    - Careless use of double or undeclared types may run more slowly
- Important to be float-safe (be explicit whenever you want single precision) to avoid using double precision where it is not needed
    - Add ‘f’ specifier on float literals:
    ```cpp
        - foo = bar * 0.123; // double assumed
        - foo = bar * 0.123f; // float explicit
    ```
    
    - Use float version of standard library functions
    ```cpp
        - foo = sin(bar); // double assumed
        - foo = sinf(bar); // single precision explicit
    ```

<hr style="height:2px">

# <span style="color:green"> CUDA Deviations from IEEE-754 </span>

- Addition and multiplication are IEEE 754 compliant
     - Maximum 0.5 ulp (units in the least place) error
- Division accuracy is not fully compliant (2 ulp)
- Not all rounding modes are supported
- No mechanism to detect floating-point exceptions (yet)

<hr style="height:2px">

<footer>
<cite> GPU NVIDIA Teaching Kit - University of Illinois </cite>
</footer>