# Errors and Floating Point Numbers

## Binary Numbers

The computer stores the number in binary. Most people know how to express the integer part in binary number (For example, $(10)_{10}=(1\times 2^3 + 0\times 2^2 + 1\times 2^1 + 0\times 2^0)_{10}=(1010)_2$). It is the same method for expressing decimal numbers, just with negative exponents.

For example, let's turn $(0.875)_{10}$ into binary number:

In [21]:
import numpy as np
a = 0.875
iters = 0
bin = np.array([])
while a != 0 and iters <= 16:
    a = a * 2
    iters += 1
    if a < 1:
        # Means that current binary bit is zero, it should be 0*2^-iters
        bin = np.append(bin, 0)
    else:
        bin = np.append(bin, 1)
        a -= 1

bin = bin.astype(int)
bin = bin.astype(str)
bin = int("".join(bin))
print(f"Binary number result is 0.{bin}")

Binary number result is 0.111


We can find that $(0.875)_{10}=(1\times 2^{-1}+1\times 2^{-2}+1\times 2^{-3})_{10}$

However, some numbers cannot be expressed finitely in binary (such as 0.1 and 0.8). This is because the denominator of the simplest fraction is not in $2^n$. We can analogize it to a decimal: we get an infinite number of decimal places if the denominator is not a multiple of 2 or 5.

## IEEE 754 Floating Point Standard
Now, let's look deeper on how computer stores the number. IEEE 754 specifies how floating point numbers are rounded and stored (especially for the number that cannot be expressed finitely). It defines variour bits of floating numbers, from 16-bit to 256-bit. We represent binary numbers using a method similar to scientific notation:
$$\pm 1.bbb\ldots b\times 2^p$$
Where $\pm$ is the sign, $b$'s are the mantissa varies from 0 and 1, and $p$ is the exponent.

For example, $(9)_{10}=(1001)_2=\pm 1.001\times 2^3$

IEEE 754 standard for single (32-bit) and double (64-bit) precision floating point number is shown below:
<center>

| Format | Sign | Mantissa | Exponent |
| :--- | :---: | :---: | :---: |
| **Single (32-bit)** | 1 | 23 | 8 |
| **Double (64-bit)** | 1 | 52 | 11 |

</center>

With normalized floating-point numbers, the precision will be super small when the number is small and super large when the number is significant. This is an acceptable trade-off between range and precision. While we are dealing with a large number, the required precision will not be very high. Hence, it is a reasonable choice to express the number in this way.

## Machine Epsilon
Definition of **Machine Epsilon**: $\epsilon_{mach}$ is the distance between 1 and the smallest floating point number greater than 1. This is also related to the exponential term.

E.g. $\epsilon_{mach}$ for IEEE754 double precision number: 
$$\epsilon_{mach}=1\times 2^{-52}$$

More specifically, this is called "interval machine epsilon". The "rounding machine epsion", which is half of it, will be used below.

## Chopping and Rounding
**Chopping**: Throw away the bits beyond the last bit in the mantissa.

**Rounding to Nearest**: If *n+1* bit is 0, truncate after the *n* bit (round down). If *n+1* bit is 1, usually we round up. If *n+1* bit is 1 and all other bits beyond are 0, then add 1 to the *n* bit if *n* bit is 1 (round up), otherwise round down.

<center>

| n+1 Bit | Bits n+2, n+3, ... | Current Bit n+1 | Final Action |
| :--- | :--- | :--- | :--- |
| **0** | Any values | Any value | **Truncate** (Round Down) |
| **1** | **At least one "1"** | Any value | **Add 1** (Round Up) |
| **1** | **All "0"s** | **1** | **Add 1** (Round to Even) |
| **1** | **All "0"s** | **0** | **Truncate** (Stay Even) |

</center>

***Note that IEEE 754 uses the rounding to nearest rule.***

## Floating-Point Error

Floating-point rounding satisfies:
$$\text{fl}(x)=x(1+\delta),\;|\delta|\le\frac{\epsilon_{mach}}{2}$$
$$\frac{|\text{fl}(x)-x|}{|x|}\le\frac{\epsilon_{mach}}{2}$$

Here, the rounding machine epsilon is used.

Floating-point arithmetics (like $+,-,\times,\div$) are subjected to the round to nearest rule. The error will accumulate and grow if we perform many such operations with a numerically unstable number. Usually $+$ and $\times$ are well conditioned, $-$ is partially well conditioned, $\div$ will be sometimes unstable.

Here is the visualization of such error:

In [22]:
N = 10
K = 30
n = []; err_abs = []; err_rel = []
x = []; x_hat = []

for i in range(N):
    n.append(i + 1)
    x.append(1 / n[i])
    x_hat.append(x[i])
    for j in range(K):
        x_hat[i] = (n[i] + 1) * x_hat[i] - 1
    err_abs.append(np.abs(x_hat[i] - x[i]))
    err_rel.append(err_abs[i] / np.abs(x[i]))

for i in range(N):
    print(f"{n[i]}  "
          f"{x[i]:.4e}  "
          f"{x_hat[i]:.4e}  "
          f"{err_abs[i]:.4e}    "
          f"{err_rel[i]:.4e}    ")


1  1.0000e+00  1.0000e+00  0.0000e+00    0.0000e+00    
2  5.0000e-01  5.0000e-01  0.0000e+00    0.0000e+00    
3  3.3333e-01  -2.1000e+01  2.1333e+01    6.4000e+01    
4  2.5000e-01  2.5000e-01  0.0000e+00    0.0000e+00    
5  2.0000e-01  6.5451e+06  6.5451e+06    3.2726e+07    
6  1.6667e-01  -4.7664e+08  4.7664e+08    2.8599e+09    
7  1.4286e-01  -9.8171e+09  9.8171e+09    6.8719e+10    
8  1.2500e-01  1.2500e-01  0.0000e+00    0.0000e+00    
9  1.1111e-01  4.9343e+12  4.9343e+12    4.4409e+13    
10  1.0000e-01  1.4089e+14  1.4089e+14    1.4089e+15    


## Forward and Backward Error

The forward error is the output error, while the backward error is the input error.

The absolute *output (forward)* error is:
$$|\delta y|=|\hat{y}-y_{true}|=|\hat{f}(x_{true})-f(x_{true})|$$

The absolute *input (backward)* error is:
$$|\delta x|=|\hat{x}-x_{true}|$$

The forward and backward error can also be converted into each other:

$$\hat{y}=\hat{f}(x)=f(\hat{x})$$

## Conditioning
Conditioning is a measurement on how sensitive the solution (or the error) is to small changes in the input. We can use an inequality to bound the error of the input and output.
$$(\text{relative output error})\le\kappa(\text{relative input error})$$

We can use a *condition number* to check if a problem is well conditioned or not:
$$\text{cond}=\max_{\text{input}}\frac{\text{relative output error}}{\text{relative input error}}$$
A problem is well conditioned if the condition number is small, vice versa.

## Other Sources of Errors
1. Numerical errors: While in numerical methods, we will sometimes calculate the integral and derivative numerically, which involves using the finite approximation method, and this will lead to error propagation.
2. Modelling errors: Mathematical and physical models always contain some assumptions and approximations, which will introduce some errors. *"All models are wrong, but some are useful."*
3. Input errors: Parameters are never known exactly; we will introduce some errors while setting the parameters.