---

## Floating-point arithmetic

Real numbers are stored on a computer following the IEEE floating-point standard:

1. **half precision** using 16 bits (Julia type: `Float16`)
2. **single precision** using 32 bits (Julia type: `Float32`)
3. **double precision** using 64 bits (Julia type: `Float64`)

Julia also has an **arbitrary precision** floating-point data type called `BigFloat`. It is excellent if you need more precision, but it is also much slower.

In [None]:
?AbstractFloat

In [None]:
subtypes(AbstractFloat)

---

## Description of IEEE double floating-point format (`Float64`)

Suppose $x$ is a floating-point number stored in the following 64-bits:

$$
\begin{array}{|c|c|c|c|c|c|c|}
\hline
1 & 2 & \cdots & 12 & 13 & \cdots & 64 \\
\hline
s & e_{10} & \cdots & e_0 & f_1 & \cdots & f_{52} \\
\hline
\end{array}
$$

- 1 bit $s$ represents the **sign**
- 11 bits $e_{10} \cdots e_{0}$ represent the **exponent**
- 52 bits $f_1 \cdots f_{52}$ represent the **fraction** (a.k.a. the mantissa or significand)

Then

$$ x = (-1)^s \left[1.f_1 \cdots f_{52}\right]_2 \times 2^{(e-1023)}.$$

Notes: 

- $x$ is **normalized** to have its first digit nonzero.
- $e = \left[e_{10} \cdots e_{0}\right]_2 = e_{10} 2^{10} + \cdots + e_1 2^1 + e_0 2^0 \in \left[0, 2^{11}-1\right] = [0, 2047]$
- $e = 0$ and $e = 2047$ are reserved for special floating-point values, so 

$$e \in [1, 2046]$$

- the "$-1023$" in the exponent is called the **bias**:  $e-1023 \in [-1022,1023]$
- $\left[1.f_1 \cdots f_{52}\right]_2 = 1 + \frac{f_1}{2^1} + \frac{f_2}{2^2} + \cdots + \frac{f_{52}}{2^{52}}$


---

## Example

$$
\begin{split}
x & = -[1.101101]_2 \times 2^{(1026-1023)} \\
  & = -[1.101101]_2 \times 2^{3} \\
  & = -[1101.101]_2 \\
  & = -\left(1 \cdot 8 + 1 \cdot 4 + 0 \cdot 2 + 1 \cdot 1 + 1 \cdot \frac{1}{2} + 0 \cdot \frac{1}{4} + 1 \cdot \frac{1}{8}\right)  \\
  & = -13.625
\end{split}
$$

In [None]:
?bitstring

In [None]:
s = bitstring(-13.625)

In [None]:
s[1], s[2:12], s[13:64]

In [None]:
Int(0b10000000010)

In [None]:
function mybitstring(x::Float64)
    s = bitstring(x)
    return s[1], s[2:12], s[13:64]
end

In [None]:
mybitstring(-13.625)

In [None]:
methods(mybitstring)

In [None]:
mybitstring(10)

---

## Example

Even if a number can be represented exactly in base-10 with a finite number of digits, it may require an infinite number of digits in base-2.

$$
0.1 = \left[0.000110011001\ldots\right]_2 = \left[1.\overline{1001}\right]_2 \times 2^{-4}
$$

Therefore, $0.1$ cannot be represented exactly as a floating-point number.

In [None]:
# Show that 0.1 is not represented exactly as a Float64

x = 0.1
mybitstring(x)

In [None]:
y = BigFloat(x)

In [None]:
BigFloat("0.1")

In [None]:
BigFloat("0.1") - 0.1

---

## Limits of floating-point numbers

- **Largest** `Float64` $= \left(2 - 2^{-52}\right) \times 2^{1023} \approx 2 \times 10^{308}$
- **Smallest positive normalized** `Float64` $= 2^{-1022} \approx 2 \times 10^{-308}$

In [None]:
?floatmax

In [None]:
# Experiment with floatmax
x = floatmax(Float64)

In [None]:
mybitstring(x)

In [None]:
(x + 1.0) - x

In [None]:
10.0*x

In [None]:
(10.0*x)/10.0

In [None]:
-10.0*x

---

In [None]:
?floatmin

In [None]:
# Experiment with floatmin
x = floatmin(Float64)

In [None]:
mybitstring(x)

In [None]:
x/10.0

In [None]:
mybitstring(x/10.0)

In [None]:
x/1e16

In [None]:
(x/1e16)*1e16

---

## De-normalized floating-point numbers

The IEEE floating-point standard also allows **de-normalized** numbers that are smaller than `floatmin(Float64)`. De-normalized floats are represented by $e = 0$.

In [None]:
# Compute the smallest Float64 that is not zero

x = 1.0
n = 0
while x != 0.0
    x /= 2.0
    n -= 1
end

x, n

In [None]:
2.0^-1075

In [None]:
x = 2.0^-1074

In [None]:
mybitstring(x)

In [None]:
(x/2.0)*2.0 == x

---

## Other special floats

- `0.0` and `-0.0`: $$e_{10} \cdots e_0 = 0 \cdots 0 \quad \text{and} \quad f_1 \cdots f_{52} = 0 \cdots 0$$
- `Inf` and `-Inf`: $$e_{10} \cdots e_0 = 1 \cdots 1 \quad \text{and} \quad f_1 \cdots f_{52} = 0 \cdots 0$$
- `NaN` (not-a-number): $$e_{10} \cdots e_0 = 1 \cdots 1 \quad \text{and} \quad f_1 \cdots f_{52} \neq 0$$

From [Julia Manual: Mathematical Operations and Elementary Functions](https://docs.julialang.org/en/v1/manual/mathematical-operations/):

- `Inf` is equal to itself and greater than everything else except `NaN`.
- `-Inf` is equal to itself and less then everything else except `NaN`.
- `NaN` is not equal to, not less than, and not greater than anything, including itself.

In [None]:
# Experiment with 0.0, -0.0, Inf, -Inf, and NaN

mybitstring(0.0)

In [None]:
mybitstring(-0.0)

In [None]:
Inf - Inf

In [None]:
0.0*Inf

In [None]:
1.0/0.0

In [None]:
mybitstring(NaN)

In [None]:
-0.0 < 0.0

In [None]:
-0.0 == 0.0

In [None]:
-0.0 === 0.0

In [None]:
?===

---

## Machine epsilon `eps(Float64)` and the unit roundoff $\eta$

- `1.0 + eps(Float64)` is the first `Float64` that is larger than `1.0`

$$\mathtt{eps(Float64)} = 2^{-52} \approx 2.2 \times 10^{-16}$$

- $\eta = $ `eps(Float64)/2.0` is the largest possible **relative error** due to roundoff

$$\eta = 2^{-53} \approx 1.1 \times 10^{-16}$$

In [None]:
?eps

In [None]:
# Experiment with eps
ϵ = eps()

In [None]:
η = ϵ/2.0

In [None]:
1.0 + ϵ

In [None]:
(1.0 + ϵ) - 1.0

In [None]:
1.0 + η

In [None]:
(1.0 + η) - 1.0

In [None]:
?nextfloat

In [None]:
?prevfloat

In [None]:
mybitstring(1.0)

In [None]:
mybitstring(1.0 + eps())

In [None]:
nextfloat(1.0) - 1.0

In [None]:
2.0 - prevfloat(2.0)

In [None]:
nextfloat(2.0) - 2.0

In [None]:
x = 2.0^50.0
nextfloat(x) - x

In [None]:
x = 2.0^51.0
nextfloat(x) - x

In [None]:
x = 2.0^52.0
nextfloat(x) - x

In [None]:
x = 2.0^53.0
nextfloat(x) - x

---

## Roundoff error example

Suppose we are using a base-10 floating-point system with 4 significant digits, using `RoundNearest`:

$$
\begin{split}
\left( 1.112 \times 10^1 \right) \times \left( 1.112 \times 10^2 \right)
& = 1.236544 \times 10^3 \\
& \rightarrow 1.237 \times 10^3 = 1237
\end{split}
$$

The absolute error is $1237 - 1236.544 = 0.456$.

The relative error is $$\frac{0.456}{1236.544} \approx 0.0004 = 0.04 \%$$

The default rounding mode is `RoundNearest` (round to the nearest floating-point number). This implies that

$$ \frac{|x - \mathrm{fl}(x)|}{|x|} \leq \eta.$$

If `RoundToZero` is used (a.k.a. **chopping**), then

$$ \frac{|x - \mathrm{fl}(x)|}{|x|} \leq 2 \eta.$$

`RoundNearest` is used since it produces smaller roundoff errors.

---

## Roundoff error accumulation

When performing arithmetic operations on floats, extra **guard digits** are used to ensure **exact rounding**. This guarantees that the relative error of a floating-point operation (**flop**) is small. More precisely, for floating-point numbers $x$ and $y$, we have

$$
\begin{split}
\mathrm{fl}(x \pm y) &= (x \pm y)(1 + \varepsilon_1) \\
\mathrm{fl}(x \times y) &= (x \times y)(1 + \varepsilon_2) \\
\mathrm{fl}(x \div y) &= (x \div y)(1 + \varepsilon_3) \\
\end{split}
$$

where $|\varepsilon_i| \leq \eta$, for $i = 1,2,3$, where $\eta$ is the unit roundoff.

Although the relative error of each flop is small, it is possible to have the roundoff error accumulate and create significant error in the final result. If $E_n$ is the error after $n$ flops, then:

- **linear roundoff error accumulation** is when $E_n \approx c_0 n E_0$
- **exponential roundoff error accumulation** is when $E_n \approx c_1^n E_0$, for some $c_1 > 1$

In general, linear roundoff error accumulation is unavoidable. On the other hand, exponential roundoff error accumulation is not acceptable and is an indication of an **unstable algorithm**. (See Example 1.6 in Ascher-Greif for an example of exponential roundoff error accumulation, and see Exercise 5 in Section 1.4 for a numerically stable method to accomplish the same task.)

---

## General advice

1. Adding $x + y$ when $|x| \gg |y|$ can cause the information in $y$ to be 'lost' in the summation.

2. Dividing by very small numbers or multiplying by very large numbers can **magnify error**.

3. Subtracting numbers that are almost equal produces **cancellation error**.

4. An **overflow** occurs when the result is too large in magnitude to be representable as a float. Result will become either `Inf` or `-Inf`. Overflows should be avoided.

4. An **underflow** occurs when the result is too small in magnitude to be representable as a float. Result will become either `0.0` or `-0.0`.


---

## Example (summation order)

This next example shows that summation order can make a difference. We will compute

$$
s = \sum_{n = 1}^{1,000,000} \frac{1}{n}
$$

in two different ways: from largest to smallest and from smallest to largest.

In [None]:
# Sum from largest to smallest

s1 = 0.0
for n = 1:1_000_000
    s1 += 1/n
end
s1

In [None]:
# Sum from smallest to largest

s2 = 0.0
for n = 1_000_000:-1:1
    s2 += 1/n
end
s2

In [None]:
s = BigFloat(0)
for n = 1_000_000:-1:1
    s += BigFloat(1)/n
end
Float64(s)
s

```
14.392726722865723 63138112749318858767664480001374431165341843304581295850751387

14.39272672286 5724 (BigFloat)
14.39272672286 4989 (small to large)
14.39272672286 5772 (large to small)
```

In [None]:
abs(Float64(s) - s1)

In [None]:
abs(Float64(s) - s2)

---

## Example (cancellation error)

Show that 

$$
\ln\left( x - \sqrt{x^2-1} \right) = -\ln\left( x + \sqrt{x^2-1} \right).
$$

Which formula is more suitable for numerical computation?

In [None]:
# Experiment with both formulas

x = 1e6
fl = log(x - sqrt(x^2 - 1))
fr = -log(x + sqrt(x^2 - 1))
fl, fr

In [None]:
x = BigFloat("1e6")
fl = log(x - sqrt(x^2 - 1))
fr = -log(x + sqrt(x^2 - 1))
fl, fr

### Float64:
```
fl = -14.50865_012405984
fr = -14.50865_7738523969
```
### BigFloat:
```
fl = -14.50865773852396941352518075581436181363002573279952559729981405735_302986080484
fr = -14.50865773852396941352518075581436181363002573279952559729981405735_612094668172
```
### Comparison:
```
fr = -14.508657738523969
fr = -14.50865773852396941352518075581436181363002573279952559729981405735_612094668172
```

---

## Example (avoiding overflow)

Overflow is possible when squaring a large number. This needs to be avoided when computing the Euclidean norm (a.k.a. the $2$-norm) of a vector $x$:

$$
\|x\|_2 = \sqrt{x_1^2 + x_2^2 + \cdots + x_n^2}.
$$

If some $x_i$ is very large, it is possible that $x_i^2$ will overflow, causing the final result to be `Inf`. We can avoid this as follows.

Let 
$$\bar{x} = \max_{i=1:n} |x_i|.$$
Then
$$
\|x\|_2 = \bar{x} \sqrt{\left(\frac{x_1}{\bar{x}}\right)^2 + \left(\frac{x_2}{\bar{x}}\right)^2 + \cdots + \left(\frac{x_n}{\bar{x}}\right)^2}.
$$
Since $|x_i/\bar{x}| \leq 1$ for all $i$, no overflow will occur. Underflow may occur, but this is harmless.


In [None]:
# Experiment with both formulas
n = 100

x = rand(n)
x[1] = 1e200

using LinearAlgebra
norm(x)

In [None]:
s = 0.0
for i = 1:n
    s += x[i]^2
end
s = sqrt(s)

In [None]:
x̄ = maximum(abs.(x))

s = 0.0
for i = 1:n
    s += (x[i]/x̄)^2
end
s = x̄*sqrt(s)

---