---
# Section 2.5: Roundoff Errors and Backward Stability
---

## Base-10 floating point numbers

Consider the following 6 digit numbers written using base-10 floating-point notation:

$$
\begin{align}
1.23456 \times 10^3    &= 1234.56 \\
1.23456 \times 10^{-2} &= 0.0123456 \\
\end{align}
$$

We can use this floating-point notation to write down both large and small numbers very compactly:

$$
1.23 \times 10^{50}, \qquad -1.234 \times 10^{-100}.
$$

The first is a number with 3 significant digits, and the second has 4 significant digits.

---

## Roundoff error

Base-10 with 4 significant digits:

$$
(1.112 \times 10^1) \times (1.112 \times 10^2) = \fbox{1.236}544 \times 10^3 \xrightarrow{\text{roundoff}} 1.237 \times 10^3
$$

Therefore, we have made the following error:

$$
\delta x = 1.237 \times 10^3 - 1.236544 \times 10^3 = 0.456
$$

The relative error is:

$$
\frac{\delta x}{x} = \frac{.456}{1.236544 \times 10^3} \approx 4 \times 10^{-4} = 0.04 \%
$$

---

## Floating-point arithmetic

Real numbers are stored on a computer using a base-2 floating-point notation that follows the IEEE floating-point standard:

1. **half precision** using 16 bits (`Float16`)
2. **single precision** using 32 bits (`Float32`)
3. **double precision** using 64 bits (`Float64`)

Julia also has an **arbitrary precision** floating-point data type called `BigFloat`. It is excellent if you need more precision, but it is also much slower.

In [1]:
?AbstractFloat

search: AbstractFloat



No documentation found.

**Summary:**

```julia
abstract AbstractFloat <: Real
```

**Subtypes:**

```julia
BigFloat
Float16
Float32
Float64
```


---

## Description of IEEE double floating-point format (`Float64`)

Suppose $x$ is a floating-point number stored in the following 64-bits:

| 1 | 2 | $\cdots$ | 12 | 13 | $\cdots$ | 64 |
|:-:|:-:|:--------:|:--:|:--:|:--------:|:--:|
|$s$|$e_{10}$| $\cdots$ |$e_0$|$f_1$|$\cdots$|$f_{52}$|

- 1 bit $s$ represents the **sign**
- 11 bits $e_{10} \cdots e_{0}$ represent the **exponent**
- 52 bits $f_1 \cdots f_{52}$ represent the **fraction** (a.k.a. the **mantissa** or **significand**)

Then

$$ x = (-1)^s \left[1.f_1 \cdots f_{52}\right]_2 \times 2^{(e-1023)}.$$

Note that $x$ is **normalized** to have its first digit nonzero.

The exponent $e$ is defined by

$$e = \left[e_{10} \cdots e_{0}\right]_2 = e_{10} 2^{10} + \cdots + e_1 2^1 + e_0 2^0.$$
  
Thus, $e \in \left[0, 2^{11}-1\right] = [0, 2047]$. Since $e = 0$ and $e = 2047$ are reserved for special floating-point values, we actually have $e \in [1, 2046]$. 

The "$-1023$" in the exponent is called the **bias**:  $e-1023 \in [-1022,1023]$

Finally, we have

$$\left[1.f_1 \cdots f_{52}\right]_2 = 1 + \frac{f_1}{2^1} + \frac{f_2}{2^2} + \cdots + \frac{f_{52}}{2^{52}}.$$

---

## Example

The 64-bit floating point number

$$
\begin{array}{|c|c|c|}
\hline
1 & 10000000010 & 1011010000000000000000000000000000000000000000000000 \\
\hline
\end{array}
$$

is

$$
\begin{split}
x & = -[1.101101]_2 \times 2^{(1026-1023)} \\
  & = -[1.101101]_2 \times 2^{3} \\
  & = -[1101.101]_2 \\
  & = -\left(1 \cdot 8 + 1 \cdot 4 + 0 \cdot 2 + 1 \cdot 1 + 1 \cdot \frac{1}{2} + 0 \cdot \frac{1}{4} + 1 \cdot \frac{1}{8}\right)  \\
  & = -13.625
\end{split}
$$

In [2]:
?bits

search: bits bitstype isbits flipbits! combinations bitbroadcast



```
bits(n)
```

A string giving the literal bit representation of a number.


In [3]:
x = -13.625

-13.625

In [4]:
b = bits(-13.625)

"1100000000101011010000000000000000000000000000000000000000000000"

In [5]:
s, e, f = b[1], b[2:12], b[13:64]

('1',"10000000010","1011010000000000000000000000000000000000000000000000")

In [6]:
0b10000000010

0x0402

In [7]:
Int(ans)

1026

---

## Example

Even if a number can be represented exactly in base-10 with a finite number of digits, it may require an infinite number of digits in base-2.

$$
0.1 = \left[0.000110011001\ldots\right]_2 = \left[1.\overline{1001}\right]_2 \times 2^{-4}
$$

Therefore, $0.1$ cannot be represented exactly as a floating-point number.

In [1]:
x = 0.1

0.1

In [2]:
b = bits(x)
s, e, f = b[1], b[2:12], b[13:64]

('0',"01111111011","1001100110011001100110011001100110011001100110011010")

In [3]:
big(0.1)

1.000000000000000055511151231257827021181583404541015625000000000000000000000000e-01

In [4]:
big(1)/10

1.000000000000000000000000000000000000000000000000000000000000000000000000000002e-01

In [5]:
(big(0.1) - big(1)/10)/0.1

5.551115123125782393969367238496382179237673776462814776562549859295662966997906e-17

---

## Largest `Float64`

$$\left(2 - 2^{-52}\right) \times 2^{1023} \approx 2 \times 10^{308}$$

In [6]:
?realmax

search: realmax realmin readdlm ReadOnlyMemoryError



```
realmax(T)
```

The highest finite value representable by the given floating-point DataType `T`.


In [7]:
realmax(BigFloat)

5.875653789111587590936911998878442589938516392745498308333779606469323584389875e+1388255822130839282

In [8]:
x = realmax(Float64)

1.7976931348623157e308

In [9]:
b = bits(x)
s, e, f = b[1], b[2:12], b[13:64]

('0',"11111111110","1111111111111111111111111111111111111111111111111111")

---

## Exercise

Suggest some calculations with `x = realmax(Float64)` that lead to strange results.

In [10]:
x = realmax(Float64)

1.7976931348623157e308

In [11]:
2*x

Inf

In [12]:
bits(Inf)

"0111111111110000000000000000000000000000000000000000000000000000"

In [13]:
x + 1

1.7976931348623157e308

In [14]:
(x + 1) - x

0.0

In [15]:
1/x

5.562684646268003e-309

In [16]:
1/(2*x)

0.0

In [17]:
(1/x)/2

2.781342323134e-309

In [18]:
?promote

search: promote promote_type promote_rule promote_shape pointer_from_objref



```
promote(xs...)
```

Convert all arguments to their common promotion type (if any), and return them all (as a tuple).


---

## Smallest (normalized) `Float64`

$$2^{-1022} \approx 2 \times 10^{-308}$$

In [19]:
x = realmin(Float64)

2.2250738585072014e-308

In [20]:
b = bits(x)
s, e, f = b[1], b[2:12], b[13:64]

('0',"00000000001","0000000000000000000000000000000000000000000000000000")

---

## Exercise

Suggest some calculations with `x = realmin(Float64)` that lead to strange results.

In [21]:
x = realmin(Float64)

2.2250738585072014e-308

In [22]:
x/2

1.1125369292536007e-308

In [23]:
1/x

4.49423283715579e307

In [24]:
x^(-1000)

Inf

In [25]:
x^1000

0.0

In [27]:
sqrt(x)

1.4916681462400413e-154

In [34]:
x/2^50

2.0e-323

---

## Denormalized `Float64`

The IEEE floating-point standard also allows **denormalized** numbers that are smaller than `realmin(Float64)`. Denormalized floats are represented by $e = 0$.

In [35]:
# Let's compute the smallest Float64 that is not zero

e = -1022
x = 2.0^e

while x != 0.0
    @printf "%24.16e = 2.0^(%5d)\n" x e
    x /= 2.0
    e -= 1
end

x, e

 2.2250738585072014e-308 = 2.0^(-1022)
 1.1125369292536007e-308 = 2.0^(-1023)
 5.5626846462680035e-309 = 2.0^(-1024)
 2.7813423231340017e-309 = 2.0^(-1025)
 1.3906711615670009e-309 = 2.0^(-1026)
 6.9533558078350043e-310 = 2.0^(-1027)
 3.4766779039175022e-310 = 2.0^(-1028)
 1.7383389519587511e-310 = 2.0^(-1029)
 8.6916947597937554e-311 = 2.0^(-1030)
 4.3458473798968777e-311 = 2.0^(-1031)
 2.1729236899484389e-311 = 2.0^(-1032)
 1.0864618449742194e-311 = 2.0^(-1033)
 5.4323092248710971e-312 = 2.0^(-1034)
 2.7161546124355486e-312 = 2.0^(-1035)
 1.3580773062177743e-312 = 2.0^(-1036)
 6.7903865310888714e-313 = 2.0^(-1037)
 3.3951932655444357e-313 = 2.0^(-1038)
 1.6975966327722179e-313 = 2.0^(-1039)
 8.4879831638610893e-314 = 2.0^(-1040)
 4.2439915819305446e-314 = 2.0^(-1041)
 2.1219957909652723e-314 = 2.0^(-1042)
 1.0609978954826362e-314 = 2.0^(-1043)
 5.3049894774131808e-315 = 2.0^(-1044)
 2.6524947387065904e-315 = 2.0^(-1045)
 1.3262473693532952e-315 = 2.0^(-1046)
 6.6312368467664760e-316 

(0.0,-1075)

In [37]:
2.0^-1072

2.0e-323

In [39]:
x = 2.0^-1074

5.0e-324

In [40]:
x/2

0.0

In [38]:
b = bits(x)
s, e, f = b[1], b[2:12], b[13:64]

('0',"00000000000","0000000000000000000000000000000000000000000000000001")

In [41]:
x = 2.0^-1075

0.0

In [42]:
b = bits(x)
s, e, f = b[1], b[2:12], b[13:64]

('0',"00000000000","0000000000000000000000000000000000000000000000000000")

In [43]:
x = 2.0^-1074

5.0e-324

In [44]:
1/x

Inf

In [45]:
1/realmin(Float64)

4.49423283715579e307

In [47]:
1/(realmin(Float64)/2^2)

Inf

---

## Smallest (denormalized) `Float64`

$$2^{-1074} \approx 5 \times 10^{-324}$$

---

## Other special floats

- `0.0` and `-0.0`: $$e_{10} \cdots e_0 = 0 \cdots 0 \quad \text{and} \quad f_1 \cdots f_{52} = 0 \cdots 0$$
- `Inf` and `-Inf`: $$e_{10} \cdots e_0 = 1 \cdots 1 \quad \text{and} \quad f_1 \cdots f_{52} = 0 \cdots 0$$
- `NaN` (not-a-number): $$e_{10} \cdots e_0 = 1 \cdots 1 \quad \text{and} \quad f_1 \cdots f_{52} \neq 0$$

From [Julia Manual: Mathematical Operations and Elementary Functions](http://julia.readthedocs.org/en/latest/manual/mathematical-operations/):

- Positive zero is equal but not greater than negative zero.
- `Inf` is equal to itself and greater than everything else except `NaN`.
- `-Inf` is equal to itself and less then everything else except `NaN`.
- `NaN` is not equal to, not less than, and not greater than anything, including itself.

In [48]:
bits(0.0)

"0000000000000000000000000000000000000000000000000000000000000000"

In [49]:
bits(-0.0)

"1000000000000000000000000000000000000000000000000000000000000000"

In [52]:
-0.0 == 0.0

true

In [51]:
-0.0 === 0.0

false

In [53]:
?===

search: === == .== !==



```
is(x, y) -> Bool
===(x,y) -> Bool
≡(x,y) -> Bool
```

Determine whether `x` and `y` are identical, in the sense that no program could distinguish them. Compares mutable objects by address in memory, and compares immutable objects (such as numbers) by contents at the bit level. This function is sometimes called `egal`.

```rst
..  ===(x, y)
           ≡(x,y)

See the :func:`is` operator
```


In [54]:
bits(Inf)

"0111111111110000000000000000000000000000000000000000000000000000"

In [55]:
bits(-Inf)

"1111111111110000000000000000000000000000000000000000000000000000"

In [56]:
bits(NaN)

"0111111111111000000000000000000000000000000000000000000000000000"

In [57]:
NaN

NaN

In [58]:
NaN == NaN

false

In [59]:
NaN != NaN

true

In [60]:
x = [1, 2, 3, NaN]

4-element Array{Float64,1}:
   1.0
   2.0
   3.0
 NaN  

In [61]:
sum(x)

NaN

In [62]:
0.0/0.0

NaN

In [66]:
A = rand(5, 5)
A[3,3] = NaN
A

5x5 Array{Float64,2}:
 0.602226  0.248673     0.994511  0.701301  0.00587779
 0.591325  0.729359     0.931586  0.534998  0.647471  
 0.525481  0.506816   NaN         0.611615  0.163879  
 0.979074  0.0712258    0.538318  0.147754  0.749601  
 0.576588  0.0828709    0.765546  0.161188  0.314293  

In [67]:
x = rand(5)

A*x

5-element Array{Float64,1}:
   0.951309
   0.925458
 NaN       
   0.902684
   0.791582

---

## Exercise

Suggest some calculations involving `0.0`, `-0.0`, `Inf`, `-Inf`, and `NaN`.

In [70]:
-1/2^1000

-Inf

In [71]:
1/-Inf

-0.0

In [75]:
sign(0.0)

0.0

In [76]:
?sign

search: sign signif signed Signed signbit significand Unsigned unsigned flipsign



```
sign(x)
```

Return zero if `x==0` and $x/|x|$ otherwise (i.e., ±1 for real `x`).


In [77]:
sign(-0.0)

-0.0

In [78]:
NaN === NaN

true

---

## Machine epsilon `eps(Float64)`

`1.0 + eps(Float64)` is the first `Float64` that is larger than `1.0`:

$$\mathtt{eps(Float64)} = 2^{-52} \approx 2.2 \times 10^{-16}$$

In [80]:
x = nextfloat(1.0)

1.0000000000000002

In [81]:
bits(1.0)

"0011111111110000000000000000000000000000000000000000000000000000"

In [82]:
bits(x)

"0011111111110000000000000000000000000000000000000000000000000001"

In [84]:
x - 1.0

2.220446049250313e-16

In [85]:
?eps

search: eps RepString @elapsed indexpids expanduser escape_string peakflops



```
eps(::DateTime) -> Millisecond
eps(::Date) -> Day
```

Returns `Millisecond(1)` for `DateTime` values and `Day(1)` for `Date` values.

```
eps(x)
```

The distance between `x` and the next larger representable floating-point value of the same `DataType` as `x`.

```
eps(T)
```

The distance between 1.0 and the next larger representable floating-point value of `DataType` `T`. Only floating-point types are sensible arguments.

```
eps()
```

The distance between 1.0 and the next larger representable floating-point value of `Float64`.


In [86]:
eps()

2.220446049250313e-16

In [87]:
eps(Float16)

Float16(0.000977)

In [88]:
eps(Float32)

1.1920929f-7

In [91]:
typeof(1e2)

Float64

In [93]:
typeof(1f2)

Float32

In [94]:
1.0 + eps() > 1.0

true

In [95]:
1.0 + eps()/2 > 1.0

false

In [96]:
1.0 + eps()/2 == 1.0

true

In [98]:
x = 1.0 + eps()

1.0000000000000002

In [99]:
b = bits(x)
s, e, f = b[1], b[2:12], b[13:64]

('0',"01111111111","0000000000000000000000000000000000000000000000000001")

In [100]:
x = 1.0
b = bits(x)
s, e, f = b[1], b[2:12], b[13:64]

('0',"01111111111","0000000000000000000000000000000000000000000000000000")

---

## The unit roundoff $u$

$u = $ `eps(Float64)/2.0` is the largest possible relative error due to roundoff:

$$u = 2^{-53} \approx 1.1 \times 10^{-16}$$

It is for this reason that `Float64` accuracy is limited to **16 decimal places**.

In [101]:
u = eps()/2.0

1.1102230246251565e-16

---

## Floating-point error bounds

Let $\varepsilon =$ relative error computing $C$ using floats.

Then

$$
\frac{\mathrm{fl}(C) - C}{C} = \varepsilon 
\qquad \implies \qquad
\mathrm{fl}(C) = C(1 + \varepsilon)
$$

IEEE floating-point standard guarantees:

$$
\begin{split}
\mathrm{fl}(x \pm y) &= (x \pm y)(1 + \varepsilon_1) \\
\mathrm{fl}(x \times y) &= (x \times y)(1 + \varepsilon_2) \\
\mathrm{fl}(x \div y) &= (x \div y)(1 + \varepsilon_3) \\
\end{split}
$$

where $|\varepsilon_i| \leq u$, for $i = 1,2,3$, where $u$ is the unit roundoff.


---

## Roundoff error accumulation

Suppose we already have errors in $x$ and $y$:

$$
\hat{x} = x(1 + \varepsilon_1), \qquad |\varepsilon_1| \ll 1
$$

$$
\hat{y} = y(1 + \varepsilon_2), \qquad |\varepsilon_2| \ll 1
$$

**Multiplication:**

$$
\begin{split}
\hat{x}\hat{y} 
&= x(1 + \varepsilon_1) y(1 + \varepsilon_2)\\
&= xy (1 + \varepsilon_1 + \varepsilon_2 + \varepsilon_1 \varepsilon_2)\\
&= xy (1 + \hat{\varepsilon}),\\
\end{split}
$$

where $\hat{\varepsilon} = \varepsilon_1 + \varepsilon_2 + \varepsilon_1 \varepsilon_2$, so $|\hat{\varepsilon}| \ll 1$.

Now let's compare the true value of $xy$ with the computed value of $\mathrm{fl}(\hat{x}\hat{y})$.

We have

$$
\begin{split}
\mathrm{fl}(\hat{x}\hat{y}) 
&= \hat{x}\hat{y}(1 + \varepsilon_3) \qquad \qquad (|\varepsilon_3| \leq u) \\
&= xy (1 + \hat{\varepsilon})(1 + \varepsilon_3) \\
&= xy (1 + \hat\varepsilon + \varepsilon_3 + \hat\varepsilon \varepsilon_3)\\
&= xy (1 + \varepsilon),\\
\end{split}
$$

where $\varepsilon = \hat\varepsilon + \varepsilon_3 + \hat\varepsilon \varepsilon_3$, so $|\varepsilon| \ll 1$.

**Division:**

$$
\frac{\hat{x}}{\hat{y}} = \frac{x(1 + \varepsilon_1)}{y(1 + \varepsilon_2)}
\approx \frac{x}{y}(1 + \varepsilon_1)(1 - \varepsilon_2)
= \frac{x}{y} (1 + \varepsilon_1 - \varepsilon_2 - \varepsilon_1 \varepsilon_2)\\
= \frac{x}{y} (1 + \hat{\varepsilon}),\\
$$

where $\hat{\varepsilon} = \varepsilon_1 - \varepsilon_2 - \varepsilon_1 \varepsilon_2$, so $|\hat{\varepsilon}| \ll 1$.

Therefore, using a similar argument as above, we have

$$
\mathrm{fl}\left(\frac{\hat{x}}{\hat{y}}\right) = \frac{x}{y} (1 + \varepsilon), \qquad \text{where $|\varepsilon| \ll 1$}.
$$

**Addition and subtraction:**

$$
\begin{split}
\hat{x} + \hat{y} 
&= x(1 + \varepsilon_1) + y(1 + \varepsilon_2) \\
&= (x + y) + x \varepsilon_1 + y \varepsilon_2 \\
&= (x + y)\left(1 + \frac{x}{x+y} \varepsilon_1 + \frac{y}{x+y} \varepsilon_2 \right) \\
&= (x + y)\left(1 + \hat\varepsilon \right), \\
\end{split}
$$

where $\hat\varepsilon = \frac{x}{x+y} \varepsilon_1 + \frac{y}{x+y} \varepsilon_2$.

If $x + y$ is very small compared to $x$ or $y$, then $\hat\varepsilon$ could be very large.

---

### Example:

Suppose

$$
\begin{align}
x &= 1.23450, & \hat{x} &= 1.23451, \\
y &= -1.23460, & \hat{y} &= -1.23459. \\
\end{align}
$$

Then:

$$
\hat{x} = x(1 + \varepsilon_1), \qquad \text{where $|\varepsilon_1| \approx 8 \times 10^{-6}$};
$$

$$
\hat{y} = y(1 + \varepsilon_2), \qquad \text{where $|\varepsilon_2| \approx 8 \times 10^{-6}$}.
$$

In [3]:
x = 1.23450
xhat = 1.23451
ɛ₁ = (x - xhat)/x
abs(ɛ₁)

8.100445524556916e-6

In [4]:
y = -1.23460
yhat = -1.23459
ɛ₂ = (y - yhat)/y
abs(ɛ₂)

8.09978940534867e-6

In [5]:
xhat + yhat

-8.000000000008001e-5

In [6]:
x + y

-9.999999999998899e-5

In [7]:
ɛ = ((x + y) - (xhat + yhat))/(x + y)

0.19999999999911183

Therefore,

$$
\hat{x} + \hat{y} = (x+y)(1 + \varepsilon), \qquad \text{where $|\varepsilon| \approx 2 \times 10^{-1}$}!!!
$$

This is called **catastrophic cancellation** and can lead to a sudden loss of accuracy in a calculation.

---

## Backward error analysis

We say that a computation $C(x_1,\ldots,x_n)$ is **backwards stable** if

$$
\mathrm{fl}(C(x_1,\ldots,x_n)) = C(\bar{x_1},\ldots,\bar{x_n}),
$$

where the error in $\bar{x_1},\ldots,\bar{x_n}$ is a **small multiple** of the unit roundoff $u$.

---

## Example:

Suppose the computation $C(A, b)$ returns the solution $x$ to $Ax = b$ and 

$$
\hat{x} = \mathrm{fl}(C(A, b)) = C(\hat{A}, \hat{b}).
$$

Then $\hat{A} \hat{x} = \hat{b}$.

If $C(A, b)$ is **backwards stable**, then $(A + \delta A)\hat{x} = b + \delta b$, where

$$
\frac{\lVert \delta A \rVert}{\lVert A \rVert} \quad \text{and} \quad
\frac{\lVert \delta b \rVert}{\lVert b \rVert} \quad \text{are small multiples of $u$}.
$$

If $A$ is also **well-conditioned**, then

$$
\frac{\lVert \delta x \rVert}{\lVert x \rVert}
\leq 
\kappa(A) \left( \frac{\lVert \delta A \rVert}{\lVert A \rVert} + \frac{\lVert \delta b \rVert}{\lVert b \rVert} \right)
$$

implies that $\frac{\lVert \delta x \rVert}{\lVert x \rVert}$ is also a small multiple of $u$, so $\hat{x}$ is an **accurate approximation** of the true solution $x$.

---

## Residual test

After computing $\hat{x} = \mathrm{fl}(C(A, b))$, we can check that the computation is **backward stable** by finding the **residual**,

$$
\hat{r} = b - A \hat{x}.
$$

Then $A \hat{x} = b + \delta b$, where $\delta b = -\hat{r}$.

If $\frac{\lVert \delta b \rVert}{\lVert b \rVert}$ is a small multiple of $u$, then computation of $\hat{x}$ was **backward stable**.

---

## Summary

To verify accuracy of $\hat{x}$, we need to check **two things**:

1. $\hat{x} = \mathrm{fl}(C(A, b))$ is **backward stable**.
2. $A$ is **well-conditioned**.

---