---
# Section 2.5: Roundoff Errors and Backward Stability

---

## Base-10 floating point numbers

Consider the following 6 digit numbers written using base-10 floating-point notation:

$$
\begin{align}
1.23456 \times 10^3    &= 1234.56 \\
1.23456 \times 10^{-2} &= 0.0123456 \\
\end{align}
$$

We can use this floating-point notation to write down both large and small numbers very compactly:

$$
1.23 \times 10^{50}, \qquad -1.234 \times 10^{-100}.
$$

The first is a number with 3 significant digits, and the second has 4 significant digits.

---

## Roundoff error

Base-10 with 4 significant digits:

$$
(1.112 \times 10^1) \times (1.112 \times 10^2) = \fbox{1.236}544 \times 10^3 \xrightarrow{\text{roundoff}} 1.237 \times 10^3
$$

Therefore, we have made the following error:

$$
\delta x = 1.237 \times 10^3 - 1.236544 \times 10^3 = 0.456
$$

The relative error is:

$$
\frac{\delta x}{x} = \frac{.456}{1.236544 \times 10^3} \approx 4 \times 10^{-4} = 0.04 \%
$$

---

## Machine epsilon `eps(Float64)`

`1.0 + eps(Float64)` is the first `Float64` that is larger than `1.0`:

$$\mathtt{eps(Float64)} = 2^{-52} \approx 2.2 \times 10^{-16}$$

---

## The unit roundoff $u$

$u = $ `eps(Float64)/2.0` is the largest possible relative error due to roundoff:

$$u = 2^{-53} \approx 1.1 \times 10^{-16}$$

It is for this reason that `Float64` accuracy is limited to **16 decimal places**.

In [None]:
u = eps()/2.0

---

## Floating-point error bounds

Let $\varepsilon =$ relative error computing $C$ using floats.

Then

$$
\frac{\mathrm{fl}(C) - C}{C} = \varepsilon 
\qquad \implies \qquad
\mathrm{fl}(C) = C(1 + \varepsilon)
$$

IEEE floating-point standard guarantees:

$$
\begin{split}
\mathrm{fl}(x \pm y) &= (x \pm y)(1 + \varepsilon_1) \\
\mathrm{fl}(x \times y) &= (x \times y)(1 + \varepsilon_2) \\
\mathrm{fl}(x \div y) &= (x \div y)(1 + \varepsilon_3) \\
\end{split}
$$

where $|\varepsilon_i| \leq u$, for $i = 1,2,3$, where $u$ is the unit roundoff.


---

## Roundoff error accumulation

Suppose we already have errors in $x$ and $y$:

$$
\hat{x} = x(1 + \varepsilon_1), \qquad |\varepsilon_1| \ll 1
$$

$$
\hat{y} = y(1 + \varepsilon_2), \qquad |\varepsilon_2| \ll 1
$$

**Multiplication:**

$$
\begin{split}
\hat{x}\hat{y} 
&= x(1 + \varepsilon_1) y(1 + \varepsilon_2)\\
&= xy (1 + \varepsilon_1 + \varepsilon_2 + \varepsilon_1 \varepsilon_2)\\
&= xy (1 + \hat{\varepsilon}),\\
\end{split}
$$

where $\hat{\varepsilon} = \varepsilon_1 + \varepsilon_2 + \varepsilon_1 \varepsilon_2$, so $|\hat{\varepsilon}| \ll 1$.

Now let's compare the true value of $xy$ with the computed value of $\mathrm{fl}(\hat{x}\hat{y})$.

We have

$$
\begin{split}
\mathrm{fl}(\hat{x}\hat{y}) 
&= \hat{x}\hat{y}(1 + \varepsilon_3) \qquad \qquad (|\varepsilon_3| \leq u) \\
&= xy (1 + \hat{\varepsilon})(1 + \varepsilon_3) \\
&= xy (1 + \hat\varepsilon + \varepsilon_3 + \hat\varepsilon \varepsilon_3)\\
&= xy (1 + \varepsilon),\\
\end{split}
$$

where $\varepsilon = \hat\varepsilon + \varepsilon_3 + \hat\varepsilon \varepsilon_3$, so $|\varepsilon| \ll 1$.

**Division:**

$$
\frac{\hat{x}}{\hat{y}} = \frac{x(1 + \varepsilon_1)}{y(1 + \varepsilon_2)}
\approx \frac{x}{y}(1 + \varepsilon_1)(1 - \varepsilon_2)
= \frac{x}{y} (1 + \varepsilon_1 - \varepsilon_2 - \varepsilon_1 \varepsilon_2)\\
= \frac{x}{y} (1 + \hat{\varepsilon}),\\
$$

where $\hat{\varepsilon} = \varepsilon_1 - \varepsilon_2 - \varepsilon_1 \varepsilon_2$, so $|\hat{\varepsilon}| \ll 1$.

Therefore, using a similar argument as above, we have

$$
\mathrm{fl}\left(\frac{\hat{x}}{\hat{y}}\right) = \frac{x}{y} (1 + \varepsilon), \qquad \text{where $|\varepsilon| \ll 1$}.
$$

**Addition and subtraction:**

$$
\begin{split}
\hat{x} + \hat{y} 
&= x(1 + \varepsilon_1) + y(1 + \varepsilon_2) \\
&= (x + y) + x \varepsilon_1 + y \varepsilon_2 \\
&= (x + y)\left(1 + \frac{x}{x+y} \varepsilon_1 + \frac{y}{x+y} \varepsilon_2 \right) \\
&= (x + y)\left(1 + \hat\varepsilon \right), \\
\end{split}
$$

where $\hat\varepsilon = \frac{x}{x+y} \varepsilon_1 + \frac{y}{x+y} \varepsilon_2$.

If $x + y$ is very small compared to $x$ or $y$, then $\hat\varepsilon$ could be very large.

---

### Example:

Suppose

$$
\begin{align}
x &= 1.23450, & \hat{x} &= 1.23451, \\
y &= -1.23460, & \hat{y} &= -1.23459. \\
\end{align}
$$

Then:

$$
\hat{x} = x(1 + \varepsilon_1), \qquad \text{where $|\varepsilon_1| \approx 8 \times 10^{-6}$};
$$

$$
\hat{y} = y(1 + \varepsilon_2), \qquad \text{where $|\varepsilon_2| \approx 8 \times 10^{-6}$}.
$$

In [None]:
x = 1.23450
xhat = 1.23451
ɛ₁ = (x - xhat)/x
abs(ɛ₁)

In [None]:
y = -1.23460
yhat = -1.23459
ɛ₂ = (y - yhat)/y
abs(ɛ₂)

In [None]:
xhat + yhat

In [None]:
x + y

In [None]:
ɛ = ((x + y) - (xhat + yhat))/(x + y)

Therefore,

$$
\hat{x} + \hat{y} = (x+y)(1 + \varepsilon), \qquad \text{where $|\varepsilon| \approx 2 \times 10^{-1}$}!!!
$$

This is called **catastrophic cancellation** and can lead to a sudden loss of accuracy in a calculation.

---

## Backward error analysis

We say that a computation $C(x_1,\ldots,x_n)$ is **backwards stable** if

$$
\mathrm{fl}(C(x_1,\ldots,x_n)) = C(\bar{x_1},\ldots,\bar{x_n}),
$$

where the error in $\bar{x_1},\ldots,\bar{x_n}$ is a **small multiple** of the unit roundoff $u$.

---

## Example:

Suppose the computation $C(A, b)$ returns the solution $x$ to $Ax = b$ and 

$$
\hat{x} = \mathrm{fl}(C(A, b)) = C(\hat{A}, \hat{b}).
$$

Then $\hat{A} \hat{x} = \hat{b}$.

If $C(A, b)$ is **backwards stable**, then $(A + \delta A)\hat{x} = b + \delta b$, where

$$
\frac{\lVert \delta A \rVert}{\lVert A \rVert} \quad \text{and} \quad
\frac{\lVert \delta b \rVert}{\lVert b \rVert} \quad \text{are small multiples of $u$}.
$$

If $A$ is also **well-conditioned**, then

$$
\frac{\lVert \delta x \rVert}{\lVert x \rVert}
\leq 
\kappa(A) \left( \frac{\lVert \delta A \rVert}{\lVert A \rVert} + \frac{\lVert \delta b \rVert}{\lVert b \rVert} \right)
$$

implies that $\frac{\lVert \delta x \rVert}{\lVert x \rVert}$ is also a small multiple of $u$, so $\hat{x}$ is an **accurate approximation** of the true solution $x$.

---

## Residual test

After computing $\hat{x} = \mathrm{fl}(C(A, b))$, we can check that the computation is **backward stable** by finding the **residual**,

$$
\hat{r} = b - A \hat{x}.
$$

Then $A \hat{x} = b + \delta b$, where $\delta b = -\hat{r}$.

If $\frac{\lVert \delta b \rVert}{\lVert b \rVert}$ is a small multiple of $u$, then computation of $\hat{x}$ was **backward stable**.

---

## Summary

To verify accuracy of $\hat{x}$, we need to check **two things**:

1. The computation $\hat{x} = \mathrm{fl}(C(A, b))$ is **backward stable**.
2. $A$ is **well-conditioned**.

---