# More on Floating Point Arithmetic 

---

Today we discuss some important issues in floating point arithmetic.
Will start with some examples. Here is one that Texas Instrument calculators in
the 1980's got wrong.

__Example__

Credited to W.J. "Jim" Cody of Argonne National Laboratory.

 Suppose we are working in a number system where $\beta =10$ and
$e \in [-99,99]$. We want to find the modulus (length) of the complex number

$$
z = x + __i__ y.
$$

The formula is

$$
|z| = \sqrt{x^2 + y^2}.
$$

This formula is not resistant to underflow and overflow. Gradual underflow was not common on computing devices of that period.

For instance take the values 

$$
x= 0.3 \times 10^{-50}, \quad y=0.4 \times 10^{-50}.
$$

Using the formula we get

\begin{eqnarray*}
x^2 = 0.09 \times 10^{-100} \quad \mbox{ underflows to $0$}, \\
y^2 = 0.16 \times 10^{-100} \quad \mbox{ underflows to $0$}.
\end{eqnarray*}

So we get

$$
|z| = \sqrt{0+0} = 0.
$$

Rounding error is not an issue at all, but there is still a problem with this algorithm. That problem is underflow.
Fortunately, it is correctable.

If we let

$$
M = \max \{ |x|,|y|\}, \quad m = \min \{ |x|,|y| \}, \quad r = m/M.
$$ 

Then

$$
|z| = M \sqrt{1+r^2}.
$$

This will get a good result as long as $|z|$ is in range.


Now here are some other rounding error examples.


__Example__

Suppose we wish to evaluate
$$
f(x) = \tan \: x - \sin \: x
$$

In [1]:
f(x)=tan(x)-sin(x)
[println((x,f(x))) for x in [1e-5,1e-6,1e-7,1e-8,1e-9,1e-10]];
#x=1e-5
#for k=1:6
#    println((x,f(x)))
#    x=x/10
#end

(1.0e-5,5.000001606324245e-16)
(1.0e-6,4.999611971168508e-19)
(1.0e-7,5.029258124322408e-22)
(1.0e-8,0.0)
(1.0e-9,0.0)
(1.0e-10,0.0)


$x$ near zero, say, $x = 10^{-10}$.

In MATLAB you might do this.

```
>> tx = tan(x)
>> sx = sin(x)
>> fx = tx -sx
```

Below are two different ways to handle this.


_Use Trigonometric Identities_

\begin{align*}
f(x) &=\tan \: x - \sin \: x = \tan \: x (1 - \cos \: x ) \\
&= \tan \: x (1-\cos \: x)(1+\cos \: x)/(1+\cos \: x) = \tan \: x (1-\cos^2 \: x)/(1+\cos \: x) \\
&= \tan \: x \sin^2\; x/(1+\cos \: x)
\end{align*}

In [2]:
f(x)=tan(x)*sin(x)^2/(1+cos(x))
[println((x,f(x))) for x in [1e-5,1e-6,1e-7,1e-8,1e-9,1e-10]];

(1.0e-5,5.000000000125e-16)
(1.0e-6,5.000000000001249e-19)
(1.0e-7,5.000000000000011e-22)
(1.0e-8,5.0000000000000005e-25)
(1.0e-9,5.000000000000001e-28)
(1.0e-10,5.0e-31)




For this expression, we get the correct answer
$$
f(10^{-10})= 5 \times 10^{-31}.
$$


_Use Taylor Series_ 

We only need a few terms.

\begin{align*}
\tan \: x &= x + \frac{x^3}{3} + \frac{2x^5}{15} + O(x^7) \\
\sin \: x &= x -\frac{x^3}{6} + \frac{x^5}{120}+O(x^7) \\
f(x)&= \tan \: x - \sin \: x = \frac{x^3}{2} + \frac{7x^5}{120} +O(x^7)
\end{align*}

This series will compute $f(10^{-10})$ just as well.

In [3]:
f(x)=x^3*(0.5+7x^2/120)
[println((x,f(x))) for x in [1e-5,1e-6,1e-7,1e-8,1e-9,1e-10]];

(1.0e-5,5.000000000058335e-16)
(1.0e-6,5.000000000000582e-19)
(1.0e-7,5.000000000000004e-22)
(1.0e-8,5.0000000000000005e-25)
(1.0e-9,5.000000000000001e-28)
(1.0e-10,5.0e-31)


