# Floating Point Arithmetic and Errors

## Absolute and relative error

Let $\alpha$ approximate $a$. Then

$$err=|a-\alpha| \\  relerr=\frac{err}{|a|}=\frac{|a-\alpha|}{|a|}.$$

In [1]:
# Try α=a:0.01:2a
a=5.0
α=5.0001
err=abs(a-α)
relerr=err/abs(a)
α, err, relerr

(5.0001, 9.999999999976694e-5, 1.9999999999953388e-5)

## Floating Point Arithmetic

Useful book on IEEE Floating Point standard:

M. Overton, Numerical Computing with IEEE Floating Point Arithmetic, SIAM Publications, Philadephia, 2001.

Useful article: 

[David Goldberg, What Every Computer Scientist Should Know About Floating-Point Arithmetic](https://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html).

### Floating Point Number System

$x$ is a floating point number if it has the form

$$
	x = \pm d \cdot \beta^e \quad \beta \in \{ 2,10 \}
$$

Base 2 is for general purpose computers, base 10 is for pocket calculators.

$e$ is the exponent and satisfies

$$
	e_{\min} \leq e \leq e_{\max},\quad
	e_{\min} < 0 < e_{\max}
$$

We will assume that arithmetic is in base 2, but will usually give examples in base 10.

Mantissa $d$ has the form

\begin{align*}
	d &= 0.d_1 \dots d_t = d_1 \beta^{-1} + d_2 \beta^{-2}
	+ \dots + d_t \beta^{-t}\\
d  &\in \{ 0,1\}\\
	d_1 &= 1 \qquad \mbox{ normalized }   \\
	d_1 &= 0 \qquad \mbox{ unnormalized }   \\
\end{align*}

Standard form for floating point numbers is normalized except at the
bottom of the exponent range.

During input and output numbers are converted from binary to decimal
and back.

Computer arithmetic is standardized, by the IEEE 754 standard
for binary arithmetic.  All but a few modern computers follow this standard.

### Machine unit and machine precision

The set

$$
\{ x \colon \lfloor \log_2 \: |x| \rfloor \in [e_{min},e_{max}] \}
$$

is the set of real numbers that are in the normalized range of floating point numbers.
$fl(x)$ is the floating point round of $x$.

_Machine unit_ is the maximum relative distance
between a real number in the floating point range and the nearest floating point number,

$$
	\epsilon = \max_{\lfloor \log_2 
    \:|x|\rfloor \in
	[e_{\min},e_{\max}]} \frac{|x - fl(x)|}{|x|}  = 2^{-t}
$$

_Machine precision_ is the relative distance between two neighbouring floating point numbers.
For $\beta=2$ obviously $\epsilon=2\epsilon_M$.

Important examples include

_IEEE Standard Single Precision (Float32)_  $\beta = 2$, $t = 24$

\begin{align*}
	\epsilon_M  &= 2^{-24} \approx	5.9605 \times 10^{-8}\\
    \epsilon &=2^{-23} \approx 1.1920 \times 10^{-7} \\
	e_{\min} &= - 126,\quad e_{\max} = 127.
\end{align*}


_IEEE Standard Double Precision (Float 64)_ $\beta =2$,$t = 53$

\begin{align*}
	\epsilon &= 2^{-53} \approx 1.1102 \times 10^{-16}\\
    \epsilon &=2^{-52} \approx 2.2204 \times 10^{-16}\\
    e_{\min} &= -1022,\quad e_{\max} = 1023.
\end{align*}

Let us compute $\epsilon$ as the smallest positive floating point number such that
$1+\epsilon\neq 1$.

In [2]:
b=1.0
a=2.0
while (b+a)!=b
    a=a/2
    println(a)
end

1.0
0.5
0.25
0.125
0.0625
0.03125
0.015625
0.0078125
0.00390625
0.001953125
0.0009765625
0.00048828125
0.000244140625
0.0001220703125
6.103515625e-5
3.0517578125e-5
1.52587890625e-5
7.62939453125e-6
3.814697265625e-6
1.9073486328125e-6
9.5367431640625e-7
4.76837158203125e-7
2.384185791015625e-7
1.1920928955078125e-7
5.960464477539063e-8
2.9802322387695312e-8
1.4901161193847656e-8
7.450580596923828e-9
3.725290298461914e-9
1.862645149230957e-9
9.313225746154785e-10
4.656612873077393e-10
2.3283064365386963e-10
1.1641532182693481e-10
5.820766091346741e-11
2.9103830456733704e-11
1.4551915228366852e-11
7.275957614183426e-12
3.637978807091713e-12
1.8189894035458565e-12
9.094947017729282e-13
4.547473508864641e-13
2.2737367544323206e-13
1.1368683772161603e-13
5.684341886080802e-14
2.842170943040401e-14
1.4210854715202004e-14
7.105427357601002e-15
3.552713678800501e-15
1.7763568394002505e-15
8.881784197001252e-16
4.440892098500626e-16
2.220446049250313e-16
1.1102230246251565e-16


The MATLAB command `eps` and the Julia function `eps()` return command give $\epsilon = 2.2204 \times 10^{-16}$.

In [3]:
eps()

2.220446049250313e-16

In [4]:
# What is this?
eps(64.0)

1.4210854715202004e-14

Julia, in particular, has a type system where `Float64` type is a sub-type of `AbstractFloat`, which has four sub-types. 
In addition to types `Float64` and `Float32`, there is a type `Float16` which uses only two bytes of computer memory and type `BigFloat` which has a 256-bit mantissa.  

In [5]:
supertype(Float64)

AbstractFloat

In [6]:
subtypes(AbstractFloat)

4-element Array{Any,1}:
 BigFloat
 Float16
 Float32
 Float64

In [7]:
for T in (Float16, Float32, Float64, BigFloat)
    println(eps(T))
end

0.000977
1.1920929e-7
2.220446049250313e-16
1.727233711018888925077270372560079914223200072887256277004740694033718360632485e-77


In [8]:
2^(-10), 2^(-23), 2^(-52), 2^(-255)

(0.0009765625, 1.1920928955078125e-7, 2.220446049250313e-16, 1.727233711018889e-77)

### Basic Floating Point Operations

We begin with the four basic arithmetic operations, addition ($+$),subtraction ($-$),multiplication ($*$),
and division ($/$). Suppose that $\odot$ is an operation such that

$$
\odot \in \{ + , - , *,/\}.
$$

Then, in floating point arithmetic with machine unit $\epsilon_M$, it is reasonable to expect that
for any two floating point numbers $x$ and $y$, we have

$$
	fl(x\;op\;y) = (x \; op\; y)\;(1 + \xi),\quad
	|\xi| \leq \epsilon_M.
$$

For division, we assume $y \neq 0$.
Any IEEE standard computer must follow this rule.  Rounding is one of two limitations that floating point
arithmetic has that real arithmetic does not have. You can quickly conclude from the above rule that as long as all that we do is add numbers of the same sign, multiply, and divide, floating point results will almost always come very close to the corresponding real arithmetic results. The difficulty occurs if we either of $x$ or $y$ is rounded, they have different signs and we add or have the same signs and we subtract. 

That is, suppose we have

$$
\tilde{x}= x(1+\delta_x), \quad \tilde{y} = y(1+\delta_y),
$$

where $x$ and $y$ are the exact real results of some computation and $\tilde{x}$ and $\tilde{y}$ are rounded floating point results with $|\delta_x| |\delta_y| \leq \delta$ for some small delta.  Suppose also that $x$ and $y$ have the same sign. Let

$$
z=x-y,\quad  \tilde{z} = fl(\tilde{x} -\tilde{y}).
$$

Then, 

\begin{align*}
\tilde{z} &=(\tilde{x}-\tilde{y})(1+\xi)= x(1+\delta_x)(1+\xi) -y(1+\delta_y)(1+\xi) 
=x-y + \delta_z,
\end{align*}

where $|\xi| \leq \epsilon$ and

$$
\delta_z = (x-y)\xi + (x\delta_x -y\delta_y)(1+\xi).
$$

The best available bound on $|\delta_z|$ is

\begin{align*}
|\delta_z| &\leq |x-y||\xi| + (|x||\delta_x| + |y||\delta_y|)(1+|\xi|) \\
& \leq |x-y| \epsilon_M + (|x|+|y|)\,\delta\,(1+\epsilon_M).
\end{align*}

Thus, the relative error in $z$ is 

\begin{align*}
\frac{|\tilde{z}-z|}{|z|}&=\frac{|\delta_z|}{|z|} 
\leq \epsilon_M + (1+\epsilon_M)\,\delta\,\frac{|x|+|y|}{|x-y|}\approx \delta \,\frac{|x|+|y|}{|x-y|}.
\end{align*}

If $|x-y| << |x|+|y|$, the effect of the round in the subtraction is not important, but the error from
previous computations on $x$ and $y$ can have a huge effect. The effect is called __propagation__. It can dramatically change the result of a compuation! We will see this issue with some examples later in this lecture.

Rounding is the first important limitation of floating point arithmetic.  A second limitation is the number range.

### Number Ranges

Floating point arithmetic has a largest and smallest computer number. First, the largest one.

__Largest Computer Number__ $\Omega$

In base $2$, with a $t$ bit mantissa, the largest computer number is

$$
	\Omega = (1 - 2^{-t}) \cdot 2^{e_{\max+1}}
$$

When numbers exceed $\Omega$, they are stored as `Inf` ($\infty$) or
`-Inf` ($-\infty$). We say than an _owerflow_ occured.


_IEEE Standard Single Precision_ (`Float32`)

$$
\quad \Omega = 3.4028\times 10^{38}
$$

_IEEE Standard Double Precisiont_ (`Float64`)

$$
\Omega = 1.79777 \times 10^{308}
$$

The MATLAB command `realmax` and the Julia function `floatmax()` show this number.

__Smallest Computer Number__ $\omega$

The definition of the smallest computer number is somewhat more complex.

The smallest computer number is given by

$$
\omega = 2^{1-t} 2^{e_{\min}}.
$$

If a computation produces a number smaller in magnitude than $\omega$, it produces what is called an 
_underflow_, it is set to $0$ or
$-0$.  If the programmer chooses, an underflow can result in an error, but in most computations, underflows are not harmful.


_IEEE Standard Single Precision_ (`Float32`):

$$
\omega = 2^{-23- 126} = 2^{-149} \approx  1.4013 \times 10^{-45}.
$$

In MATLAB, this comes from the command `omega= eps('single')*realmin('single')`.


_IEEE Standard Double Precision_ (`Float64`):

$$
\omega= 2^{-1022-52} = 2^{-1074} \approx  4.9407 \times 10^{-324}
$$

The appropriate MATLAB command to get this value is 
`omega = eps*realmin` and the equivalent Julia command is `floatmin()*eps()`.


_Important and Subtle Point_ 

Numbers at the bottom of the exponent
range are not normalized.

MATLAB function `realmin` yields

$$
\omega_{useful} \approx 2.2251 \times 10^{-308}.
$$

Some people call this the smallest USEFUL floating point
 number since

$$
1/\omega_{useful} \leq \Omega
$$

and $\omega_{useful}$ is normalized.

Smallest floating point number, $\omega$, has the form

$$
0.0 \cdots 01 \times 2^{e_{\min}} \quad \cdots\quad
\mbox{Gradual Underflow}
$$

Before the IEEE standard most computers had the smallest
floating point number as

$$
	0.10 \cdots 0 \times 2^{e_{\min}} \qquad \cdots
	\mbox{ normalized}
$$

Earlier computers, (pre-1985) set numbers below this smallest 'useful' floating point number to zero.
This change was one of the more controversial features of the IEEE standard.

__Example.__ $\beta = 10$, $-5 \leq e \leq 5$

\begin{eqnarray*}
	x & = 0.1957 \times 10^{-5}   \\
	y & = 0.1942 \times 10^{-5}
\end{eqnarray*}

Compute  $fl(x - y)$. Whar happens?

$$
0.1957 \times 10^{-5}-0.1942 \times 10^{-5}  =0.0015 \times 10^{-5}
$$

Pre-1985 philosophy was to set $fl(x - y)=0$.

Gradual Underflow stores $fl(x - y)=0.0015 \times 10^{-5}$, that is, Gradual Underflow guarantees that for any two floating point numbers $x$ and $y$

$$
fl(x - y) = 0 \mbox{ if and only if } x = y.
$$

In [9]:
for T in (Float16, Float32, Float64, BigFloat)
    println((floatmin(T),floatmax(T)))
end

(Float16(6.104e-5), Float16(6.55e4))
(1.1754944f-38, 3.4028235f38)
(2.2250738585072014e-308, 1.7976931348623157e308)
(2.382564904887951073216169781732674520415196125559239787955023752600945386104324e-323228497, 2.09857871646738769240435811688383907063809796547335262778664622571024044777575e+323228496)


In [10]:
1/floatmin(Float32),floatmax(Float32)

(8.507059f37, 3.4028235f38)

In [11]:
for T in (Float16, Float32, Float64)
    println((floatmin(T)*eps(T)))
end

6.0e-8
1.0e-45
5.0e-324


###  Special Quantities  $0$, $-0$, `Inf`,`-Inf` i `NaN`

Zero has a sign:

In [12]:
a=1.0
b=0.0
c=-b
c,b==c

(-0.0, true)

In [13]:
a/b

Inf

In [14]:
d=a/b
e=a/c
d==e, 1/d==1/e

(false, true)

In [15]:
b/c

NaN

`NaN` (Not a Number) can be generated by (c.f. Calculus 1):

In [16]:
Inf+(-Inf),0*Inf, Inf/Inf, 0.0/0.0

(NaN, NaN, NaN, NaN)

IEEE Arithmetic is a closed system:

$\big\{$ floating point numbers,`Inf`,`-Inf`, `NaN`$\big\}$ 
$\stackrel{\odot}{\rightarrow}$ 
$\big\{$ floating point numbers,`Inf`,`-Inf`, `NaN` $\big\}$

no matter what the operation $\odot$ is.

Clever programmers take advantage of these features. However, in the coding assignments in this course, if you get
`NaN` or `Inf` or `-Inf`, you have probably made an error.

### Binary Representation

In [17]:
bitstring(0)

"0000000000000000000000000000000000000000000000000000000000000000"

In [18]:
bitstring(1)

"0000000000000000000000000000000000000000000000000000000000000001"

In [19]:
bitstring(0.0)

"0000000000000000000000000000000000000000000000000000000000000000"

In [20]:
bitstring(-0.0)

"1000000000000000000000000000000000000000000000000000000000000000"

In [21]:
bitstring(1.0)

"0011111111110000000000000000000000000000000000000000000000000000"

In [22]:
bitstring(Float16(1.0))

"0011110000000000"

In [23]:
bitstring(2.0)

"0100000000000000000000000000000000000000000000000000000000000000"

__Problem.__ Explain the above binary representations. 

### Examples

__Using difference of squares__

Compute

$$
f(x) = \sqrt{1 + x^2} - 1, \quad \mbox{$x$ is near zero}.
$$

This formula in standard double precision yields
$f(10^{-12}) = 0$.

In [24]:
f(x)=sqrt(1+x^2)-1
x=1e-6
for k=1:4
    println((x,f(x)))
    x=x/10
end

(1.0e-6, 5.000444502911705e-13)
(1.0e-7, 4.884981308350689e-15)
(1.0e-8, 0.0)
(1.0e-9, 0.0)


The difference-of-squares trick yields

\begin{eqnarray*}
f(x) & \equiv (\sqrt{1 + x^2} - 1) \left( \frac{\sqrt{1 + x^2} + 1}{\sqrt{1 + x^2} + 1}\right) \\
& = \frac{x^2}{\sqrt{1+x^2} + 1}\equiv f_1(x),
\end{eqnarray*}

that is,  $f_1(10^{-12}) = 0.5 \cdot 10^{-24}$. 
This answer is as accurate as we can expect in standard double precision.

In [25]:
f₁(x)=x^2/(1+sqrt(1+x^2))
x=1e-6
for k=1:10
    println((x, f₁(x)))
    x=x/10
end

(1.0e-6, 4.99999999999875e-13)
(1.0e-7, 4.999999999999987e-15)
(1.0e-8, 5.0000000000000005e-17)
(1.0e-9, 5.0e-19)
(1.0e-10, 5.0000000000000005e-21)
(1.0000000000000001e-11, 5.000000000000001e-23)
(1.0000000000000002e-12, 5.000000000000001e-25)
(1.0000000000000002e-13, 5.0000000000000016e-27)
(1.0000000000000002e-14, 5.0000000000000015e-29)
(1.0e-15, 5.0e-31)


In [26]:
x=1e-12

1.0e-12

In [27]:
BigFloat(x)

9.999999999999999798866476292556153672528435061295226660149637609720230102539062e-13

In [28]:
f(BigFloat(x))

4.99999999999999979886647504255615569526325357670149402898679501001332484014577e-25

__Quadratic equation__

In exact arithmetic, the quadratic equation

$$ ax^2 + bx+c=0$$

has roots

\begin{align*}
x_1&=\frac{-b-\mathop{\mathrm{sign}}(b)\sqrt{b^2-4ac}}{2a} \\
x_2&\equiv\frac{-b+\mathop{\mathrm{sign}}(b)\sqrt{b^2-4ac}}{2a}= \frac{-b+\mathop{\mathrm{sign}}(b)\sqrt{b^2-4ac}}{2a}\cdot \frac{-b-\mathop{\mathrm{sign}}(b)\sqrt{b^2-4ac}}{-b-\mathop{\mathrm{sign}}(b)\sqrt{b^2-4ac}}
\\ &= \frac{2c}{-b-\mathop{\mathrm{sign}}(b)\sqrt{b^2-4ac}}\equiv x_3.
\end{align*}

In [29]:
a=2.0
b=123456789.0
c=4.0

x₁=(-b-sqrt(b*b-4*a*c))/(2.0*a)
x₂=(-b+sqrt(b*b-4*a*c))/(2.0*a)
x₃=(2*c)/(-b-sqrt(b*b-4*a*c))
x₁,x₂,x₃

(-6.172839449999997e7, -3.3527612686157227e-8, -3.240000029484002e-8)

Let us check using `BigFloat`:

In [30]:
a=BigFloat(a)
b=BigFloat(b)
c=BigFloat(c)
x₂=(-b+sqrt(b*b-4*a*c))/(2.0*a)

-3.240000029484001968915648868258452417675753633383540995167795107129921671968718e-08

__Tangent and sine__

In [31]:
x=1e-10
tan(x)-sin(x)

0.0

However, the 
trigonometric identities give

\begin{align*}
\tan x - \sin x & = \tan x (1 - \cos x ) 
= \tan x (1-\cos x)\frac{1+\cos x}{1+\cos x}\\ & = \tan x \frac{1-\cos^2 x}{1+\cos x} \\
&= \tan x \sin^2 x \frac{1}{1+\cos x},
\end{align*}

and Taylor formula gives

\begin{align*}
\tan x &= x + \frac{x^3}{3} + \frac{2x^5}{15} + O(x^7) \\
\sin x &= x -\frac{x^3}{6} + \frac{x^5}{120}+O(x^7) \\
\tan x - \sin x &= \frac{x^3}{2} + \frac{7x^5}{120} +O(x^7).
\end{align*}

Both formulas give accurate resut:

In [32]:
tan(x)*sin(x)^2/(1+cos(x)), x^3/2+7*x^5/120

(5.0e-31, 5.0e-31)

__Absolute value of a complex number__

To avoid underflow or overflow, instead of using the standard formula 

$$
|z|=|x+iy|=\sqrt{x^2+y^2}
$$

we must use the following formulas (Explain!):

$$
M = \max \{ |x|,|y|\}, \quad m = \min \{ |x|,|y| \}, \quad r = \frac{m}{M}, \quad 
|z| = M \sqrt{1+r^2}.
$$

These formulas are encoded in the function `abs()`.

In [33]:
z=2e+170+3e+175*im

2.0e170 + 3.0e175im

In [34]:
sqrt(real(z)^2+imag(z)^2), abs(z)

(Inf, 3.0000000000666667e175)

In [35]:
z=2e-170+3e-175*im
sqrt(real(z)^2+imag(z)^2), abs(z)

(0.0, 2.000000000225e-170)

__Problem.__ Study the function [hypot](https://en.wikipedia.org/wiki/Hypot) and the  BLAS 1 function `dnrm2.f`.

In [36]:
real(z)^2, imag(z)^2

(0.0, 0.0)