# Floating Point Arithmetic

---

Useful book on IEEE Floating Point standard

M. Overton, Numerical Computing with IEEE Floating Point Arithmetic, SIAM Publications, Philadephia, 2001.


__Floating Point Number System__ 

$x$ is a floating point number if it has the form

$$
	x = \pm d \cdot \beta^e \quad \beta \in \{ 2,10 \}
$$

Base 2 is for general purpose computers, base 10 is for pocket calculators.

$e$ is the exponent and satisfies

$$
	e_{\min} \leq e \leq e_{\max}\quad,
	e_{\min} < 0 < e_{\max}
$$

We will assume that arithmetic is in base 2, but will usually give examples in base 10.

Mantissa $d$ has the form

\begin{align*}
	d &= 0.d_1 \dots d_t = d_1 \beta^{-1} + d_2 \beta^{-2}
	+ \dots + d_t \beta^{-t}\\
d  &\in \{ 0,1\}\\
	d_1 &= 1 \qquad \mbox{ normalized }   \\
	d_1 &= 0 \qquad \mbox{ unnormalized }   \\
\end{align*}

Standard form for floating point numbers is normalized except at the
bottom of the exponent range.

During input and output numbers are converted from binary to decimal
and back.

Computer arithmetic is standardized, by the IEEE 754 standard
for binary arithmetic.  All but a few modern computers follow this standard.

__Machine unit__

$$
	\epsilon_M = \max_{\lfloor \log_2 
    \:|x|\rfloor \in
	[e_{\min},e_{\max}]} \frac{|x - fl(x)|}{|x|}  = 2^{-t}
$$

The set

$$
\{ x \colon \lfloor \log_2 \: |x| \rfloor \in [e_{min},e_{max}] \}
$$

is the set of real numbers that are in the normalized range of floating point numbers.
$fl(x)$ is the floating point round of $x$. Thus the __machine unit__ is the maximum relative distance
between a real number in the floating point range and the nearest floating point number.

Important examples inclue

__IEEE Standard Single Precision (Float32)__  $\beta = 2$, $t = 24$

\begin{align*}
	\epsilon_M  &= 2^{-24} \approx	5.9605 \times 10^{-8}\\
	e_{\min} &= - 126,\quad e_{\max} = 128
\end{align*}


__IEEE Standard Double Precision (Float 64)__ $\beta =2$,$t = 53$

\begin{align*}
	\epsilon_M &= 2^{-53} \approx 1.1102 \times 10^{-16}\\
    e_{\min} &= -1022,\quad e_{\max} = 1024
\end{align*}

The MATLAB `eps` command and the Julia `eps()` command give $eps = 2.2204 \times
10^{-16}$ which is the maximum relative spacing between two
floating point numbers. As you can easily deduce, this number is $2\epsilon_M$.

Julia, in particular, has a type system where `Float64` type is a sub-type of `AbstractFloat`, which has four sub-types. 
In addition to types `Float64` and `Float32`, there is a type `Float16` which uses only two bytes of computer memory and type `BigFloat` which has a 256-bit mantissa.  


In [1]:
supertype(Float64)

AbstractFloat

In [2]:
subtypes(AbstractFloat)

4-element Array{Any,1}:
 BigFloat
 Float16 
 Float32 
 Float64 

In [3]:
for T in (Float16, Float32,Float64,BigFloat)
    println(eps(T))
end

0.000977
1.1920929f-7
2.220446049250313e-16
1.727233711018888925077270372560079914223200072887256277004740694033718360632485e-77


__Basic Floating Point Operations__ 
We begin with the four basic arithmetic operations, addition ($+$),subtraction ($-$),multiplication ($*$),
and division ($/$). Suppose that __op__ is an operation such that

$$
	op \in \{ + , - , *,/\}.
$$

Then, in floating point arithmetic with machine unit $\epsilon_M$, it is reasonable to expect that
for any two floating point numbers $x$ and $y$, we have

$$
	fl(x\;op\;y) = (x \; op\; y)\;(1 + \xi),\quad
	|\xi| \leq \epsilon_M
$$

For division, we assume $y \neq 0$.
Any IEEE standard computer must follow this rule.  Rounding is one of two limitations that floating point
arithmetic has that real arithmetic does not have. You can quickly conclude from the above rule that as long as all that we do is add numbers of the same sign, multiply, and divide, floating point results will almost always come very close to the corresponding real arithmetic results. The difficulty occurs if we either of $x$ or $y$ is rounded, they have different signs and we add or have the same signs and we subtract. 

That is, suppose we have

$$
\tilde{x}= x(1+\delta_x), \quad \tilde{y} = y(1+\delta_y)
$$

where $x$ and $y$ are the exact real results of some computation and $\tilde{x}$ and $\tilde{y}$ are rounded floating point results with $|\delta_x| |\delta_y| \leq \delta$ for some small delta.  Suppose also that $x$ and $y$ have the same sign. If

$$
z=x+y, \quad \tilde{z} = fl(\tilde{x} -\tilde{y})
$$

Then

\begin{align*}
\tilde{z} &=(\tilde{x}-\tilde{y})(1+\xi), \quad |\xi| \leq \epsilon_M\\
& x(1+\delta_x)(1+\xi) -y(1+\delta_y)(1+\xi) \\
& x-y + \Delta_z
\end{align*}

where

$$
\Delta_z = (x-y)\xi + (x\delta_x -y\delta_y)(1+\xi)
$$

The best available bound on $|\Delta_z|$ is

\begin{align*}
|\Delta_z| &\leq |x-y||\xi| + (|x||\delta_x| + |y||\delta_y|)(1+|\xi|) \\
& \leq |x-y| \epsilon_M + (|x|+|y|)\delta(1+\epsilon_M)
\end{align*}

Thus

\begin{align*}
|\delta_z| &= \frac{|\tilde{z}-z|}{|z|}\\
&\leq \epsilon_M + (1+\epsilon_M)\delta\frac{|x|+|y|}{|x-y|}
\end{align*}

If $|x-y| << |x|+|y|$, the effect of the round in the subtraction is not important, but the error from
previous computations on $x$ and $y$ can have a huge effect. The effect is called __propagation__. It can dramatically change the result of a compuation! We will see this issue with some examples later in this lecture.

Rounding is the first important limitation of floating point arithmetic.  A second limitation is the number range.



__Number Ranges__

Floating point arithmetic has a largest and smallest computer number. First, the largest one.

__Largest Computer Number__ $\Omega$

In base $2$, with a $t$ bit mantissa, the largest computer number is

$$
	\Omega = (1 - 2^{-t}) \cdot 2^{e_{\max}}
$$

When numbers exceed $\Omega$, they are stored as `Inf` ($\infty$) or
`-Inf` ($-\infty$).


__IEEE Standard Single Precision (Float32)__

$$
e_{\max} = 128,  \quad \Omega = 3.4028\approx 10^{38}
$$


__IEEE Standard Double Precision (Float64)__

$$
e_{\max} = 1024, \quad \Omega = 1.79777 \times 10^{308}
$$

The  MATLAB __realmax__ command displays this number.



__Smallest Computer Number__ $\omega$

The definition of the smallest computer number is somewhat more complex.


__IEEE Floating Point Standard__

The smallest computer number is given by

$$
\omega = 2^{1-t} 2^{e_{\min}}.
$$

If a computation produces a number smaller in magnitude than $\omega$, it produces what is called an 
__underflow__, it is set to $0$ or
$-0$.  If the programmer chooses, an underflow can result in an error, but in most computations, underflows are not harmful.



__IEEE Standard Single Precision (Float32)__

$$
\omega = 2^{-23- 126} = 2^{-149} \approx  1.4013 \times 10^{-45}.
$$

In MATLAB, this comes from the command

$$
omega= eps('single')*realmin('single');
$$


__IEEE Standard Double Precision (Float64)__

$$
\omega= 2^{-1022-52} = 2^{-1074} \approx  4.9407 \times 10^{-324}
$$

The appropriate MATLAB command to get this value is

$$
omega = eps*realmin.
$$

_Important and Subtle Point_ --- Numbers at the bottom of the exponent
range are not normalized.

MATLAB function `realmin` yields

$$
	\underline{\omega_{useful} \approx 2.2251 \times 10^{-308}}
$$

Some people call this the smallest USEFUL floating point
 number since
 
$$
1/\omega_{useful} \leq \Omega
$$

and $\omega_{useful}$ is normalized.

Smallest floating point number, $\omega$ has the form

$$
	0.0 \cdots 01 \times 2^{e_{\min}} \quad \cdots\quad
	\underline{\mbox{Gradual Underflow}}
$$

Before the IEEE standard most computers had the smallest
floating point number as
$$
	0.10 \cdots 0 \times 2^{e_{\min}} \qquad \cdots
	\mbox{ normalized}
$$

Earlier computers, (pre-1985) set numbers below this smallest 'useful' floating point number to zero.
This change was one of the more controversial features of the IEEE standard.

In [4]:
for T in (Float16, Float32,Float64,BigFloat)
    println((realmin(T),realmax(T)))
end

(Float16(6.104e-5),Float16(6.55e4))
(1.1754944f-38,3.4028235f38)
(2.2250738585072014e-308,1.7976931348623157e308)
(2.382564904887951073216169781732674520415196125559239787955023752600945386104324e-323228497,2.09857871646738769240435811688383907063809796547335262778664622571024044777575e+323228496)


The next set of code gives the actual smallest floating point number for each of these three precisions.

In [5]:
for T in (Float16, Float32,Float64)
    println((realmin(T)*eps(T)))
end

6.0e-8
1.0f-45
5.0e-324


Here is an example (in decimal arithmetic) which shows why this feature made it into the IEEE standard.


__Example__ $\beta = 10$, $-5 \leq e \leq 5$

\begin{eqnarray*}
	x & = 0.1957 \times 10^{-5}   \\
	y & = 0.1942 \times 10^{-5}
\end{eqnarray*}


Do  $fl(x - y)$ \quad --- \quad What happens?

$$
0.1957 \times 10^{-5}-0.1942 \times 10^{-5}  =0.0015 \times 10^{-5}
$$


Pre-1985 philosophy --- set $x - y$ to zero

Gradual Underflow stores $x - y$ as $0.0015 \times 10^{-5}$

Gradual underflow guarantees that for any two floating point numbers $x$ and $y$, 

$$
fl(x - y) = 0 \mbox{ if and only if x = y}.
$$

Now for some examples.

__Interesting floating point computations__


__Example__

$$
f(x) = \sqrt{1 + x^2} - 1 \quad \mbox{$x$ near zero}.
$$

$\underline{f(10^{-12}) = 0}$  using the formula in
IEEE double.

In [6]:
f(x)=sqrt(1+x^2)-1
x=1e-6
for k=1:4
    println((x,f(x)))
    x=x/10
end

(1.0e-6,5.000444502911705e-13)
(1.0e-7,4.884981308350689e-15)
(1.0e-8,0.0)
(1.0e-9,0.0)


__Better!!__

Use an old "difference of two squares" trick.

\begin{eqnarray*}
	f(x) & = (\sqrt{1 + x^2} - 1) \left( \frac{\sqrt{1 + x^2} + 1}{\sqrt{1 + x^2} + 1}\right)   \\
	& = \frac{x^2}{\sqrt{1+x^2} + 1}
\end{eqnarray*}

$\underline{f(10^{-12}) = 0.5 \cdot 10^{-24}}$

This answer is as accurate as we can expect to get in this precision.
We can even take this example a little farther!

In [7]:
f(x)=x^2/(1+sqrt(1+x^2))
x=1e-6
for k=1:10
    println((x, f(x)))
    x=x/10
end

(1.0e-6,4.99999999999875e-13)
(1.0e-7,4.999999999999987e-15)
(1.0e-8,5.0000000000000005e-17)
(1.0e-9,5.0e-19)
(1.0e-10,5.0000000000000005e-21)
(1.0000000000000001e-11,5.000000000000001e-23)
(1.0000000000000002e-12,5.000000000000001e-25)
(1.0000000000000002e-13,5.0000000000000016e-27)
(1.0000000000000002e-14,5.0000000000000015e-29)
(1.0e-15,5.0e-31)




__Example Back to the Quadratic Equation__

Find roots of
$$
	ax^2 + bx + c = 0
$$

Our classic formulas are

\begin{eqnarray*}
	x_1 & = \frac{-b - \mathrm{sign} (b) \sqrt{b^2 - 4ac}}{2a}
	\\
	x_2 & =\frac{-b +\mathrm{sign} (b) \sqrt{b^2 - 4ac}}{2a}
\end{eqnarray*}

when $\underline{b^2 -4 ac \geq 0}$.

But we can use the "difference of two squares" trick again.

\begin{eqnarray*}
	& x_2 \cdot \left( \frac{b + \mathrm{sign}(b) \sqrt{b^2 - 4ac}}{b+ \mathrm{sign} (b) \sqrt{b^2 - 4ac}}\right)
	\\
	& = \frac{-b^2 + b^2 - 4ac}{2c \cdot (b + \mathrm{sign}(b)\sqrt{b^2 - 4ac})}
	\\
	& = \frac{-4ac}{2a\cdot (b + \mathrm{sign}(b) \sqrt{b^2 -4ac})} = \frac{-2c}{b + \mathrm{sign}(b) \sqrt{b^2 -4ac}}
\end{eqnarray*}

In [8]:
a=1
b=1e18
c=3
disc=b^2-4*a*c
x1=(-b-sign(b)*sqrt(disc))/(2a)
x2=-2c/(b+sign(b)*sqrt(disc))
x1,x2

(-1.0e18,-3.0e-18)