In [1]:
using DrWatson;
@quickactivate "NumericalAnalysis";

# Floating Point Numbers

In reading these notes, it is helpful to also watch and refer to [this video](https://www.youtube.com/watch?v=97Gb9TS3MJs) on floating point numbers and roundoff error. A really great video on floating point arithmetic in Julia may be found [here](https://www.youtube.com/watch?v=fL8vYG69EhE&t=14s).

The set of all real numbers $\mathbb{R}$ is continuous and unbounded. Due to considerations of memory and efficiency, it is not practical to store the exact value of each real number when carrying out numerical computations on a machine. Thus, we construct a discrete, finite subset $\mathbb{P}$ of $\mathbb{R}$, called **floating point numbers**[^1], and a function $\text{fl}:\mathbb{R} \rightarrow \mathbb{P}$ called **rounding** that sends each real number to it's floating point approximation. (In fact, $\mathbb{P}$ will be a subset of the set of all rational numbers $\mathbb{Q}$.) Then, we approximate values and operations in $\mathbb{R}$ with values and operations in $\mathbb{P}$. Thus, we will discretize  $\mathbb{R}$, and since in general $|\text{fl}(x) - x| > 0$ for $x\in \mathbb{R}$, this produces a type of discretization error called **roundoff error**. It is important for us to learn to assess and control for roundoff error. 

[^1]: Note that the textbook uses $\mathbb{F}$ to denote the set of floating point numbers. Since $\mathbb{F}$ is often used to denote a set with the algebraic structure of a field, and floating point numbers do not form a field, we prefer to use $\mathbb{P}$ instead.

Rounding and roundoff error can lead to some surprising consequences. In order to illustrate what can happen as a result of roundoff we will look at a over-simplified toy model. Suppose for the sake of simplicity that we round off all results of arithmetic to three significant digits. Thus a number such as $1.234$ would become $1.23$ after rounding, while $2.345$ would become $2.35$. In the following calculation, we use $=$ to denote the result of an exact calculation and $\approx$ to denote the result of a calculation after rounding. Then

$$
\begin{align*}
&(2.31 + 0.00312) + 0.00312 = 2.31312 + 0.00312 \\
&\approx 2.31 + 0.0312 = 2.31312 \approx 2.31
\end{align*}
$$

while 

$$
\begin{align*}
&2.31 + (0.00312 + 0.00312) = 2.31 + 0.00624 \\
&\approx 2.31 + 0.01 = 2.32 \approx 2.32
\end{align*}
$$
   

 
Thus, $(2.31 + 0.00312) + 0.00312$ is rounded to $2.31$ while $2.31 + (0.00312 + 0.00312)$ is rounded to $2.32$. Rounding, that is, discretizing $\mathbb{R}$ has resulted in the loss of the associative property of addition that is satisfied by abstract real number arithmetic.  

The goal of these notes is to explain in detail the discretization $\text{fl}:\mathbb{R} \rightarrow \mathbb{P}$ and explore those aspects and consequences of this discretization that are most relevant in the context of numerical analysis. Before we construct the discretization $\text{fl}:\mathbb{R} \rightarrow \mathbb{P}$, it is helpful to reconsider $\mathbb{R}$ from a perspective that may be unfamiliar, that is, we consider the binary rather than decimal representation of real numbers.  

## Binary Representation of Real Numbers

From calculus and other math courses you are likely used to dealing with real numbers (*i.e.*, elements of $\mathbb{R}$) as decimal numbers. That is, we write a positive element $x \in \mathbb{R}$ as


$$x = a_{N}10^N + a_{N-1}10^{N-1} + \cdots + a_{1} 10^1 + a_{0}10^0 + a_{-1}10^{-1} + a_{-2}10^{-2} + \cdots ,$$

where $N \in \mathbb{Z}$, each $a_{i}\in \{0,1,0,2,3,4,5,6,7,8,9\}$ with $a_{N}\neq 0$, and the expression may contain an infinite number of terms with negative powers of 10. Of course $0$ simply has all coefficents equal to zero, and to get negative real numbers we just "change the sign" of a positive real number by multiplying by $-1$. As a simple example,  

$$3.14 = 3 \times 10^1 + 1 \times 10^{-1} + 4 \times 10^{-2}.$$

Observe that we can write our general expression more succinctly as

$$x = \sum_{k=-\infty}^{N} a_{k}10^k,$$ 


or even 

$$x = \pm a \left(1 + \sum_{j=1}^{\infty}a_{j}10^{-j} \right)10^{N} = \pm a (1 + f)10^{N},$$


where $f = \sum_{j=1}^{\infty}a_{j}10^{-j}$, with $a\in \{1,2,3,4,5,6,7,8,9\}$, and $a_{j}\in \{0,1,2,3,4,5,6,7,8,9\}$. 

For example,

$$24.6 = 2\times 10^1 + 4 \times 10^0 + 6 \times 10^{-1} = +2(1 + 2 \times 10^{-1} + 3 \times 10^{-2})10^{1}.$$

Let's use Julia to confirm our result:

In [2]:
x = 2*(1 + 2*10.0^-1 + 3*10.0^-2)*10.0^1

24.6

Take note that $24.6$ is a rational number with a finite decimal expansion. On the other hand, a number such as $\frac{1}{3}$ has an inifite but repeating decimal expansion and there are even irrational numbers (*e.g.*, $\sqrt{2}$)  that have an infinite and nonrepeating decimal expansion. 

Decimal expansion uses base 10 in the sense that every number is represented as a (potentially infinite) sum of products of powers of 10 by elements from $\{0,1,2,3,4,5,6,7,8,9\}$. This is a choice that, while familiar is arbitrary. As just one alternative, one may instead represent real numbers in base 2, that is, as a (potentially infinite) sum of products of powers of 2 by elements from $\{0,1\}$. In this case, we can write a real number as


$$x = \pm (1 + f)2^{N},$$

where $f = \sum_{j=1}^{\infty}a_{j}2^{-j}$, with $a_{j}\in \{0,1\}$. 

For example,

$$
\begin{align*}
 17.25 &= 1\times 2^4 + 0 \times 2^3 + 0\times 2^2 + 0\times 2^1 + 1\times 2^0 + 0 \times 2^{-1} + 1 \times 2^{-2} \\
 &= +(1 + 0 \times 2^{-1} + 0\times 2^{-2} + 0\times 2^{-3} + 1\times 2^{-4} + 0 \times 2^{-5} + 1 \times 2^{-6})2^{4} \\
 &= +(1 + f)2^{4},
\end{align*}
$$
where $f=0 \times 2^{-1} + 0\times 2^{-2} + 0\times 2^{-3} + 1\times 2^{-4} + 0 \times 2^{-5} + 1 \times 2^{-6}$. 

Again we can confirm this result in Julia, this time using a little bit of code cleverness:

In [3]:
c = [0,0,0,1,0,1]; # the coefficients of f
p = -1:-1:-6; # the relevant powers of 2 in f
t = 2.0.^p; # raise 2 to the relevant powers
f = sum(c.*t); # compute f
x = (1 + f)*2.0^4 # compute x

17.25

Now we will look at an example that illusrates the general method for finding the binary representation of a decimal number. Consider $3.1$. It is probably easiest to break this up into two steps. Write $3.1 = 3.0 + 0.1$ and find the binary expansions for the integer part, *i.e.*, $3$ and the fractional part, *i.e*, $0.1$ separately.  The integer part is straightforward:
$$
\begin{align*}
3 &= 2 + 1 \\
&= 1\times 2^1 + 1 \times 2^{0}. 
\end{align*}
$$

To find the binary expansion of the fractional part $0.1$ we proceed by "repeated multiplication by 2":
$$
\begin{align*}
0.1 \times 2 &= \color{red}{0}.2 \rightarrow \color{red}{0} \\
0.2 \times 2 &= \color{red}{0}.4 \rightarrow \color{red}{0} \\
0.4 \times 2 &= \color{red}{0}.8 \rightarrow \color{red}{0} \\ 
0.8 \times 2 &= \color{red}{1}.6 \rightarrow \color{red}{1} \\
0.6 \times 2 &= \color{red}{1}.2 \rightarrow \color{red}{1} \\
0.2 \times 2 &= \color{red}{0}.4 \rightarrow \color{red}{0} \\
0.4 \times 2 &= \color{red}{0}.8 \rightarrow \color{red}{0} \\ 
&\text{this will now continue to repeat in the same pattern.}
\end{align*}
$$

Thus, we see that
$$
\begin{align*}
0.1 &= 0\times 2^{-1} + 0\times 2^{-2} + 0\times 2^{-3} + 1\times 2^{-4} + 1\times 2^{-5} \\
    &+ 0\times 2^{-6} + 0\times 2^{-7} + 0\times 2^{-8} + 1\times 2^{-9} + 1 \times 2^{-10} \\
    &+ \text{pattern repeats}.
\end{align*}
$$

Then,
$$
\begin{align*}
3.1 &= 1\times 2^1 + 1 \times 2^{0} + 0\times 2^{-1} + 0\times 2^{-2} + 0\times 2^{-3} + 1\times 2^{-4} + 1\times 2^{-5} + \cdots \\
&= +(1+f)2^{1}, 
\end{align*}
$$
where
$$
f = 1 \times 2^{-1} + 0\times 2^{-2} + 0\times 2^{-3} + 0\times 2^{-4} + 1\times 2^{-5} + 1\times 2^{-6} + \cdots.
$$

Note that 3.1 does not have a finite binary expansion. Thus, we can not compute 3.1 exactly using its binary representation in Julia. However, we can at least perform a saninty check:

In [4]:
c = [1,0,0,0,1,1]; # some of the coefficients of f
p = -1:-1:-6; # some of the relevant powers of 2 in f
t = 2.0.^p; # raise 2 to the relevant powers
f = sum(c.*t); # compute partial f
x = (1 + f)*2.0^1 # compute partial x

3.09375

Carrying out our expansion further will get us closer to 3.1:

In [5]:
c = [1,0,0,0,1,1,0,0,0,1,1,0,0,0,1,1,0,0,0,1,1,0,0,0,1,1,0,0,0,1,1,0,0,0,1,1,0,0,0,1,1]; # some of the coefficients of f
p = -1:-1:-41; # some of the relevant powers of 2 in f
t = 2.0.^p; # raise 2 to the relevant powers
f = sum(c.*t); # compute partial f
x = (1 + f)*2.0^1 # compute partial x

3.096774193548299

Let's compute the error and accurate digits between 3.1 and our truncated binary expansion of 3.1:

In [6]:
abs_err = abs(3.1 - x);
rel_err = abs_err/abs(3.1);
acc_dig = -log10(rel_err);
println("Absolute error ", abs_err)
println("Relative error ", rel_err)
println("Accurate digits ", acc_dig)

Absolute error 0.0032258064517010077
Relative error 0.0010405827263551637
Accurate digits 2.9827233876566837


Take a second to think about why our method of "repeated multiplication by 2" works to find the binary expansion of the fractional part of a number. 

Here is the essential idea: We are assuming that
$$
0.1 = a_{1}2^{-1} + a_{2}2^{-2} + a_{3}2^{-3} + \cdots, 
$$
for some unique values for the coefficients $a_{1},a_{2},\ldots$, with each one taking a value of either 0 or 1. Then multiplying by 2 gives:
$$
2\times 0.1 = a_{1} + a_{2}2^{-1} + a_{3}2^{-2} + \cdots,
$$
and this allows us to determine $a_{1}$. Multiplying by 2 again will allow us to determine $a_{2}$ and so on. 

Now we have seen that for any $x\in \mathbb{R}$, we may write $x=\pm(1+f)2^{N}$, where $N$ is an integer (*i.e.*, an element of $\mathbb{Z}$), and $f=\sum_{k=1}^{\infty}a_{k}2^{-k}$. In the next section, we will discretize $\mathbb{R}$ by *truncating* the binary expansion of real numbers. 

## Floating Point and Finite Precision Numbers

For a nonzero real number $x\in \mathbb{R}$, we have seen that
$$
x = \pm\left(1 + \sum_{k=1}^{\color{red}{\infty}}a_{k}2^{-k} \right)2^{\color{red}{N}}, \ \ N\in \mathbb{Z}, \ \ a_{k}\in\{0,1\}.
$$
Then we define $\text{fl}(x)$ to be "the closest value" to $x$ that satisfies  
$$
\text{fl}(x) = \pm\left(1 + \sum_{k=1}^{\color{red}{d}}a_{k}2^{-k} \right)2^{\color{red}{E}}, \ \ \text{for some } E\in \{-N_{-},-N_{-}+1,\ldots,-1,0,1,\ldots,N_{+}\}, \ \ a_{k}\in\{0,1\},
$$
where $d$, $N_{-}$, and $N_{+}$ are **fixed positive integers**. The positive integer $d$ determines the number of significant binary digits. 

For example, if $d=2$, $N_{-}=1$, and $N_{+}=2$, then define the set
$$
\begin{align*}
\mathbb{P}_{\text{example}} &= \left\{\pm\left(1 + \sum_{k=1}^{\color{red}{2}}a_{k}2^{-k} \right)2^{\color{red}{E}} :\ a_{k}\in\{0,1\},\ \ E \in \{-1,0,1,2\}\right\} \\
&= \pm\left\{\frac{1}{2},\frac{5}{8},\frac{3}{4},\frac{7}{8},1,\frac{5}{4},\frac{3}{2},\frac{7}{4},2,\frac{5}{2},3,\frac{7}{2},4,5,6,7\right\}. 
\end{align*}
$$
(We leave it as a homework exercise for you to show that the last equality is true.)  

In this case rounding real numbers to values in $\mathbb{P}_{\text{example}}$, we would have, for example, 
$$
\begin{align*}
x &= 2 \Rightarrow \text{fl}(x) = 2 = +(1 + 0\times 2^{-1} + 0\times 2^{-2})2^1, \\
x &= 0.5 \Rightarrow \text{fl}(x) = \frac{1}{2} = +(1+0\times 2^{-1} + 0\times 2^{-2})2^{-1}, \\
x &= 1.6 \Rightarrow \text{fl}(x) = \frac{3}{2} = +(1 + 1\times 2^{-1} + 0\times 2^{-2})2^0, \\
x &= 1.7 \Rightarrow \text{fl}(x) = \frac{7}{4} = +(1 + 1\times 2^{-1} + 1\times 2^{-2})2^0, \\
x &= 4.2 \Rightarrow \text{fl}(x) = 4 = +(1 + 0\times 2^{-1} + 0 \times 2^{-2})2^2, \\
x &= 4.6 \Rightarrow \text{fl}(x) = 5 = +(1 + 0\times 2^{-1} + 1 \times 2^{-2})2^2. 
\end{align*}
$$
As homework, you should compute the errors and accurate digit values for each of the previous numbers. 

You may wonder, what would happen to a value such as $x=0.04$, which is a lot smaller than the smallest positive number in $\mathbb{P}_{\text{example}}$, or $x=10$, which is a lot larger than the largest positive value in $\mathbb{P}_{\text{example}}$.  This is an excellent question. We will add some additional values to the floating point numbers in order to handle such situations. For a particular choice of positive integers $d$, $N_{-}$, and $N_{+}$, we call the set
$$
\begin{align*}
\mathbb{P} &= \left\{\pm\left(1 + \sum_{k=1}^{\color{red}{d}}a_{k}2^{-k} \right)2^{\color{red}{E}} :\ a_{k}\in\{0,1\},\ \ E \in \{-N_{-},\ldots,N_{+}\}\right\}, 
\end{align*}
$$
a set of **floating point numbers**. Notice that we can denote an element $p \in \mathbb{P}$ by $p=\pm(1+f)2^E$, where $f = \sum_{k=1}^{d}a_{k}2^{-k}$.

Given a set of floating point numbers $\mathbb{P}$, we define the **finite precision values** to be elements of the set 
$$
\text{finite precision values} = \mathbb{P} \cup \{0,\pm\text{Inf},\text{NaN}\},
$$
where $0$ is zero and $\pm\text{Inf},\text{NaN}$ are symbols which will be explained shortly. 

Commonly, in practical implementations, the standard choice for $d$, $N_{-}$, and $N_{+}$ is the so-called [**IEEE-754 standard**](https://en.wikipedia.org/wiki/IEEE_754) where $d=52$, $N_{-}=1022$, and $N_{+}=1023$. The symbol $\text{Inf}$ is a value greater than any element of $\mathbb{P}$, and  $-\text{Inf}$ is a value less than any element of $\mathbb{P}$. The symbol $\text{NaN}$ (which stands for "not a number") is used to represent indeterminant forms such as $\frac{0}{0}$. 

## Machine Precision

From the definition of $\mathbb{P}$, observe the following:

1. We have $1+f \in [1, 2)$, and this implies that for each $E$, $(1+f)2^{E} \in [2^{E}, 2^{E+1})$.

2. The smallest number of $\mathbb{P}$ that is greater than $1$ is $1 + 2^{-d}$. We define **machine epsilon** to be $\epsilon_{\text{mach}} = 2^{-d}$.  (Note that machine epsilon is **not** the smallest positive floating point number, that value is $2^{-N_{-}}$, while the largest floating point number is just under $2^{N_{+} + 1}$). 

3. For each exponent $E\in \{-N_{-},-N_{-}+1,\ldots,-1,0,1,\ldots,N_{+}\}$, there are $2^{d}$ evenly spaced floating point values in $[2^E,2^{E+1})$.

4. For each exponent $E\in \{-N_{-},-N_{-}+1,\ldots,-1,0,1,\ldots,N_{+}\}$, the distance between consecutive floating point numbers in $[2^E,2^{E+1})$ is $\Delta = \frac{2^{E+1} - 2^E}{2^d} = 2^{E-d}$.

In our example case where $d=2$, $N_{-}=1$, and $N_{+}=2$ we have that $\epsilon_{\text{mach}} = 2^{-2} = \frac{1}{4}$ and hence $1+\epsilon_{\text{mach}} = \frac{5}{4}$. However, the smallest positive value of $\mathbb{P}_{\text{example}}$ is $\frac{1}{2}$. 

In IEEE 754 standard, $\epsilon_{\text{mach}} = 2^{-52}$ which can be accessed in Julia as:

In [7]:
eps(Float64)

2.220446049250313e-16

Compare that with

In [8]:
2.0^-52

2.220446049250313e-16

Furthermore, we have the following estimates for the smallest and largest positive floating point values in Julia:

In [9]:
2.0^-1022

2.2250738585072014e-308

and

In [10]:
(1.0 + 2.0^-1 + 2.0^-2 + 2.0^-3)*2.0^1023

1.6853373139334212e308

Notice what happens if we try to double our largest floating point number:

In [11]:
2.0^1024

Inf

This produces a so-called **overflow**. There is also a phenomenon called **underflow** where a positive floating point number 
is rounded to zero. Finally, examine the following result:

In [12]:
0.0/0.0

NaN

These last examples helo to explain the meaning of the symbols $\pm\text{Inf}$ and $\text{NaN}$ that were mentioned previously. 


You should understand and memorize the following important result: 
> **Theorem:** Let $x$ be a positive real number that lies in $[2^{E},2^{E+1})$ for some choice of exponent $E$ . Then
$$
|\text{fl}(x) - x| \leq \frac{1}{2}2^{E-d}
$$
> and
$$
\frac{|\text{fl}(x) - x|}{|x|} \leq \frac{\frac{1}{2}2^{E-d}}{2^{E}} \leq \frac{1}{2}\epsilon_{\text{mach}}. 
$$

## Floating Point Arithmetic

On a machine, *i.e.*, a computer arithmetic is performed on floating point numbers and returns a finite precision value. Moving forward, our assumption will be that there is a set of floating point operations such as addition, multiplication, etc. that are analogous to the abstract arithmetic operations on real numbers. Furthermore, we will suppose that each floating point operation is carried out such that the resulting relative error is bounded by machine epsilon, $\epsilon_{\text{mach}}$. For example, if $\bigoplus$ is used to denote floating point addition and if $x,y\in \mathbb{P}$ then we have
$$
\frac{|(x\bigoplus y) - (x+y)|}{|x+y|} \leq \epsilon_{\text{mach}}. 
$$

## Finite Precision as Perturbations

We may view finite precision numbers and arithmetic as perturbations to exact real numbers and exact real number arithmetic. That is, if $x \in \mathbb{R}$, we may consider 
$$
\text{fl}(x) = x(1 + \delta), \text{where } |\delta| \leq \frac{\epsilon_{\text{mach}}}{2}, 
$$
and
$$
\begin{align*}
\text{fl}(\text{fl}(x) \pm \text{fl}(y)) &= \text{fl}(x) \pm \text{fl}(y)(1 + \delta_{1}), \\
\text{fl}(\text{fl}(x) \times \text{fl}(y)) &= \text{fl}(x) \times \text{fl}(y)(1 + \delta_{2}), \\
\text{fl}(\text{fl}(x) \div \text{fl}(y)) &= \text{fl}(x) \div \text{fl}(y)(1 + \delta_{3}), 
\end{align*}
$$
where each of $\delta_{1}$, $\delta_{2}$, and $\delta_{3}$ are bounded by $\epsilon_{\text{mach}}$.

Here is the important point, since any time we solve a problem on a computer we are introducing a (typically small) perturbation, it is important to understand how sensitive our solution or method of solution is to small perturbations. This is addressed by the notions of stability and conditioning which we will take up in the next set of notes.