# IEEE Arithmetic and More Floating Point Examples

__IEEE Arithmetic__

__Book__

M.L. Overton, Numerical Computing with IEEE Floating Point Arithmetic, SIAM Publications,
Philadelphia, PA,2001.

Good book for general computer scientists.

Put link to Goldberg article. 

Supports the following "special numbers."

```
Inf
-\Inf
0
-0
NaN
```

In [5]:
1/0.0, 1/Inf, 1/(-0.0), 1/(-Inf)

(Inf,0.0,-Inf,-0.0)

`NaN` (not a number) can be generated by

In [6]:
Inf+(-Inf),0*Inf, Inf/Inf, 0.0/0.0

(NaN,NaN,NaN,NaN)

IEEE Arithmetic is a closed system.

$$
\{ \textrm{floating point numbers},Inf,-Inf, NaN\}  \stackrel{operations}{\rightarrow} 
\{ \textrm{floating point numbers},Inf,-Inf, NaN\}
$$

no matter what the operations are.

Clever programmers take advantage of these features. However, in the coding assignments in this course, if you get
`NaN` or `Inf` or `-Inf`, you have probably made an error.

In [7]:
f(x)=sec(x)-tan(x)

f (generic function with 1 method)

In [9]:
x=π/2
f(x),sec(x),tan(x)

(0.0,1.633123935319537e16,1.633123935319537e16)

In [10]:
g(x)=1/(sec(x)+tan(x))

g (generic function with 1 method)

In [11]:
g(x)

3.061616997868383e-17

In [12]:
h(x)=cos(x)/(1+sin(x))

h (generic function with 1 method)

In [13]:
h(x)

3.061616997868383e-17

In [14]:
y=x-eps()
f(y),g(y),h(y)


(0.0,1.416384724411995e-16,1.416384724411995e-16)

In [16]:
typeof(π)

Irrational{:π}

In [21]:
x=map(BigFloat,π)/2

1.570796326794896619231321691639751442098584699687552910487472296153908203143099

In [23]:
sec(x),tan(x),f(x),g(x),h(x)

(1.82329127542575665758945097744587056181050935898819084966946418442954497061682e+77,1.82329127542575665758945097744587056181050935898819084966946418442954497061682e+77,0.000000000000000000000000000000000000000000000000000000000000000000000000000000,2.742293602448380191855326565989245052626895591271719877947514292480356721283385e-78,2.742293602448380191855326565989245052626895591271719877947514292480356721283385e-78)

The following class of problems gives rise to two separate types of issues, one we have
already discussed, one we have not. Below $\epsilon$ is the value generated by the `eps()` command.



__Linear Systems of Equations__

\begin{eqnarray*}
\displaystyle\frac{\epsilon}{10} x_1 + x_2 = 1 \\
x_1 + x_2 = 2
\end{eqnarray*}

A good approximate answer is $x_1 = x_2 =1$. Use the augmented system approach.

\begin{eqnarray*}
&\left(\begin{array}{cc|c} \displaystyle\displaystyle\frac{\epsilon}{10} & 1 & 1 \\ 1 & 1 & 2 \end{array} \right) \\
&\left(\begin{array}{cc|c} \displaystyle \displaystyle\frac{\epsilon}{10} & 1 & 1 \\ 0 & 1-\displaystyle\displaystyle\frac{10}{\epsilon} & 2-\displaystyle\displaystyle\frac{10}{\epsilon}\end{array}\right)\\
\approx &\left(\begin{array}{cc|c}  \displaystyle\displaystyle\frac{\epsilon}{10} & 1 & 1 \\ 0 & -\displaystyle\displaystyle\frac{10}{\epsilon} & -\displaystyle\displaystyle\frac{10}{\epsilon}\end{array}\right)\\ &\textrm{to machine precision}
\end{eqnarray*}

The very significant "1" and "2" in the last line are  _rounded away_ !

Back solve to get $x_1 = 0$,$x_2 = 1$. If you put these values back in the orgininal system, note
that $x_1+x_2 = 1$, so this is "way off."

In [27]:
[eps()/10 1;0 1-10/eps()]\[1; 2-10/eps()]

2-element Array{Float64,1}:
 0.0
 1.0

Again there is a "fix", it is called partial pivoting. Put largest uneliminated entry in the column
in pivot or diagonal position
\begin{eqnarray*}
&\left(\begin{array}{cc|c}     1 & 1 & 2 \\\displaystyle\frac{\epsilon}{10} & 1 & 1 \end{array}\right) \\
&\left(\begin{array}{cc|c}     1 & 1 & 2 \\0                   & 1-\displaystyle\frac{\epsilon}{10}&1-\displaystyle\frac{\epsilon}{10}\end{array}\right)\\
\approx   &\left(\begin{array}{cc|c}     1 & 1 & 2 \\0                   & 1                    &1                    \end{array}\right)
\end{eqnarray*}

In [29]:
[1 1;0 1-eps()/10]\[2; 1-eps()/10]

2-element Array{Float64,1}:
 1.0
 1.0

This is the correct solution to machine precision.

Sometimes changing the algorithm does _no good at all_!  Again $\epsilon$ is the value generated
by the `eps()` command.

\begin{eqnarray*}
(1+2\epsilon)x_1 + (1+2\epsilon)x_2 = 2 \\
(1+\epsilon)x_1 + x_2 =2
\end{eqnarray*}

Using the augmented matrix approach

$$
\left(\begin{array}{cc|c}
(1+2\epsilon)&     (1+2\epsilon )&     2 \\
(1+\epsilon)&     1   &2 \end{array}\right)
$$

Mulitply the first row by $\alpha = (1+\epsilon)/(1+2\epsilon)= 1-\epsilon + O(\epsilon^2)$
and you get

$$
\left(\begin{array}{cc|c}
(1+2\epsilon)&     (1+2\epsilon )&     2 \\
0           &     -\epsilon & 2\epsilon \end{array}\right)
$$
The solution is $x_1 = 4$ and $x_2 =-2$. This is correct to machine precision.

A small change in the right hand side yields

$$
\left(\begin{array}{cc|c}
(1+2\epsilon)&     (1+2\epsilon )&     2+4\epsilon \\
(1+\epsilon)&     1   &2 +\epsilon \end{array}\right)
$$

The correct answer is $x_1=x_2=1$, but with rounding I get $x_1 =0$, and $x_2 =2$. Why?
Every trick I know except increasing the precision, yields similar wrong answers.

In [31]:
[1+2*eps() 1+2*eps(); 1+eps() 1]\[2+4*eps(); 2+eps()]

2-element Array{Float64,1}:
 0.0
 2.0

In [36]:
[BigFloat(1)+2*eps() 1+2*eps(); 1+eps() 1]\[BigFloat(2)+4*eps(); 2+eps()]

2-element Array{BigFloat,1}:
 0.000000000000000000000000000000000000000000000000000000000000000000000000000000
 2.000000000000000000000000000000000000000000000000000000000000000000000000000000

__Reason__  IEEE Arithmetic rounds this system to

 $$
\left(\begin{array}{cc|c}
(1+2\epsilon)&     (1+2\epsilon )&     2+4\epsilon \\
(1+\epsilon)&     1   &2           \end{array}\right)
$$

which has the solution $x_1=0$ and $x_2=2$. This problem is very close to the singular system

$$
\left(\begin{array}{cc|c} 1 & 1 & 2 \\ 1 & 1 & 2\end{array}\right)
$$

which has the solutions

$$
\mathbf{x} =(x_1,x_2)^T = (1,1)^T+ \beta*(-1,1)^T, \quad \beta \in \mathbb{R},
$$

Note that $(x_1,x_2)= (1,1)$ and $(x_1,x_2)=(0,2)^T$ are two of those solutions.