# Aritmetika računala i pogreške

## Apsolutna i relativna pogreška

Neka je $\alpha$ aproksimacija za $a$. Tada vrijedi

$$err=|a-\alpha| \\  relerr=\frac{err}{|a|}=\frac{|a-\alpha|}{|a|}.$$

In [1]:
# Probajte for α=a:0.01:2a
a=5.0
α=5.1
err=abs(a-α)
relerr=err/abs(a)
α, err, relerr

(5.1, 0.09999999999999964, 0.019999999999999928)

## Aritmetika s plivajućim zarezom

Korisna knjiga za IEEE Floating Point standard:

M. Overton, Numerical Computing with IEEE Floating Point Arithmetic, SIAM Publications, Philadephia, 2001.

Koristan članak: 

[David Goldberg, What Every Computer Scientist Should Know About Floating-Point Arithmetic](https://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html).

### Brojevi s plivajućim zarezom

$x$ je broj s plivajućim zarezom ako ima oblik
$$
	x = \pm d \cdot \beta^e \quad \beta \in \{ 2,10 \}
$$

Baza 2 je za računala opće namjene, baza 10 je za džepne kalkulatore.

$e$ je eksponent i zadovoljava

$$
	e_{\min} \leq e \leq e_{\max}\quad,
	e_{\min} < 0 < e_{\max}
$$

Pretpostavit ćemo aritmetiku s bazom 2, ali će primjeri uglavnom biti u bazi 10.

Mantisa $d$ ima oblik

\begin{align*}
	d &= 0.d_1 \dots d_t = d_1 \beta^{-1} + d_2 \beta^{-2}
	+ \dots + d_t \beta^{-t}\\
d_i  &\in \{ 0,1\}\\
	d_1 &= 1 \qquad \mbox{ normalizirana }   \\
	d_1 &= 0 \qquad \mbox{ nenormalizirana }   \\
\end{align*}

Standardni oblik brojeva s plivajućim zarezom je normaliziran, osim pri dnu raspona eksponenata. 

Prilikom ulaza i izlaza brojevi se konvertiraju iz binarnog u decimalni zapis i natrag.

Aritmetika računala je standardizirana kroz IEEE 754 standard za binarnu aritmetiku.  
Svi osim nekolicine modernih računala prate ovaj standard.

### Točnost stroja

$$
	\epsilon_M = \max_{\lfloor \log_2 
    \:|x|\rfloor \in
	[e_{\min},e_{\max}]} \frac{|x - fl(x)|}{|x|}  = 2^{-t}
$$

Skup

$$
\{ x \colon \lfloor \log_2 \: |x| \rfloor \in [e_{min},e_{max}] \}
$$

je skup svih brojeva unutar normaliziranog raspona brojeva s plivajućim zarezom.
$fl(x)$ je $x$ zaokružen na najbliži broj s plivajućim zarezom. Prema tome, _točnost stroja_ je 
najveća relativna udaljenost između realnog broja koji se nalazi u rasponu brojeva s plivajućim zarezom i najbližeg broja s plivajućim zarezom.

Važni primjeri su

__IEEE standardna jednostruka točnost__ (`Float32`):  $\beta = 2$, $t = 24$,

\begin{align*}
	\epsilon_M  &= 2^{-24} \approx	5.9605 \times 10^{-8}\\
	e_{\min} &= - 126,\quad e_{\max} = 128.
\end{align*}


__IEEE standardna dvostruka točnost__ (`Float 64`): $\beta =2$, $t = 53$,

\begin{align*}
	\epsilon_M &= 2^{-53} \approx 1.1102 \times 10^{-16}\\
    e_{\min} &= -1022,\quad e_{\max} = 1024
\end{align*}

Izračunajmo točnost stroja kao najmanji pozitivni broj $\epsilon_M$ takav da je $1+\epsilon_M\neq 1$:

In [2]:
b=1.0
a=2.0
while (b+a)!=b
    a=a/2
    println(a)
end

1.0
0.5
0.25
0.125
0.0625
0.03125
0.015625
0.0078125
0.00390625
0.001953125
0.0009765625
0.00048828125
0.000244140625
0.0001220703125
6.103515625e-5
3.0517578125e-5
1.52587890625e-5
7.62939453125e-6
3.814697265625e-6
1.9073486328125e-6
9.5367431640625e-7
4.76837158203125e-7
2.384185791015625e-7
1.1920928955078125e-7
5.960464477539063e-8
2.9802322387695312e-8
1.4901161193847656e-8
7.450580596923828e-9
3.725290298461914e-9
1.862645149230957e-9
9.313225746154785e-10
4.656612873077393e-10
2.3283064365386963e-10
1.1641532182693481e-10
5.820766091346741e-11
2.9103830456733704e-11
1.4551915228366852e-11
7.275957614183426e-12
3.637978807091713e-12
1.8189894035458565e-12
9.094947017729282e-13
4.547473508864641e-13
2.2737367544323206e-13
1.1368683772161603e-13
5.684341886080802e-14
2.842170943040401e-14
1.4210854715202004e-14
7.105427357601002e-15
3.552713678800501e-15
1.7763568394002505e-15
8.881784197001252e-16
4.440892098500626e-16
2.220446049250313e-16
1.1102230246251565e-16


MATLAB naredba `eps` i Julia funkcija `eps()` daju $2.2204 \times
10^{-16}$, što je najveći relativni razmak između dva broja s plivajućim zarezom.
Lako se može zaključit da je taj broj $2\epsilon_M$.

In [3]:
eps()

2.220446049250313e-16

In [4]:
# Što je ovo?
eps(200.0)

2.842170943040401e-14

Julia posebno ima sustav tipova podataka, gdje je tip `Float64` pod-tip tipa `AbstractFloat`, koji ima četiri pod-tipa. Uz standardne tipove `Float64` i `Float32`, tu su i tip `Float16` koji koristi samo dva bajta računalne memorije i tip `BigFloat` čija mantisa ima 256 bitova.

In [5]:
supertype(Float64)

AbstractFloat

In [6]:
subtypes(AbstractFloat)

4-element Array{Any,1}:
 BigFloat
 Float16 
 Float32 
 Float64 

In [7]:
for T in (Float16, Float32, Float64, BigFloat)
    println(eps(T))
end

0.000977
1.1920929e-7
2.220446049250313e-16
1.727233711018888925077270372560079914223200072887256277004740694033718360632485e-77


### Osnovne operacije s plivajućim zarezom

Započnimo s četiri osnovne operacije, zbrajanjem ($+$), oduzimanjem ($-$), moženjem ($*$) i
dijeljenjem ($/$). Neka je $op$ operacija tako da je

$$
op \in \{ + , - , *,/\}.
$$

Tada je u aritmetici plivajućeg zareza s točnošću stroja $\epsilon_M$ razumno očekivti da za svaka dva 
broja s plivajućim zarezom $x$ i $y$ vrijedi

$$
fl(x\;op\;y) = (x \; op\; y)\;(1 + \xi),\quad
|\xi| \leq \epsilon_M
$$

For division, we assume $y \neq 0$.
Any IEEE standard computer must follow this rule.  Rounding is one of two limitations that floating point
arithmetic has that real arithmetic does not have. You can quickly conclude from the above rule that as long as all that we do is add numbers of the same sign, multiply, and divide, floating point results will almost always come very close to the corresponding real arithmetic results. The difficulty occurs if we either of $x$ or $y$ is rounded, they have different signs and we add or have the same signs and we subtract. 

That is, suppose we have

$$
\tilde{x}= x(1+\delta_x), \quad \tilde{y} = y(1+\delta_y)
$$

where $x$ and $y$ are the exact real results of some computation and $\tilde{x}$ and $\tilde{y}$ are rounded floating point results with $|\delta_x| |\delta_y| \leq \delta$ for some small delta.  Suppose also that $x$ and $y$ have the same sign. If

$$
z=x+y, \quad \tilde{z} = fl(\tilde{x} -\tilde{y})
$$

Then

\begin{align*}
\tilde{z} &=(\tilde{x}-\tilde{y})(1+\xi), \quad |\xi| \leq \epsilon_M\\
& x(1+\delta_x)(1+\xi) -y(1+\delta_y)(1+\xi) \\
& x-y + \Delta_z
\end{align*}

where

$$
\Delta_z = (x-y)\xi + (x\delta_x -y\delta_y)(1+\xi)
$$

The best available bound on $|\Delta_z|$ is

\begin{align*}
|\Delta_z| &\leq |x-y||\xi| + (|x||\delta_x| + |y||\delta_y|)(1+|\xi|) \\
& \leq |x-y| \epsilon_M + (|x|+|y|)\delta(1+\epsilon_M)
\end{align*}

Thus

\begin{align*}
|\delta_z| &= \frac{|\tilde{z}-z|}{|z|}\\
&\leq \epsilon_M + (1+\epsilon_M)\delta\frac{|x|+|y|}{|x-y|}
\end{align*}

If $|x-y| << |x|+|y|$, the effect of the round in the subtraction is not important, but the error from
previous computations on $x$ and $y$ can have a huge effect. The effect is called __propagation__. It can dramatically change the result of a compuation! We will see this issue with some examples later in this lecture.

Rounding is the first important limitation of floating point arithmetic.  A second limitation is the number range.



__Number Ranges__

Floating point arithmetic has a largest and smallest computer number. First, the largest one.

__Largest Computer Number__ $\Omega$

In base $2$, with a $t$ bit mantissa, the largest computer number is

$$
	\Omega = (1 - 2^{-t}) \cdot 2^{e_{\max}}
$$

When numbers exceed $\Omega$, they are stored as `Inf` ($\infty$) or
`-Inf` ($-\infty$).


__IEEE Standard Single Precision (Float32)__

$$
e_{\max} = 128,  \quad \Omega = 3.4028\approx 10^{38}
$$


__IEEE Standard Double Precision (Float64)__

$$
e_{\max} = 1024, \quad \Omega = 1.79777 \times 10^{308}
$$

The  MATLAB __realmax__ command displays this number.



__Smallest Computer Number__ $\omega$

The definition of the smallest computer number is somewhat more complex.


__IEEE Floating Point Standard__

The smallest computer number is given by

$$
\omega = 2^{1-t} 2^{e_{\min}}.
$$

If a computation produces a number smaller in magnitude than $\omega$, it produces what is called an 
__underflow__, it is set to $0$ or
$-0$.  If the programmer chooses, an underflow can result in an error, but in most computations, underflows are not harmful.



__IEEE Standard Single Precision (Float32)__

$$
\omega = 2^{-23- 126} = 2^{-149} \approx  1.4013 \times 10^{-45}.
$$

In MATLAB, this comes from the command

$$
omega= eps('single')*realmin('single');
$$


__IEEE Standard Double Precision (Float64)__

$$
\omega= 2^{-1022-52} = 2^{-1074} \approx  4.9407 \times 10^{-324}
$$

The appropriate MATLAB command to get this value is

$$
omega = eps*realmin.
$$

_Important and Subtle Point_ --- Numbers at the bottom of the exponent
range are not normalized.

MATLAB function `realmin` yields

$$
	\underline{\omega_{useful} \approx 2.2251 \times 10^{-308}}
$$

Some people call this the smallest USEFUL floating point
 number since
 
$$
1/\omega_{useful} \leq \Omega
$$

and $\omega_{useful}$ is normalized.

Smallest floating point number, $\omega$ has the form

$$
	0.0 \cdots 01 \times 2^{e_{\min}} \quad \cdots\quad
	\underline{\mbox{Gradual Underflow}}
$$

Before the IEEE standard most computers had the smallest
floating point number as
$$
	0.10 \cdots 0 \times 2^{e_{\min}} \qquad \cdots
	\mbox{ normalized}
$$

Earlier computers, (pre-1985) set numbers below this smallest 'useful' floating point number to zero.
This change was one of the more controversial features of the IEEE standard.

In [12]:
1/floatmin(),floatmax()

(4.49423283715579e307, 1.7976931348623157e308)

In [8]:
for T in (Float16, Float32, Float64, BigFloat)
    println((floatmin(T),floatmax(T)))
end

(Float16(6.104e-5), Float16(6.55e4))
(1.1754944f-38, 3.4028235f38)
(2.2250738585072014e-308, 1.7976931348623157e308)
(8.50969131174083613912978790962048280567755996982969624908264897850135431080301e-1388255822130839284, 5.875653789111587590936911998878442589938516392745498308333779606469323584389875e+1388255822130839282)


## Posebne vrijednosti (_special quantities_) $0$, $-0$, `Inf` i `NaN`

Vidi 


Nula ima predznak:

In [4]:
a=1.0
b=0.0
c=-b
c,b==c

(-0.0, true)

In [5]:
d=a/b
e=a/c
d==e, 1/d==1/e

(false, true)

In [6]:
b/c

NaN

In [9]:
bitstring(0)

"0000000000000000000000000000000000000000000000000000000000000000"

In [11]:
bitstring(1)

"0000000000000000000000000000000000000000000000000000000000000001"

In [12]:
bitstring(0.0)

"0000000000000000000000000000000000000000000000000000000000000000"

In [13]:
bitstring(-0.0)

"1000000000000000000000000000000000000000000000000000000000000000"

In [14]:
bitstring(1.0)

"0011111111110000000000000000000000000000000000000000000000000000"

__Zadatak.__ Objasnite prethodne binarne zapise. 

## Točnost stroja $\varepsilon$

je najmanji broj $\varepsilon$ takav da je $1+\varepsilon\neq 1$ 

In [16]:
b=1.0
a=2.0
while (b+a)!=b
    a=a/2
    println(a)
end

1.0
0.5
0.25
0.125
0.0625
0.03125
0.015625
0.0078125
0.00390625
0.001953125
0.0009765625
0.00048828125
0.000244140625
0.0001220703125
6.103515625e-5
3.0517578125e-5
1.52587890625e-5
7.62939453125e-6
3.814697265625e-6
1.9073486328125e-6
9.5367431640625e-7
4.76837158203125e-7
2.384185791015625e-7
1.1920928955078125e-7
5.960464477539063e-8
2.9802322387695312e-8
1.4901161193847656e-8
7.450580596923828e-9
3.725290298461914e-9
1.862645149230957e-9
9.313225746154785e-10
4.656612873077393e-10
2.3283064365386963e-10
1.1641532182693481e-10
5.820766091346741e-11
2.9103830456733704e-11
1.4551915228366852e-11
7.275957614183426e-12
3.637978807091713e-12
1.8189894035458565e-12
9.094947017729282e-13
4.547473508864641e-13
2.2737367544323206e-13
1.1368683772161603e-13
5.684341886080802e-14
2.842170943040401e-14
1.4210854715202004e-14
7.105427357601002e-15
3.552713678800501e-15
1.7763568394002505e-15
8.881784197001252e-16
4.440892098500626e-16
2.220446049250313e-16
1.1102230246251565e-16


In [17]:
1+a==1.0

true

In [18]:
2a, 1+2a==1.0

(2.220446049250313e-16, false)

Programi imaju ugrađenu naredbu koja daje $\varepsilon$

In [19]:
eps()

2.220446049250313e-16

In [20]:
# Što je ovo?
eps(200.0)

2.842170943040401e-14

In [21]:
methods(eps)

In [22]:
eps(Float64), 2.0^(-52)

(2.220446049250313e-16, 2.220446049250313e-16)

In [23]:
eps(Float32), 2.0^(-23)

(1.1920929f-7, 1.1920928955078125e-7)

In [24]:
eps(Float16), 2.0^(-10)

(Float16(0.000977), 0.0009765625)

In [25]:
eps(BigFloat), 2.0^(-255)

(1.727233711018888925077270372560079914223200072887256277004740694033718360632485e-77, 1.727233711018889e-77)

## Katastrofalno kraćenje (_catastrophic cancellation_)

U egzaktnoj aritmetici kvadratna jednadžba

$$ ax^2 + bx+c=0$$

ima rješenja

\begin{align*}
x_1&=\frac{-b-\sqrt{b^2-4ac}}{2a} \\
x_2&=\frac{-b+\sqrt{b^2-4ac}}{2a}\equiv \frac{-b+\sqrt{b^2-4ac}}{2a}\cdot \frac{-b-\sqrt{b^2-4ac}}{-b-\sqrt{b^2-4ac}}
\\ &= \frac{2c}{-b-\sqrt{b^2-4ac}}=x_3
\end{align*}


In [26]:
a=2.0
b=123456789.0
c=4.0

x1=(-b-sqrt(b*b-4*a*c))/(2.0*a)
x2=(-b+sqrt(b*b-4*a*c))/(2.0*a)
x3=(2*c)/(-b-sqrt(b*b-4*a*c))
x1,x2,x3

(-6.172839449999997e7, -3.3527612686157227e-8, -3.240000029484002e-8)

Provjerimo s `BigFloat`:

In [27]:
a=BigFloat(a)
b=BigFloat(b)
c=BigFloat(c)
x2=(-b+sqrt(b*b-4*a*c))/(2.0*a)

-3.240000029484001968915648868258452417675753633383540995167795107129921671968718e-08

Još jedan primjer:

In [28]:
x=1e-10
tan(x)-sin(x)

0.0

Međutim,
trignometrijski indentiti daju:

\begin{align*}
\tan x - \sin x & = \tan x (1 - \cos x ) 
= \tan x (1-\cos x)\frac{1+\cos x}{1+\cos x}\\ & = \tan x \frac{1-\cos^2 x}{1+\cos x} \\
&= \tan x \sin^2 x \frac{1}{1+\cos x},
\end{align*}

a Taylorova formula daje:

\begin{align*}
\tan x &= x + \frac{x^3}{3} + \frac{2x^5}{15} + O(x^7) \\
\sin x &= x -\frac{x^3}{6} + \frac{x^5}{120}+O(x^7) \\
\tan x - \sin x &= \frac{x^3}{2} + \frac{7x^5}{120} +O(x^7)
\end{align*}

Obe formule daju potpuno točan rezultat:

In [29]:
tan(x)*sin(x)^2/(1+cos(x)), x^3/2+7*x^5/120

(5.0e-31, 5.0e-31)