### Representation of numbers


#### Integers
 - Base b system: $(base=b\in\mathbb{Z}^+): \forall x\in\mathbb{Z}^+. x={(d_nd_{n-1}...d_0)}_b=\sum_{i=0}^n d_i\times b^i, 0\leq d_i < b, d_i\in \mathbb{Z}^+$
 - Adding an extra sign to represent positive / negative, in computer, 0/1
 - Converting decimal to base b: repeatedly divide the decimal number x by the base b, take the remainder each time, then the remainders are $d_0,...,d_n$

#### Reals
 - Base b system: extend the integers representation to the fraction part: $.x={(.d_{-1}d_{-2}...d_{-n})}_b=\sum_{i=1}^\infty d_{-i}\times b^{-i}$
 - Every binary fraction can be mapped to a decimal fraction, but not the converse
 - Converting decimal to binary: repeatedly multiple the decimal fraction x by 2, take the integer, until the fraction part is 0. Then the integers are $d_{-1},...d_{-n}$

### Computer Representation of Numbers

#### Simplified form
$$x = (f)_b\times b^{(e)_b} \\ 
f = \pm(.d_1d_2...d_t)_b\:(mantissa \:|\: significand) \\
e = \pm (c_{s-1}c_{s-2}...c_0)_b\:(exponent\:|\: characteristic)$$
- **normalized floating-point number**  if $d_1\neq 0$ or $f=0$
- $0\leq|f|<1$

#### IEEE Standard form
The IEEE Standard form is the way most computers representing binary numbers, which is  
$$(-1)^q \times (d_0.d_1d_2...d_{t-1})_b\times 2^e \\
E_{min} \leq e \leq E_{max} (E_{min} = - E_{max} + 1), e = (-1)^p \times (c_{s-2}...c_1c_0)_b \\
q, p \in \{0,1\}, d_i, c_i \in \{0,1\}\\
$$
<table>
<thead>
<tr>
<td>Type of number</td>
<td>number of bits</td>
<td>$E_{min}$</td>
<td>$E_{max}$</td>
<td>t</td>
<td>$\epsilon_{mach}$</td>
</tr></thead>
<tbody>
<tr>
<td>single precision, binary32</td>
<td>32</td>
<td>-126</td>
<td>127</td>
<td>23+1 = 24</td>
<td>$1.2\times 10^{-7}$</td>
</tr>
<tr>
<td>double precision, binary64</td>
<td>64</td>
<td>-1022</td>
<td>1023</td>
<td>52 + 1 = 53</td>
<td>$2.2\times 10^{-16}$</td>
</tr>
<tr>
<td>quadruple precision, binary128</td>
<td>128</td>
<td>-16382</td>
<td>16383</td>
<td>112+1=113</td>
<td>$1.9\times 10^{-34}$</td>
</tr>
</tbody>
</table>
 
 - The mantissa is normalized, $d_0$ is always saved to be 1/ 
 - IEEE uses proper rounding
 - includes "special" numbers (symbols) for indeterminate values (e.x. $\infty, -\infty, NaN$)

#### Limits
- The exponent $e$ is limited by $E_{min}\leq e\leq E_{max}, -E_{min}=E_{max}=(aa...a)_b, a = b-1$
- The largest floating point number **Overflow Level (OFL)** is $N_{max} = (.aa...a)_b\times b^{(aa...a)_b}$
- The smallest floating point number **Underflow Level (UFL)** is $N_{min}=(.100...0)_b\times b^{-(aa...a)_b}$ (normalized, what we often consider), $N_{min}=(.00...1)_b\times b^{-(aa...a)_b}$ (non-normalized)


#### Set of floating point numbers
- **$R_b(t,s)$** the set of all base b floating point numbers that can be represented by $t$ b-digits mantissa and $s$ b-digits exponent
- $R_b(t,s) = [-OFL, -UFL]\cup {0} \cup [UFL, OFL]$, **Overflow** if in $(-\infty, -OFL)\cup(OFL,\infty)$, **Underflow** if in $(-UFL,UFL)-{0}$
- $R_b(t,s)$ is finite, discrete, and more condensed towards 0

#### Rounding
$\forall x \in \mathbb{R}, fl(x)\in\mathbb{R}_b(t,s)$ is the common way to convert a real to a floating point number. The common ways are
 - Chopping: simply flooring after $t$th digit of the mantissa
 - Traditional rounding: round up if $D_{i+1}\geq b/2$, down if $D_{i+1} < b/2$
 - Proper (perfect) rounding: round up if $D_{i+1}\ge b/2$, down if $D_{i+1} < b/2$, round to the nearest even if $D_{i+1} = b/2$

### Representation Error

#### round-off error $fl(x) - x$
 - **relative round-off error** $\delta = \frac{fl(x)-x}{x}, |\delta|\leq b^{1-t} (normalized, chopping), |\delta|\leq b^{1-t}/2 (normalized, rounding)$
 - **Correct in $r$ significant $b$-digits** if the approximation $\hat{x}$ to $x$ is $|\frac{x-\hat{x}}{x}|=|\delta| \leq \frac{b^{1-r}}{2}$.<br>

#### Error in arithmetic
 - Let $\circ$ be an arithmetic operator (or any function $f:\mathbb{R}_b(t,s)\rightarrow \mathbb{R}_b(t,s)$, e.x. $+,-,\times, /,sin$), then the computer's floating-point operation $\bar{\circ}$ is constructed so that $x\bar{\circ} y = fl(x\circ y)$

#### Machine epsilon $(\epsilon_{mach})$
 - The smallest (non-normalized) floating-point number such that $1+\epsilon_{mach} > 1$, $\epsilon_{mach} = \delta$ bounds, a.k.a $|\delta| \leq \epsilon_{mach}$, note $b^{1-t}/2 < \epsilon_{mach} = b^{1-t}/2 + b^{-t} \leq b^{1-t}$ if proper rounding.

**Error propagation**
 As soon as an error rises, it may then be amplified or reduced in subsequent operations. <br>
 Let $x,y\in\mathbb{R}, fl(x) = x(1+\delta_x), fl(y)=y(1+\delta_y)$
 - Multiplication
$$fl(x)\bar{\times}fl(y) = fl(fl(x)\times fl(y)) \\
  = (x(1+\delta_x)y(1+\delta_y))(1+\delta_{xy}) \\ 
  = xy(1+\delta_x + \delta_y + \delta_{xy} + \delta_x \delta_y + \delta_x \delta_{xy} + \delta_{xy}\delta_y + \delta_x\delta_y\delta_{xy}) \\
  \approx xy(1+\delta_x+\delta_y+\delta_{xy}) \quad\text{the rest are their product, hence much smaller} \\
  |\delta_\times| \leq 3 \epsilon_{mach}
$$
 - Addition
 $$
     fl(x)\bar{+}fl(y) = fl(fl(x)+fl(y)) \\ 
     = (x(1+\delta_x) + y(1+\delta_y))(1+\delta_{x+y}) \\ 
     = x(1+\delta_x)(1+\delta_{x+y}) + y(1+\delta_y)(1+\delta_{x+y}) \\
     \approx x(1+\delta_x+\delta_{x+y}) + y(1+\delta_y+\delta_{x+y}) \\
     = (x+y)(1 + \frac{x(\delta_x+\delta_{x+y})}{x+y} + \frac{y(\delta_y+\delta_{x+y})}{x+y}) \\
     |\delta_+| \leq |\frac{x}{x+y}|2\epsilon_{mach}+|\frac{y}{x+y}|2\epsilon_{mach} = \frac{|x|+|y|}{|x+y|}2\epsilon_{mach} \\
     = 2\epsilon_{mach} \:|\: xy>0 \\ 
     = 2\epsilon_{mach}\frac{|x-y|}{|x+y|}\:|\: xy<0
 $$
Consider $x\rightarrow -y$, the relative error is extreme high
 - The possible error occurred in addition (adding nearly opposite numbers) is called **catastrophic cancellation**

#### Error Propagation
Consider some $x\in\mathbb{R}$ and output $f(x)\in\mathbb{R}$. <br>
Let $fl(x)=x(1+\delta_x)$, assume $f(x)$ is twice differentiable near $x$. Then, if $f(x)\neq 0$, Taylor's series give $f(fl(x))=f(x)(1+\delta_f) + o(\delta_x^2)$ ignore the higher order terms, $\delta_f = \frac{xf'(x)}{f(x)}\delta_x, |\delta_f|\leq |\frac{xf'(x)}{f(x)}|\epsilon_{mach}$ <br>
**(relative) condition number** $k_f=|\frac{xf'(x)}{f(x)}|$ measures the relative sensitivity of the computation of $f(x)$ on relatively small changes in the input $x$ (how the relative error in $x$ propagates in $f(x)$)
 - **Well-conditioned** if relatively small changes in the input produces relatively small changes in the output, otherwise **ill-conditioned**
 - Examples 
   - $f(x):=\sqrt{x}, k_f = |\frac{xf'(x)}{f(x)}| = 1/2$ is well-conditioned.
   - $f(x)=e^x, k_f = |\frac{xf'(x)}{f(x)}| = |x|$ since for large input, the function itself will overflow faster, it is well-conditioned.
 - Note if some functions have $k_f > \epsilon_{mach}^{-1}$, we risk having no correct digits at all, $k_f$ is a inherent property to the function, not to the computation.

#### Stability of algorithms 
 A numerical algorithm is **stable** if small changes in the algorithm input have a small effect on the output, otherwise **unstable**. 
 - Example: consider $(15.6-15.7)^2$ in 3 decimal digits with rounding, 
 $$(15.6-15.7)^2 = (-.1)^2 = .1\times 10^{-1} \\
 15.6^2 - 2(15.6)(15.7) + 15.7^2 = 243 - 490 + 246 = -.1\times 10^1$$ <br>
 Mathematically equivalent expressions are not necessarily computationally equivalent.
 - Efforts taken to get more stable algorithm
  - avoid adding nearly opposite numbers
  - minimize the number of operations
  - When adding several numbers, to add them starting from the smallest and proceeding to the largest
  - to be alert when adding numbers of very different scales

#### Forward and backward errors
Consider $y = f(x)$, assume $f^{-1}$ exists, assume instead of $y$, we compute $\hat{y}$ due to various errors. <br>
**Forward error** $y=\hat{y}$ includes initial data error.<br>
Let $\hat{x} = f^{-1}(\hat{y})$, then $x-\hat{x}$ is the **backward error** <br>
We can also consider $\hat{y}$ as being the result of inexact computation $\hat{f}$, $\hat{y}=\hat{f}(x)$, therefore $\hat{f}(x) = f(\hat{x}), \hat{x}=f^{-1}(\hat{f}(x))$
- Relative forward error = $\frac{y-\hat{y}}{y}$, relative back error = $\frac{x-\hat{x}}{x}$, and $|RFE|\approx k_f |RBE|$
- Example $x = 2, f(x):=\sqrt{x}, \hat{y}= 1.4$. 
   - $\hat{x} = f^{-1}(1.4) = 1.96$
   - Forward error $y - \hat{y} = 1.4142... - 1.4 = 0.0142...$
   - Relative forward error $\frac{y-\hat{y}}{y} = \frac{0.0142...}{1,4142...}\approx 0.01$
   - Backward error = $x-\hat{x} = 2 - 1.96 = 0.04$
   - Relative backward error = $\frac{x-\hat{x}}{x} = 0.04 / 2 = 0.02$
   - Condition number = 1/2


#### Truncation and rounding errors
**Truncation** Errors happens when mathematical expressions are approximated by other expressions, assuming the computations in the approximate expressions are performed in exact arithmetic. <br> 
The evaluation of the approximate expressions may be performed in finite arithmetic, it may involve additional error as **rounding error**<br>
**Computational error** is the sum of the two.


#### Total Error
Consider the input $x$, its computer representation $\hat{x}$, the target computation $f(x)$, and its approximation function $g(x)$, the computer evaluates $\hat{g}(x)$, then **total error** $$f(x)-\hat{g}(\hat{x}) = (f(x)-f(\hat{x})) + [(f(\hat{x})-g(\hat{x})) + (g(\hat{x}) - \hat{g}(\hat{x}))] \\
= (propagated) + [(truncation) + (rounding)] $$
 - Example: $\sin(\pi/8)$, let $x=\pi, \hat{x}=3, f(x)=sin(x), g(x)=x-x^3/3!$, the computation be done in 3-decimal digit floating point arithmetic. Then $\hat{g}(\hat{x}) = fl(fl(3/8) - fl(fl(1/3!)fl(fl(3/8)^3))) \approx 0.366$
  - initial data error $x-\hat{x} = \pi - 3 \approx 0.14158...$ 
  - propagated error $f(x) - f(\hat{x}) = \sin(\pi/8) - \sin(8/3)\approx 0.0164$
  - truncation error $f(\hat{x}) - g(\hat{x}) = \sin(3/8) - (3/8 - (3/8)^3/6 \approx 0.0000616$
  - rounding error $g(\hat{x}) - \hat{g}(\hat{x}) = 3/8 - (3/8)^3/6 - fl(3/8 - (3/8)^3/6) = 0.0002109$
  - Total error $f(x) - \hat{g}(\hat{x}) = 0.38268343... - 0.366 = 0.01668343 \approx 0.0164 + 0.0000616 + 0.0002109$
 - General conclusions
   - mathematical operations are not always equivalent to the respective computational operations 
   - Not all mathematical formulas and other properties hold in computer arithmetic
   - in most cases, small errors arise, which we should be alerted for some cases
   - using maximum available precision is suggested
   

### Taylor's Theorem

Let $k\in\mathbb{Z}^+, a\in\mathbb{R}, x\in\mathbb{R}$. Assume $f:\mathbb{R}\rightarrow\mathbb{R}$ is $k+1$ differentiable on $(a, x)$ and continuous on $[a, x]$, then $\exists\epsilon$ s.t. $f(x) = (\sum_{i=0}^k \frac{f^{(i)}(a)}{i!}(x-a)^i) + \frac{f^{(k+1)}(\epsilon)}{(k+1)!} (x-a)^{k+1}$. <br>
Then, let $t_k(x) = \sum_{i=0}^k \frac{f^{(i)}(a)}{i!}(x-a)^i$ is a good approximation. <br>
$R_{k+1}(x) = \frac{f^{(k+1)}(\epsilon)}{(k+1)!} (x-a)^{k+1}$ is the truncation error. <br>