# Scientific Computing

## 01.01 Scientific Computing

Given mathematical relationship $y = f(x)$, do one of the following:
* evaluate
    * given input $x$, then compute output $y$
* solve
    * find an input $x$ that produces output $y$
* optimize
    * find an input $x$ that produces an extreme value (max or min) $y$

---

How to analyze a numerical solution:
* discrete vs. continuous
* linear vs. nonlinear
* finite or infinite dimensional
* purely algebraic or via derivatives or via integrals

---

**General Approach** Replace a difficult problem by an easier one having the same or closely related solution.

## 01.02 Approximations

Sources of approximations.
* Before computation.
    * modelling
    * measurements
    * inherited from previous computations used as input
* During computation.
    * truncation / discretization
    * rounding

---

Measuring error.
* **absolute error** = approximate value - true value
* **relative error** = absolute error / true value

---

Types of error.

$
\begin{aligned}
\text{total error} &= \hat{f}(\hat{x}) - f(x) \\
&= \text{computational error} + \text{propagated data error} \\
\\
\text{computational error} &= \hat{f}(\hat{x}) - f(\hat{x}) \\
\text{propagated data error} &= f(\hat{x}) - f(x) \\
\end{aligned}
$

where
* $\hat{f}(\hat{x})$ approximate function with approximate input
* $f(x)$ true function with true input
* $f(\hat{x})$ true function with approximate input


The *computational error* is difference in result obtained using approximate function and true function.

The *propagated data error* is difference in result obtained using approximate input and true input.

---

Types of computational error.

* truncation error
    * difference between true result and result using exact arithmetic
        * source: mathematical approximations such as truncating series or discrete approximations
* rounding error
    * difference between result of approximate function using exact arithmetic and using limited precision
        * source: inexact representation of real numbers and arithmetic

In practice, usually **one** of these dominates.
   * truncation error dominates in continuous problems
   * rounding error dominates in algebraic problems

---

## 01.03 Forward and Backward Error

Forward Error: $\Delta{y} = \hat{y} - y$

Backward Error: $\Delta{x} = \hat{x} - x$

where
* $\hat{y}$ is the computed result
* $y$ is the true result
* $\hat{x}$ is the approximate input
* $x$ is the true input

In practice, **backward error** is easier to compute than **forward error**.

---

## 01.04 Conditioning, Stability, and Accuracy

A problem is **well-posed** if solution:
* exists
* unique
* depends continuously on problem data

A problem is **ill-conditioned** if relative change in solution can be much larger than input data.
* The term **sensitivity** refers to how ill-conditioned a problem is.

---

#### Condition Number

The condition number relates backward error to the forward error.

$$
|\text{relative forward error}| = \text{cond} \times |\text{relative backward error}|
$$

More formally.

$
\begin{aligned}
\text{cond} &= \frac{|[f(\hat{x}) - f(x)] / f(x)|}{|(\hat{x} - x) / x|} \\
&= \frac{|\Delta{y} / y|}{|\Delta{x} / x|} \\
&= \frac{|x f'(x)|}{|f(x)|}
\end{aligned}
$

where
* *cond* is the condition number
* $f(\hat{x})$ is the true function with approximate input
* $f(x)$ is the true function with true input
* $\hat{x}$ is the approximate input
* $x$ is the true input

A problem is **ill-conditioned** when $\text{cond} >> 1$
* The problem is **ill-conditioned** when the relative change in the output is much larger than input.

Condition number of inverse of $f$ is reciprocal of the condition number of $f$.

$
\text{cond}(f^{-1}) = \frac{1}{\text{cond}(f)}
$

---

**Stability** is analogous to conditioning, but in context of *computational error*.
* Computational error refers to effect on the result computed by an algorithm.
* In contrast, conditioning refers to the effects of data error on solution to problem.

In terms of error analysis, we say an algorithm is stable when the solution produced has relatively small backward error.

---

**Accuracy** is the closeness of the computed solution to true solution and depends on:
* conditioning of problem
* stability of algorithm


| Algorithm Stability | Problem Conditioning | Result | 
| ------------------- | -------------------- | ------ |
| stable | well-conditioned | accurate |
| stable | ill-conditioned | not accurate |
| unstable | well-conditioned | not accurate |
| unstable | ill-conditioned | not accurate |

---

## 01.05 Floating-Point Numbers

#### Representation

Components of a floating point number system $\mathbb{F}$:
* $\beta$ base or radix
* $p$ precision
* $[L, U]$ lower and upper exponent

The floating point number $x \in \mathbb{F}$ has the form:

$
x = \pm \left( d_0 + \frac{d_1}{\beta} + \frac{d_2}{\beta^2} + ... + \frac{d_{p-1}}{\beta^{p-1}} \right) \beta^{E}
$

where
* $d_i$ is an integer $0 \leq d_i \leq \beta - 1, \qquad i=0,...,p-1$
* $E$ is an integer $L \leq E \leq U$

The **mantissa**, m, refers to the parenthesized expression:

$
m = \displaystyle \sum_{i=0}^{p - 1} \left( \frac{d_i}{\beta^i} \right)
$

The mantissa is **normalized** when $1 \leq m \lt \beta$.  This provides the following benefits:
* Makes each number unique.
* Eliminates any leading zeros, maximizing available precision. 

The **exponent** field, $E = U - L + 1$, determines range of representable magnitudes.

* Number of normalized floating point numbers, N, in a system:

$
N = 2 (\beta - 1) \beta^{p -1} (U - L + 1) +1
$

* Underflow, UFL, smallest possible normalized number:

$
UFL = \beta^L
$

* Overflow, OFL, largest possible normalized number:

$
OFL = \beta^{U+1} (1 - \beta^{-p})
$

---

The components of the IEEE standard are listed in the table below:


| System | $\beta$ | $p$ | L | U |
| ------ | ------- | ------ | - | - |
| IEEE SP | 2 | 24 | -126 | 127 |
| IEEE DP | 2 | 53 | -1022 | 1023 |

---

Compute the number of numbers for single and double precision:
* single precision ~ $10^{9}$ aka giga
* double precision ~ $10^{18}$ aka exa

Compute the number-of-numbers for a given representation.

In [1]:
def number_of_numbers(beta, p, L, U):
    """
    Return the number of numbers in the given floating point system.
    """
    return 2 * (beta-1) * beta**(p-1) * (U-L+1) + 1


ieee_half_precision = {
    'beta': 2,
    'p': 11,
    'L': -14,
    'U': 15
}

ieee_single_precision = {
    'beta': 2,
    'p': 24,
    'L': -126,
    'U': 127,
}

ieee_double_precision = {
    'beta': 2,
    'p': 53,
    'L': -1022,
    'U': 1023,
}

print('{:,}'.format(number_of_numbers(**ieee_half_precision)))
print('{:,}'.format(number_of_numbers(**ieee_single_precision)))
print('{:,}'.format(number_of_numbers(**ieee_double_precision)))

61,441
4,261,412,865
18,428,729,675,200,069,633


Convert a binary string to (single precision) floating point.

In [2]:
import math

def binary_to_float(binary, beta=2, p=24, L=-126, U=127):
    """
    Return the floating point number from a binary representation.
    """
    # Parse sign bit.
    sign = int(binary[0], base=beta)
    # Parse exponent as unsigned and rescale by subtracting L.
    exp_bits = math.ceil(math.log(U-L+1, beta))
    exp = int(binary[1:1+exp_bits], base=beta) + (L-1)
    # Parse fraction aka mantissa and add implicit bit.
    mantissa = '1' + binary[1+exp_bits:1+exp_bits+p]
    sig, denom = 0., 1.
    for d in mantissa:
        if d == '1':
            sig += denom
        denom /= 2.
    return sig * beta**(exp)

binary_str = '01000001110010000100010000001000'
print(binary_to_float(binary_str))

# Compare to library method for converting binary to single precision.
import struct
expected = struct.unpack('f', struct.pack('I', int(binary_str,2)))[0]
assert(expected == binary_to_float(binary_str))

25.033218383789062


Compute the machine epsilon.

In [3]:
def epsilon():
    """
    Compute the machine epsilon and precision.
    """
    eps, p = 1.0, 0
    while (1 + eps) > 1:
        eps /= 2.
        p += 1
    return eps*2., p  # Rescale epsilon to last true value.

eps, p = epsilon()
print(eps, p)

# Compare to library constants.
import sys
assert(eps == sys.float_info.epsilon)
assert(p == sys.float_info.mant_dig)

2.220446049250313e-16 53


#### Exceptional Values

**Inf** divide any finite number by zero eg 1/0

**NaN** undefined operation eg 0/0, $0 \times \text{Inf}$, $\text{Inf} / \text{Inf}$

## 01.06 Floating-Point Arithmetic

How is floating-point arithmetic performed?
* addition
  * shift mantissa until exponents match
  * possible loss of digits of smaller number
* multiplication
  * product of 2 p-digit mantissas
  * possible loss of digits if $p_i + p_j$ > machine precision
* division
  * quotient of 2 p-digit mantissas
  * possible loss of digits if $\frac{p_i}{p_j}$ > machine precision

In general, **overflow** is worse than **underflow**.
  * Overflow: No good approximations to arbitrarily large magnitudes.
  * Underflow: Zero is reasonable approximation to small magnitudes.

---

Floating point arithmetic is **not** associative.

$
(1 + \epsilon) + \epsilon = 1 \\
1 + (\epsilon + \epsilon) > 1
$

---

**Cancellation** is result of subtracting numbers of similar magnitudes.
  * The most significant aka leading digits of the results are lost.
  * Compare to **rounding** where least significant aka trailing digits are lost.

Demonstrations.

* Example 1.

Subtract two numbers which differ by $\epsilon$, answer is $2 \epsilon$ in real arithmetic.

$
(1 + \epsilon) - (1 - \epsilon) = 1 - 1 = 0
$

* Example 2.

Summing alternating series when $x < 0$.

$
e^x = 1 + x + \frac{x^2}{2!} + ...
$


---