###### <table>
 <tr align=left><td><img align=left src="./images/CC-BY.png">
 <td>Text provided under a Creative Commons Attribution license, CC-BY. All code is made available under the FSF-approved MIT license. (c) Kyle T. Mandli</td>
</table>

In [None]:
from __future__ import print_function

%matplotlib inline
import numpy
import matplotlib.pyplot as plt

# Sources of Error

Error can come from many sources when formulating problems and/or applying numerical methods:
 - Model/Data Error
 - Discretization Error
 - Floating Point Error
 - Convergence Error

 

### Objectives

* Understand the different sources of error
* Explore some simple approaches to error analysis
* Quantify errors
    * absolute error
    * relative error
    * precision
* Long term goals 
    * Use error estimates to control accuracy/reliability of solutions
    * Understand errors so you can **believe** and **justify** your solutions

## Model and Data Error

Errors in fundamental formulation
 - SIR model 
     - simplistic averaged model of interactions
     $$ \dot{I} = \alpha SI - \beta I$$
     - basic model predicts a single peak 
 - Data Error - Inaccuracy in measurement or uncertainties in parameters
 
Unfortunately we cannot control model and data error directly but we can use methods that may be more robust in the presense of these types of errors.

## Discretization or Truncation Error

Errors arising from approximating a function with a simpler function, e.g. Using the approximation $\sin(x) \approx x$ when $|x| \approx 0$. 

## Floating Point Error

Errors arising from approximating real numbers with finite-precision numbers and arithmetic.

## Convergence Error

In some instances an algorithm is developed that will take a current approximation and then find an improvement on the current approximation. In some instances the errors generated in each indivudal step can accumulate or become magnified after repeating the algorithm a number of times. 

## Basic Definitions

Before exploring the different kinds of error, it is important to first define the ways that error is measured. Given a true value of a function $f$ and an approximate solution $F$ define:

Absolute Error:  
$$
    e = | f - F |
$$

Relative Error:  
$$
    r = \frac{e}{|f|} = \frac{|f - F|}{|f|}
$$

Note: these definitions assume $f$ and $F$ are scalar valued.  However these definitions are readily extended to more complicate objects such as vectors or matrices with appropriate norms.

### Decimal Precision

This definition of relative error provides a convenient estimate for the number of digits of decimal precision $p$

given a relative error $r$,  the precision $p$ is the largest integer such that
$$
    r \leq 5\times 10^{-p}
$$

Example
* if $r = 0.001 < 5\times10^{-3}$ has $p=3$ significant digits
* if $r = 0.006 < 5\times10^{-2}$ has $p=2$ significant digits (because this error would cause rounding up) 

### Example

let
$$
    f = e^1,\quad F=2.71
$$

In [None]:
f = numpy.exp(1.)
F = 2.71
print('f = {}'.format(f))
print('F = {}'.format(F))

In [None]:
e = numpy.abs(f - F)
r = e/numpy.abs(f)
print('Absolute Error: {}'.format(e))
print('Relative Error: {}'.format(r))

In [None]:
p = int(-numpy.log10(r/5.))
print('Decimal precision: {}'.format(p))

### Big-O Notation

In many situations an approximation will have a parameter associated with it, and the value of the parameter is often chosen to insure that the error is reasonable in a given situation. In such circumstances we often want to know the impact on the error if we change the value of the parameter. This leads to the definition of Big-O notation: 

$$
    f(x) =  O(g(x)) \quad \text{as} \quad x \rightarrow a
$$ 
if and only if 
$$
    |f(x)| \leq M |g(x)| \quad \text{as}\quad  |x - a| < \delta \quad \text{where} \quad M,a > 0.
$$ 

In practice we use Big-O notation to say something about how the terms we may have left out of a series might behave.  We saw an example earlier of this with the Taylor's series approximations.

#### Example:
Consider approximating a differentiable  function $f(x)$ by its Taylor polynomial (truncated Taylor series)  expanded around $x_0=0$., i.e. 

$$
F(x) = T_N(x_0 + \Delta x) = \sum^N_{n=0} f^{(n)}(x_0)  \frac{\Delta x^n}{n!}
$$

where
$$
f(x)=\lim_{N\rightarrow\infty} T_N
$$ 
assuming the Taylor series converges

or for the case $f(x)=\sin(x)$ expanded around $x_0=0$

$$
T_N(\Delta x) = \sum^N_{n=0} (-1)^{n} \frac{\Delta x^{2n+1}}{(2n+1)!}
$$



For $N=2$, we can then write $F(x)$  as

$$F(\Delta x) = \Delta x - \frac{\Delta x^3}{6} + \frac{\Delta x^5}{120}$$

so our true function is

$$
    f(x) = F(\Delta x) + O(\Delta x^7)
$$

or the absolute error

$$
    e = | f -F | \sim O(\Delta x^7)
$$

**We can also develop rules for error propagation based on Big-O notation:**

In general, there are two theorems that do not need proof and hold when the value of x is large: 

Let
$$\begin{aligned}
    f(x) &= p(x) + O(x^n) \\
    g(x) &= q(x) + O(x^m) \\
    k &= \max(n, m)
\end{aligned}$$
then
$$
    f+g = p + q + O(x^k)
$$
and
\begin{align}
    f \cdot g &= p \cdot q + p O(x^m) + q O(x^n) + O(x^{n + m}) \\
    &= p \cdot q + O(x^{n+m})
\end{align}

On the other hand, if we are interested in small values of x, say $\Delta x$, the above expressions can be modified as follows: 

\begin{align}
    f(\Delta x) &= p(\Delta x) + O(\Delta x^n) \\
    g(\Delta x) &= q(\Delta x) + O(\Delta x^m) \\
    r &= \min(n, m)
\end{align}
then
$$
    f+g = p + q + O(\Delta x^r)
$$
and
\begin{align}
    f \cdot g &= p \cdot q + p \cdot O(\Delta x^m) + q \cdot O(\Delta x^n) + O(\Delta x^{n+m}) \\
    &= p \cdot q + O(\Delta x^r)
\end{align}

**Note:** In this case we suppose that at least the polynomial with $k = \max(n, m)$ has the following form: 

$$
    p(\Delta x) = 1 + p_1 \Delta x + p_2 \Delta x^2 + \ldots
$$
or 
$$
    q(\Delta x) = 1 + q_1 \Delta x + q_2 \Delta x^2 + \ldots
$$

so that there is an $\mathcal{O}(1)$ term that guarantees the existence of $\mathcal{O}(\Delta x^r)$ in the final product. 

To get a sense of why we care most about the power on $\Delta x$ when considering convergence the following figure shows how different powers on the convergence rate can effect how quickly we converge to our solution.  Note that here we are plotting the same data two different ways.  Plotting the error as a function of $\Delta x$ is a common way to show that a numerical method is doing what we expect and exhibits the correct convergence behavior.  Since errors can get small quickly it is very common to plot these sorts of plots on a log-log scale to easily visualize the results.  Note that if a method was truly of the order $n$ that they will be a linear function in log-log space with slope $n$.

#### Behavior of error as a function of $\Delta x$

In [None]:
dx = numpy.linspace(1.0, 1e-4, 100)

fig = plt.figure()
fig.set_figwidth(fig.get_figwidth() * 2.0)
axes = []
axes.append(fig.add_subplot(1, 2, 1))
axes.append(fig.add_subplot(1, 2, 2))

for n in range(1, 5):
    axes[0].plot(dx, dx**n, label="$\Delta x^%s$" % n)
    axes[1].loglog(dx, dx**n, label="$\Delta x^%s$" % n)

axes[0].legend(loc=2)
axes[1].set_xticks([10.0**(-n) for n in range(5)])
axes[1].set_yticks([10.0**(-n) for n in range(16)])
axes[1].legend(loc=4)
for n in range(2):
    axes[n].set_title("Growth of Error vs. $\Delta x^n$")
    axes[n].set_xlabel("$\Delta x$")
    axes[n].set_ylabel("Estimated Error")

plt.show()

## Discretization Error

**Taylor's Theorem:**  Let $f(x) \in C^{N+1}[a,b]$ and $x_0 \in [a,b]$, then for all $x \in (a,b)$ there exists a number $c = c(x)$ that lies between $x_0$ and $x$ such that

$$ f(x) = T_N(x) + R_N(x)$$

where $T_N(x)$ is the Taylor polynomial approximation

$$T_N(x) = \sum^N_{n=0} \frac{f^{(n)}(x_0)\cdot(x-x_0)^n}{n!}$$

and $R_N(x)$ is the residual (the part of the series we left off)

$$R_N(x) = \frac{f^{(N+1)}(c) \cdot (x - x_0)^{N+1}}{(N+1)!}$$

### Note


The residual:

$$
    R_N(x) = \frac{f^{(N+1)}(c) \cdot (x - x_0)^{N+1}}{(N+1)!}
$$

depends on the $N+1$ order derivative of $f$ evaluated at an **unknown** value $c\in[x,x_0]$.  

If we knew the value of $c$ we would know the exact value of $R_N(x)$ and therefore the function $f(x)$.  In general we don't know this value but we can use $R_N(x)$ to put upper bounds on the error **and** to understand how the error changes as we move away from $x_0$.

Start by replacing $x - x_0$ with $\Delta x$.  The primary idea here is that the residual $R_N(x)$ becomes smaller as $\Delta x \rightarrow 0$ (at which point $T_N(x) = f(x_0)$).

$$
    T_N(x) = \sum^N_{n=0} \frac{f^{(n)}(x_0)\cdot\Delta x^n}{n!}
$$

and $R_N(x)$ is the residual (the part of the series we left off)

$$
    R_N(x) = \frac{f^{(n+1)}(c) \cdot \Delta x^{n+1}}{(n+1)!} \leq M \Delta x^{n+1} = O(\Delta x^{n+1})
$$

where $M$ is an upper bound on 
$$
    \frac{f^{(n+1)}(c)}{(n+1)!}
$$

#### Example 1

$f(x) = e^x$ with $x_0 = 0$ on the interval $x\in(-1,1)$

Using this we can find expressions for the relative and absolute error as a function of $x$ assuming $N=2$.

Derivatives:
$$\begin{aligned}
    f'(x) &= e^x \\
    f''(x) &= e^x \\ 
    f^{(n)}(x) &= e^x
\end{aligned}$$

Taylor polynomials:
$$\begin{aligned}
    T_N(x) &= \sum^N_{n=0} e^0 \frac{x^n}{n!} \Rightarrow \\
    T_2(x) &= 1 + x + \frac{x^2}{2}
\end{aligned}$$

Remainders:
$$\begin{aligned}
    R_N(x) &= e^c \frac{x^{N+1}}{(N+1)!} \\
    R_2(x) &= e^c \cdot \frac{x^3}{6} \leq \frac{e^1}{6} \approx 0.5
\end{aligned}$$

Accuracy:
\begin{align}
    \exp(1) &= 2.718\ldots \\
    T_2(1) &= 2.5 
\end{align}

$$
\Rightarrow e \approx 0.2,\quad r \approx 0.08,\quad p = ?
$$

We can also use the package sympy which has the ability to calculate Taylor polynomials built-in!

In [None]:
import sympy
sympy.init_printing(pretty_print=True)
x = sympy.symbols('x')
f = sympy.exp(x)
f.series(x0=0, n=3)

Lets plot this numerically for a section of $x$.

In [None]:
x = numpy.linspace(-1, 1, 100)
f = numpy.exp(x)
T_N = 1.0 + x + x**2 / 2.0
R_N = numpy.exp(1) * x**3 / 6.0

In [None]:
fig = plt.figure(figsize=(8,6))
axes = fig.add_subplot(1,1,1)
axes.plot(x, T_N, 'r', x, f, 'k', x, numpy.abs(R_N), 'b')
axes.plot(x,numpy.abs(numpy.exp(x)-T_N),'g--')
axes.plot(0.0, 1.0, 'o', markersize=10)

axes.grid()
axes.set_xlabel("x",fontsize=16)
axes.set_ylabel("$f(x)$, $T_N(x)$, $|R_N(x)|$", fontsize=16)
axes.legend(["$T_N(x)$", "$f(x)$", "$|R_N(x)|$", "e(x)"], loc=2)
plt.show()

#### Example 2

Approximate
$$
    f(x) = \frac{1}{x} \quad x_0  = 1,
$$
using $x_0 = 1$ to the 3rd Taylor series term on the inverval $x\in[1,\infty)$

$$
\begin{matrix}
f'(x) = -\frac{1}{x^2}, &  f''(x) = \frac{2}{x^3}, & f'''(x) = -\frac{6}{x^4}, & \ldots, & f^{(n)}(x) &= \frac{(-1)^n n!}{x^{n+1}}
\end{matrix}
$$

$$
\begin{aligned}
    T_N(x) &= \sum^N_{n=0} (-1)^n (x-1)^n \Rightarrow \\
    T_2(x) &= 1 - (x - 1) + (x - 1)^2
\end{aligned}
$$


$$
\begin{aligned}
    R_N(x) &= \frac{(-1)^{n+1}(x - 1)^{n+1}}{c^{n+2}} \Rightarrow \\
    R_2(x) &= \frac{-(x - 1)^{3}}{c^{4}}
\end{aligned}
$$

### plot this problem

In [None]:
x = numpy.linspace(0.8, 2, 100)
f = 1.0 / x
T_N = 1.0 - (x-1) + (x-1)**2
R_N = -(x-1.0)**3 / (1.**4)

In [None]:
plt.figure(figsize=(8,6))
plt.plot(x, T_N, 'r', x, f, 'k', x, numpy.abs(R_N), 'b')
plt.plot(x,numpy.abs(f - T_N),'g--')
plt.plot(1.0, 1.0, 'o', markersize=10)
plt.grid(True)
plt.xlabel("x",fontsize=16)
plt.ylabel("$f(x)$, $T_N(x)$, $R_N(x)$",fontsize=16)
plt.title('$f(x) = 1/x$',fontsize=18)
plt.legend(["$T_N(x)$", "$f(x)$", "$|R_N(x)|$", '$e(x)$'], loc='best')
plt.show()

### Computational Issue #1: Accuracy... how many terms?

Given a Taylor Polynomial approximation of an arbitrary function $f(x)$,  how do we determine how many terms are required such that $|R_N(x)|<tol$.  And how do we determine the tolerance?

### Computational Issue #2 Efficiency... Operation counts for polynomial evaluation

Given 

$$P_N(x) = a_0 + a_1 x + a_2 x^2 + \ldots + a_N x^N$$ 

or

$$P_N(x) = p_0 x^N + p_1 x^{N-1} + p_2 x^{N-2} + \ldots + p_{N}$$

what is the most **efficient way**  to evaluate $P_N(x)$? (i.e. minimize number of floating point operations)

Consider two ways to write $P_3$

* The standard way:

$$ P_3(x) = p_0 x^3 + p_1 x^2 + p_2 x + p_3$$

* using nested multiplication (aka **Horner's Method**):

$$ P_3(x) = ((p_0 x + p_1) x + p_2) x + p_3$$

Consider how many operations it takes for each...

$$ P_3(x) = p_0 x^3 + p_1 x^2 + p_2 x + p_3$$

$$P_3(x) = \overbrace{p_0 \cdot x \cdot x \cdot x}^3 + \overbrace{p_1\cdot x \cdot x}^2 + \overbrace{p_2 \cdot x}^1 + p_3$$

Note:  here we're just counting multiplications as they will dominate the flop count

Adding up all the operations we can in general think of this as a pyramid (it's really the triangle numbers)

$$
    \sum_{n=1}^N n = \frac{N(N+1)}{2} 
$$

![Original Count](./images/horners_method_big_count.png)

Thus we can estimate that the algorithm written this way will take approximately $O(N^2 / 2)$ operations to complete.

Looking at nested iteration, however:

$$ P_3(x) = ((p_0 x + p_1) x + p_2) x + p_3$$

Here we find that the method is $O(N)$ compared to the first evaluation which $O(N^2)$ (we usually drop the 2 in these cases).  That's a huge difference for large $N$!

#### Algorithm

Fill in the function and implement Horner's method:
```python
def eval_poly(p, x):
    '''Evaluates a polynomial using Horner's method given coefficients p at x
    
      The polynomial is defined as
    
        P(x) = p[0] x**n + p[1] x**(n-1) + ... + p[n-1] x + p[n]
        
    Parameters:
        p: list or numpy array of coefficients 
        x:  scalar float
        
    returns:
        P(x):  value of the polynomial at point x (float)
    '''
    pass
```

In [None]:
def eval_poly(p, x):
    '''Evaluates a polynomial using Horner's method given coefficients p at x
    
      The polynomial is defined as
    
        P(x) = p[0] x**n + p[1] x**(n-1) + ... + p[n-1] x + p[n]
        
    Parameters:
        p: list or numpy array of coefficients 
        x:  scalar float or numpy array (this version is more robust to floating point error)
        
    returns:
        P(x):  value of the polynomial at point x, P will return as the same type as x
    
    '''
    
    
    if isinstance(x,numpy.ndarray):
        y = p[0]*numpy.ones(x.shape)
    elif isinstance(x,float):
        y = p[0]
    else:
        raise TypeError
        
    for element in p[1:]:
        y = y*x + element
        
    return y 

In [None]:
# Scalar test

p = [1, 2, 3]
x = 1.
test = eval_poly(p,x)
answer = numpy.array([x**2, x, 1]).dot(p)

print('test = {} ({}), answer = {} ({})'.format(test,type(test),answer,type(answer)))
numpy.testing.assert_allclose(test,answer)
print('success')

In [None]:
# Vectorized test with x a numpy array

p = [1, -3, 10, 4, 5, 5]
x = numpy.linspace(-10, 10, 100)
P = eval_poly(p,x)
print('x: {}, P(x): {}'.format(type(x),type(P)))

In [None]:
plt.plot(x, P)
plt.xlabel('x')
plt.ylabel('P(x)')
plt.title('{}th order polynomial, p={}'.format(len(p)-1,p))
plt.grid()
plt.show()

## Convergence Error

In some circumstances a formula or algorithm is applied repeatedly as a way to obtain a final approximation. Usually, the errors that occur at each individual step are small. By repeating the algorithm, though, the errors can sometimes grow or become magnified. 

As example of this phenomena is given below. The values of the terms in a difference equation are calculated,
$$
   \begin{align}
      y_0 &= 1, \\
      y_1 &= \frac{1}{5}, \\
      y_{n+1} &= \frac{16}{5} y_n - \frac{3}{5} y_{n-1}.
   \end{align}
$$

The true solution to the difference equation is $y_n = \left(\frac{1}{5}\right)^n$, where $n=$0, 1, 2, $\ldots$  

In [None]:
# Choose the number of iterations
N = 40
y = numpy.empty(N+1)            # Allocate an empty vector with N+1 entries

# Now use the difference equation to generate the numbers in the sequence
y[0] = 1
y[1] = 1/5
for n in range(2,N+1):
    y[n] = 16/5*y[n-1] - 3/5*y[n-2]


And plot the result

In [None]:
# Now plot the result
n = numpy.arange(0,N+1)
fig = plt.figure(figsize=(10.0, 5.0))
axes = fig.add_subplot(1, 1, 1)
axes.semilogy(n,y, 'rx', markersize=5, label='$y_n$')
axes.semilogy(n,(1/5)**n,'b.', label='$y_{true}$')
axes.grid()
axes.set_title("Calculated Values Of A Difference Equation",fontsize=18)
axes.set_xlabel("$n$",fontsize=16)
axes.set_ylabel("$y_n$",fontsize=16)
axes.legend(loc='best', shadow=True)
plt.show()

Simply looking at the exact solution, the sequence of numbers generated by the difference equation above should get very close to zero. Instead, the numbers in the sequence initially get closer to zero, but at some point they begin to grow and get larger. An underlying problem is that the computer is not able to store the numbers exactly. The second number in the sequence, $y_1=\frac{1}{5}$ has a small error, and the computer stores it as $y_1 = \frac{1}{5}+\epsilon$ where $\epsilon$ is some small error.

Each time a new number in the loop is generated, the error is multiplied. For example, after the first iteration $y_2$ is
$$
    \begin{align}
       y_2 &= \frac{16}{5} \left( \frac{1}{5}+\epsilon \right) 
               - \frac{3}{5} \left( 1 \right), \\
           &= \frac{1}{5^2} + \frac{16}{5} \epsilon.
    \end{align}
$$

After the second time through the loop, the value of $y_3$ is 

$$ 
y_3=\frac{1}{5^3} + \frac{241}{25}\epsilon
$$

Even though the value of $\epsilon$ is very close to zero, every iteration makes the error grow.  Repeated multiplication will result in a very large number. 

The error associated with the initial representation of the number $\frac{1}{5}$ is a problem with the way a digital computer stores floating point numbers. In most instances the computer cannot represent a number exactly, and the small error in approximating a given number can give rise to other problems. 

## Floating Point Error

Errors arising from approximating real numbers with finite-precision numbers

$$\pi \approx 3.14$$

or $\frac{1}{3} \approx 0.333333333$ in decimal, results from a finite number of registers to represent each number.


### Floating Point Systems

Numbers in floating point systems are represented as a series of bits that represent different pieces of a number.  In *normalized floating point systems* there are some standard conventions for what these bits are used for.  In general the numbers are stored by breaking them down into the form  

$$F = \pm d_1 . d_2 d_3 d_4 \ldots d_p \times \beta^E$$

where
1. $\pm$ is a single bit  representing the sign of the number
2. $d_1 . d_2 d_3 d_4 \ldots d_p$ is called the *mantissa*.  Note that technically the decimal could be moved but generally, using scientific notation, the decimal can always be placed at this location.  The digits $d_2 d_3 d_4 \ldots d_p$ are called the *fraction* with $p$ digits of precision.  Normalized systems specifically put the decimal point in the front like we have and assume $d_1 \neq 0$ unless the number is exactly $0$.
3. $\beta$ is the *base*.  For binary $\beta = 2$, for decimal $\beta = 10$, etc.
4. $E$ is the *exponent*, an integer in the range $[E_{\min}, E_{\max}]$

The important points on any floating point system is that
1. There exist a discrete and finite set of representable numbers
2. These representable numbers are not evenly distributed on the real line
3. Arithmetic in floating point systems yield different results from infinite precision arithmetic (i.e. "real" math)

#### Properties of Floating Point Systems
All floating-point systems are characterized by several important numbers
 - Smalled normalized number (underflow if below - related to subnormal numbers around zero)
 - Largest normalized number (overflow if above)
 - Zero
 - Machine $\epsilon$ or $\epsilon_{\text{machine}}$
 - `inf` and `nan`, infinity and **N**ot **a** **N**umber respectively

##### Example:  Toy System
Consider the toy 2-digit precision decimal system (normalized)
$$f = \pm d_1 . d_2 \times 10^E$$
with $E \in [-2, 0]$.

**Number and distribution of numbers**
1. How many numbers can we represent with this system?

2. What is the distribution on the real line?

3. What is the underflow and overflow limits?

4. What is the smallest number $\epsilon_{mach}$ such that $1+\epsilon_{mach} > 1$?

How many numbers can we represent with this system?

$$
    f = \pm d_1 . d_2 \times 10^E ~~~ \text{with} E \in [-2, 0]
$$

* sign bit: 2

* $d_1$: 9   (normalized numbers $d_1\neq 0$)
* $d_2$: 10  

* $E$: 3 

* zero: 1

total:
$$ 
    2 \times 9 \times 10 \times 3 + 1 = 541
$$

What is the distribution on the real line? $$f = \pm d_1 . d_2 \times 10^E ~~~ \text{with} ~~~ E \in [-2, 0]$$

In [None]:
d_1_values = [1, 2, 3, 4, 5, 6, 7, 8, 9]
d_2_values = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
E_values = [0, -1, -2,]

fig = plt.figure(figsize=(10.0,1.5))
axes = fig.add_subplot(1, 1, 1)

for E in E_values:
    for d1 in d_1_values:
        for d2 in d_2_values:
            axes.plot( (d1 + d2 * 0.1) * 10**E, 0.0, 'r|', markersize=20)
            axes.plot(-(d1 + d2 * 0.1) * 10**E, 0.0, 'r|', markersize=20)
            
axes.plot(0.0, 0.0, '|', markersize=20)
axes.plot([-1., 1.], [0.0, 0.0], 'k|', markersize=30)

axes.plot([-10.0, 10.0], [0.0, 0.0], 'k')

axes.set_title("Distribution of Values $[-10, 10]$")
axes.set_yticks([])
ticks = [i for i in range(-10,11,1)]
axes.set_xticks(ticks)
axes.set_xlabel("x")
axes.set_ylabel("")
axes.set_xlim([-10, 10])
plt.show()

In [None]:
fig = plt.figure(figsize=(10.0,1.5))
axes = fig.add_subplot(1,1,1)

for E in E_values:
    for d1 in d_1_values:
        for d2 in d_2_values:
            axes.plot( (d1 + d2 * 0.1) * 10**E, 0.0, 'r+', markersize=20)
            axes.plot(-(d1 + d2 * 0.1) * 10**E, 0.0, 'r+', markersize=20)
            
axes.plot(0.0, 0.0, '+', markersize=20)
axes.plot([-0.1, 0.1], [0.0, 0.0], 'k|', markersize=30)
axes.plot([-1, 1], [0.0, 0.0], 'k')

axes.set_title("Close up $[-1, 1]$")
axes.set_yticks([])
ticks = numpy.linspace(-1.,1.,21)
axes.set_xticks(ticks)
axes.set_xlabel("x")
axes.set_ylabel("")
axes.set_xlim([-1, 1])
#fig.tight_layout(h_pad=1, w_pad=5)

plt.show()

What is the underflow and overflow limits?

* Smallest number that can be represented is the underflow:  $1.0 \times 10^{-2} = 0.01$
* Largest number that can be represented is the overflow:  $9.9 \times 10^0 = 9.9$

What is the smallest number $\epsilon_{mach}$ such that $1+\epsilon_{mach} > 1$?

* $\epsilon_{mach} = 0.1$

### Binary Systems
Consider the 2-digit precision base 2 system:

$$
    f=\pm d_1 . d_2 \times 2^E \quad \text{with} \quad E \in [-1, 1]
$$

#### Number and distribution of numbers
1. How many numbers can we represent with this system?

2. What is the distribution on the real line?

3. What is the underflow and overflow limits?

4. What is $\epsilon_{mach}$?


How many numbers can we represent with this system?

$$f=\pm d_1 . d_2 \times 2^E ~~~~ \text{with} ~~~~ E \in [-1, 1]$$

$$ 2 \times 1 \times 2 \times 3 + 1 = 13$$

What is the distribution on the real line?

In [None]:
d_1_values = [1]
d_2_values = [0, 1]
E_values = [1, 0, -1]

fig = plt.figure(figsize=(10.0, 1.0))
axes = fig.add_subplot(1, 1, 1)

for E in E_values:
    for d1 in d_1_values:
        for d2 in d_2_values:
            axes.plot( (d1 + d2 * 0.5) * 2**E, 0.0, 'r+', markersize=20)
            axes.plot(-(d1 + d2 * 0.5) * 2**E, 0.0, 'r+', markersize=20)
            
axes.plot(0.0, 0.0, 'r+', markersize=20)
axes.plot([-4.5, 4.5], [0.0, 0.0], 'k')

axes.set_title("Distribution of Values")
axes.set_yticks([])
axes.set_xticks(numpy.linspace(-4,4,9))
axes.set_xlabel("x")
axes.set_ylabel("")
axes.grid()
axes.set_xlim([-5, 5])
plt.show()

* Smallest number that can be represented is the underflow:  $1.0 \times 2^{-1} = 0.5$

* Largest number that can be represented is the overflow:  $1.1 \times 2^1 = 3$

* $\epsilon_{mach} = 0.1 = 2^{-1}= 1/2$

**Note**: these numbers are in a binary system.  

Quick rule of thumb:
$$
    2^3 2^2 2^1 2^0 . 2^{-1} 2^{-2} 2^{-3}
$$
correspond to
8s, 4s, 2s, 1s . halves, quarters, eighths, ...

### Real Systems - IEEE 754 Binary Floating Point Systems

#### Single Precision
 - Total storage alloted is 32 bits
 - Exponent is 8 bits $\Rightarrow E \in [-126, 128]$
 - Fraction 23 bits ($p = 24$)
 
```
s EEEEEEEE FFFFFFFFFFFFFFFFFFFFFFF
0 1      8 9                     31
```

* Overflow $= 2^{128}\approx3.40\times10^{38}$
* Underflow $= 2^{-126} \approx 1.17 \times 10^{-38}$
* $\epsilon_{\text{machine}} = 2^{-23} \approx 1.19 \times 10^{-7}$


#### Double Precision
 - Total storage alloted is 64 bits
 - Exponent is 11 bits $\Rightarrow E \in [-1022, 1024]$
 - Fraction 52 bits ($p = 53$)
 
```
s EEEEEEEEEE FFFFFFFFFF FFFFFFFFFF FFFFFFFFFF FFFFFFFFFF FFFFFFFFFF FF
0 1       11 12                                                      63
```
* Overflow $= 2^{1024} \approx 1.8 \times 10^{308}$
* Underflow $= 2^{-1022} \approx 2.2 \times 10^{-308}$
* $\epsilon_{\text{machine}} = 2^{-52} \approx 2.2 \times 10^{-16}$

### Python Access to IEEE Numbers

Access many important parameters, such as machine epsilon:

```python
import numpy
numpy.finfo(float).eps
```

In [None]:
print(numpy.finfo(numpy.float16))

In [None]:
print(numpy.finfo(numpy.float32))

In [None]:
print(numpy.finfo(float))

In [None]:
print(numpy.finfo(numpy.float128))

### Examples

In [None]:
eps = numpy.finfo(float).eps
MAX = numpy.finfo(float).max
print('eps = {}'.format(eps))
print('MAX = {}'.format(MAX))

Show that $(1 + \epsilon_{mach}) > 1$

In [None]:
print(MAX)

In [None]:
print(MAX*(1+ 0.4*eps))

In [None]:
print(1 + 0.4*eps == 1.0)

## Why should we care about this?

 - Floating point arithmetic is not commutative or associative
 - Floating point errors compound, do not assume even double precision is enough!
 - Mixing precision can be  dangerous

### Example 1: Simple Arithmetic
 
Simple arithmetic $\delta < \epsilon_{\text{machine}}$. 

Compare

   $$1+\delta - 1 \quad vs. \quad 1 - 1 + \delta$$

In [None]:
eps = numpy.finfo(float).eps
delta = 0.5*eps
x = 1+delta  -1
y = 1 - 1 + delta
print('1 + delta - 1 = {}'.format(x))
print('1 - 1 + delta = {}'.format(y))
print( x == y)

### Example 2: Catastrophic cancellation 

Let us examine what happens when we add two numbers $x$ and $y$ where $x + y \neq 0$.  We can actually estimate these bounds by doing some error analysis.  Here we need to introduce the idea that each floating point operation introduces an error such that

$$
    \text{fl}(x ~\text{op}~ y) = (x ~\text{op}~ y) (1 + \delta)
$$

where $\text{fl}(\cdot)$ is a function that returns the floating point representation of the expression enclosed, $\text{op}$ is some operation (e.g. $+, -, \times, /$), and $\delta$ is the floating point error due to $\text{op}$.

Back to our problem at hand.  The floating point error due to addition is

$$
    \text{fl}(x + y) = (x + y) (1 + \delta).
$$

Comparing this to the true solution using a relative error we have

$$\begin{aligned}
    \frac{|(x + y) - \text{fl}(x + y)|}{|x + y|} &= \frac{|(x + y) - (x + y) (1 + \delta)|}{|x + y|} = \delta.
\end{aligned}$$

so that if $\delta = \mathcal{O}(\epsilon_{\text{machine}})$ we are not too concerned.

What if instead we consider a floating point error on the representations of $x$ and $y$, $x \neq y$, and say $\delta_x$ and $\delta_y$ are the magnitude of the errors in their representation.  Here we will assume this constitutes the floating point error rather than being associated with the operation itself.

Now consider the difference between the two numbers
$$\begin{aligned}
    \text{fl}(x - y) &= x (1 + \delta_x) - y (1 + \delta_y) \\
    &= x - y + x \delta_x - y \delta_y \\
    &= (x - y) \left(1 + \frac{x \delta_x - y \delta_y}{x - y}\right)
\end{aligned}$$

Again computing the relative error we then have

$$\begin{aligned}
    \frac{\left|(x - y) - (x - y) \left(1 + \frac{x \delta_x - y \delta_y}{x - y}\right)\right|}{|x - y|} &= 
   \left|1 - \left(1 + \frac{x \delta_x - y \delta_y}{x - y}\right)\right|\\
   &=\frac{|x \delta_x - y \delta_y|}{|x - y|} \\
\end{aligned}$$

The important distinction here is that now the error is dependent on the values of $x$ and $y$ and more importantly, their difference.  Of particular concern is the relative size of $x - y$.  As it approaches zero relative to the magnitudes of $x$ and $y$ the error could be arbitrarily large.  This is known as **catastrophic cancellation**.

In [None]:
dx = numpy.array([10**(-n) for n in range(1, 16)])
x = 1.0 + dx
y = numpy.ones(x.shape)
error = numpy.abs(x - y - dx) / (dx)

In [None]:
fig = plt.figure()
fig.set_figwidth(fig.get_figwidth() * 2)

axes = fig.add_subplot(1, 2, 1)
axes.loglog(dx, x + y, 'o-')
axes.set_xlabel("$\Delta x$")
axes.set_ylabel("$x + y$")
axes.set_title("$\Delta x$ vs. $x+y$")

axes = fig.add_subplot(1, 2, 2)
axes.loglog(dx, error, 'o-')
axes.set_xlabel("$\Delta x$")
axes.set_ylabel("$|x + y - \Delta x| / \Delta x$")
axes.set_title("Difference between $x$ and $y$ vs. Relative Error")

plt.show()

### Example 3: Function Evaluation

Consider the function
$$
    f(x) = \frac{1 - \cos x}{x^2}
$$
with $x\in[-10^{-4}, 10^{-4}]$.  

Taking the limit as $x \rightarrow 0$ we can see what behavior we would expect to see from evaluating this function:
$$
    \lim_{x \rightarrow 0} \frac{1 - \cos x}{x^2} = \lim_{x \rightarrow 0} \frac{\sin x}{2 x} = \lim_{x \rightarrow 0} \frac{\cos x}{2} = \frac{1}{2}.
$$

What does floating point representation do?

In [None]:
x = numpy.linspace(-1e-3, 1e-3, 100, dtype=numpy.float32)
f = 0.5
F = (1.0 - numpy.cos(x)) / x**2
rel_err = numpy.abs((f - F)) / f

In [None]:
fig = plt.figure(figsize=(8,6))
axes = fig.add_subplot(1, 1, 1)
axes.plot(x, rel_err, 'o')
axes.set_xlabel("x")
axes.grid()
axes.set_ylabel("Relative Error")
axes.set_title("$\\frac{1-\\cos{x}}{x^2} - \\frac{1}{2}$",fontsize=18)
plt.show()

### Example 4: Evaluation of a Polynomial

   $$f(x) = x^7 - 7x^6 + 21 x^5 - 35 x^4 + 35x^3-21x^2 + 7x - 1$$
   
Note: $f(1) = 0$ (and will be close to zero for $x\approx 1$)

Here we compare polynomial evaluation using naive powers compared to Horner's method as implemented in `eval_poly(p,x)` defined above.

In [None]:
x = numpy.linspace(0.988, 1.012, 1000, dtype=numpy.float16)
y = x**7 - 7.0 * x**6 + 21.0 * x**5 - 35.0 * x**4 + 35.0 * x**3 - 21.0 * x**2 + 7.0 * x - 1.0

# repeat using Horner's method from above
p = numpy.array([1, -7, 21, -35, 35, -21, 7, -1 ])
yh = eval_poly(p,x)

In [None]:
fig = plt.figure(figsize=(8,6))
fig.set_figwidth(fig.get_figwidth() * 2)

axes = fig.add_subplot(1, 2, 1)
axes.plot(x, y, 'r',label='naive')
axes.plot(x, yh, 'b',label='horner')
axes.set_xlabel("x")
axes.set_ylabel("y")
axes.set_ylim((-0.1, 0.1))
axes.set_xlim((x[0], x[-1]))
axes.grid()
axes.legend()

axes = fig.add_subplot(1, 2, 2)
axes.plot(x,yh-y,'g')
axes.grid()
axes.set_xlabel('x')
axes.set_ylabel('$f_{horner} - f_n$')
axes.set_title('error')
plt.show()

In [None]:
def eval_polys(p, x):
    '''Evaluates a polynomial using Horner's method given coefficients p at x
    
      The polynomial is defined as
    
        P(x) = p[0] x**n + p[1] x**(n-1) + ... + p[n-1] x + p[n]
        
    Parameters:
        p: list or numpy array of coefficients 
        x:  scalar float or numpy array this version is less careful about input type
        
    returns:
        P(x):  value of the polynomial at point x, P will return as the same type as x
    
    '''
   
    y = p[0]
    for element in p[1:]:
        y = y*x + element
        
    return y 

In [None]:

# repeat using different Horner's method from above
yh = eval_polys(p,x)

In [None]:
fig = plt.figure(figsize=(8,6))
fig.set_figwidth(fig.get_figwidth() * 2)

axes = fig.add_subplot(1, 2, 1)
axes.plot(x, y, 'r',label='naive')
axes.plot(x, yh, 'b',label='horner')
axes.set_xlabel("x")
axes.set_ylabel("y")
axes.set_ylim((-0.1, 0.1))
axes.set_xlim((x[0], x[-1]))
axes.grid()
axes.legend()

axes = fig.add_subplot(1, 2, 2)
axes.plot(x,yh-y,'g')
axes.grid()
axes.set_xlabel('x')
axes.set_ylabel('$f_{horner} - f_n$')
axes.set_title('error')
plt.show()

### Example 5: Rational Function Evaluation
Compute $f(x) = x + 1$ by the function 

$$F(x) = \frac{x^2 - 1}{x - 1}$$

Do you expect there to be issues?

In [None]:
x = numpy.linspace(0.5, 1.5, 101, dtype=numpy.float32)
f_hat = (x**2 - 1.0) / (x - 1.0)
f = (x + 1.0)

In [None]:
fig = plt.figure()
axes = fig.add_subplot(1, 1, 1)
axes.plot(x, numpy.abs(f - f_hat)/numpy.abs(f))
axes.set_xlabel("$x$")
axes.set_ylabel("Relative Error")
axes.grid()
plt.show()

## Combination of Errors

In general we need to concern ourselves with the combination of both discretization error and floating point error.

### Reminder:
* **Discretization error**:  Errors arising from approximation of a function, truncation of a series...

$$\sin x \approx x - \frac{x^3}{3!} + \frac{x^5}{5!} + O(x^7)$$

* **Floating-point Error**:  Errors arising from approximating real numbers with finite-precision numbers

$$\pi \approx 3.14$$

or $\frac{1}{3} \approx 0.333333333$ in decimal cannot be represented exactly as a binary floating point number

### Example 1

Consider a finite difference approximation to the first derivative of a function

$$f^\prime(x) \approx \frac{f(x + \Delta x) - f(x)}{\Delta x}$$

**Note**: in the limit $\Delta x\rightarrow 0$,  this is the standard definition of the first derivative.  However we're interested in the error for a *finite* $\Delta x$.

Moreover, (as we will see in future notebooks), there are many ways to approximate the first derivative.  For example we can write the "centered first derivative" as

$$f^\prime(x) \approx \frac{f(x + \Delta x) - f(x - \Delta x)}{2\Delta x}$$


Here we will just compare the the error for the two different finite-difference formulas given

$$f(x) = f^\prime(x) = e^x$$ 

at $x=1$ for decreasing values of $\Delta x$. We will also introduce the idea of an 'inline' or `lambda` function in python.

In [None]:
f = lambda x: numpy.exp(x)
f_prime = lambda x: numpy.exp(x)

In [None]:
delta_x = numpy.array([2.0**(-n) for n in range(1, 60)])
x = 1.0

# Forward finite difference approximation to first derivative
f_hat_1 = (f(x + delta_x) - f(x)) / (delta_x)
# Centered finite difference approximation to first derivative
f_hat_2 = (f(x + delta_x) - f(x - delta_x)) / (2.0 * delta_x)

In [None]:
fig = plt.figure(figsize=(8,6))
axes = fig.add_subplot(1, 1, 1)
axes.loglog(delta_x, numpy.abs(f_hat_1 - f_prime(x)), 'o-', label="One-Sided")
axes.loglog(delta_x, numpy.abs(f_hat_2 - f_prime(x)), 's-', label="Centered")
axes.legend(loc=3,fontsize=14)
axes.set_xlabel("$\Delta x$",fontsize=16)
axes.set_ylabel("Absolute Error",fontsize=16)
axes.set_title("Finite Difference approximations to $df/dx$",fontsize=18)
axes.grid()
plt.show()


### Example 2

Evaluate $e^x$ with its Taylor series.

$$e^x = \sum^\infty_{n=0} \frac{x^n}{n!}$$

Can we pick $N < \infty$ that can approximate $e^x$ over a give range $x \in [a,b]$ such that the relative error $E$ satisfies $E < 8 \cdot \varepsilon_{\text{machine}}$?

We can try simply evaluating the Taylor polynomial directly for various $N$

In [None]:
from scipy.special import factorial

def my_exp(x, N=10):
    value = 0.0
    for n in range(N + 1):
        value += x**n / float(factorial(n))
        
    return value

And test this

In [None]:
eps = numpy.finfo(numpy.float64).eps

x = numpy.linspace(-2, 50., 100, dtype=numpy.float64)
MAX_N = 300
for N in range(1, MAX_N + 1):
    rel_error = numpy.abs((numpy.exp(x) - my_exp(x, N=N)) / numpy.exp(x))
    if numpy.all(rel_error < 8.0 * eps): 
        break

In [None]:
fig = plt.figure(figsize=(8,6))
axes = fig.add_subplot(1, 1, 1)
axes.plot(x, rel_error/eps)
axes.set_xlabel("x")
axes.set_ylabel("Relative Error/eps")
axes.set_title('N = {} terms'.format(N))
axes.grid()
plt.show()

### Can we do better?  

Note: 

the largest value of $x$ such that $e^x <$ MAX is:

In [None]:
print(numpy.log(numpy.finfo(float).max))

and `numpy.exp` handles that just fine

In [None]:
print(numpy.exp(709,dtype=numpy.float64))
print(numpy.exp(-709,dtype=numpy.float64))

Your homework:  the great Exp Challenge

### One final example (optional):  How to calculate  Relative Error

Say we wanted to compute the relative error between two values $x$ and $y$ using $x$ as the normalizing value.  Algebraically the forms
$$
    E = \frac{x - y}{x}
$$
and
$$
    E = 1 - \frac{y}{x}
$$
are equivalent.  In finite precision what form might be expected to be more accurate and why?

*Example based on a [blog](https://nickhigham.wordpress.com/2017/08/14/how-and-how-not-to-compute-a-relative-error/) post by Nick Higham*

Using this model the original definition contains two floating point operations such that
$$\begin{aligned}
    E_1 = \text{fl}\left(\frac{x - y}{x}\right) &= \text{fl}(\text{fl}(x - y) / x) \\
    &= \left[ \frac{(x - y) (1 + \delta_1)}{x} \right ] (1 + \delta_2) \\
    &= \frac{x - y}{x}  (1 + \delta_1) (1 + \delta_2)
\end{aligned}$$

For the other formulation we have
$$\begin{aligned}
    E_2 = \text{fl}\left( 1 - \frac{y}{x} \right ) &= \text{fl}\left(1 - \text{fl}\left(\frac{y}{x}\right) \right) \\
    &= \left(1 - \frac{y}{x} (1 + \delta_1) \right) (1 + \delta_2)
\end{aligned}$$

If we assume that all $\text{op}$s have similar error magnitudes then we can simplify things by letting
$$
    |\delta_\ast| \le \epsilon.
$$

To compare the two formulations we again use the relative error between the true relative error $e_i$ and our computed versions $E_i$.

Original definition:
$$\begin{aligned}
    \frac{e - E_1}{e} &= \frac{\frac{x - y}{x} - \frac{x - y}{x}  (1 + \delta_1) (1 + \delta_2)}{\frac{x - y}{x}} \\
    &\le 1 - (1 + \epsilon) (1 + \epsilon) = 2 \epsilon + \epsilon^2
\end{aligned}$$

Manipulated definition:

$$\begin{aligned}
    \frac{e - E_2}{e} &= \frac{e - \left[1 - \frac{y}{x}(1 + \delta_1) \right] (1 + \delta_2)}{e} \\
    &= \frac{e - \left[e - \frac{y}{x} \delta_1 \right] (1 + \delta_2)}{e} \\
    &= \frac{e - \left[e + e\delta_2 - \frac{y}{x} \delta_1 - \frac{y}{x} \delta_1 \delta_2)) \right] }{e} \\
    &= - \delta_2 + \frac{1}{e} \frac{y}{x} \left(\delta_1 + \delta_1 \delta_2 \right) \\
    &= - \delta_2 + \frac{1 -e}{e} \left(\delta_1 + \delta_1 \delta_2 \right) \\
    &\le \epsilon + \left |\frac{1 - e}{e}\right | (\epsilon + \epsilon^2)
\end{aligned}$$

We see then that our floating point error will be dependent on the relative magnitude of $e$

### Comparison of Relative Errors of estimates of Relative Error ;^)

In [None]:
# Based on the code by Nick Higham
# https://gist.github.com/higham/6f2ce1cdde0aae83697bca8577d22a6e
# Compares relative error formulations using single precision and compared to double precision

N = 501    # Note: Use 501 instead of 500 to avoid the zero value
d = numpy.finfo(numpy.float32).eps * 1e4
a = 3.0
x = a * numpy.ones(N, dtype=numpy.float32)
y = [x[i] + numpy.multiply((i - numpy.divide(N, 2.0, dtype=numpy.float32)), d, dtype=numpy.float32) for i in range(N)]

# Compute errors and "true" error
relative_error = numpy.empty((2, N), dtype=numpy.float32)
relative_error[0, :] = numpy.abs(x - y) / x
relative_error[1, :] = numpy.abs(1.0 - y / x)
exact = numpy.abs( (numpy.float64(x) - numpy.float64(y)) / numpy.float64(x))

# Compute differences between error calculations
error = numpy.empty((2, N))
for i in range(2):
    error[i, :] = numpy.abs((relative_error[i, :] - exact) / numpy.abs(exact))

fig = plt.figure(figsize=(8,6))
axes = fig.add_subplot(1, 1, 1)
axes.semilogy(y, error[0, :], '.', markersize=10, label="$|x-y|/|x|$")
axes.semilogy(y, error[1, :], '.', markersize=10, label="$|1-y/x|$")

axes.grid(True)
axes.set_xlabel("y")
axes.set_ylabel("Relative Error")
axes.set_xlim((numpy.min(y), numpy.max(y)))
axes.set_ylim((5e-9, numpy.max(error[1, :])))
axes.set_title("Relative Error Comparison: x,y {}".format(y[0].dtype))
axes.legend()
plt.show()

Some other links that might be helpful regarding IEEE Floating Point:
 - [What Every Computer Scientist Should Know About Floating-Point Arithmetic](http://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html)
 - [IEEE 754 Floating Point Calculator](http://babbage.cs.qc.edu/courses/cs341/IEEE-754.html)
 - [Numerical Computing with IEEE Floating Point Arithmetic](http://epubs.siam.org/doi/book/10.1137/1.9780898718072)

## Future issues with fp64 and High-Performance Computing


### The Issues

* In traditional High-Performance computing IEEE fp64 has become the standard precision necessary for accurate, reproducible calculations for a wide range of scientific computing (e.g. climate models, fusion, solid mechanics)
* Until recently, the needs for HPC drove the development of Chips/Hardware such that Commodity Computers and Super Computers benefited from the same technology.
* However, with the rise of general purpose GPU's and AI,  the landscape is changing rapidly

### A brief history

[Dongarra et al., 2024](https://arxiv.org/abs/2411.12090)

<table style=width:100%>
    <tr>
<img src="./images/Dongarra_etal_2024_ArXiv_fp.png" width="800"/>
    </tr>
</table>



### A brief history of floating point hardware

[Dongarra et al., 2024](https://arxiv.org/abs/2411.12090)

* 1980's:  Dedicated seperate floating point co-processors (e.g. Intel 8087, Motorola 68881 co-processors)
* 1987:  Introduction of Intel x486 CPU's with built in floating point registers
* 1999: Introduction of Nvidia GeForce 256 separate "Graphical Processing Unit" GPU low precision, fast parallel graphics. 
* mid-2000s:  Adoption of programmable General Purpose GPU's for fp acceleration, addition of fp64 on GPU's
* 2006: Introduction of Nvidia CUDA language for programmable GPU's: shift to GPU's for high-performance computing and ML/AI 
* ~2020+: ML/AI revolution:  Deep learning algorithms driven by matrix multiplications that tolerate low precision 


### Current Floating Point fp64  performance for CPU's and GPU's

<table style=width:100%>
    <tr>
<img src="./images/CPUvGPU_fp64-performance.jpg" width="800"/>
    </tr>
</table>



### Current Floating Point formats for CPU's and GPU's

<table style=width:100%>
    <tr>
<img src="./images/Dongarra_etal_2024_figure01.jpg" width="800"/>
    </tr>
</table>



### Near Future NVIDIA GPU FloatingPoint roadmap [StorageReview.com](https://www.storagereview.com/news/nvidias-gtc-2025-highlights-blackwell-gpus-dgx-systems-and-ai-q-framework)

| Specification | H100 | H200 | B100 | B200 | B300 |
|---|---|---|---|---|---|
| Max Memory | 80 GBs HBM3 | 141 GBs HBM3e | 192 GBs HBM3e | 192 GBs HBM3e | 288 GBs HBM3e |
| Memory Bandwidth | 3.35 TB/s | 4.8TB/s | 8TB/s | 8TB/s | 8TB/s |
| FP4 Tensor Core | – | – | 14 PFLOPS | 18 PFLOPS | 30 PFLOPS |
| FP6 Tensor Core | – | – | 7 PFLOPS | 9 PFLOPS | 15 PFLOPS* |
| FP8 Tensor Core | 3958 TFLOPS (~4 PFLOPS) | 3958 TFLOPS (~4 PFLOPS) | 7 PFLOPS | 9 PFLOPS | 15 PFLOPS* |
| INT 8 Tensor Core | 3958 TOPS | 3958 TOPS | 7 POPS | 9 POPS | 15 PFLOPS* |
| FP16/BF16 Tensor Core | 1979 TFLOPS (~2 PFLOPS) | 1979 TFLOPS (~2 PFLOPS) | 3.5 PFLOPS | 4.5 PFLOPS | 7.5 PFLOPS* |
| TF32 Tensor Core | 989 TFLOPS | 989 TFLOPS | 1.8 PFLOPS | 2.2 PFLOPS | 3.3 PFLOPS* |
| FP32 (Dense) | 67 TFLOPS | 67 TFLOPS | 30 TFLOPS | 40 TFLOPS | Information Unknown |
| FP64 Tensor Core (Dense) | 67 TFLOPS | 67 TFLOPS | 30 TFLOPS | 40 TFLOPS | Information Unknown |
| FP64 (Dense) | 34 TFLOPS | 34 TFLOPS | 30 TFLOPS | 40 TFLOPS | Information Unknown |
| Max Power Consumption | 700W | 700W | 700W | 1000W | Information Unknown |

### Beyond Blackwell

<table style=width:100%>
    <tr>
<img src="./images/NVIDIA_Roadmap.jpeg" width="800"/>
    </tr>
</table>


### Interesting times indeed

...the landscape of high-performance computation is increasingly complex...but there are important classes of problems that still need high-precision floating point.  Some Options:

* fp64 Emulation leveraging low-precision hardware
* clever mixed precision algorithms 


## Operation Counting

Discretization Error:  **Why not use more terms in the Taylor series?**

Floating Point Error: **Why not use the highest precision possible?**

### Example 1: Matrix-Vector Multiplication

Let $A, B \in \mathbb{R}^{N \times N}$ and $x \in \mathbb{R}^N$.  

1. Count the approximate number of operations it will take to compute $A x$.
2. Do the same for $A B$.

Matrix-vector product:  Defining $[A]_i$ as the $i$th row of $A$ and $A_{ij}$ as the $i$, $j$th entry then
$$
    A x = \sum^N_{i=1} [A]_i \cdot x = \sum^N_{i=1} \sum^N_{j=1} A_{ij} x_j
$$

Take an explicit case, say $N = 3$, then the operation count is
$$
    A x = [A]_1 \cdot v + [A]_2 \cdot v + [A]_3 \cdot v = \begin{bmatrix}
        A_{11} \times v_1 + A_{12} \times v_2 + A_{13} \times v_3 \\
        A_{21} \times v_1 + A_{22} \times v_2 + A_{23} \times v_3 \\
        A_{31} \times v_1 + A_{32} \times v_2 + A_{33} \times v_3
    \end{bmatrix}
$$

This leads to 15 operations (6 additions and 9 multiplications).  

Take another case, say $N = 4$, then the operation count is
$$
    A x = [A]_1 \cdot v + [A]_2 \cdot v + [A]_3 \cdot v = \begin{bmatrix}
        A_{11} \times v_1 + A_{12} \times v_2 + A_{13} \times v_3 + A_{14} \times v_4 \\
        A_{21} \times v_1 + A_{22} \times v_2 + A_{23} \times v_3 + A_{24} \times v_4 \\
        A_{31} \times v_1 + A_{32} \times v_2 + A_{33} \times v_3 + A_{34} \times v_4 \\
        A_{41} \times v_1 + A_{42} \times v_2 + A_{43} \times v_3 + A_{44} \times v_4 \\
    \end{bmatrix}
$$

This leads to 28 operations (12 additions and 16 multiplications).

Generalizing this there are $N^2$ multiplications and $N (N -1)$ additions for a total of

$$
    \text{operations} = N (N - 1) + N^2 = \mathcal{O}(N^2).
$$

Matrix-Matrix product ($AB$):  Defining $[B]_j$ as the $j$th column of $B$ then
$$
    (A B)_{ij} = \sum^N_{i=1} \sum^N_{j=1} [A]_i \cdot [B]_j
$$
The inner product of two vectors is represented by
$$
    a \cdot b = \sum^N_{i=1} a_i b_i
$$
leading to $\mathcal{O}(3N)$ operations.  Since there are $N^2$ entries in the resulting matrix then we would have $\mathcal{O}(N^3)$ operations.

There are methods for performing matrix-matrix multiplication faster.  In the following figure we see a collection of algorithms over time that have been able to bound the number of operations in certain circumstances.  Here
$$
    \mathcal{O}(N^\omega)
$$
![matrix multiplication operation bound](./images/bound_matrix_multiply.png)