# Number Representation and Precision

Real numbers are stored with a decimal precision (or mantissa) and the decimal exponent range. The mantissa contains the significant figures of the number (and thereby the precision of the number). A number like (9.90625)10 in the decimal representation is given in a binary representation by

(1001.11101)$_2$ = $1\times2^3 +0\times2^2 +0\times2^1 +1\times2^0 +1\times2^{−1} +1\times2^{−2} +1\times2^{−3} +0\times2^{−4} +1 \times 2^{−5}$

and it has an exact machine number representation since we need a finite number of bits to represent this number. This representation is however not very practical. Rather, we prefer to use a scientific notation. In the decimal system we would write a number like 9.90625 in what is called the normalized scientific notation. This means simply that the decimal point is shifted and appropriate powers of 10 are supplied. Our number could then be written as
$9.90625 = 0.990625 \times 10^1$,
and a real non-zero number could be generalized as
$x = \pm r \times 10^n$,
with a $r$ a number in the range $1/10 \le r < 1$. In a similar way we can represent a binary number in
scientific notation as
$x = \pm q \times 2^m$,
with a $q$ a number in the range $1/2 \le q < 1$.

In a typical computer, floating-point numbers are represented in the way described above, but with certain restrictions on q and m imposed by the available word length. In the machine, our number x is represented as

$x = (−1)^s \times mantissa \times 2^{exponent}$

where $s$ is the sign bit, and the exponent gives the available range. With a single-precision word, 32 bits, 8 bits would typically be reserved for the exponent, 1 bit for the sign and 23 for the mantissa. 

## 32-bit – single precision:

Sign bit: 1 bit

Exponent: 8 bits

Significand precision: 24 bits (23 explicitly stored)

This gives 6–9 significant decimal digits precision!

## 64-bit = double precision:

Sign bit: 1 bit

Exponent: 11 bits

Significand precision: 53 bits (52 explicitly stored)

This gives 15–17 significant decimal digits precision.
This the the Python default standard


## 128-bit = quadruple precision:

Sign bit: 1 bit

Exponent: 15 bits

Significand precision: 113 bits (112 explicitly stored)

This gives 33–36 significant decimal digits precision.


## 256-bit – Octuple precision:

Sign bit: 1 bit
    
Exponent: 19 bits
    
Significand precision: 237 bits (236 explicitly stored)

THIS IS RARELY IMPLEMENTED


# Precision effects

One important consequence of rounding error is that you should **NEVER Use an if statment to test equality of two floats.**  For instance, you should nerev, in any program, have a statment like:

In [2]:
x = 3 * 1.1
if x == 3.3:
    print(x)

print(x)

3.3000000000000003


If you need to do a logic trigger based on a float:

In [3]:
epsilon = 1e-12
if abs(x-3.3) < epsilon:
    print(x)

3.3000000000000003


## Which operations are most important in dealing with precision?

__Subtraction__ and __Derivatives__

## Subtraction

a = b - c

We have:   $fl(a) = fl(b) - fl(c) = a(1+\epsilon_a)$  or
            $fl(a) = b(1+\epsilon_b) - c(1+\epsilon_c)$
            
So, $fl(a)/a = 1 + \epsilon_b (b/a) - \epsilon_c (c/a)$

IF $b \sim c$, we have the potential of increased error on $fl(a)$


If we have:

$x = 1000000000000000$

$y = 1000000000000001.2345678901234$

as far the computer is concerned:
    

In [4]:
x = 1000000000000000
y = 1000000000000001.2345678901234
 
print(y-x) 

1.25


**The true result should be 1.2345678901234!**

In other words, instead of 16-figure accuracy we now only have three figures and the fractional error is a few percent of the true value.  This is much worse than before!


To see another exanple of this in practice, consider two numbers:

$x = 1$, and $ y = 1+10^{-14}\sqrt 2$ 

Simply we can see that:

$ 10^{14} (y - x) = \sqrt 2$

Let us try the same calculation in python:
 

In [5]:
from math import sqrt
x = 1.0
y = 1.0 + (1e-14)*sqrt(2)

print((1e14)*(y-x))
print(sqrt(2))

1.4210854715202004
1.4142135623730951


Again error off by a percent.  We need to be careful in how we code math!

## Example 1:  Summing $1/n$ 

Consider the series:

$$s_1 = \sum_{n=1}^N \frac{1}{n}$$ which is finite when N is finite, then consider

$$s_2 = \sum_{n=N}^1 \frac{1}{n}$$ which when summed analyitically should give $s_2 = s_1$

In [40]:
# Write a code to perform both of these to sums for N = 1e8 and compare
s1 = 0.0 
s2 = 0.0

for n in range(1, 10000):
    s1 += 1/n
    
for n in range(1, 10000):
    s2 += 1/(10000-n)
    
print(s1)
print(s2)

9.787506036044348
9.787506036044386


## Example 2: $e^{-x}$

There are three possible algorithms for $e^{-x}$

1) Simple: $$e^{-x} = \sum_{n=0}^{\infty} (-1)^n \; \frac{x^n}{n!}$$  

2) Recursion: $$e^{-x} = \sum_{n=0}^{\infty} s_n = \sum_{n=0}^{\infty} (-1)^n \; \frac{x^n}{n!}$$  where $$ S_n = -s_{n-1} \frac{x}{n}$$

3) Inverse:  $$e^{x} {\sum_{n=0}^{\infty} \frac{x^n}{n!}}$$  Then take the inverse:   $$e^{-x} = \frac{1}{e^{x}}$$


In [162]:
import numpy as np

# write a function to compute e^-X for all three methods 
# Then chack their output for x = 0 - 100, in steps of 10 and 
# Compare to the numpy version of exp(-x) which is imported above. 

# code here
def fact(n):
    f=1
    if n==0:
        return 1
    else:
        n>0
        for k in range(1, n+1):
            f=f*k
        return f
    
def e_minusx_simple(x):
    s=0
    for i in range(0,1000):
        s=s+((-1)**i)*((x**i)/fact(i))
    return s
    
print(e_minusx_simple(0))
print(e_minusx_simple(10))
print(e_minusx_simple(20))
print(e_minusx_simple(30))
print(e_minusx_simple(40))
print(e_minusx_simple(50))
print(e_minusx_simple(60))
print(e_minusx_simple(70))
print(e_minusx_simple(80))
print(e_minusx_simple(90))
print(e_minusx_simple(100))

def e_minusx_rec(x):
    s=1
    for i in range(1,1000):
        s=s+((-1)**(i-1))*(((x-1)**(i-1))/fact(i-1))
    return s

    if s == 0:
        return 1
    else:
        return -s*(x/n)

print(e_minusx_rec(0))
print(e_minusx_rec(10))
print(e_minusx_rec(20))
print(e_minusx_rec(30))
print(e_minusx_rec(40))
print(e_minusx_rec(50))
print(e_minusx_rec(60))
print(e_minusx_rec(70))
print(e_minusx_rec(80))
print(e_minusx_rec(90))
print(e_minusx_rec(100))

def e_minusx_inv(x):
    s=0
    for i in range(0,100):
        s=s+(((x**i)/(fact(i))))
    return s

print(1/e_minusx_inv(0))
print(1/e_minusx_inv(10))
print(1/e_minusx_inv(20))
print(1/e_minusx_inv(30))
print(1/e_minusx_inv(40))
print(1/e_minusx_inv(50))
print(1/e_minusx_inv(60))
print(1/e_minusx_inv(70))
print(1/e_minusx_inv(80))
print(1/e_minusx_inv(90))
print(1/e_minusx_inv(100))

np.exp(0)

1.0
4.5399929433607724e-05
5.47810291652921e-10
-8.553016433669241e-05
0.1470264494805502
-7015.776232597128
-1223051118.0619795
15141759713408.848
6.772465846238849e+17
-7.884988118863822e+21
-2.8756582514726483e+26
3.7182818284590455
1.0001234098040117
1.0000000029023852
1.0000550957546368
1.2404243212484811
390.77364917275554
200736067.22037536
-4463748912655.682
-2.4028260940827123e+17
4.958100031876921e+21
1.3168584461776608e+26
1.0
4.539992976248486e-05
2.0611536224385583e-09
9.357622968840171e-14
4.248354255291594e-18
1.9287498485811295e-22
8.756523735728401e-27
3.977161397179805e-31
1.8362668153382484e-35
9.734161245088655e-40
7.643449333734118e-44


0.36787944117144233