<div style="text-align: justify">
These are the lecture notes for CSC349A Numerical Analysis taught by
Rich Little. They roughly correspond to
the material covered in each lecture in the classroom but the actual
classroom presentation might deviate significantly from them depending
on the flow of the course delivery. They are provided as a reference to
the instructor as well as supporting material for students who miss
the lectures. They are simply notes to support the lecture so the text
is not detailed and they are not thoroughly checked. Use at your own
risk.
</div>

# 1 Overview
<div style="text-align: justify">
The material covered in this lecture is mostly based on handouts 3 and
4.  This material consists of the floating point number
representation, round off errors, and subtractive
cancellation. Several examples of using idealized floating point
arithmetic with ($b=10$, $k=4$, and chopping or rounding) were presented
in class. I strongly encourage you to practice several problems using
idealized floating point arithmetic to gain experience as it is
frequently part of midterm and final questions.
</div>

<div style="text-align: justify">
The only additional note I would like to make that is not mentioned in
the handouts is that when calculating errors in most cases the true
estimate used is simply an estimate with higher precision than the one
we are computing rather than the actual true value. For example when
approximating $\pi$ it is impossible to obtain the true value as it
has an infinite number of digits. Instead we calculate the error of a
a poor approximation (let's say with 4 significant digits) with one
that is more accurate like the one provided by a calculator (for
example 10 significant digits).
</div>

# 2 Round-off Error
<div style="text-align: justify">
Consider the interval between any 2 consecutive powers of the base $b$ let's say $b^{t-1}$ and $b^{t}$. Then the first floating point number in that interval will be $0.100\dots0 \times b^t=b^{t-1}$. The next higher number in that representation will be $0.100\dots1 \times b^t$ and so on until the highest number that uses $b^{t-1}$ as the base which will be $0.(b-1)(b-1)\dots(b-1)\times b^t$. For example in decimal that would be $0.999\dots9 \times 10^t$. 
</div>

<div style="text-align: justify">
All floating point numbers with exponent equal to $t$ are in the
interval $[b^{t-1}, b^t)$. In this interval there are exactly
  $(b-1)b^{k-1}$ distinct floating point numbers and they are equally
  spaced.
</div>

<div style="text-align: justify">
The distance between any 2 consecutive numbers is: 
</div>

$$
\frac{b^t - b^{t-1}}{(b-1)b^{k-1}} = \frac{(b-1)b^{t-1}}{(b-1)b^{k-1}} = b^{t-k}\tag{1}
$$

<div style="text-align: justify">
The spacing between numbers gets larger as $t$ gets larger. In this
course we will frequently be using $b=10$ and precision $k=4$ as it
makes calculations by hand easier and more easily understood by
humans.
</div>

<div style="text-align: justify">
Generally, when we represent a real number by a floating-point number of some precision $k$, we have to decide what $k$ digits to use. We usually make this decision by either ${\bf chopping}$ all the digits to the right of the $k^{th}$ digit or ${\bf rounding}$ the $k^{th}$ digit up or down based on the value of the $(k+1)^{st}$ digit. Each of these creates a different round-off error. Let's consider chopping first.
</div>

<div style="text-align: justify">
${\bf Example.}$ Let $b=10,\ k=4$ and $p=2/3$.
$$
\begin{array}{c|c|c|c}
&\text{floating-point approximation} & \text{absolute error} & \text{relative error} \\
\hline
\text{chopping}& +0.6666\times 10^0 & 0.0000666... & 0.0001 \\
\hline
\text{rounding}& +0.6667\times 10^0 & 0.0000333... & 0.00005 
\end{array}
$$
</div>

Note that the above absolute errors are round-off errors (that is, they are the difference between a real number p and a floating-point approximation to p).

<div style="text-align: justify">
For any real number $p$ in interval $[b^{t-1},b^t)$, the upper bound of the absolute true error of $p^*$, the floating-point representation of $p$ with chopping, is given by,
</div>

$$
|p-p^*|<b^{t-k}
$$

<div style="text-align: justify">
which is the distance between any two adjacent values in the interval.
</div>

<div style="text-align: justify">

To calculate the upper bound on the relative error, we need a bound on $1/|p|$. But, since $p \in [b^{t-1},b^t)$, we know that $p \ge b^{t-1}$ and thus $\frac{1}{p} \le \frac{1}{b^{t-1}}$. Combining this with what we know about the absolute error we get,
</div>

$$
\frac{|p-p^*|}{|p|} < \frac{b^{t-k}}{b^{t-1}} = b^{(t-k)-(t-1)}=b^{1-k}
$$

<div style="text-align: justify">
This quantity $b^{1-k}$ is called the ${\bf unit\ round-off}$ (or the ${\bf machine\ epsilon}$). Note that it is independent of $t$ and the magnitude of $p$. The number $k-1$ indicates approximately the number of significant base $b$ digits in a floating-point approximation to a real number $p$.
</div>

<div style="text-align: justify">
${\bf Example.}$ Let $b=2,\ k=24$ (a $32$ bit word)
    
The unit round-off is $b^{1-k}= 2^{-23}\approx 10^{-7}$, implying that in such a floating-
point system, a real number has about $23$ correct binary digits or $7$ correct decimal
digits.
</div>

<div style="text-align: justify">
${\bf With\ rounding}$ 
    
Similar to above, except now the absolute error satisfies
</div>

$$
|p-p^*|<\frac{1}{2}b^{t-k}
$$

Thus, the relative error satisfies

$$
\frac{|p-p^*|}{|p|} < \frac{0.5 b^{t-k}}{b^{t-1}}=\frac{1}{2}b^{1-k}
$$

<div style="text-align: justify">
which is the unit round-off in this case. See (3.10) on page 68 of the 6th ed. or page 71 of the 7th ed.
</div>

# 3 Floating-point arithmetic
<div style="text-align: justify">
Floating-point arithmetic is a simulation of real arithmetic. We will
use the notation ${\it fl}$ to denote the floating-point representation
of a real number $x$ as $fl(x)$ as well as the floating-point representation of arithmetic operations such as: 
</div>

$$
fl(a+b), fl(a-b), fl(a \times b), fl(a/b) 
$$

where $a$ and $b$ are floating-point numbers. 

<div style="text-align: justify">
The implementation of these floating-point operations (in either software or hardware) depends on several factors, and includes for examples choices such as whether to use rounding or chopping and the number of significant digits used for floating-point addition and subtraction. 
</div>

<div style="text-align: justify">
For simplicity, we will consider only ${\bf ''idealized''\ floating-point\ arithmetic}$ which is defined as follows. Let $\bullet$ denote any of the basic arithmetic operations $+ - \times /$ and let $x$ and $y$ denote floating point numbers. $fl(x \bullet y)$ is obtained by performing ${\bf exact\ arithmetic}$ on $x$ and $y$, and then ${\bf rounding\ or\ chopping}$ this result to $k$ significant digits.
</div>

<div style="text-align: justify">
${\bf Note 1:}$ Although no actual digital computers or calculators implement floating-point arithmetic that way (it's too expensive as it would require a very long accumulator for doing addition and subtraction), idealized floating-point arithmetic:
    
- Behaves very much like any actual implementation 
    
- Is very simple to do in hand computations
    
- Has accuracy almost identical to that of any implementation
</div>

<div style="text-align: justify">
${\bf Note 2:}$ If $fl$ is applied to an arithmetic expression containing more than one arithmetic operation, then each of the arithmetic operations must be replaced by its corresponding floating-point operation. For example: 
</div>

$$
\begin{align}
fl(x + y - z) =& fl(fl(x+y)-z) \\ 
fl(xy + z /cos(x)) =& fl(fl(x \times y)+fl(z/fl(cos(x)))) 
\end{align}
$$

<div style="text-align: justify">
Each $fl$ operation is computed according to the rules of idealized floating-point arithmetic, that is, the exact value of the result is rounded or chopped to $k$-significant digts before proceeding with the rest of the computation. Note that we will compute $fl(cos(x)), fl(\sqrt x), fl(e^x)$ and so on this way. 
</div>

<div style="text-align: justify">
${\bf Note 3:}$ With idealized floating-point arithmetic, the maximum relative error in $fl(x \bullet y)$ is the same as the maximum relative error in converting a real number $z$ to floating-point form. Thus, for a ${\bf single}$ floating-point $+ - \times /$, the ${\bf relative\ error\ is very\ small}$: it is $ < b^{1-k}$ with chopping, or $\frac{1}{2} b^{1-k}$ (with rounding). However, the relative erorr in a floating-point computation ${\bf might\ be\ large}$ if more than one floating-point operation is performed. For example, compute $fl(x + y + z)$ when 
</div>

$$
x = + 0.1234 \times 10 ^0, \;\;\; y = -0.5508 \times 10^{-4}, \;\;\; z = -0.1232 \times 10^0
$$

<div style="text-align: justify">
using base $b=10$, precision $k=4$, rounding idealized floating-point arithmetic.
</div>

$$
fl(x+y) = + 0.1233 \times 10^0\;\;\; \mbox{since} \;\; x +y = 0.12334492
$$

$$
fl(x+y+z) = +0.1000 \times 10^{-3} \;\;\; \mbox{since} \;\; .1233 - .1232 = 0.0001 
$$

<div style="text-align: justify">
Since the exact value of $x+y+z = 0.00014492$, the relative error is: 
</div>

$$
\left| \frac{0.00014492 - 0.0001}{0.00014492} \right| = 0.31 \;\;\; \mbox{or} \;\; 31\%
$$

<div style="text-align: justify">
Note, however, that this large relative error can be avoided by changing the order in which these 3 numbers are added together. Consider the evaluation of 
</div>

$$
fl(x+z+y) = fl(fl(x+z)+y)
$$

<div style="text-align: justify">
We obtain: 
</div>

$$
fl(x+z) = 0.0002 \;\;\; \mbox{or} \;\;\; 0.2000\times 10^{-3} 
$$

$$
fl(fl(x+z)+y) = 0.1449 \times 10^{-3} \;\;\; \mbox{since} \;\;\; 0.0002 - 0.00005508 = 0.00014492, 
$$

<div style="text-align: justify">
which has a relative error of only $0.000138$ or $0.0138\%$. 
</div>

# 4 Subtractive cancellation


<div style="text-align: justify">
Subtractive cancellation refers to the loss of significant digits during a floating-point computation due to the subtraction of ${\it nearly}$ equal floating-point numbers. 
</div>

<div style="text-align: justify">
Note that if $\hat{x}$ is an approximation to $x>0$ and $\hat{y}$ is an approximation to $y>0$, and if for example $\hat{x}$ agrees with $x$ to 8 significant digits and $\hat{y}$ agrees with $y$ to 8 significant digits, then
</div>

$$
\begin{align*}
\hat{x} \times \hat{y} \approx x \times y \\
\hat{x}/\hat{y} \approx x/y \\
\hat{x} + \hat{y} \approx x + y
\end{align*}
$$

<div style="text-align: justify">
will also agree to about 8 significant digits.  However, this may not be true for subtraction: it is possible that none of the significant digits in $\hat{x} - \hat{y}$ and $x - y$ agree. 
</div>

<div style="text-align: justify">
The following examples illustrate subtractive cancellation, and show how it can be avoided in each of these cases.
</div>

<div style="text-align: justify">
${\bf Example\ 1:}$ The evaluation of $fl(\sqrt{x^2+1}-x)$ will be inaccurate if $x$ is large and positive. For example, using $b=10$, $k=4$ idealized rounding floating-point arithmeticat and $x = 65.43$ we obtain the following (where $fl(\sqrt{z})$
is computed using idealized floating-point arithmetic; that is, the exact value of $\sqrt{z}$ is
rounded to 4 significant digits).
</div>

$$
\begin{align*}
&fl(x^2) = fl(4281.0849)=4281\qquad or \qquad 0.4281\times 10^4\\
&fl(x^2+1) = fl(4281+1)=4282 \\
&fl(\sqrt{x^2+1}) = fl(\sqrt{4282}) = fl(65.43699..)= 65.44\\
&fl(\sqrt{x^2+1} - x) = fl(65.44 - 65.43)=0.01\qquad or \qquad 0.1000\times 10^{-1} 
\end{align*}
$$

<div style="text-align: justify">
However, the true (exact) value of $\sqrt{x^2+1} - x$ is $0.0076413...$. The relative error in
$fl(\sqrt{x^2+1} - x)$ is about $0.31$ or $31\%$.
</div>

<div style="text-align: justify">
To avoid the subtractive cancellation above and to obtain an accurate floatingpoint result, note that
</div>

$$
(\sqrt{x^2+1} - x)(\frac{\sqrt{x^2+1} + x}{\sqrt{x^2+1} + x}) = \frac{1}{\sqrt{x^2+1} + x}
$$

<div style="text-align: justify">
The latter expression gives an extremely accurate result in floating-point arithmetic when
$x = 65.43$ (and indeed for all “large” positive values of $x$).
</div>

$$
\begin{align*}
&fl(x^2) = fl(4281.0849)=4281\qquad or \qquad 0.4281\times 10^4\\
&fl(x^2+1) = fl(4281+1)=4282 \\
&fl(\sqrt{x^2+1}) = fl(\sqrt{4282}) = fl(65.43699...)= 65.44\\
&fl(\sqrt{x^2+1} + x) = fl(65.44 + 65.43)=fl(130.87)=130.9\\
&fl(\frac{1}{\sqrt{x^2+1} + x}) = fl(\frac{1}{130.9})=fl(0.00763941...)=0.007639
\end{align*}
$$

<div style="text-align: justify">
which has a relative error of $0.0003$ or $0.03\%$.
</div>

<div style="text-align: justify">
${\bf Example\ 2:}$ The evaluation of $fl(x - \sin{x})$ will be inaccurate if $x$ is close to $0$. For example, using $b=10$, $k=4$ idealized chopping floating-point aritmetic and $x=0.01234$, we obtain the following (note that the argument for $\sin$ is in radians).
</div>

$$
\begin{align*}
&fl(\sin{x}) = fl(0.01233968...)=0.01233\qquad or \qquad 0.1233\times 10^{-1}\\
&fl(x-\sin{x}) = fl(0.01234 - 0.01233)=0.00001\qquad or \qquad 0.1000\times 10^{-4}
\end{align*}
$$

<div style="text-align: justify">
However, the true (exact) value of $x-\sin{x}$ is $0.313177\times 10^{-6}$ , giving a relative error
in the computed approximation of
</div>

$$
|1-\frac{0.00001}{0.313177\times 10^{-6}}|=30.93\qquad or \qquad 3093\%.
$$

<div style="text-align: justify">
To avoid the catastrophic loss of significant digits in this example, use the Taylor series
approximation for $f (x) = \sin {x}$ expanded about $x_0=0$ (see Chapter 4 of the textbook):
</div>

$$
x-\sin{x} = x - (x- \frac{x^3}{3!}+\frac{x^5}{5!}-\frac{x^7}{7!}+\frac{x^9}{9!}-...).
$$

<div style="text-align: justify">
Thus, if $x$ is close to $0$, a very good approximation to $x- \sin {x}$ is, for example,
</div>

$$
x-\sin{x} \approx \frac{x^3}{6}+\frac{x^5}{120}
$$

With $x = 0.01234$ as above, we obtain

$$
\begin{align*}
&fl(x^3) = fl(0.18790809...\times 10^{-5})= 0.1879\times 10^{-5}\\
&fl(x^3/6) = fl(0.1879\times 10^{-5}/6)=fl(0.3131666...\times 10^{-6})= 0.3131 \times 10^{-6}\\
&fl(x^5) = fl(0.28613817...\times 10^{-9}) = 0.2861\times 10^{-9}\\
&fl(x^5/120) = fl(0.2861\times 10^{-9}/120)=fl(0.2384166...\times 10^{-11}) = 0.2384\times 10^{-11}\\
&fl(x^3/6-x^5/120) = fl(0.3131 \times 10^{-6}- 0.2384\times 10^{-11})=fl(0.3130976...\times 10^{-6}) = 0.3130\times 10^{-6}
\end{align*}
$$

which has a very small relative error of

$$
|1-\frac{0.3130\times 10^{-6}}{0.313177\times 10^{-6}}|=0.000565\qquad or \qquad 0.0565\%.
$$

<div style="text-align: justify">
${\bf Example\ 3:}$ The evaluation of $fl(1-\sin{x})$ will be inaccurate if $\sin{x}$ is close to $1$; for example, if $x \approx\pi / 2 = 1.5707963...$ (radians).

For example, this floating-point computation will be inaccurate if $x = 1.56$. The
cancellation of significant digits in this case can be seen since $\sin(1.56) =  0.99994172...$.
</div>

<div style="text-align: justify">
To avoid such a loss of significant digits whenever $x$ is close to $\pi / 2$ , note that
</div>

$$
(1-\sin{x})(\frac{1+\sin{x}}{1+\sin{x}}) = \frac{1-\sin^2{x}}{1+\sin{x}}=\frac{\cos^2{x}}{1+\sin{x}}
$$

<div style="text-align: justify">
Evaluation of this expression in floating-point arithmetic when $x$ is close to $\pi / 2$ will not result in any large loss of significant digits. For example, with x = 1.56 we obtain (using rounding) the following, which is very close to the true value of $0.00005827977...$.
</div>

$$
\begin{align*}
&fl(\cos{x}) = fl(0.010796...)= 0.01080\\
&fl(\cos^2{x}) = fl(0.00011664)= 0.0001166\\
&fl(\sin{x}) = fl(0.99994172...) = 0.9999\\
&fl(1+\sin{x}) = fl(1.9999)= 2.000\\
&fl(\cos^2{x}/(1+\sin{x})) = fl(0.0001166\ /\ 2.000)) = 0.00005830
\end{align*}
$$

<div style="text-align: justify">
${\bf Example\ 4:}$ Provided that $x\neq 1/ 2$ or $x \neq 2$ ,
$$\frac{1}{2x-1}-\frac{x+2}{x-2} = \frac{-2x(x+1)}{(2x-1)(x-2)}.$$

However, if evaluated in floating-point arithmetic, these two expressions may give very different results; that is, if we let

</div>

$$
f(x)= \frac{1}{2x-1}-\frac{x+2}{x-2}, \qquad g(x)=\frac{-2x(x+1)}{(2x-1)(x-2)}
$$

<div style="text-align: justify">
then for some values of $x,\ fl( f (x))$ and $fl(g(x))$ may differ greatly.
In each of the following cases, assume that our usual floating-point system with
$b= 10,\ k=4$ and rounding is used.
</div>

<div style="text-align: justify">
${\bf Case(i)}.$
Suppose that $x$ is an exact valid floating-point number (in whatever floating-point
system you are using) and that $x$ is close to (but not equal to) $-1$. Then the
evaluation of $fl( f (x))$ will be very inaccurate since
</div>

$$
fl\left(\frac{1}{2x-1}\right)\approx - \frac13, \qquad fl\left(\frac{x+2}{x-2}\right)\approx-\frac13
$$

<div style="text-align: justify">
so that $fl( f (x))$ is computed as the difference of two almost equal numbers, which will result in a loss of significant digits due to subtractive cancellation. However, this does not occur in the evaluation of $fl(g(x))$ when $x$ is close to $-1$.
</div>

<div style="text-align: justify">
For example, if $x =-0.9986$, then you can verify the following. The exact value
is $f (x) = g(x) = 0.0003111109...;\ fl( f (-0.9986)) = 0.0001000$ or $ 0.1000\times 10^{-3}$ is very inaccurate; $fl(g(-0.9986)) = 0.0003111$ is very accurate.
</div>

<div style="text-align: justify">
${\bf Case(ii).}$ Suppose that $x$ is an exact valid floating-point number (in whatever floatingpoint system you are using) and that x is close to (but not equal to) 2. Then the evaluation of both of $fl( f (x))$ and $fl(g(x))$ will be very accurate because there is no subtractive cancellation in either expression. Although $fl(x - 2)$ occurs in the denominator of each expression, if $x$ is an exact valid floating-point number, then there is no round-off error in $fl(x - 2)$ ; that is, $fl(x - 2)$ is exactly equal to the value of $x - 2$. (To see this, consider values such as $1.997$ or $2.023$.) Thus, $fl(1/(x - 2))$ will be very accurate (as the round-off error in a single floating-point division is small). Since the value of $fl(1/(x - 2))$ is also very large relative to all other parts of the expressions for $f (x)$ and $g(x)$, the values of both $fl( f (x))$ and $fl(g(x))$ will be very accurate.
</div>

<div style="text-align: justify">
For example, if $x =1.997$, then you can verify the following. The exact value is
$f (x) = g(x) = 1332.6673...;\ fl( f (1.997)) = 1332$ and $fl(g(1.997)) = 1333$.
</div>

<div style="text-align: justify">
${\bf Case(iii).}$ Suppose that $x$ is NOT a valid floating-point number and that $x$ is close to $2$.
For example, if $b =10,\ k = 4$ suppose that $x = 2.001234$. Then both of $fl( f (x))$ and
$fl(g(x))$ will be very inaccurate because they both are computed using the value of
$fl(2.001234) = 2.001$ rather than the exact value of $x$. In such a case, note that the value of $fl\left(\frac{1}{x-2}\right)$ and the exact value of $\frac{1}{x-2}$ will differ greatly.
</div>

For example, using $x = 2.001234$,
$$fl\left(\frac{1}{x-2}\right)=1000\ \text{wheras}\ \frac{1}{x-2}=810.37277...$$

<div style="text-align: justify">
Using $x = 2.001234$, the exact value is $f (x) = g(x) = -3242.1580...$. Using
$fl(2.001234) = 2.001$, you can verify that $fl( f (2.001)) = - 4001$ and
$fl(g(2.001)) = - 4001$, both of which are very inaccurate.
</div>