# MTH 652: Advanced Numerical Analysis

## Lecture 7

### Topics

* Solvers and numerical linear algebra
* Conjugate gradient method

#### Introduction

Many of the problems we have considered until now in this course have resulted in **symmetric positive-definite** (SPD) matrices.
In fact, let $a(\cdot, \cdot)$ denote a symmetric, coercive bilinear form, and let $\{ \phi_i \}$ denote a basis for the discrete space $V_h$.
Then, the system matrix $A$ defined by
$$
   A_{ij} = a(\phi_i, \phi_j)
$$
is symmetric and positive-definite (why?).

Last term, we looked at the Gauss-Seidel method, which is a simple iterative method for approximating the solution to linear systems
$$
   A x = b
$$
and is guaranteed to converge when $A$ is SPD.

The convergence of Gauss-Seidel is determined by the matrix norm of the iteration matrix.
We have shown that when $A$ is SPD then this matrix norm is strictly less than 1, and so the method converges.
However, we have not derived any estimates for **how quickly** Gauss-Seidel will converge (in terms of discretization parameters).
We did see last term that as the problem size grew (we refined our mesh), solving a system with the diffusion "stiffness matrix" required more iterations, however solving a system with the mass matrix (i.e. corresponding to the L2 inner product) required roughly a constant number of iterations even on more refined problems.

* And are there better methods that result in faster convergence than Gauss-Seidel?
* Can we quantify or analyze the convergence with respect to discretization paramters?

These questions will motivate our study of the **conjugate gradient method**, which is one of the most well-known and widely used **Krylov methods**.

The system we are interested in solving is
$$
   Ax = b.
$$

Define the **quadratic form**
$$
   f(x) = \frac{1}{2} x^T A x - b^T x.
$$

Note that if $x$ is the solution (i.e. $Ax = b$), and $y$ is arbitrary, then
$$
   \begin{aligned}
      f(x + y) &= \frac{1}{2} (x^T + y^T) A (x + y) - b^T (x + y) \\
         &= \frac{1}{2} x^T A x + x^T A y + \frac{1}{2} y^T A y - b^T x - b^T y \\
         &= f(x) + \frac{1}{2} y^T A y,
   \end{aligned}
$$
which implies that $x$ **minimizes the quadratic form**.

We can also see this as follows.
The gradient of $f$ is given by
$$
   f'(x) = \begin{pmatrix}
      \frac{\partial f}{\partial x_1} \\
      \frac{\partial f}{\partial x_2} \\
      \vdots \\
      \frac{\partial f}{\partial x_n}
   \end{pmatrix}
$$

The form $f$ will reach its minimum when $f'$ is equal to zero.
It is relatively straightforward to see that
$$
   f'(x) = \frac{1}{2}A^t x + \frac{1}{2} A x - b = Ax - b,
$$
by symmetry.
So it is clear that $f$ has a critical point at the solution $x$.
When $A$ is positive-definite, the critical point is a minimum.


#### Steepest Descent

We can try to minimize the form using a method known as steepest descent.
Note that the gradient $f'(x)$ points in the direction of the steepest increase in $f$.
So, we can move in the opposite direction to try to find a minimum.

Some notation (familiar from before). The **error** $e_{(i)}$ is
$$
   e_{(i)} = x_{(i)} - x,
$$
and the **residual** is
$$
   r_{(i)} = -A e_{(i)} = b - A x_{(i)}.
$$

Note that the residual is
$$
   r_{(i)} = -f'(x_{(i)})
$$
so the residual is pointing in the direction of steepest descent.

Knowing the direction of steepest descent allows us to perform a **line search**.
A line search is a procedure that will choose the best value long a particular line, i.e. find the best $\alpha$ to define
$$
   x_{(i+1)} = x_{(i)} + \alpha r_{(i)}.
$$
This is a one-dimensional problem.
We can minimize $f$ along this line when
$$
   \frac{d}{d\alpha} f(x_{(i+1)}) = 0.
$$
By the chain rule
$$
   \frac{d}{d\alpha} f(x_{(i+1)}) = f'(x_{(i+1)})^T \frac{d}{d\alpha} x_{(i+1)} = f'(x_{(i+1)})^T r_{(i)},
$$
and this is zero when $f'(x_{(i+1)})$ is orthogonal to $r_{(i)}$.

This is intuitively clear, because the rate of increase or decrease of $f$ along this line is given by the **projection** of the gradient $f'$ onto the search line.
When the projection is zero, the rate of increase or decrease along the line is zero, and this happens when the gradient is orthogonal to the search direction.

We now determine $\alpha$.
Note that
$$
   f'(x_{(i+1)}) = r_{(i+1)},
$$
so we want
$$
\begin{aligned}
   r_{(i+1)}^T r_{(i)} &= 0 \\
   (b - Ax_{(i+1)})^T r_{(i)} &= 0 \\
   (b - A(x_{(i)} + \alpha r_{(i)}))^T r_{(i)} &= 0 \\
   (b - Ax_{(i)})^T r_{(i)} - \alpha (A r_{(i)})^T r_{(i)} &= 0 \\
   r_{(i)}^T r_{(i)} - \alpha (A r_{(i)})^T r_{(i)} &= 0 \\
   \alpha (A r_{(i)})^T r_{(i)} &= r_{(i)}^T r_{(i)} \\
   \alpha &= \frac{r_{(i)}^T r_{(i)}}{r_{(i)}^T A r_{(i)}} \\
\end{aligned}
$$

Putting this together, we obtain an algorithm for the steepest descent method:

Start with an initial guess $x_{(0)}$.
Then, given $x_{(i)}$, compute
$$
   \begin{aligned}
      r_{(i)} &= b - Ax_{(i)} \\
      \alpha_{(i)} &= \frac{r_{(i)}^T r_{(i)}}{r_{(i)}^T A r_{(i)}} \\
      x_{(i+1)} &= x_{(i)} + \alpha_{(i)} r_{(i)}
   \end{aligned}
$$

#### Convergence analysis of steepest descent

The main tool we will use to analyze iterative methods like steepest descent (and later conjugate gradients) will be eigenvectors and eigenvalues.

We first recall some facts of the the eigenvalues and eigenvectors of SPD matrices:

* The eigenvalues of symmetric matrices are real
* The eigenvalues of SPD matrices are positive
* The eigenvectors of a symmetric matrix form an orthonormal basis

What happens if at some point in steepest descent the error $e_{(i)}$ is an eigenvector?
In this case, the residual is **also** an eigenvector since
$$
   r_{(i)} = -A e_{(i)} = -\lambda _{(i)}.
$$
Then
$$
\begin{aligned}
   x_{(i+1)} &= x_{(i)} + \frac{r_{(i)}^T r_{(i)}}{r_{(i)}^T A r_{(i)}} r_{(i)} \\
      &= x_{(i)} - \frac{1}{\lambda}\frac{r_{(i)}^T r_{(i)}}{r_{(i)}^T r_{(i)}} \lambda e_{(i)} \\
      &= x_{(i)} - e_{(i)} \\
      &= x,
\end{aligned}
$$
and so in this case, steepest descent will converge immediately.


This is clearly a very special case, and so if we want to consider the more general case, we can express $e_{(i)}$ as a linear combination of eigenvectors (recall that they form an orthonormal basis),
$$
   e_{(i)} = \sum_{j=1}^n \xi_j v_j.
$$
Then,
$$
\begin{aligned}
   r_{(i)} &= -A e_{(i)} = -\sum_j \xi_j \lambda_j v_j \\
   \| e_{(i)} \|^2 &= e_{(i)}^T e_{(i)} = \sum_j \xi_j^2 \\
   e_{(i)}^T A e_{(i)} &= \sum_j \xi_j^2 \lambda_j \\
   \| r_{(i)} \|^2 &= r_{(i)}^T r_{(i)} = \sum_j \xi_j^2 \lambda_j^2 \\
   r_{(i)}^T A r_{(i)} &= \sum_j \xi_j^2 \lambda_j^3
\end{aligned}
$$

This means that the error at the next step is given by
$$
   e_{(i+1)}
      = e_{(i)} + \frac{r_{(i)}^T r_{(i)}}{r_{(i)}^T A r_{(i)}} r_{(i)}
      = e_{(i)} + \frac{ \sum_j \xi_j^2 \lambda_j^2 }{\sum_j \xi_j^2 \lambda_j^3} r_{(i)}
$$

If all the eigenvectors happen to have the same eigenvalue $\lambda$ (again, a **very special case**), then
$$
   e_{(i+1)} = e_{(i)} + \frac{ \lambda^2 \sum_j \xi_j^2 }{\lambda^3 \sum_j \xi_j^2} r_{(i)},
$$
and we again have immediate convergence.

We analyze the general case in the $A$-norm (also called the energy norm — why, and what is the connection with the energy norm from finite elements?)
$$
   \| e \|_A = \left( e^T A e \right)^{1/2}.
$$
We calculate
$$
\begin{aligned}
   \| e_{(i+1)} \|_A^2
      &= e_{(i+1)}^T A e_{(i+1)} \\
      &= (e_{(i)} + \alpha_{(i)}r_{(i)})^T A (e_{(i)} + \alpha_{(i)} r_{(i)}) \\
      &= e_{(i)}^T A e_{(i)} + 2\alpha r_{(i)}^T A e_{(i)} + \alpha_{(i)}^2 r_{(i)}^T A r _{(i)} \\
      &= \| e_{(i)} \|_A^2 - 2\frac{r_{(i)}^T r_{(i)}}{r_{(i)}^T A r_{(i)}}(r_{(i)}^T r_{(i)}) + \left( \frac{r_{(i)}^T r_{(i)}}{r_{(i)}^T A r_{(i)}} \right)^2 r_{(i)}^T A r_{(i)} \\
      &= \| e_{(i)} \|_A^2 - \frac{(r_{(i)}^T r_{(i)})^2}{r_{(i)}^T A r_{(i)}} \\
      &= \| e_{(i)} \|_A^2 \left(1  - \frac{(r_{(i)}^T r_{(i)})^2}{r_{(i)}^T A r_{(i)} e_{(i)}^T A e_{(i)}} \right) \\
      &= \| e_{(i)} \|_A^2 \left(1  - \frac{ (\sum_j \xi_j^2 \lambda_j^2)^2 }{ ( \sum_j \xi_j^2 \lambda_j^3 ) (\sum_j \xi_j^2 \lambda_j)} \right) \\
      &= \| e_{(i)} \|_A^2 \omega^2
\end{aligned}
$$
where $\omega^2$ is defined by
$$
   \omega^2 = 1  - \frac{ (\sum_j \xi_j^2 \lambda_j^2)^2 }{ ( \sum_j \xi_j^2 \lambda_j^3 ) (\sum_j \xi_j^2 \lambda_j)}
$$

We will estimate $\omega^2$ in the case where $n = 2$.
Later we will provide a prove that this estimate extends to the case of general $n$.

We have two eigenvalues, and assume $\lambda_1 \geq \lambda_2$.
Then,
$$
   e_{(i)} = \xi_1 v_1 + \xi_2 v_2.
$$
Let
$$
   \kappa = \frac{\lambda_1}{\lambda_2}, \qquad
   \mu = \frac{\xi_2}{\xi_1}.
$$
(Note that $\kappa \geq 1$ always).
Then,
$$
\begin{aligned}
   \omega^2
      &= 1  - \frac{ (\xi_1^2\lambda_1^2 + \xi_2^2\lambda_2^2)^2 }{(\xi_1^2\lambda_1 + \xi_2^2\lambda_2)(\xi_1^2\lambda_1^3 + \xi_2^2\lambda_2^3)} \\
      &= 1 - \frac{ (\kappa^2 + \mu^2)^2 }{ (\kappa + \mu^2)(\kappa^3 + \mu^2)}
\end{aligned}
$$

Some tedious calculations show that $\omega$ is maximized when $\mu = \pm \kappa$, which gives the inequality
$$
\begin{aligned}
   \omega^2
      &\leq 1 - \frac{4\kappa^4}{\kappa^5 + 2\kappa^4 + \kappa^3} \\
      &= \frac{\kappa^5 - 2\kappa^4 + \kappa^3}{\kappa^5 + 2\kappa^4 + \kappa^3} \\
      &= \frac{(\kappa - 1)^2}{(\kappa + 1)^2},
\end{aligned}
$$
and so
$$
   \omega \leq \frac{\kappa - 1}{\kappa + 1}.
$$
When $\kappa$ is small (very close to 1), $\omega$ is small, leading to fast convergence.
When $\kappa$ is large, $\omega$ is very close to 1, leading to slow convergence.

In general, we define the **condition number** of a SPD matrix by
$$
   \kappa = \frac{\lambda_{\mathrm{max}}}{\lambda_{\mathrm{min}}},
$$
and obtain the convergence result for steepest descent
$$
   \| e_{(i)} \|_A \leq \left( \frac{\kappa - 1}{\kappa + 1} \right)^i \| e_{(0)} \|_A
$$