# MTH 652: Advanced Numerical Analysis

## Lecture 8

### Topics

* Solvers and numerical linear algebra
* Conjugate gradient method

#### Method of conjugate directions

One downside of steepest descent is that it tends to repeat search directions over and over again.
In fact, in 2D, it is easy to see that steepest descent will always use only two search directions at right angles to each other.
Somehow this seems wasteful: we return to the same search directions over and over again, and so the line search isn't really getting us to the best place along the line from the point of view of convergence.

What if we have a set of orthogonal search directions, and we can use each search direction **only once**, so that each step will take us to exactly the right place along that search direction.

In other words, we have a set of directions
$$
   \{ d_{(0)}, d_{(1)}, \ldots, d_{(n-1)} \},
$$
and at each step we choose a point
$$
   x_{(i+1)} = x_{(i)} + \alpha_{(i)} d_{(i)}.
$$

If the error $e_{(i+1)}$ is **orthogonal** to $d_{(i)}$, then that means that the error can be expressed as a linear combination of the **other** search directions.
In other words, we never need to step in the direction $d_{(i)}$ again.
This condition can be expressed as
$$
   \begin{aligned}
      d_{(i)}^T e_{(i+1)} &= 0 \\
      d_{(i)}^T ( e_{(i)} + \alpha_{(i)} d_{(i)} ) &= 0 \\
      \alpha_{(i)} &= -\frac{d_{(i)}^T e_{(i)}}{d_{(i)}^T d_{(i)}}
   \end{aligned}
$$

Unfortunately, $\alpha_{(i)}$ isn't really computable using this expression: it depends upon knowing $e_{(i)}$, and if we knew $e_{(i)}$, we would be done.

The way around this problem is to choose search directions that, instead of being **orthogonal**, are **conjugate**, in other words, $A$-orthogonal, i.e.
$$
   d_{(i)}^T A d_{(j)} = 0
$$
whenever $i \neq j$.

If the directions are conjugate, then we want the new error $e_{(i+1)}$ to be $A$-orthogonal to the search direction, i.e.
$$
   \begin{aligned}
      d_{(i)}^T A e_{(i+1)} &= 0 \\
      d_{(i)}^T A (e_{(i)} + \alpha_{(i)} d_{(i)}) &= 0 \\
      \alpha_{(i)} = \frac{d_{(i)}^T r_{(i)}}{d_{(i)}^T A d_{(i)}},
   \end{aligned}
$$
which **is** computable, because it involves the residual $r_{(i)} = b - Ax_{(i)}$.

This is actually equivalent to a line search along this direction:
$$
   \begin{aligned}
      \frac{d}{d\alpha} f(x_{(i+1)}) &= 0 \\
      f'(x_{(i+1)})^T \frac{d}{d\alpha} x_{(i+1)} &= 0 \\
      -r_{(i+1)}^T d_{(i)} &= 0 \\
      d_{(i)}^T A e_{(i+1)} &= 0
   \end{aligned}
$$

This is very similar to steepest descent, but we don't choose the search vectors according to the direction of steepest increase/decrease.

This procedure is guaranteed to converge in at most $n$ steps.

#### Why is it guaranteed to work?

Express the starting error in terms of the search directions
$$
   e_{(0)} = \sum_{j=0}^{n-1} \delta_j d_{(j)}.
$$

What are the coefficients $\delta_j$?

$$
\begin{aligned}
   d_{(k)}^T A e_{(0)} &= \sum_j \delta_j d_{(k)}^T A d_{(j)} \\
   d_{(k)}^T A e_{(0)} &= \delta_k d_{(k)}^T A d_{(k)} \\
   \delta_k &= \frac{d_{(k)}^T A e_{(0)}}{d_{(k)}^T A d_{(k)}} \\
      &= \frac{d_{(k)}^T A (e_{(k)} - \sum_{i=0}^{k-1} \alpha_{(i)} d_{(i)})}{d_{(k)}^T A d_{(k)}} \\
      &= \frac{d_{(k)}^T A e_{(k)}}{d_{(k)}^T A d_{(k)}} \\
\end{aligned}
$$

From this, it follows that $\alpha_{(i)} = -\delta_{(i)}$.

This gives us another way of looking at the iterative procedure.
At every iteration, we cut down one component of the error term:

$$
\begin{aligned}
   e_{(i)} &= e_{(0)} + \sum_{j=0}^{i-1} \alpha_{(j)}d_{(j)} \\
      &= \sum_{j=0}^{n-1} \delta_{(j)} d_{(j)} - \sum_{j=0}^{i-1} \delta_{(j)} d_{(j)} \\
      &= \sum_{j=i}^{n-1} \delta_{(j)} d_{(j)}
\end{aligned}
$$

#### Optimality of the error term

Recall that steepest descent returns to the same search vectors over and over again.
In this sense, it is finding approximations that are not optimal in the space spanned by the search vectors.
On the other hand, the methods of conjugate directions **does** find an optimal approximation in the search space.

Let $\mathcal{D}_i = \operatorname{span}\{ d_{(0)}, d_{(1)}, \ldots , d_{(i-1)} \}$

The error $e_{(i)}$ lies in the space $e_{(0)} + \mathcal{D}_i$.
In fact, $e_{(i)}$ is the **minimum** (in the energy norm) of all vectors in this space.

Note that the energy norm of $e_{(i)}$ can be written as
$$
\begin{aligned}
   \| e_{(i)} \|_A &= \sum_{j=i}^{n-1} \sum_{k=i}^{n-1} \delta_{(j)} \delta_{(k)} d_{(j)}^T A d_{(k)}
\end{aligned}
$$
using the expression for the error derived above.
Then, by $A$-orthogonality of the search directions,
$$
   \| e_{(i)} \|_A = \sum_{j=i}^{n-1} \delta_{(j)}^2 d_{(j)}^T A d_{(j)}
$$

Any **other** element of $e_{(0)} + \mathcal{D}_i$ will involve some of the vectors $d_{(k)}$.
Since these are set to zero in the above expansion, we know that this is the minimizer in the $A$-norm.

#### Gram-Schmidt conjugation

Everything seems good about this method.
All we need now is to find some way to generate the $A$-orthogonal (conjugate) search directions $d_{(i)}$.

Suppose we have $n$ linearly independent vectors $u_0, u_1, \ldots, u_{n-1}$.
(We can take the standard basis vectors, for example).

1. Set $d_{(0)} = u_0$.
2. To find $d_{(i)}$, subtract off any components of $u_i$ that are not $A$-orthogonal to $d_{(0)}, \ldots, d_{(i-1)}$

That is, set
$$
   d_{(i)} := u_i + \sum_{k=0}^{i-1} \beta_{ik} d_{(k)}
$$
for appropriate $\beta_{ik}$.

$$
\begin{aligned}
   d_{(i)}^T A d_{(j)} &= u_i^T A d_{(j)} + \sum_{k=0}^{i-1} \beta_{ik} d_{(k)}^T A d_{(j)} \\
   0 &= u_i^T A d_{(j)} + \beta_{ij} d_{(j)}^T A d_{(j)} \\
   \beta_{ij} &= - \frac{u_i^T A d_{(j)}}{d_{(j)}^T A d_{(j)}}
\end{aligned}
$$

In general, this procedure is expensive, because it requires storing all of the previous search directions, and it will take $\mathcal{O}(n^3)$ operations to generate a full set of vectors.

#### The Method of Conjugate Gradients

We can **finally** describe the conjugate gradient method.
We just set
$$
   u_i = r_{(i)}.
$$

Why would we do this?

First, since the search directions are $A$-orthogonal to the error, the residual must be orthogonal to all previous search directions.
This means we will always get new, linearly independent vectors.

Since the residuals are used as the search directions, we have that
$$
   \mathcal{D}_i = \operatorname{span} \{ r_{(0)}, r_{(1)}, \ldots, r_{(i-1)} \}.
$$

$r_{(i)}$ is orthogonal to the search directions, so $r_{(i)}$ is orthogonal to $\mathcal{D}_i$, and therefore $r_{(i)}$ is orthogonal to all the other residuals,
$$
   r_{(i)}^T r_{(j)} = 0, \qquad i \neq j
$$

Also, note that the residual is given by
$$
   r_{(i+1)} = -A e_{(i+1)} = -A(e_{(i)} + \alpha_{(i)} d_{(i)}) = r_{(i)} - \alpha_{(i)} A d_{(i)},
$$
so $r_{(i+1)}$ is a linear combination of the previous residual and $A d_{(i)}$.

This means that the space $\mathcal{D}_{i+1}$ is given by the union of $\mathcal{D}_i$ and $A \mathcal{D}_i$,
so
$$
\begin{aligned}
   \mathcal{D}_i &= \operatorname{span}\{ d_{(0)}, Ad_{(0)}, A^2 d_{(0)}, \ldots, A^{i-1} d_{(0)}\} \\
   &= \operatorname{span}\{ r_{(0)}, Ar_{(0)}, A^2 r_{(0)}, \ldots, A^{i-1} r_{(0)}\}
\end{aligned}
$$

This type of subspace is called a **Krylov** subspace.

This choice of search directions has a very important property:

Note that $r_{(i+1)}$ is orthogonal to $\mathcal{D}_{i+1}$.
But $A \mathcal{D}_i$ is a subspace of $\mathcal{D}_{i+1}$.
So, $r_{(i+1)}$ is orthogonal to $A \mathcal{D}_i$, and hence $r_{(i+1)}$ is $A$-orthogonal to $\mathcal{D}_i$.

This means that to generate a new search direction $d_{(i+1)}$ that is $A$-orthogonal to all the previous search directions, we just need to make sure that it's orthogonal to $d_{(i)}$, since it's automatically $A$-orthogonal to the previous ones.

Note that we had
$$
   r_{(i+1)} = r_{(i)} - \alpha_{(i)} A d_{(i)},
$$
and so
$$
   r_{(i)}^T r_{(j+1)} = r_{(i)}^T r_{(j)} - \alpha_{(j)} r_{(i)}^T A d_{(j)},
$$
therefore
$$
   \alpha_{(j)} r_{(i)}^T A d_{(j)} = r_{(i)}^T r_{(j)} - r_{(i)}^T r_{(j+1)}
$$

Note that the right-hand side vanishes whenever $i$ is neither $j$ nor $j+1$.
Returning to the formula for $\beta_{ij}$, this means that almost all of the $\beta_{ij}$ are now zero, and we don't need to store the old search directions to guarantee $A$-orthogonality.