In [None]:
from IPython.html.services.config import ConfigManager
from IPython.utils.path import locate_profile
cm = ConfigManager(profile_dir=locate_profile(get_ipython().profile))
cm.update('livereveal', {
              'theme': 'sky',
              'transition': 'zoom',
              'start_slideshow_at': 'selected',
})

# Lecture 11. Iterative methods (advanced topics)

## (Approximate) Syllabus
- **Week 1:** Intro & basic integral equations (turning PDEs into IEs, typical kernels, Nystrom, collocation, Galerkin, quadrature for singular/hypersingular integrals).
- **Week 2:** Translation-invariant kernels and convolutions, FFT. Concept of close and far interactions precorrected FFT. Barnes-Hut method
- **Week 3:**  Fast multipole methods. Algebraic analogue of fast multipole method, hierarchical matrices
- **Week 4:**  Multigrid methods, domain decomposition

## Previous lecture: 

Theory of geometric multigrid, subspace correction iterations.

## Todays lecture

Today we will talk a little about the iterative methods (most of it was covered in the NLA course) refresh the basic stuff and cover some advanced topics.


## The"philosophy" of iterative methods
As you remember, and iterative method is the computation of the form
$$x_{k+1} = \Phi(x_k), \quad k = 1, \ldots$$
If the mapping $\Phi(x)$ is contractive, then 
the iterative process converges to the unique stationary point
$$x_* = \Phi(x_*).$$
What are the additional requirements on $\Phi$ (and how this theorem is proved).

## Fixed-point convergence

**Theorem**: Let $M$ be a full metric space with distance $\rho$. Then, if the mapping $\Phi: M \rightarrow M$ is a contraction mapping, i.e. 
$$
   \rho(\Phi(x), \Phi(y)) \leq q \rho(x, y),
$$
where $q < 1$
the fixed-point method converges to a unique solution of 

$$x = \Phi(x).$$

The proof is based on the observation that $x_k$ is a Cauchy sequence:

For any $m \geq k$

$$\rho(x_m, x_k) \leq \sum_{i = k}^{m-1} \rho(x_{i+1}, x_i) \leq \sum_{i=k}^{m-1} q^k \rho(x_1, x_0) \leq \rho(x_1, x_0) \frac{q^k}{1-q}.$$ 

## Reminder: simple iteration
For linear systems, we have the simplest iterative method
    $$x_{k+1} = x_k + B (f - Ax_k),$$
 and it converges if $\rho (I - B A) < 1$, where $\rho$ is a spectral radius.
 To prove the convergence under the condition, we need to use the Jordan form.
 

## Speeding up the convergence of the fixed-point iteration
There are many way to speed-up the convergence of fixed-point iteration method.

Newton method is probably one of the most well-known for a non-linear case, 

when we rewrite the equation 

$$f(x) = 0$$

in the form

$$g(x) = x, $$

where $g(x) = x - \alpha(x) f(x),$$

and the best local convergence is attained by setting

$$\alpha(x) = (f'(x))^{-1}$$.

This leads to the solution of a linear system, thus not very useful for our purposes (although for non-linear PDE it might be useful).

## Anderson acceleration

Given a fixed-point iterations, 

$$x_{k+1} = g(x_k),$$

how can we speed up the convergence?

One of the most simple (but often very effective) ways to do so is the so-called **Anderson acceleration**

Very widely used in chemistry, under the name "Direct inversion in the iterated subspaces" (DIIS).

## Anderson acceleration: how to derive it

In the standard fixed-point method we store only the previous iterate $x_k$.

The idea of AA comes from storing a certain **history** of the iterates with depth $m$:

Then we take a **weighted sum** (also called mixing) to approximate the new iterate:

$$x_{k+1} = \sum_{j = 0}^m \alpha_j g(x_{k-j}).$$

In order to make the fixed point the same, we need

$$\sum_{j=0}^m \alpha_j = 1.$$



## Anderson acceleration: how to derive it (2)

The main question is how select $\alpha$. 

We are solving the equation 

$$x - g(x) = f(x) = 0.$$

Then, $f(x_k)$ can be named **residual** at point $x_k$.

Then, in the AA we select $\alpha$ in such a way that

$$ \Vert \sum_{\alpha_j} f_{k-j}  \Vert \rightarrow \min.$$

Subject to the constraint $\sum_{j=0}^m \alpha_j = 1$ we have the linear least squares formula.


## Anderson acceleration: final formula

The minimization problem can be rewritten in the equivalent form as

$$ \min_{\beta_1, \ldots, \beta_k} \Vert f_k - \sum_{j=1}^m \beta_j \left(f_{k-j+1} - f_{k-j}\right) \Vert_2,$$

without any additional restrictions on $\beta$.

Thus, we form a matrix $F$ with columns $f_k - f_{k-1}$ and solve a linear least squares problem with it.

## Demo of simple iteration & AA

In [43]:
import numpy as np
def fixed_point(x0, f, eps=1e-6, niters=1000):
    k = 0
    er = 2 * eps
    x = x0.copy()
    while k < niters and er > eps:
        x1 = f(x)
        er = np.linalg.norm(x1 - x)
        x = x1.copy()
        print('it={0: d}, er={1: 3.2e}'.format(k, er))
        k += 1
        
def anderson_acceleration(x1, f, eps=1e-6, m=5, warmup=5, niters=1000, condtol=1e12):
    x0 = x1.copy()
    for i in range(warmup):
        x0 = f(x0)
    x = f(x0)
    xc = np.hstack((x0, x))
    res = np.hstack((f(x0) - x0, f(x) - x))
    #xc = [x0, x]

    k = 0
    er = 2 * eps
    while k < niters and er > eps:
        er = np.linalg.norm(res[:, -1])
        print('it={0: d}, er={1: 3.2e}'.format(k, er))
        if np.size(res, 1) > m or np.linalg.cond(res) > condtol:
            res = res[:, 1:]
            xc = xc[:, 1:]
        m0 = np.size(res, 1)
        ed = np.eye(m0)[:m0-1, :]
        zd = np.eye(m0)[1:, :]
        gd = zd - ed
        Fk = res.dot(gd.T)
        rhs = res[:, -1]
        gm = np.linalg.lstsq(Fk, rhs)[0]
        dx = xc.dot(gd.T)
        xnew = rhs + xc[:, -1] - (dx + Fk).dot(gm)
        xnew = np.reshape(xnew, (xc.shape[0], -1))
        xc = np.hstack((xc, xnew))
        res = np.hstack((res, f(xnew) - xnew))
        k += 1
    sol = xc[:, -1]
    return sol


def f(x):
    return np.cos(x)

x0 = np.ones((1, 1))
sol = anderson_acceleration(x0, f)
print sol

#sol_fixed = fixed_point(x0, f)
#print sol_fixed

it= 0, er= 4.19e-02
it= 1, er= 3.49e-04
it= 2, er= 5.96e-07
[ 0.73908513]


## Anderson acceleration & GMRES
Anderson acceleration applied to linear system acceleration is the **Generalized** minimal residual method.

In this case, we have $$g(x) = Ax + b.$$

AA without truncation is equivalent to the GMRES method applied to the linear system 

$$(I - A) x = b$$ starting with the same $x_0$. For details see the paper by [Walker et. al](http://users.wpi.edu/~walker/Papers/Walker-Ni,SINUM,V49,1715-1735.pdf)

## Krylov subspaces 

To solve $Ax = b$ we take random initial vector $x_0$ and solve

$$A y = r_0 = b - A x_0.$$

## Krylov subspaces & GMRES

Now it is probably a good idea to recall the basic idea of Krylov methods and GMRES.

In GMRES, we look for the solution in the Krylov subspace

$$\mathcal{K}_i = \{b, Ab, A^2 b, \ldots, A^{i} b\}$$ that minimizes the residual

$$\Vert A x - b \Vert $$

over $K_i$

The length of the vector

$$r_i = b - Ay $$ is minimal only if 

$$r_i \perp A \mathcal{K}_i.$$

## Geometric version of GMRES

In the **geometric versison** of GMRES we construct $q_1, \ldots, q_i$ form the basis in $\mathcal{K}_i$, and another basis $p_1 = A q_1, \ldots, p_i = A q_i$ such that they form orthogonal basis in $A \mathcal{K}_i$. 

A vector $q_i$ from this sequence should satisfy the following properties:

$$q_{i+1} \notin \mathcal{K}_i, \quad q_{i+1}  \in \mathcal{K}_{i+1}, \quad p_{i+1} = A q_{i+1} \perp A \mathcal{K}_i.$$

This vector can be obtained by applying the orthogonalization procedure to the vector $p = Aq$, where $q = Aq_i$.

In the geometric version one needs to store both $q$ and $p$ (two sequences of vectors).

## Algebraic version of GMRES

Thw algebraic approach uses only one sequence of vectors $q_1, \ldots, q_i$, such that they form the basis in $\mathcal{K}_i$.

This idea comes from the paper by Saad (where the widely used GMRES acronym comes as well).

We start from 
$$q_0 = \frac{r_0}{\Vert r_0 \Vert}.$$

In order to  build a basis, we build the last vector  and orthogonalize it to the previous one:

$$q_{i+1}= \frac{\widehat{q}_{i+1}}{\Vert \widehat{q}_{i+1} \Vert}, \quad \widehat{q}_{i+1} = A q_i - h_{i i} q_i - \ldots - h_{i1} q_1,$$

where $h_{ik} = (A q_i, q_k), h_{i+1 i} = \Vert \widehat{q}_{i+1} \Vert.

## Matrix form

In the matrix form we have the following decomposition:

$$Aq_i=\begin{bmatrix} q_1 & \ldots & q_{i+1} \end{bmatrix} \begin{bmatrix} h_{i1} \\ \vdots \\ h_{i+1 i} \end{bmatrix}$$

Thus, after the first $i$ steps we have

$$AQ_i = Q_{i+1}  \widehat{H}_i,$$

where $\widehat{H}_i$ has the form

$$\widehat{H}_i = \begin{bmatrix} H_i \\ 0 & \ldots & h_{i+1 i} \end{bmatrix},$$

where $H_i$ is upper Hessenberg.

## Matrix form (cont.)

Let $$\widehat{H}_i = U_i R_i, \quad U_i \in \mathbb{C}^{(i+1) \times i}, \quad R_i \in \mathbb{C}^{i \times i},$$
be the QR-factorization of $\widehat{H}_i$.

The minimization problem 

$$\Vert r_0 - A y \Vert_2 \rightarrow \min, \quad y \in \mathcal{K}_i$$

is equivalent to solving the least square problem

$$\Vert r_0 - A Q_i y \Vert_2.$$

The **minimum** is obtained for $y_i$ that solves

$$R_i y_i = \Vert r_0 \Vert^2 U^*_i e_1.$$


Finally, $$x_i = x_0 + Q_i y_i = x_0 + Q_i R^{-1}_i z_i.$$

## Note on the implementation

We need to orthogonalize $\widehat{H}_i$.

Als, the vectors $z_i$  and $y_i$ can be computed in $\mathcal{O}(i^2)$ operations.

## Summary of GMRES

- We need to store $i$ vectors at each iteration. 
- Only the product $Ax$ is needed

If we need to store constant number of vectors, we need either:

- Special matrices (symmetric positive definite case works very well).
- BiCGStab method.


## Symmetric positive definite case: CG method

If $A = A%* > 0$ we have the conjugate gradient method, which allows short storage and has wonderful convergence properties.

Again, given $x_0$ we solve

$$Au_0 = r_0,$$

and look for the solution that minimizes the $A$-norm of the error, i.e.

$$\Vert y - u_0 \Vert_A$$ 
over all vector in $\mathcal{K}_i$.


This gives short-term recurrent relations.


## Why Krylov subspaces

We have used quite specific subspaces (**Krylov subspaces**) to represent the solution.

Maybe it is better to compute matrix-by-vector products with other vectors, (i.e. random ones?)


## Optimal subspaces

Consider and algorithm $\Phi$ that produces a sequence of **embedded subspaces**

$$L_{i+1} = \Phi(b, L_i, AL_i).$$

The quality of the algorithm is characterized as its index by.

$$m(\Phi, A, b) = \min \{i: \min_y \Vert A y - b \Vert_2 \leq \epsilon.$$

Bad algorithm can have infinity index.

## Nemirovsky and Judin result

In the late 1970 Judin & Nemirosvsky result on the quasi-optimality of the Krylov subspaces.
t
The result states: For any non-singular matrix $A$ there exists a unitary matrix $Q$ such that

$$m(\Phi_k, a, b) \leq 2 m(\Phi, Q^* A Q, b) + 1,$$

where $\Phi_k$ is the algorithm that generates Krylov subspaces.

This gives you the quasioptimality of the Krylov s

## Preconditioning

Solve $B^{-1} A x = Bf$, instead of $Ax = f$ with the hope that the matrix $B^{-1} A$ is **better conditioned**.

Remember the difference between right & left preconditioners

$$BA x = Bf, \quad A B y = f$$

and explicit and implicit preconditioners

$$BA x = Bf, \quad B^{-1} A x = B^{-1} f.$$

## What can we use for preconditioning?

As a preconditioner, many different things can be used:


For sparse linear systems:

1. Incomplete LU preconditioners (ILUT, ILU(k), ILU2)
2. Multigrid preconditioners (both algebraic and geometric multigrid)
3. Domain decomposition (tomorrow)
4. H-matrices factorizations as approximate inverses (i.e. inverse operators for many PDEs can be well approximated by H-matrices)

## Preconditioners for dense matrices

For dense matrices, for a shift-invariant case, circulant matrices (as easily inverted by the FFT).

For a general hierarhical matrix, its inverse can be approximated by the H-matrix as well.

## Summary

- Anderson acceleration
- GMRES
- And fast solvers for dense matrices



## Next lecture
- Domain decomposition (additive Schwarz, multiplicative Schwarz)
- Parallelization of fast sparse solvers.

In [40]:
from IPython.core.display import HTML
def css_styling():
    styles = open("./styles/custom.css", "r").read()
    return HTML(styles)
css_styling()