# TIES581 Project work

Mikael Myyrä  
`mikael.b.myyra@jyu.fi`  

In this document I implement and test Krylov subspace methods for
linear systems using NumPy, based on the descriptions of Saad (2003).

First is some setup code to import libraries and test problems as well as build
a tiny test runner to evaluate and compare methods.
Next are my implementations and short explanations of FOM, GMRES, and DIOM,
followed by SGS and ILU(0) preconditioning.
At the end I run tests on these methods and briefly analyze the results.

# Test problems and library setup

Using numpy for matrix utilities and scipy to read the Matrix Market matrix format.

The matrices are picked from the Harwell-Boeing and FIDAP collections on
[Matrix Market](https://math.nist.gov/MatrixMarket/).
Two of them, ORSIRR1 and FIDAP36, are also used by Saad (2003).
FIDAP005 is a smaller problem that is helpful in early testing
because you can reasonably print and read it.

Following Saad (2003) chapter 3.7, the right-hand side of $Ax = b$ is generated
as $b = Ae$ where $e = (1,1,\dots,1)^T$, and the initial guess $x_0$
is a vector of random values. Saad does not specify the range
or distribution of random values, so I am assuming the conventional
uniform distribution in the range $[0, 1)$.

The following code loads the test problems and sets up a simple test runner framework
for running and analyzing the methods implemented later.


In [1]:
import numpy as np
import scipy as sp
from scipy.io import mmread

import math
import time
from dataclasses import dataclass
from enum import Enum
from typing import Callable
from typing import Union


def default_rhs(A) -> np.ndarray:
    return A * np.ones((A.shape[1],))


def random_guess(A) -> np.ndarray:
    return np.random.random_sample((A.shape[1],))


@dataclass
class Equation:
    name: str
    A: np.ndarray
    b: np.ndarray

    def residual(self, x: np.ndarray) -> np.ndarray:
        return self.b - self.A * x


def loadeq(path: str, name: str) -> Equation:
    A = mmread(path)
    return Equation(name=name, A=A, b=default_rhs(A))


FIDAP005 = loadeq("test_matrices/fidap005.mtx", "FIDAP005")
FIDAP036 = loadeq("test_matrices/fidap036.mtx", "FIDAP036")
GR3030 = loadeq("test_matrices/gr_30_30.mtx", "GR3030")
ORSIRR1 = loadeq("test_matrices/orsirr_1.mtx", "ORSIRR1")


MAX_ITERATIONS = 300
EPSILON = 1e-9


@dataclass
class RunResult:
    """Return type of solution methods, containing the actual computed answer
    and performance information for analysis."""

    ans: np.ndarray
    iterations: int


@dataclass
class SolveMethod:
    """A solver method, along with some metadata for display."""

    name: str
    precond_name: str
    func: Callable[[Equation, np.ndarray], RunResult]
    extra_args: dict


@dataclass
class TestRun:
    """Runs a list of methods on a single test problem and prints results in a table."""

    eq: Equation
    methods: list[SolveMethod]

    def run(self):
        print(self.eq.name)
        print(f"size: {self.eq.A.shape[0]} x {self.eq.A.shape[1]}")
        print("")

        # pretty-printing results as a table
        CELL_SIZE = 15
        def fmt_cell(x: Union[str, float]) -> str:
            text = f"{x:.3e}" if isinstance(x, float) else str(x)
            return text.center(CELL_SIZE)
        def print_row(cells: list[Union[str, float]]):
            print("|".join([fmt_cell(c) for c in cells]))

        headers = ["method", "preconditioner", "iterations", "time (ms)", "residual"]
        print_row(headers)
        print_row(["-" * CELL_SIZE] * len(headers))

        for method in self.methods:
            start_time = time.perf_counter_ns()
            x0 = random_guess(self.eq.A)

            result = method.func(self.eq, x0, **method.extra_args)

            duration_ns = time.perf_counter_ns() - start_time
            duration_ms = duration_ns // 1000000
            resid = np.linalg.norm(self.eq.residual(result.ans))
            print_row([method.name, method.precond_name, result.iterations,  duration_ms, resid])

        print("")


np.set_printoptions(precision=5)

# Methods and implementations

## FOM and GMRES

The Full Orthogonalization Method (FOM) and the Generalized Minimum Residual Method (GMRES)
are two closely related algorithms to approximately solve
a linear system $Ax = b$ using a Krylov subspace

$$
\mathcal{K}_m = \text{span}\{r_0, Ar_0, A^2r_0, \dots, A^{m-1}r_0\}
$$

where $m$ is the dimension of the subspace and $r_0 = b - Ax_0$ is the residual
of some initial guess $x_0$. $\mathcal{K}_m$ is related to the original problem
space by

$$
V_m^TAV_m = H_m + w_me_m^T
$$

where $V_m$ is an orthonormal basis of $\mathcal{K}_m$, $H_m$ is a
$m \times m$ Hessenberg matrix, and $w_me_m^T$ is a one-dimensional vector.
In these methods these matrices are computed with a method known as Arnoldi orthogonalization,
the vector $w_me_m^T$ is ignored, and the resulting Hessenberg system is solved.

The benefit of this approach is that a problem based on $H_m$ is
smaller than the original problem (controllable by the choice of subspace dimension $m$)
and its Hessenberg structure makes it easier to solve with direct methods.
The tradeoff is that the smaller $m$ is, the less accurate the solution will be.

Getting an accurate result in one iteration requires working in a
high-dimensional Krylov subspace, which has a computational complexity of $O(m^2)$ due to
orthogonalization requiring dot products with all previously computed basis vectors.
Saad (2003) presents two variants to alleviate this, the Restarted and Incomplete versions.

I originally implemented FOM and GMRES separately, but because of how similar they are
I chose to join them into one implementation as suggested by Saad (2003).
The practical difference between FOM and GMRES is that FOM solves the system

$$
H_my = ||r_0||e_1
$$

where $H_m$ is an $m \times m$ square matrix, whereas
GMRES solves an overdetermined least-squares problem

$$
\bar{H}_my = ||r_0||e_1
$$

where $\bar{H}_m$ is a $(m+1) \times m$ matrix with an additional row compared to FOM.
In more theoretical terms, FOM is an orthogonal projection method with
$\mathcal{K} = \mathcal{L} = \mathcal{K}_m$
and GMRES is an otherwise equivalent method but with $\mathcal{L}$
replaced with $A\mathcal{K}_m$ (Saad 2003).

### Restarted FOM and GMRES

Restarted FOM/GMRES simply runs the algorithm repeatedly with a small $m$
until a desired precision is achieved.

This implementation also supports preconditioning with a constant operation $M^{-1}$
(right-preconditioned variant). Running with the "identity" preconditioner
is equivalent to running with no preconditioning. Results of these are
compared in the "test runs" section later.


In [2]:
class ArnoldiVariant(Enum):
    FOM = 0
    GMRES = 1


# Preconditioner is a function that generates a preconditioning operation M^{-1}
# from the matrix A.
# implementations are defined later in the document.
Preconditioner = Callable[[Equation], Callable[[np.ndarray], np.ndarray]]


def restarted_arnoldi(
    eq: Equation,
    x0: np.ndarray,
    variant: ArnoldiVariant,
    subsp_dim: int,
    preconditioner: Preconditioner,
) -> RunResult:
    """Approximately solve `Ax = b` using the Full Orthogonalization Method
    or the Generalized Minimum Residual Method (right-preconditioned variant)
    with Krylov subspace dimension `subsp_dim`."""

    # Krylov subspace can't be larger than the column count of A
    if eq.A.shape[1] < subsp_dim:
        subsp_dim = eq.A.shape[1]

    precond_op = preconditioner(eq)

    # stop condition from Saad (2003): reduce residual norm by a factor of 10^7
    initial_resid_norm = np.linalg.norm(eq.residual(x0))
    stop_resid_limit = 1e-7 * initial_resid_norm
    # flag needed to break the outer iteration from an inner loop
    has_converged = False
    x = np.copy(x0)
    iter_count = 0
    while not has_converged and iter_count < MAX_ITERATIONS:
        iter_count += 1
        resid = eq.residual(x)
        resid_norm = np.linalg.norm(resid)
        # it's possible for the algorithm to diverge to infinity. bail if that happens
        # (in the test cases, this happens to FIDAP036 with FOM)
        if math.isinf(resid_norm):
            break
        # matrices stored as arrays, built incrementally
        # first basis vector of the Krylov subspace based on the residual
        V = [resid / resid_norm]
        # the upper diagonal form of the Hessenberg matrix H
        R = []
        # `g` in the equation `Hy = g` (in FOM)
        # or the least-squares problem `argmin ||g - Hy||` (in GMRES)
        rhs = [resid_norm]
        # `s` and `c` in the rotation matrices used transform H into upper triangular form
        rot_s = []
        rot_c = []

        for col in range(subsp_dim):
            # the next vector in the preconditioned Krylov subspace's basis
            # (before orthogonalization and normalization)
            next_v = eq.A * precond_op(V[-1])
            # vector of the nonzero entries of the next column of R.
            # we first compute the corresponding column of H into this vector
            # and then apply rotations to arrive at R
            next_r = np.zeros(col+1)
            # orthogonalize the new basis vector and populate H (next_r)
            for prev_col in range(len(V)):
                next_r[prev_col] = next_v.dot(V[prev_col])
                next_v -= next_r[prev_col] * V[prev_col]
            # the entry of H below the diagonal, which we will eliminate with a rotation later
            h_subdiag = np.linalg.norm(next_v)

            # apply previous plane rotations to the new column of H
            for row in range(len(rot_s)):
                h1 = rot_c[row] * next_r[row] + rot_s[row] * next_r[row+1]
                h2 = -rot_s[row] * next_r[row] + rot_c[row] * next_r[row+1]
                next_r[row] = h1
                next_r[row+1] = h2

            # last rotation is skipped in FOM
            if variant == ArnoldiVariant.FOM and col == subsp_dim - 1:
                R.append(next_r)
                continue

            # compute and apply the next plane rotation matrix
            denom = math.sqrt(next_r[-1]**2 + h_subdiag**2)
            next_s = h_subdiag / denom
            next_c = next_r[-1] / denom
            # left-multiply H by it
            r_diag = next_c * next_r[-1] + next_s * h_subdiag
            # debug: check that this indeed eliminates the subdiagonal entry of H
            assert abs(-next_s * next_r[-1] + next_c * h_subdiag) < EPSILON
            next_r[-1] = r_diag
            # also left-multiply rhs by it (after appending a zero to it)
            next_rhs = -next_s * rhs[-1]
            rhs[-1] = next_c * rhs[-1]

            R.append(next_r)
            # the absolute value of `next_rhs` is the residual norm at this step.
            # if it's low enough, we can stop and move on to solving the triangular system
            if abs(next_rhs) < stop_resid_limit or math.isinf(next_rhs):
                has_converged = True
                break

            rot_s.append(next_s)
            rot_c.append(next_c)
            if col < subsp_dim-1:
                rhs.append(next_rhs)
                V.append(next_v / h_subdiag)

        # backwards substitution to solve the upper triangular system `Ry = g`
        y = np.zeros(len(R))
        for row in reversed(range(len(y))):
            y[row] = (
                (1.0 / R[row][row])
                * (rhs[row]
                    - sum([R[col][row] * y[col] for col in range(row + 1, len(y))]))
            )

        V = np.array(V).T
        x += precond_op(V @ y)

    return RunResult(
        ans=x,
        iterations=iter_count,
    )

### Direct Incomplete Orthogonalization Method (DIOM)

The basic idea of incomplete orthogonalization is simple: reduce the cost
of orthogonalization by only comparing against the last $k$ computed basis vectors.
However, this only reduces computational costs while keeping memory costs
the same. With some additional work, a special LU factorization can be created
that does not need to store the entire basis $V_m$ in memory.
This is the Direct variant of FOM detailed in chapter 6.4.2 of Saad (2003).

The same idea can be made to work with GMRES as well,
but in the interest of getting this project done in a reasonable amount of time,
I chose to skip to preconditioning techniques after this.


In [3]:
from collections import deque

def direct_iom(eq: Equation, x0: np.ndarray, ortho_count: int) -> RunResult:
    """Approximately solve `Ax = b` using the Direct Incomplete
    Orthogonalization Method, orthogonalizing against `ortho_count` vectors."""

    initial_resid = eq.residual(x0)
    initial_resid_norm = np.linalg.norm(initial_resid)
    stop_resid_limit = 1e-7 * initial_resid_norm

    # the last `ortho_count` vectors of the basis of K_m.
    # deque automatically handles dropping values we don't need anymore
    V = deque([], ortho_count)
    # first basis vector from the initial residual
    V.append(initial_resid / initial_resid_norm)
    # P = VU^{-1}
    # one less entry because the `m`th vector is computed from `k-1` previous ones
    P = deque([], ortho_count - 1)
    # lower triangular part of the LU decomposition of H
    # one entry less because the final column does not have the subdiagonal
    L = deque([], ortho_count - 1)
    # zeta = L^{-1}(\beta e_1)
    # only one iteration of zeta needs to be stored
    zeta = 0
    # the rest of the state involved (H and U)
    # can be computed one vector at a time and don't need to be stored
    
    x = np.copy(x0)
    step_idx = 0
    # stop if we cover all of the dimension of A without converging
    while step_idx < eq.A.shape[1]:
        # check convergence
        resid_norm = np.linalg.norm(eq.residual(x))
        if resid_norm < stop_resid_limit or math.isinf(resid_norm):
            break

        # we don't actually need to know the indices where we are in the "full" matrix,
        # indexing into the incomplete arrays and vectors in the right way
        # gives us all the information we need

        next_v = eq.A * V[-1]
        # only store the nonzero entries of H, of which there are at most
        # `ortho_count + 1` because H is banded Hessenberg
        next_h = np.zeros(len(V) + 1)
        for col in range(len(V)):
            next_h[col] = next_v.dot(V[col])
            next_v -= next_h[col] * V[col]
        # entry of H below the main diagonal
        next_h[len(V)] = np.linalg.norm(next_v)
        next_v /= next_h[len(V)]
        # next_v is V_{m+1} which isn't used until next iteration,
        # so don't append it to V until the end of the loop

        if step_idx == 0:
            # first step is a special case where the subdiagonal entry of L does not exist yet
            zeta = initial_resid_norm
            next_u = np.copy(next_h[0:-1])
            # compute the L value that will be used in the next step now
            # so we don't have to store an additional column of H
            L.append(next_h[-1] / next_u[-1])
        else:
            # LU factorization based on notes below
            # once again only storing nonzero entries in `ortho_count`-vectors for U
            # and scalars for L.
            next_u = np.copy(next_h[0:-1])
            for i in range(1, len(next_u)):
                next_u[i] -= L[i-1] * next_u[i-1]

            if abs(next_u[-1]) < EPSILON:
                break

            zeta *= -L[-1]

            # again, this value of L will be used in the next step
            L.append(next_h[-1] / next_u[-1])

        next_p = (
            (1.0 / next_u[-1])
            * (V[-1] - sum([next_u[i] * P[i] for i in range(len(next_u) - 1)]))
        )

        P.append(next_p)

        V.append(next_v)

        x += zeta * next_p

        step_idx += 1

    return RunResult(
        ans=x,
        iterations=step_idx,
    )

#### Updating the LU decomposition

Saad (2003) doesn't explicitly say how the LU decomposition of $H_m$ is computed,
so I'm going to try and figure it out myself. Say we have $k = 3$ and
at step $m = 10$ of the algorithm,

$$
H_{10} = \begin{bmatrix}
\ddots & \\
& h_{6,8} \\
& h_{7,8} & h_{7,9} \\
& h_{8,8} & h_{8,9} & h_{8,10} \\
& h_{9,8} & h_{9,9} & h_{9,10} \\
& & h_{10,9} & h_{10,10} \\
& & & h_{11,10} \\
\end{bmatrix}
$$

We've just computed the last column of $H_{10}$ and we also have the matrices

$$
L_{9} = \begin{bmatrix}
\ddots & \\
& l_{7,6} & 1 \\
& & l_{8,7} & 1 \\
& & & l_{9,8} & 1 \\
\end{bmatrix}
$$

and

$$
U_9 = \begin{bmatrix}
\ddots & \\
& u_{5,7} \\
& u_{6,7} & u_{6,8} \\
& u_{7,7} & u_{7,8} & u_{7,9} \\
& & u_{8,8} & u_{8,9} \\
& & & u_{9,9} \\
\end{bmatrix}.
$$

For the second to last subdiagonal entry of $H_{10}$ which we haven't yet considered
in the LU decomposition, we have

$$
h_{10,9} = l_{10,9}u_{9,9}
$$

from which we get the new subdiagonal entry of $L_{10}$

$$
l_{10,9} = \frac{h_{10,9}}{u_{9,9}}.
$$

Additionally the equations for the final column of $L_{10}U_{10} = H_{10}$ are

$$
\begin{align*}
h_{8,10} &= l_{8,7}u_{7,10} + u_{8,10} = u_{8,10} \iff u_{8,10} = h_{8,10} \\
h_{9,10} &= l_{9,8}u_{8,10} + u_{9,10} \iff u_{9,10} = h_{9,10} - l_{9,8}u_{8,10} \\
h_{10,10} &= l_{10,9}u_{9,10} + u_{10,10} \iff u_{10,10} = h_{10,10} - l_{10,9}u_{9,10} \\
\end{align*}
$$

From this we can deduce an iterative algorithm for the elements of $u_{*,10}$.

$$
\begin{align*}
u_{m-k,m} &= h_{m-k,m} \\
u_{i,m} &= h_{i,m} - l_{i,i-1}u_{i-1,m} \text{ for } i = m-k+1, \dots, m \\
\end{align*}
$$

# Preconditioners

## Identity

This "preconditioner" sets $M = I$, representing the unpreconditioned version of the algorithm.
This way we can use the preconditioned implementation of GMRES/FOM also for testing the
unpreconditioned variants.


In [4]:
def p_identity(eq: Equation) -> Callable[[np.ndarray], np.ndarray]:
    return lambda x: x

## Symmetric Gauss-Seidel

The SGS preconditioner is equivalent to executing a single iteration of the
Symmetric Gauss-Seidel algorithm. The preconditioning matrix $M_{SGS}$ is equal
to $LU = (I - ED^{-1})(D - F)$ where $D$ is the diagonal of $A$ and $-F$ and $-E$ its
strictly upper and lower triangular parts, respectively.
$L$ is lower triangular, consisting of the strictly lower triangular part of A divided by its diagonal
and ones on the main diagonal. $U$ is upper triangular, consisting of the upper triangular part of $A$
including its main diagonal.

The preconditioning operation $M^{-1}x$ is computed by solving for $z$ in
$Mz = x$, which can be done with one sweep each of forward and backward substitution
thanks to the LU decomposition.


In [5]:
def p_symmetric_gauss_seidel(eq: Equation) -> Callable[[np.ndarray], np.ndarray]:
    # transform A to Compressed Sparse Row format for efficient row operations
    A = eq.A.tocsr()
    diag = A.diagonal(0)

    def impl(x: np.ndarray) -> np.ndarray:
        ans = np.copy(x)
        # forward solve of (I - ED^{-1})y = x
        for row in range(len(x)):
            for ptr in range(A.indptr[row], A.indptr[row+1]):
                col_idx = A.indices[ptr]
                if col_idx >= row:
                    break
                # some of the test cases have some zeroes on the diagonal.
                # pivoting could potentially be used to alleviate this, but that's complicated
                # and this doesn't need to be exact, so just ignore the corresponding elements
                if diag[col_idx] == 0.0:
                    continue
                ans[row] -= ans[col_idx] * A.data[ptr] / diag[col_idx]
            
        # backward solve of (D - F)z = y
        for row in reversed(range(len(x))):
            if diag[row] == 0.0:
                continue
            for ptr in range(A.indptr[row], A.indptr[row+1]):
                col_idx = A.indices[ptr]
                if col_idx <= row:
                    continue
                col_idx = A.indices[ptr]
                ans[row] -= ans[col_idx] * A.data[ptr]
            ans[row] /= diag[row]

        return ans

    return impl

## Incomplete LU Factorization

The ILU(0) factorization is of the form $A = LU - R$ where $L$ and $U$ are
lower and upper triangular with the same zero pattern as $A$ and
$R$ contains the "fill-in" part of the factorization where a complete LU
factorization would have nonzero entries that don't match the pattern of $A$.
$R$ is discarded and $LU$ is used as the preconditioning matrix $M$.
(Saad 2003)

Other, more accurate ILU factorizations could be computed by including some of the
fill-in terms that don't match the nonzero pattern of $A$. In general, an
ILU(p) factorization is produced by taking the ILU(0) factorization of $A$,
then performing another ILU(0) factorization using the zero pattern of $LU$,
and repeating this process p times (Saad 2003).

An efficient implementation of this does not repeatedly produce ILU(0) factorizations,
but maintains a "level of fill" matrix that can be computed from the adjacency graph of $A$.
However, this is too much complexity for the time I have for this project,
so I will only implement ILU(0) here.


In [6]:
def p_ilu_zero(eq: Equation) -> Callable[[np.ndarray], np.ndarray]:
    # starting with A, we compute the ILU factorization in place
    # and store both L and U in the same matrix.
    # this is possible because L has a unit diagonal, which we don't need to store.
    lu = eq.A.tocsr()
    for row in range(1, eq.A.shape[0]):
        if lu[row, row] == 0.0:
            continue
        # making use of the CSR format of A to handle the condition (i, j) \in NZ(A)
        for ptr in range(lu.indptr[row], lu.indptr[row+1]):
            col = lu.indices[ptr]
            if col >= row or lu[col, col] == 0.0:
                break
            lu.data[ptr] /= lu[col, col]
            for other_ptr in range(ptr+1, lu.indptr[row+1]):
                other_col = lu.indices[other_ptr]
                lu.data[other_ptr] -= lu.data[ptr] * lu[col, other_col]
            

    def impl(x: np.ndarray) -> np.ndarray:
        ans = np.copy(x)
        # forward solve of Ly = x
        for row in range(len(x)):
            for ptr in range(lu.indptr[row], lu.indptr[row+1]):
                col_idx = lu.indices[ptr]
                if col_idx >= row:
                    break
                ans[row] -= ans[col_idx] * lu.data[ptr]
            
        # backward solve of Uz = y
        for row in reversed(range(len(x))):
            if lu[row, row] == 0.0:
                continue
            for ptr in range(lu.indptr[row], lu.indptr[row+1]):
                col_idx = lu.indices[ptr]
                if col_idx <= row:
                    continue
                col_idx = lu.indices[ptr]
                ans[row] -= ans[col_idx] * lu.data[ptr]
            ans[row] /= lu[row, row]

        return ans

    return impl

# Tests

The following code runs the algorithms defined earlier
on each of the test problems loaded at the start.
Preconditioning methods are tested with GMRES.
A reference solution using `scipy.sparse.linalg.gmres`
with no preconditioning is also tested.

Preconditioning turned out to be slow enough to be unusable with the largest
problem FIDAP036. A single run of GMRES(10) with SGS preconditioning took 12 minutes.
I tried my best to optimize the code - I'm iterating by row using a row-major format
and only doing the minimum number of multiplications required, but I suspect the
main reason why it takes this long is Python. It would be interesting to try
implementing these routines in Rust or C and see how much that would speed them up.

## Reference Scipy GMRES


In [7]:
from scipy.sparse.linalg import gmres as sp_gmres

# counting iterations for comparison purposes.
# this needs to be global to be available in the scipy callback's scope
ref_iter_count = 0
def increment_iter(_):
    global ref_iter_count
    ref_iter_count += 1

def reference_gmres(eq: Equation, x0: np.ndarray, ortho_count: int) -> RunResult:
    global ref_iter_count
    ref_iter_count = 0
    ans, _ = sp_gmres(
        eq.A,
        eq.b,
        x0=x0,
        restart=ortho_count,
        maxiter=MAX_ITERATIONS,
        callback=increment_iter,
        callback_type="x",
    )
    return RunResult(ans=ans, iterations=ref_iter_count-1)

## Test runs


In [8]:
ref_gmres = [
    SolveMethod(
        name=f"Ref({dim})",
        precond_name="-",
        func=reference_gmres,
        extra_args={"ortho_count": dim},
    )
    for dim in [10, 30, 50]
]

fom = [
    SolveMethod(
        name=f"FOM({dim})",
        precond_name="-",
        func=restarted_arnoldi,
        extra_args={
            "variant": ArnoldiVariant.FOM,
            "subsp_dim": dim,
            "preconditioner": p_identity,
        },
    )
    for dim in [10, 30, 50]
]

diom = [
    SolveMethod(
        name=f"DIOM({count})",
        precond_name="-",
        func=direct_iom,
        extra_args={"ortho_count": count},
    )
    for count in [5, 10, 50]
]

gmres = [
    SolveMethod(
        name=f"GMRES({dim})",
        precond_name="-",
        func=restarted_arnoldi,
        extra_args={
            "variant": ArnoldiVariant.GMRES,
            "subsp_dim": dim,
            "preconditioner": p_identity,
        },
    )
    for dim in [10, 30, 50]
] 

gmres_sgs = [
    SolveMethod(
        name=f"GMRES({dim})",
        precond_name="SGS",
        func=restarted_arnoldi,
        extra_args={
            "variant": ArnoldiVariant.GMRES,
            "subsp_dim": dim,
            "preconditioner": p_symmetric_gauss_seidel,
        },
    )
    for dim in [10, 30, 50]
]

gmres_ilu0 = [
    SolveMethod(
        name=f"GMRES({dim})",
        precond_name="ILU(0)",
        func=restarted_arnoldi,
        extra_args={
            "variant": ArnoldiVariant.GMRES,
            "subsp_dim": dim,
            "preconditioner": p_ilu_zero,
        },
    )
    for dim in [10, 30, 50]
]

TestRun(
    eq=FIDAP005,
    methods=[*ref_gmres, *fom, *diom, *gmres, *gmres_sgs, *gmres_ilu0],
).run()

TestRun(
    eq=FIDAP036,
    # no preconditioning for FIDAP036
    # because the Python implementations are too slow
    methods=[*ref_gmres, *fom, *diom, *gmres],
).run()

TestRun(
    eq=GR3030,
    methods=[*ref_gmres, *fom, *diom, *gmres, *gmres_sgs, *gmres_ilu0],
).run()

TestRun(
    eq=ORSIRR1,
    methods=[*ref_gmres, *fom, *diom, *gmres, *gmres_sgs, *gmres_ilu0],
).run()

FIDAP005
size: 27 x 27

     method    | preconditioner|   iterations  |   time (ms)   |    residual   
---------------|---------------|---------------|---------------|---------------
    Ref(10)    |       -       |      300      |      124      |   1.212e+00   
    Ref(30)    |       -       |       1       |       1       |   2.230e-09   
    Ref(50)    |       -       |       1       |       1       |   1.249e-09   
    FOM(10)    |       -       |      300      |      210      |   1.112e+01   
    FOM(30)    |       -       |       1       |       2       |   1.572e-01   
    FOM(50)    |       -       |       1       |       1       |   2.549e-01   
    DIOM(5)    |       -       |       27      |       3       |   1.081e+02   
    DIOM(10)   |       -       |       27      |       4       |   9.689e-01   
    DIOM(50)   |       -       |       19      |       3       |   1.818e-01   
   GMRES(10)   |       -       |      300      |      209      |   3.083e+00   
   GMRES(30)   |

## Thoughts on the results

The reference implementation is consistently the fastest, which is not surprising —
it has certainly had more work put into optimization, and larger parts of it
are implemented in C. The slowest in terms of time are the preconditioned
GMRES variants, but as discussed earlier, the preconditioning is not representatively
fast because it's implemented in Python. Thus, time isn't a particularly interesting
metric here. Iteration counts, however, reveal some interesting information.

FOM seems to have the most trouble converging, which isn't unexpected given that
it solves a problem containing less information than the one in GMRES.
A surprising result is that on FIDAP036, FOM actually diverges to infinity,
and this happens faster the more we increase the Krylov subspace's dimension.
I'm not sure why this happens only with FOM — the other methods converge, though slowly.

The iteration counts on DIOM correspond to the number of basis vectors computed
instead of the number of restarts. In most cases it fails to converge before
covering the entirety of the dimension of $A$, though it performs quite well
on GR3030. It is, however, quite fast to run and doesn't require much memory.
A restarted variant of this that ensures convergence could be a competitive method,
at least for some problems.

My implementation of GMRES seems to match the reference Scipy implementation
fairly closely, which tells me I probably got the math right.
Differences in iteration counts seem to be inversely correlated with differences in
the residual norm of the solution, which suggests the only major difference between
my implementation and Scipy's is the choice of stopping condition.

Preconditioning, while very slow in terms of execution time due to implementation issues covered earlier,
has quite a substantial effect on convergence. Most dramatically, solving FIDAP005 with
GMRES(10) fails to converge in 300 steps without preconditioning, but only takes
one step with the ILU(0) preconditioner. Even on the more challenging ORSIRR1
it results in more than an order of magnitude's improvement at worst and convergence
in a single iteration at best. ILU(0) significantly outperforms SGS in all cases where
SGS doesn't also cause immediate convergence. Since the only extra cost of ILU(0) compared to SGS
is a preprocessing operation done once at the start of the algorithm, and it significantly improves convergence,
it's certainly the better option in all but the most simple cases.

# Conclusion and future work

In this document I've implemented and tested three orthogonalization-based Krylov
subspace methods for linear systems, namely FOM, DIOM, and GMRES, as well as
the SGS and ILU(0) preconditioners. Of these, the clear winner in terms of convergence speed is
GMRES with ILU(0) preconditioning, converging in 5 steps or less for every problem I was
able to test it on.

The results are fairly convincing, but not comprehensive — this is a very small set of problems and methods.
Saad (2003) covers plenty of other methods for both solving and preconditioning that I didn't have time to try.
For instance, an interesting piece of future work would be to get some Symmetric Positive Definite test problems
and compare methods that exploit SPD (e.g. Conjugate Gradient) against the general methods I've implemented here.
Methods based on Lanczos biorthogonalization were also ignored here.
Another subject I find interesting is ILU(p) preconditioning.
For that I would also need more difficult test problems, since ILU(0) was already
maximally effective for the problems I had here.

# Sources

Saad, Y. (2003). Iterative Methods for Sparse Linear Systems.