In [1]:
import numpy as np


#  Eigenvalues and where to find them

## Introduction

This ipython notebook is mostly a copy of the work done by Marc Khoury (https://marckhoury.github.io/numerical-algorithms-for-computing-eigenvectors/)
Most parts of it are direct reproduction, all credit belongs to him.

The eigenvalues and eigenvectors of a matrix are essential in many applications across the sciences. Despite their utility, students often leave their linear algebra courses with very little intuition for eigenvectors. In this notebook we describe several surprisingly simple algorithms for computing the eigenvalues and eigenvectors of a matrix, while attempting to convey as much geometric intuition as possible.

## The  nice world of positive definite matrices

### The  nice world of positive definite matrices

#### Basic definition
In linear algebra, a symmetric $ n\times n$ real matrix $M$ is said to be positive definite if the scalar $ z^{T}Mz$ is strictly positive for every non-zero column vector $z \in \mathbb{R}^n \setminus \vec{0}$. When interpreting $Mz$ as the output of an operator, $M$, acting on an input, $z$, the property of positive definiteness implies that the output always has a positive inner product with the input, as often observed in physical processes.

More generally, a complex $n\times n$ Hermitian matrix $M$ is said to be positive definite if the scalar $z^{*}Mz$ is strictly positive for every non-zero column vector $z \in \mathbb{C}^n \setminus \vec{0}$.
Here $z^{*}$ denotes the conjugate transpose of $z$ and we recall that an hermitian matrix $M$ is such that $M_{i,j} = \overline{M_{j,i}}$ (that implies that its diagonal is real). Note that $z^{*}Mz$ is automatically real since $M$ is Hermitian.

Positive semi-definite matrices are defined similarly, except that the above scalars $z^TMz$ or $z^{*}Mz$ must be positive or zero (i.e. non-negative). Negative definite and negative semi-definite matrices are defined analogously. A matrix that is not positive semi-definite and not negative semi-definite is called indefinite.

#### Alternative definition through bilinear form / inner product

The matrix $M$ is positive definite if and only if the bilinear form $\langle z,w\rangle =z^{T}Mw$ is positive definite and similarly for a positive definite sesquilinear form in the complex case.

##### Bilinear form
A bilinear form on a vector space $V$ is a bilinear map $V \times V \to K$, where $K$ is the field of scalars. In other words, a bilinear form is a function $B : V \times V \to K$ that is linear in each argument separately:

\begin{align*}
  & B(u + v, w) = B(u, w) + B(v, w) \qquad \text{and} \qquad B(\lambda u, v) &= \lambda B(u, v) \\
  & B(u, v + w) = B(u, v) + B(u, w) \qquad \text{and} \qquad B(u, \lambda v) &= \lambda B(u, v)
\end{align*}

##### Sesquilinear form
The definition of sesquilinear extends this property to the complex case

\begin{align*}
  & \phi(x+y,z+w) & = \phi(x,z) + \phi (x,w) + \phi(y,z) + \phi (y,w) \\
  & \phi(ax,by) & = \overline{a} b \, \phi(x,y)
\end{align*}

##### Inner product

A bilinear/sesquilinear form is called an inner product iff it has the following properties:
* Conjugate symmetry: $\langle x,y\rangle =\overline{\langle y,x\rangle }$
* Linearity in the first argument: 
\begin{align*}
  \langle ax,y\rangle &= a \langle x,y\rangle \\
  \langle x+y,z\rangle &=\langle x,z\rangle +\langle y,z\rangle
\end{align*}
* Positive-definite: $\langle x,x\rangle >0,\quad x\in V\setminus \{\mathbf{\vec{0}} \}$

Inner product can be use to provide a vector space with a lot of useful properties, indeed inner product space is a specific case of more generic structures that are less constrained:
![caption](data/250px-Mathematical_Spaces.png)

#### Eigenvalues and eigenvectors

Given a vector space $V$ over the field $F$ and a linear transformation $M : F \to F$ let's now define what is an "eigenvalue".
\begin{align*}
  \lambda \in F \text{ is an eigenvalue of } M \iff $\exists v \in F \setminus \vec{0} \text{ s.t } M v = \lambda v
\end{align*}

$v$ is called the eigenvector corresponding the eigenvalue $\lambda$. There can be multiple eigenvector associated with one eigenvalue, in this case we talk about the multiplicity of this eigenvalue.

#### Characteristic polynomial
From what we have seen in the previous definition, the eigenvalue $\lambda$ and the eigenvector $v$ can be seen a solution to the following problem:

\begin{align}
  (\lambda Id - M) v = 0
\end{align}

Since $v$ is non-zero, this means that the matrix $\lambda I − M$ is singular (non-invertible), which in turn means that its determinant is 0. Thus the roots of the function $det(\lambda I − M)$ are the eigenvalues of M, and it is clear that this determinant is a polynomial in $\lambda$.
This "trick" of finding the proper determinant for a matrix to be singular allows to separate the search of eigenvalue from the search of its corresponding eigenvectors.

An interesting property of the characteristic polynomial is that, thanks to Cayley–Hamilton theorem, every square matrix over a commutative ring ($\mathbb{R} or \mathbb{C} for instance$) satisfies its own characteristic equation.
We slightly touched upon this topic in the notebook called "KrylovBasedMethods"

Unfortunately, finding roots of a polynomial of arbitrary degree can be a hard task. In particular, Abel–Ruffini theorem proves that there are polynomial of degree 5 whose roots cannot be expressed with radicals, and later on, in 1830, Évariste Galois proved that most equations of degree higher than four cannot be solved by radicals, hence no closed form is easily available.

#### The fundamental theorem of algebra

#### Eigenvalues of 
Let's now try to imagine





Lets recall what posit

Let 
A
 be a symmetric positive definite matrix. Since 
A
 is symmetric all of the eigenvalues of 
A
 are real and 
A
 has a full set of orthogonal eigenvectors. Let 
λ
1
≥
λ
2
≥
…
≥
λ
n
>
0
denote the eigenvalues of 
A
 and let 
u
1
,
…
u
n
 denote their corresponding eigenvectors. The fact that 
A
 is positive definite means that 
λ
i
>
0
 for all 
i
. This condition isn’t strictly necessary for the algorithms described below; I’m assuming it so that I can refer to the largest eigenvalue as opposed to the largest in magnitude eigenvalue.

All of my intuition for positive definite matrices comes from the geometry of the quadratic form 
x
⊤
A
x
. Figure 1 plots 
x
⊤
A
x
 in 
R
3
 for several 
2
×
2
 matrices. When 
A
 is positive definite, the quadratic form 
x
⊤
A
x
 is shaped like a bowl. More rigorously it has positive curvature in every direction and the curvature at the origin in the direction of each eigenvector is proportional to the eigenvalue of that eigenvector. In 
R
3
, the two eigenvectors give the directions of the maximum and minimum curvature at the origin. These are also known as principal directions in differential geometry, and the curvatures in these directions are known as principal curvatures. I often shorten this intuition by simply stating that positive definite matrices are bowls, because this is always the picture I have in my head when discussing them.


Figure 1: The geometry of the quadratic form 
x
⊤
A
x
 for, from left to right, a positive definite matrix, a positive semi-definite matrix, an indefinite matrix, and a negative definite matrix. When 
A
 is positive definite it has positive curvature in every direction and is shaped like a bowl. The curvature at the origin in the direction of an eigenvector is proportional to the eigenvalue. A positive semi-definite matrix may have one or more eigenvalues equal to 0. This creates a flat (zero curvature) subspace of dimension equal to the number of eigenvalues with value equal to 0. An indefinite matrix has both positive and negative eigenvalues, and so has some directions with positive curvature and some with negative curvature, creating a saddle. A negative definite matrix has all negative eigenvalues and so the curvature in every direction is negative at every point.
Now suppose we wanted to compute a single eigenvector of 
A
. This problem comes up more often than you’d think and it’s a crime that undergraduate linear algebra courses don’t often make this clear. The first algorithm that one generally learns, and the only algorithm in this post that I knew as an undergraduate, is an incredibly simple algorithm called Power Iteration. Starting from a random unit vector 
v
 we simply compute 
A
t
v
 iteratively. For sufficiently large 
t
, 
A
t
v
 converges to the eigenvector corresponding to the largest eigenvalue of 
A
, hereafter referred to as the “top eigenvector”.

def PowerIteration(A, max_iter):
  v = np.random.randn(A.shape[0])
  v /= np.linalg.norm(v) #generate a uniformly random unit vector
  for t in range(max_iter):
    v = np.dot(A, v) #compute Av
    v /= np.linalg.norm(v)
  return v
To see why Power Iteration converges to the top eigenvector of 
A
 it helps to write 
v
 in the eigenbasis of 
A
 as 
v
=
∑
n
i
=
1
β
i
u
i
for some coefficients 
β
i
. Then we have that

A
t
v
=
A
t
(
n
∑
i
=
1
 
β
i
u
i
)
=
n
∑
i
=
1
 
β
i
A
t
u
i
=
n
∑
i
=
1
 
β
i
λ
t
i
u
i
=
λ
t
1
n
∑
i
=
1
 
β
i
(
λ
i
λ
1
)
t
u
i
=
λ
t
1
(
β
1
u
1
+
n
∑
i
=
2
 
β
i
(
λ
i
λ
1
)
t
u
i
)
.
Since 
λ
1
 is the largest eigenvalue, the fractions 
(
λ
i
λ
1
)
t
 go to 0 as 
t
→
∞
, for all 
i
≠
1
. Thus the only component of 
A
t
v
 that has any weight is that of 
u
1
. How quickly each of those terms goes to 0 depends on the ratio 
λ
2
λ
1
. If this term is close to 1 then it may take many iterations to disambiguate between the top two (or more) eigenvectors. We say that the Power Iteration algorithm converges at a rate of 
O
(
(
λ
2
λ
1
)
t
)
, which for some unfortunate historical reason is referred to as “linear convergence”.


Figure 2: An illustration of the Power Iteration algorithm. The 
i
th bar represents the component of the current iterate on the 
i
th eigenvector, in order of decreasing eigenvalue. Notice that the components corresponding to the smallest eigenvalues decrease most rapidly, whereas the components on the largest eigenvalues take longer to converge. This animation represents 50 iterations of Power Iteration.
Power Iteration will give us an estimate of the top eigenvector 
u
1
, but what about the other extreme? What if instead we wanted to compute 
u
n
, the eigenvector corresponding to the smallest eigenvalue? It turns out there is a simple modification to the standard Power Iteration algorithm that computes 
u
n
. Instead of multiplying by 
A
 at each iteration, multiply by 
A
−
1
. This works because the eigenvalues of 
A
−
1
 are 
1
λ
i
, and thus the smallest eigenvalue of 
A
, 
λ
n
, corresponds to the largest eigenvalue of 
A
−
1
, 
1
λ
n
. Furthermore the eigenvectors of 
A
−
1
 are unchanged. This slight modification is called Inverse Iteration, and it exhibits the same convergence as Power Iteration, by the same analysis.

def InverseIteration(A, max_iter):
  v = np.random.randn(A.shape[0])
  v /= np.linalg.norm(v) #generate a uniformly random unit vector
  lu, piv = scipy.linalg.lu_factor(A) # compute LU factorization of A
  for t in range(max_iter):
    v = scipy.linalg.lu_solve((lu, piv), v) #compute A^(-1)v
    v /= np.linalg.norm(v)
  return v
Note that we don’t actually compute 
A
−
1
 explicitly. Instead we compute an LU factorization of 
A
 and solve the system 
L
U
v
t
+
1
=
v
t
. The matrix that we’re multiplying by does not change at each iteration, so we can compute the LU factorization once and quickly solve a linear system to compute 
A
−
1
v
 at each iteration.


Figure 3: The Inverse Iteration algorithm. Notice that in this case the algorithm converges to the eigenvector corresponding to the smallest eigenvalue.
Power Iteration and Inverse Iteration find the eigenvectors at the extremes of the spectrum of 
A
, but sometimes we may want to compute a specific eigenvector corresponding to a specific eigenvalue. Suppose that we have an estimate 
μ
 of an eigenvalue. We can find the eigenvector corresponding to the eigenvalue of 
A
 closest to 
μ
 by a simple modification to Inverse Iteration. Instead of multiplying by 
A
−
1
 at each iteration, multiply by 
(
μ
I
n
−
A
)
−
1
 where 
I
n
 is the identity matrix. The eigenvalues of 
(
μ
I
n
−
A
)
−
1
 are 
1
μ
−
λ
i
. Thus the largest eigenvalue of 
(
μ
I
n
−
A
)
−
1
 corresponds to the eigenvalue of 
A
 whose value is closest to 
μ
. By the same analysis as Power Iteration, Shifted Inverse Iteration also exhibits linear convergence. However the better the estimate 
μ
 the larger 
1
μ
−
λ
i
 and, consequently, the faster the convergence.

def ShiftedInverseIteration(A, mu, max_iter):
  I = np.identity(A.shape[0])
  v = np.random.randn(A.shape[0])
  v /= np.linalg.norm(v) #generate a uniformly random unit vector
  lu, piv = scipy.linalg.lu_factor(mu*I - A) # compute LU factorization of (mu*I - A)
  for t in range(max_iter):
    v = scipy.linalg.lu_solve((lu, piv), v) #compute (mu*I - A)^(-1)v
    v /= np.linalg.norm(v)
  return v

Figure 4: The Shifted Inverse Iteration algorithm. In this case we converge to the eigenvector corresponding to the eigenvalue nearest 
μ
.
Shifted Inverse Iteration converges quickly if a good estimate of the target eigenvalue is available. However if 
μ
 is a poor approximation of the desired eigenvalue, Shifted Inverse Iteration may take a long time to converge. In fact all of the algorithms we’ve presented so far have exactly the same convergence rate; they all converge linearly. If instead we could improve on the eigenvalue estimate at each iteration we could potentially develop an algorithm with a faster convergence rate. This is the main idea behind Rayleigh Quotient Iteration.

The Rayleigh quotient is defined as 
λ
R
(
v
)
=
v
⊤
A
v
v
⊤
v
 for any vector 
v
. There are many different ways in which we can understand the Rayleigh quotient. Some intuition that is often given is that the Rayleigh quotient is the scalar value that behaves most like an “eigenvalue” for 
v
, even though 
v
 may not be an eigenvector. What is meant is that the Rayleigh quotient is the minimum to the optimization problem 
min
λ
∈
R
|
|
A
v
−
λ
v
|
|
2
. This intuition is hardly satisfying.

Let’s return to the geometry of the quadratic forms 
x
⊤
A
x
 and 
x
⊤
x
 which comprise the Rayleigh quotient, drawn in orange and blue respectively in Figure 5. Without loss of generality we can assume that 
A
 is a diagonal matrix. (This is without loss of generality because we’re merely rotating the surface so that the eigenvectors align with the 
x
 and 
y
 axes, which does not affect the geometry of the surface. This is a common trick in the numerical algorithms literature.) In this coordinate system, the quadratic form 
x
⊤
A
x
=
λ
1
x
2
1
+
λ
2
x
2
2
, where 
λ
1
 and 
λ
2
 are the diagonal entries, and thus the eigenvalues, of 
A
.

Consider any vector 
v
 and let 
h
=
span
{
v
,
(
0
,
0
,
1
)
}
 be the plane spanned by 
v
 and the vector 
(
0
,
0
,
1
)
. The intersection of 
h
 with the quadratic forms 
x
⊤
A
x
 and 
x
⊤
x
 is comprised of two parabolas, also shown in Figure 5. (This is a common trick in the geometric algorithms literature.) If 
v
 is aligned with the 
x
-axis, then, within the coordinate system defined by 
h
, 
x
⊤
A
x
 can be parameterized by 
y
=
λ
1
x
2
 and 
x
⊤
x
 can be parameterized by 
y
=
x
2
. (Note that here 
y
 and 
x
 refer to local coordinates within 
h
 and are distinct from the vector 
x
 used in 
x
⊤
A
x
.) Similarly if 
v
is aligned with the 
y
-axis, then 
x
⊤
A
x
 can be parameterized by 
y
=
λ
2
x
2
. (If 
v
 is any other vector then 
x
⊤
A
x
 can be parameterized by 
y
=
κ
x
2
 for some 
κ
 dependent upon 
v
.) The Rayleigh quotient at 
v
 is 
λ
R
(
v
)
=
λ
1
x
2
x
2
=
λ
1
. The curvature of the parabola 
y
=
λ
1
x
2
 at the origin is 
2
λ
1
. Thus the Rayleigh quotient is proportional to the the curvature of 
x
⊤
A
x
 in the direction 
v
!


Figure 5: The quadratic form 
x
⊤
A
x
 is shown in orange and 
x
⊤
x
 is shown in blue. Intersecting both surfaces with a plane 
h
 gives two parabola. Within the plane 
h
 we can define a local coordinate system and parameterize both parabola as 
κ
x
2
 and 
x
2
. The Rayleigh quotient is equal to the ratio of the heights of the parabolas at any point, which is always equal to 
κ
.
From this intuition it is clear that the value of the Rayleigh quotient is identical along any ray starting at, but not including, the origin. The length of 
v
 corresponds to the value of 
x
 in the coordinate system defined by 
h
, which does not affect the Rayleigh quotient. We can also see this algebraically, by choosing a unit vector 
v
 and parameterizing a ray in the direction 
v
 as 
α
v
 for 
α
∈
R
 and 
α
>
0
. Then we have that

λ
R
(
α
v
)
=
(
α
v
⊤
)
A
(
α
v
)
α
2
v
⊤
v
=
v
⊤
A
v
v
⊤
v
=
v
⊤
A
v
.
Thus it is sufficient to consider the values of the Rayleigh quotient on the unit sphere.

For a unit vector 
v
 the value of the Rayleigh quotient can be written in the eigenbasis as 
v
⊤
A
v
=
n
∑
i
=
1
 
λ
i
⟨
v
,
u
i
⟩
2
 where 
∑
n
i
=
1
⟨
v
,
u
i
⟩
2
=
1
. Thus the Rayleigh quotient is a convex combination of the eigenvalues of 
A
 and so its value is bounded by the minimum and maximum eigenvalues 
λ
n
≤
λ
R
(
v
)
≤
λ
1
for all 
v
. This fact is also easily seen from the geometric picture above, as the curvature at the origin is bounded by twice the minimum and maximum eigenvalues. It can be readily seen by either direct calculation or by the coefficients of the convex combination, that if 
v
 is an eigenvector, then 
λ
R
(
v
)
 is the corresponding eigenvalue of 
v
.

Recall that a critical point of a function is a point where the derivative is equal to 0. It should come as no surprise that the eigenvalues are the critical values of the Rayleigh quotient and the eigenvectors are the critical points. What is less obvious is the special geometric structure of the critical points.

The gradient of the Rayleigh quotient is 
2
v
⊤
v
(
A
v
−
λ
R
(
v
)
v
)
, from which it is easy to see that every eigenvector is a critical point of 
λ
R
. The type of critical point is determined by the Hessian matrix, which at the critical point 
u
i
 is 
2
(
A
−
λ
i
I
)
. The eigenvalues of the Hessian are 
λ
j
−
λ
i
 for 
j
∈
[
1
,
n
]
. Assuming for a moment that the eigenvalues are all distinct, the matrix 
2
(
A
−
λ
i
I
)
 has 
i
−
1
 eigenvectors that are positive, one eigenvalue that is 0, and 
n
−
i
 eigenvalues that are negative. The 0 eigenvalue represents the fact that the value of the Rayleigh quotient is unchanged along the ray 
α
u
i
. The other eigenvalues represent the fact that at 
u
i
, along the unit sphere, there are 
i
−
1
 directions in which we can walk to increase the value of the Rayleigh quotient, and 
n
−
i
 directions that decrease the Rayleigh quotient. Thus each eigenvector gives rise to a different type of saddle, and there are exactly two critical points of each type on the unit sphere.


Figure 6: Contours of the Rayleigh quotient on the unit sphere and the gradient of the Rayleigh quotient at each point. We clearly see one minimum in blue corresponding to the minimum eigenvalue, one saddle point, and one maximum in bright yellow corresponding to the maximum eigenvalue.
Finally we come to the crown jewel of the algorithms in this post. The Rayleigh Quotient Iteration algorithm simply updates the estimate 
μ
 at each iteration with the Rayleigh quotient. Other than this slight modification, the algorithm is exactly like Shifted Inverse iteration.

def RayleighQuotientIteration(A, max_iter):
  I = np.identity(A.shape[0])
  v = np.random.randn(A.shape[0])
  v /= np.linalg.norm(v) #generate a uniformly random unit vector
  mu = np.dot(v, np.dot(A, v))
  for t in range(max_iter):
    v = np.linalg.solve(mu * I - A, v) #compute (mu*I - A)^(-1)v
    v /= np.linalg.norm(v)
    mu = np.dot(v, np.dot(A, v)) #compute Rayleigh quotient
  return (v, mu)
This slight modification drastically improves the convergence rate. Unlike the other algorithms in this post which converge linearly, Rayleigh quotient iteration exhibits local cubic convergence! This means that, assuming 
∥
v
t
−
u
i
∥
≤
ϵ
 for some 
u
i
, on the next iteration we will have that 
∥
v
t
+
1
−
u
i
∥
≤
ϵ
3
. In practice this means that you should expect triple the number of correct digits at each iteration. It’s hard to understate how crazy fast cubic convergence is, and, to the best of the author’s knowledge, algorithms that exhibit cubic convergence are rare in the numerical algorithms literature.


Figure 7: The Rayleigh Quotient Iteration algorithm. After only 6 iterations the eigenvalue estimate 
μ
t
 is so accurate that the resulting matrix 
(
μ
t
I
n
−
A
)
 is singular up-to machine precision and we can no longer solve the system for an inverse. Note that every other figure in this post shows 50 iterations.
Intuitively, the reason that Rayleigh Quotient Iteration exhibits cubic convergence is because, while the Shifted Inverse Iteration step converges linearly, the Rayleigh quotient is a quadratically good estimate of an eigenvalue near an eigenvector. To see this consider the Taylor series expansion of 
λ
R
 near an eigenvector 
u
i
.

λ
R
(
v
)
=
λ
R
(
u
i
)
+
(
v
−
u
i
)
⊤
∇
λ
R
(
u
i
)
+
O
(
|
|
v
−
u
i
|
|
2
)
=
λ
R
(
u
i
)
+
O
(
|
|
v
−
u
i
|
|
2
)
λ
R
(
v
)
−
λ
R
(
u
i
)
=
O
(
|
|
v
−
u
i
|
|
2
)
The second step follows from the fact that 
u
i
 is a critical point of 
λ
R
 and so 
∇
λ
R
(
u
i
)
=
0
.

While Rayleigh Quotient Iteration exhibits very fast convergence, it’s not without its drawbacks. First, notice that the system 
(
μ
t
I
−
A
)
−
1
 changes at each iteration. Thus we cannot precompute a factorization of this matrix and quickly solve the system using forward and backward substitution at each iteration, like we did in the Shifted Inverse Iteration algorithm. We need to solve a different linear system at each iteration, which is much more expensive. Second, Rayleigh Quotient Iteration gives no control over to which eigenvector it converges. The eigenvector it converges to depends on which basin of attraction the initial random vector 
v
0
 falls into. Thus cubic convergence comes at a steep cost. This balance between an improved convergence rate and solving a different linear system at each iteration feels like mathematical poetic justice. The price to pay for cubic convergence is steep.
