The learning outcomes of this session are

-   to understand the concept of eigenvectors and eignevalues.

-   To make use of the numpy libraries for computing eigenvalues/vectors.

-   To use the above to carry out PCA.

Eigenvalues/vectors
================


Import NumPy as before with

In [None]:
import numpy as np

In the first instance, check to see if the examples on page 29 of the notes for topic 5 are examples of eigenvalues.
i.e. Given that 

$\mathbf{A} = 
\begin{pmatrix}
3&5\\
4&2 
\end{pmatrix}
$

and

$ 
\underline{u}_1 = 
\begin{pmatrix}
5 \\ 4
\end{pmatrix}\;\; ,
$

$ 
\underline{u}_2 = 
\begin{pmatrix}
4 \\ -1
\end{pmatrix} \;\;.
$

Check that $\underline{u}_1$ is an eigenvector of $\mathbf{A}$ and $\underline{u}_2$ is not.


Now use the numpy.linalg.eig to compute the eigenvalues and eigenvectors of $\mathbf{A}$. 

In [None]:
w,E = np.linalg.eig(A)
print(w)
print(E)

Note that $\tt{E}$ as matrix is in a column form to represent the eigenvectors, i.e. the first column of $\tt{E}$ is the first eigenvector and so on. 

Check that the vectors in $\tt{E}$ really are eigenvectors of $\mathbf{A}$. 

Since $\underline{u}_1$ is an eigenvector of $\mathbf{A}$ then it should be a multiple of either the first or second eigenvector in $\tt{E}$ - how can you check this? Implement the Python to do this.

Compute the eigenvalues of the matrix 

$
\begin{pmatrix}
3&-1&5\\
2&1&0\\
4&1&2
\end{pmatrix}
$

on page 35 of the notes.  Show that one of them is zero.

Attempt to compute the inverse of this matrix. What happens if you do perform SVD on it? 

Compute the eigenvalues of the matrix

$
\begin{pmatrix}
3 & -4 \\
1 & 3 
\end{pmatrix}
$

on page 40 of the notes. 

Note - $j \;\; = \;\; \sqrt{-1}$.  

Principal Component Analysis 
========================

There are a set of libraries for doing PCA on data (using scikit) in Python but here we will just use Numpy. 

First we will need some additional libraries.

In [None]:
from numpy import genfromtxt
import math

For the PCA we will use a data set about cars. This is a standard data setthat you can find in the statisitical package R. The full data set can be found on the moodle page in the file mtcars.csv. It has a variety of data for 32 cars. For this exercise we will use a version of the data set where 

<ol>
  <p>all the label data has been removed,</p>

  <p>the columns with integer data (you could include it if you wished..) has also been removed.</p>

</ol>

Leaving us with 32 cars and 6 variables.

Note - the order of the data is still the same (i.e. the first row corresponds to the Mazda RX4). 

The file can be read in as follows.


In [None]:
mtcars = genfromtxt('cars.csv',delimiter=',')

Check the size of the array that has been read in.

In [None]:
M,N = np.shape(mtcars)

In [None]:
print(N)

In [None]:
print(M)

As dicussed in the notes, given a matrix of data $\mathbf{T}$ we need to compute

$
x_{i,j} \;\; \equiv \;\;  t_{i,j} \,\, - \,\, \mu_i \;\; , 
$

and define

$
\mathbf{X}^\intercal \;\; \equiv \;\; 
\begin{pmatrix}
\vdots & \dots & \vdots \\
\underline{x}_1 & & \underline{x}_N \\
\vdots & \dots & \vdots 
\end{pmatrix}
$

This process is called "centreing" as data in each column now has a mean of zero.

We define the function below to do this. 

In [None]:
def centreData( data ):
    "Compute means of columns of data and subtract that off data"
    means = np.mean(data,axis=0)
    return data - means

In [None]:
XT = centreData(mtcars)

You can check if the data is centred by computing the mean of the new data.

In [None]:
np.mean(XT,axis=0)

The covariance of the original data can be computed as 

$
\mathbf{C} \;\; = \;\; \frac{1}{M-1} \,  \mathbf{X} \, \mathbf{X}^\intercal \;\; . 
$

The relevant function for this is computed below. 

In [None]:
def covariance( data ):
    "Compute covariance of a data matrix"
    XT = centreData(data)
    M = np.shape(data)[0]
    return (1.0/(M-1.)) * np.dot(np.transpose(XT),XT)

In [None]:
C = covariance(mtcars)

We can also compute the covariance matrix from the data using the function $\tt{numpy.cov}$. 

In [None]:
C1 = np.cov(mtcars,rowvar=False)

Show that C and C1 are the same.

Having computed the covariance matrix we want to express the data we have (written as $\mathbf{X}^\intercal$ in terms of a new set of coordinates $\mathbf{Y}^\intercal$ as 

$
\mathbf{Y}^\intercal \;\; = \;\; \mathbf{X}^\intercal \, \mathbf{R}^\intercal  \;\;,
$

where the covariance matrix for $\mathbf{Y}^\intercal$ and $C_Y$ is diagonal.

$
C_Y \;\; = \;\; \mathbf{R} \, \mathbf{C} \, \mathbf{R}^\intercal \;\; .
$

From the notes we see that in computing the eigenvalues/vectors of $\mathbf{C}$ we find that if $\mathbf{D}$ and 
$\mathbf{E}$ are the eigenvalues and eigenvectors of $\mathbf{C}$ then 

$
\mathbf{R} = \mathbf{E}^\intercal \;\; 
$
and the eigenvalues represent the variances of $\mathbf{Y}^\intercal$. 


In [None]:
D,E = np.linalg.eig(C)

In [None]:
YT = XT.dot(E)

We can check if YT has the same dimensions of XT.

In [None]:
np.shape(YT)

Inspect the variances of $\mathbf{Y}^\intercal$. Do you think that using the first two components is a reasonable approximation? Why? (or Why not?)

In [None]:
print(D)

We can now plot the data using the new axes (i.e. in terms of $\mathbf{Y}^\intercal$). To do this we will need matplotlib. 

In [None]:
import matplotlib.pyplot as plt

We can now plot the components against each other. These are simply the columns of YT.

In [None]:
PC1 = np.transpose(YT)[0]

In [None]:
PC2 = np.transpose(YT)[1]

In [None]:
plt.plot(PC1, PC2, 'o', label='PC1 versue PC2', markersize=10)

There is a notable outlier in this plot. See if you can find the relevant row of YT which will correspond to the row of XT and then determine what car it is. Any possible reason it is an outlier?



As a further check perform a plot of PC1 against PC3. What is notable about the scale? 