In [4]:
import numpy as np
import matplotlib.pyplot as plt

## First, let's initialize some low-dimensional (ground truth) data in a high-dimensional space...

In [5]:
# true dimensions of interest
# NOTE: in high dimensions, randomly chosen directions tend to be nearly orthogonal

num_data = 1_000    # number of data points (e.g. time steps in a recording)
dim = 100           # number of dimensions in data (e.g. number of neurons in the recording)

# generate two ground truth directions that data should fall along, up to slight (noisy) perturbations
v1 = np.random.normal(size=dim)
v1 /= np.linalg.norm(v1)

v2 = np.random.normal(size=dim)
v2 /= np.linalg.norm(v2)

noise_std = 0.05

# generate the data
X = np.zeros((num_data, dim))
for i in range(num_data):
    noise_vec = noise_std * np.random.normal(size=dim)

    if i < num_data / 2:
        X[i, :] = v1 * np.random.normal() + noise_vec
    else:
        X[i, :] = v2 * np.random.normal() + noise_vec


print(f'Number of data points: {num_data}')
print(f'Number of dimensions: {dim}')


Number of data points: 1000
Number of dimensions: 100


## Do we suspect that the data is low dimensional?

We can quantify this by first computing the covariance matrix $C = \frac1n X^T X \in \mathbb{R}^{\text{num_features } \times \text{ num_features}}$ (note that $X \in \mathbb{R}^{\text{num_data } \times \text{ num_features}}$), and then calculating its **participation ratio** $\rho = \frac{(\text{Tr}(C))^2}{\text{Tr}(C^2)}$.

Here, the trace operator $\text{Tr}$ is the (linear) function that takes in an $N \times N$ matrix and outputs the sum of the diagonal elements of that matrix: $\text{Tr}(A) = \sum_{i=1}^N A_{ii}$. Interestingly, it turns out that the trace of a matrix is *also* the sum of its eigenvalues $\lambda_1, \lambda_2, \dots, \lambda_N$.

**Question:** Show that the eigenvalues of $C^2$ are $\lambda_1^2, \dots, \lambda_N^2$. Using this, and the fact that the trace operator returns the sum of the eigenvalues, show that $\rho = \frac{\left(\sum_{k=1}^N \lambda_k\right)^2}{\sum_{k=1}^N \lambda_k^2}$.

Collectively, the number of nonzero eigenvalues and their relative sizes tells us how much the data features tend to vary among the different eigenvector directions. In particular, if a lot of eigenvalues are zero, or close to zero, we should expect the data to be low-dimensional!

**Question:** Suppose that $\lambda_1 = \lambda_2 = \dots = \lambda_N$. What does $\rho$ equal in this case? Now, suppose that $\lambda_1 \neq 0$ but that $\lambda_2 = \lambda_3 = \dots = \lambda_N = 0$; what is $\rho$ in this case? What if $\lambda_1 = \lambda_2 \neq 0$ and all other eigenvalues are zero?

Do you see how a lower participation ratio $\rho$ indicates lower data dimensionality?

## Now, compute the participation ratio of the data covariance matrix! Is it large or small relative to the total dimensionality $N$? Does this indicate we should try using PCA?

In [6]:
## ADD CODE HERE
# Hint: np.trace() is the numpy function that calculates the trace of a square matrix!

# C = ...

# rho = ...

# print results
print(f'The participation ratio is {rho}, and the total data dimensionality is {X.shape[1]}.')


## Now, compute the covariance matrix of the data, and compute its SVD / eigendecomposition!

In [7]:
## ADD CODE HERE
# Hint: there's lots of functions for computing the eigendecomposition and/or SVD.
# There's np.linalg.eigh() (among others), and also some analogous scipy functions.

# C = ...

# U = ...

# top_eigvals = ...

print(U.shape)
print(top_eigvals)


## Finally, take the top two eigenvectors and project the data onto their span!

In [1]:
# ADD CODE HERE

# X_proj = ...  (the shape should be num_data x 2)

print(X_proj.shape)

# plot first two PC's of the data, along with
# the two ground truth direction vectors

v1_proj = U.T @ v1
v2_proj = U.T @ v2

plt.figure(figsize=(10, 10))
plt.plot(X_proj[:, 0], X_proj[:, 1], 'o', zorder=1)

plt.quiver(0, 0, v1_proj[0], v1_proj[1],
           angles='xy', scale_units='xy', scale=1,
           color='r', width=0.005, headwidth=5, headlength=7, zorder=3)

plt.quiver(0, 0, v2_proj[0], v2_proj[1],
           angles='xy', scale_units='xy', scale=1,
           color='darkgreen', width=0.005, headwidth=5, headlength=7, zorder=3)

plt.gca().set_aspect('equal')
plt.xlabel('PC1', fontsize=16)
plt.ylabel('PC2', fontsize=16)
plt.show()

## How much of the variance did we explain with two PC's?



Recall that, intuitively, the sizes of the eigenvalues of the covariance matrix $C = \frac1n X^T X$ (where $n$ is the number of data) tell us to what extent the data varies along the different eigenvector directions. By using more principal components, we capture more and more of the variance along different eigen-directions.

To calculate what proportion of the variance in the data is accounted for with $k \leq N$ principal components, we arrange the eigenvalues in decreasing order as $\lambda_1, \lambda_2, \dots, \lambda_N$ and then compute the quantity $\frac{\sum_{i=1}^k \lambda_i}{\sum_{i=1}^N \lambda_i}$. Let us now compute the proportion of the variance explained in our present case:

In [2]:
# ADD CODE HERE

# top_2_eigvals = ...

# total_eigval_sum = ...

# prop_var_explained = ...

print(f'Proportion of total variance in data explained with 2 PCs: {prop_var_explained}.')


**Conceptual Question: ** Where is the remaining variance coming from? Is it important to capture this remaining variance?

## Congratulations, you are now a PCA wizard! Feel free to tinker around with any part of this exericse (e.g. the underlying data, the noise level, the number of PCs used, etc.) and see how the results change!