# Example Notebook

This notebook serves to show how to create a notebook in the current directory structure.

Simply, after importing `init_notebook` the user can both import the `test_module` and its `test_function`, located in the `src` directory. Finally, it also becomes possible to open files in the `data` folder directly. All paths are relative to this directory so it is also possible to directly save a file to it.

In [1]:
import init_notebook
from test_module import test_function
import pca
from pca import PCA, plot_data_with_pcs

import numpy as np

test_function()

I was imported!


# Principal Component Analysis
We load the data from a file and perform PCA.

It's obvious that one of the principal components accounts for over 99.3% of the variance in the data, which motivates us to the assumption that the other principal component is just noise.

In [2]:
with open("pca_dataset.txt", 'r') as file:
    # Load data inta a Python ndarray, shape (100, 2)
    global data_matrix
    data_matrix = np.loadtxt(file, delimiter=' ')

pca_result = PCA.pca(data_matrix)

U, S, Vh = pca_result
E = pca_result.energy

print(f'Energies: {E}\nsum should be 1: {np.sum(E)}\nSingular values ordered by magnitude: {S}')

Energies: [0.99314266 0.00685734]
sum should be 1: 1.0
Singular values ordered by magnitude: [9.94340494 0.82624201]


# Reversing PSA to verify correctness

In [3]:
data_matrix_reconstructed = pca_result.reverse_pca(2, add_mean=True)

# Before and after: Works nicely!
display_side_by_side = True
if display_side_by_side:
    for o, r in zip(data_matrix, data_matrix_reconstructed):
        print(f'o: {o}, r: {r}')

assert np.allclose(data_matrix, data_matrix_reconstructed)

AssertionError: 

# Plotting data

The plot looks suspiciously linear, which further supports the assumption that a linear model is very suitable for this dataset. The green principal component seems to be just noise, with no discernable patterns. The points align strongly with the red principal component.

PSA is a great approach for data that shows affine-linear behavior, as opposed to data on curved manifolds. This is why it makes sense to use PSA for this data.

In [None]:
plot_data_with_pcs(data_matrix, Vh)

# Approximate 1D

Eliminating the lesser principal component and approximating the data with only the first principal component, the data is approximated to a 1D line. This is done by simply setting the second singular value to 0 inside the matrix of singular values given by `S`, and then reversing the PCA by multiplying the matrices `U`, `S`, and `Vh`.

In [None]:
approximated_data = pca_result.reverse_pca(1)

print(approximated_data)

plot_data_with_pcs(approximated_data, pca_result.Vh)