# STL10 PCA and GMM Experiment
#### by Pio Lauren T. Mendoza

In this notebook the Principal Component Analysis (PCA) and Gaussian Mixture Models will be built from scratch by only using pure python and numpy. Their performance will be evaluated under various conditions.

## Importing Modules

In [1]:
from numpy.random import default_rng
from PIL import Image
from torchvision import datasets

import matplotlib.pyplot as plt
import numpy as np

rng = default_rng()

## Principal Component Analysis

High dimensional data are is ubiquitous these days. Even a handheld smart phone can capture high resolution RGB images. These high resolution RGB images have tons of information. Each pixel represent an information which comprises a high dimensional data. These high dimensional data are sometimes diffucult to store due to its size. It may also be difficult to analyze and visualize since high dimensional data is composed of large amount of information. Luckily, these high dimensional data are usually over defined. They are composed of redundant information which has correlation which can be utilized for dimensionality reduction. Principal component analysis is an algorithm for linear data dimensionality reduction [\[1\]](#1). PCA tries to find the minimum set of basis that can represent the data in a compact manner. These basis vectors were called as *"best fitting lines"* in the original paper of Pearson in 1901 [\[2\]](#2). 

## Data Preparation

The `torchvision` library will be used for dataloading. For more information about the STL10 dataset api of `torchvision` you may visit the [documentation](https://pytorch.org/vision/stable/_modules/torchvision/datasets/stl10.html). STL10 unlabeled split dataset has 100k images but for this experiment only 10k images will be used. This 10k images will be sampled in the STL10 unlabeled split. Due to computational power limit instead of using the RGB images, it will be first converted into grayscale images.

In [2]:
ds_unlabeled = datasets.STL10(root="./data", split="unlabeled", download=True)
classes = ds_unlabeled.classes
print(f"STL10 unlabeled split size: {len(ds_unlabeled)}")

Files already downloaded and verified
STL10 unlabeled split size: 100000


In [3]:
data = rng.choice(ds_unlabeled.data, size=10_000, replace=False)/255

## References

<a id='1'>[\[1\] M. P. Deisenroth, A. A. Faisal, and C. S. Ong, Mathematics for Machine Learning. Cambridge, United Kingdom: Cambridge University Press, 2020.](https://mml-book.com)</a>  
<a id='2'>[\[2\] Pearson, K. 1901. On lines and planes of closest fit to systems of points in space. Philosophical Magazine 2:559-572](https://cdn1.sph.harvard.edu/wp-content/uploads/sites/1056/2012/10/pearson1901.pdf)</a>  
<a id='3'>[\[3\] A. Coates, A. Ng, en H. Lee, “An Analysis of Single Layer Networks in Unsupervised Feature Learning”, in AISTATS, 2011.](https://cs.stanford.edu/~acoates/stl10/)</a>  
