# Notes on Chapter 5: Compressing Data via Dimensionality Reduction

## Summary

In Chapter 4, Raschka covered a number of methods for reducing the dimensions of a dataset through **feature selection**, including L1 regularization, SBS, and random forests. In different ways, these algorithms accomplish the same goal of reducing the **feature space** of a dataset by eliminating those features that provide the least amount of information to a classifier (or whose absence has the least impact on model performance). These techniques can help improve the efficiency of a model, or combat the "curse of dimensionality" when using nearest-neighbor methods.

As we saw when dealing with missing data, however, elminating entire features from the sample space can also remove *too much* information, increasing a classifer's bias to an unacceptable degree. How might we reduce the dimensions of a dataset while preserving necessary information?

In this chapter, Raschka covers three techniques for reducing the feature space of a dataset through **transformation** (also referred to as **dimensionality reduction** or **compression**):

1. [Principal Component Analysis, or PCA](#Principal-Component-Analysis:-unsupervised-dimensionality-reduction) (unsupervised, linear)
2. Linear Discriminant Analysis, or LDA (supervised, linear, maximizes class separability)
3. Kernel PCA (unsupervised, nonlinear)

These techniques can drastically reduce the dimensions of a dataset while preserving as much information as possible. 

Throughout, I supplement notes on Raschka's writing with insight from *An Introduction to Statistical Learning*, which covers the math behind these techniques in greater detail.

## Principal Component Analysis: unsupervised dimensionality reduction

### Background

Principal Component Analysis (PCA) is an **unsupervised algorithm** for dimensionality reduction, meaning that a human doesn't have to train it at all. Using magic (AKA linear algebra), PCA identifies patterns in a dataset based on correlations between features, and transforms the data in such a way that it maximizes the amount of variance that the new dataset retains from the old.

In PCA, **"variance"** refers to a different phenomenon than it does when we use it to analyze classifiers. Here, we're interested in maximizing the *statistical* variance of our dataset: the amount of "spread" represented by each feature.

At a high level, PCA works by finding the "directions" in the data that account for most of its variance, subject to the constraint that the directions not be correlated (i.e that they are perpendicular). Put another way, PCA looks for **a set of new axes** that we can project the data onto while preserving as much spread as possible. We call the new axes (or directions) "principal components." One of the nicest parts of PCA is that it finds directions in declining order of importance, and assigns a "score" to each direction: because of this, we can easily select whatever size dimension we want to project our data into.

Mathematically, we can define our goal as producing a $k$-dimensional subspace of a $d$ dimensional feature space by finding a $d \times k$ transformation matrix, $W$, such that:

$$ X \cdot W = z $$

Where $X$ is our dataset in $d$ dimensions, and $z$ is a dataset in $k$ dimensions.

### Algorithm

In broad strokes, the steps to perform PCA are as follows:

1. Standardize the features of $X$ (PCA is sensitive to data scaling, since it works to minimize variance across all dimensions of the dataset)
2. Construct a **covariance matrix**, comparing the covariance between each feature
3. Decompose the covariance matrix into its **eigenvectors and eigenvalues**
4. Select the $k$ best eigenvectors by choosing the $k$-largest eigenvalues
5. Build the transformation matrix $W$ from the $k$ best eigenvectors
6. Transform $X$ using $W$

To understand this algorithm, we'll need to briefly cover some explanatory background on **covariance** and **eigenvectors**.

### The covariance matrix

**Covariance** is a statistic that quantifies how closely related two features are to each other. We define it mathematically as the *mean product of the differences from the means* of two features (here, $x$ and $y$, but when considering our dataset we'll represent the features as $X_{i}$ and $X_{j}$). That's a mouthful, but it's a lot simpler in algebraic notation:

$$ cov(x, y) = \sum_{i=1}^{n} \frac{(x_{i} - \mu_{x})(y_{i} - \mu_{y})}{n} $$

Where $\mu_{x}$ and $\mu_{y}$ are the means of each feature. Since in PCA we standardize the features prior to analyzing their covariances, the definition is even simpler:

$$ cov(x_{std}, y_{std}) = \frac{1}{n} \sum_{i=1}^{n} x^{(i)}_{std} y^{(i)}_{std} $$

By taking the product of the differences of the means (phew), we can neatly capture two properties of the relationship between the features: 1) what direction the features trend in (think about the quadrants of a coordinate plane, centered around the mean) and 2) how tightly they cluster together (or: do values of $x$ far from $\mu_{x}$ imply values of $y$ far from $\mu_{y}$?).

The **covariance matrix** of a dataset is a square matrix that records the covariance between every possible pair of features in the dataset. For a dataset in $k$ dimensions, with features in the sequence ${1, 2, 3, ... k}$, we can represent the covariance matrix $\Sigma$ as:

$$ \Sigma = \begin{bmatrix} cov(1,1) & cov(1,2) & cov(1,3) & ... & cov(1,k) \\
                            cov(2,1) & cov(2,2) & cov(2,3) & ... & cov(2,k) \\
                            cov(3,1) & cov(3,2) & cov(3,3) & ... & cov(1,k) \\
                            ... & ... & ... & ... & ... \\
                            cov(k,1) & cov(k,2) & cov(k,3) & ... & cov(k,k) \end{bmatrix} $$

### What the heck's an eigenvector?





