# Principal Component Analysis (PCA)

## Curse of dimensionality
In essence, machine learning algorithms build predictive models by finding structure in a given dataset, and typically the more data an algorithm is fed the better the model should work (with consideration on maintaining model generality). However, for an algorithm to train properly and comprehensively on a given dataset, it should 'consider' as much 'space'/values for the input features that we'd like to train for; thus, the amount of data increases exponentially with the increase of input features. This is defined as the 'Curse of Dimensionality'.

For example, let's say we want to train a classifier using the simple K-nearest neighbors algorithm to identify whether a person enjoys playing basketball based on his/her height. We would collect the heights from $n$ subjects and feed them to the algorithm. Suppose that it turns out that the prediction isn't good enough: a person's height isn't a good indicator whether he/she enjoys playing the sport, so the natural thing is to include additional features in training the classifier, like age, weight, shoe sizes, stride length, etc.

Now, we are faced with a dilemma: the $n$ heights that we collected are well spread out (e.g., uniformly spread between 145cm and 190cm), but if we also include the weights of those subjects, the spread in both height and weight is reduced. That is, only a small number of subjects will have a similar height $H$, which means that the weights at this particular height isn't that well spread out. Consequently, since the algorithm can only learn from what it's given, it makes predictions that are heavily influenced by the handful of weights at each height.

The image below reflects the dilemma that we encounter: empty spaces are introduced as we increase the number of features to train on, indicating the lack of information necessary for the algorithm to make decent predictions. In 1-D we can see that the 4 segments are well populated with data, but the same data in 2-D is less populated in the 4x4 = 16 regions, and are even more sparsely located in 3-D with 4x4x4 = 64 regions.

![](https://images.deepai.org/glossary-terms/curse-of-dimensionality-61461.jpg)

Image credit: https://deepai.org/machine-learning-glossary-and-terms/curse-of-dimensionality

## Principal component analysis (PCA) - a dimensionality reduction solution
### What is it and why is it useful?
Ideally, to address this issue, we want to reduce the dimensionality of the data, and the easiest way is to only take the features we know contribute most significantly to the outcome of the classifier. In the previous example, it's likely that the height of a subject has a bigger influence on his/her enjoyment in playing basketball than said subject's birth month, so we wouldn't select that as an input feature.

In reality, however, the input features that we select &mdash; to the best of our understanding &mdash; usually have some degree of importance to the prediction outcome (otherwise they would've been omitted to begin with). That being said, the key idea remains: **not all features are equally as important**, which means that we can find some (linear) combination of input features as our 'new' feature, and that is the gist of principal component analysis (PCA). Essentially, we pick proportions/fractions of all the original features and combine them to become new features (such that the combined fractions would simply be a 'whole' feature, like a sum of parts).

These new features are known as the principal components of the data due to the way they are obtained: a principal component is simply the best-fit line that passes through the set of data points and each principal component is orthogonal to all the other principal components. Therefore, for a dataset with $n$ features, there are only $n$ principal components. Note that a principal component line differs from regular linear regression because the fit is based on the *distance of the line to all input data points*, not the *distance of the line to the output values*.
 
In short, PCA helps us to *identify patterns in data based on the correlation between features*, and using it as a dimensionality reduction tool allows us to *improve the computational complexity (driven by the amount of data that goes through the algorithm) with minimal compensation to model accuracy*.

### How does it work?
PCA performs a change of basis for the data to create a basis that is a linear combination of the original (raw) basis: the old basis are the pure features while the new basis are the principal components. Intuitively, PCA takes a set of features in a dataset and create a new set of 'features' which are made up of proportions of the old features.

The first principal component is the line through the dataset with the minimum distance to all the points; this is equivalent to the line that points in the direction with the maximum variance in the high-dimensional data. The remaining principal components are then the lines that are orthogonal to all the preceding principal components: the $k$-th principal component is orthogonal to all $k-1$ principal components, and they are ranked in descending order of their fits to the data (i.e., the further the lines are from all the datapoints, the lower it's ranked).

#### Example
In a dataset that considers 3 features: height, weight, and width, the $i$-th datapoint is described by each feature individually, that is:

\begin{align*}

    \mathbf{d_i} &= (x, y, z)

\end{align*}    
in units of height, weight, and width respectively, in 3-D space. PCA then modifies the basis such that the features are combined into a single principal component vector (a new basis vector) that is a combination of the 3 existing axes. Then, instead of $\mathbf{d_i} = (x, y, z)$, we get a description for a datapoint as follows:

\begin{align*}

    \mathbf{\hat{d}_i} &= (\mathbf{a}\mathbf{d_i}^T, \mathbf{b}\mathbf{d_i}^T, \mathbf{c}\mathbf{d_i}^T) \\
    &= (a_1x + a_2y + a_3z, b_1x + b_2y + b_3z, c_1x + c_2y + c_3z)

\end{align*}
where $\mathbf{a} = (a_1, a_2, a_3)$, $\mathbf{b} = (b_1, b_2, b_3)$ and $c = (c_1, c_2, c_3)$ are the new orthonormal basis vectors (they are also the eigenvectors for the covariance matrix of the features). Effectively, $\mathbf{a}$, $\mathbf{b}$, and $\mathbf{c}$ are the new 'features' that PCA created from combining the height, weight, and width features. The coefficients to the first new basis vector (i.e., the proportions) are computed using least squares, while the remaining vectors are orthogonal to the each other; essentially, the first vectors is a lines that is fitted to the data passing through the data's centroid.

Finally, to reduce the dimensionality of the problem, we eliminate some of the new 'features', i.e., eliminate some basis vectors. In the example above, we can choose to use only vectors $\mathbf{a}$ and $\mathbf{b}$ to describe our data, which means the same datapoint is now described as

\begin{align*}

    \mathbf{\hat{d}_i}' &= (\mathbf{a}\mathbf{d_i}^T, \mathbf{b}\mathbf{d_i}^T) \\
    &= (a_1x + a_2y + a_3z, b_1x + b_2y + b_3z).

\end{align*}
Then we only need to train a model based on the principal components $\mathbf{a}$ and $\mathbf{b}$. The decision in keeping principal components is driven by the variance (eigenvalues) of the data along each new principal component; the basis vectors corresponding to the highest $k$ variances are kept, with $k$ being a design parameter.

The following image shows 2 principal components for the 2-D data:

![](https://miro.medium.com/max/1200/0*J-m5zAEFGOs-qnLo.png)

Image credit: https://medium.com/xebia-engineering/principal-component-analysis-autoencoder-257e90d08a4e


## Final words
PCA is an example of a linear dimensionality reduction tool that does not require label information (i.e., unsupervised). Autoencoders achieve the same functionality but allows nonliner transformation (recall that PCA is uses linear transformations for change of basis operations).

LASSO is another popular dimensionality reduction tool, but requires label information to train (i.e., supervised learning).



