# Dimensional Reduction

<center>
<img src="images/dimensional_reduction.png" width="500">
</center>

This lecture will introduce the concept of dimensional reduction and the benefits/challenges of high-dimensional data, along with a few key approaches to dimensional reduction.

* Dimensional reduction basics
    - Problem statement for dimensional reduction
    - The curse and blessing of dimensionality
    - Considerations in dimensional reduction
    - Assessing performance of dimensional reduction
* Familiar approaches
    - Summary statistics
    - Feature selection
* Principal component analysis
    - Linear PCA
    - Kernel PCA
    - Other variants
* Other approaches
    - Manifold-based models
    - Autoencoding
* Conclusions

## Dimensional reduction basics

### Problem statement for dimensional reduction

Reduction of the number of independendent variables under consideration while maintaining the properties of the statistical distribution of the underlying data.

Dimensional reduction can be either **supervised** or **unsupervised**, but this lecture will focus on unsupervised approaches since they are the most distinct. Dimensional reduction can be used to extract/generate features from data, to get a better intuitive understanding of data, or to compress data.

### The blessing and the curse of dimensionality

Dimensionality in mathematics is different from the dimensionality of the physical world. Engineers typically consider problems in 3 dimensions, but in data science the number of dimensions is equal to the number of features and can be extremely high (>10k). This leads to two phenomena:

* The [curse of dimensionality](https://en.wikipedia.org/wiki/Curse_of_dimensionality) refers to the fact that the volume of a high-dimensional space varies exponentially with the dimension. For example, consider the volume of a cube of length $L$ in dimension $d$:

$V_d = L^d$

If we want to sample this space with a resolution of $\Delta L = L/N$ we would need a total of $N^d$ points. If $N=10$ and $d=100$ (a moderate number of dimensions in data science) then the total number of samples needed is $10^{100}$, which is [greater than the number of atoms in the known universe](https://www.universetoday.com/36302/atoms-in-the-universe/)!

* The [blessing of dimensionality](https://arxiv.org/abs/1801.03421) is a somewhat lesser known phenomenon that occurs because of the data sparsity that arises from the curse of dimensionality. It essentially means that as the number of independent dimensions increases the data tends to be more easily separable and will look increasingly like well-separated points. This makes the data more well-suited to simplistic statistical assumptions such as being represented by a Gaussian distribution with equal covariance.

The curse of dimensionality always applies, but the blessing is not guaranteed. This means that in general it is more challenging to work in high dimensions.

### Considerations for dimensional reduction

* Matrix rank - how many independent dimensions are there?

* Linearity of the subspace - are patterns linear or non-linear?

* Projection - can a new high-dimensional point be projected onto the low-dimensional map?

* Inversion - can a new low-dimensional point be projected back into high-dimensional space?

* Supervised vs. unsupervised - are the training labels used to determine the dimensional reduction?

### Assessing performance of dimensional reduction models

As with clustering it can be challenging to assess the performance of dimensional reduction models, especially when unsupervised. Nonetheless there are a few approaches that can be used. Selecting the right approach will depend on the problem, but using a variety is always a good idea.

* Variance

One common idea in dimensional reduction is to assess the "retained variance" of the low-dimensional data. This is common in techniques such as PCA.

* Distance

The "stress" function compares the distance between points $i$ and $j$ in a low-dimensional space to the distance in the full-dimensional space (similar to the cophenetic coefficient for clustring linkages).

$S(\vec{x}_{0}, \vec{x}_1, \vec{x}_2, ... \vec{x}_n) =  \left(\frac{\sum_{i=0}^n \sum_{i < j}(d_{ij} - ||x_i - x_j||)^2}{\sum_{i=0}^n \sum_{i < j} d_{ij}^2}\right)^{1/2}$

where $d_{ij}$ is the distance in the high-dimensional space and $\vec{x}$ is the vector in the low-dimensional space. Some approaches seek to minimize this directly (e.g. multi-dimensional scaling), but it can also be used as an accuracy metric.

* Visualization

Where possible, visualizing the data in the low-dimensional space and looking for patterns is a very powerful approach. However this becomes challenging if the low-dimensional space is $>$3 dimensions.

* Model performance

If you have labels for the data one good approach is to construct a supervised model from both the low- and high-dimensional spaces and evaluating the accuracy of both. If the accuracy does not decrease then the key patterns are retained in the low-dimensional representation. This is also a pragmatic approach since one main use of dimensional reduction is to construct more efficient supervised models.

### Familiar approaches

First we will look at some approaches that should be familiar. In order to illustrate dimensional reduction we will use one of the most famous problems in machine learning: a database of hand-written digits. We can load this in and take a look at the data.

In [None]:
%matplotlib inline
from sklearn.datasets import load_digits
import matplotlib.pyplot as plt
import numpy as np

digits,classes = load_digits(return_X_y=True)
print(digits.data.shape)
print(classes.shape)
X_orig = np.array(digits.data)
y = classes

This is a dataset of 1797 data points with 64 dimensions. Lets see what they look like:

In [None]:
def show_image(digit_data, n, ax=None):
    if ax is None:
        fig, ax = plt.subplots()
    img = digit_data[n].reshape(8,8)
    ax.imshow(img,cmap='binary')
    
N = 3
show_image(X_orig, N)
print('Digit: {}, Min: {}, Max: {}'.format(y[N],X_orig.min(),X_orig.max()))

#### Question: What type of pre-processing would you recommend?

In [None]:
## Implement pre-processing
#X = (X_orig - X_orig.mean(axis=0))/(np.max(X_orig.std(axis=0)+0.001))
X = (X_orig - X_orig.min(axis=0))/(X_orig.max() - X_orig.min())
#X = X > 0.4

N = 9
show_image(X, N)
print('Digit: {}, Min: {}, Max: {}'.format(y[N],X.min(),X.max()))

### Summary Statistics

A very basic way to reduce dimensions is to take "summary statistics". For example, we could take the mean and standard deviation of each 64 dimensional datapoint in order to create a 2-dimensional representation:

In [None]:
from scipy.stats import mode

def mean_mode(X):
    Xmean = X.mean(axis=1)
    Xstd = X.std(axis=1)
    X_ms = np.column_stack((Xmean,Xstd))
    return X_ms

print(X.shape)
X_ms = mean_mode(X)
print(X_ms.shape)

Lets assess the distances between datapoints before/after the dimensional reduction:

In [None]:
from scipy.spatial.distance import pdist

def stress(X_reduced, X):
    D_red = pdist(X_reduced)
    D_tot = pdist(X)
    numerator = np.sum((D_tot - D_red)**2)
    denom = np.sum(D_tot**2)
    return np.sqrt(numerator/denom)

print(stress(X, X))
print(stress(X_ms, X))

#### Question: Is this a good dimensional reduction?

Since we have labels for the data we can also look at the low-dimensional data and see if it captures the distinguishing patterns.

In [None]:
fig, ax = plt.subplots()

ax.scatter(X_ms[:,0], X_ms[:,1], c=y)

def add_labels(ax, cmap=plt.cm.viridis):
    colors = [cmap((i/9)) for i in range(10)]
    xpos = 0.1
    for label in range(0,10):
        c = colors[label]
        ax.annotate(str(label), xy=[xpos, 1.1], xycoords='axes fraction', color=c, size=15)
        xpos += 0.07

add_labels(ax)

Unsurprisingly there are not any obvious patterns here. Summary statistics discard a considerable amount of information, but they are straightforward to use. Summary statistics are:

* Unsupervised - We did not use the labels to determine the statistics
* Projectable - It is also easy to project a new data point into the reduced dimensional space
* Not invertible - There is no way to go back from the reduced dimensional space to the original space.

### Feature Selection as dimensional reduction

Another approach to dimensional reduction is to utilize a supervised model to perform feature selection:

In [None]:
from sklearn.feature_selection import SelectKBest, f_classif

N = 10
kbest = SelectKBest(f_classif, k=N)
X_new = kbest.fit_transform(X, y)
print(X_new.shape)
print(stress(X_new,X))

It looks like this is working better than the summary statistics. We can visualize the first two components in a scatter plot:

In [None]:
fig, ax = plt.subplots()

ax.scatter(X_new[:,0], X_new[:,1], c=y)

The patterns are still not obvious, but there is clearly more separation. Lets take a look at the features that are selected:

In [None]:
fig, ax = plt.subplots()

selected = kbest.get_support().reshape(8,8)
ax.imshow(selected,cmap='binary')

#### Question: Which features are least likely to be selected as we continue to increase the dimensionality?

We could also assess the dimensional reduction by comparing the accuracy of a classification model trained on the full and reduced dimension models:

In [None]:
from sklearn import svm
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

idxs = range(len(X))
train_idxs, test_idxs, y_train, y_test = train_test_split(idxs, y, test_size=0.5)
X_train_full = X[train_idxs]
X_test_full = X[test_idxs]

X_train_lowD = X_new[train_idxs]
X_test_lowD = X_new[test_idxs]

clf_full = svm.SVC()
clf_full.fit(X_train_full,y_train)
y_pred_full = clf_full.predict(X_test_full)
accuracy_full = accuracy_score(y_pred_full, y_test)

clf_lowD = svm.SVC()
clf_lowD.fit(X_train_lowD,y_train)
y_pred_lowD = clf_lowD.predict(X_test_lowD)
accuracy_lowD = accuracy_score(y_pred_lowD, y_test)

print(accuracy_full)
print(accuracy_lowD)

We can see that the feature selection approach works well, but it requires knowing the outputs. The feature selection approach is **supervised**. It is also projectable.

#### Question: Is feature selection invertible?

## Principal component analysis

Hopefully this is also familiar, but as a quick refresher the principal component analysis is obtained via the **eigenvalues** of the **covariance matrix**

In [None]:
C = np.cov(X.T)

fig,ax = plt.subplots()
ax.imshow(C)

We can also think of PCA from the perspective of an objective function. The objective function for PCA is:

$||\underline{\underline{X}} - \underline{\underline{A}}||_F$

where $||\underline{\underline{M}}||_F = \sum_i \sum_j M_{ij}^2$ is the "Frobeneius norm" and $\underline{\underline{A}}$ is a "low rank" approximation of $\underline{\underline{X}}$.

Obviously if this objective function is simply minimized for all the elements of $\underline{\underline{A}}$ it will give us $\underline{\underline{X}}$, which is rather trivial. The key is that we apply a constraint on the rank:

$\min_{\underline{\underline{A}}} ||\underline{\underline{X}} - \underline{\underline{A}}||_F$ subject to $rank(\underline{\underline{A}}) \leq k$

In other words we are looking for the closest rank-$k$ matrix to the original matrix $\underline{\underline{X}}$.

Note: We will not derive the connection between this minimization problem and the covariance matrix eigenvalues here, but details are available in Hastie 14.5.

#### What is the definition of matrix rank?

The rank of a matrix is also determined by its non-zero eigenvalues. Another way to think about PCA for dimensional reduction is that it uses only the $k$ largest eigenvalues of the covariance matrix to reconstruct the data. If we remember back to earlier lectures, the eigenvalues of the covariance data also give us the variance of the data along the principal component axes. Therefore we find that PCA gives a rank-$k$ approximation of the data that **retains the maximum possible amount of variance**.

In [None]:
eig_vals, eig_vecs = np.linalg.eig(C)
eig_vecs = eig_vecs.T #<- note that the eigenvectors are the *columns* by default
eig_idxs = range(0,len(eig_vals)) #<- they are also not sorted by default, so we need to do some more work
vals_ids = list(zip(eig_vals, eig_idxs))
vals_ids.sort()
vals_ids.reverse()
eig_vals, eig_idxs = zip(*vals_ids)
eig_vals = np.array(eig_vals)
eig_idxs = list(eig_idxs)

eig_vals_sorted = eig_vals[eig_idxs]
eig_vecs_sorted = eig_vecs[eig_idxs]

fig, ax = plt.subplots()
ax.plot(eig_vals_sorted)

In [None]:
print(eig_vals_sorted)

#### Question: What is the rank of this matrix?

Lets take a look at the top 3 eigenvectors:

In [None]:
eig_matrix = np.row_stack(eig_vecs_sorted)

fig,axes = plt.subplots(1,3,figsize=(15,5))

show_image(eig_matrix,0,ax=axes[0])
show_image(eig_matrix,1,ax=axes[1])
show_image(eig_matrix,2,ax=axes[2])

Now we can use the top $k$ eigenvalues to project the original data onto a $k$-dimensional space (the "axes" of this space are the eigenvectors).

In [None]:
k = 20
projector = np.column_stack(eig_vecs[:k])
print(projector.shape)
X_k = np.dot(X,projector)
print(X_k.shape)

print('Retained variance:', sum(eig_vals_sorted[:k])/sum(eig_vals_sorted))
print('Stress:', stress(X_k, X))

We can also visualize this in a 2-dimensional space:

In [None]:
k = 2
projector = np.column_stack(eig_vecs[:k])
X_k = np.dot(X,projector)

fig,ax = plt.subplots()

ax.scatter(X_k[:,0], X_k[:,1], c=y)
add_labels(ax)

We can see that there is substantial separation of the digits even in a 2-dimensional space.

Another very useful feature of PCA is that it is **invertable**. We can move between the low-dimensional and high-dimensional representations using the projection matrix.

In [None]:
k = 3
projector = np.column_stack(eig_vecs[:k])
X_k = np.dot(X,projector)
X_reconstruct = np.dot(X_k,projector.T) + X.mean()


N = 0
fig,axes = plt.subplots(1,2,figsize=(10,5))
show_image(X, N, ax=axes[0])
show_image(X_reconstruct, N, ax=axes[1])

We can also compare the accuracy of a classification model as a function of number of components:

In [None]:
from sklearn import svm
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

k = 5
projector = np.column_stack(eig_vecs[:k])
X_k = np.dot(X,projector)

idxs = range(len(X))
train_idxs, test_idxs, y_train, y_test = train_test_split(idxs, y, test_size=0.5)
X_train_full = X[train_idxs]
X_test_full = X[test_idxs]

X_train_lowD = X_k[train_idxs]
X_test_lowD = X_k[test_idxs]

clf_full = svm.SVC()
clf_full.fit(X_train_full,y_train)
y_pred_full = clf_full.predict(X_test_full)
accuracy_full = accuracy_score(y_pred_full, y_test)

clf_lowD = svm.SVC()
clf_lowD.fit(X_train_lowD,y_train)
y_pred_lowD = clf_lowD.predict(X_test_lowD)
accuracy_lowD = accuracy_score(y_pred_lowD, y_test)

print(accuracy_full)
print(accuracy_lowD)

Finally, we can combine PCA with other dimensional reduction techniques like feature selection:

In [None]:
k_PCA = 40
projector = np.column_stack(eig_vecs[:k_PCA])
X_k = np.dot(X,projector)
print(X_k.shape)

k_kB = 10
kbest = SelectKBest(f_classif, k=k_kB)
X_new = kbest.fit_transform(X_k, y)
print(X_new.shape)
print(kbest.get_support())


#### Question: Will feature selection select the largest variance principal components for every problem?

PCA is one of the most widely used techniques in dimensional reduction because it is:

* Unsupervised - We did not use the labels to determine the statistics
* Projectable - It is also easy to project a new data point into the reduced dimensional space
* Invertible - It is easy to move from the low-dimensional space to the high dimensional space

However, its weakness is that it is linear in the original space. It does not do well with non-linear patterns.

### Kernel PCA

The solution to non-linearity in PCA is a familiar one: using a "kernel" to perform PCA in an even-higher dimensional space that captures non-linearities. The concept here is that rather than using the covariance matrix the eigenvalues of a "kernel matrix" are used:

$K_{ij} = \kappa(\vec{x}_i, \vec{x}_j)$ where $\kappa$ is a kernel function such as the radial basis function:

$\kappa_{rbf}(\vec{x}_i, \vec{x}_j) = \exp(-\gamma ||\vec{x}_i - \vec{x}_j||^2)$

We will not go into the details, but instead show an example of how it works with the `scikit-learn` implementation:

In [None]:
from sklearn.datasets import make_moons
from sklearn.decomposition import KernelPCA, PCA

k = 1
X_m, y_m = make_moons(n_samples=100, random_state=0)
kPCA = KernelPCA(n_components=k, kernel='rbf', gamma=15, fit_inverse_transform=True)
lPCA = PCA(n_components=k)

lPCA.fit(X_m)
X_PCA = lPCA.transform(X_m)

kPCA.fit(X_m)
X_kPCA = kPCA.transform(X_m)

fig,axes = plt.subplots(1,3,figsize=(15,5))
axes[0].scatter(X_m[:,0], X_m[:,1], c=y_m)
axes[1].scatter(X_PCA, np.zeros(X_PCA.size), c=y_m)
axes[2].scatter(X_kPCA, np.zeros(X_PCA.size), c=y_m)

We can see that while PCA fails to separate the two datasets, kernel PCA is successful! However, we had to choose a hyperparameter ($\gamma$).

#### Question: How could $\gamma$ be selected?

Kernel PCA is also invertable, much like regular PCA:

In [None]:
X_PCA_reconstruct = lPCA.inverse_transform(X_PCA)
X_kPCA_reconstruct = kPCA.inverse_transform(X_kPCA)

fig,axes = plt.subplots(1,3,figsize=(15,5))
axes[0].scatter(X_m[:,0], X_m[:,1], c=y_m)
axes[1].scatter(X_PCA_reconstruct[:,0], X_PCA_reconstruct[:,1], c=y_m)
axes[2].scatter(X_kPCA_reconstruct[:,0], X_kPCA_reconstruct[:,1], c=y_m)

We can see that the reconstruction itself doesn't look particularly good, but some key features of the original structure are preserved with kPCA, while the PCA reconstruction looks drastically different.

### Other PCA variants

PCA is one of the most powerful dimensional reduction techniques and has many other variants. We will not go into the details, but a few worth mentioning are:

* Partial least squares - supervised regression-based PCA that maximizes covariance between input and output
* Linear discriminant analysis - supervised classification-based PCA that maximizes inter-class variance
* Robust PCA - good for cases where there is sparse data and/or large errors/outliers

Lets quickly compare PCA, kernel PCA and LDA for the digits dataset:

In [None]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

k=2

lPCA = PCA(n_components=k)
lPCA.fit(X)
X_PCA = lPCA.transform(X)

kPCA = KernelPCA(n_components=k, kernel='rbf', gamma=0.5, fit_inverse_transform=True)
kPCA.fit(X)
X_kPCA = kPCA.transform(X)

lda = LinearDiscriminantAnalysis(n_components=2)
lda.fit(X,y)
X_lda = lda.transform(X)

fig, axes = plt.subplots(1,3,figsize=(15,5))

axes[0].scatter(X_PCA[:,0], X_PCA[:,1],c=y)
axes[1].scatter(X_kPCA[:,0], X_kPCA[:,1],c=y)
axes[2].scatter(X_lda[:,0], X_lda[:,1],c=y)

add_labels(axes[1])

## Other dimensional reduction approaches

There are many other techniques that can be applied for dimensional reduction. Here we will briefly introduce two concepts: manifold learning and autoencoding.

### Manifold learning

Manifold learning approaches utilize distance metrics between points to define their similarity, and then seek to minimize the difference between distance metrics in the high- and low-dimensional spaces. The advantage of distance metrics over variance is that the **local structure** of data (distances between points) can be more easily exploited. This makes manifold learning techniques much better suited for highly non-linear datasets.

One prototype manifold learning technique is **multi-dimensional scaling**. The principle is that the "stress" metric which we introduced earlier is directly minimized. The stress is given by:

$S(\vec{x}_{0}, \vec{x}_1, \vec{x}_2, ... \vec{x}_n) =  \left(\frac{\sum_{i=0}^n \sum_{i < j}(d_{ij} - ||x_i - x_j||)^2}{\sum_{i=0}^n \sum_{i < j} d_{ij}^2}\right)^{1/2}$

where $d_{ij}$ is the distance in the high-dimensional space and $\vec{x}$ is the vector in the low-dimensional space. A typical choice is to use the Euclidean distance to compute $d_{ij}$, but there are variants of MDS that use other distance metrics. For example "non-metric" MDS uses distances based on ordering between different points, so that the relative ordering of distances is favored over the numerical value of distances.

The optimization problem is rather challenging, so we will just use the `scikit-learn` implementation to see how MDS works for the hand-written digits:

In [None]:
from sklearn.manifold import MDS

mds = MDS(n_components=2, n_init=1, max_iter=100) #<- note that we need to give some max_iteration and initial guess parameters since this is iterative
X_mds = mds.fit_transform(X) #<- note that there is no transform method. What does this mean?

fig, ax = plt.subplots()
ax.scatter(X_mds[:,0], X_mds[:,1], c=y)
add_labels(ax)
print('Stress: ', stress(X_mds, X))

There are clearly some clusters, but the separation is not much better than PCA or other approaches. Adding more iterations or initial configurations may improve things.

Another popular manifold-based method is tSNE, or t-distribution stochastic neighbor embedding. This uses a probabalistic similarity metric based on the t-distribution, which makes it somewhat better suited to retain both local and global structure. The details of this are well beyond this course, but we can see how it performs:

In [None]:
from sklearn.manifold import TSNE

tsne = TSNE(n_components=2, perplexity=30.0, 
            early_exaggeration=12.0, 
            learning_rate=200.0, 
            n_iter=1000,
            init='random',
            method='exact')

X_tsne = mds.fit_transform(X)

fig, ax = plt.subplots()
ax.scatter(X_tsne[:,0], X_tsne[:,1], c=y)
add_labels(ax)
print('Stress: ', stress(X_tsne, X))

There is slightly better separation than MDS and PCA, but still considerable overlap. There are also many additional hyper-parameters that don't have clear meaning (e.g. "perplexity"), and the outcome will depend on the initial guesses and the algorithms used. While tSNE can be powerful, it also typically requires substantial effort.

Some other common manifold-based techniques include:

* Isomap
* Locally linear embedding (LLE)
* Spectral embedding
* Local tangent space alignment (LTSA)

A [comparison](http://scikit-learn.org/stable/auto_examples/manifold/plot_compare_methods.html):

<center>
<img src="images/manifold_techniques.png" width="800">
</center>

Manifold techniques can give powerful insight into the high-dimensional structure of data; however, most suffer from several key disadvantages:

* Not projectable - the low dimensional representation only applies to the training points.
* Not invertible - no way to move back to high-dimensional space
* Slow - manifold techniques use distance matrices and hence tend to scale as $N^2$

For these reasons manifold techniques are best for providing insight into the structure of the data, but usually need to be combined with other dimensional reduction approaches for model construction.

### Autoencoding

The final approach we will discuss is "autoencoding", which is the use of neural networks for dimensional reduction. This is a relatively new approach without any implementation in `scikit-learn`, but it is conceptually different from others so it is included here.

<center>
<img src="images/autoencoder.png" width="500">
</center>

The idea is that you train a neural network with the same data as inputs and outputs, but use an intermediate hidden layer (or layers) with dimensionality smaller than the original data. This forces the data through a "bottleneck" where it is represented in a low-dimensional form. This has numerous advantages:

* projectable and invertible - the link between the high/low dimensional representation is defined by the neural net
* fast and scalable - neural networks are computationally efficient
* non-linear and unsupervised - the autoencoder learns the non-linear manifold without needing labels

However, the typical cautions of neural networks apply:

* extremely large training datasets needed
* architecture and hyperparameters need to be tuned/selected
* no intuitive link between low- and high-dimensional representations

This is a field of research on its own, but worth being aware of nonetheless.

## Conclusions

* High-dimensional data is common in engineering: images, signals, etc.

* High-dimensional data has unique attributes:
    - Curse of dimensionality
    - Blessing of dimensionality
    
* Dimensional reduction techniques can be used to:
    - Visualize data
    - Engineer model features
    - Improve model efficiency
    - Compress data
    
* Various techniques can be used to assess the quality of reduced-dimensional data

* Principal component analysis is a prototype for dimensional reduction that is:
    - Unsupervised
    - Projectable
    - Invertable

* Principal component analysis can be thought of as:
    - determining the best rank-$k$ approximation of a matrix based on the Frobenius norm
    - determining the rank-$k$ matrix that retains the maximum possible variance of the original data
    - A projection based on the largest $k$ eigenvalues of the covariance matrix
   
* Three general approaches to dimensional reduction were introduced:
    - Retaining variance (PCA-based approaches, global structure considered)
    - Retaining distances (manifold approaches, local structure emphasized)
    - Intermediate representation (autoencoders, neural-network based)
    
* A combination of different dimensional reduction techniques is typically required in practice.

## Further Reading:

* Hastie Ch. 14.5-14.9
* Hastie Ch. 18
* [Sebastian Raschka PCA tutorial](https://sebastianraschka.com/Articles/2015_pca_in_3_steps.html)
* [Sebastian Raschka kernel PCA](https://sebastianraschka.com/Articles/2014_kernel_pca.html)