# Chapter 8 - Dimensionality Reduction

In [1]:
import pandas as pd
import numpy as np

In [2]:
import os, itertools

In [3]:
from sklearn import datasets

Many Machine Learning problems involve thousands or even millions of features for each training
instance. Not only does this make training extremely slow, it can also make it much harder to find a good solution, as we will see. 

This problem is often referred to as the **curse of dimensionality**.

Fortunately, in real-world problems, it is often possible to reduce the number of features considerably, turning an intractable problem into a tractable one. 

For example, consider the MNIST images: 

the pixels on the image borders are almost always white, so you could completely drop
these pixels from the training set without losing much information. 
*********
Figure 7-6 confirms that these pixels are utterly unimportant for the classification task. 

Moreover, two neighboring pixels are often highly correlated: 

if you merge them into a single pixel (e.g., by taking the mean of the two pixel intensities), you will not lose much information.

**WARNING**

Reducing dimensionality does **lose some information** (just like compressing an image to JPEG can degrade its quality), so even though it will speed up training, it may also make your system perform slightly worse. 

It also makes your **pipelines** a bit more complex and thus harder to maintain. 

So you should **first try to train your system with the original** data before considering using
dimensionality reduction if training is too slow. 

In some cases, however, reducing the dimensionality of the training data may **filter out some noise and unnecessary** details and thus result in higher performance (but in general it won’t; it will just speed up training).

Apart from speeding up training, dimensionality reduction is also extremely useful for data visualization (or *DataViz*). 

Reducing the number of dimensions down to two (or three) makes it possible to plot a highdimensional training set on a graph and often gain some important insights by visually detecting patterns, such as clusters.

In this chapter we will discuss the curse of dimensionality and get a sense of what goes on in highdimensional space. Then, we will present the two main approaches to dimensionality reduction (**projection and Manifold Learning**), and we will go through three of the most popular dimensionality reduction techniques: **PCA, Kernel PCA, and LLE**.

## The Curse of Dimensionality

We are so used to living in three dimensions that our intuition fails us when we try to imagine a highdimensional space. Even a basic 4D hypercube is incredibly hard to picture in our mind (see Figure 8-1), let alone a 200-dimensional ellipsoid bent in a 1,000-dimensional space.

![8-1.PNG](attachment:8-1.PNG)

It turns out that many things behave very differently in high-dimensional space. 

For example, if you pick a random point in a unit square (a 1 × 1 square), it will have only about a 0.4% chance of being located less than 0.001 from a border (in other words, it is very unlikely that a random point will be “extreme” along any dimension). 
But in a 10,000-dimensional unit hypercube (a 1 × 1 × ... × 1 cube, with ten thousand 1s),
this probability is greater than 99.999999%. 

**Most points in a high-dimensional hypercube are very close to the *border*.**

Here is a more troublesome difference: 

if you pick two points randomly in a unit square, the distance between these two points will be, on average, roughly 0.52. 

If you pick two random points in a unit 3D cube, the average distance will be roughly 0.66. 

But what about two points picked randomly in a 1,000,000-dimensional hypercube? The average distance, believe it or not, will be about 408.25 (roughly $\sqrt{1,000,000/6}$)! 

This is quite counterintuitive: how can two points be so far apart when they both lie within the same unit hypercube? 
This fact implies that high-dimensional datasets are at risk
of being very **sparse**: 

**most training instances are likely to be far away from each other.**

Of course, this also means that a new instance will likely be far away from any training instance, making predictions much **less reliable** than in lower dimensions, since they will be based on much larger extrapolations. 

In short: 

**the more dimensions the training set has, the greater the risk of overfitting it.**

In theory, one solution to the curse of dimensionality could be to **increase the size of the training set to reach a sufficient density of training instances**. Unfortunately, in practice, the **number of training instances required to reach a given density grows *exponentially*** with the number of dimensions:

With just 100 features (much less than in the MNIST problem), you would need more training instances than atoms in the observable universe in order for training instances to be within 0.1 of each other on average, assuming they were spread out uniformly across all dimensions.

### Main Approaches for Dimensionality Reduction

Two main approaches to reducing dimensionality: 
* Projection
* Manifold Learning.

## Projection

In most real-world problems, training instances are not spread out uniformly across all dimensions. 

Many features are **almost constant**, while others are **highly correlated**. 

As a result, all training instances actually **lie within (or close to) a much lower-dimensional subspace of the high-dimensional space**. 

Let’s look at an example. In Figure 8-2 you can see a 3D dataset represented by the circles.

![8-2.PNG](attachment:8-2.PNG)

Notice that all training instances lie close to a plane: this is a lower-dimensional (2D) subspace of the high-dimensional (3D) space. 

Now if **we project every training instance perpendicularly onto this subspace** (as represented by the short lines connecting the instances to the plane), we get the new 2D
dataset shown in Figure 8-3. 

Ta-da! We have just reduced the dataset’s dimensionality from 3D to 2D.

Note that the axes correspond to new features $z_1$ and $z_2$ (the coordinates of the projections on the plane).

![8-3.PNG](attachment:8-3.PNG)

However, projection is not always the best approach to dimensionality reduction. In many cases the subspace may *twist* and *turn*, such as in the famous **Swiss roll** toy dataset represented in Figure 8-4.

![8-4.PNG](attachment:8-4.PNG)

**Simply projecting onto a plane** (e.g., by dropping $x_3$) would **squash different layers** of the Swiss roll together, as shown on the left of Figure 8-5. 

However, what you really want is to unroll the Swiss roll to obtain the 2D dataset on the right of Figure 8-5.

![8-5.PNG](attachment:8-5.PNG)

## Manifold Learning

The Swiss roll is an example of a 2D *manifold*. 

Put simply, a 2D manifold is a **2D shape that can be bent and twisted in a higher-dimensional space**. 

More generally, a d-dimensional manifold is a part of an ndimensional space (where d < n) that **locally resembles** a d-dimensional hyperplane. 

In the case of the Swiss roll, d = 2 and n = 3: it locally resembles a 2D plane, but **it is rolled in the third dimension**.

Many dimensionality reduction algorithms work by **modeling the manifold on which the training instances lie**; this is called **Manifold Learning**. 

It relies on the manifold assumption, also called the *manifold hypothesis*:

Most real-world high-dimensional datasets **lie close to a much lowerdimensional manifold**. This assumption is very often empirically observed.

Once again, think about the MNIST dataset: all handwritten digit images have some similarities. They are made of connected lines, the borders are white, they are more or less centered, and so on. 

If you *randomly* generated images, only a ridiculously tiny fraction of them would look like handwritten digits. 

In other words, the degrees of freedom available to you if you try to create a digit image are dramatically lower than the degrees of freedom you would have if you were allowed to generate any image you wanted.

These constraints tend to squeeze the dataset into a lower-dimensional manifold.

The manifold assumption is often accompanied by *another implicit assumption*: 

**The task at hand (e.g., classification or regression) will be simpler if expressed in the lower-dimensional space of the manifold.**

For example, in the top row of Figure 8-6 the Swiss roll is split into two classes: in the 3D space (on the left), the decision boundary would be fairly complex, but in the 2D unrolled manifold space (on the right), the decision boundary is a simple straight line.


However, this assumption **does not always hold**. 

For example, in the bottom row of Figure 8-6, the decision boundary is located at $x_1 = 5$. 

This decision boundary looks very simple in the original 3D space (a vertical plane), but it looks more complex in the unrolled manifold (a collection of four independent line segments).


In short, if you reduce the dimensionality of your training set before training a model, **it will definitely speed up training**, but it may not always lead to a **better** or **simpler** solution; it all depends on the dataset.

Hopefully you now have a good sense of what the curse of dimensionality is and how dimensionality reduction algorithms can fight it, especially when the manifold assumption holds.

## PCA

Principal Component Analysis (PCA) is by far the most popular dimensionality reduction algorithm.

First it identifies the **hyperplane that lies closest to the data**, and then it **projects the data onto it.**

#### Preserving the Variance

Before you can project the training set onto a lower-dimensional hyperplane, you first need to choose the **right hyperplane**. 

For example, a simple 2D dataset is represented on the left of Figure 8-7, along with
three different axes (i.e., one-dimensional hyperplanes). 

On the right is the result of the projection of the dataset onto each of these axes. 

As you can see, the projection onto the solid line preserves the maximum variance, while the projection onto the dotted line preserves very little variance, and the projection onto
the dashed line preserves an intermediate amount of variance.

![8-7.PNG](attachment:8-7.PNG)

It seems reasonable to **select the axis that preserves the maximum amount of variance**, as it will most likely **lose less information than the other projections**.

Another way to justify this choice is that it is the axis that **minimizes the mean squared distance between the original dataset and its projection** onto that
axis. This is the rather simple idea behind PCA.

## Principal Components

PCA identifies the axis that accounts for the **largest amount of variance** in the training set. 

In Figure 8-7, it is the solid line. It also finds a second axis, **orthogonal to the first one**, that accounts for the **largest amount of remaining variance**.

In this 2D example there is no choice: it is the dotted line. 

If it were a higherdimensional dataset, PCA would also find a third axis, *orthogonal to both previous axes*, and a fourth, a fifth, and so on — as many axes as the number of dimensions in the dataset.

The **unit vector** that defines the $i^{th}$ axis is called the $i^{th}$ **principal component (PC)**. 

In Figure 8-7, the 1st PC is $c_1$ and the 2nd PC is $c_2$. In Figure 8-2 the first two PCs are represented by the orthogonal arrows in the plane, and the third PC would be orthogonal to the plane (pointing up or down).

**NOTE**

The **direction of the principal components is not stable**: if you perturb the training set slightly and run PCA again, some of the new PCs may point in the **opposite direction** of the original PCs. However, they will generally *still lie on the same axes*. 

In some cases, a pair of PCs may even *rotate or swap*, but the plane they define will generally remain the same.

How can you find the principal components of a training set? 

Luckily, there is a standard matrix factorization technique called **Singular Value Decomposition (SVD)** that can decompose the training set matrix $X$ into the dot product of three matrices $U · \sum · V^T$, where $V^T$ contains **all the principal components** that we are looking for, as shown in Equation 8-1.

![eq8-1.PNG](attachment:eq8-1.PNG)

The following Python code uses NumPy’s `svd()` function to obtain all the principal components of the training set, then extracts the first two PCs:

In [4]:
X_centered = X - X.mean(axis=0)
U, s, V = np.linalg.svd(X_centered)
c1 = V.T[:, 0]
c2 = V.T[:, 1]

NameError: name 'X' is not defined

**WARNING**

PCA assumes that the dataset is **centered around the origin**. 

Scikit-Learn’s PCA classes take care of centering the data for you.

However, if you implement PCA yourself (as in the preceding example), or if you use other libraries, **don’t forget to center the data first.**

## Projecting Down to d Dimensions

Once you have identified all the principal components, you can reduce the dimensionality of the dataset down to $d$ dimensions by **projecting it onto the hyperplane defined by the first $d$ principal components**.

Selecting this hyperplane ensures that the projection will **preserve as much variance as possible**. 

For example, in Figure 8-2 the 3D dataset is projected down to the 2D plane defined by the first two principal components, *preserving a large part of the dataset’s variance*. 

As a result, the 2D projection looks very much like the original 3D dataset.

To **project the training set onto the hyperplane**, you can simply **compute the dot product of the training set matrix $X$ by the matrix $W_d$, defined as the *matrix containing the first $d$ principal components*** (i.e., the matrix composed of the first d columns of $V^T$), as shown in Equation 8-2.

![eq8-2.PNG](attachment:eq8-2.PNG)

The following Python code projects the training set onto the plane defined by the first two principal components:

In [5]:
W2 = V.T[:, :2]
X2D = X_centered.dot(W2)

NameError: name 'V' is not defined

## Using Scikit-Learn

Scikit-Learn’s PCA class implements PCA using SVD decomposition just like we did before. 

The following code applies PCA to reduce the dimensionality of the dataset down to two dimensions (note that it *automatically takes care of centering the data*):

In [7]:
from sklearn.decomposition import PCA
pca = PCA(n_components = 2)
X2D = pca.fit_transform(X)

NameError: name 'X' is not defined

After fitting the PCA transformer to the dataset, you can access the principal components using the `components_` variable (note that it contains the PCs as **horizontal vectors**, so, for example, the first principal component is equal to `pca.components_.T[:, 0]`).

## Explained Variance Ratio

Another very useful piece of information is the ***explained variance ratio*** of each principal component, available via the `explained_variance_ratio_` variable. 

It indicates the proportion of the dataset’s variance that lies along the **axis of each principal component**. 
For example, let’s look at the explained variance ratios of the first two components of the 3D dataset represented in Figure 8-2:

In [8]:
pca.explained_variance_ratio_

AttributeError: 'PCA' object has no attribute 'explained_variance_ratio_'

This tells you that 84.2% of the dataset’s variance lies along the first axis, and 14.6% lies along the second axis. This leaves less than 1.2% for the third axis, so it is reasonable to assume that it probably carries little information.

## Choosing the Right Number of Dimensions

Instead of *arbitrarily* choosing the number of dimensions to reduce down to, it is generally preferable to **choose the number of dimensions that add up to a sufficiently large portion of the variance** (e.g., 95%).

Unless, of course, you are reducing dimensionality for *data visualization* — in that case you will generally want to reduce the dimensionality down to 2 or 3.

The following code computes PCA without reducing dimensionality, then **computes the minimum number of dimensions required to preserve 95% of the training set’s variance**:

In [9]:
pca = PCA()
pca.fit(X_train)
cumsum = np.cumsum(pca.explained_variance_ratio_)
d = np.argmax(cumsum >= 0.95) + 1

NameError: name 'X_train' is not defined

You could then set `n_components=d` and run PCA again. 

However, there is a much better option: instead of specifying the number of principal components you want to preserve, you can set `n_components` to be a float between 0.0 and 1.0, indicating the ratio of variance you wish to preserve:

In [10]:
pca = PCA(n_components=0.95)
X_reduced = pca.fit_transform(X_train)

NameError: name 'X_train' is not defined

Yet another option is to plot the explained variance as a function of the number of dimensions (simply plot `cumsum`; see Figure 8-8). 

There will usually be an elbow in the curve, where the **explained variance stops growing fast**. 

You can think of this as the *intrinsic dimensionality* of the dataset. 

In this case, you can see that reducing the dimensionality down to about 100 dimensions wouldn’t lose too much explained variance.

![8-8.PNG](attachment:8-8.PNG)

## PCA for Compression

After dimensionality reduction, the training set takes up much less space. 

For example, try applying PCA to the MNIST dataset while preserving 95% of its variance. You should find that each instance will have just over 150 features, instead of the original 784 features. So while most of the variance is preserved, the dataset is now less than 20% of its original size! 

This is a reasonable compression ratio, and you can see how this can speed up a classification algorithm (such as an SVM classifier) tremendously.

It is also possible to **decompress the reduced dataset back to 784 dimensions by applying the inverse transformation of the PCA projection**.

Of course this won’t give you back the original data, since the projection lost a bit of information (within the 5% variance that was dropped), but it will likely be quite
close to the original data. 

*The mean squared distance between the original data and the reconstructed data*
(compressed and then decompressed) is called the **reconstruction error**. 

For example, the following code compresses the MNIST dataset down to 154 dimensions, then uses the `inverse_transform()` method to decompress it back to 784 dimensions. 

Figure 8-9 shows a few digits from the original training set (on the left), and the corresponding digits after compression and decompression. 

You can see that there is a slight image quality loss, but the digits are still mostly intact.

In [11]:
pca = PCA(n_components = 154)
X_reduced = pca.fit_transform(X_train)
X_recovered = pca.inverse_transform(X_reduced)

NameError: name 'X_train' is not defined

![8-9.PNG](attachment:8-9.PNG)

The equation of the inverse transformation is shown in Equation 8-3.

![eq8-3.PNG](attachment:eq8-3.PNG)

# Incremental PCA

One problem with the preceding implementation of PCA is that **it requires the whole training set to fit in memory in order for the SVD algorithm to run**. 

Fortunately, **Incremental PCA (IPCA)** algorithms have been developed: 

you can split the training set into mini-batches and feed an IPCA algorithm one minibatch at a time. This is useful for **large training sets**, and also to apply **PCA online** (i.e., on the fly, as new instances arrive).


The following code splits the MNIST dataset into 100 mini-batches (using NumPy’s `array_split()` function) and feeds them to Scikit-Learn’s `IncrementalPCA` class to reduce the dimensionality of the MNIST dataset down to 154 dimensions (just like before). 

Note that you must call the `partial_fit()` method with each mini-batch rather than the `fit()` method with the whole training set:

In [12]:
from sklearn.decomposition import IncrementalPCA

n_batches = 100

inc_pca = IncrementalPCA(n_components=154)

for X_batch in np.array_split(X_train, n_batches):
    inc_pca.partial_fit(X_batch)

X_reduced = inc_pca.transform(X_train)

NameError: name 'X_train' is not defined

Alternatively, you can use NumPy’s `memmap` class, which allows you to **manipulate a large array stored in a binary file on *disk* as if it were entirely in *memory***; the class loads only the data it needs in memory, when it needs it. 

Since the `IncrementalPCA` class uses **only a small part** of the array at any given time,
the memory usage remains under control. This makes it possible to call the usual `fit()` method, as you can see in the following code:

In [13]:
X_mm = np.memmap(filename, dtype="float32", mode="readonly", shape=(m, n))
batch_size = m // n_batches
inc_pca = IncrementalPCA(n_components=154, batch_size=batch_size)
inc_pca.fit(X_mm)

NameError: name 'filename' is not defined

## Randomized PCA

Scikit-Learn offers yet another option to perform PCA, called **Randomized PCA**. This is a stochastic algorithm that quickly *finds an **approximation** of the first d principal components*. 

Its computational complexity is $O(m × d^2) + O(d^3)$, instead of $O(m × n^2) + O(n^3)$, so it is dramatically faster than the previous algorithms when d is much smaller than n.

In [14]:
rnd_pca = PCA(n_components=154, svd_solver="randomized")
X_reduced = rnd_pca.fit_transform(X_train)

NameError: name 'X_train' is not defined

## Kernel PCA

**kernel trick**: 
a mathematical technique that implicitly maps instances into a very high-dimensional space (called the **feature space**), enabling nonlinear classification and regression with Support Vector Machines. 

Recall that **a linear decision boundary in the high-dimensional** feature space corresponds to **a complex nonlinear decision boundary in the original space**.

It turns out that the same trick can be applied to PCA, making it possible to **perform complex nonlinear projections for dimensionality reduction**. 

This is called **Kernel PCA (kPCA)**. It is often good at **preserving clusters of instances** after projection, or sometimes even unrolling datasets that lie close to a twisted manifold.


For example, the following code uses Scikit-Learn’s `KernelPCA` class to perform kPCA with an RBF kernel (see Chapter 5 for more details about the RBF kernel and the other kernels):

In [15]:
from sklearn.decomposition import KernelPCA
rbf_pca = KernelPCA(n_components = 2, kernel="rbf", gamma=0.04)

X_reduced = rbf_pca.fit_transform(X)

NameError: name 'X' is not defined

Figure 8-10 shows the Swiss roll, reduced to two dimensions using a linear kernel (equivalent to simply using the PCA class), an RBF kernel, and a sigmoid kernel (Logistic).

![8-10.PNG](attachment:8-10.PNG)

## Selecting a Kernel and Tuning Hyperparameters

As kPCA is an unsupervised learning algorithm, *there is no obvious performance measure to help you select the best kernel and hyperparameter values*. 

However, dimensionality reduction is often a preparation step for a supervised learning task (e.g., classification), so you can simply use grid search to select the kernel and hyperparameters that **lead to the best performance on that task**. 

For example, the following code creates a two-step pipeline, first reducing dimensionality to two dimensions using kPCA, then applying Logistic Regression for classification. Then it uses `GridSearchCV` to find the best kernel and gamma value for kPCA in order to get the best classification accuracy at the end of the pipeline:

In [16]:
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

clf = Pipeline([
("kpca", KernelPCA(n_components=2)),
("log_reg", LogisticRegression())
])

param_grid = [{
"kpca__gamma": np.linspace(0.03, 0.05, 10),
"kpca__kernel": ["rbf", "sigmoid"]
}]

grid_search = GridSearchCV(clf, param_grid, cv=3)
grid_search.fit(X, y)

NameError: name 'X' is not defined

The best kernel and hyperparameters are then available through the `best_params_` variable:

In [17]:
print(grid_search.best_params_)

AttributeError: 'GridSearchCV' object has no attribute 'best_params_'

Another approach, this time entirely unsupervised, is to select the kernel and hyperparameters that **yield the lowest reconstruction error**. 

However, **reconstruction is not as easy as with linear PCA**. 

Here’s why Figure 8-11 shows the original Swiss roll 3D dataset (top left), and the resulting 2D dataset after kPCA is applied using an RBF kernel (top right). 

Thanks to the kernel trick, this is mathematically equivalent to **mapping the training set to an infinite-dimensional feature space** (bottom right) using the feature map φ, then **projecting the transformed training set down to 2D using linear PCA**. 

Notice that if we could invert the linear PCA step for a given instance in the reduced space, the reconstructed point would lie in feature space, not in the original space (e.g., like the one represented by an x in the diagram). 

Since the feature space is infinite-dimensional, we cannot compute the reconstructed point, and therefore we cannot compute the true reconstruction error. 

Fortunately, it is possible to find a point in the original space that would map close to the reconstructed point. 

This is called the **reconstruction pre-image**. Once you have this pre-image, you can measure its squared distance to the original instance. 

You can then select the kernel and hyperparameters that minimize this reconstruction pre-image error.

![8-11.PNG](attachment:8-11.PNG)

You may be wondering how to perform this reconstruction. One solution is to train a supervised
regression model, with the projected instances as the training set and the original instances as the targets.

Scikit-Learn will do this automatically if you set `fit_inverse_transform=True`, as shown in the following code:

In [18]:
rbf_pca = KernelPCA(n_components = 2, kernel="rbf", gamma=0.0433, fit_inverse_transform=True)
X_reduced = rbf_pca.fit_transform(X)
X_preimage = rbf_pca.inverse_transform(X_reduced)

NameError: name 'X' is not defined

**NOTE**

By default, `fit_inverse_transform=False` and `KernelPCA` has no `inverse_transform()` method. This method only gets created when you set `fit_inverse_transform=True`.

You can then compute the reconstruction pre-image error:

Now you can use grid search with cross-validation to find the kernel and hyperparameters that minimize this pre-image reconstruction error.

## LLE

Locally Linear Embedding (LLE) is another very powerful **nonlinear dimensionality reduction
(NLDR)** technique. 

It is a **Manifold Learning** technique that does not rely on projections like the previous
algorithms. 

In a nutshell, LLE works by first **measuring how each training instance linearly relates to its closest neighbors (c.n.), and then looking for a low-dimensional representation of the training set where these local relationships are best preserved**. 


This makes it particularly good at *unrolling twisted manifolds*, especially when there is not too much noise.


For example, the following code uses Scikit-Learn’s `LocallyLinearEmbedding` class to unroll the Swiss roll. 

The resulting 2D dataset is shown in Figure 8-12. As you can see, the Swiss roll is completely
unrolled and the **distances between instances are locally well preserved**. 

However, distances are not preserved on a *larger scale*: the left part of the unrolled Swiss roll is squeezed, while the right part is stretched. 

Nevertheless, LLE did a pretty good job at modeling the manifold.

In [19]:
from sklearn.manifold import LocallyLinearEmbedding
lle = LocallyLinearEmbedding(n_components=2, n_neighbors=10)
X_reduced = lle.fit_transform(X)

NameError: name 'X' is not defined

![8-12.PNG](attachment:8-12.PNG)

Here’s how LLE works: 
    
* for each training instance $x^{(i)}$, the algorithm identifies its $k$ closest neighbors (in the preceding code k = 10)

* then tries to reconstruct $x^{(i)}$ as a **linear function of these neighbors**. More specifically, it finds the weights $w{i,j}$ such that the squared distance between $x^(i)$ and $\sum_{j=1}^m w{i,j} x^j$ is as small as possible, assuming $w{i,j} = 0$ if x(j) is not one of the k closest neighbors of $x^{(i)}$.

Thus the first step of LLE is the *constrained optimization* problem described in Equation 8-4, where $W$ is the weight matrix containing all the weights $w{i,j}$. 

The second constraint simply normalizes the weights for each training instance $x^{(i)}$.

![eq8-4.PNG](attachment:eq8-4.PNG)

After this step, the weight matrix (containing the weights ) encodes the local linear relationships between the training instances.

now the second step is to **map the training instances into a d-dimensional space (where d < n) while preserving these local relationships as much as possible**. 

If $z^{(i)}$ is the image of $x^{(i)}$ in this d-dimensional space, then we want the squared distance between $z^{(i)}$ and to be as small as possible. 

This idea leads to the unconstrained optimization problem described in Equation 8-5. 

It looks very similar to the first step, but instead of keeping the instances fixed and finding the optimal weights, we are doing the reverse: **keeping the weights fixed and finding the optimal position of the instances’ images in the low-dimensional space**. Note that $Z$ is the matrix containing all $z^{(i)}$.

![eq8-5.PNG](attachment:eq8-5.PNG)

Scikit-Learn’s LLE implementation has the following computational complexity: $O(m log(m) n log(k))$ for finding the k nearest neighbors, $O(mnk^3)$ for optimizing the weights, and $O(dm^2)$ for constructing the low dimensional representations. 

Unfortunately, the $m^2$ in the last term makes this algorithm scale poorly to very large datasets.

### Other Dimensionality Reduction Techniques

* **Multidimensional Scaling (MDS)** reduces dimensionality while trying to preserve the distances between the instances (see Figure 8-13).

* **Isomap creates** a graph by connecting each instance to its nearest neighbors, then reduces dimensionality while trying to preserve the *geodesic* distances between the instances (The geodesic distance between two nodes in a graph is the number of nodes on the shortest path between these nodes).

* **t-Distributed Stochastic Neighbor Embedding (t-SNE)** reduces dimensionality while trying to keep similar instances close and dissimilar instances apart. It is mostly used for *visualization*, in particular to visualize clusters of instances in high-dimensional space (e.g., to visualize the MNIST images in 2D).

* **Linear Discriminant Analysis (LDA)** is actually a classification algorithm, but during training it learns the most discriminative axes between the classes, and these axes can then be used to define a hyperplane onto which to project the data. The benefit is that the projection will keep classes as far apart as possible, so LDA is a **good technique to reduce dimensionality before running another classification algorithm such as an SVM classifier**.

**Exercises**

1. What are the main motivations for reducing a dataset’s dimensionality? What are the main drawbacks?

2. What is the curse of dimensionality?

3. Once a dataset’s dimensionality has been reduced, is it possible to reverse the operation? If so, how? If not, why?

4. Can PCA be used to reduce the dimensionality of a highly nonlinear dataset?

5. Suppose you perform PCA on a 1,000-dimensional dataset, setting the explained variance ratio to 95%. How many dimensions will the resulting dataset have?

6. In what cases would you use vanilla PCA, Incremental PCA, Randomized PCA, or Kernel PCA?

7. How can you evaluate the performance of a dimensionality reduction algorithm on your dataset?

8. Does it make any sense to chain two different dimensionality reduction algorithms?

9. Load the MNIST dataset (introduced in Chapter 3) and split it into a training set and a test set (take the first 60,000 instances for training, and the remaining 10,000 for testing). Train a Random Forest classifier on the dataset and time how long it takes, then evaluate the resulting model on the test set. Next, use PCA to reduce the dataset’s dimensionality, with an explained variance ratio of 95%. Train a new Random Forest classifier on the reduced dataset and see how long it takes. Was training much faster? Next evaluate the classifier on the test set: how does it compare to the previous classifier?

10. Use t-SNE to reduce the MNIST dataset down to two dimensions and plot the result using Matplotlib. You can use a scatterplot using 10 different colors to represent each image’s target class. Alternatively, you can write colored digits at the location of each instance, or even plot scaled-down versions of the digit images themselves (if you plot all digits, the visualization will be too cluttered, so you should either draw a random sample or plot an instance only if no other instance has already been plotted at a close distance). You should get a nice visualization with well-separated clusters of digits. Try using other dimensionality reduction algorithms such as PCA, LLE, or MDS and compare the resulting visualizations.