# Chpater 8 - Dimensionality Reduction
Curse of Dimensionality: The large number of features not only makes training extremely slow, it can also make it much harder to find a good solution. <br>
Dimensionality redction not only speeds up training, but also is extremely useful for data visualization.

## The Curse of Dimensionality
High-dimensional datasets are at risk of being very sparse: most training instances are likely to be far away from each other. => a new instance will likely be far away from any training instacnes, making predictions much less reliable than in lower dimensions. => **the more dimensions the training set has, the greater the risk of overfitting it.**

## Main Approaches for Dimensionality Reduction

### Projection
In most real-world problems, many features are almost constance, while others are highly correlated. <br>
All training instances actually lie within a much lower-dimensional subspace of the high-dimensional space. <br>

### Manifold Learning
A *d*-dimensional manifold is a part of an *n*-dimensional (where *d*<*n*) that locally resembles a *d*-dimensional hyperplane.

Many dimensionality reduction algorithms work by modeling the manifold on which the training instances lie; this is called Manifold Learning. It relies on the manifold assumption, also called the manifold hypothesis, which holds that most real-world high-dimensional datasets lie close to a much lower-dimensional manifold.

The manifold assumption is often accompanied by another implicit assumption: that the task at hand (e.g., classification or regression) will be simpler if expressed in the lower-dimensional space of the manifold.

## PCA
Principal Component Analysis (PCA) is by far the most popular dimensionality reduction algorithm. First it identifies the hyperplace that lies closest to the data, and then it projects the data onto it.

### Preserving the Variance
Select a lower-dimensional hyperplane that preserves the maximum amount of variance, as it will most likely lose less information than the other projections. <br>
Another way to justify this choice is that it is the axis that minimizes the mean squared distance between the original dataset and its projection one that hyperplane.

### Principal Components
PCA needs axes as many as the number of dimensions in the dataset. <br>
The unit vector that defines the $i^{th}$ axis is called the $i^{th}$ principal component (PC).

Singular Value Decompostion (SVD): decomposes the training set matrix X into
 the matrix multiplication of three matrices $U \Sigma V^{T}$, where V contains all the principal components that we are looking for.

In [None]:
X_centered = X - X.mean(axis=0)
U, s, Vt = np.linalg.svd(X_centered)
c1 = Vt.T[:, 0]
c2 = Vt.T[:, 1]

### Projecting Down to *d* Dimensions
Projecting the training set down to d dimensions:
$$X_{d-proj} = XW_{d}$$

In [None]:
W2 = Vt.T[:, :2]
X2D = X_centered.dot(W2)

### Using Scikit-Learn
It automatically takes care of centering the data.

In [None]:
from sklearn.decomposition import PCA

pca = PCA(n_components = 2) # n_components: the number of dimensions
X2D = pca.fit_transform(X)

### Explained Variance Ratio
Explained variance ratio: It is available vai the `explained_variance_ratio_` variable. It indicates the proportion of the dataset's variance that lies along the axis of each principal componenet.

### Choosing the Right Number of Dimensions
It is generally preferable to choose the number of dimensions that add up to a sufficiently large portion of the variance (e.g., 95%).

In [None]:
pca = PCA()
pca.fit(X_train)
cumsum = np.cumsum(pca.explained_variance_ratio_)
d = np.argmax(cumsum >= 0.95) + 1 
# set n_components = d

or

In [None]:
pca = PCA(n_components=0.95)
X_reduced = pca.fit_transform(X_train)

### PCA fo Compression
It is also possible to decompress thereduced dataset back to the original number of dimensions by applying the inverse transformation of the PCA projection. <br>
The mean squared distance between the original data and the reconstructed data (compressed and then decompressed) is called the *reconstruction error*.

PCA inverse transformation, back to the original number of dimensions
$$X_{recovered} = X_{d-proj}W_{d}^{T}$$


In [None]:
pca = PCA(n_componets = 154)
X_reduced = pca.fit_transform(X_train)
X_recovered = pca.inverse_transform(X_reduced)

### Randomized PCA
Scikit-Learn uses a stochastic algorithm called *Randomized PCA* that quickly finds an approximation ofthe first *d* principal components. Its computational complexity is $O(m \times d^{2}) + O(d^{3})$, instead of $O(m \times n^{2}) + O(n^{3})$ for the full SVD approach. <br>
Set the `svd_solver` hyperparameter to "`randomized`". <br>
By default, `svd_solver` is actually set to "`auto`": Scikit-Learn automatically uses the randomized PCA algorithm if *m* or *n* is greater than 500 and *d* is less than 80% of *m* or *n*, or else it uses the full SVD approach.

In [None]:
rnd_pca = PCA(n_components=154, svd_solver="randomized")
X_reduced = rnd_pca.fit_transform(X_train)

### Incremental PCA
*Incremental PCA* (IPCA) algorithms allow you to split the training set into mini-batches and feed an IPCA algorithm one mini-batch at a time.

In [None]:
from sklearn.decomposition import IncrementalPCA

n_batches = 100
inc_pca = IncrementalPCA(n_components=154)
for X_batch in np.array_split(X_train, n_batches):
    inc_pca.partial_fit(X_batch) # you must call the partial_fit() method

or

In [None]:
# Numpy's memmap class allows you to manipulate a large array stored in a
# binary file on disk as if it were entirely in memroy; the class loads only
# the data it needs in memory, when it needs it.

X_mm = np.memmap(filename, dtype="float32", mode="readonly", shape=(m,n))

batch_size = m // n_batches
inc_pca = IncrementalPCA(n_components=154, batch_size=batch_size)
inc_pca.fit(X_mm)

## Kernel PCA
Kernel PCA (kPCA) makes it possible to perform complex nonlinear projections for dimensionality reduction. It is often good at preserving clusters of instances after projection, or sometimes even unrolling datasets that lie close to a twisted manifold.

In [None]:
from sklearn.decomposition import KernelPCA

rbf_pca = KernelPCA(n_components = 2, kernel="rbf", gamma=0.04)
X_reduced = rbf_pca.fit_transform(X)

### Selecting a Kernel and Tuning Hyperparameters
There is no obvious performance measure to help you select the best kernel and hyperparameter values because kPCA is an unsupervised learning algorithm. <br>
You can simply use grid search to select the kernel and hyperparameters that lead to the best performance on a specific task.

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

clf = Pipeline([
        ("kpca", KernelPCA(n_components=2)),
        ("log_reg", LogisticRegression())
])

param_grid = [{
        "kpca__gamma": np.linspace(0.03, 0.05, 10),
        "kpca__kerneal": ["rbf", "sigmoid"]
}]

grid_search = GridSearchCV(clf, param_grid, cv=3)
grid_search.fit(X, y)

print(gird_search.best_params_)

Another approach is to select the kernel and hyperparameters that yield the lowest reconstruction error.

In [None]:
# Performing reconstruction

rbf_pca = KernelPCA(n_components = 2, kernel="rbf", gamma=0.0433,
                    fit_inverse_transformation=True)
X_reduced = rbf_pca.fit_transform(X)
X_preimage = rbf_pca.inverse_transform(X_reduced)

In [None]:
# Computing the reconstruction pre-image error

from sklearn.metrics import mean_squared_error
mean_squared_error(X, X_preimage)

## LLE
*Locally Linear Embedding* (LLE) is a very powerful nonlinear dimensionality reduction (NLDR) technique. <br>
LLE works by first measuring how each training instance linearly relates to its closest neighbors, and then looking for a low-dimensional representation of the training set where these local relationships are best preserved. <br>
This makes it particularly good at unrolling twisted manifolds, especially when there is not too much noise.

In [None]:
from sklearn.manifold import LocallyLinearEmbedding

lle = LocallyLinearEmbedding(n_components=2, n_neighbors=10)
X_reduced = lle.fit_transform(X)

Computational complexity of Scikit-Learn's LLE implementation:
- $O(m log(m)n log(k))$ for find the *k* nearest neighbors
- $O(mnk^{3})$ for optimizing the weights
- $O(dm^{2})$ for constructing the low-dimensional representations => $m^{2}$ makes this algorithm scale poorly to very large datasets.

## Other Dimensionality Reduction Techniques
- *Multidimensional Scaling* (MDS) reduces dimensionality while trying to preserve the distances between the instances.
- *Isomap* creates a graph by connecting each instance to its nearest neighbors, then reduces dimensionality while try to preserve the geodesic distances between the instances.
- *t-Distributed Stochastic Neighbor Embedding* (t-SNE) reduces dimensionality while trying to keep similar instances close and dissimilar instances apart.
- *Linear Discriminant Analysis* (LDA): the projection will keep classes as far apart as possible, so LDA is a good technique to reduce dimensionality before running another classification algorithm such as an SVM classifier.