<a href="https://colab.research.google.com/github/mcfatbeard57/Hands-On-ML-Tensor-FLow/blob/main/DimensionalityReduction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Chapter 8 : Dimensionality Reduction

Filter out noise and unnecessary details. Also speed up the training

Useful for data viz and finding patterns and clusters

Reduces Space


## Curse of Dimensionality

The curse of dimensionality refers to the fact that many problems that do not
exist in low-dimensional space arise in high-dimensional space. 

In Machine Learning, one common manifestation is the fact that randomly sampled high dimensional
vectors are generally very sparse, increasing the risk of overfitting
and making it very difficult to identify patterns in the data without having plenty
of training data.

High dimensional data sets are at risk of being highly sparse

## PCA

First it identifies the hyperplane that lies closest to the data, and then
it projects the data onto it while preserving the maximum
variance

it is the axis that minimizes the mean squared distance
between the original dataset and its projection onto that axis.

In [None]:
from sklearn.decomposition import PCA
pca = PCA(n_components = 2)
X2D = pca.fit_transform(X)

In [None]:
# First principal component is equal to 
pca.components_.T[:,0])

#### Explained Variance Ratio
the proportion of the dataset’s variance that lies along the axis of each principal component.

In [None]:
print(pca.explained_variance_ratio_)

#### Choosing the Right Number of Dimensions

choose the number of dimensions that add up to a sufficiently
large portion of the variance (e.g., 95%). Unless, of course, you are reducing dimensionality
for data visualization—in that case you will generally want to reduce the
dimensionality down to 2 or 3.

In [None]:
pca = PCA()
pca.fit(X)
cumsum = np.cumsum(pca.explained_variance_ratio_)
d = np.argmax(cumsum >= 0.95) + 1

instead of specifying the number of principal components you want to
preserve, you can set n_components to be a float between 0.0 and 1.0, indicating the
ratio of variance you wish to preserve:

In [None]:
pca = PCA(n_components=0.95)
X_reduced = pca.fit_transform(X)

**intrinsic dimensionality of the dataset**
elbow in the
curve, where the explained variance stops growing fast.

#### Reconstruction Error.

since the projection lost a bit of information (within the
5% variance that was dropped), The mean squared distance between the original data and the reconstructed data
(compressed and then decompressed) is called the reconstruction error.

**the inverse_transform() method**

In [None]:
pca = PCA(n_components = 154)
X_mnist_reduced = pca.fit_transform(X_mnist)
X_mnist_recovered = pca.inverse_transform(X_mnist_reduced)


## Incremental PCA

split the training
set into mini-batches and feed an IPCA algorithm one mini-batch at a time. This is
useful for large training sets, and also to apply PCA online

using NumPy’s array_split() function

**partial_fit() method**

In [None]:

from sklearn.decomposition import IncrementalPCA
n_batches = 100
inc_pca = IncrementalPCA(n_components=154)
for X_batch in np.array_split(X_mnist, n_batches):
inc_pca.partial_fit(X_batch)
X_mnist_reduced = inc_pca.transform(X_mnist)

NumPy’s memmap class, which allows you to manipulate a
large array stored in a binary file on disk as if it were entirely in memory; the class
loads only the data it needs in memory, when it needs it.

usual fit() method

In [None]:
X_mm = np.memmap(filename, dtype="float32", mode="readonly", shape=(m, n))
batch_size = m // n_batches
inc_pca = IncrementalPCA(n_components=154, batch_size=batch_size)
inc_pca.fit(X_mm) 

## Randomized PCA
This
is a **stochastic algorithm** that quickly finds an approximation of the first d principal
components. Its computational complexity is O(m × d2) + O(d3), instead of O(m × n2) + O(n3), so it is dramatically **faster** than the previous algorithms  **when d is much smaller than n.**

In [None]:

rnd_pca = PCA(n_components=154, svd_solver="randomized")
X_reduced = rnd_pca.fit_transform(X_mnist)

## Kernel PCA

possible to perform
complex nonlinear projections for dimensionality reduction.

often **good** at preserving clusters of instances after projection, or
sometimes even unrolling datasets that lie close to a twisted manifold.

In [None]:
from sklearn.decomposition import KernelPCA
rbf_pca = KernelPCA(n_components = 2, kernel="rbf", gamma=0.04)
X_reduced = rbf_pca.fit_transform(X)

## Selecting a Kernel and Tuning Hyperparameters

kPCA is an unsupervised learning algorithm, there is no obvious performance
measure to help you select the best kernel and hyperparameter values.

dimensionality reduction is often a preparation step for a supervised learning task
(e.g., classification), so you can simply use grid search to select the kernel and hyperparameters
that lead to the best performance on that task.

In [None]:
# the following code creates a two-step pipeline, 
# first reducing dimensionality to two dimensions using kPCA, 
# then applying Logistic Regression for classification
# Then it uses Grid SearchCV to find the best kernel and gamma value for kPCA 
# in order to get the best classification accuracy at the end of the pipeline:

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
clf = Pipeline([
("kpca", KernelPCA(n_components=2)),
("log_reg", LogisticRegression())
])
param_grid = [{
"kpca__gamma": np.linspace(0.03, 0.05, 10),
"kpca__kernel": ["rbf", "sigmoid"]
}]
grid_search = GridSearchCV(clf, param_grid, cv=3)
grid_search.fit(X, y)

In [None]:
print(grid_search.best_params_)

### Reconstruction Error

Another approach, this time entirely unsupervised, is to select the kernel and hyperparameters
that yield the lowest reconstruction error.

You can then select the kernel
and hyperparameters that minimize this reconstruction pre-image error.

**set fit_inverse_transform=True**

In [None]:
rbf_pca = KernelPCA(n_components = 2, kernel="rbf", gamma=0.0433,
fit_inverse_transform=True)
X_reduced = rbf_pca.fit_transform(X)
X_preimage = rbf_pca.inverse_transform(X_reduced)

KernelPCA has no
inverse_transform() method.

compute the reconstruction pre-image error:

In [None]:
from sklearn.metrics import mean_squared_error
mean_squared_error(X, X_preimage)

## LLE
Locally Linear Embedding is nonlinear dimensionality
reduction (NLDR) technique.

LLE works by first measuring
how each training instance linearly relates to its closest neighbors (c.n.), and then
looking for a low-dimensional representation of the training set where these local
relationships are best preserved

**LocallyLinearEmbedding class**

In [None]:
from sklearn.manifold import LocallyLinearEmbedding
lle = LocallyLinearEmbedding(n_components=2, n_neighbors=10)
X_reduced = lle.fit_transform(X)

LLE implementation has the following computational complexity:
O(m log(m)n log(k)) for finding the k nearest neighbors, O(mnk3) for optimizing the
weights, and O(dm2) for constructing the low-dimensional representations. Unfortunately,
the m2 in the last term makes this algorithm scale poorly to very large datasets.

## Other Dimensionality Reduction Techniques

*   **Multidimensional Scaling (MDS)** reduces dimensionality while trying to preserve
the distances between the instances 

*   **Isomap** creates a graph by connecting each instance to its nearest neighbors, then
reduces dimensionality while trying to preserve the geodesic distances9 between
the instances.

* **t-Distributed Stochastic Neighbor Embedding (t-SNE)** reduces dimensionality
while trying to keep similar instances close and dissimilar instances apart. It is
mostly used for visualization, in particular to visualize clusters of instances in
high-dimensional space (e.g., to visualize the MNIST images in 2D).

*   **Linear Discriminant Analysis (LDA)** is actually a classification algorithm, but during
training it learns the most discriminative axes between the classes, and these
axes can then be used to define a hyperplane onto which to project the data. The
benefit is that the projection will keep classes as far apart as possible, so LDA is a
good technique to reduce dimensionality before running another classification
algorithm such as an SVM classifier.