## CHAPTER 9
---
# DIMENSIONALITY REDUCTION USING FEATURE EXTRACTION

---
Not all features are created equal and the goal of feature extraction for dimensionality reduction is to reduce the number of features with only a small loss in our data’s ability to generate high-quality predictions.

One downside of the feature extraction techniques we discuss is that the new features we generate will not be interpretable by humans. If we wanted to maintain our ability to interpret our models, dimensionality reduction through feature selection is a better option.

## 9.1 Reducing Features Using Principal Components

- Given a set of features, you want to reduce the number of features while retaining the variance in the data
- Use **`principal component analysis`** with scikit’s `PCA`

In [1]:
# Load libraries
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn import datasets

# Load the data
digits = datasets.load_digits()

# Standardize the feature matrix
features = StandardScaler().fit_transform(digits.data)

# Create a PCA that will retain 99% of variance
pca = PCA(n_components=0.99, whiten=True)

# Conduct PCA
features_pca = pca.fit_transform(features)

# Show results
print("Original number of features:", features.shape[1])
print("Reduced number of features:", features_pca.shape[1])

Original number of features: 64
Reduced number of features: 54


#### Discussion:
Principal component analysis (PCA) is a popular linear dimensionality reduction technique. PCA projects observations onto the (hopefully fewer) principal components of the feature matrix that retain the most variance. PCA is an unsupervised technique, meaning that it does not use the information from the target vector and instead only considers the feature matrix.

PCA is implemented in scikit-learn using the pca method:
- if the argument to `n_components` is between 0 and 1, pca returns the minimum amount of features that retain that much variance. It is common to use values of 0.95 and 0.99, meaning 95% and 99% of the variance of the original features has been retained, respectively.
- `whiten=True` transforms the values of each principal component so that they have zero mean and unit variance.
- `svd_solver="randomized"`, which implements a stochastic algorithm to find the first principal components in often significantly less time.

The output of our solution shows that PCA let us reduce our dimensionality by 10 features while still retaining 99% of the information (variance) in the feature matrix.

## 9.2 Reducing Features When Data Is Linearly Inseparable

- You suspect you have linearly inseparable data and want to reduce the dimensions.
- Use an extension of principal component analysis that uses kernels (**`KernelPCA`**) to allow for non-linear dimensionality reduction.

In [2]:
# Load libraries
from sklearn.decomposition import PCA, KernelPCA
from sklearn.datasets import make_circles

# Create linearly inseparable data
features, _ = make_circles(n_samples=1000, random_state=1, noise=0.1, factor=0.1)

# Apply kernal PCA with radius basis function (RBF) kernel
kpca = KernelPCA(kernel="rbf", gamma=15, n_components=1)
features_kpca = kpca.fit_transform(features)

print("Original number of features:", features.shape[1])
print("Reduced number of features:", features_kpca.shape[1])

Original number of features: 2
Reduced number of features: 1


#### Discussion:
Standard PCA uses linear projection to reduce the features. If the data is linearly separable (i.e., you can draw a straight line or hyperplane between different classes) then PCA works well. However, if your data is not linearly separable (e.g., you
can only separate classes using a curved decision boundary), the linear transformation will not work as well.

Kernels allow us to project the linearly inseparable data into a higher dimension where it is linearly separable; this is called the kernel trick. A common kernel to use is the Gaussian `radial basis function kernel` rbf, but other options are the `polynomial kernel` (poly) and `sigmoid kernel` (sigmoid). We can even specify a linear projection (linear), which will produce the same results as standard PCA.

## 9.3 Reducing Features by Maximizing Class Separability

- You want to reduce the features to be used by a classifier.
- Try **`linear discriminant analysis`** (LDA) to project the features onto component axes that maximize the separation of classes

In [3]:
# Load libraries
from sklearn import datasets
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

# Load Iris flower dataset:
iris = datasets.load_iris()
features = iris.data
target = iris.target

# Create and run an LDA, then use it to transform the features
lda = LinearDiscriminantAnalysis(n_components=1)
features_lda = lda.fit(features, target).transform(features)

# Print the number of features
print("Original number of features:", features.shape[1])
print("Reduced number of features:", features_lda.shape[1])

Original number of features: 4
Reduced number of features: 1


We can use explained_variance_ratio_ to view the amount of variance explained by each component. In our solution the single component explained over 99% of the variance:

In [4]:
lda.explained_variance_ratio_

array([0.9912126])