<a href="https://colab.research.google.com/github/google/applied-machine-learning-intensive/blob/master/content/xx_misc/dimensionality_reduction/colab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#### Copyright 2020 Google LLC.

In [0]:
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Dimensionality Reduction

Principal Component Analysis (PCA) is one of the most common ways to perform dimensionality reduction. PCA takes a set of independent and dependent variables (dimensions) and creates a representation of the variable, or group of variables, that explains the most variance. In a regression or classification problem, that would mean reducing the number of variables or features to the most important aggregate components and perhaps discarding those which add little value to our model's predictive power. This is known as feature extraction, and can help simplify your model.  


### Load Packages

In [0]:
%matplotlib inline

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

## Why PCA?

The [curse of dimensionality](https://en.wikipedia.org/wiki/Curse_of_dimensionality) states that analyzing data with high dimensionality can lead to overly complex models that are inefficient, suffer from overfitting, and tend to have less predictive power. In machine learning, this often means that your feature space is too large. Maybe there are more features than columns of data, or perhaps your data is too sparse to draw any statistically significant inferences. PCA simplifies the feature set into a set of "principal components," which are linear combinations of the original features and have low correlation between themselves. PCA may be undesirable in a case where you want your model to be interpretable using your original features, not the principal components.

As a rule of thumb, if your optimal number of components is greater than or equal to your original feature count, you probably shouldn't use PCA. It is all about finding the optimal component count, where the components explain the most variance in your model. In other words, you want to choose the best features for your model.

PCA and other techniques for dimensionality reduction also help to visualize and analyze higher dimensional data either in 2D or 3D. Sometimes PCA is referred to as Singular Value Decomposition (SVD), but we will call it PCA for now.  

If you'd like to take a deep dive into the math (and there is quite a bit of math!), read [these helpful lecture notes](http://www.stat.cmu.edu/~cshalizi/uADA/12/lectures/ch18.pdf) from a statistics course at Carnegie Mellon University.

## Data Preparation

PCA works best when features are normally distributed and have low [multicollinearity](https://en.wikipedia.org/wiki/Multicollinearity). Because PCA is performing rotations in N-dimensional space, we typically need to standardize our data. Essentially, we are reducing the space that our data occupies in higher dimensions by standardizing the distribution, or scaling the range of values down to $[0, 1]$. Each method of scaling has its own data requirements, and there are several flavors of scaling and standardization. Therefore, you should first conduct a thorough data analysis of your features in order to make this assessment.

You may see the terms "scaling," "standardizing," "centering," and "normalizing" used interchangeably. This can be confusing, so let's break down these terms.

1. Scaling: Changes the range of the data but does not affect the distribution.

2. Standardizing: Changes the distribution of the data by calculating the standard normal score.

3. Centering: Shifts the distribution of the data so that the mean is zero.

4. Normalizing: [normalizes](https://en.wikipedia.org/wiki/Normalization_(statistics) the rows of your dataset.

When using PCA to build a predictive model, we typically want to standardize the data with standard scalers. But some cases (e.g., cluster analysis or NLP) may require normalization of rows, not columns. There also may be other cases outside of PCA where you will need to scale or standardize.

[Here is a helpful guide](https://docs.google.com/spreadsheets/d/1woVi7wq13628HJ-tN6ApaRGVZ85OdmHsDBKLAf5ylaQ/edit?usp=sharing) for choosing which method is right for your data.

You can always check out the documentation for the implementation library, and scikit-learn's website has [another helpful guide](https://scikit-learn.org/stable/modules/preprocessing.html) to its preprocessing methods.





## Download Wine Data 

In [0]:
df_wine = pd.read_csv(
    'http://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data',
    header=None)
display(df_wine.head())
display(df_wine.shape)

## Split Into Training and Test Data, and Then Standardize

In [0]:
# Split into training and testing sets.
X, y = df_wine.iloc[:, 1:].values, df_wine.iloc[:, 0].values
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, stratify=y, random_state=0)

# Standardize the features.
sc = StandardScaler()
X_train_std = sc.fit_transform(X_train)
X_test_std = sc.transform(X_test)

## Covariance

The [covariance](https://en.wikipedia.org/wiki/Covariance) of features $M$ and $N$ is defined as follows:

$$ \sigma_{MN}^2 = \frac{1}{n}\sum_i {(m_i-\mu_M)^2(n_i-\mu_N)^2}$$

- $\mu_M$ is the sample mean of feature $M$
- $\mu_N$ is the sample mean of feature $N$.

Covariance is an extension of variance; it is an indication of variability within a set of two features, just as variance is an indicator of variability within a feature. Don't worry too much about the math here. The implementation of PCA hides the details.

## Eigenvectors and Values

Eigenvectors represent the directional vectors that we search for in the N-dimensional space. Eigenvalues represent the length of these vectors, and they inform us of how much variance is explained by the Nth principal component. An eigenvalue of 1 means there is no more information gained beyond the original feature, so it is desirable to have principal components with values greater than 1.

In [0]:
cov_mat = np.cov(X_train_std.T)
eigen_vals, eigen_vecs = np.linalg.eig(cov_mat)

Again, don't worry too much about how eigenvalues and eigenvectors are calculated; most of it is under the hood in scikit-learn.

## How Does PCA Work?

We use the covariance defined above to search for a first component (a vector that minimizes the error or distance from that vector and the data). This process iterates until a `n_components`, or number of vectors to build n principal components is found. In scikit-learn, you can choose a number of components to solve for, or let scikit-learn automatically choose the optimal number of components.

In [0]:
# Calculate cumulative sum of explained variances.
tot = sum(eigen_vals)
var_exp = [(i / tot) for i in sorted(eigen_vals, reverse=True)]
cum_var_exp = np.cumsum(var_exp)

ax, fig = plt.subplots(1,1,figsize=(8, 4))
# Plot explained variances.
plt.bar(range(1, 14), var_exp, alpha=0.5, align='center',
        label='individual explained variance')
plt.step(range(1, 14), cum_var_exp, where='mid',
         label='cumulative explained variance')
plt.ylabel('Explained variance ratio')
plt.xlabel('Principal component index')
plt.legend(loc='best')
plt.show()

Using PCA, we can see the explained variance of each component. The most variance is explained by the first principal component and drops off around 4 PCs. We can also see that the cumulative explained variance hits approximately 90% with 8 PCs.

## PCA for Feature Extraction

PCA is just one form of dimensionality reduction, and you will come across other related forms, as well as other types of dataset transformations. Transforming your dataset is a key technique in model-building, so don't get too attached to your original dataset.

By sorting the eigenpairs (vectors and their values), we can project that data into a lower dimensional space.

In [0]:
# Make a list of (eigenvalue, eigenvector) tuples.
eigen_pairs = [(np.abs(eigen_vals[i]),
                eigen_vecs[:, i]) for i in range(len(eigen_vals))]

# Sort the (eigenvalue, eigenvector) tuples from high to low.
eigen_pairs.sort(key=lambda k: k[0], reverse=True)

w = np.hstack((eigen_pairs[0][1][:, np.newaxis],
               eigen_pairs[1][1][:, np.newaxis]))
print('Matrix W:\n', w)

The result is a 13x2 projection matrix that is created from the top-2 eigenvectors. We can now use this projection matrix, $W$, to map any sample, $x$, to its 2-dimensional sample vector $x'$.

In [0]:
# Project training data onto PC1 and PC2.
X_train_pca = X_train_std.dot(w)

# Visualize projection.
colors = ['r', 'b', 'g']
markers = ['s', 'x', 'o']
for l, c, m in zip(np.unique(y_train), colors, markers):
    plt.scatter(X_train_pca[y_train==l, 0], 
                X_train_pca[y_train==l, 1], 
                c=c, label=l, marker=m) 
plt.xlabel('PC 1')
plt.ylabel('PC 2')
plt.legend(loc='lower left')
plt.show()

That is how you can implement PCA from scratch using a covariance matrix.

## Using PCA With scikit-learn

We can now use scikit-learn to implement PCA and to understand all of the explained variance per component. If we choose `n_components` to be `None`, then we will get a number of components equal to the number of features in our dataset.

In [0]:
pca = PCA(n_components=None)
X_train_pca = pca.fit_transform(X_train_std)
pca.explained_variance_ratio_

Now we can use our PCA in a logistic regression.

In [0]:
# Initialize pca and logistic regression model.
pca = PCA(n_components=2)
lr = LogisticRegression(multi_class='auto', solver='liblinear', random_state=0)

# Fit and transform data.
X_train_pca = pca.fit_transform(X_train_std)
X_test_pca = pca.transform(X_test_std)
lr.fit(X_train_pca, y_train)

This can be visualized using plot decision regions.

In [0]:
from matplotlib.colors import ListedColormap

def plot_decision_regions(X, y, classifier, resolution=0.02):
    # Setup marker generator and color map.
    markers = ('s', 'x', 'o', '^', 'v')
    colors = ('red', 'blue', 'lightgreen', 'gray', 'cyan')
    cmap = ListedColormap(colors[:len(np.unique(y))])

    # Plot the decision surface.
    x1_min, x1_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    x2_min, x2_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx1, xx2 = np.meshgrid(np.arange(x1_min, x1_max, resolution),
                           np.arange(x2_min, x2_max, resolution))
    Z = classifier.predict(np.array([xx1.ravel(), xx2.ravel()]).T)
    Z = Z.reshape(xx1.shape)
    plt.contourf(xx1, xx2, Z, alpha=0.4, cmap=cmap)
    plt.xlim(xx1.min(), xx1.max())
    plt.ylim(xx2.min(), xx2.max())

    # Plot class samples.
    for idx, cl in enumerate(np.unique(y)):
        plt.scatter(x=X[y == cl, 0], 
                    y=X[y == cl, 1],
                    alpha=0.6, 
                    c=[cmap(idx)],
                    edgecolor='black',
                    marker=markers[idx], 
                    label=cl) # Plot decision regions for training se.

plot_decision_regions(X_train_pca, y_train, classifier=lr)
plt.xlabel('PC 1')
plt.ylabel('PC 2')
plt.legend(loc='lower right')
plt.show()

Now let's plot the decision regions of the classifier and see if the classes are separable by eye.

In [0]:
# Plot decision regions for test set.
plot_decision_regions(X_test_pca, y_test, classifier=lr)
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.legend(loc='lower right')
plt.show()

## Resources

- Examples adapted from 
[TDS](https://towardsdatascience.com/principal-component-analysis-for-dimensionality-reduction-115a3d157bad)
- [Tutorial](https://www.ics.forth.gr/mobile/pca.pdf)
- [Math](http://www.stat.cmu.edu/~cshalizi/uADA/12/lectures/ch18.pdf)
- [Feature Selection](http://jmlr.csail.mit.edu/papers/volume3/guyon03a/guyon03a.pdf)




# Exercises

Watch this video from [Siraj Raval](https://www.youtube.com/channel/UCWN3xxRkmTPmbKwht9FuE5A?feature=embeds_subscribe_title) in class.


In [0]:
from IPython.display import IFrame

IFrame(src="https://www.youtube.com/embed/jPmV3j1dAv4", width="560",
       height="315", frameborder="0")