# Dimensionality reduction with PCA

<table>
    <tr>
    <td><img src="http://www.acheronanalytics.com/uploads/9/8/6/3/98636884/editor/51764130_1.jpg?1491762379" style="float: center; width: 200px"></td>
    <td>
        <b>Principal components analysis (PCA) is one of the most popular methods available<br>for reducing the number of variables in a data set.</b><br><br>
        <li>We typically describe PCA as an unsupervised learning tool.</li><br>
        <li>But, dimensionality reduction techniques are useful for supervised learning, too.</li><br>
        <br><i>In this notebook,</i><br>we describe its use as a dimension reduction step for linear regression.
    </td>
    </tr>
</table>

We know that we can use linear regression to model the relationship between our dependent variable and one (or more) independent variables (i.e. 'features').
- Let's try using the principal components (the dimensions along which the data vary the most) as the features of our logistic regression and see how it affects our accuracy.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn import datasets

%matplotlib inline

In [None]:
# load the diabetes dataset
data = datasets.load_diabetes()

In [None]:
scaler = StandardScaler()

# define X
feature_matrix = pd.DataFrame(data.data, columns=data.feature_names)
scaled_feature_matrix = pd.DataFrame(scaler.fit_transform(data.data), columns=data.feature_names)

# define y
labels = data.target

Visualize correlations in raw data using `PairGrid`.

In [None]:
g = sns.PairGrid(scaled_feature_matrix)
g = g.map_lower(sns.regplot)
g = g.map_upper(sns.kdeplot, cmap="Blues", shade=True, shade_lowest=False)
g = g.map_diag(plt.hist)

plt.show()

In [None]:
# check linear regression scores before modifying data
linreg = LinearRegression()
orig_lr_scores = cross_val_score(linreg, scaled_feature_matrix, labels, scoring='neg_mean_squared_error', cv=25)

print(np.sqrt(-(orig_lr_scores).mean()))

In [None]:
# extract principal components 

# if not specified: n_components = min(n_samples, n_features)
# thus, in this case, n_components = 10, since n_features = 10
pca = PCA(n_components=4)
pca.fit(scaled_feature_matrix)
pca

Now, let's look at the principal component weighting vectors (i.e. eigenvectors).
- The principal components, or eigenvectors, can be thought of as weightings on the original variables to transform them into the new feature space.

In [None]:
pc_names = [f'PC{i+1}' for i in range(len(pca.components_))]

In [None]:
print(feature_matrix.columns)
for i, pc in enumerate(pc_names):
    print(pc, 'weighting vector:', pca.components_[i], '\n')

Transform the original data into the principal component space.

In [None]:
feat_mat_pcs = pd.DataFrame(pca.transform(scaled_feature_matrix), columns=pc_names)

Visualize correlations in PC's using [`PairGrid`](https://seaborn.pydata.org/generated/seaborn.PairGrid.html).
- Confirm that correlations between variables have been eliminated.

In [None]:
g = sns.PairGrid(feat_mat_pcs)
g = g.map_lower(sns.regplot)
g = g.map_upper(sns.kdeplot, cmap="Blues", shade=True, shade_lowest=False)
g = g.map_diag(plt.hist)

plt.show()

In [None]:
# now, check linear regression scores for the reduced data
linreg = LinearRegression()
pc_lr_scores = cross_val_score(linreg, feat_mat_pcs, labels, scoring='neg_mean_squared_error', cv=25)

print(pc_lr_scores)
print(np.sqrt(-(pc_lr_scores).mean()))

In the end, we arrived at a model with very similar performance to the larger model, but with the number of features greatly reduced.

#### Before we wrap up --
We should look at how we can use [`Pipeline`](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) to 'merge' the two steps (PCA then LR) into a single object (see [example](https://scikit-learn.org/0.18/auto_examples/plot_digits_pipe.html) from docs).

In [None]:
from sklearn.pipeline import Pipeline

pipe = Pipeline([
    ('reduce_dim', PCA()),
    ('predict', LinearRegression())
])

In [None]:
pipe_scores = cross_val_score(pipe, scaled_feature_matrix, labels, scoring='neg_mean_squared_error', cv=25)
print(pipe_scores)
print(np.sqrt(-(pipe_scores).mean()))

> In practice, we could then go on to use [`GridSearchCV`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) to find the optimal number of components to use (if we were dealing with a larger data set). 
- But, we'll end there.