# Dimensionality Reduction

In [None]:
import numpy as np
import pandas as pd

from sklearn.decomposition import PCA

import matplotlib.pyplot as plt
%matplotlib inline

We will continue to work with the Crime data introduced in Week 8

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import Imputer, StandardScaler
from sklearn.pipeline import Pipeline

# Load some crime data
headers = pd.read_csv('comm_names.txt', squeeze=True)
headers = headers.apply(lambda s: s.split()[1])
crime = (pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/communities/communities.data', 
                    header=None, na_values=['?'], names=headers)
         .iloc[:, 5:]
         .dropna()
         )

# Set target and predictors
target = 'ViolentCrimesPerPop'
predictors = [c for c in crime.columns if not c == target]

# Train/test split
X = crime[predictors]
y = crime[[target]]
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=2)

In [None]:
X_train.shape

In Week five we build a linear model with this data. The problem was that there were too many variables and so it was too easy to overfit. We used regularization to solve this problem.

This week, we will use a different strategy. Instead of regularization, we will reduce overfitting by reducing the number of variables that are used in the model. First, we will make sure we understand how to use PCA.

### Learn the principal components of the training data

First lets look at the covariance of the training data, which tells us about the relationships between the various variables.

Total Variance

Now lets transform the data into principal component space.

What does the covariance look like now?

What about the total variance?

The explained variance is also an attribute of the PCA object

### Exercise

A typical rule of thumb is to keep the components that account for 90% of the total variance. Use PCA to create a new data frame X_reduced. The total variance in the columns of X_reduced should account for about 90% of the variance in X_train

Fit PCA to the training data

Scree plot of explained variance

Decide how many components to use

#### Method One

Create a new PCA object that is explicity a dimension reducer

#### Method Two

Transform the data using the original PCA, but keep only the first num_comp columns

#### Method Three

The key to understanding PCA is the following equation:

$$
\underset{m \times n}M \approx \underset{m \times k}U \times \underset{k \times n}V^T
$$

- $M$ is the original data matrix (as long as each column has zero mean)
- $k$ is the dimension to reduce to
- $U$ maps from rows of M to components
- $V^T$ maps from components to columns of M (features)

### Exercise

The component matrix $V$ tells us the mapping between principal components and the original features. Use this matrix to try to interpret the first two principal components

- The first component ranges from wealthy immigrant neighborhoods, to poor native neighborhoods.
- The second component ranges from poor immigrant neighborhoods to wealthy native neighborhoods.
- These do seem to be orthogonal

## Application: Linear Regression

We tried regularized regression on this dataset before. Let's remind ourselves of the results

#### Vanilla

In [None]:
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(X_train, y_train)

print 'Train R-squared: {:.3}'.format(lr.score(X_train, y_train))
print 'Test R-squared: {:.3}'.format(lr.score(X_test, y_test))

#### Lasso

In [None]:
from sklearn.linear_model import LassoCV
lasso = LassoCV(alphas= 2. ** np.arange(-10, 10))
lasso.fit(X_train, y_train)

print 'Train R-squared: {:.3}'.format(lasso.score(X_train, y_train))
print 'Test R-squared: {:.3}'.format(lasso.score(X_test, y_test))

### Using Dimensionality Reduction

The reason the un-regularized model does so poorly is that there are way too many features compared to the number of observations. Another approach to regularization is simply to choose a smaller number of features.

## Exercise

Below are three approaches to reducing the number of features used in the model. For each approach, 

1. Plot the R-squared (on X_test) vs number of features used.
2. Report the best R-squared
3. Report the number of features used to get the best R-squared

#### 1. Randomly choose subsets of features

This is the most naive approach to dimensionality reduction; just pick $k$ features at random.

#### 2. Intelligently choose subsets of features

We can do a little better than randomly choosing sets of features. Use the sklearn function `f_regression` to rank the features in order of their correlation with the target, then pick the top $k$ features.

#### 3. Intelligently choose a projections into principal component space

Lastly, we can use PCA to compress as much information into as few features as possible. The number of features to use is the number of principal components.

## Bonus: Kernel PCA

### Exercise

Instead of vanilla PCA, try using a nonlinear kernel. See if you can find kernel parameters that improve the regression model.

In [None]:
from sklearn.decomposition import KernelPCA
from sklearn.model_selection import GridSearchCV