# Deciding How Many Components To Use

> This is done by eyeballing the scree plot, and looking for a point at which the proportion of variance explained by each subsequent principal component drops off ... this type of visual analysis is inherently ad hoc. Unfortunately, there is no well-accepted objective way to decide how many principal components are enough.

- *An Introduction to Statistical Learning, pg. 384

In [None]:
cd ..

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

import numpy as np
import pandas as pd

from lib.preprocessing import BoxCoxTransformer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression, Lasso

In [None]:
X = pd.read_pickle('data/ames_features.p')
y = pd.read_pickle('data/ames_target.p')

In [None]:
from sklearn.decomposition import PCA

In [None]:
pca = PCA()
pca.fit(X)

In [None]:
explained_var_ratio = pca.explained_variance_ratio_
cumulative_exp_var = np.cumsum(explained_var_ratio)

In [None]:
plt.plot(explained_var_ratio)
plt.plot(cumulative_exp_var)

### LESSON: The PCA is highly dependent upon scaled data

In [None]:
from sklearn.preprocessing import StandardScaler
X_sc = StandardScaler().fit_transform(X)
pca.fit(X_sc)

In [None]:
explained_var_ratio = pca.explained_variance_ratio_
cumulative_exp_var = np.cumsum(explained_var_ratio)
plt.plot(explained_var_ratio)
plt.plot(cumulative_exp_var)

In [None]:
plt.plot(explained_var_ratio)

> [We are] looking for a point at which the proportion of variance explained by each subsequent principal component drops off. This is often referred to as an *elbow* in the scree plot.

In [None]:
plt.plot(explained_var_ratio[:10])

> However, this type of visual analysis is inherently ad hoc. Unfortunately, there is no well-accepted objective way to decide how many principal components are enough. In fact, **the question of how many principal compo- nents are enough is inherently ill-defined**, and will depend on the specific area of application and the specific data set. In practice, we tend to look at the first few principal components in order to find interesting patterns in the data. If no interesting patterns are found in the first few principal components, then further principal components are unlikely to be of interest. Conversely, if the first few principal components are interesting, then we typically continue to look at subsequent principal components until no further interesting patterns are found. This is admittedly a subjective ap- proach, and is reflective of the fact that PCA is generally used as a tool for exploratory data analysis.