From this [great forum post][1] by Achal we noticed that there is quite some correlation between the numerical features. Therefore I wanted to explore to how many components we could reduce the feature subspace without losing too much of the explained variance.


  [1]: https://www.kaggle.com/achalshah/allstate-claims-severity/allstate-feature-analysis-python

In [None]:
%matplotlib inline
import numpy as np 
import pandas as pd

from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
train = pd.read_csv('../input/train.csv')

**Checking how many numercial and categorical features + putting the colnames in a list per data type**

In [None]:
numFeatures = []
catFeatures = []

for col, val in train.iloc[0,:].iteritems():
    if type(val) is not str:
        numFeatures.append(col)
    elif type(val) is str:
        catFeatures.append(col)
        
# Remove id and loss from the numFeatures
numFeatures.remove('id')
numFeatures.remove('loss')
        
print(len(numFeatures), 'Numerical Features:', numFeatures, "\n")
print(len(catFeatures), 'Categorical Features:', catFeatures)

**Standardizing the numerical features before performing PCA**

In [None]:
sc = StandardScaler()
train_nums_std = sc.fit_transform(train[numFeatures])

**PCA**<br>
Set n_components to None to keep all principal components and their explained variance

In [None]:
pca = PCA(n_components=None)
train_nums_pca = pca.fit_transform(train_nums_std)
varExp = pca.explained_variance_ratio_

**Plot the cumulative explained variance as a function of the number of components**

In [None]:
cumVarExplained = []
nb_components = []
counter = 1
for i in varExp:
    cumVarExplained.append(varExp[0:counter].sum())
    nb_components.append(counter)
    counter += 1

plt.subplots(figsize=(8, 6))
plt.plot(nb_components, cumVarExplained, 'bo-')
plt.ylabel('Cumulative Explained Variance')
plt.xlabel('Number of Components')
plt.ylim([0.0, 1.1])
plt.xticks(np.arange(1, len(nb_components), 1.0))
plt.yticks(np.arange(0.0, 1.1, 0.10))

With 7 components we already explain more than 90% of all variance in the features. So we could reduce the number of features to half of the original numerical features.