For this notebook, we want to look at the features correlated with data processing and 

In [2]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn import preprocessing
from sklearn.decomposition import PCA

Lets take a loot at the the iris dataset.

In [3]:
data = pd.read_csv("iris.csv",header=None)
data.columns = ['sepal length', 'sepal width', 'petal length', 'petal width', 'class']
d_vals = data.drop(columns=['class']).values
d_list = d_vals[:,0:4]
data.head()

Unnamed: 0,sepal length,sepal width,petal length,petal width,class
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


 Now lets scale the independent features of the dataset and compute the correlation coefficients and co-variance among all pairs of independent features.

In [4]:
data_scale = preprocessing.scale(d_vals)
data_scale = pd.DataFrame(data_scale,columns=['sepal length', 'sepal width', 'petal length', 'petal width'])

In [5]:
print("Correlation Coefficient Matrix:")
data_scale[['sepal length', 'sepal width', 'petal length', 'petal width']].corr()

Correlation Coefficient Matrix:


Unnamed: 0,sepal length,sepal width,petal length,petal width
sepal length,1.0,-0.109369,0.871754,0.817954
sepal width,-0.109369,1.0,-0.420516,-0.356544
petal length,0.871754,-0.420516,1.0,0.962757
petal width,0.817954,-0.356544,0.962757,1.0


In [6]:
print("Co-variance pairs:")
print("SL & SW:", data_scale['sepal length'].cov(data_scale['sepal width']))
print("SL & PL", data_scale['sepal length'].cov(data_scale['petal length']))
print("SL & PW", data_scale['sepal length'].cov(data_scale['petal width']))

print("SW & PL", data_scale['sepal width'].cov(data_scale['petal length']))
print("SW & PW", data_scale['sepal width'].cov(data_scale['petal width']))

print("PL & PW", data_scale['petal length'].cov(data_scale['petal width']))

Co-variance pairs:
SL & SW: -0.11010327176239859
SL & PL 0.877604856347186
SL & PW 0.8234432550696282
SW & PL -0.42333835208169923
SW & PW -0.3589370029669187
PL & PW 0.9692185540781538


With all the coefficient pairs we can see that the strongest is petal length and petal width.

Now lets compute the PCA and show the principal components of the fitting for this dataset.

In [8]:
pc = PCA(n_components=4)
pc.fit(d_list)
data_PCA = pc.transform(d_list)
data_PCA = pc.fit_transform(d_list)
data_PCA = pd.DataFrame(data_PCA, columns=['PC1', 'PC2', 'PC3', 'PC4'])
print("Components:", pc.components_)
print("Variance:", pc.explained_variance_)
print("Coefficients:", pc.singular_values_)

Components: [[ 0.36158968 -0.08226889  0.85657211  0.35884393]
 [ 0.65653988  0.72971237 -0.1757674  -0.07470647]
 [-0.58099728  0.59641809  0.07252408  0.54906091]
 [ 0.31725455 -0.32409435 -0.47971899  0.75112056]]
Variance: [4.22484077 0.24224357 0.07852391 0.02368303]
Coefficients: [25.08986398  6.00785254  3.42053538  1.87850234]


Now we can compute the correlation coefficient between each original features and the new features generated by the PCA to see how they relate

In [9]:
data_scale['PC1'] = data_PCA['PC1']
data_scale['PC2'] = data_PCA['PC2']
data_scale['PC3'] = data_PCA['PC3']
data_scale['PC4'] = data_PCA['PC4']
matrix = data_scale[['sepal length', 'sepal width', 'petal length', 'petal width', 'PC1', 'PC2', 'PC3', 'PC4']].corr()
matrix = matrix.drop(columns=['PC1', 'PC2', 'PC3', 'PC4'])
matrix

Unnamed: 0,sepal length,sepal width,petal length,petal width
sepal length,1.0,-0.109369,0.871754,0.817954
sepal width,-0.109369,1.0,-0.420516,-0.356544
petal length,0.871754,-0.420516,1.0,0.962757
petal width,0.817954,-0.356544,0.962757,1.0
PC1,0.897545,-0.389993,0.997854,0.966484
PC2,0.390231,0.828313,-0.04903,-0.04818
PC3,-0.196612,0.38545,0.011518,0.201607
PC4,0.058961,-0.115029,-0.041841,0.151465


With that done we can then match each coefficient to the PCA they represent. From the table we can see that the columns correlate direclty from with columns 1-4 matching respectively matching PC1-PC4.