Import pandas, PCA and Standard Scaler

> PCA can give you wonky results if the variance in the original dataset is large, so we want to standardize the data. StandardScaler allows you to standardize the dataset so the mean is 0 and the variance is 1. This process is common in ML models


In [1]:
import pandas as pd
import plotly.express as px

from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

Load iris and take a look at the dataset

In [2]:
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
df = pd.read_csv(url, names=['sepal_length','sepal_width','petal_length','petal_width','target'])
df.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,target
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


Split the dataset into X and y

In [3]:
X = df.drop('target', 1)
y = df.target

Apply the standardization to the X values

In [4]:
X = StandardScaler().fit_transform(X)
print(pd.DataFrame(X))

            0         1         2         3
0   -0.900681  1.032057 -1.341272 -1.312977
1   -1.143017 -0.124958 -1.341272 -1.312977
2   -1.385353  0.337848 -1.398138 -1.312977
3   -1.506521  0.106445 -1.284407 -1.312977
4   -1.021849  1.263460 -1.341272 -1.312977
..        ...       ...       ...       ...
145  1.038005 -0.124958  0.819624  1.447956
146  0.553333 -1.281972  0.705893  0.922064
147  0.795669 -0.124958  0.819624  1.053537
148  0.432165  0.800654  0.933356  1.447956
149  0.068662 -0.124958  0.762759  0.790591

[150 rows x 4 columns]


Run the principal component analysis model on X

In [5]:
pca = PCA(n_components=2)
principalComponents = pca.fit_transform(X)
pcaDF = pd.DataFrame(data = principalComponents, columns = ['PC1', 'PC2'])
pcaDF

Unnamed: 0,PC1,PC2
0,-2.264542,0.505704
1,-2.086426,-0.655405
2,-2.367950,-0.318477
3,-2.304197,-0.575368
4,-2.388777,0.674767
...,...,...
145,1.870522,0.382822
146,1.558492,-0.905314
147,1.520845,0.266795
148,1.376391,1.016362


Combine the principalDf with y to get a dataframe with both the components and y

In [6]:
finalDf = pd.concat([pcaDF, y], axis = 1)
finalDf

Unnamed: 0,PC1,PC2,target
0,-2.264542,0.505704,Iris-setosa
1,-2.086426,-0.655405,Iris-setosa
2,-2.367950,-0.318477,Iris-setosa
3,-2.304197,-0.575368,Iris-setosa
4,-2.388777,0.674767,Iris-setosa
...,...,...,...
145,1.870522,0.382822,Iris-virginica
146,1.558492,-0.905314,Iris-virginica
147,1.520845,0.266795,Iris-virginica
148,1.376391,1.016362,Iris-virginica


Plot the principal components vs each other (don't need to code this yourselves)

In [7]:
px.scatter(finalDf, x = 'PC1', y='PC2', color = 'target')

Take a look at the variance ratios for the two principal components

In [8]:
pca.explained_variance_ratio_

array([0.72770452, 0.23030523])