## PCA from SAS® Viya® on Iris

### Source
This example is adapted from [Reduce Data Dimensionality using PCA – Python](https://www.geeksforgeeks.org/reduce-data-dimentionality-using-pca-python/) by user suvratarora06.

### Data Preparation
#### About the data set
The Iris data set comes Ronald Fisher's 1936 paper "The use of multiple measurements in taxonomic problems". It contains 50 samples with five features: identification of one of three species of irises and the length and width of both sepals and petals.  This version of the iris data is available through scikit-learn.

In [None]:
from sklearn import datasets
import pandas as pd
import seaborn as sn
from sasviya.ml.decomposition import PCA

#### Importing the data set
To load the data, use `datasets.load_iris()` from sklearn.  In this case, we will also convert the data into a Pandas dataframe.


In [None]:
iris = datasets.load_iris()
df = pd.DataFrame(iris['data'], columns = iris['feature_names'])
df.head(5)

#### Standardizing the data
It is commonly recommended to standardize data when using machine learning techniques.  The PCA class automatically scales the input features by default, so there is no need for a separate scaling step. 

### Visualizing the data

Heatmaps provide a convenient way to visualize the correlation between variables.  In a default seaborn heatmap, darker colors indicate stronger negative correlations and lighter colors show stronger positive correlations.  Since petal length has a high correlation other features, applying dimensionality reduction could be in order.

In [None]:
sn.heatmap(df.corr(), annot=True)

### Applying Principal Component Analysis
To reduce dimensionality, we will use principal component analysis (PCA).  Since the original data has 4 columns, we will use `n_components` to reduce it 3. 

For details about using the `PCA` class of the `sasviya` package, see the [PCA documentation](https://documentation.sas.com/?cdcId=workbenchcdc&cdcVersion=default&docsetId=explore&docsetTarget=n1hbrdco0inum2n1ddq5wv4ghifq.htm).

In [None]:
pca = PCA(n_components = 3)
pca.fit(df)
data_pca = pca.transform(df).values
data_pca = pd.DataFrame(data_pca,columns=['PC1','PC2','PC3'])
data_pca.head()

### Visualizing data after PCA
As the brief output shows, PCA transforms the original data. Let's examine the correlation matrix for the transformed data with a heatmap.  We can see that the principal components have almost no correlation with each other.

In [None]:
sn.heatmap(data_pca.corr(), annot=True)