# Dimensionality Reduction by PCA

## Learning Activity Q3

This notebook applies Principal Component Analysis (PCA) to the Iris dataset in order to reduce its dimensionality while preserving as much variance as possible. The transformed data is then visualized in a scatter plot to see how effectively PCA separates the different species of iris flowers.


In [2]:
import pandas as pd
import numpy as np

#a.
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler


## Loading the Iris Dataset and Data Preprocessing

The Iris dataset is loaded from scikit-learn.
Before applying PCA, the data is scaled so that each feature has a mean of 0 and a standard deviation of 1.
Standardization is required because PCA is sensative to the scale of the variables.


In [None]:
#b.
#loading the Iris dataset in
iris = load_iris()

#Features (measurements)
X = iris.data

#Labels (species)
y = iris.target

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

array([[-9.00681170e-01,  1.01900435e+00, -1.34022653e+00,
        -1.31544430e+00],
       [-1.14301691e+00, -1.31979479e-01, -1.34022653e+00,
        -1.31544430e+00],
       [-1.38535265e+00,  3.28414053e-01, -1.39706395e+00,
        -1.31544430e+00],
       [-1.50652052e+00,  9.82172869e-02, -1.28338910e+00,
        -1.31544430e+00],
       [-1.02184904e+00,  1.24920112e+00, -1.34022653e+00,
        -1.31544430e+00],
       [-5.37177559e-01,  1.93979142e+00, -1.16971425e+00,
        -1.05217993e+00],
       [-1.50652052e+00,  7.88807586e-01, -1.34022653e+00,
        -1.18381211e+00],
       [-1.02184904e+00,  7.88807586e-01, -1.28338910e+00,
        -1.31544430e+00],
       [-1.74885626e+00, -3.62176246e-01, -1.34022653e+00,
        -1.31544430e+00],
       [-1.14301691e+00,  9.82172869e-02, -1.28338910e+00,
        -1.44707648e+00],
       [-5.37177559e-01,  1.47939788e+00, -1.28338910e+00,
        -1.31544430e+00],
       [-1.26418478e+00,  7.88807586e-01, -1.22655167e+00,
      

## Principal Component Analysis (PCA)

PCA transforms the standardized data into a new coordinate system where the first component (PC1) captures the greatest amount of variance. The second component (PC2) captures the next greatest amount of variance. 
2 components were chosen in order to visualize the data in 2 dimension (x-y graph).

In [None]:
#c.
from sklearn.decomposition import PCA

#Apply PCA with 2 components (for visualization)
pca = PCA(n_components = 2)
X_pca = pca.fit_transform(X_scaled)

## Explained Variance

The explained variance ratio indicates how much of the original data variance is captured by each principal component.
The total variance explained by the selected componentshelps determine how effective the dimensionality reduction was.

In [None]:
#d.
print("Explained variance ratio:", pca.explained_variance_ratio_)
print("Total variance explained:", np.sum(pca.explained_variance_ratio_))


## Visualization of PCA Results in a scatter plot

The scatter plot below shows the Iris dataset projected onto the first two principal components.
Each point represents a flower, and colors indicate different species.

In [None]:
#e.
import matplotlib.pyplot as plt

plt.figure(figsize = (8,6))
scatter = plt.scatter(
    X_pca[:, 0], #PC1 value
    X_pca[:,1], #PC2 value
    c=y,
    cmap = "viridis",
    edgecolor = "k"
)

plt.xlabel("Principal Component 1")
plt.ylabel("Principal Component 2")
plt.title("PCA of Iris Dataset")

plt.legend(
    handles=scatter.legend_elements()[0],
    labels=iris.target_names,
    title="Species"
)

plt.show()

## Interpretation of Results

