## More on Principal Component Analysis 

## Overview

The Digits dataset consists of 1,797 images of handwritten digits, each represented by a 64-dimensional feature vector. The dataset's high dimensionality can pose challenges when visualising and exploring it and could also lead to model complexity.

This notebook applies PCA to the Digits dataset and evaluate its ability to reduce the dataset's dimensionality while retaining valuable information.

## Import libraries 

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_digits
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

## Load and prepare dataset

In [2]:
# Load the dataset
digits = load_digits()
X = digits.data
y = digits.target

# Standardize the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

### Step 1

To reduce the dataset's dimensionality, let's transform the standardised dataset by applying PCA.

In [3]:
pca = PCA()
X_pca = pca.fit_transform(X_scaled)

### Step 2

To understand which components carry the most information, we can assess how much of the dataset's variance is captured by each principal component.

We compute and print the `Explained Variance Ratio` for each principal component formatted to four decimal places.

In [4]:
ev_ratio = pca.explained_variance_ratio_

for i, ev in enumerate(ev_ratio):
    print(f"PC{i+1}: Explained Variance = {ev:.4f}")

PC1: Explained Variance = 0.1203
PC2: Explained Variance = 0.0956
PC3: Explained Variance = 0.0844
PC4: Explained Variance = 0.0650
PC5: Explained Variance = 0.0486
PC6: Explained Variance = 0.0421
PC7: Explained Variance = 0.0394
PC8: Explained Variance = 0.0339
PC9: Explained Variance = 0.0300
PC10: Explained Variance = 0.0293
PC11: Explained Variance = 0.0278
PC12: Explained Variance = 0.0258
PC13: Explained Variance = 0.0228
PC14: Explained Variance = 0.0223
PC15: Explained Variance = 0.0217
PC16: Explained Variance = 0.0191
PC17: Explained Variance = 0.0178
PC18: Explained Variance = 0.0164
PC19: Explained Variance = 0.0160
PC20: Explained Variance = 0.0149
PC21: Explained Variance = 0.0135
PC22: Explained Variance = 0.0127
PC23: Explained Variance = 0.0117
PC24: Explained Variance = 0.0106
PC25: Explained Variance = 0.0098
PC26: Explained Variance = 0.0094
PC27: Explained Variance = 0.0086
PC28: Explained Variance = 0.0084
PC29: Explained Variance = 0.0080
PC30: Explained Varianc

### Step 3

We can also evaluate how much total variance is captured as components are added incrementally. This can help us get a view of how many components are needed to capture a substantial proportion of the dataset's variance.

We determine the cumulative variance ratio by summing the explained variance ratios of each principal component.

In [5]:
cv_ratio = np.cumsum(ev_ratio)

for i, cv in enumerate(cv_ratio):
    print(f"PC{i+1}: Cumulative Variance = {cv:.4f}")

PC1: Cumulative Variance = 0.1203
PC2: Cumulative Variance = 0.2159
PC3: Cumulative Variance = 0.3004
PC4: Cumulative Variance = 0.3654
PC5: Cumulative Variance = 0.4140
PC6: Cumulative Variance = 0.4561
PC7: Cumulative Variance = 0.4955
PC8: Cumulative Variance = 0.5294
PC9: Cumulative Variance = 0.5594
PC10: Cumulative Variance = 0.5887
PC11: Cumulative Variance = 0.6166
PC12: Cumulative Variance = 0.6423
PC13: Cumulative Variance = 0.6651
PC14: Cumulative Variance = 0.6874
PC15: Cumulative Variance = 0.7090
PC16: Cumulative Variance = 0.7281
PC17: Cumulative Variance = 0.7459
PC18: Cumulative Variance = 0.7623
PC19: Cumulative Variance = 0.7782
PC20: Cumulative Variance = 0.7931
PC21: Cumulative Variance = 0.8066
PC22: Cumulative Variance = 0.8193
PC23: Cumulative Variance = 0.8310
PC24: Cumulative Variance = 0.8416
PC25: Cumulative Variance = 0.8513
PC26: Cumulative Variance = 0.8608
PC27: Cumulative Variance = 0.8694
PC28: Cumulative Variance = 0.8778
PC29: Cumulative Variance = 0

### Step 4

Based on the results from **Exercise 3**, it is possible to determine how many components are needed to capture at least 85% of the total variance.


In [11]:
# Calculate the cumulative variance ratio
cumulative_variance_ratio = np.cumsum(explained_variance_ratio)

# Print the cumulative variance ratio
for i, cv in enumerate(cumulative_variance_ratio):
    print(f"PC{i+1}: Cumulative Variance = {cv:.4f}")

PC1: Cumulative Variance = 0.1203
PC2: Cumulative Variance = 0.2159
PC3: Cumulative Variance = 0.3004
PC4: Cumulative Variance = 0.3654
PC5: Cumulative Variance = 0.4140
PC6: Cumulative Variance = 0.4561
PC7: Cumulative Variance = 0.4955
PC8: Cumulative Variance = 0.5294
PC9: Cumulative Variance = 0.5594
PC10: Cumulative Variance = 0.5887
PC11: Cumulative Variance = 0.6166
PC12: Cumulative Variance = 0.6423
PC13: Cumulative Variance = 0.6651
PC14: Cumulative Variance = 0.6874
PC15: Cumulative Variance = 0.7090
PC16: Cumulative Variance = 0.7281
PC17: Cumulative Variance = 0.7459
PC18: Cumulative Variance = 0.7623
PC19: Cumulative Variance = 0.7782
PC20: Cumulative Variance = 0.7931
PC21: Cumulative Variance = 0.8066
PC22: Cumulative Variance = 0.8193
PC23: Cumulative Variance = 0.8310
PC24: Cumulative Variance = 0.8416
PC25: Cumulative Variance = 0.8513
PC26: Cumulative Variance = 0.8608
PC27: Cumulative Variance = 0.8694
PC28: Cumulative Variance = 0.8778
PC29: Cumulative Variance = 0

The results show that each additional principal component captures more variance. For example, PC1 captures 12.03%, and by PC10, 58.87% is captured. All 64 components capture 100% variance. This helps identify how many components are needed to capture a significant portion of the dataset's variance.

The cumulative variance ratio shows that to reach at least 85% cumulative variance, 25 components are needed.

We observe that by retaining the first 25 components, we can capture 85.13% of the total variance in the dataset, which significantly reduces the dataset's dimensionality while retaining most of its information.

Using 25 components instead of the original 64 simplifies any downstream models by reducing their feature space, potentially improving model performance and interpretability.

The reduced number of components also makes visualising the data easier, which can provide meaningful insights into class separations or clustering.

Therefore, the ability to capture over 85% of the variance with 25 components makes PCA a viable dimensionality reduction technique for this dataset, preserving most information while simplifying further analyses.