# Principal Components Analysis
Principal Component Analysis (PCA) is a dimensionality reduction technique commonly used in machine learning and data analysis. 
- It aims to transform a high-dimensional dataset into a lower-dimensional representation while retaining most of the relevant information.

Supose we have a collections of students and its grades in the English course. Grades from 0 to 10

In [None]:
import numpy as np
import pandas as pd

grades = pd.DataFrame({"english": [8.0, 7.1, 10.0, 3.8, 1.4, 2.3]})
grades

Lets represent this values in a plot

In [None]:
import matplotlib.pyplot as plt
_, ax = plt.subplots(figsize=(3,3))
ax.scatter(grades.english, np.full(len(grades), 2), c=[*'rrrbbb'])
plt.xlabel('english')
plt.yticks([])
plt.show()

It is evident that there are two distinct groups of students. The distances between students 1, 2, and 3 are more similar to each other compared to students 4, 5, and 6.

Lets add now a second course.

In [None]:
grades['math'] = [7, 6.5, 6.1, 4, 3.7, 2]
grades

In [None]:
_, ax = plt.subplots(figsize=(3,3))
ax.scatter(grades.english, grades.math, c=[*'rrrbbb'])
plt.xlabel('english')
plt.ylabel('math')   
plt.show()

We can observe that the objects remain clustered together when considering the two dimensions. Specifically, objects 1, 2, and 3 exhibit a high degree of similarity among themselves, while objects 4, 5, and 6 also demonstrate strong similarity within their group.

Now, lets add a third course.

In [None]:
grades['history'] = [10, 9.2, 8.7, 6.4, 6.0, 5.8]
grades

In [None]:
from mpl_toolkits.mplot3d import Axes3D

fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')

# Plot the data points in 3D
ax.scatter(grades.english, grades.math, grades.history, c=[*'rrrbbb'])
plt.show()

As you can see, the same objects keep clustered together.

Lets add a new course.

In [None]:
grades['chemistry'] = [9.3, 8.4, 7.9, 4.3, 4.2, 3.9]
grades

# Now we are no longer able to represent data in a plot, because we cannot see a 4D space.

Principal Components Analysis, or PCA, allows to represent a dataset with many dimensions into a new dataset with reduced number of dimensions, but keeping the most important information.
- The principal components represent new orthogonal axes that capture the maximum variance in the data.
- The main principal component is the line which, when all points are projected over it, their spread is the largest.

In this example, it allows to represent the 4D information of grades into a 1D representation, which we can plot and understand.

In [None]:
from sklearn.decomposition import PCA

# Perform PCA
pca = PCA(n_components=1)  # Select the number of components you want to keep
principal_components = pca.fit_transform(grades)

# Create a new DataFrame with the principal components
principal_df = pd.DataFrame(principal_components, columns=['PC1'])

# Plot the data in a 2D scatter plot
plt.scatter(principal_df['PC1'], np.full(len(grades), 2), c=[*'rrrbbb'])
plt.xlabel('PC1')
plt.yticks([])
plt.title('PCA Scatter Plot')
plt.show()

Lets see the explained variance of the principal component

In [None]:
pca.explained_variance_

The variance of all the original features is:

In [None]:
grades.var()

With a total of:

In [None]:
grades.var().sum()

So, the explained variance ratio is:

In [None]:
pca.explained_variance_ratio_

Lets try now with two principal components

In [None]:
# Perform PCA
pca = PCA(n_components=2)  # Select the number of components you want to keep
principal_components = pca.fit_transform(grades)

# Create a new DataFrame with the principal components
principal_df = pd.DataFrame(principal_components, columns=['PC1', 'PC2'])

# Plot the data in a 2D scatter plot
plt.scatter(principal_df['PC1'], principal_df['PC2'], c=[*'rrrbbb'])
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.title('PCA Scatter Plot')
plt.show()

Lets see now the explained variance

In [None]:
print(pca.explained_variance_)
pca.explained_variance_ratio_

Lets try with 3 PC

In [None]:
# Perform PCA
pca = PCA(n_components=3)  # Select the number of components you want to keep
principal_components = pca.fit_transform(grades)

# Create a new DataFrame with the principal components
principal_df = pd.DataFrame(principal_components, columns=['PC1', 'PC2', 'PC3'])

pca.explained_variance_ratio_

For showing PCA in a real live example, we will use the digit image dataset that we used before. Now we represent each image as a continuous collection of values.

In [None]:
import pandas as pd
import numpy as np
from tensorflow.keras.datasets import mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()

In [None]:
from IPython.display import display
import matplotlib.pyplot as plt


# Calculate the number of rows and columns for the grid
n_rows = 4
n_cols = 4

# Create a new figure with the desired grid size
fig, axes = plt.subplots(n_rows, n_cols, figsize=(5, 5))

# Iterate over the image files and display them in the grid
for i, ax in enumerate(axes.flatten()):
    ax.imshow(x_train[i], cmap="gray")
    ax.set_title(f"Number:{y_train[i]}")
    ax.axis('off')

# Adjust the spacing and layout
plt.tight_layout()

# Display the figure
plt.show()

Lets transfor the features

In [None]:
data = pd.DataFrame(x_train.reshape((-1, 28*28)))
data.head()

We are unable to represent this information in a plot because it contains too much dimensions. Lets find the two most important PC

In [None]:
# Perform PCA
pca = PCA(n_components=2) 
principal_components = pca.fit_transform(data)

# Create a new DataFrame with the principal components
principal_df = pd.DataFrame(principal_components, columns=['PC1', 'PC2'])

In [None]:
total_variance = data.var().sum()
print("Total variance", total_variance)
print("Explained variance per PC:", pca.explained_variance_)
print("- relative:", pca.explained_variance_ratio_)

Remember, there are 784 features!!!

Now, lets plot the PCs

In [None]:
plt.scatter(principal_df['PC1'], principal_df['PC2'], c=y_train, cmap="jet")
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.title('PCA Scatter Plot')
plt.colorbar()
plt.show()

Lets filter some classes in order to analyze the components.

In [None]:
together = principal_df.copy()
together['digit'] = y_train

def plot_digits(digits):
    to_show = together[together.digit.isin(digits)]
    plt.scatter(to_show.PC1, to_show.PC2, c=to_show.digit, cmap="jet")
    plt.xlabel('PC1')
    plt.ylabel('PC2')
    plt.title('PCA Scatter Plot')
    plt.colorbar()
    plt.show()

In [None]:
plot_digits({0, 1})

- It separates quite well 0 and 1
- It is more variety in the 0 than in the ones

In [None]:
plot_digits({0, 8})

In [None]:
plot_digits({6, 8})

Not quite well separated.

If we add more components, we can get a more accurate representation, but I can no longer visualize it.

In [None]:
# Perform PCA
pca = PCA(n_components=10)  
principal_components = pca.fit_transform(data)

# Create a new DataFrame with the principal components
principal_df = pd.DataFrame(principal_components, columns=[f'PC{n}' for n in range(1, 11)])

In [None]:
total_variance = data.var().sum()
print("Total variance", total_variance)
print("Explained variance per PC:", pca.explained_variance_)
relative_var = pca.explained_variance_ratio_
print("- relative:", relative_var)
print("- cumulative", np.cumsum(relative_var))

So, taking only 10 principal components, I am able to explain almost half of the total variance of the original database with 784 features.

Sumarizing:
- PCA allows to transform datasets from higher dimensions to lower dimensions, keeping the most important information (related to variance)
- It is a powerfull tool for visualization (and for other methods too!)