## PCA : Step-by-Step

### Introduction to PCA
PCA stands for Principal Component Analysis, and it is a widely used technique in data analysis and machine learning. At its core, PCA is a way to reduce the complexity of high-dimensional data by identifying the most important patterns and trends in the data.

Imagine you have a dataset with many variables, such as age, height, weight, income, and education level, and you want to understand how these variables are related to each other. PCA can help you by finding the underlying structure of the data and identifying the key factors that explain most of the variation in the data.

To do this, PCA uses linear algebra to transform the data into a new coordinate system that captures the most important information in the data. The new coordinate system is called the principal components, and each principal component is a linear combination of the original variables.

By examining the principal components, you can identify the most important patterns in the data and understand how different variables contribute to these patterns. You can also use the principal components to visualize the data in a lower-dimensional space, which can help you identify clusters or groups of similar data points.

Overall, PCA is a powerful tool for exploring and analyzing complex datasets, and it can be applied to a wide range of fields, including biology, economics, psychology, and computer science.

### PCA Theory
PCA is based on the concept of linear algebra, specifically the eigenvalue decomposition of a covariance matrix. In simple terms, the covariance matrix is a measure of the linear relationship between pairs of variables. Note that PCA is applied on centered data.

Let X be an n x p matrix representing the centered data, where n is the number of observations and p is the number of variables. The covariance matrix of X is given by:

C = (1/n) * X^T * X

where ^T denotes the transpose of a matrix. The covariance matrix is a symmetric positive semi-definite matrix, which means that it has p real eigenvalues and p orthogonal eigenvectors.

The eigendecomposition of C is given by:

C = V * Lambda * V^T

where V is a p x p matrix whose columns are the eigenvectors of C, and Lambda is a diagonal matrix whose entries are the corresponding eigenvalues.

The eigenvectors in V are sorted in descending order according to their corresponding eigenvalues in Lambda. The first principal component is the linear combination of the variables that corresponds to the eigenvector with the largest eigenvalue. The second principal component is the linear combination that corresponds to the eigenvector with the second largest eigenvalue, and so on.

To compute the principal components of the data, we multiply the centered data matrix X by the matrix of eigenvectors V:

Y = X * V

where Y is the matrix of principal components. Each column of Y represents a principal component, and each row represents an observation.

The proportion of variance explained by each principal component is given by its corresponding eigenvalue divided by the sum of all eigenvalues:

prop_i = lambda_i / (sum(lambda))

where prop_i is the proportion of variance explained by the i-th principal component.

PCA can be used for data compression by selecting the top k principal components that explain the most variance in the data. The compressed data can be reconstructed by multiplying the matrix of selected principal components by the transpose of the matrix of eigenvectors:

X_hat = Y_k * V^T_k

where Y_k is the matrix of the k selected principal components, and V_k is the matrix of the corresponding k eigenvectors. X_hat is the reconstructed data, which should be close to the original data X.

### Data Preparation for PCA
Suppose we have a dataset with n observations and p variables. Before applying PCA, we need to perform the following data preparation steps:

1. Standardization: PCA is sensitive to the scale of the variables, so we need to standardize the data to have zero mean and unit variance. This can be done by subtracting the mean of each variable and dividing by its standard deviation:
X_standardized = (X - mean(X)) / std(X) (where X is the original data matrix, mean(X) is the mean vector of X, and std(X) is the standard deviation vector of X)

2. Missing value imputation: If the dataset has missing values, we need to impute them before applying PCA. There are different imputation methods that can be used, such as mean imputation, regression imputation, or multiple imputation.

3. Outlier detection: Outliers can affect the results of PCA, so it is important to detect and handle them before applying PCA. One way to detect outliers is by computing the Mahalanobis distance of each observation from the mean of the data. Observations with a large Mahalanobis distance are considered outliers.

4. Variable selection: If the dataset has a large number of variables, it may be necessary to perform variable selection before applying PCA. This can be done using various methods, such as correlation analysis, mutual information, or feature importance scores.

After performing these data preparation steps, we can apply PCA to the standardized data X_standardized to obtain the principal components and perform exploratory data analysis.

### Limitations of PCA
1. _Linearity assumption:_ PCA assumes that the data is linearly related. If the data has a nonlinear structure, PCA may not be the most appropriate technique.

2. _Loss of interpretability: After performing PCA, the principal components may not be directly interpretable in terms of the original variables. This can make it difficult to explain the results to non-technical stakeholders.

3. _Sensitivity to outliers:_ PCA is sensitive to outliers, which can distort the results and lead to incorrect conclusions.

4. _Sensitivity to scaling:_ PCA is sensitive to the scale of the variables, which can affect the results. It is important to standardize the variables before performing PCA.

5. _Difficulty in choosing the number of components:_ Choosing the number of components to retain can be a challenging task. If too few components are retained, important information may be lost. If too many components are retained, the results may be overfit and not generalize well to new data.

6. _Correlation-based:_ PCA assumes that variables are linearly correlated with each other. If the variables are not correlated, PCA may not be the most appropriate technique.

7. _Lack of robustness:_ PCA is not a robust technique and can be affected by outliers and influential observations.

In [1]:
import numpy as np

In [2]:
def pca_top_k_components(A, k):
    # Step 1: Standardize the data
    standardized_A = (A - np.mean(A, axis=0)) / np.std(A, axis=0)

    # Step 2: Compute the covariance matrix
    covariance_matrix = np.cov(standardized_A, rowvar=False)

    # Step 3: Perform eigen decomposition of the covariance matrix
    eigenvalues, eigenvectors = np.linalg.eig(covariance_matrix)

    # Step 4: Sort eigenvectors by eigenvalues in descending order
    print(eigenvalues)
    sorted_indices = np.argsort(eigenvalues)[::-1]
    print(sorted_indices)
    sorted_eigenvalues = eigenvalues[sorted_indices]
    print(sorted_eigenvalues)
    sorted_eigenvectors = eigenvectors[:, sorted_indices]
    print(eigenvectors)
    print(sorted_eigenvectors)

    # Step 5: Select the top k eigenvectors
    top_k_eigenvectors = sorted_eigenvectors[:, :k]

    # Step 6: Project the centered data onto the top k eigenvectors
    pca_projection = np.dot(standardized_A, top_k_eigenvectors)

    return pca_projection

In [3]:
# Generate a random matrix A
np.random.seed(0)
A = np.random.rand(10, 5)  # 10 samples, 5 features
print("A:", A)

# Call the function to get the top k principal components
k = 2
top_k_components = pca_top_k_components(A, k)
print("Top", k, "principal components:")
print(top_k_components)

A: [[0.5488135  0.71518937 0.60276338 0.54488318 0.4236548 ]
 [0.64589411 0.43758721 0.891773   0.96366276 0.38344152]
 [0.79172504 0.52889492 0.56804456 0.92559664 0.07103606]
 [0.0871293  0.0202184  0.83261985 0.77815675 0.87001215]
 [0.97861834 0.79915856 0.46147936 0.78052918 0.11827443]
 [0.63992102 0.14335329 0.94466892 0.52184832 0.41466194]
 [0.26455561 0.77423369 0.45615033 0.56843395 0.0187898 ]
 [0.6176355  0.61209572 0.616934   0.94374808 0.6818203 ]
 [0.3595079  0.43703195 0.6976312  0.06022547 0.66676672]
 [0.67063787 0.21038256 0.1289263  0.31542835 0.36371077]]
[2.55533375 1.4114298  0.70162366 0.4000991  0.48706926]
[0 1 2 4 3]
[2.55533375 1.4114298  0.70162366 0.48706926 0.4000991 ]
[[-0.48778004 -0.21458486  0.69108706 -0.38801997 -0.29641719]
 [-0.51062694 -0.0215367  -0.69789734 -0.4815427  -0.1408979 ]
 [ 0.38522227 -0.58037433 -0.16914554  0.06571509 -0.6941475 ]
 [-0.16421402 -0.78441966 -0.0478546   0.10696059  0.58650568]
 [ 0.57093635 -0.03661883  0.06661604 

Once we have the top k eigenvectors from step 5, these eigenvectors define a new basis in the high-dimensional space of the original data. Each eigenvector represents a direction in this space that captures the most variance in the data.

To obtain the principal components, we take the standardized data matrix and project it onto these top k eigenvectors. This projection effectively transforms the data from its original high-dimensional space to a new space defined by the principal components.