<a href="https://colab.research.google.com/github/jeffreygalle/MAT422/blob/main/hw_1_4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 1.4.1. Singular value decomposition
The Singular Value Decomposition (SVD) of a matrix is a factorization of the matrix into three matrices. SVD is a key component of principal components analysis (PCA) - a useful and powerful dimension reduction technique.


The SVD of a m x n matrix A can be expressed in terms of the factorization of A into the product of the three matrices; A = U * Σ * V^T

Note that the columns of U and V are orthonormal, and the matrix Σ is diagonal with real positive entries.

U: m x m matrix of the orthonormal eigenvectors of A*A^T.

V^T: the transpose of an n x n matrix containing the orthonormal eigenvectos of A^T * A.


Σ: a diagonal matrix with r elements equal to the root of the positive eigenvalues of A*A^T or A^T*A


In [None]:
# example in Python

import numpy as np

A = np.array([[4, 0, 2], [1, 3, 0]])

# Perform SVD .svd on matrix A
U, S, Vt = np.linalg.svd(A)

# convert S into a diagonal matrix form
# because matrix S is returned as a 1D array of singular values
# convert it into a diagonal matrix and match the dimensions of the original matrix A.

# Create zeroed matrix of the same size as A with 2 rows
S_matrix = np.zeros((2, 3))

np.fill_diagonal(S_matrix, S)

# reconstructing matrix A
A_reconstructed = U @ S_matrix @ Vt

print("Matrix A:")
print(A)
print("\nMatrix U:")
print(U)
print("\nSingular values (S):")
print(S)
print("\nMatrix Vt:")
print(Vt)
print("\nReconstruction of matrix A:")
print(A_reconstructed)


Matrix A:
[[4 0 2]
 [1 3 0]]

Matrix U:
[[-0.94362832 -0.33100694]
 [-0.33100694  0.94362832]]

Singular Values (S):
[4.62635107 2.93204293]

Matrix Vt:
[[-0.88742081 -0.2146445  -0.40793632]
 [-0.1297387   0.96549915 -0.22578588]
 [-0.44232587  0.14744196  0.88465174]]

Reconstructed Matrix A:
[[4.00000000e+00 2.77574497e-16 2.00000000e+00]
 [1.00000000e+00 3.00000000e+00 6.00525422e-17]]


# 1.4.2. Low-rank matrix approximations

Aim of Low rank matrix approximation: Technique that uses SVD to create a "simplified" version of a matrix but still preserving the matricies' essential information. Only the most significant singular values and their corresponding vectors are retained.

The rank of a (m x n) matrix is determined by the number of linearly independent rows present in that matrix. Reducing the rank during reconstruction leads to creating a low-rank approximation of the original matrix.

Low-rank approximations can reduce the dimensionality of high dimensional data, making it easier to store and process. [*source*: https://www.geeksforgeeks.org/eigenvector-computation-and-low-rank-approximations/ ]




In [None]:
import numpy as np

A = np.array([[3, 2, 2, 4],
              [1, 3, 1, 1],
              [0, 0, 0, 0],
              [5, 6, 7, 8]])

# Perform SVD with .svd on matrix A
U, S, Vt = np.linalg.svd(A)


# Similar as above convert singular values S into diag matrix
S_matrix = np.zeros((U.shape[0], Vt.shape[0]))
np.fill_diagonal(S_matrix, S)

# set up the low-rank approximation by selecting only the top k components, here we
# will use k =2, try k=3
k = 2

# keep only the top k components for U, S, Vt
U_k = U[:, :k]
S_k = S_matrix[:k, :k]
Vt_k = Vt[:k, :]

# ** Construct the rank-k approximation of A **
A_approx = U_k @ S_k @ Vt_k

print("Original matrix A:")
print(A)
print("\nTruncated singular values:")
print(S[:k])
print("\nApproximated matrix A (Rank = 2):")
print(A_approx)


Original matrix A:
[[3 2 2 4]
 [1 3 1 1]
 [0 0 0 0]
 [5 6 7 8]]

Truncated singular values:
[14.60384575  2.04699042]

Approximated matrix A (Rank = 2):
[[2.46691183 1.78742843 2.79851757 3.86020766]
 [0.72329351 2.8896619  1.41448115 0.92743893]
 [0.         0.         0.         0.        ]
 [5.28661471 6.11428905 6.57067725 8.07515931]]


# 1.4.3. Principal component analysis

# 1.4.3.1 Covariance Matrix

The covariance matrix summarizes the covariance (a measure of how much two random variables vary together) between pairs of variables in a dataset. The covariance matrix is also a square matrix.


The covariance matrix is computed using Cov(X) = 1/(m-1) * X ^T
where X is the centered data matrix.


# 1.4.3.2 Principal component analysis

Principal Component Analysis (PCA) is a technique used for dimensionality reduction, data compression, and feature extraction. It transforms a large set of data into a smaller one that still contains most of the original information.


Steps in PCA [*source:* https://www.geeksforgeeks.org/principal-component-analysis-pca/ ]:

1. Standardize the Data: Scale the data to have a mean of 0 and standard deviation of 1.
2. Compute the Covariance Matrix: Calculate the covariance matrix to understand the relationships between variables.
3. Compute Eigenvalues and Eigenvectors: Determine the eigenvectors and eigenvalues of the covariance matrix to identify the principal components.
4. Select Principal Components: Choose the top k eigenvectors corresponding to the largest eigenvalues, which capture the most variance.
5. Transform the Data: Project the data onto the selected principal components to obtain the reduced dataset.




In [None]:
import numpy as np
import pandas as pd
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

#  dataset with 3 variables each with 5 observations
data = np.array([[2.5, 2.4, 3.2],
                 [0.5, 0.7, 1.1],
                 [2.2, 2.9, 3.5],
                 [1.9, 2.2, 2.8],
                 [3.1, 3.0, 3.6]])

# Convert DataFrame
df = pd.DataFrame(data, columns=['Feature1', 'Feature2', 'Feature3'])

# Step 1: Standardize the data
scaler = StandardScaler()
scaled_data = scaler.fit_transform(df)

# apply the PCA
pca = PCA(n_components=2)  # We choose 2 components for simplicity
principal_components = pca.fit_transform(scaled_data)

#  principal components dataFrame
pca_df = pd.DataFrame(principal_components, columns=['Principal Component 1', 'Principal Component 2'])

print("Original Data:\n")
print(df)
print("\n\nPrincipal Components:\n")
print(pca_df)


Original Data:

   Feature1  Feature2  Feature3
0       2.5       2.4       3.2
1       0.5       0.7       1.1
2       2.2       2.9       3.5
3       1.9       2.2       2.8
4       3.1       3.0       3.6


Principal Components:

   Principal Component 1  Principal Component 2
0              -0.644572              -0.211700
1               3.203124              -0.058527
2              -0.988777               0.465536
3               0.145770               0.094697
4              -1.715544              -0.290006
