# <center>Principal Component Analysis (PCA) | Part 2</center>

#### 1. Problem Formulation

**Problem Formulation** in PCA involves setting up the problem to find the directions in the data that capture the most variation. The goal is to reduce the number of features while retaining the most important information. 

1. **Objective**: Identify a set of new features (principal components) that are uncorrelated and capture the most variance from the original data.
2. **Data Matrix**: Start with a dataset organized into a matrix where each row represents an observation and each column represents a feature.

---

#### 2. Covariance and Covariance Matrix

**Covariance** measures how two features change together. If two features tend to increase or decrease together, they have high positive covariance. If one increases while the other decreases, they have negative covariance.

**Covariance Matrix** is a square matrix that summarizes the covariance between each pair of features in the dataset. It provides a compact way to see how features relate to each other.

- **Covariance Matrix Example**: For features X and Y:
  \[
  \text{Cov}(X, Y) = \frac{1}{N-1} \sum_{i=1}^{N} (X_i - \bar{X})(Y_i - \bar{Y})
  \]
  Where \(\bar{X}\) and \(\bar{Y}\) are the means of X and Y, and N is the number of data points.

- **Matrix Form**: If you have features X, Y, and Z, the covariance matrix might look like:
  \[
  \text{Cov} = \begin{bmatrix}
  \text{Var}(X) & \text{Cov}(X, Y) & \text{Cov}(X, Z) \\
  \text{Cov}(Y, X) & \text{Var}(Y) & \text{Cov}(Y, Z) \\
  \text{Cov}(Z, X) & \text{Cov}(Z, Y) & \text{Var}(Z)
  \end{bmatrix}
  \]

---

#### 3. Eigenvectors and Eigenvalues

**Eigenvectors** and **eigenvalues** are mathematical concepts used to simplify the covariance matrix:

- **Eigenvectors**: Directions in which the data is stretched or squished. They indicate the directions of the principal components.
- **Eigenvalues**: Measure the amount of variance captured by each eigenvector. Larger eigenvalues mean that the corresponding eigenvector captures more variance.

- **Finding Eigenvalues and Eigenvectors**: For a given covariance matrix, the eigenvectors are found by solving:
  \[
  \text{Cov} \cdot v = \lambda \cdot v
  \]
  Where \(\text{Cov}\) is the covariance matrix, \(v\) is the eigenvector, and \(\lambda\) is the eigenvalue.

---

#### 4. Visualizing Linear Transformations

**Visualizing Linear Transformations** helps understand how PCA projects data onto new axes:

1. **Original Data**: Imagine your data in 2D or 3D space.
2. **New Axes**: PCA finds new directions (principal components) that best capture the variance in the data.
3. **Projection**: The data is projected onto these new axes, reducing dimensionality while preserving the most information.

- **Example**: If you start with data in a cloud shape, PCA might rotate and stretch this cloud along new axes, so the data looks more spread out along the principal components.

---

#### 5. Eigendecomposition of a Covariance Matrix

**Eigendecomposition** is the process of decomposing the covariance matrix into its eigenvectors and eigenvalues:

1. **Compute Covariance Matrix**: First, calculate the covariance matrix from your data.
2. **Decompose**: Use eigendecomposition to break it down into eigenvectors and eigenvalues.
3. **Sort**: Arrange eigenvectors by the size of their corresponding eigenvalues. The larger the eigenvalue, the more variance the eigenvector captures.

- **Steps**:
  1. Compute the covariance matrix from the data.
  2. Solve for eigenvalues and eigenvectors.
  3. Order the eigenvectors by eigenvalue size.

---

#### 6. How to Solve PCA

To solve PCA, follow these steps:

1. **Standardize Data**: Center the data by subtracting the mean of each feature (optional but recommended).
2. **Compute Covariance Matrix**: Calculate the covariance matrix from the standardized data.
3. **Perform Eigendecomposition**: Find the eigenvalues and eigenvectors of the covariance matrix.
4. **Sort Eigenvectors**: Order the eigenvectors by the magnitude of their eigenvalues.
5. **Select Principal Components**: Choose the top eigenvectors corresponding to the largest eigenvalues.
6. **Transform Data**: Project the original data onto the new axes defined by the selected principal components.

---

#### 7. How to Transform Points

To **transform points** using PCA:

1. **Obtain Principal Components**: From the eigenvectors, select the top components (principal components) you want to use.
2. **Multiply Data by Principal Components**: Project the original data points onto the selected principal components by multiplying the data matrix by the matrix of principal components.
   \[
   \text{Transformed Data} = \text{Original Data} \times \text{Principal Components}
   \]

- **Result**: The transformed data will be in the new feature space defined by the principal components, which captures the most variance in fewer dimensions.

---

### Summary

- **PCA** is a method for reducing the number of features by finding new directions (principal components) that capture the most variance in the data.
- **Covariance Matrix** shows how features relate to each other.
- **Eigenvectors and Eigenvalues** help identify the principal components and their importance.
- **Eigendecomposition** breaks down the covariance matrix to find eigenvectors and eigenvalues.
- **Solving PCA** involves standardizing data, computing the covariance matrix, performing eigendecomposition, selecting principal components, and transforming the data.
- **Transforming Points** involves projecting the original data onto the new principal components to reduce dimensionality.

In [26]:
import numpy as np
import pandas as pd

np.random.seed(23)

# Generate Class 1 samples
mu_vec1 = np.array([0,0,0])
cov_mat1 = np.array([[1,0,0],[0,1,0],[0,0,1]])
class1_sample = np.random.multivariate_normal(mu_vec1, cov_mat1, 20)

df = pd.DataFrame(class1_sample, columns=['feature1', 'feature2', 'feature3'])
df['target'] = 1

# Generate Class 2 samples
mu_vec2 = np.array([1,1,1])
cov_mat2 = np.array([[1,0,0],[0,1,0],[0,0,1]])
class2_sample = np.random.multivariate_normal(mu_vec2, cov_mat2, 20)

df1 = pd.DataFrame(class2_sample, columns=['feature1', 'feature2', 'feature3'])
df1['target'] = 0

# Concatenate DataFrames
df_combined = pd.concat([df, df1], ignore_index=True)

df.head(5)

Unnamed: 0,feature1,feature2,feature3,target
0,0.666988,0.025813,-0.777619,1
1,0.948634,0.701672,-1.051082,1
2,-0.367548,-1.13746,-1.322148,1
3,1.772258,-0.347459,0.67014,1
4,0.322272,0.060343,-1.04345,1


In [27]:
import plotly.express as px
#y_train_trf = y_train.astype(str)
fig = px.scatter_3d(df, x=df['feature1'], y=df['feature2'], z=df['feature3'],
              color=df['target'].astype('str'))
fig.update_traces(marker=dict(size=12,
                              line=dict(width=2,
                                        color='DarkSlateGrey')),
                  selector=dict(mode='markers'))

fig.show()

In [28]:
# step 1 - apply standard scaling

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

df.iloc[:, 0:3] = scaler.fit_transform(df.iloc[:,0:3])


In [29]:
# step 2 - find Covariance matrix
covariance_matrix = np.cov([df.iloc[:,0], df.iloc[:,1], df.iloc[:,2]])
print('Covariance Matrix: \n', covariance_matrix)

Covariance Matrix: 
 [[ 1.05263158  0.20397591 -0.28888004]
 [ 0.20397591  1.05263158  0.10956124]
 [-0.28888004  0.10956124  1.05263158]]


In [30]:
# Step 3 - finding Eigen values and Eigen vectors

eigen_values, eigen_vectors = np.linalg.eig(covariance_matrix) # if we are working on 3d data then we will get 3 eigen vectors and 3 eigen values

In [31]:
eigen_values

array([0.64212617, 1.36120658, 1.15456198])

In [32]:
eigen_vectors

array([[-0.65172443,  0.74834128,  0.12345283],
       [ 0.48046517,  0.28140349,  0.8306415 ],
       [-0.58686326, -0.60066414,  0.54294945]])

In [33]:
# import numpy as np
# import pandas as pd
# import matplotlib.pyplot as plt
# from mpl_toolkits.mplot3d import Axes3D

# # Example DataFrame (replace with your actual DataFrame)
# df = pd.DataFrame({
#     'feature1': np.random.rand(10),
#     'feature2': np.random.rand(10),
#     'feature3': np.random.rand(10)
# })

# # Example eigenvectors (replace with your actual eigenvectors)
# eigen_vectors1 = np.array([
#     [1, 0, 0],
#     [0, 1, 0],
#     [0, 0, 1]
# ])

# fig = plt.figure(figsize=(7,7))
# ax = fig.add_subplot(111, projection='3d')

# # Plot the data points
# ax.plot(df['feature1'], df['feature2'], df['feature3'], 'o', markersize=8, color='blue', alpha=0.2)

# # Plot the mean point
# mean_point = [df['feature1'].mean(), df['feature2'].mean(), df['feature3'].mean()]
# ax.plot([mean_point[0]], [mean_point[1]], [mean_point[2]], 'o', markersize=10, color='red', alpha=0.5)

# # Plot the eigenvectors
# for v in eigen_vectors1.T:
#     ax.quiver(mean_point[0], mean_point[1], mean_point[2], v[0], v[1], v[2], length=1, normalize=True, color='r')

# ax.set_xlabel('x_values')
# ax.set_ylabel('y_values')
# ax.set_zlabel('z_values')

# plt.title('Eigenvectors')

# plt.show()


In [34]:
pc = eigen_vectors[0:2]
pc

array([[-0.65172443,  0.74834128,  0.12345283],
       [ 0.48046517,  0.28140349,  0.8306415 ]])

In [35]:
df

Unnamed: 0,feature1,feature2,feature3,target
0,0.473294,0.167622,-0.987345,1
1,0.724424,1.16935,-1.332261,1
2,-0.449148,-1.556529,-1.674155,1
3,1.458807,-0.385625,0.838706,1
4,0.165928,0.218801,-1.322635,1
5,-1.021938,0.784085,1.417307,1
6,-1.760335,-1.262038,-0.261118,1
7,0.810679,0.927003,1.017782,1
8,0.093558,-1.282404,-0.178411,1
9,1.008516,0.386715,-1.549422,1


In [36]:
transformed_df = np.dot(df.iloc[:,0:3],pc.T)
# 40,3 - 3,2
new_df = pd.DataFrame(transformed_df,columns=['PC1','PC2'])
new_df['target'] = df['target'].values
new_df.head()

Unnamed: 0,PC1,PC2,target
0,-0.304909,-0.545559,1
1,0.238477,-0.429512,1
2,-1.078773,-2.044435,1
3,-1.135779,1.289053,1
4,-0.107685,-0.957342,1


In [37]:
new_df['target'] = new_df['target'].astype('str')
fig = px.scatter(x=new_df['PC1'],
                 y=new_df['PC2'],
                 color=new_df['target'],
                 color_discrete_sequence=px.colors.qualitative.G10
                )

fig.update_traces(marker=dict(size=12,
                              line=dict(width=2,
                                        color='DarkSlateGrey')),
                  selector=dict(mode='markers'))
fig.show()
     