# Notebook 14: Principal Component Analysis (PCA)

Welcome to the fourteenth notebook in our machine learning series. In this notebook, we will explore **Principal Component Analysis (PCA)**, a widely used dimensionality reduction technique that transforms high-dimensional data into a lower-dimensional space while preserving as much variance as possible.

We'll cover the following topics:
- What is PCA?
- Key concepts: Variance, Eigenvalues, and Eigenvectors
- How PCA works
- Implementation using scikit-learn
- Advantages and limitations

## What is PCA?

Principal Component Analysis (PCA) is a statistical technique used to reduce the dimensionality of a dataset while retaining most of the information (variance). It achieves this by projecting the data onto a new set of axes called principal components, which are orthogonal and ordered by the amount of variance they explain.

PCA is commonly used for data visualization, noise filtering, and as a preprocessing step before applying other machine learning algorithms.

## Key Concepts

- **Variance:** A measure of the spread of data points. PCA seeks to maximize the variance explained by each principal component.
- **Eigenvalues and Eigenvectors:** Eigenvectors determine the direction of the principal components, while eigenvalues indicate the amount of variance explained by each component.
- **Principal Components:** Linear combinations of the original features that form a new coordinate system. The first principal component explains the most variance, the second explains the next most, and so on.
- **Dimensionality Reduction:** Reducing the number of features by selecting only the top principal components that explain a significant portion of the variance.

## How PCA Works

1. **Standardize the Data:** Center the data by subtracting the mean and scale it by dividing by the standard deviation for each feature (ensures features with larger scales don't dominate).
2. **Compute the Covariance Matrix:** Calculate the covariance matrix to understand how features vary together.
3. **Eigen Decomposition:** Perform eigen decomposition on the covariance matrix to find eigenvalues and eigenvectors.
4. **Sort Eigenvalues and Eigenvectors:** Order the eigenvectors by their corresponding eigenvalues in descending order to identify the principal components.
5. **Select Top Components:** Choose the top k eigenvectors to form a new feature matrix (where k is the desired number of dimensions).
6. **Transform the Data:** Project the original data onto the new feature space using the selected eigenvectors.

## Implementation Using scikit-learn

Let's implement PCA using scikit-learn to reduce the dimensionality of a dataset and visualize the results. We'll also use it as a preprocessing step for a classification task.

In [None]:
# Import necessary libraries
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Generate a synthetic dataset with high dimensionality
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=42)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Standardize the data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Apply PCA to reduce to 2 dimensions for visualization
pca = PCA(n_components=2)
X_train_pca = pca.fit_transform(X_train_scaled)
X_test_pca = pca.transform(X_test_scaled)

# Visualize the reduced data
plt.figure(figsize=(8, 6))
plt.scatter(X_train_pca[:, 0], X_train_pca[:, 1], c=y_train, cmap='viridis', label='Train')
plt.scatter(X_test_pca[:, 0], X_test_pca[:, 1], c=y_test, cmap='viridis', marker='x', label='Test')
plt.xlabel('First Principal Component')
plt.ylabel('Second Principal Component')
plt.title('PCA: Data Reduced to 2 Dimensions')
plt.legend()
plt.show()

# Print explained variance ratio
print(f'Explained Variance Ratio for 2 components: {pca.explained_variance_ratio_}')
print(f'Total Explained Variance Ratio: {sum(pca.explained_variance_ratio_):.2f}')

# Use PCA as preprocessing for classification (reduce to 5 dimensions)
pca_classifier = PCA(n_components=5)
X_train_pca_classifier = pca_classifier.fit_transform(X_train_scaled)
X_test_pca_classifier = pca_classifier.transform(X_test_scaled)

# Train a Random Forest Classifier on reduced data
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train_pca_classifier, y_train)
y_pred = rf.predict(X_test_pca_classifier)
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy of Random Forest after PCA (5 components): {accuracy:.2f}')

## Advantages and Limitations

**Advantages:**
- Reduces computational complexity by lowering the number of features.
- Helps in visualizing high-dimensional data by reducing it to 2 or 3 dimensions.
- Can improve model performance by removing noise and redundant features.

**Limitations:**
- Assumes linear relationships between variables, which may not capture complex patterns.
- Loss of interpretability since principal components are combinations of original features.
- Requires standardization of data to ensure fair contribution from all features.

## Conclusion

Principal Component Analysis is a powerful tool for dimensionality reduction, making it easier to visualize and process high-dimensional data. While it has limitations in capturing non-linear relationships, it remains a fundamental technique in the machine learning preprocessing pipeline.

In the next notebook, we will explore another important topic to further enhance our machine learning skills.