# Class 2: Dimensionality Reduction with PCA

**Week 7: Unsupervised Learning and Advanced Data Analysis**

**Objective**: Learn dimensionality reduction and apply Principal Component Analysis (PCA) to simplify datasets.

**Agenda**:
- Understand why dimensionality reduction is useful.
- Explore PCA: how it works and what it reveals.
- Demo: Apply PCA to a dataset and visualize results.
- Exercise: Reduce dimensions of a dataset and interpret components.

Let’s simplify complex data and uncover its structure!

## 1. Why Dimensionality Reduction?

- **Problem**: High-dimensional data (many features) is hard to visualize, analyze, or model.
  - **Curse of Dimensionality**: More features can lead to overfitting, noise, or computational challenges.
- **Solution**: Reduce dimensions while preserving important information.
- **Applications**:
  - Visualizing high-dimensional data in 2D/3D.
  - Speeding up machine learning models.
  - Removing redundant or noisy features.

**Principal Component Analysis (PCA)** is a popular technique to achieve this by transforming data into a new set of features (principal components).

## 2. How PCA Works

**Goal**: Find new axes (principal components) that capture the most variance in the data.

**Intuition**:
- Imagine data as a cloud of points in high-dimensional space.
- PCA finds the directions (axes) where the cloud spreads out the most.
- The first principal component (PC1) captures the most variance, PC2 the second most, and so on.

**Steps**:
1. Standardize the data (zero mean, unit variance).
2. Compute the covariance matrix to understand feature relationships.
3. Find eigenvectors (directions) and eigenvalues (amount of variance) of the covariance matrix.
4. Project the data onto the top *k* eigenvectors (principal components).

**Key Outputs**:
- **Explained Variance Ratio**: How much variance each component captures.
- **Scree Plot**: Visualizes the importance of components.
- **Transformed Data**: Lower-dimensional representation.

**Applications**:
- Visualize customer data in 2D after reducing from many features.
- Preprocess data for clustering (like k-means from Class 1).

Let’s see PCA in action!

## 3. Demo: PCA on the Iris Dataset

We’ll use the Iris dataset (4 features: sepal length, sepal width, petal length, petal width) to demonstrate PCA, reducing it to 2D for visualization.

**Setup**: Ensure you have the required libraries installed (same as Class 1):
```bash
pip install numpy pandas scikit-learn matplotlib seaborn
```

In [None]:
# Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target
feature_names = iris.feature_names

# Standardize the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Apply PCA
pca = PCA(n_components=2)  # Reduce to 2 dimensions
X_pca = pca.fit_transform(X_scaled)

# Create a DataFrame for visualization
pca_df = pd.DataFrame(X_pca, columns=['PC1', 'PC2'])
pca_df['Species'] = y

# Visualize the results
plt.figure(figsize=(8, 6))
sns.scatterplot(x='PC1', y='PC2', hue='Species', palette='Set1', data=pca_df, s=100, alpha=0.7)
plt.title('PCA of Iris Dataset (2D Projection)')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.legend(iris.target_names)
plt.show()

In [None]:
# Explained variance ratio
print('Explained Variance Ratio:', pca.explained_variance_ratio_)
print('Total Variance Explained:', sum(pca.explained_variance_ratio_))

# Scree plot
pca_full = PCA().fit(X_scaled)  # Fit PCA with all components
plt.plot(range(1, len(pca_full.explained_variance_ratio_) + 1), 
         pca_full.explained_variance_ratio_, 'bo-')
plt.title('Scree Plot')
plt.xlabel('Principal Component')
plt.ylabel('Explained Variance Ratio')
plt.xticks(range(1, len(pca_full.explained_variance_ratio_) + 1))
plt.show()

**Discussion**:
- **
- The scatter plot shows how PCA separates Iris species in 2D.
- PC1 captures the most variance, PC2 the next most.
- The scree plot suggests 2–3 components explain most of the data’s structure.
- How might this help with clustering (like k-means from Class 1)?

## 4. Exercise: Apply PCA to a Dataset

Now it’s your turn! Apply PCA to a dataset and interpret the results.

**Task**:
- Use the Iris dataset (or synthetic data if you prefer).
- Reduce it to 2 dimensions using PCA.
- Visualize the results and check the explained variance.
- Bonus: Examine the PCA components to see which features contribute most.

**Instructions**:
1. Run the code below to load and standardize the data.
2. Apply PCA and plot the 2D projection.
3. Check the explained variance and scree plot.
4. (Optional) Interpret the components.

In [None]:
# Load and standardize data (using Iris again for simplicity)
X_ex = iris.data
scaler_ex = StandardScaler()
X_scaled_ex = scaler_ex.fit_transform(X_ex)

# Your code: Apply PCA
pca_ex = PCA(n_components=2)
X_pca_ex = pca_ex.fit_transform(X_scaled_ex)

# Create DataFrame
pca_df_ex = pd.DataFrame(X_pca_ex, columns=['PC1', 'PC2'])

# Visualize
plt.figure(figsize=(8, 6))
plt.scatter(pca_df_ex['PC1'], pca_df_ex['PC2'], c='blue', s=100, alpha=0.7)
plt.title('Your PCA Projection')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.show()

In [None]:
# Your code: Check explained variance
print('Explained Variance Ratio:', pca_ex.explained_variance_ratio_)

# Scree plot
pca_full_ex = PCA().fit(X_scaled_ex)
plt.plot(range(1, len(pca_full_ex.explained_variance_ratio_) + 1), 
         pca_full_ex.explained_variance_ratio_, 'bo-')
plt.title('Your Scree Plot')
plt.xlabel('Principal Component')
plt.ylabel('Explained Variance Ratio')
plt.xticks(range(1, len(pca_full_ex.explained_variance_ratio_) + 1))
plt.show()

In [None]:
# Bonus: Inspect PCA components
# Components show the contribution of each original feature
components_df = pd.DataFrame(pca_ex.components_, columns=feature_names, index=['PC1', 'PC2'])
print('PCA Components:\n', components_df)

# Visualize as a heatmap
plt.figure(figsize=(8, 4))
sns.heatmap(components_df, annot=True, cmap='coolwarm', center=0)
plt.title('Feature Contributions to Principal Components')
plt.show()

## 5. Wrap-Up

**Key Takeaways**:
- PCA reduces dimensions by finding directions of maximum variance.
- Explained variance and scree plots help decide how many components to keep.
- PCA is great for visualization and preprocessing (e.g., before clustering).

**Discussion Questions**:
- What patterns did you see in the 2D projection?
- Which features contributed most to PC1 and PC2?
- How could PCA help with the mall customer dataset?

**Homework**:
- Think about how PCA could simplify the mall customer dataset.
- Explore its features (e.g., age, income, spending score) and hypothesize what PC1 might represent.
- Bring ideas to Class 3!

Awesome work! Next, we’ll explore data distributions and feature selection.