# An Introduction to Principal Component Analysis (PCA)

## Introduction
What is Principal Component Analysis?
- Many scientific datasets have many correlated measurements: spectra (many wavelengths), climate grids (many locations), images (many pixels), recordings from many sensors or neurons.
- PCA finds orthogonal directions (principal components) that explain the most variance, which helps visualization, noise reduction, compression, and discovering latent degrees of freedom.

Principal Component Analysis (PCA) is a statistical technique used to reduce the dimensionality of data while preserving as much variance as possible. Think of it as finding the best camera angle to capture the most information about a 3D object in a 2D photograph.

Key Concepts:
- Dimensionality Reduction: Transform high-dimensional data to lower dimensions
- Variance Preservation: Keep the most important patterns in the data
- Data Visualization: Make complex datasets interpretable
- Noise Reduction: Filter out less important variations

| Field | Application | Benefits |
|-------|-------------|----------|
| **Spectroscopy** | Identify key wavelengths in complex spectra | Reduce noise, find characteristic peaks |
| **Genomics** | Find patterns in gene expression data | Identify gene clusters, reduce computational complexity |
| **Climate Science** | Understand weather patterns from multiple variables | Visualize climate patterns, identify trends |
| **Chemistry** | Analyze molecular properties and reactions | Optimize reaction conditions, understand structure-property relationships |
| **Physics** | Process sensor data from experiments | Extract signals from noise, identify fundamental modes |


:::{exercise} Understanding Dimensionality

**Scenario**: A researcher has measurements of temperature, humidity, pressure, and wind speed from 1000 weather stations. They want to create a 2D map showing weather patterns.

**Question**: How could PCA help in this situation?

**Tasks**:
1. Identify the original dimensionality of the data
2. Explain what the principal components might represent
3. Discuss potential limitations of reducing to 2D
:::

In [None]:
# Visual demo: 2D correlated data and PC1
import numpy as np
import matplotlib.pyplot as plt

np.random.seed(0)
n = 200
x = np.random.normal(size=n)
y = 2.0 * x + 0.8 * np.random.normal(size=n)
X = np.vstack([x,y]).T
Xc = X - X.mean(axis=0)

C = np.cov(Xc, rowvar=False)
eigvals, eigvecs = np.linalg.eigh(C)
order = eigvals.argsort()[::-1]
eigvals = eigvals[order]
eigvecs = eigvecs[:, order]

pc1 = eigvecs[:,0]
plt.figure(figsize=(6,6))
plt.scatter(Xc[:,0], Xc[:,1], s=15)
origin = np.zeros(2)
for s in [-3,3]:
    p = s * np.sqrt(eigvals[0]) * pc1
    plt.plot([origin[0], p[0]], [origin[1], p[1]], linewidth=3)
plt.xlabel('x (centered)')
plt.ylabel('y (centered)')
plt.title('2D correlated data and first principal component')
plt.axis('equal')
plt.grid(True)
plt.show()

## Mathematical Foundation

### Step 1: Data Centering
\begin{equation}
\hat X = X - \mu
\end{equation}

Where $\mu$ is the mean of each variable.

**Why center the data?**
- Ensures PCA finds directions of maximum variance from the center
- Prevents variables with larger scales from dominating

### Step 2: Covariance Matrix
\begin{equation}
C = \frac{1}{n-1}\  \hat X^T \times \hat X.
\end{equation}

Where $n$ is the number of observations.

**The covariance matrix captures**:
- How much each variable varies (diagonal elements)
- How variables co-vary together (off-diagonal elements)

### Step 3: Eigendecomposition
\begin{equation}
C\times \vec v = \lambda \vec v
\end{equation}
Where:
- $\vec v$ are eigenvectors (principal component directions)
- $\lambda$ are eigenvalues (variance captured by each component)

### Step 4: Variance Explained

Variance Explained by 
\begin{equation}
PC_i = \frac{\lambda_i}{\sum_j\lambda_j}
\end{equation}



:::{exercise}Covariance Understanding
**Question**: If two variables have a covariance of 0, what does this mean for PCA?

Consider this covariance matrix:
```
C = [4.0  0.0]
    [0.0  1.0]
```

**Tasks**:
1. What are the eigenvalues?
2. What are the eigenvectors?
3. How much variance does each PC explain?
:::



## Step-by-Step PCA Process

### Example: Analyzing Plant Growth Data

Suppose we measure 4 variables for 100 plants:
- Height (cm)
- Leaf area (cm²)
- Root depth (cm)
- Biomass (g)

```python
# Step 1: Organize data matrix X (100 × 4)
# Rows = observations (plants)
# Columns = variables (measurements)

# Step 2: Standardize data (often necessary)
X_scaled = (X - mean(X)) / std(X)

# Step 3: Compute covariance matrix
C = cov(X_scaled)

# Step 4: Find eigenvalues and eigenvectors
eigenvalues, eigenvectors = eig(C)

# Step 5: Sort by eigenvalues (descending)
# Step 6: Choose number of components
# Step 7: Transform data
PC_scores = X_scaled @ eigenvectors
```

:::{exercise} Plant Growth Interpretation**

Given the following PCA results for plant data:

**PC1 Loadings**: Height (0.6), Leaf Area (0.5), Root Depth (0.4), Biomass (0.5)
**PC2 Loadings**: Height (-0.3), Leaf Area (0.2), Root Depth (0.8), Biomass (-0.5)

**Questions**:
1. What does PC1 represent biologically?
2. What does PC2 represent?
3. Which variables contribute most to each PC?
:::

In [None]:
# Animated PCA visualization
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.animation import FuncAnimation
from IPython.display import HTML

np.random.seed(0)
n = 200
x = np.random.normal(size=n)
y = 2.0 * x + 0.8 * np.random.normal(size=n)
X = np.vstack([x,y]).T
Xc = X - X.mean(axis=0)

C = np.cov(Xc, rowvar=False)
eigvals, eigvecs = np.linalg.eigh(C)
order = eigvals.argsort()[::-1]
eigvals = eigvals[order]
eigvecs = eigvecs[:, order]
V = eigvecs

fig, ax = plt.subplots(1, 2, figsize=(10,5))
sc1 = ax[0].scatter(Xc[:,0], Xc[:,1], s=15)
X_pca = Xc.dot(V)
sc2 = ax[1].scatter(X_pca[:,0], X_pca[:,1], s=15)
ax[0].set_title('Original space')
ax[1].set_title('PCA space')
for a in ax:
    a.set_xlim(-6,6)
    a.set_ylim(-6,6)
    a.grid(True)

line_pc1, = ax[0].plot([], [], 'r-', lw=2, label='PC1')
line_pc2, = ax[0].plot([], [], 'g-', lw=2, label='PC2')
ax[0].legend()

def init():
    line_pc1.set_data([], [])
    line_pc2.set_data([], [])
    return sc1, sc2, line_pc1, line_pc2

def update(frame):
    theta = frame
    R = np.array([[np.cos(theta), -np.sin(theta)],
                  [np.sin(theta),  np.cos(theta)]])
    X_rot = Xc.dot(R.T)
    sc1.set_offsets(X_rot)
    line_pc1.set_data([0, R[0,0]*3], [0, R[1,0]*3])
    line_pc2.set_data([0, R[0,1]*3], [0, R[1,1]*3])
    return sc1, sc2, line_pc1, line_pc2

ani = FuncAnimation(fig, update, frames=np.linspace(0, np.pi/2, 30),
                    init_func=init, blit=True, interval=100)
plt.close(fig)
HTML(ani.to_jshtml())

In [None]:
# Interactive 3D PCA viewer using plotly and ipywidgets
# If plotly or ipywidgets are not installed, this will instruct the user.
try:
    import plotly.express as px
    import ipywidgets as widgets
    from IPython.display import display, HTML
    from sklearn.datasets import load_iris
    from sklearn.decomposition import PCA
except Exception as e:
    print('Missing required packages for interactive viewer. Please install: plotly, ipywidgets.')
    print('Error:', e)
    raise

# Load data and compute PCA
data = load_iris()
X = data.data
y = data.target
target_names = data.target_names
pca = PCA(n_components=X.shape[1])
scores = pca.fit_transform(X)
explained = pca.explained_variance_ratio_

import pandas as pd
df_scores = pd.DataFrame(scores, columns=[f'PC{i+1}' for i in range(scores.shape[1])])
df_scores['target'] = [target_names[i] for i in y]

def make_3d_plot(pc_x=1, pc_y=2, pc_z=3):
    fig = px.scatter_3d(df_scores, x=f'PC{pc_x}', y=f'PC{pc_y}', z=f'PC{pc_z}',
                        color='target', symbol='target', title=f'PC{pc_x} vs PC{pc_y} vs PC{pc_z} (explained var: {explained[pc_x-1]:.2f}, {explained[pc_y-1]:.2f}, {explained[pc_z-1]:.2f})')
    fig.update_traces(marker=dict(size=5))
    fig.update_layout(width=800, height=600)
    return fig

pc_options = [1,2,3,4]
pc_x = widgets.Dropdown(options=pc_options, value=1, description='PC X:')
pc_y = widgets.Dropdown(options=pc_options, value=2, description='PC Y:')
pc_z = widgets.Dropdown(options=pc_options, value=3, description='PC Z:')

out = widgets.Output()

def update_plot(change=None):
    with out:
        out.clear_output(wait=True)
        fig = make_3d_plot(pc_x.value, pc_y.value, pc_z.value)
        #display(fig)
        fig.show()

pc_x.observe(update_plot, names='value')
pc_y.observe(update_plot, names='value')
pc_z.observe(update_plot, names='value')

controls = widgets.HBox([pc_x, pc_y, pc_z])
#display(controls)
display(controls, out)
update_plot()

## Interpretation Guidelines

### Choosing Number of Components

**Method 1: Cumulative Variance**
- Keep components explaining 80-95% of variance
- Depends on application requirements

**Method 2: Scree Plot (Elbow Method)**
- Plot eigenvalues vs. component number
- Look for "elbow" where decrease slows

**Method 3: Kaiser Criterion**
- Keep components with eigenvalues > 1 (if data is standardized)
- Only applies when variables are standardized

### Loading Interpretation

**Loading Values**:
- **|loading| > 0.7**: Strong relationship
- **0.3 < |loading| < 0.7**: Moderate relationship
- **|loading| < 0.3**: Weak relationship

**Signs Matter**:
- **Positive loading**: Variable increases with PC
- **Negative loading**: Variable decreases with PC

### Common Pitfalls

| **Mistake** | **Correct** |
|---|---|
|Treating PCs as original variables|PCs are linear combinations of original variables|
|Ignoring scaling/standardization|Consider whether to standardize based on variable units|
|Over-interpreting small components|Focus on components explaining meaningful variance|


## A detailed Worked Example - The Iris Dataset

The Iris dataset is a classic in machine learning and statistics. It contains 150 samples from three species of Iris flowers (Setosa, Versicolor, and Virginica). For each sample, four features were measured: sepal length, sepal width, petal length, and petal width.

Our goal is to see if we can use PCA to 'summarize' these four measurements and visualize the separation between the species.

### Setup - Importing Libraries

First, let's import all the necessary libraries for data manipulation, numerical computation, and plotting.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.datasets import load_iris

# Set some default plotting styles for better looking visuals
sns.set_style('whitegrid')

### Loading data

In [None]:
# Load the data
iris = load_iris()
X = iris.data # The feature matrix
y = iris.target # The labels (species)
feature_names = iris.feature_names
target_names = iris.target_names

# Let's look at the first 5 rows of the data
print("Feature Names:", feature_names)
print("\nFirst 5 rows of data:\n", X[:5])

### Step 1: Standardize the Data

PCA is sensitive to the scale of the features. We need to standardize our data so that each feature has a mean of 0 and a standard deviation of 1.

In [None]:
X_scaled = StandardScaler().fit_transform(X)

print("First 5 rows of scaled data:\n", X_scaled[:5])

### Step 2: Perform PCA

Now we apply PCA. We'll ask `scikit-learn` to find the first two principal components.

In [None]:
# Initialize PCA and fit the scaled data
# n_components specifies how many dimensions we want to reduce to.
pca = PCA(n_components=2)

# Fit the model and transform the data to the new coordinate system
X_pca = pca.fit_transform(X_scaled)

### Step 3: Analyze and Visualize the Results

#### Scree Plot

First, let's see how much variance our two new components capture. A scree plot is perfect for this.

In [None]:
explained_variance = pca.explained_variance_ratio_

print(f"Variance explained by PC1: {explained_variance[0]:.2%}")
print(f"Variance explained by PC2: {explained_variance[1]:.2%}")
print(f"Total variance explained by first two components: {np.sum(explained_variance):.2%}")

# To make a full scree plot, we can re-run PCA without specifying n_components
pca_full = PCA().fit(X_scaled)

plt.figure(figsize=(8, 6))
plt.bar(range(1, len(pca_full.explained_variance_ratio_) + 1), pca_full.explained_variance_ratio_ * 100, alpha=0.7, align='center', label='Individual explained variance')
plt.step(range(1, len(pca_full.explained_variance_ratio_) + 1), np.cumsum(pca_full.explained_variance_ratio_) * 100, where='mid', label='Cumulative explained variance')
plt.ylabel('Explained Variance Percentage')
plt.xlabel('Principal Component Index')
plt.title('Scree Plot for Iris Dataset')
plt.xticks(range(1, len(pca_full.explained_variance_ratio_) + 1))
plt.legend(loc='best')
plt.show()

**Observation:** The first two components capture over 95% of the total variance in the data! This means our 2D plot will be a very good representation of the original 4D data.

#### Scores Plot

Next, we create a scores plot. This is a scatter plot of our samples in the new PCA space. We will color each point according to its true species.

In [None]:
plt.figure(figsize=(10, 8))
colors = ['navy', 'turquoise', 'darkorange']

for color, i, target_name in zip(colors, [0, 1, 2], target_names):
    plt.scatter(X_pca[y == i, 0], X_pca[y == i, 1], color=color, alpha=.8, lw=2,
                label=target_name)

plt.legend(loc='best', shadow=False, scatterpoints=1)
plt.title('PCA Scores Plot of Iris Dataset')
plt.xlabel(f'Principal Component 1 ({explained_variance[0]:.2%})')
plt.ylabel(f'Principal Component 2 ({explained_variance[1]:.2%})')
plt.show()

**Observation:** The three species are very well separated in the PCA plot. The *Setosa* species is a distinct cluster, while *Versicolor* and *Virginica* are also mostly separated, though they have some overlap.

#### Biplot

Finally, the biplot helps us understand *why* the samples are separated. It overlays the original feature vectors (loadings) on top of the scores plot. This tells us how the original variables contribute to the principal components.

We will use a helper function to create a clean biplot.

In [None]:
def biplot(score, coeff, labels=None):
    """
    Creates a biplot visualization.
    
    score: The transformed data (scores), e.g., X_pca.
    coeff: The eigenvectors (loadings), e.g., pca.components_.T.
    labels: The names of the original features.
    """
    plt.figure(figsize=(12, 10))
    xs = score[:, 0]
    ys = score[:, 1]
    n = coeff.shape[0]
    
    # Plot the scores
    for color, i, target_name in zip(colors, [0, 1, 2], target_names):
        plt.scatter(xs[y == i], ys[y == i], color=color, alpha=0.7, label=target_name)

    # Plot the loadings
    for i in range(n):
        plt.arrow(0, 0, coeff[i, 0]*4, coeff[i, 1]*4, color='r', alpha=0.9, head_width=0.05)
        if labels is None:
            plt.text(coeff[i, 0] * 4.2, coeff[i, 1] * 4.2, "Var" + str(i + 1), color='black', ha='center', va='center')
        else:
            plt.text(coeff[i, 0] * 4.2, coeff[i, 1] * 4.2, labels[i], color='black', ha='center', va='center', fontsize=12)
    
    plt.xlabel(f'Principal Component 1 ({explained_variance[0]:.2%})')
    plt.ylabel(f'Principal Component 2 ({explained_variance[1]:.2%})')
    plt.title('Biplot of Iris Dataset')
    plt.legend()
    plt.grid()

# Call the function with our data
# Note: we need to transpose pca.components_ to get the loadings in the right shape
biplot(X_pca, np.transpose(pca.components_), labels=feature_names)
plt.show()

**Interpretation of the Biplot:**

*   **PC1 (the horizontal axis):** All four variables have vectors pointing to the right, but `petal length`, `petal width`, and `sepal length` point most strongly in this direction. This suggests PC1 is a measure of **overall flower size**. Larger flowers (like Virginica) are on the right (high PC1 score), and smaller flowers (like Setosa) are on the left (low PC1 score).
*   **PC2 (the vertical axis):** This axis shows an interesting contrast. `sepal width` points up, while `petal width` and `petal length` point down. This component separates flowers with wide sepals relative to their petal size from those with the opposite characteristics. This explains the separation between Versicolor and Virginica, which have similar overall sizes (PC1) but different shapes (PC2).

## Advanced Details

### PCA Assumptions and Limitations

**Assumptions**:
- Linear relationships between variables
- Data follows (approximately) multivariate normal distribution
- Variables are continuous

**Limitations**:
- **Linear combinations only**: Cannot capture nonlinear patterns
- **Variance-based**: May not preserve class separability
- **Global method**: Same transformation for all data points

**When PCA May Not Work Well**:
- Highly nonlinear data (consider kernel PCA)
- Categorical variables (consider correspondence analysis)
- When rare events are important (PCA focuses on major patterns)

### Robust PCA

**Problem**: Standard PCA is sensitive to outliers.

**Solutions**:
- **Robust PCA**: Use median instead of mean, robust covariance estimation
- **Sparse PCA**: Assume many loadings are exactly zero
- **Kernel PCA**: Handle nonlinear relationships

### PCA vs. Other Techniques

| Method | Best For | Limitation |
|--------|----------|------------|
| **PCA** | Continuous data, linear relationships | Linear only |
| **Factor Analysis** | Identifying latent variables | Assumes specific model |
| **ICA** | Separating mixed signals | Assumes independence |
| **t-SNE** | Visualization, nonlinear | Not for dimensionality reduction |
| **UMAP** | Large datasets, preserving structure | Complex parameter tuning |

---

## Application Exercises

Now it's your turn! Apply the techniques learned above to the following datasets. For each exercise, you'll need to:
1. Load the data using `pandas`.
2. Select the feature columns.
3. Standardize the features.
4. Perform PCA.
5. Create and interpret a scores plot and/or a biplot.

*(Note: You will need to have the `.csv` files in the same directory as this notebook for the code to work.)*

### Genomics - Cancer Subtype Identification
**Dataset:** `cancer_data.csv`
**Task:** Perform PCA on the gene expression data. Can you identify distinct clusters in a scores plot? What might they represent?

In [None]:
try:
    cancer_df = pd.read_csv('cancer_data.csv')
    print("Cancer data loaded successfully!")
    # Your code here
    # 1. Select all columns except 'Sample_ID' and 'Subtype' as your features (X)
    # X_cancer = ...
    
    # 2. Standardize X_cancer
    # X_cancer_scaled = ...
    
    # 3. Perform PCA (n_components=2)
    # pca_cancer = ...
    # X_cancer_pca = ...
    
    # 4. Create a scores plot. You can color by the 'Subtype' column to see if the clusters match.
    # plt.figure(...)
    
except FileNotFoundError:
    print("File 'cancer_data.csv' not found. Please make sure it's in the same directory as the notebook.")

### Chemistry - Classifying Olive Oils
**Dataset:** `olive_oil_spectra.csv`
**Task:** Use PCA on the spectral data. Can you distinguish oils from different regions? Which wavelengths (columns) are most important for this separation?

In [None]:
try:
    oil_df = pd.read_csv('olive_oil_spectra.csv')
    print("Olive oil data loaded successfully!")
    # Your code here
    # 1. Select the wavelength columns as features.
    # X_oil = ...

    # 2. Standardize and perform PCA.
    # ...

    # 3. Create a biplot. It might be too cluttered to label all the variables (wavelengths),
    #    so focus on the scores plot part and the general direction of the loadings cloud.
    #    Color the points by the 'Region' column.
    # ...
except FileNotFoundError:
    print("File 'olive_oil_spectra.csv' not found. Please make sure it's in the same directory as the notebook.")

###  Environmental Science - Air Pollution Sources
**Dataset:** `air_pollution.csv`
**Task:** Use PCA to identify patterns in air quality data. Create a biplot and interpret the loadings. Can you hypothesize what physical processes PC1 and PC2 represent?

In [None]:
try:
    air_df = pd.read_csv('air_pollution.csv')
    print("Air pollution data loaded successfully!")
    # Your code here
    # 1. Select the pollutant and meteorological columns as features.
    # X_air = ...

    # 2. Standardize and perform PCA.
    # ...

    # 3. Create a biplot and interpret the loadings.
    #    Look at how variables like Ozone, NOx, and Temperature are related in the PCA space.
    # ...
except FileNotFoundError:
    print("File 'air_pollution.csv' not found. Please make sure it's in the same directory as the notebook.")

### Agriculture - Crop Yield Analysis
**Dataset:** `crop_data.csv`
**Task:** Perform PCA on the input variables (everything except `Crop_Yield`). Then, plot the PC1 scores against `Crop_Yield`. Is there a relationship? This demonstrates using PCA for feature engineering.

In [None]:
try:
    crop_df = pd.read_csv('crop_data.csv')
    print("Crop data loaded successfully!")
    # Your code here
    # 1. Select the input variables as features.
    # input_vars = ['Rainfall', 'Sunlight_Hours', 'Fertilizer_Amount', 'Soil_pH']
    # X_crop = crop_df[input_vars]
    # y_crop = crop_df['Crop_Yield']

    # 2. Standardize and perform PCA.
    # ...
    # X_crop_pca = ...

    # 3. Create a scatter plot of the first principal component vs. Crop_Yield.
    # plt.figure(...)
    # plt.scatter(X_crop_pca[:, 0], y_crop)
    # plt.xlabel('Principal Component 1')
    # plt.ylabel('Crop Yield')
    # plt.title('PC1 vs. Crop Yield')
    # plt.show()
    # What does the relationship (or lack thereof) tell you?

except FileNotFoundError:
    print("File 'crop_data.csv' not found. Please make sure it's in the same directory as the notebook.")

### Pharmaceutical Analysis
**Scenario**: A pharmaceutical company measured 50 chemical properties of 200 drug candidates to predict effectiveness.

**Data Structure**:
- Rows: 200 drug candidates
- Columns: 50 chemical properties (molecular weight, logP, polar surface area, etc.)

**Tasks**:
1. **Preprocessing**: Should you standardize the variables? Why?
2. **Analysis**: Apply PCA and determine how many components to retain
3. **Interpretation**: If PC1 loads heavily on molecular weight, logP, and size-related properties, what does this PC represent?
4. **Application**: How would you use PCA scores to select promising drug candidates?

**Expected Results**:
- PC1 (30%): "Molecular size" - larger, more lipophilic molecules
- PC2 (20%): "Polarity" - hydrophilic vs. hydrophobic character
- PC3 (15%): "Complexity" - structural complexity measures

### Protein Structure Analysis
**Scenario**: You have 3D coordinates for all atoms in a protein from molecular dynamics simulations (1000 time points).

**Challenge**: Identify main modes of protein flexibility.

**Tasks**:
1. **Data Setup**: How would you arrange the coordinate data for PCA?
2. **Preprocessing**: What preprocessing steps are crucial?
3. **Interpretation**: What do the principal components represent physically?
4. **Validation**: How would you verify your results make biological sense?

**Solution Approach**:
- **Data matrix**: time_points × (3 × number_of_atoms)
- **Preprocessing**: Center each structure, possibly align to remove rotation
- **PC1**: Often the "breathing" mode (overall expansion/contraction)
- **PC2-PC3**: Hinge motions, domain movements

### Astronomical Data
**Scenario**: Telescope survey measured brightness in 20 wavelength bands for 10,000 stars.

**Goal**: Classify star types using PCA.

**Questions**:
1. How many principal components might you expect to need?
2. What would the principal components represent physically?
3. How would you use PCA results for star classification?

**Hints**:
- Different star types have characteristic spectra
- Temperature affects overall brightness curve shape
- Chemical composition affects specific absorption lines
- PC1 likely relates to stellar temperature
- PC2-PC3 might capture metallicity, surface gravity

## Summary

**Key Takeaways**:

1. **PCA transforms data** to new coordinates that maximize variance
2. **Principal components are linear combinations** of original variables
3. **Interpretation requires examining loadings** and variance explained
4. **Standardization is crucial** when variables have different scales
5. **PCA is exploratory** - use it to understand data structure
6. **Limitations exist** - PCA assumes linear relationships

**When to Use PCA**:
- High-dimensional numerical data
- Variables are correlated
- Want to visualize or reduce dimensionality
- Need to identify main patterns

**When to Avoid PCA**:
- Variables are already uncorrelated
- All components are equally important
- Nonlinear relationships dominate
- Small sample size relative to variables

---

## Further Reading

**Books**:
- "The Elements of Statistical Learning" - Hastie, Tibshirani, Friedman
- "Pattern Recognition and Machine Learning" - Bishop
- "Applied Multivariate Statistical Analysis" - Johnson & Wichern

**Online Resources**:
- Scikit-learn PCA documentation
- StatQuest PCA videos (Josh Starmer)
- Andrew Ng's Machine Learning Course (PCA section)

**Research Papers**:
- Jolliffe, I.T. "Principal Component Analysis" (comprehensive review)
- Ringner, M. "What is principal component analysis?" (Nature Biotechnology)
- Domain-specific PCA applications in your field of interest