# An Introduction to Principal Component Analysis (PCA)

Imagine you are an environmental scientist studying a lake. At a single point in the lake, you have a probe that measures 50 different things simultaneously: water temperature, pH, dissolved oxygen, turbidity, conductivity, and concentrations of 45 different chemical pollutants. You collect this 50-dimensional data point every minute for a year.

Now you have millions of data points, each with 50 features. Your goal is simple: 'How did the lake's overall health change over the year?'

Answering this is impossible by plotting 50 separate graphs. Many of these features are redundant (e.g., several pollutants might come from the same source and always appear together). How can we boil these 50 correlated features down to just two or three 'meta-features' that capture the most important patterns of change? For instance, perhaps we could discover a 'Pollution Event Axis' and a 'Seasonal Temperature Axis'.

This is the core problem that Principal Component Analysis solves. It is a powerful method for finding the most meaningful 'meta-features' (the Principal Components) in complex, high-dimensional data, allowing us to visualize patterns and understand the hidden structure of our measurements.

## Dimensionality reduction methods
The common goal of dimensionality reduction is to transform high-dimensional data into a lower-dimensional space while preserving as much of the meaningful structure and properties of the original data as possible. This is done to simplify datasets, reduce computational and storage costs, and make data easier to visualize and analyze. The following list show some of the more relevant methods used:

- **Principal Component Analysis (PCA)** A linear technique that transforms the data into a new coordinate system of orthogonal axes called principal components. These components are ordered by the amount of variance they capture from the original data. Reference: https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html
- **Linear Discriminant Analysis (LDA)** A supervised linear technique that finds a linear combination of features that best separates two or more classes of objects. It aims to maximize the between-class variance while minimizing the within-class variance. Reference: https://scikit-learn.org/stable/modules/generated/sklearn.discriminant_analysis.LinearDiscriminantAnalysis.html
- **t-Distributed Stochastic Neighbor Embedding (t-SNE)** A non-linear method primarily used for visualizing high-dimensional data. It models data points by their similarities in both high and low dimensions, preserving the local structure of the data. Reference: https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html
- **Uniform Manifold Approximation and Projection (UMAP)** A modern non-linear technique that constructs a graph representation of the data in high dimensions and then optimizes a similar graph in a lower dimension. It is known for its speed and ability to preserve both local and global data structure. Reference: https://umap-learn.readthedocs.io/en/latest/
- **Kernel PCA** An extension of PCA that uses kernel functions to perform non-linear dimensionality reduction. It implicitly maps the data to a higher-dimensional space where linear separation is possible, then applies PCA. Reference: https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.KernelPCA.html
- **Autoencoders** A type of neural network used for unsupervised learning. It consists of an "encoder" that compresses the input into a low-dimensional code and a "decoder" that reconstructs the input from this code. The compressed code is the lower-dimensional representation. Reference: https://blog.keras.io/building-autoencoders-in-keras.html
- **Feature Selection** This is a category of methods that select a subset of the original features rather than creating new ones. Techniques include filter methods (e.g., chi-squared test), wrapper methods (e.g., recursive feature elimination), and embedded methods (e.g., LASSO regularization). Reference: https://scikit-learn.org/stable/modules/feature_selection.html
- **ISO MAP** Preserves geodesic distances on data manifold using neighborhood graph + MDS.

A nice visualization to compare: https://projector.tensorflow.org/


| Method | Advantages | Disadvantages |
| :--- | :--- | :--- |
| **PCA** | Simple, fast, and removes correlated features. Effective for linear data. | Can be sensitive to data scaling. Performs poorly with non-linear data. |
| **LDA** | Improves classification performance by focusing on class separability. | Supervised (requires labels). Assumes data is normally distributed. |
| **t-SNE** | Excellent at visualizing local clusters in high-dimensional data. | Computationally expensive. Can lose global structure information. |
| **UMAP** | Very fast and scalable. Preserves both local and global data structure. | Can be sensitive to hyperparameters. Interpretation can be nuanced. |
| **Kernel PCA**| Can capture complex, non-linear relationships in the data. | More computationally intensive than PCA. Requires careful kernel selection. |
| **Autoencoders**| Can learn highly complex, non-linear representations. Flexible for various data types. | Can be prone to overfitting. Requires a large amount of training data. |
| **Feature Selection**| Improves model interpretability by using original features. Reduces overfitting. | May discard features that are useful only in combination with others. |
|**ISO Map**|aptures global nonlinear structure; preserves manifold geometry well if assumptions hold.|(Classic references like “Isomap: Learning Nonlinear Structure from High‑Dimensional Data”, etc.) — not explicitly in the small set I pulled but widely cited.|

## Introduction to PCA
What is Principal Component Analysis?
- Many scientific datasets have many correlated measurements: spectra (many wavelengths), climate grids (many locations), images (many pixels), recordings from many sensors or neurons.
- PCA finds orthogonal directions (principal components) that explain the most variance, which helps visualization, noise reduction, compression, and discovering latent degrees of freedom.

Principal Component Analysis (PCA) is a statistical technique used to reduce the dimensionality of data while preserving as much variance as possible. Think of it as finding the best camera angle to capture the most information about a 3D object in a 2D photograph.

Key Concepts:
- Dimensionality Reduction: Transform high-dimensional data to lower dimensions
- Variance Preservation: Keep the most important patterns in the data
- Data Visualization: Make complex datasets interpretable
- Noise Reduction: Filter out less important variations

| Field | Application | Benefits |
|-------|-------------|----------|
| **Spectroscopy** | Identify key wavelengths in complex spectra | Reduce noise, find characteristic peaks |
| **Genomics** | Find patterns in gene expression data | Identify gene clusters, reduce computational complexity |
| **Climate Science** | Understand weather patterns from multiple variables | Visualize climate patterns, identify trends |
| **Chemistry** | Analyze molecular properties and reactions | Optimize reaction conditions, understand structure-property relationships |
| **Physics** | Process sensor data from experiments | Extract signals from noise, identify fundamental modes |


:::{exercise} Understanding Dimensionality

**Scenario**: A researcher has measurements of temperature, humidity, pressure, and wind speed from 1000 weather stations. They want to create a 2D map showing weather patterns.

**Question**: How could PCA help in this situation?

**Tasks**:
1. Identify the original dimensionality of the data
2. Explain what the principal components might represent
3. Discuss potential limitations of reducing to 2D
:::

Now let's see an example.

In [None]:
import numpy as np
import matplotlib.pyplot as plt

# Data generation
np.random.seed(0)
n = 200
x = np.random.normal(size=n)
y = 2.0 * x + 0.8 * np.random.normal(size=n)
X = np.vstack([x,y]).T
Xc = X - X.mean(axis=0)


In [None]:
plt.figure(figsize=(6,6))
plt.scatter(Xc[:,0], Xc[:,1], s=15)
plt.xlabel('x (centered)')
plt.ylabel('y (centered)')
plt.title('2D data')
plt.axis('equal')
plt.grid(True)
plt.show()

This data is clearly positively correlated, as can be also be seen from a seaborn pair plot

In [None]:
import seaborn as sns
import pandas as pd

# Create a DataFrame with the centered data
df = pd.DataFrame(Xc, columns=['x', 'y'])

# Pairplot to visualize pairwise relationships and distributions
sns.pairplot(df)
plt.show()


Some info can be drawn from the so-called covariance matrix:

In [None]:
# Compute the covariance matrix, and its eigen values and vectors, ordered
# rowvar = False means vars are in columns in this case
C = np.cov(Xc, rowvar=False)

In [None]:
# Plot the covariance matrix as a heatmap
plt.figure(figsize=(6, 6))
sns.heatmap(C, annot=True, cmap="coolwarm", cbar=True, square=True, fmt=".2f", 
            xticklabels=["x", "y"], yticklabels=["x", "y"])
plt.title("Covariance Matrix Heatmap")
plt.show()


Now, there is more info embedded in the correlation matrix. You can compute the correlation coefficient, defined as 
\begin{equation}
Corr(X, Y) = \frac{Cov(X, Y)}{\sqrt{Var(X)Var(Y)}}.
\end{equation}

Also, the condition number
\begin{equation}
\kappa = \frac{\lambda_{\max}}{\lambda_{\min}}
\end{equation}
also helps to interpret the covariance matrix. For instance, if $\kappa$ is very large, it suggest that the problem is ill-represented and there might be multi-linearities (high correlation among features). 

Its determinant speaks about the volume in feature space, and small value might indicate that the data is redundant. 

Furthermore, the Covariance matrix  eigen-values (explained variance) and eigen-vectors (principal components) are also meaningfull

In [None]:
# Visual demo: 2D correlated data and PC1

eigvals, eigvecs = np.linalg.eigh(C)
order = eigvals.argsort()[::-1]
eigvals = eigvals[order]
eigvecs = eigvecs[:, order]

# Visualize data and eigen directions
pc1 = eigvecs[:,0]
plt.figure(figsize=(6,6))
plt.scatter(Xc[:,0], Xc[:,1], s=15)
origin = np.zeros(2)
c = {-3: 'r', 3 : 'k'}
for s in [-3,3]:
    p = s * np.sqrt(eigvals[0]) * pc1
    plt.plot([origin[0], p[0]], [origin[1], p[1]], linewidth=3, c=c[s])
plt.xlabel('x (centered)')
plt.ylabel('y (centered)')
plt.title('2D correlated data and first principal component')
plt.axis('equal')
plt.grid(True)
plt.show()

In [None]:
from bokeh.plotting import figure, show
from bokeh.io import output_notebook
import numpy as np

# Data generation
np.random.seed(0)
n = 200
x = np.random.normal(size=n)
y = 2.0 * x + 0.8 * np.random.normal(size=n)
X = np.vstack([x, y]).T
Xc = X - X.mean(axis=0)

# Compute the covariance matrix, and its eigen values and vectors
C = np.cov(Xc, rowvar=False)
eigvals, eigvecs = np.linalg.eigh(C)
order = eigvals.argsort()[::-1]
eigvals = eigvals[order]
eigvecs = eigvecs[:, order]


# Check orthogonality: the dot product should be 0 for orthogonal vectors
print(f"{eigvecs[:, 0]=}   {eigvecs[:, 1]=}")
dot_product = np.dot(eigvecs[:, 0], eigvecs[:, 1])
print(f"Dot product of PC1 and PC2: {dot_product}")

# print the eigevals normalized by the total sum
print(f"Eigenvals proportion: {eigvals/np.sum(eigvals)}")

# First principal component
pc1 = eigvecs[:, 0]

# Bokeh plot setup
output_notebook()  # Ensure the plot is rendered inline

# Create figure
p = figure(width=600, height=600, title="2D Correlated Data and PC1 and PC2", match_aspect=True)
p.scatter(Xc[:, 0], Xc[:, 1], size=8, color="blue", alpha=0.5)

# Plot the principal component lines
origin = np.array([0, 0])
color = {-3:"red", 3:"black"}
for s in [-3, 3]:
    p.line([origin[0], s * np.sqrt(eigvals[0]) * pc1[0]], 
           [origin[1], s * np.sqrt(eigvals[0]) * pc1[1]], 
           line_width=3, color=color[s], alpha=0.6)

# Plot the second principal component lines
# First principal component
pc2 = eigvecs[:, 1]
color = {-3:"orange", 3:"cyan"}
for s in [-3, 3]:
    p.line([origin[0], s * np.sqrt(eigvals[1]) * pc2[0]], 
           [origin[1], s * np.sqrt(eigvals[1]) * pc2[1]], 
           line_width=3, color=color[s], alpha=0.9)


# Labels and formatting
p.xaxis.axis_label = "x (centered)"
p.yaxis.axis_label = "y (centered)"
p.grid.grid_line_color = "white"
p.title.text_font_size = '32px'
p.grid.grid_line_color = "gray"  # Grid line color
p.grid.grid_line_alpha = 0.3  # Transparency of grid lines
p.grid.grid_line_width = 1  # Width of grid lines

# Show plot
show(p)


This is the key for PCA method: finding the eigen values and eigen vectors give some special directions where the data varies the most and allow to reduce the dimensionality of the data (in this case, using the first principal component will be enough) 

## Example applications to basic sciences
### Biology & Molecular Biology

- Higa, G. S., de Oliveira, L. G., Luns, K. F., de Jesus, R. E., Ferreira, R. C., da Silva, J. A., Cunha, P. H., & Pires, D. S. (2024). Application of Principal Component Analysis as a Prediction Model for Feline Sporotrichosis. Animals, 14(12), 1696.  <https://www.mdpi.com/2306-7381/12/1/32>

- Sfriso, P., & Crave, A. (2022). Principal Component Analysis and Related Methods for Investigating the Dynamics of Biological Macromolecules. J - Multidisciplinary Scientific Journal, 5(2), 298–317. <https://www.mdpi.com/2571-8800/5/2/21>

### Environmental Science & Earth Science
- Huang, J., Lu, D., Lin, W., & Yang, Q. (2024). Enhancing slope stability prediction through integrated PCA-SSA-SVM modeling: a case study of LongLian expressway. Frontiers in Earth Science, 12. DOI: https://doi.org/10.3389/feart.2024.1429601

- Kaur, L., & Godara, P. (2025). Application of Principal Component Analysis (PCA) in groundwater quality evaluation: A case study of arid region. Energy & Environment Management, 1(21). <https://ojs.awpublishing.org/eem/article/view/17>

- Kahangwa, C. (2022). Application of Principal Component Analysis, Cluster Analysis, Pollution Index and Geoaccumulation Index in Pollution Assessment with Heavy Metals from Gold Mining Operations, Tanzania. Journal of Geoscience and Environment Protection, 10(4), 303-317. <https://www.scirp.org/journal/paperinformation?paperid=116892>
 
### Chemistry

- Moreira, M., Hillenkamp, M., Divitini, G., Tizei, L. H. G., Ducati, C., Cotta, M. A., Rodrigues, V., & Ugarte, D. (2025). Improving Quantitative EDS Chemical Analysis of Alloy Nanoparticles by PCA Denoising: Part I, Reducing Reconstruction Bias. Microscopy and Microanalysis. DOI: <https://academic.oup.com/mam/article-abstract/28/2/338/6911396?redirectedFrom=fulltext>

### Physics

- Yao, Z., Yang, J., & Lin, H.-Q. (2025). Principal component analysis for percolation with and without data preprocessing. Physical Review E, 111(4).DOI: <https://journals.aps.org/pre/abstract/10.1103/PhysRevE.111.045303>


## Mathematical Foundation
PCA is often viewed as looking for the directions with the most variance. But it can also be formulated as  an optimization problem, and this explains why is the covariance matrix actually there. First, let's assume all data is centered, that is it has mean 0. Now, let's also assume that there is a new given unit vector $u$ that could be more useful that some of the original axis we were using. The projection of the features vector $x$ on $u$ is 
\begin{equation}
x' = (u\cdot x) u = (u^T x) u = x_u u.
\end{equation}
The dot product represents the amount of information we are retaining, it is maximum when both vectors are in the same direction, and null when they are orthogonal. Since it could be negative, we can define the aount of preserved information as 
\begin{equation}
(u\cdot x)^2,
\end{equation}
or, better, the mean preserved info over all samples as 
\begin{equation}
\frac{1}{n}\sum_i^n (u\cdot x_i)^2 = \frac{1}{n} \sum_i (x_{i,u})^2 = \frac{1}{n} \sum_i x_{i,u}^T x_{i,u} = \frac{1}{n} \sum_i (x_i^T u)(u^T x_i) = \frac{1}{n} \sum_i (x_i^T x_i) u u^T = \frac{1}{n} \sum_i (x_i^T x_i) u^2, = C u^2,   
\end{equation}
where we have replaced the definition of the covariance matrix
\begin{equation}
C = \frac{1}{n} \sum_i (x_i^T x_i).
\end{equation}
Therefore, to find the direction that preserve the most information about our original data, we must solve the optimization problem
\begin{equation}
\max Cu^2
\end{equation}
subject to 
\begin{equation}
u^2 = 1.
\end{equation}
This can be donde using the method of Lagrange multipliers, and some matrix calculus allows us to write it here simply:
\begin{equation}
\max Cu^2, \text{s.t, \ \ } u^2 = 1,
\end{equation}
deriving with respect to u (yes, it is a vector, see matrix calculus), and introducing a Lagrange multiplier $\lambda$ for the constraint
\begin{equation}
\frac{\partial}{\partial u} (Cu^2 - \lambda (u^2 - 1)) = 0,
\end{equation}
\begin{equation}
2Cu - 2\lambda u = 0,
\end{equation}
and, finally
\begin{equation}
Cu = \lambda u, 
\end{equation}
which is exactly the eigen-value problem for the covariance matrix. Therefore, the directions that preserve the most information about our original data are exactly the eigen-vectors of the covariance matrix.  








## General algorithm
### Step 1: Data Centering
\begin{equation}
\hat X = X - \mu
\end{equation}

Where $\mu$ is the mean of each variable.

**Why center the data?**
- Ensures PCA finds directions of maximum variance from the center
- Prevents variables with larger scales from dominating

### Step 2: Covariance Matrix
\begin{equation}
C = \frac{1}{n-1}\  \hat X^T \times \hat X.
\end{equation}

Where $n$ is the number of observations.

**The covariance matrix captures**:
- How much each variable varies (diagonal elements)
- How variables co-vary together (off-diagonal elements)

:::{exercise}
You are given the following three (centered) data points for two features, 'Gene A expression' and 'Gene B expression':

    Sample 1: (-2, -1)

    Sample 2: (0, 0)

    Sample 3: (2, 1)

Without using a calculator, do you expect the covariance between Gene A and Gene B to be positive, negative, or near zero? Why?
Calculate the 2x2 covariance matrix for this data. Does the result confirm your intuition?
:::

### Step 3: Eigendecomposition
\begin{equation}
C\times \vec v = \lambda \vec v
\end{equation}
Where:
- $\vec v$ are eigenvectors (principal component directions)
- $\lambda$ are eigenvalues (variance captured by each component)

### Step 4: Variance Explained

Variance Explained by 
\begin{equation}
PC_i = \frac{\lambda_i}{\sum_j\lambda_j}
\end{equation}



:::{exercise}Covariance Understanding
**Question**: If two variables have a covariance of 0, what does this mean for PCA?

Consider this covariance matrix:
```
C = [4.0  0.0]
    [0.0  1.0]
```

- What are the eigenvalues?
- What are the eigenvectors?
- How much variance does each PC explain?

:::{exercise} The scree plot decision
An ecologist runs a PCA on a dataset with 10 features related to forest health. The explained variance for each component is:

    PC1: 45%

    PC2: 30%

    PC3: 8%

    PC4: 7%

    PC5: 4%

    PC6-PC10: <1% each

- Create a scree plot (bar chart of explained variance).
- Using the "elbow method," how many principal components would you choose to retain for further analysis?
- What is the cumulative variance explained by your chosen number of components? Is this likely sufficient?
:::

## Advanced Details

### PCA Assumptions and Limitations

**Assumptions**:
- Linear relationships between variables
- Data follows (approximately) multivariate normal distribution
- Variables are continuous

**Limitations**:
- **Linear combinations only**: Cannot capture nonlinear patterns
- **Variance-based**: May not preserve class separability
- **Global method**: Same transformation for all data points

**When PCA May Not Work Well**:
- Highly nonlinear data (consider kernel PCA)
- Categorical variables (consider correspondence analysis)
- When rare events are important (PCA focuses on major patterns)

### Robust PCA

**Problem**: Standard PCA is sensitive to outliers.

**Solutions**:
- **Robust PCA**: Use median instead of mean, robust covariance estimation
- **Sparse PCA**: Assume many loadings are exactly zero
- **Kernel PCA**: Handle nonlinear relationships

---

## PCA applied to the previous data using scikit learn


In [None]:
import numpy as np

from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import pandas as pd

# Data generation
np.random.seed(0)
n = 200
x = np.random.normal(size=n)
y = 2.0 * x + 0.8 * np.random.normal(size=n)
X = np.vstack([x, y]).T

#Xc = X - X.mean(axis=0)
X_scaled = StandardScaler().fit_transform(X)

pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

explained_variance = pca.explained_variance_ratio_
print(f"\nVariance explained by PC1: {explained_variance[0]:.2%}")
print(f"Variance explained by PC2: {explained_variance[1]:.2%}")
print(f"Total variance explained: {np.sum(explained_variance):.2%}")


## PCA applied to real data 

Sky data survey. To generate the data, go to <https://skyserver.sdss.org/dr14/en/tools/search/form/searchform.aspx> and write or use the default query. To get the full data, use the following sql command
```sql
    
SELECT TOP 10000
    p.ra, p.dec, p.u, p.g, p.r, p.i, p.z,
    p.run, p.rerun, p.camcol, p.field,
    s.specobjid, s.class, s.z as redshift
FROM PhotoObj AS p
JOIN SpecObj AS s ON s.bestobjid = p.objid
WHERE
    p.u BETWEEN 0 AND 19.6
    AND p.g BETWEEN 0 AND 20
```

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# --- 1. DATA ACQUISITION ---
# Load the data from the CSV file you downloaded from the SDSS SkyServer.
#file_path = 'SDSS_DR14.csv' # Or whatever you named the downloaded file
file_path = 'skyserverdata2.csv' # Or whatever you named the downloaded file
# url = "https://drive.google.com/file/d/1ZLnCVim0P1Ktewyhd1VcRNWTDCte-W0O/view?usp=drive_link"

try:
    # Use skiprows=1 to ignore the first line ("#Table1")
    sdss_df = pd.read_csv(file_path, skiprows=1)
    print("SDSS dataset loaded successfully and parsed correctly!")

except FileNotFoundError:
    print(f"File not found: '{file_path}'. Please make sure you have downloaded the data and placed it in the correct directory.")
    sdss_df = pd.DataFrame()

# Proceed only if the DataFrame was loaded successfully
if not sdss_df.empty:
    # --- 2. DATA EXPLORATION AND PREPARATION ---
    print("\nDataset Shape:", sdss_df.shape)
    print("\nColumns:", sdss_df.columns)

    # Define our features and the target variable for coloring the plot
    features = ['u', 'g', 'r', 'i', 'z']
    target = 'class'

    # --- Safety Check ---
    # Verify that all required columns exist in the DataFrame before proceeding.
    if target in sdss_df.columns and all(col in sdss_df.columns for col in features):
        
        X = sdss_df[features]
        y = sdss_df[target]

        print("\nNumber of samples for each class:")
        print(y.value_counts())

        # Standardize the feature data
        X_scaled = StandardScaler().fit_transform(X)
        print("\nData has been standardized.")

        # --- 3. APPLYING PCA ---
        pca = PCA(n_components=2)
        X_pca = pca.fit_transform(X_scaled)

        explained_variance = pca.explained_variance_ratio_
        print(f"\nVariance explained by PC1: {explained_variance[0]:.2%}")
        print(f"Variance explained by PC2: {explained_variance[1]:.2%}")
        print(f"Total variance explained: {np.sum(explained_variance):.2%}")

        # --- 4. VISUALIZATION AND INTERPRETATION ---

        # Create a new DataFrame for easier plotting with seaborn
        pca_df = pd.DataFrame(data=X_pca, columns=['PC1', 'PC2'])
        pca_df['class'] = y.values # Add the class labels for coloring

        # --- PLOT 1: SCORES PLOT (COLORED) ---
        # Scores mean the data is plotted on the new coordinate system : PC1xPC2
        plt.figure(figsize=(12, 9))
        sns.scatterplot(
            x='PC1', y='PC2',
            hue='class',
            data=pca_df,
            palette='viridis',
            alpha=0.7,
            s=40 # marker size
        )
        plt.title('PCA of SDSS Astronomical Objects (based on colors)', fontsize=16)
        plt.xlabel(f'Principal Component 1 ({explained_variance[0]:.2%})', fontsize=12)
        plt.ylabel(f'Principal Component 2 ({explained_variance[1]:.2%})', fontsize=12)
        plt.legend(title='Object Class')
        plt.grid(True)
        plt.show()

        # --- PLOT 2: LOADINGS PLOT ---
        plt.figure(figsize=(10, 8))
        # In scikit-learn, pca.components_ are the loadings. We transpose for easier plotting.
        loadings = pca.components_.T
        
        for i, feature in enumerate(features):
            plt.arrow(0, 0, loadings[i, 0], loadings[i, 1], head_width=0.05, head_length=0.05, color='red', alpha=0.8)
            plt.text(loadings[i, 0]*1.15, loadings[i, 1]*1.15, feature, color='black', ha='center', va='center', fontsize=14)

        plt.xlim(-0.8, 0.8)
        plt.ylim(-0.8, 0.8)
        plt.xlabel(f'PC1 ({explained_variance[0]:.2%})')
        plt.ylabel(f'PC2 ({explained_variance[1]:.2%})')
        plt.title('Loadings Plot', fontsize=16)
        plt.gca().set_aspect('equal', adjustable='box')
        plt.grid()
        plt.show()

    else:
        print("\n--- ERROR ---")
        print(f"The target column '{target}' or one of the feature columns was not found in the data.")
        print("Please ensure your downloaded file contains all necessary columns.")
        print("Required columns: 'u', 'g', 'r', 'i', 'z', 'class'")

This is a bokeh plotting version with some additional info

In [None]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from bokeh.plotting import figure, show, output_notebook
from bokeh.io import push_notebook
from bokeh.palettes import Category10
from bokeh.models import ColumnDataSource, HoverTool
from bokeh.layouts import column
import warnings
warnings.filterwarnings('ignore')

# Enable Bokeh output in notebook
output_notebook()

# --- 1. DATA ACQUISITION ---
file_path = 'skyserverdata2.csv'  # Or whatever you named the downloaded file
# url = "https://drive.google.com/file/d/1ZLnCVim0P1Ktewyhd1VcRNWTDCte-W0O/view?usp=drive_link"
#file_path = "https://drive.google.com/file/d/1ZLnCVim0P1Ktewyhd1VcRNWTDCte-W0O/view"
file_path = "https://drive.usercontent.google.com/download?id=1ZLnCVim0P1Ktewyhd1VcRNWTDCte-W0O&export=download"
#file_path = "https://drive.google.com/uc?id=1ZLnCVim0P1Ktewyhd1VcRNWTDCte-W0O&export=download"

try:
    # Use skiprows=1 to ignore the first line ("#Table1")
    sdss_df = pd.read_csv(file_path, skiprows=1)
    print("SDSS dataset loaded successfully and parsed correctly!")
except FileNotFoundError:
    print(f"File not found: '{file_path}'. Please make sure you have downloaded the data and placed it in the correct directory.")
    sdss_df = pd.DataFrame()

# Proceed only if the DataFrame was loaded successfully
if not sdss_df.empty:
    # --- 2. DATA EXPLORATION AND PREPARATION ---
    print("\nDataset Shape:", sdss_df.shape)
    print("\nColumns:", sdss_df.columns.tolist())

    # Define our features and the target variable for coloring the plot
    features = ['u', 'g', 'r', 'i', 'z']
    target = 'class'

    # --- Safety Check ---
    if target in sdss_df.columns and all(col in sdss_df.columns for col in features):
        
        X = sdss_df[features]
        y = sdss_df[target]

        print("\nNumber of samples for each class:")
        print(y.value_counts())

        # Remove any rows with NaN values
        mask = ~(X.isnull().any(axis=1) | y.isnull())
        X = X[mask]
        y = y[mask]
        print(f"\nAfter removing NaN values: {len(X)} samples")

        # Standardize the feature data
        scaler = StandardScaler()
        X_scaled = scaler.fit_transform(X)
        print("\nData has been standardized.")

        # --- 3. APPLYING PCA ---
        pca = PCA(n_components=2)
        X_pca = pca.fit_transform(X_scaled)

        explained_variance = pca.explained_variance_ratio_
        print(f"\nVariance explained by PC1: {explained_variance[0]:.2%}")
        print(f"Variance explained by PC2: {explained_variance[1]:.2%}")
        print(f"Total variance explained: {np.sum(explained_variance):.2%}")

        # --- 4. VISUALIZATION AND INTERPRETATION ---

        # Create a new DataFrame for easier plotting with Bokeh
        pca_df = pd.DataFrame(data=X_pca, columns=['PC1', 'PC2'])
        pca_df['class'] = y.values  # Add the class labels for coloring

        # Get unique classes and assign colors
        unique_classes = sorted(y.unique())
        n_classes = len(unique_classes)
        colors = Category10[max(3, n_classes)][:n_classes]
        color_map = dict(zip(unique_classes, colors))
        
        # Add color column to dataframe
        pca_df['color'] = [color_map[cls] for cls in pca_df['class']]

        # --- PLOT 1: SCORES PLOT (COLORED) ---
        print("\nGenerating PCA scores plot...")
        
        # Create hover tool
        hover = HoverTool(tooltips=[
            ("Class", "@class"),
            ("PC1", "@PC1{0.00}"),
            ("PC2", "@PC2{0.00}")
        ])

        # Prepare the data for the scatter plot (scores plot)
        source = ColumnDataSource(pca_df)

        p1 = figure(
            width=800, 
            height=600, 
            title="PCA of SDSS Astronomical Objects (based on ugriz colors)",
            tools=[hover, "pan", "wheel_zoom", "box_zoom", "reset", "save"]
        )
        
        # Create scatter plot for each class separately to get proper legend
        for cls in unique_classes:
            class_data = pca_df[pca_df['class'] == cls]
            class_source = ColumnDataSource(class_data)
            p1.circle(
                x='PC1', y='PC2', 
                source=class_source, 
                size=8, 
                color=color_map[cls],
                legend_label=str(cls), 
                alpha=0.7
            )
        
        p1.xaxis.axis_label = f'Principal Component 1 ({explained_variance[0]:.1%} variance)'
        p1.yaxis.axis_label = f'Principal Component 2 ({explained_variance[1]:.1%} variance)'
        p1.legend.title = 'Object Class'
        p1.legend.location = "top_right"
        p1.legend.click_policy = "hide"  # Allow hiding/showing classes by clicking legend
        
        show(p1)

        # --- PLOT 2: LOADINGS PLOT ---
        print("\nGenerating loadings plot...")
        
        # In scikit-learn, pca.components_ are the loadings (transposed)
        loadings = pca.components_.T  # Shape: (n_features, n_components)
        
        # Create the loadings plot
        p2 = figure(
            width=800, 
            height=600, 
            title="PCA Loadings Plot - Feature Contributions",
            tools="pan,wheel_zoom,box_zoom,reset,save"
        )
        
        # Draw arrows from origin to each loading point
        for i, feature in enumerate(features):
            x_end = loadings[i, 0]
            y_end = loadings[i, 1]
            
            # Draw arrow line
            p2.line([0, x_end], [0, y_end], line_width=2, color='red', alpha=0.8)
            
            # Draw arrow head (simple triangle)
            p2.triangle([x_end], [y_end], size=10, color='red', alpha=0.8)
            
            # Add feature labels
            p2.text(
                x=[x_end * 1.1], y=[y_end * 1.1], 
                text=[feature], 
                text_font_size="12pt", 
                text_align="center",
                text_baseline="middle"
            )

        # Add grid lines at origin
        p2.line([-1, 1], [0, 0], line_color='gray', line_alpha=0.3)
        p2.line([0, 0], [-1, 1], line_color='gray', line_alpha=0.3)
        
        # Add unit circle for reference
        theta = np.linspace(0, 2*np.pi, 100)
        circle_x = np.cos(theta)
        circle_y = np.sin(theta)
        p2.line(circle_x, circle_y, line_color='gray', line_alpha=0.3, line_dash='dashed')

        p2.xaxis.axis_label = f'PC1 ({explained_variance[0]:.1%} variance)'
        p2.yaxis.axis_label = f'PC2 ({explained_variance[1]:.1%} variance)'
        
        # Set equal aspect ratio and appropriate ranges
        max_range = max(abs(loadings.max()), abs(loadings.min())) * 1.2
        p2.x_range.start = -max_range
        p2.x_range.end = max_range
        p2.y_range.start = -max_range
        p2.y_range.end = max_range
        
        show(p2)

        # --- INTERPRETATION ---
        print("\n" + "="*60)
        print("INTERPRETATION OF RESULTS:")
        print("="*60)
        
        print(f"\nPCA Summary:")
        print(f"- Total variance explained by first 2 components: {np.sum(explained_variance):.1%}")
        print(f"- PC1 explains {explained_variance[0]:.1%} of variance")
        print(f"- PC2 explains {explained_variance[1]:.1%} of variance")
        
        print(f"\nLoadings (feature contributions):")
        loadings_df = pd.DataFrame(loadings, columns=['PC1', 'PC2'], index=features)
        print(loadings_df.round(3))
        
        print(f"\nTop contributors to PC1:")
        pc1_contrib = abs(loadings_df['PC1']).sort_values(ascending=False)
        for feature, contrib in pc1_contrib.items():
            print(f"  {feature}: {contrib:.3f}")
            
        print(f"\nTop contributors to PC2:")
        pc2_contrib = abs(loadings_df['PC2']).sort_values(ascending=False)
        for feature, contrib in pc2_contrib.items():
            print(f"  {feature}: {contrib:.3f}")
    
    else:
        print("\n--- ERROR ---")
        print(f"The target column '{target}' or one of the feature columns was not found in the data.")
        print("Please ensure your downloaded file contains all necessary columns.")
        print("Required columns: 'u', 'g', 'r', 'i', 'z', 'class'")
        if not sdss_df.empty:
            print(f"Available columns: {sdss_df.columns.tolist()}")

else:
    print("Cannot proceed without data. Please check the file path and try again.")

:::{exercise} The Importance of Standardization 
Take the original Python script and find the line where the data is scaled. Comment it and perform again the PCA analysis. Is there any consequence? Replot and analyze. How much variance is capture now by PC1 and PC2? Do you still see the quasars separation in the scores plot?

:::

:::{exercise} Incorporating a Physical Measurement
Investigate how adding a non-color, physical feature like 'redshift' changes the PCA results and the interpretability of the components. Redshift is a measure of how much an object's light has been stretched as it travels through the expanding universe; it is strongly correlated with distance..  In the original script, add 'redshift' to the list of features to be included in the analysis. Re-run the script. 
- What is the new total explained variance captured by PC1 and PC2? Has it increased or decreased?
- Look at the new Scores Plot. Is the separation between the three classes (STAR, GALAXY, QSO) better or worse than in the original plot? Pay close attention to the Star/Galaxy overlap.
- Examine the new Loadings Plot. Where does the new 'redshift' arrow point? What does its direction and length tell you about its contribution to PC1 and PC2?
- Based on the new loadings, what do PC1 and PC2 now represent? Has the interpretation of "overall brightness" or "color contrast" changed? What physical reason might explain why adding redshift was so effective (or ineffective) at separating the classes?
:::

:::{exercise} What about more components?
- Plot the scores for the first 3 and 4 principal components.
- Analyze how the classes (STAR, GALAXY, QSO) are distributed in the 3D space
- Calculate the cumulative explained variance for the first 3 or 4 components. How much of the total variance is explained by these components?
- Does adding more components improves the separation between classes and whether the additional components add significant information?
:::

:::{exercise} Compare the performance of PCA with other dimensionality reduction methods like t-SNE or UMAP.
- Apply t-SNE or UMAP to the original feature set (without PCA) and visualize the results.
- Compare the ability of t-SNE/UMAP to separate the classes with PCA.
- Discuss the advantages and limitations of PCA versus non-linear dimensionality reduction methods like t-SNE and UMAP.
:::

## Interpretation Guidelines

### Choosing Number of Components

**Method 1: Cumulative Variance**
- Keep components explaining 80-95% of variance
- Depends on application requirements

**Method 2: Scree Plot (Elbow Method)**
- Plot eigenvalues vs. component number
- Look for "elbow" where decrease slows

**Method 3: Kaiser Criterion**
- Keep components with eigenvalues > 1 (if data is standardized)
- Only applies when variables are standardized

### Loading Interpretation

**Loading Values**:
- **|loading| > 0.7**: Strong relationship
- **0.3 < |loading| < 0.7**: Moderate relationship
- **|loading| < 0.3**: Weak relationship

**Signs Matter**:
- **Positive loading**: Variable increases with PC
- **Negative loading**: Variable decreases with PC

## Common Pitfalls

| **Mistake** | **Correct** |
|---|---|
|Treating PCs as original variables|PCs are linear combinations of original variables|
|Ignoring scaling/standardization|Consider whether to standardize based on variable units|
|Over-interpreting small components|Focus on components explaining meaningful variance|


## A detailed Worked Example - The Iris Dataset

The Iris dataset is a classic in machine learning and statistics. It contains 150 samples from three species of Iris flowers (Setosa, Versicolor, and Virginica). For each sample, four features were measured: sepal length, sepal width, petal length, and petal width.

Our goal is to see if we can use PCA to 'summarize' these four measurements and visualize the separation between the species.

### Setup - Importing Libraries

First, let's import all the necessary libraries for data manipulation, numerical computation, and plotting.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.datasets import load_iris

# Set some default plotting styles for better looking visuals
sns.set_style('whitegrid')

### Loading data

In [None]:
# Load the data
iris = load_iris()
X = iris.data # The feature matrix
y = iris.target # The labels (species)
feature_names = iris.feature_names
target_names = iris.target_names

# Let's look at the first 5 rows of the data
print("Feature Names:", feature_names)
print("\nFirst 5 rows of data:\n", X[:5])

### Step 1: Standardize the Data

PCA is sensitive to the scale of the features. We need to standardize our data so that each feature has a mean of 0 and a standard deviation of 1.

In [None]:
X_scaled = StandardScaler().fit_transform(X)

print("First 5 rows of scaled data:\n", X_scaled[:5])

### Step 2: Perform PCA

Now we apply PCA. We'll ask `scikit-learn` to find the first two principal components.

In [None]:
# Initialize PCA and fit the scaled data
# n_components specifies how many dimensions we want to reduce to.
pca = PCA(n_components=2)

# Fit the model and transform the data to the new coordinate system
X_pca = pca.fit_transform(X_scaled)

### Step 3: Analyze and Visualize the Results

#### Scree Plot

First, let's see how much variance our two new components capture. A scree plot is perfect for this.

In [None]:
explained_variance = pca.explained_variance_ratio_

print(f"Variance explained by PC1: {explained_variance[0]:.2%}")
print(f"Variance explained by PC2: {explained_variance[1]:.2%}")
print(f"Total variance explained by first two components: {np.sum(explained_variance):.2%}")

# To make a full scree plot, we can re-run PCA without specifying n_components
pca_full = PCA().fit(X_scaled)

plt.figure(figsize=(8, 6))
plt.bar(range(1, len(pca_full.explained_variance_ratio_) + 1), pca_full.explained_variance_ratio_ * 100, alpha=0.7, align='center', label='Individual explained variance')
plt.step(range(1, len(pca_full.explained_variance_ratio_) + 1), np.cumsum(pca_full.explained_variance_ratio_) * 100, where='mid', label='Cumulative explained variance')
plt.ylabel('Explained Variance Percentage')
plt.xlabel('Principal Component Index')
plt.title('Scree Plot for Iris Dataset')
plt.xticks(range(1, len(pca_full.explained_variance_ratio_) + 1))
plt.legend(loc='best')
plt.show()

**Observation:** The first two components capture over 95% of the total variance in the data! This means our 2D plot will be a very good representation of the original 4D data.

#### Scores Plot

Next, we create a scores plot. This is a scatter plot of our samples in the new PCA space. We will color each point according to its true species.

In [None]:
plt.figure(figsize=(10, 8))
colors = ['navy', 'turquoise', 'darkorange']

for color, i, target_name in zip(colors, [0, 1, 2], target_names):
    plt.scatter(X_pca[y == i, 0], X_pca[y == i, 1], color=color, alpha=.8, lw=2,
                label=target_name)

plt.legend(loc='best', shadow=False, scatterpoints=1)
plt.title('PCA Scores Plot of Iris Dataset')
plt.xlabel(f'Principal Component 1 ({explained_variance[0]:.2%})')
plt.ylabel(f'Principal Component 2 ({explained_variance[1]:.2%})')
plt.show()

**Observation:** The three species are very well separated in the PCA plot. The *Setosa* species is a distinct cluster, while *Versicolor* and *Virginica* are also mostly separated, though they have some overlap.

#### Biplot

Finally, the biplot helps us understand *why* the samples are separated. It overlays the original feature vectors (loadings) on top of the scores plot. This tells us how the original variables contribute to the principal components.

We will use a helper function to create a clean biplot.

In [None]:
def biplot(score, coeff, labels=None):
    """
    Creates a biplot visualization.
    
    score: The transformed data (scores), e.g., X_pca.
    coeff: The eigenvectors (loadings), e.g., pca.components_.T.
    labels: The names of the original features.
    """
    plt.figure(figsize=(12, 10))
    xs = score[:, 0]
    ys = score[:, 1]
    n = coeff.shape[0]
    
    # Plot the scores
    for color, i, target_name in zip(colors, [0, 1, 2], target_names):
        plt.scatter(xs[y == i], ys[y == i], color=color, alpha=0.7, label=target_name)

    # Plot the loadings
    for i in range(n):
        plt.arrow(0, 0, coeff[i, 0]*4, coeff[i, 1]*4, color='r', alpha=0.9, head_width=0.05)
        if labels is None:
            plt.text(coeff[i, 0] * 4.2, coeff[i, 1] * 4.2, "Var" + str(i + 1), color='black', ha='center', va='center')
        else:
            plt.text(coeff[i, 0] * 4.2, coeff[i, 1] * 4.2, labels[i], color='black', ha='center', va='center', fontsize=12)
    
    plt.xlabel(f'Principal Component 1 ({explained_variance[0]:.2%})')
    plt.ylabel(f'Principal Component 2 ({explained_variance[1]:.2%})')
    plt.title('Biplot of Iris Dataset')
    plt.legend()
    plt.grid()

# Call the function with our data
# Note: we need to transpose pca.components_ to get the loadings in the right shape
biplot(X_pca, np.transpose(pca.components_), labels=feature_names)
plt.show()

**Interpretation of the Biplot:**

*   **PC1 (the horizontal axis):** All four variables have vectors pointing to the right, but `petal length`, `petal width`, and `sepal length` point most strongly in this direction. This suggests PC1 is a measure of **overall flower size**. Larger flowers (like Virginica) are on the right (high PC1 score), and smaller flowers (like Setosa) are on the left (low PC1 score).
*   **PC2 (the vertical axis):** This axis shows an interesting contrast. `sepal width` points up, while `petal width` and `petal length` point down. This component separates flowers with wide sepals relative to their petal size from those with the opposite characteristics. This explains the separation between Versicolor and Virginica, which have similar overall sizes (PC1) but different shapes (PC2).

## Application Exercises

Now it's your turn! Apply the techniques learned above to the following datasets. For each exercise, you'll need to:
1. Load the data using `pandas`.
2. Select the feature columns.
3. Standardize the features.
4. Perform PCA.
5. Create and interpret a scores plot and/or a biplot.

*(Note: You will need to have the `.csv` files in the same directory as this notebook for the code to work.)*

### Genomics - Cancer Subtype Identification
**Dataset:** `cancer_data.csv`
**Task:** Perform PCA on the gene expression data. Can you identify distinct clusters in a scores plot? What might they represent?

In [None]:
try:
    cancer_df = pd.read_csv('cancer_data.csv')
    print("Cancer data loaded successfully!")
    # Your code here
    # 1. Select all columns except 'Sample_ID' and 'Subtype' as your features (X)
    # X_cancer = ...
    
    # 2. Standardize X_cancer
    # X_cancer_scaled = ...
    
    # 3. Perform PCA (n_components=2)
    # pca_cancer = ...
    # X_cancer_pca = ...
    
    # 4. Create a scores plot. You can color by the 'Subtype' column to see if the clusters match.
    # plt.figure(...)
    
except FileNotFoundError:
    print("File 'cancer_data.csv' not found. Please make sure it's in the same directory as the notebook.")

### Chemistry - Classifying Olive Oils
**Dataset:** `olive_oil_spectra.csv`
**Task:** Use PCA on the spectral data. Can you distinguish oils from different regions? Which wavelengths (columns) are most important for this separation?

In [None]:
try:
    oil_df = pd.read_csv('olive_oil_spectra.csv')
    print("Olive oil data loaded successfully!")
    # Your code here
    # 1. Select the wavelength columns as features.
    # X_oil = ...

    # 2. Standardize and perform PCA.
    # ...

    # 3. Create a biplot. It might be too cluttered to label all the variables (wavelengths),
    #    so focus on the scores plot part and the general direction of the loadings cloud.
    #    Color the points by the 'Region' column.
    # ...
except FileNotFoundError:
    print("File 'olive_oil_spectra.csv' not found. Please make sure it's in the same directory as the notebook.")

###  Environmental Science - Air Pollution Sources
**Dataset:** `air_pollution.csv`
**Task:** Use PCA to identify patterns in air quality data. Create a biplot and interpret the loadings. Can you hypothesize what physical processes PC1 and PC2 represent?

In [None]:
try:
    air_df = pd.read_csv('air_pollution.csv')
    print("Air pollution data loaded successfully!")
    # Your code here
    # 1. Select the pollutant and meteorological columns as features.
    # X_air = ...

    # 2. Standardize and perform PCA.
    # ...

    # 3. Create a biplot and interpret the loadings.
    #    Look at how variables like Ozone, NOx, and Temperature are related in the PCA space.
    # ...
except FileNotFoundError:
    print("File 'air_pollution.csv' not found. Please make sure it's in the same directory as the notebook.")

### Agriculture - Crop Yield Analysis
**Dataset:** `crop_data.csv`
**Task:** Perform PCA on the input variables (everything except `Crop_Yield`). Then, plot the PC1 scores against `Crop_Yield`. Is there a relationship? This demonstrates using PCA for feature engineering.

In [None]:
try:
    crop_df = pd.read_csv('crop_data.csv')
    print("Crop data loaded successfully!")
    # Your code here
    # 1. Select the input variables as features.
    # input_vars = ['Rainfall', 'Sunlight_Hours', 'Fertilizer_Amount', 'Soil_pH']
    # X_crop = crop_df[input_vars]
    # y_crop = crop_df['Crop_Yield']

    # 2. Standardize and perform PCA.
    # ...
    # X_crop_pca = ...

    # 3. Create a scatter plot of the first principal component vs. Crop_Yield.
    # plt.figure(...)
    # plt.scatter(X_crop_pca[:, 0], y_crop)
    # plt.xlabel('Principal Component 1')
    # plt.ylabel('Crop Yield')
    # plt.title('PC1 vs. Crop Yield')
    # plt.show()
    # What does the relationship (or lack thereof) tell you?

except FileNotFoundError:
    print("File 'crop_data.csv' not found. Please make sure it's in the same directory as the notebook.")

### Pharmaceutical Analysis
**Scenario**: A pharmaceutical company measured 50 chemical properties of 200 drug candidates to predict effectiveness.

**Data Structure**:
- Rows: 200 drug candidates
- Columns: 50 chemical properties (molecular weight, logP, polar surface area, etc.)

**Tasks**:
1. **Preprocessing**: Should you standardize the variables? Why?
2. **Analysis**: Apply PCA and determine how many components to retain
3. **Interpretation**: If PC1 loads heavily on molecular weight, logP, and size-related properties, what does this PC represent?
4. **Application**: How would you use PCA scores to select promising drug candidates?

**Expected Results**:
- PC1 (30%): "Molecular size" - larger, more lipophilic molecules
- PC2 (20%): "Polarity" - hydrophilic vs. hydrophobic character
- PC3 (15%): "Complexity" - structural complexity measures

### Protein Structure Analysis
**Scenario**: You have 3D coordinates for all atoms in a protein from molecular dynamics simulations (1000 time points).

**Challenge**: Identify main modes of protein flexibility.

**Tasks**:
1. **Data Setup**: How would you arrange the coordinate data for PCA?
2. **Preprocessing**: What preprocessing steps are crucial?
3. **Interpretation**: What do the principal components represent physically?
4. **Validation**: How would you verify your results make biological sense?

**Solution Approach**:
- **Data matrix**: time_points × (3 × number_of_atoms)
- **Preprocessing**: Center each structure, possibly align to remove rotation
- **PC1**: Often the "breathing" mode (overall expansion/contraction)
- **PC2-PC3**: Hinge motions, domain movements

### Astronomical Data
**Scenario**: Telescope survey measured brightness in 20 wavelength bands for 10,000 stars.

**Goal**: Classify star types using PCA.

**Questions**:
1. How many principal components might you expect to need?
2. What would the principal components represent physically?
3. How would you use PCA results for star classification?

**Hints**:
- Different star types have characteristic spectra
- Temperature affects overall brightness curve shape
- Chemical composition affects specific absorption lines
- PC1 likely relates to stellar temperature
- PC2-PC3 might capture metallicity, surface gravity

## Summary

**Key Takeaways**:

1. **PCA transforms data** to new coordinates that maximize variance
2. **Principal components are linear combinations** of original variables
3. **Interpretation requires examining loadings** and variance explained
4. **Standardization is crucial** when variables have different scales
5. **PCA is exploratory** - use it to understand data structure
6. **Limitations exist** - PCA assumes linear relationships

**When to Use PCA**:
- High-dimensional numerical data
- Variables are correlated
- Want to visualize or reduce dimensionality
- Need to identify main patterns

**When to Avoid PCA**:
- Variables are already uncorrelated
- All components are equally important
- Nonlinear relationships dominate
- Small sample size relative to variables

---

## Further Reading

**Books**:
- "The Elements of Statistical Learning" - Hastie, Tibshirani, Friedman
- "Pattern Recognition and Machine Learning" - Bishop
- "Applied Multivariate Statistical Analysis" - Johnson & Wichern

**Online Resources**:
- Scikit-learn PCA documentation
- StatQuest PCA videos (Josh Starmer)
- Andrew Ng's Machine Learning Course (PCA section)

**Research Papers**:
- Jolliffe, I.T. "Principal Component Analysis" (comprehensive review)
- Ringner, M. "What is principal component analysis?" (Nature Biotechnology)
- Domain-specific PCA applications in your field of interest