
# PCA Lab: Finding Hidden Factors in Academic Performance

**Objective:** Understand how Principal Component Analysis (PCA) finds latent structure among correlated variables and down-weights variables that are uncorrelated with the main pattern in the data.

**Scenario:** We measure four scores for 100 students:
- **Math** (0â€“100)
- **Science** (0â€“100)
- **English** (0â€“100)
- **Music** (0â€“100) â€” intentionally designed to be **uncorrelated** with the other subjects

We expect Math/Science/English to be correlated via an underlying "academic ability" factor, while Music varies independently.


In [None]:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from pandas.plotting import scatter_matrix
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

np.random.seed(42)
pd.set_option('display.float_format', lambda x: f'{x:,.3f}')


## 1) Generate synthetic data

In [None]:

n = 100

# Hidden "academic ability" factor (drives the three academic subjects)
academic_ability = np.random.normal(loc=75, scale=10, size=n)

# Academic subjects are correlated through the shared factor + some noise
math = academic_ability + np.random.normal(0, 5, n)
science = academic_ability + np.random.normal(0, 5, n)
english = academic_ability + np.random.normal(0, 5, n)

# Music is independent (uncorrelated) noise-like variable
music = np.random.normal(loc=70, scale=10, size=n)

df = pd.DataFrame({
    'Math': math,
    'Science': science,
    'English': english,
    'Music': music
})

df.head()


## 2) Explore the data

In [None]:

# Summary statistics
df.describe()


In [None]:

# Correlation matrix
corr = df.corr(numeric_only=True)
corr


In [None]:

# Scatter matrix to visualize pairwise relationships (matplotlib only)
axarr = scatter_matrix(df, figsize=(8,8), diagonal='hist')
plt.tight_layout()
plt.show()



## 3) Standardize the data

PCA is sensitive to scale. We'll standardize each column to have mean 0 and standard deviation 1.


In [None]:

scaler = StandardScaler()
X_scaled = scaler.fit_transform(df)  # returns a NumPy array with standardized columns
X_scaled[:5]


## 4) Fit PCA and examine variance explained

In [None]:

pca = PCA()
pca.fit(X_scaled)

explained = pca.explained_variance_ratio_
ev = pd.DataFrame({
    'PC': [f'PC{i+1}' for i in range(len(explained))],
    'Explained_Variance_Ratio': explained
})
ev


In [None]:

# Scree plot (variance explained by each PC)
plt.figure(figsize=(6,4))
plt.plot(range(1, len(pca.explained_variance_ratio_)+1), pca.explained_variance_ratio_, marker='o')
plt.title('Scree Plot: Explained Variance by Principal Component')
plt.xlabel('Principal Component')
plt.ylabel('Proportion of Variance Explained')
plt.xticks(range(1, len(pca.explained_variance_ratio_)+1))
plt.grid(True, linestyle='--', linewidth=0.5)
plt.show()


## 5) Interpret component loadings (how variables contribute to each PC)

In [None]:

# Loadings = eigenvectors of covariance of standardized data
loadings = pd.DataFrame(
    pca.components_.T,
    index=df.columns,
    columns=[f'PC{i+1}' for i in range(len(df.columns))]
)
loadings


In [None]:

# Visualize loadings as a simple heatmap using matplotlib (no seaborn)
fig, ax = plt.subplots(figsize=(6,3))
im = ax.imshow(loadings.values, aspect='auto')
ax.set_xticks(range(loadings.shape[1]))
ax.set_yticks(range(loadings.shape[0]))
ax.set_xticklabels(loadings.columns)
ax.set_yticklabels(loadings.index)
plt.title('PCA Loadings (Variables x Components)')
plt.colorbar(im, fraction=0.046, pad=0.04)
plt.tight_layout()
plt.show()


## 6) Project data onto the first two components

In [None]:

pca2 = PCA(n_components=2)
PCs = pca2.fit_transform(X_scaled)
pc_df = pd.DataFrame(PCs, columns=['PC1','PC2'])

plt.figure(figsize=(6,5))
plt.scatter(pc_df['PC1'], pc_df['PC2'], alpha=0.8)
plt.title('Students projected on first two principal components')
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.grid(True, linestyle='--', linewidth=0.5)
plt.tight_layout()
plt.show()

pc_df.head()



### ðŸ’¬ In-class discussion prompts
1. **What does PC1 represent in plain language?**  
   *Hint: Look at the loadings of Math, Science, and English on PC1.*
2. **What does PC2 represent?**  
   *Hint: Does Music load heavily on PC2?*
3. **Why did we standardize the variables before PCA?**
4. **If we removed Music, how would the variance explained by PC1 change?**


## 7) Extension A: Re-run PCA *without* Music

In [None]:

X3 = df[['Math','Science','English']].values
X3_scaled = scaler.fit_transform(X3)
pca3 = PCA().fit(X3_scaled)

ev3 = pd.DataFrame({
    'PC': [f'PC{i+1}' for i in range(X3.shape[1])],
    'Explained_Variance_Ratio': pca3.explained_variance_ratio_
})
ev3


In [None]:

plt.figure(figsize=(6,4))
plt.plot(range(1, 1+len(pca3.explained_variance_ratio_)), pca3.explained_variance_ratio_, marker='o')
plt.title('Scree Plot (No Music)')
plt.xlabel('Principal Component')
plt.ylabel('Proportion of Variance Explained')
plt.xticks(range(1, 1+len(pca3.explained_variance_ratio_)))
plt.grid(True, linestyle='--', linewidth=0.5)
plt.tight_layout()
plt.show()


## 8) Extension B: Add another correlated variable (Reading)

In [None]:

# Create a new correlated variable (Reading)
reading = academic_ability + np.random.normal(0, 5, n)
df_ext = df.copy()
df_ext['Reading'] = reading

# Re-run PCA
X_ext = df_ext.values
X_ext_scaled = scaler.fit_transform(X_ext)
pca_ext = PCA().fit(X_ext_scaled)

loadings_ext = pd.DataFrame(
    pca_ext.components_.T,
    index=df_ext.columns,
    columns=[f'PC{i+1}' for i in range(df_ext.shape[1])]
)

ev_ext = pd.DataFrame({
    'PC': [f'PC{i+1}' for i in range(df_ext.shape[1])],
    'Explained_Variance_Ratio': pca_ext.explained_variance_ratio_
})

ev_ext, loadings_ext


In [None]:

# Visualize new loadings
fig, ax = plt.subplots(figsize=(7,3))
im = ax.imshow(loadings_ext.values, aspect='auto')
ax.set_xticks(range(loadings_ext.shape[1]))
ax.set_yticks(range(loadings_ext.shape[0]))
ax.set_xticklabels(loadings_ext.columns)
ax.set_yticklabels(loadings_ext.index)
plt.title('PCA Loadings with Reading Added')
plt.colorbar(im, fraction=0.046, pad=0.04)
plt.tight_layout()
plt.show()



## 9) Wrap-up

- **PC1** captures the shared variation of Math/Science/English (and Reading, when added) â€” i.e., a *general academic ability* factor.  
- **Music** is largely uncorrelated with those subjects, so it contributes primarily to a separate component (often **PC2**).  
- PCA is most useful when you have many correlated variables and want a smaller set of uncorrelated components for visualization, modeling, or interpretation.

**Next steps:** Try swapping in your own datasets (e.g., agricultural features, fitness measures) and repeat this workflow.
