# Feature Standardization in PCA: A Student Performance Example

## Introduction

Welcome to this hands-on demonstration of why feature standardization is crucial when performing Principal Component Analysis (PCA). We'll use a realistic student performance dataset to show how different measurement scales can hide important patterns in your data.

## Learning Objectives

By the end of this notebook, you will understand:
- Why features with different scales can dominate PCA
- How standardization reveals true patterns in data
- The practical impact of preprocessing decisions on analysis results

## Setup

Let's start by importing the necessary libraries:

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import seaborn as sns

# Set style for better plots
plt.style.use('seaborn-v0_8')
np.random.seed(42)  # For reproducible results

## 1. Creating Our Dataset

We'll analyze student performance data where:
- **Math scores**: Measured on a 0-100 scale
- **Reading scores**: Measured on a 0-10 scale

Notice how the scales are different by a factor of 10!

In [None]:
# Create realistic student performance data
# Each row represents a student: [Math Score (0-100), Reading Score (0-10)]

student_data = np.array([
    [85, 9.2],  # Good at both
    [45, 8.8],  # Bad at math, good at reading  
    [92, 4.1],  # Good at math, bad at reading
    [78, 7.8],  # Average at both
    [35, 9.5],  # Bad at math, excellent at reading
    [95, 3.2],  # Excellent at math, poor at reading
    [60, 6.0],  # Below average at both
    [88, 5.5],  # Good at math, below average reading
    [52, 8.9],  # Below average math, good reading
    [75, 7.2],  # Decent at both
    [41, 9.1],  # Poor math, good reading
    [89, 4.8],  # Good math, poor reading
])

# Convert to DataFrame for easier handling
df = pd.DataFrame(student_data, columns=['Math_Score', 'Reading_Score'])
df['Student_ID'] = range(1, len(df) + 1)

print("Student Performance Data:")
print(df)
print(f"\nDataset shape: {df[['Math_Score', 'Reading_Score']].shape}")

In [None]:
# Let's examine the basic statistics
print("Descriptive Statistics:")
print(df[['Math_Score', 'Reading_Score']].describe())

## 2. Visualizing the Raw Data

In [None]:
# Create a scatter plot of the raw data
plt.figure(figsize=(10, 6))

plt.scatter(df['Math_Score'], df['Reading_Score'], s=100, alpha=0.7)

# Add student ID labels
for i, row in df.iterrows():
    plt.annotate(f'S{row["Student_ID"]}', 
                (row['Math_Score'], row['Reading_Score']),
                xytext=(5, 5), textcoords='offset points', fontsize=8)

plt.xlabel('Math Score (0-100 scale)')
plt.ylabel('Reading Score (0-10 scale)')
plt.title('Student Performance: Raw Data\n(Notice the different scales!)')
plt.grid(True, alpha=0.3)
plt.show()

# Calculate correlation
correlation = df['Math_Score'].corr(df['Reading_Score'])
print(f"Correlation between Math and Reading scores: {correlation:.3f}")

## 3. PCA Without Standardization

Let's see what happens when we apply PCA directly to the raw data:

In [None]:
# Prepare data (remove Student_ID column)
X = df[['Math_Score', 'Reading_Score']].values

# Apply PCA without standardization
pca_raw = PCA()
X_pca_raw = pca_raw.fit_transform(X)

print("=== PCA WITHOUT STANDARDIZATION ===")
print(f"Explained variance ratio: {pca_raw.explained_variance_ratio_}")
print(f"Cumulative explained variance: {np.cumsum(pca_raw.explained_variance_ratio_)}")
print(f"\nFirst Principal Component weights:")
print(f"  Math Score: {pca_raw.components_[0][0]:.3f}")
print(f"  Reading Score: {pca_raw.components_[0][1]:.3f}")
print(f"\nSecond Principal Component weights:")
print(f"  Math Score: {pca_raw.components_[1][0]:.3f}")
print(f"  Reading Score: {pca_raw.components_[1][1]:.3f}")

In [None]:
# Visualize PCA results without standardization
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))

# Plot 1: Original data with PC directions
ax1.scatter(X[:, 0], X[:, 1], s=100, alpha=0.7)

# Add principal component arrows
mean_point = np.mean(X, axis=0)
pc1_arrow = pca_raw.components_[0] * 40  # Scale for visibility
pc2_arrow = pca_raw.components_[1] * 20

ax1.arrow(mean_point[0], mean_point[1], pc1_arrow[0], pc1_arrow[1], 
          head_width=2, head_length=3, fc='red', ec='red', linewidth=2, 
          label=f'PC1 ({pca_raw.explained_variance_ratio_[0]:.1%} variance)')
ax1.arrow(mean_point[0], mean_point[1], pc2_arrow[0], pc2_arrow[1], 
          head_width=2, head_length=3, fc='blue', ec='blue', linewidth=2,
          label=f'PC2 ({pca_raw.explained_variance_ratio_[1]:.1%} variance)')

ax1.set_xlabel('Math Score (0-100)')
ax1.set_ylabel('Reading Score (0-10)')
ax1.set_title('Raw Data with Principal Components')
ax1.legend()
ax1.grid(True, alpha=0.3)

# Plot 2: Transformed data (PC space)
ax2.scatter(X_pca_raw[:, 0], X_pca_raw[:, 1], s=100, alpha=0.7)
ax2.set_xlabel(f'First PC ({pca_raw.explained_variance_ratio_[0]:.1%} variance)')
ax2.set_ylabel(f'Second PC ({pca_raw.explained_variance_ratio_[1]:.1%} variance)')
ax2.set_title('Data in Principal Component Space')
ax2.grid(True, alpha=0.3)
ax2.axhline(y=0, color='k', linestyle='--', alpha=0.3)
ax2.axvline(x=0, color='k', linestyle='--', alpha=0.3)

plt.tight_layout()
plt.show()

## 4. PCA With Standardization

Now let's see what happens when we standardize the features first:

In [None]:
# Standardize the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Apply PCA to standardized data
pca_scaled = PCA()
X_pca_scaled = pca_scaled.fit_transform(X_scaled)

print("=== PCA WITH STANDARDIZATION ===")
print(f"Explained variance ratio: {pca_scaled.explained_variance_ratio_}")
print(f"Cumulative explained variance: {np.cumsum(pca_scaled.explained_variance_ratio_)}")
print(f"\nFirst Principal Component weights:")
print(f"  Math Score: {pca_scaled.components_[0][0]:.3f}")
print(f"  Reading Score: {pca_scaled.components_[0][1]:.3f}")
print(f"\nSecond Principal Component weights:")
print(f"  Math Score: {pca_scaled.components_[1][0]:.3f}")
print(f"  Reading Score: {pca_scaled.components_[1][1]:.3f}")

In [None]:
# Show the standardized data statistics
print("Standardized Data Statistics:")
df_scaled = pd.DataFrame(X_scaled, columns=['Math_Score_Scaled', 'Reading_Score_Scaled'])
print(df_scaled.describe())

In [None]:
# Visualize PCA results with standardization
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))

# Plot 1: Standardized data with PC directions
ax1.scatter(X_scaled[:, 0], X_scaled[:, 1], s=100, alpha=0.7)

# Add principal component arrows
mean_scaled = np.mean(X_scaled, axis=0)
pc1_arrow_scaled = pca_scaled.components_[0] * 2  # Scale for visibility
pc2_arrow_scaled = pca_scaled.components_[1] * 2

ax1.arrow(mean_scaled[0], mean_scaled[1], pc1_arrow_scaled[0], pc1_arrow_scaled[1], 
          head_width=0.1, head_length=0.15, fc='red', ec='red', linewidth=2,
          label=f'PC1 ({pca_scaled.explained_variance_ratio_[0]:.1%} variance)')
ax1.arrow(mean_scaled[0], mean_scaled[1], pc2_arrow_scaled[0], pc2_arrow_scaled[1], 
          head_width=0.1, head_length=0.15, fc='blue', ec='blue', linewidth=2,
          label=f'PC2 ({pca_scaled.explained_variance_ratio_[1]:.1%} variance)')

ax1.set_xlabel('Math Score (standardized)')
ax1.set_ylabel('Reading Score (standardized)')
ax1.set_title('Standardized Data with Principal Components')
ax1.legend()
ax1.grid(True, alpha=0.3)

# Plot 2: Transformed data (PC space)
ax2.scatter(X_pca_scaled[:, 0], X_pca_scaled[:, 1], s=100, alpha=0.7)
ax2.set_xlabel(f'First PC ({pca_scaled.explained_variance_ratio_[0]:.1%} variance)')
ax2.set_ylabel(f'Second PC ({pca_scaled.explained_variance_ratio_[1]:.1%} variance)')
ax2.set_title('Standardized Data in Principal Component Space')
ax2.grid(True, alpha=0.3)
ax2.axhline(y=0, color='k', linestyle='--', alpha=0.3)
ax2.axvline(x=0, color='k', linestyle='--', alpha=0.3)

plt.tight_layout()
plt.show()

## 5. Side-by-Side Comparison

Let's create a comprehensive comparison to highlight the differences:

In [None]:
# Create a summary comparison
comparison_data = {
    'Metric': [
        'PC1 Explained Variance',
        'PC2 Explained Variance', 
        'PC1 Math Weight',
        'PC1 Reading Weight',
        'PC2 Math Weight',
        'PC2 Reading Weight'
    ],
    'Without Standardization': [
        f"{pca_raw.explained_variance_ratio_[0]:.3f}",
        f"{pca_raw.explained_variance_ratio_[1]:.3f}",
        f"{pca_raw.components_[0][0]:.3f}",
        f"{pca_raw.components_[0][1]:.3f}",
        f"{pca_raw.components_[1][0]:.3f}",
        f"{pca_raw.components_[1][1]:.3f}"
    ],
    'With Standardization': [
        f"{pca_scaled.explained_variance_ratio_[0]:.3f}",
        f"{pca_scaled.explained_variance_ratio_[1]:.3f}",
        f"{pca_scaled.components_[0][0]:.3f}",
        f"{pca_scaled.components_[0][1]:.3f}",
        f"{pca_scaled.components_[1][0]:.3f}",
        f"{pca_scaled.components_[1][1]:.3f}"
    ]
}

comparison_df = pd.DataFrame(comparison_data)
print("COMPARISON SUMMARY:")
print("=" * 60)
print(comparison_df.to_string(index=False))

In [None]:
# Create a visual comparison of explained variance
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))

# Without standardization
components = ['PC1', 'PC2']
variance_raw = pca_raw.explained_variance_ratio_
ax1.bar(components, variance_raw, color=['red', 'blue'], alpha=0.7)
ax1.set_title('Explained Variance\n(Without Standardization)')
ax1.set_ylabel('Explained Variance Ratio')
ax1.set_ylim(0, 1)
for i, v in enumerate(variance_raw):
    ax1.text(i, v + 0.02, f'{v:.3f}', ha='center', fontweight='bold')

# With standardization  
variance_scaled = pca_scaled.explained_variance_ratio_
ax2.bar(components, variance_scaled, color=['red', 'blue'], alpha=0.7)
ax2.set_title('Explained Variance\n(With Standardization)')
ax2.set_ylabel('Explained Variance Ratio')
ax2.set_ylim(0, 1)
for i, v in enumerate(variance_scaled):
    ax2.text(i, v + 0.02, f'{v:.3f}', ha='center', fontweight='bold')

plt.tight_layout()
plt.show()

## 6. Interpreting the Results

In [None]:
print("🎯 KEY INSIGHTS:")
print("=" * 50)

print("\n📊 WITHOUT STANDARDIZATION:")
print(f"• PC1 explains {pca_raw.explained_variance_ratio_[0]:.1%} of variance")
print(f"• PC1 is dominated by Math scores (weight: {pca_raw.components_[0][0]:.3f})")
print(f"• Reading scores barely contribute (weight: {pca_raw.components_[0][1]:.3f})")
print("• This happens because Math scores have a larger numerical range!")

print(f"\n📊 WITH STANDARDIZATION:")
print(f"• PC1 explains {pca_scaled.explained_variance_ratio_[0]:.1%} of variance")
print(f"• Both subjects contribute significantly:")
print(f"  - Math weight: {pca_scaled.components_[0][0]:.3f}")
print(f"  - Reading weight: {pca_scaled.components_[0][1]:.3f}")

# Interpret the pattern
if pca_scaled.components_[0][0] * pca_scaled.components_[0][1] < 0:
    print("• PC1 reveals a 'trade-off' pattern: students tend to excel in one subject!")
else:
    print("• PC1 reveals students who are good/bad at both subjects!")

print(f"\n📈 VARIANCE DISTRIBUTION:")
print(f"• Without standardization: {pca_raw.explained_variance_ratio_[0]:.1%} vs {pca_raw.explained_variance_ratio_[1]:.1%}")
print(f"• With standardization: {pca_scaled.explained_variance_ratio_[0]:.1%} vs {pca_scaled.explained_variance_ratio_[1]:.1%}")
print("• Standardization reveals more balanced variance distribution!")

## 7. Interactive Exercise

Try modifying the dataset and see how it affects the results:

In [None]:
# 🎮 EXERCISE: Create your own student data
# Try different patterns and see how standardization affects PCA results

# Example: What if students were generally good at both subjects?
exercise_data = np.array([
    [85, 8.5],  # Good at both
    [90, 9.0],  # Good at both  
    [78, 7.8],  # Good at both
    [82, 8.2],  # Good at both
    [88, 8.8],  # Good at both
    [75, 7.5],  # Decent at both
])

print("🎮 EXERCISE RESULTS:")
print("Try modifying the exercise_data array above with different patterns!")
print("Suggestions:")
print("1. All students good at both subjects")
print("2. Strong positive correlation between subjects") 
print("3. Random/no correlation between subjects")
print("\nRun PCA on your data and compare with/without standardization!")

# Uncomment below to test your exercise data:
# X_exercise = exercise_data
# pca_ex = PCA().fit(X_exercise)
# pca_ex_scaled = PCA().fit(StandardScaler().fit_transform(X_exercise))
# print(f"Raw data PC1 variance: {pca_ex.explained_variance_ratio_[0]:.3f}")
# print(f"Scaled data PC1 variance: {pca_ex_scaled.explained_variance_ratio_[0]:.3f}")

## 8. Key Takeaways

In [None]:
print("🎯 IMPORTANT LESSONS:")
print("=" * 50)
print()
print("1. 📏 SCALE MATTERS:")
print("   Features with larger numerical ranges dominate PCA")
print("   This can hide important patterns in your data!")
print()
print("2. 🔧 STANDARDIZATION HELPS:")
print("   Converting features to the same scale (mean=0, std=1)")
print("   Allows PCA to find true underlying patterns")
print()
print("3. 🔍 INTERPRETATION CHANGES:")
print("   Raw data: PCA might just reflect measurement scales")
print("   Standardized: PCA reveals actual data relationships")
print()
print("4. ⚖️ BALANCE IN COMPONENTS:")
print("   Standardization often leads to more balanced variance")
print("   distribution across principal components")
print()
print("5. 🤔 WHEN TO STANDARDIZE:")
print("   ✅ Features have different units (dollars vs. years)")
print("   ✅ Features have very different scales")
print("   ✅ You want equal consideration of all features")
print("   ❌ Features are already on similar scales")
print("   ❌ The scale difference is meaningful for your analysis")

## 9. Next Steps

In [None]:
print("🚀 WHAT'S NEXT?")
print("=" * 30)
print("• Learn about other preprocessing techniques (normalization, robust scaling)")
print("• Explore PCA with real datasets (iris, wine, digits)")
print("• Study how to choose the number of principal components")
print("• Practice interpreting principal components in different domains")
print("• Learn about other dimensionality reduction techniques (t-SNE, UMAP)")

## Summary

In this notebook, we demonstrated why feature standardization is crucial for PCA:

- **Without standardization**: PCA was dominated by the Math scores due to their larger scale (0-100 vs 0-10)
- **With standardization**: PCA revealed the true underlying pattern in student performance

Remember: **Always consider the scales of your features before applying PCA!** Standardization ensures that each feature contributes fairly to the analysis, allowing you to discover meaningful patterns rather than artifacts of measurement scales.