# Factor Analysis vs PCA: Educational Assessment Data

This analysis demonstrates both Factor Analysis (FA) and Principal Component Analysis (PCA)
on student assessment data to identify latent constructs and understand dimensionality.

**Learning objectives:**
- Apply Factor Analysis to discover latent psychological constructs
- Apply PCA to identify underlying dimensions in multivariate data
- Understand communalities and uniquenesses in measurement models
- Use factor rotation to achieve simple structure
- Compare Factor Analysis with Principal Component Analysis
- Interpret factor loadings for construct validation
- Understand the fundamental differences between FA and PCA

## Import Libraries and Setup

**Task:** Set up your Python environment for both Factor Analysis and PCA by importing the necessary libraries. You'll need:
- pandas for data handling
- numpy for numerical operations
- matplotlib and seaborn for visualization
- scikit-learn's PCA and StandardScaler
- factor_analyzer package (including FactorAnalyzer, calculate_kmo, and calculate_bartlett_sphericity)
- Configure a basic logger for tracking analysis steps

In [None]:
import sys
from pathlib import Path

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from factor_analyzer import FactorAnalyzer
from factor_analyzer.factor_analyzer import calculate_bartlett_sphericity, calculate_kmo
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# Simple logger
import logging

logging.basicConfig(level=logging.INFO, format="%(levelname)s: %(message)s")
logger = logging.getLogger(__name__)

## Data Loading and Exploration

**Task:** Load the educational assessment data from the CSV file in the current directory. Check if the file exists (if not, inform the user to run the fetch script and exit with code 1). Extract the assessment variables (excluding the Student ID column) and log basic information about the dataset dimensions and variable names.

In [None]:
script_dir = Path.cwd()
data_path = script_dir / "educational.csv"

if not data_path.exists():
    logger.error(f"Data file not found: {data_path}")
    logger.info("Run 'fetch_educational.py' to generate the required data file")
    sys.exit(1)

df = pd.read_csv(data_path)
logger.info(
    f"Loaded dataset: {len(df)} students, {len(df.columns) - 1} assessment variables"
)

# Extract assessment variables (exclude Student ID)
X = df.iloc[:, 1:]
variable_names = list(X.columns)

logger.info(f"Assessment variables: {variable_names}")
logger.info(f"Data shape: {X.shape}")

## Data Standardization

**Task:** Standardize the assessment data using StandardScaler so all variables have mean 0 and standard deviation 1. Log a confirmation message after standardization.

Both Factor Analysis and PCA require standardized data to ensure variables contribute equally
to the analysis, regardless of their original measurement scales.

In [None]:
scaler = StandardScaler()
X_standardized = scaler.fit_transform(X)

logger.info("Data standardized: mean ≈ 0, std ≈ 1 for all variables")

## Factor Analysis Assumptions Testing

**Task:** Test the statistical assumptions for Factor Analysis by calculating Bartlett's Test of Sphericity and the Kaiser-Meyer-Olkin (KMO) measure of sampling adequacy. Log the chi-square statistic and p-value for Bartlett's test (with interpretation whether p < 0.05). For KMO, log the overall measure and classify it as Excellent (>0.9), Good (>0.8), Acceptable (>0.6), or Unacceptable.

Before proceeding with Factor Analysis, we must verify that our data meets
key statistical assumptions for meaningful factor extraction.

In [None]:
# Test statistical assumptions
chi_square_value, p_value = calculate_bartlett_sphericity(X_standardized)
kmo_all, kmo_model = calculate_kmo(X_standardized)

logger.info("Factor Analysis Assumptions Testing:")
logger.info("\nBartlett's Test of Sphericity:")
logger.info(f"  Chi-square statistic: {chi_square_value:.3f}")
logger.info(f"  p-value: {p_value:.6f}")
if p_value < 0.05:
    logger.info("  ✓ Significant - variables are sufficiently correlated for FA")
else:
    logger.info("  ✗ Not significant - FA may not be appropriate")

logger.info("\nKaiser-Meyer-Olkin (KMO) Test:")
logger.info(f"  Overall Measure of Sampling Adequacy: {kmo_model:.3f}")
if kmo_model > 0.9:
    adequacy = "Excellent"
elif kmo_model > 0.8:
    adequacy = "Good"
elif kmo_model > 0.6:
    adequacy = "Acceptable"
else:
    adequacy = "Unacceptable"
logger.info(f"  Interpretation: {adequacy} sampling adequacy")

### Individual Variable Adequacy

**Task:** Extract and log the individual KMO (Measure of Sampling Adequacy) value for each variable. Identify any variables with MSA < 0.6 and flag them as problematic, or confirm that all variables show adequate sampling adequacy.

Each variable's individual KMO value indicates how well it can be predicted
from the other variables in the analysis.

In [None]:
logger.info("\nIndividual Variable Sampling Adequacy:")
for i, var_name in enumerate(variable_names):
    msa_value = kmo_all[i]
    logger.info(f"  {var_name}: {msa_value:.3f}")

# Flag any problematic variables
low_msa_vars = [
    var_name for i, var_name in enumerate(variable_names) if kmo_all[i] < 0.6
]
if low_msa_vars:
    logger.warning(f"Variables with low MSA (<0.6): {low_msa_vars}")
else:
    logger.info("All variables show adequate sampling adequacy (MSA ≥ 0.6)")

## Factor Extraction with Principal Axis Factoring

**Task:** Extract 3 factors using Principal Axis Factoring (method='principal') without rotation. Use the FactorAnalyzer class, fit it to the standardized data, and verify that the extraction succeeded by checking if loadings were produced. Log the eigenvalues for the extracted factors rounded to 3 decimal places.

We'll extract factors using Principal Axis Factoring (PAF), which:
- Focuses on shared variance among variables (common factors)
- Estimates communalities iteratively
- Distinguishes between common and unique variance

In [None]:
# Determine number of factors to extract
n_factors = 3  # Based on theoretical expectation of quantitative, verbal, and interpersonal factors

fa_unrotated = FactorAnalyzer(n_factors=n_factors, rotation=None, method="principal")
fa_unrotated.fit(X_standardized)

# Verify successful extraction
if fa_unrotated.loadings_ is None:
    logger.error("Factor extraction failed - no loadings produced")
    sys.exit(1)

logger.info(f"Factor Analysis Results ({n_factors} factors extracted):")
eigenvalues_fa = fa_unrotated.get_eigenvalues()[0]
logger.info(f"Eigenvalues: {np.round(eigenvalues_fa[:n_factors], 3)}")

### Communalities and Variance Decomposition

**Task:** Extract the communalities from the factor analysis and calculate uniquenesses (1 - communality). For each variable, log its communality (h²) and uniqueness (u²) values rounded to 3 decimal places. Then calculate and log the total common variance (sum of communalities), the proportion of total variance explained by factors, and the average communality.

Each variable's variance is decomposed into:
- **Communality (h²)**: Variance explained by common factors
- **Uniqueness (u²)**: Variance unique to the variable (including error)

In [None]:
communalities = fa_unrotated.get_communalities()
uniquenesses = 1 - communalities

logger.info("\nVariance Decomposition (Communalities and Uniquenesses):")
for i, var_name in enumerate(variable_names):
    h2 = communalities[i]
    u2 = uniquenesses[i]
    logger.info(f"  {var_name}: h² = {h2:.3f}, u² = {u2:.3f}")

# Analyze overall variance structure
factor_variance = np.sum(communalities)
total_variance = len(variable_names)  # For standardized data
variance_explained_fa = factor_variance / total_variance

logger.info("\nOverall Variance Analysis:")
logger.info(f"Total standardized variance: {total_variance:.1f}")
logger.info(f"Common variance (Σh²): {factor_variance:.3f}")
logger.info(f"Proportion of variance explained by factors: {variance_explained_fa:.1%}")
logger.info(f"Average communality: {np.mean(communalities):.3f}")

**Interpreting Communalities:**

- **High communality (h² > 0.6)**: Variable strongly related to common factors
- **Moderate communality (0.3 < h² < 0.6)**: Moderate factor relationship
- **Low communality (h² < 0.3)**: Mostly unique variance, weak factor loading

All nine assessment variables show high communalities, indicating they are well-explained
by the three underlying educational constructs (Quantitative, Verbal, Interpersonal).

## Inspect and Interpret Unrotated Factors

**Task:** Before applying rotation, examine and interpret the unrotated factor solution. Display the unrotated loadings matrix as a DataFrame with proper variable and factor labels. Analyze the loading patterns to identify: (1) what each factor represents, (2) problems with interpretability (general factors, bipolar contrasts, lack of simple structure), and (3) why rotation is needed to achieve theory-aligned, interpretable factors.

Understanding the unrotated solution reveals why rotation is essential for achieving simple structure and practical interpretability in factor analysis.

In [None]:
# Extract the unrotated loadings from the fitted model
loadings_unrotated = fa_unrotated.loadings_

# Create a detailed DataFrame of unrotated loadings
unrotated_loadings_df = pd.DataFrame(
    loadings_unrotated,
    columns=[f"Factor {i+1}" for i in range(n_factors)],
    index=variable_names
)

logger.info("\n" + "="*70)
logger.info("UNROTATED FACTOR LOADINGS ANALYSIS")
logger.info("="*70)

print("\nUnrotated Factor Loadings Matrix:")
print(unrotated_loadings_df.round(3))
print()

### Interpretation of Unrotated Factor Loadings

Looking at the unrotated loadings table above, we can identify three distinct patterns:

#### **Factor 1: General Educational Ability (The "g-factor")**
- **ALL variables** load positively and moderately high (0.577 to 0.737)
- This is the classic "general factor" pattern commonly seen in educational and psychological research
- **Interpretation:** Students who score high on Factor 1 tend to score high on ALL assessments across domains
- **Problem:** This doesn't help us understand specific abilities - it represents overall academic competence rather than distinct constructs
- **Not actionable:** We can't say "this student is strong in Factor 1" and know what specific skills they have

#### **Factor 2: Interpersonal vs. Cognitive Skills Contrast**
- **Positive loadings:** Collaboration (0.706), Leadership (0.710), Communication (0.650)
- **Negative loadings:** Math (-0.478), Algebra (-0.438), Geometry (-0.464), Reading (-0.175), Vocabulary (-0.153), Writing (-0.097)
- **Interpretation:** This is a **bipolar contrast dimension**
  - High Factor 2 score = Strong interpersonal skills relative to cognitive skills
  - Low Factor 2 score = Strong cognitive skills relative to interpersonal skills
- **Problem:** The bipolar nature creates confusion. Does a high score mean:
  - Good at interpersonal skills?
  - Bad at math/reading?
  - Both?

#### **Factor 3: Quantitative vs. Verbal Contrast**
- **Positive loadings:** Math (0.422), Algebra (0.555), Geometry (0.338)
- **Negative loadings:** Reading (-0.527), Vocabulary (-0.588), Writing (-0.558)
- **Weak/Mixed:** Interpersonal skills (0.107 to 0.212)
- **Interpretation:** Another **bipolar contrast dimension**
  - High Factor 3 = Stronger quantitative skills relative to verbal skills
  - Low Factor 3 = Stronger verbal skills relative to quantitative skills
- **Problem:** This suggests students can't be good at BOTH math and reading, which doesn't align with reality or theory

### **Key Problems with Unrotated Solution:**

**Problem 1: Lack of Simple Structure**
   - Most variables have substantial loadings on multiple factors
   - Example: MathScore loads 0.687 on F1, -0.478 on F2, and 0.422 on F3
   - Makes it unclear which factor "owns" each variable

**Problem 2: Difficult Substantive Interpretation**
   - What does "high on Factor 1" really mean beyond "generally capable"?
   - Bipolar factors create ambiguous interpretations

**Problem 3: Bipolar Factors Create Confusion**
   - Negative loadings suggest inverse relationships that may not be theoretically meaningful
   - "Good at X but bad at Y" is harder to interpret than "good at X"

**Problem 4: Not Theory-Aligned**
   - We theoretically expect 3 **independent, positive** constructs:
     - Quantitative Reasoning (Math, Algebra, Geometry)
     - Verbal Ability (Reading, Vocabulary, Writing)
     - Interpersonal Skills (Collaboration, Leadership, Communication)
   - Unrotated solution gives us contrasts and a general factor instead

**Problem 5: Low Practical Utility**
   - Cannot easily create subscales or interpret student profiles
   - Difficult to communicate results to educators or stakeholders

### **Why These Patterns Emerge:**

The unrotated solution is mathematically optimal for **variance extraction**:
- Factor 1 extracts maximum variance (eigenvalue = 4.041)
- Factor 2 extracts maximum remaining variance (eigenvalue = 2.124)
- Factor 3 extracts maximum remaining variance (eigenvalue = 1.635)

However, **maximum variance does not equal maximum interpretability**. The unrotated factors are orthogonal (uncorrelated) but arranged to maximize variance extraction, not psychological meaning.

### **What We Need: Rotation**

Rotation will transform these factors to achieve:

**Benefit 1: Simple Structure** - Each variable loads primarily on ONE factor

**Benefit 2: Clear Interpretation** - Factors represent distinct, interpretable constructs

**Benefit 3: Elimination of Bipolar Factors** - All loadings become predominantly positive

**Benefit 4: Theory Alignment** - Factors match expected Quantitative, Verbal, and Interpersonal dimensions

**Benefit 5: Practical Utility** - Easy to create subscales and interpret student profiles

**Important:** Rotation does NOT change:
- The communalities (h²) for each variable
- The total variance explained
- The fundamental fit of the model
- The orthogonality of factors (for Varimax rotation)

Rotation is simply a **rigid transformation** of the factor space to achieve better interpretability while preserving all mathematical properties.

## Factor Rotation for Simple Structure

**Task:** Extract 3 factors again using Principal Axis Factoring but this time with Varimax rotation. Store both the unrotated and rotated loadings matrices. Include a safety check in case rotation fails (use unrotated solution as fallback). Create a comparison DataFrame showing the unrotated and rotated loadings side-by-side for all three factors, and display it rounded to 3 decimal places.

Factor rotation improves interpretability without changing the fundamental solution.
Varimax rotation seeks "simple structure" where each variable loads primarily on one factor.

In [None]:
fa_rotated = FactorAnalyzer(n_factors=n_factors, rotation="varimax", method="principal")
fa_rotated.fit(X_standardized)

loadings_unrotated = fa_unrotated.loadings_
loadings_rotated = fa_rotated.loadings_

# Safety check for rotation success
if loadings_rotated is None:
    logger.warning("Varimax rotation failed, using unrotated solution")
    loadings_rotated = loadings_unrotated
    fa_rotated = fa_unrotated

logger.info("\nFactor Loadings Comparison (Unrotated vs. Varimax Rotated):")

# Create comparison table
comparison_df = pd.DataFrame(
    {
        "Variable": variable_names,
        "Unrot_F1": loadings_unrotated[:, 0],
        "Unrot_F2": loadings_unrotated[:, 1],
        "Unrot_F3": loadings_unrotated[:, 2],
        "Rotated_F1": loadings_rotated[:, 0],
        "Rotated_F2": loadings_rotated[:, 1],
        "Rotated_F3": loadings_rotated[:, 2],
    }
)

print(comparison_df.round(3))

### Factor Loading Interpretation

**Rotation benefits:**
- **Simple structure**: Variables load primarily on one factor
- **Clearer interpretation**: Easier to identify what each factor represents
- **Practical meaning**: Factors align better with theoretical constructs

**Loading interpretation guidelines:**
- **|loading| > 0.6**: Strong factor relationship
- **0.3 < |loading| < 0.6**: Moderate relationship
- **|loading| < 0.3**: Weak/negligible relationship

**Expected factor structure after rotation:**
- **Factor 1**: Likely Interpersonal Skills (Collaboration, Leadership, Communication)
- **Factor 2**: Likely Verbal Ability (ReadingComp, Vocabulary, Writing)
- **Factor 3**: Likely Quantitative Reasoning (MathScore, AlgebraScore, GeometryScore)

## Principal Component Analysis

**Task:** Apply PCA to the standardized data and transform it to obtain principal component scores. Extract the eigenvalues, explained variance ratios, and cumulative variance. Log these results rounded to 3 decimal places to understand how much variance each component captures.

PCA identifies linear combinations of variables that capture maximum variance.
Unlike Factor Analysis, PCA does not distinguish between common and unique variance.

In [None]:
pca = PCA()
Z = pca.fit_transform(X_standardized)

eigenvalues_pca = pca.explained_variance_
explained_ratio = pca.explained_variance_ratio_
cumulative_variance = np.cumsum(explained_ratio)

logger.info("\nPCA Results:")
logger.info(f"Eigenvalues: {np.round(eigenvalues_pca, 3)}")
logger.info(f"Explained variance ratio: {np.round(explained_ratio, 3)}")
logger.info(f"Cumulative variance: {np.round(cumulative_variance, 3)}")

### Interpreting the Variance Structure

The eigenvalues and explained variance ratios reveal the underlying dimensionality:

**Key insights:**
- **PC1, PC2, PC3**: Each captures a major dimension of educational assessment
- **First 3 components**: Should explain >80% cumulative variance based on the three underlying factors
- **Later components**: Capture measurement noise or minor variations

**Component retention strategy:**
- Kaiser criterion: Retain components with eigenvalues > 1.0 (expect 3 components)
- Scree plot: Look for the "elbow" where eigenvalues level off
- Practical rule: Retain components explaining ≥80% cumulative variance

**Expected interpretation:**
- **PC1**: Likely represents general cognitive ability or interpersonal skills
- **PC2**: Likely distinguishes verbal vs. quantitative reasoning
- **PC3**: Likely captures a specific educational dimension or mixed skills

## Eigenvalue Comparison: FA vs PCA

**Task:** Create two side-by-side scree plots comparing PCA eigenvalues (left, in steelblue) with FA eigenvalues (right, in darkgreen). For each plot, show eigenvalues as connected points, add a horizontal line at eigenvalue = 1.0 (Kaiser criterion), include proper axis labels and titles, and add a grid. Save the combined figure as 'fa_scree.png'.

Scree plots reveal how eigenvalues differ between methods due to their
different approaches to variance decomposition.

In [None]:
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))
components = np.arange(1, len(eigenvalues_pca) + 1)

# PCA scree plot
ax1.plot(
    components,
    eigenvalues_pca,
    "o-",
    linewidth=2,
    color="steelblue",
    markersize=8,
    label="PCA Eigenvalues",
)
ax1.axhline(y=1.0, color="red", linestyle="--", alpha=0.7, label="Kaiser criterion")
ax1.set_xlabel("Component Number")
ax1.set_ylabel("Eigenvalue")
ax1.set_title("PCA Eigenvalues")
ax1.set_xticks(components)
ax1.grid(True, linestyle=":", alpha=0.7)
ax1.legend()

# FA scree plot
ax2.plot(
    components,
    eigenvalues_fa,
    "o-",
    linewidth=2,
    color="darkgreen",
    markersize=8,
    label="FA Eigenvalues",
)
ax2.axhline(y=1.0, color="red", linestyle="--", alpha=0.7, label="Kaiser criterion")
ax2.set_xlabel("Factor Number")
ax2.set_ylabel("Eigenvalue")
ax2.set_title("Factor Analysis Eigenvalues")
ax2.set_xticks(components)
ax2.grid(True, linestyle=":", alpha=0.7)
ax2.legend()

plt.tight_layout()
scree_path = script_dir / "fa_scree.png"
plt.savefig(scree_path, dpi=150, bbox_inches="tight")
logger.info(f"Eigenvalue comparison saved: {scree_path}")
plt.show()

**Eigenvalue pattern differences:**

- **FA eigenvalues**: Generally lower because they reflect only common variance
- **PCA eigenvalues**: Higher because they include both common and unique variance
- **Kaiser criterion**: Both methods should show 3 eigenvalues > 1.0, confirming the 3-factor structure
- **Elbow location**: Clear drop after the 3rd component indicates optimal retention of 3 factors
- **Theoretical focus**: FA prioritizes meaningful factors over variance maximization

## Factor Loading Visualization

**Task:** Create a side-by-side comparison of two heatmaps showing unrotated and Varimax rotated factor loadings. Use matplotlib's imshow with the 'RdBu_r' colormap (range -1 to 1). Label the x-axis with variable names (rotated 45 degrees), y-axis with factor numbers. Add text annotations showing the exact loading values (rounded to 3 decimals) in each cell. Include a shared colorbar on the right side of the figure. Save the figure as 'fa_loadings.png'.

Heatmaps provide visual comparison of loading patterns before and after rotation.

In [None]:
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))

# Unrotated loadings heatmap using matplotlib imshow
im1 = ax1.imshow(
    loadings_unrotated.T,
    cmap="RdBu_r",
    aspect="auto",
    vmin=-1,
    vmax=1,
)
ax1.set_title("Unrotated Factor Loadings")
ax1.set_xlabel("Variables")
ax1.set_ylabel("Factors")
ax1.set_xticks(range(len(variable_names)))
ax1.set_xticklabels(variable_names, rotation=45, ha="right")
ax1.set_yticks(range(n_factors))
ax1.set_yticklabels([f"Factor {i + 1}" for i in range(n_factors)])

# Add annotations
for i in range(n_factors):
    for j in range(len(variable_names)):
        text = ax1.text(
            j,
            i,
            f"{loadings_unrotated[j, i]:.3f}",
            ha="center",
            va="center",
            color="black",
            fontsize=7,
        )

# Rotated loadings heatmap using matplotlib imshow
im2 = ax2.imshow(
    loadings_rotated.T,
    cmap="RdBu_r",
    aspect="auto",
    vmin=-1,
    vmax=1,
)
ax2.set_title("Varimax Rotated Factor Loadings")
ax2.set_xlabel("Variables")
ax2.set_ylabel("Factors")
ax2.set_xticks(range(len(variable_names)))
ax2.set_xticklabels(variable_names, rotation=45, ha="right")
ax2.set_yticks(range(n_factors))
ax2.set_yticklabels([f"Factor {i + 1}" for i in range(n_factors)])

# Add annotations
for i in range(n_factors):
    for j in range(len(variable_names)):
        text = ax2.text(
            j,
            i,
            f"{loadings_rotated[j, i]:.3f}",
            ha="center",
            va="center",
            color="black",
            fontsize=7,
        )

# Adjust layout and add colorbar to the right
plt.tight_layout()
fig.subplots_adjust(right=0.85)  # Make room for colorbar
cbar_ax = fig.add_axes([0.87, 0.15, 0.02, 0.7])  # Position for colorbar
cbar = fig.colorbar(im1, cax=cbar_ax)
cbar.set_label("Loading")

loadings_path = script_dir / "fa_loadings.png"
plt.savefig(loadings_path, dpi=150, bbox_inches="tight")
logger.info(f"Factor loadings heatmap saved: {loadings_path}")
plt.show()

## PCA Biplot Visualization

**Task:** Create a PCA biplot combining student scores on PC1 and PC2 (as colored scatter points) with variable loadings (as red arrows emanating from the origin). Scale the loading arrows appropriately so they're visible on the same plot. Color the points by their PC1 score using a colormap. Include axis labels showing the variance explained by each component. Save the plot as 'pca_biplot.png'.

The biplot combines:
- **Points**: Individual student scores on PC1 and PC2
- **Arrows**: Variable loadings showing how each assessment contributes to the components

**Interpretation guide:**
- Arrow direction indicates which component the variable loads on
- Arrow length indicates loading strength
- Similar arrows suggest variables measure related constructs
- Opposite arrows indicate negative correlation
- Clusters of related variables (Quantitative, Verbal, Interpersonal) should be evident

In [None]:
fig, ax = plt.subplots(figsize=(10, 7))

# Plot student scores
pc1_scores = Z[:, 0]
pc2_scores = Z[:, 1]

scatter = ax.scatter(
    pc1_scores,
    pc2_scores,
    c=pc1_scores,
    cmap="viridis",
    alpha=0.6,
    s=40,
    edgecolors="black",
    linewidth=0.5,
)
colorbar = plt.colorbar(scatter, label="PC1 Score")

# Plot variable loadings as arrows
pca_loadings = pca.components_.T * np.sqrt(pca.explained_variance_)
scale_factor = max(pc1_scores.std(), pc2_scores.std()) * 3.5

for i, var_name in enumerate(variable_names):
    loading_x = pca_loadings[i, 0] / np.sqrt(pca.explained_variance_[0]) * scale_factor
    loading_y = pca_loadings[i, 1] / np.sqrt(pca.explained_variance_[1]) * scale_factor

    ax.arrow(
        0,
        0,
        loading_x,
        loading_y,
        color="red",
        head_width=0.15,
        alpha=0.8,
        linewidth=2.5,
        head_length=0.15,
    )
    ax.text(
        loading_x * 1.15,
        loading_y * 1.15,
        var_name,
        color="red",
        fontweight="bold",
        fontsize=11,
        ha="center",
    )

ax.set_xlabel(f"PC1 ({explained_ratio[0]:.1%} of variance)")
ax.set_ylabel(f"PC2 ({explained_ratio[1]:.1%} of variance)")
ax.set_title("PCA Biplot: Student Scores and Variable Loadings")
ax.grid(True, linestyle=":", alpha=0.3)
ax.axhline(y=0, color="black", linewidth=0.8)
ax.axvline(x=0, color="black", linewidth=0.8)

plt.tight_layout()
biplot_path = script_dir / "pca_biplot.png"
plt.savefig(biplot_path, dpi=150, bbox_inches="tight")
logger.info(f"Biplot saved: {biplot_path}")
plt.show()

## Factor Analysis vs PCA Comparison

**Task:** Compare the variance explanation: calculate the proportion of total variance explained by the first 3 PCA components vs. the proportion of common variance explained by 3 FA factors. Get factor scores from the rotated FA model. Create a comparison DataFrame showing PCA loadings (PC1, PC2, PC3) and FA loadings (F1, F2, F3) side-by-side for all variables. Display rounded to 3 decimals.

Direct comparison reveals fundamental differences between these two approaches
to multivariate data analysis.

In [None]:
# Get factor scores
fa_scores = fa_rotated.transform(X_standardized)

logger.info("\nFactor Analysis vs PCA Comparison:")

# Variance explanation comparison
pca_variance_3comp = explained_ratio[:3].sum()
logger.info("\nVariance Explanation:")
logger.info(f"  PCA (first 3 components): {pca_variance_3comp:.1%} of total variance")
logger.info(f"  FA (3 factors): {variance_explained_fa:.1%} of common variance")
logger.info(
    "  Key difference: PCA maximizes total variance, FA focuses on shared variance"
)

# Loading comparison
logger.info("\nLoading Pattern Comparison:")
comparison_detailed = pd.DataFrame(
    {
        "Variable": variable_names,
        "PCA_PC1": pca_loadings[:, 0],
        "PCA_PC2": pca_loadings[:, 1],
        "PCA_PC3": pca_loadings[:, 2],
        "FA_F1": loadings_rotated[:, 0],
        "FA_F2": loadings_rotated[:, 1],
        "FA_F3": loadings_rotated[:, 2],
    }
)

print(comparison_detailed.round(3))

## Factor Structure Interpretation and Validation

**Task:** Analyze the rotated factor structure by identifying salient loadings (absolute value > 0.4) for each factor. Log which variables load strongly on each factor with their loading values. Interpret the factor structure in terms of the three educational constructs.

Analyze the extracted factors to understand what psychological constructs
they represent and how well they align with theoretical expectations.

In [None]:
# Identify salient loadings (commonly |loading| > 0.4)
loading_threshold = 0.4

logger.info(f"\nFactor Structure Analysis (threshold = {loading_threshold}):")
for factor_idx in range(n_factors):
    factor_name = f"Factor {factor_idx + 1}"
    salient_vars = []

    for var_idx, var_name in enumerate(variable_names):
        loading = loadings_rotated[var_idx, factor_idx]
        if abs(loading) > loading_threshold:
            salient_vars.append(f"{var_name} ({loading:+.3f})")

    logger.info(
        f"  {factor_name}: {', '.join(salient_vars) if salient_vars else 'No salient loadings'}"
    )

logger.info("\nEducational Construct Interpretation:")
logger.info("  Factor 1: Likely represents Interpersonal Skills")
logger.info("    - Expected high loadings: Collaboration, Leadership, Communication")
logger.info("  Factor 2: Likely represents Verbal Ability")
logger.info("    - Expected high loadings: ReadingComp, Vocabulary, Writing")
logger.info("  Factor 3: Likely represents Quantitative Reasoning")
logger.info("    - Expected high loadings: MathScore, AlgebraScore, GeometryScore")
logger.info("\nAll variables should show high communalities (h² > 0.6),")
logger.info("indicating they are well-explained by the three underlying factors.")

## PCA Component Loadings Analysis

**Task:** Create a DataFrame showing the PCA loadings (component coefficients) for the first 3 principal components. Use the variable names as row labels and 'PC1', 'PC2', 'PC3' as column names. Display the loadings table rounded to 3 decimal places.

Loadings show how each variable contributes to each principal component.
High absolute loadings indicate strong relationships.

In [None]:
# Create loadings table for first 3 components
pca_loadings_df = pd.DataFrame(
    pca.components_[:3].T, columns=["PC1", "PC2", "PC3"], index=variable_names
)

logger.info("\nPCA Component Loadings Matrix:")
print(pca_loadings_df.round(3))

## Student Score Analysis

**Task:** Create a DataFrame containing student IDs (using the actual STUD_xxx format from the data) with their PC1 and PC2 scores. Identify and log the top 5 students with the highest scores on PC1, showing both their PC1 and PC2 values rounded to 3 decimal places.

Examine how students rank on the principal components to understand
the practical meaning of each dimension.

In [None]:
# Create student ranking analysis using actual Student IDs from data
student_scores = pd.DataFrame(
    {"Student_ID": df["Student"].values, "PC1_Score": Z[:, 0], "PC2_Score": Z[:, 1]}
)

# Top performers on PC1
top_pc1 = student_scores.nlargest(5, "PC1_Score")
logger.info("\nTop 5 students on PC1:")
for _, row in top_pc1.iterrows():
    logger.info(
        f"  {row['Student_ID']}: PC1={row['PC1_Score']:.3f}, PC2={row['PC2_Score']:.3f}"
    )

# Top performers on PC2
top_pc2 = student_scores.nlargest(5, "PC2_Score")
logger.info("\nTop 5 students on PC2:")
for _, row in top_pc2.iterrows():
    logger.info(
        f"  {row['Student_ID']}: PC1={row['PC1_Score']:.3f}, PC2={row['PC2_Score']:.3f}"
    )

# Score distribution summary
logger.info("\nScore Distribution Summary:")
logger.info(f"  PC1 range: [{Z[:, 0].min():.3f}, {Z[:, 0].max():.3f}]")
logger.info(f"  PC2 range: [{Z[:, 1].min():.3f}, {Z[:, 1].max():.3f}]")
logger.info(
    f"  PC1-PC2 correlation: {np.corrcoef(Z[:, 0], Z[:, 1])[0, 1]:.3f} (should be ≈ 0)"
)

## Summary and Method Selection Guidelines

This comparative analysis demonstrates key concepts for dimensionality reduction and latent variable modeling:

**Factor Analysis advantages:**
- **Theoretical grounding**: Models specific latent constructs (quantitative, verbal, interpersonal)
- **Measurement model**: Separates common variance from measurement error
- **Simple structure**: Rotation achieves cleaner variable-factor relationships
- **Communality estimates**: Reveals how much variance is shared vs. unique
- **Three-factor structure**: Clearly identifies Quantitative, Verbal, and Interpersonal dimensions

**PCA advantages:**
- **Maximum variance**: Captures the most information in each component
- **Data reduction**: Efficient compression for visualization and further analysis
- **Computational simplicity**: Faster and more stable than iterative FA
- **Total variance**: Includes all sources of variation (common + unique)

**When to choose Factor Analysis:**
- Testing specific theories about latent psychological constructs
- Developing or validating measurement instruments
- Modeling common variance while acknowledging measurement error
- Interpreting results in terms of theoretical constructs
- Educational assessment scenarios where underlying abilities are of interest

**When to choose PCA instead:**
- Primary goal is data reduction or compression
- Maximizing explained variance is the priority
- No theoretical model for underlying structure
- Computational efficiency is critical

**Key methodological insights:**
- **Assumption testing**: KMO and Bartlett's tests confirm FA appropriateness
- **Rotation benefits**: Varimax rotation dramatically improves interpretability in FA
- **Communality interpretation**: High communalities (h² > 0.6) for all variables confirm the data fits the 3-factor model
- **Variance focus**: PCA explains more total variance, FA focuses on shared variance
- **Construct validation**: Loading patterns should align with theoretical expectations
- **Three-factor solution**: Kaiser criterion and scree plot both support 3 factors

**Educational assessment insights:**
- **Quantitative Reasoning**: Math, Algebra, and Geometry form a coherent dimension
- **Verbal Ability**: Reading, Vocabulary, and Writing cluster together
- **Interpersonal Skills**: Collaboration, Leadership, and Communication represent a distinct dimension
- **Realistic correlations**: Factors show moderate intercorrelations (0.2-0.3) as expected in educational data
- **Assessment design**: Results validate that these nine measures capture three distinct educational constructs

**Next applications:**
- Explore oblique rotation when factors may be correlated
- Use confirmatory factor analysis to test specific theoretical models
- Apply discriminant analysis for classification tasks
- Combine with cluster analysis to identify student groups
- Use canonical correlation to relate assessment domains
- Develop composite scores for each factor for student reporting