# Complete Factor Analysis Workflow - MA2003B Multivariate Statistics Course

This notebook demonstrates a comprehensive end-to-end Factor Analysis pipeline, from data preparation through interpretation and comparison with PCA. This represents the complete analytical process used in multivariate statistics.

## Learning Objectives:
- Implement the full factor analysis workflow
- Apply statistical tests for factorability
- Make decisions about factor retention and rotation
- Interpret factor loadings, communalities, and variance explained
- Compare Factor Analysis results with Principal Component Analysis

**Data**: Simulated 100 observations × 5 variables (replace with real data)

**Workflow Steps**:
1. Data preparation and standardization
2. Factorability assessment (KMO, Bartlett's test)
3. Factor extraction and retention decisions
4. Factor rotation for interpretability
5. Results interpretation and validation
6. Comparison with PCA

In [2]:
# Import Required Libraries
import numpy as np
from factor_analyzer import FactorAnalyzer
from factor_analyzer.factor_analyzer import calculate_bartlett_sphericity, calculate_kmo
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

In [3]:
# Complete Factor Analysis Workflow
print("Complete Factor Analysis Workflow")
print("=" * 50)
print("Following systematic steps for multivariate analysis")

Complete Factor Analysis Workflow
Following systematic steps for multivariate analysis


## Step 1: Data Loading and Preparation

In [4]:
# Load your actual data here (replace the simulated data)
# X = pd.read_csv('your_data.csv')  # Real data loading
# For demonstration, we use simulated multivariate normal data
np.random.seed(42)  # For reproducible results
X = np.random.randn(100, 5)  # 100 observations, 5 variables

print("Step 1: Data Preparation")
print("-" * 25)
print(f"Dataset dimensions: {X.shape[0]} observations × {X.shape[1]} variables")
print("Data type: Simulated multivariate normal (replace with real data)")

Step 1: Data Preparation
-------------------------
Dataset dimensions: 100 observations × 5 variables
Data type: Simulated multivariate normal (replace with real data)


## Step 2: Data Standardization

In [5]:
# Standardize variables to ensure equal contribution to analysis
# This is crucial when variables have different scales/units
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

print("Step 2: Data Standardization")
print("-" * 25)
print("Variables standardized: mean ≈ 0, std ≈ 1 for all variables")
print("Ensures equal contribution regardless of original scales")

Step 2: Data Standardization
-------------------------
Variables standardized: mean ≈ 0, std ≈ 1 for all variables
Ensures equal contribution regardless of original scales


## Step 3: Assess Suitability for Factor Analysis

In [6]:
# Test statistical assumptions before proceeding
kmo_all, kmo_model = calculate_kmo(X_scaled)
chi_square_value, p_value = calculate_bartlett_sphericity(X_scaled)

print("Step 3: Factorability Assessment")
print("-" * 25)
print("Testing whether data is suitable for factor analysis:")
print()

print("Kaiser-Meyer-Olkin (KMO) Test:")
print("  Measures sampling adequacy for each variable and overall")
print(f"  Overall KMO: {kmo_model:.3f}")
if kmo_model > 0.8:
    kmo_interpretation = "Excellent"
elif kmo_model > 0.7:
    kmo_interpretation = "Good"
elif kmo_model > 0.6:
    kmo_interpretation = "Acceptable"
else:
    kmo_interpretation = "Unacceptable"
print(f"  Interpretation: {kmo_interpretation} sampling adequacy")
print()

print("Bartlett's Test of Sphericity:")
print("  Tests null hypothesis: correlation matrix is identity matrix")
print(f"  Chi-square: {chi_square_value:.3f}, p-value: {p_value:.3f}")
if p_value < 0.05:
    print("  Result: Significant - variables are correlated, FA is appropriate")
else:
    print("  Result: Not significant - variables may be uncorrelated, reconsider FA")

Step 3: Factorability Assessment
-------------------------
Testing whether data is suitable for factor analysis:

Kaiser-Meyer-Olkin (KMO) Test:
  Measures sampling adequacy for each variable and overall
  Overall KMO: 0.504
  Interpretation: Unacceptable sampling adequacy

Bartlett's Test of Sphericity:
  Tests null hypothesis: correlation matrix is identity matrix
  Chi-square: 7.442, p-value: 0.683
  Result: Not significant - variables may be uncorrelated, reconsider FA


## Step 4: Determine Number of Factors to Extract

In [7]:
# Use PCA eigenvalues as initial guide for factor retention
pca = PCA()
pca.fit(X_scaled)
eigenvalues = pca.explained_variance_

# Kaiser criterion: retain factors with eigenvalues > 1.0
n_factors = sum(eigenvalues > 1)

print("Step 4: Factor Retention Decision")
print("-" * 25)
print("Using Kaiser criterion (eigenvalues > 1.0):")
print(f"Eigenvalues: {np.round(eigenvalues, 3)}")
print(f"Suggested number of factors: {n_factors}")
print("Rationale: Factors should explain more variance than individual variables")

Step 4: Factor Retention Decision
-------------------------
Using Kaiser criterion (eigenvalues > 1.0):
Eigenvalues: [1.298 1.074 0.987 0.932 0.758]
Suggested number of factors: 2
Rationale: Factors should explain more variance than individual variables


## Step 5: Perform Factor Analysis with Rotation

In [8]:
# Extract factors using Principal Axis Factoring with Varimax rotation
fa = FactorAnalyzer(n_factors=n_factors, rotation="varimax", method="principal")
fa.fit(X_scaled)

print("Step 5: Factor Extraction and Rotation")
print("-" * 25)
print(f"Method: Principal Axis Factoring with {n_factors} factors")
print("Rotation: Varimax (orthogonal) for simple structure")



Step 5: Factor Extraction and Rotation
-------------------------
Method: Principal Axis Factoring with 2 factors
Rotation: Varimax (orthogonal) for simple structure


## Step 6: Extract and Interpret Results

In [9]:
# Get key factor analysis outputs
loadings = fa.loadings_  # Variable-factor correlations
communalities = fa.get_communalities()  # Common variance proportions
variance_explained = fa.get_factor_variance()  # Variance decomposition

print("Step 6: Results Interpretation")
print("-" * 25)
print()

print("Factor Loadings (Variable-Factor Correlations):")
print("  Values > 0.6 indicate strong factor relationships")
print("  Values > 0.3 indicate moderate relationships")
if loadings is not None:
    print(loadings.round(3))
else:
    print("  Loadings not available")
print()

print("Communalities (h² - Common Variance):")
print("  Proportion of variance explained by extracted factors")
print("  Higher values indicate better factor model fit")
print(communalities.round(3))
print()

print("Factor Variance Explained:")
print("  [0]: Sum of squared loadings (eigenvalues)")
print("  [1]: Proportional variance explained")
print("  [2]: Cumulative variance explained")
print(np.round(variance_explained, 3))

Step 6: Results Interpretation
-------------------------

Factor Loadings (Variable-Factor Correlations):
  Values > 0.6 indicate strong factor relationships
  Values > 0.3 indicate moderate relationships
[[-0.778 -0.023]
 [ 0.366  0.613]
 [-0.229  0.832]
 [ 0.492 -0.098]
 [ 0.433  0.222]]

Communalities (h² - Common Variance):
  Proportion of variance explained by extracted factors
  Higher values indicate better factor model fit
[0.606 0.51  0.744 0.251 0.237]

Factor Variance Explained:
  [0]: Sum of squared loadings (eigenvalues)
  [1]: Proportional variance explained
  [2]: Cumulative variance explained
[[1.222 1.127]
 [0.244 0.225]
 [0.244 0.47 ]]


In [18]:
F = fa.transform(X)
F[0].reshape(1, -1).T



array([[0.03361469],
       [0.21166407]])

In [20]:
loadings @ F[0]

array([-0.03097677,  0.14211092,  0.16837473, -0.00413045,  0.06156508])

In [15]:
X[0]

array([ 0.49671415, -0.1382643 ,  0.64768854,  1.52302986, -0.23415337])

## Step 7: Compare with Principal Component Analysis

In [11]:
# Transform data using both methods for comparison
pca_scores = pca.transform(X_scaled)[:, :n_factors]  # Keep same number of components
fa_scores = fa.transform(X_scaled)

print("Step 7: PCA vs Factor Analysis Comparison")
print("-" * 25)
print()

# Variance comparison
pca_variance = pca.explained_variance_ratio_[:n_factors].sum()
fa_variance = variance_explained[2][-1]  # Cumulative variance from FA

print("Variance Explained Comparison:")
print(f"  PCA: {pca_variance*100:.1f}% cumulative variance")
print(f"  FA: {fa_variance*100:.1f}% cumulative variance")
print()

print("Key Differences:")
print("- PCA: Maximizes total variance, includes unique + common variance")
print("- FA: Focuses on common variance, models unique variance separately")
print("- PCA: Components are linear combinations for dimensionality reduction")
print("- FA: Factors represent latent constructs for theory testing")
print()

print("Factor Scores Shape:")
print(f"  PCA scores: {pca_scores.shape}")
print(f"  FA scores: {fa_scores.shape}")
print("  Both provide component/factor scores for further analysis")

Step 7: PCA vs Factor Analysis Comparison
-------------------------

Variance Explained Comparison:
  PCA: 47.0% cumulative variance
  FA: 47.0% cumulative variance

Key Differences:
- PCA: Maximizes total variance, includes unique + common variance
- FA: Focuses on common variance, models unique variance separately
- PCA: Components are linear combinations for dimensionality reduction
- FA: Factors represent latent constructs for theory testing

Factor Scores Shape:
  PCA scores: (100, 2)
  FA scores: (100, 2)
  Both provide component/factor scores for further analysis




## Summary and Recommendations

In [12]:
print("Analysis Complete!")
print("=" * 50)
print("Summary:")
print(f"- Extracted {n_factors} factors from {X.shape[1]} variables")
print(f"- Cumulative variance explained: {fa_variance*100:.1f}%")
print(f"- KMO measure: {kmo_model:.3f}")

if kmo_model > 0.6 and p_value < 0.05:
    print("- Data meets factorability requirements")
else:
    print("- Consider data quality issues or alternative methods")

print("\nNext Steps:")
print("- Examine factor loadings for theoretical interpretation")
print("- Consider oblique rotation if factors are expected to correlate")
print("- Validate factor structure with confirmatory methods")
print("- Use factor scores in subsequent analyses")

Analysis Complete!
Summary:
- Extracted 2 factors from 5 variables
- Cumulative variance explained: 47.0%
- KMO measure: 0.504
- Consider data quality issues or alternative methods

Next Steps:
- Examine factor loadings for theoretical interpretation
- Consider oblique rotation if factors are expected to correlate
- Validate factor structure with confirmatory methods
- Use factor scores in subsequent analyses
