# PCA Analysis

This notebook displays results from the automated PCA analysis pipeline. All computations are handled by the pipeline - this notebook focuses on narrative and visualization.

In [None]:
from core.config import initialize_notebook
import pickle
import matplotlib.pyplot as plt
from pathlib import Path

# Initialize environment
env = initialize_notebook(regenerate_run_id=False)

# Display active research question
research_question = env.configs.run['run_name']
research_group_col = env.configs.data["columns"]["mapping"]["research_group"]

print(f"\n{'='*50}")
print(f"Research Question: {research_question.upper()}")
print(f"Group Column: {research_group_col}")
print(f"{'='*50}\n")

run_cfg = env.configs.run
data_dir = env.repo_root / "outputs" / run_cfg["run_name"] / run_cfg["run_id"] / f"seed_{run_cfg['seed']}"
pca_dir = data_dir / "pca"
plots_dir = pca_dir / "plots"

print(f"Loading PCA results from: {pca_dir}")

In [None]:
# Run PCA analysis if not already computed
if not pca_dir.exists() or not (plots_dir / "pca_scree.png").exists():
    print("PCA results not found. Running PCA analysis pipeline...")
    from core.pca.pipeline import run_pca_analysis
    results = run_pca_analysis(env)
    print("PCA analysis complete!")
else:
    print("PCA results found, loading existing analysis")

## Variance Explained

The scree plot shows how much variance each principal component captures. This helps determine the intrinsic dimensionality of the imaging data.

In [None]:
# Display scree plot
scree_plot = plots_dir / "pca_scree.png"

if scree_plot.exists():
    from IPython.display import Image, display
    display(Image(str(scree_plot)))
else:
    print("Scree plot not found. Run the full pipeline first.")

## Group Separation in PCA Space

These plots show how well the research groups separate in the reduced PCA space. Clear separation suggests the groups have distinct neuroimaging signatures.

In [None]:
# Display PC1 vs PC2 plot
pc1_pc2_plot = plots_dir / f"pca_{research_question}_pc1_pc2.png"

if pc1_pc2_plot.exists():
    display(Image(str(pc1_pc2_plot)))
else:
    print(f"{research_question.title()} PC1 vs PC2 plot not found. Run the full pipeline first.")

In [None]:
# Display PC2 vs PC3 plot
pc2_pc3_plot = plots_dir / f"pca_{research_question}_pc2_pc3.png"

if pc2_pc3_plot.exists():
    display(Image(str(pc2_pc3_plot)))
else:
    print(f"{research_question.title()} PC2 vs PC3 plot not found. Run the full pipeline first.")

## Summary Statistics

Load and display key statistics from the PCA analysis.

In [None]:
# Load PCA model and metadata
pca_model_path = pca_dir / "pca_model.pkl"
metadata_path = pca_dir / "metadata.pkl"

if pca_model_path.exists() and metadata_path.exists():
    with open(pca_model_path, 'rb') as f:
        pca_model = pickle.load(f)
    
    with open(metadata_path, 'rb') as f:
        metadata = pickle.load(f)
    
    print("=== PCA Analysis Summary ===")
    print(f"Training samples: {len(metadata['train'][research_question]):,}")
    print(f"Validation samples: {len(metadata['val'][research_question]):,}")
    print(f"Test samples: {len(metadata['test'][research_question]):,}")
    print(f"Total components: {pca_model.n_components_}")
    print(f"Variance explained: {pca_model.explained_variance_ratio_.sum():.1%}")
    print(f"Top 3 components variance: {pca_model.explained_variance_ratio_[:3].sum():.1%}")
    
    # Show research group distribution
    import pandas as pd
    train_counts = pd.Series(metadata['train'][research_question]).value_counts()
    val_counts = pd.Series(metadata['val'][research_question]).value_counts()
    test_counts = pd.Series(metadata['test'][research_question]).value_counts()
    
    print(f"\n{research_question.title()} group distribution:")
    print(f"  Train: {train_counts.to_dict()}")
    print(f"  Val:   {val_counts.to_dict()}")
    print(f"  Test:  {test_counts.to_dict()}")
    
else:
    print("PCA model or metadata not found. Run the full pipeline first.")

## Conclusion

This PCA analysis demonstrates:

1. **Dimensionality Reduction**: High-dimensional imaging data (~1100 features) reduced to manageable number of components while preserving variance
2. **Group Patterns**: Visualization of research group separation in principal component space
3. **Feature Engineering**: PCA-transformed features ready for downstream classification models (SVM, MLP)

The PCA models and transformed data are saved for use in classification pipelines.