# PCA Analysis
## Dimensionality Reduction for Aortic Features

This notebook demonstrates the use of Principal Component Analysis (PCA) for dimensionality reduction on aortic profile data. The goal is to reduce the complexity of the data while retaining the most important features for classification and visualization.

### Objectives:
1. Load and preprocess aortic profile data.
2. Perform PCA to reduce dimensionality.
3. Visualize the PCA-transformed data.
4. Evaluate classification performance using PCA-transformed features.
5. Provide an interactive interface for PCA analysis.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix
from ipywidgets import interact, Dropdown, IntSlider
from utils import (
    categorize_diameter, evaluate_classifier, evaluate_regressor, 
    plot_confusion_matrix, plot_roc_curves, plot_feature_importance, 
    ConfusionMatrixDisplay
)

## 1. Load and Preprocess Profile Data

This section defines a function to load and preprocess aortic profile data for PCA analysis. The data is standardized, and clinical labels are matched for further analysis.

### Steps:
1. Load the selected profile data from CSV files.
2. Load clinical labels and categorize aortic diameters into groups:
   - `<40mm`
   - `40-45mm`
   - `45-50mm`
   - `≥50mm`
3. Standardize the features to ensure they are on the same scale.
4. Return the standardized features and corresponding labels.

In [2]:
FILE_PATHS = {
    "CenterlineCurvature": "../../Data/measures/AscendingProfile_CenterlineCurvature.csv",
    "Diameter": "../../Data/measures/AscendingProfile_Diameter.csv",
    "Eccentricity": "../../Data/measures/AscendingProfile_Eccentricity.csv",
    "ScaledDiameter": "../../Data/measures/AscendingProfile_ScaledDiameter.csv"
}

@interact
def load_profile_data(profile_type=Dropdown(options=list(FILE_PATHS.keys()))):
    try:
        profile_df = pd.read_csv(FILE_PATHS[profile_type])
    except FileNotFoundError:
        print(f"Error: File not found for profile type '{profile_type}'.")
        return None, None, profile_type
    except pd.errors.EmptyDataError:
        print(f"Error: File for profile type '{profile_type}' is empty or invalid.")
        return None, None, profile_type
    
    # Load labels
    ascending_df = pd.read_csv("../../Data/measures/Ascending.csv")
    ascending_df['group'] = ascending_df['max_diameter'].apply(categorize_diameter)
    id_to_group = dict(zip(ascending_df.iloc[:, 0], ascending_df['group']))
    
    # Match labels
    groups = np.array([id_to_group.get(str(id_value)) for id_value in profile_df.iloc[:, 0]])
    
    # Standardize
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(profile_df.iloc[:, 1:].values)
    
    return X_scaled, groups, profile_type

interactive(children=(Dropdown(description='profile_type', options=('CenterlineCurvature', 'Diameter', 'Eccent…

## 2. Run PCA Analysis

This section performs Principal Component Analysis (PCA) on the standardized data to reduce its dimensionality. PCA identifies the directions (principal components) that capture the most variance in the data.

### Outputs:
1. A plot showing the cumulative explained variance as a function of the number of components.
2. A bar chart showing the variance explained by each principal component.

### Parameters:
- `n_components`: The number of principal components to retain.

In [3]:
def run_pca_analysis(X_scaled, n_components=10):
    pca = PCA(n_components=n_components)
    X_pca = pca.fit_transform(X_scaled)
    
    # Plot explained variance
    plt.figure(figsize=(10,4))
    plt.subplot(121)
    plt.plot(np.cumsum(pca.explained_variance_ratio_), 'o-')
    plt.xlabel("Number of Components")
    plt.ylabel("Cumulative Explained Variance")
    
    plt.subplot(122)
    plt.bar(range(n_components), pca.explained_variance_ratio_)
    plt.xlabel("Principal Component")
    plt.ylabel("Variance Explained")
    plt.tight_layout()
    plt.show()
    
    return pca, X_pca

## 3. Visualize PCA Space

This section visualizes the PCA-transformed data in a 2D space using the first two principal components. Each point represents a sample, and the points are color-coded based on their clinical group.

### Features:
- Scatter plot of PCA-transformed data.
- Color-coded groups:
  - Blue: `<40mm`
  - Green: `40-45mm`
  - Orange: `45-50mm`
  - Red: `≥50mm`

### Parameters:
- `pc_x`: The principal component to use for the x-axis.
- `pc_y`: The principal component to use for the y-axis.

In [4]:
def plot_pca_space(X_pca, groups, pc_x=0, pc_y=1):
    group_colors = {0: "blue", 1: "green", 2: "orange", 3: "red"}
    group_labels = {0: "<40mm", 1: "40-45mm", 2: "45-50mm", 3: "≥50mm"}
    
    plt.figure(figsize=(8,6))
    for group in np.unique(groups):
        mask = groups == group
        plt.scatter(X_pca[mask, pc_x], X_pca[mask, pc_y],
                   c=group_colors[group], label=group_labels[group],
                   alpha=0.7)
    
    plt.xlabel(f"PC{pc_x+1}")
    plt.ylabel(f"PC{pc_y+1}")
    plt.title("PCA Space")
    plt.legend()
    plt.grid()
    plt.show()

## 4. Evaluate PCA Classification

This section evaluates the classification performance of a logistic regression model trained on PCA-transformed features. The model predicts the clinical group of each sample.

### Outputs:
1. Classification report showing precision, recall, and F1-score for each group.
2. Confusion matrix visualizing the model's performance.

### Steps:
1. Split the PCA-transformed data into training and testing sets.
2. Train a logistic regression model on the training set.
3. Evaluate the model on the testing set.
4. Visualize the confusion matrix.

In [5]:
def evaluate_pca_classification(X_pca, groups):
    """
    Evaluate classification performance using PCA-transformed data.
    
    Args:
        X_pca: PCA-transformed features
        groups: True labels
    
    Returns:
        Confusion matrix and classification report
    """
    # Train-test split
    from sklearn.model_selection import train_test_split
    X_train, X_test, y_train, y_test = train_test_split(X_pca, groups, test_size=0.3, random_state=42)
    
    # Train a classifier
    clf = LogisticRegression(max_iter=1000, random_state=42)
    clf.fit(X_train, y_train)
    
    # Predictions
    y_pred = clf.predict(X_test)
    
    # Metrics
    print("Classification Report:")
    report = classification_report(y_test, y_pred)
    print(report)
    
    # Confusion Matrix
    cm = confusion_matrix(y_test, y_pred)
    plot_confusion_matrix(cm, title="Confusion Matrix")
    
    return cm, report

def plot_pca_feature_loadings(pca, feature_names):
    loadings = pca.components_.T
    num_components = pca.n_components_
    
    plt.figure(figsize=(10, 6))
    for i in range(num_components):
        plt.bar(range(len(feature_names)), loadings[:, i], label=f"PC{i+1}")
    
    plt.xticks(range(len(feature_names)), feature_names, rotation=90)
    plt.xlabel("Features")
    plt.ylabel("Loadings")
    plt.title("PCA Feature Loadings")
    plt.legend()
    plt.tight_layout()
    plt.show()

def display_variation_modes(pca):
    for i, variance in enumerate(pca.explained_variance_ratio_):
        print(f"PC{i+1}: {variance:.2%} of variance explained")

## 5. Interactive PCA Analysis

This section provides an interactive interface for PCA analysis using `ipywidgets`. Users can:
1. Select the profile type to analyze.
2. Adjust the number of principal components (`n_components`).
3. Choose which principal components to visualize (`pc_x` and `pc_y`).

### Features:
- Interactive dropdown to select the profile type.
- Sliders to adjust PCA parameters.
- Dynamic visualization of PCA space and feature loadings.
- Evaluation of classification performance using the selected PCA configuration.

In [6]:
@interact
def interactive_pca_combined(profile_type=Dropdown(options=list(FILE_PATHS.keys())),
                             n_components=(2, 20, 1), pc_x=(0, 9, 1), pc_y=(1, 10, 1)):
    X_scaled, groups, _ = load_profile_data(profile_type)
    if X_scaled is None or groups is None:
        print("Error: Failed to load profile data.")
        return
    
    pca, X_pca = run_pca_analysis(X_scaled, n_components)
    plot_pca_space(X_pca, groups, pc_x, pc_y)
    
    # Display variation modes
    display_variation_modes(pca)
    
    # Plot PCA feature loadings
    feature_df = pd.read_csv(FILE_PATHS[profile_type])
    if feature_df.shape[1] > 1:  # Ensure there are feature columns
        feature_names = feature_df.columns[1:]
        plot_pca_feature_loadings(pca, feature_names)
    else:
        print("No feature columns found in the dataset.")
    
    # Evaluate classification performance
    cm, report = evaluate_pca_classification(X_pca, groups)
    
    return pca

interactive(children=(Dropdown(description='profile_type', options=('CenterlineCurvature', 'Diameter', 'Eccent…

## Conclusion

In this notebook, we demonstrated the use of PCA for dimensionality reduction on aortic profile data. Key takeaways include:
1. PCA effectively reduces the dimensionality of the data while retaining most of the variance.
2. Visualizing PCA-transformed data helps identify patterns and group separations.
3. PCA-transformed features can be used for classification with reasonable accuracy.