Problem Statement
PCA is one of the most commonly used unsupervised learning algorithms for dimensionality reduction, but it's currently missing from aprender.
Impact: Cannot perform feature extraction, data visualization, or noise reduction - all fundamental ML tasks.
Proposed Solution
Implement PCA following sklearn API with EXTREME TDD methodology.
Algorithm
Mathematical Foundation:
- Center data (subtract mean)
- Compute covariance matrix: Σ = (X^T X) / (n-1)
- Eigendecomposition: Σ = V Λ V^T
- Sort eigenvectors by eigenvalue (descending)
- Project data: X_pca = X V_k (keep top k components)
Key Features:
n_components parameter (int or variance threshold)
explained_variance_ - variance explained by each component
explained_variance_ratio_ - percentage of variance explained
transform() - project data to lower dimensions
inverse_transform() - reconstruct original space (lossy)
Implementation
Trait: Transformer (fit/transform/fit_transform)
API Design:
pub struct PCA {
n_components: Option<usize>,
components: Option<Matrix<f32>>, // Principal components (eigenvectors)
mean: Option<Vector<f32>>, // Mean of training data
explained_variance: Option<Vector<f32>>,
explained_variance_ratio: Option<Vector<f32>>,
}
impl Transformer for PCA {
fn fit(&mut self, x: &Matrix<f32>) -> Result<(), &'static str>;
fn transform(&self, x: &Matrix<f32>) -> Result<Matrix<f32>, &'static str>;
}
Success Criteria
- ✅ PCA struct with Transformer trait
- ✅ fit/transform/inverse_transform methods
- ✅ Explained variance computation
- ✅ 15+ tests passing
- ✅ Zero clippy warnings
- ✅ Example: examples/pca_iris.rs
- ✅ Book chapter: book/src/ml-fundamentals/pca.md
Estimated Effort
Timeline: 1-2 days
Complexity: Medium (requires eigendecomposition)
Problem Statement
PCA is one of the most commonly used unsupervised learning algorithms for dimensionality reduction, but it's currently missing from aprender.
Impact: Cannot perform feature extraction, data visualization, or noise reduction - all fundamental ML tasks.
Proposed Solution
Implement PCA following sklearn API with EXTREME TDD methodology.
Algorithm
Mathematical Foundation:
Key Features:
n_componentsparameter (int or variance threshold)explained_variance_- variance explained by each componentexplained_variance_ratio_- percentage of variance explainedtransform()- project data to lower dimensionsinverse_transform()- reconstruct original space (lossy)Implementation
Trait:
Transformer(fit/transform/fit_transform)API Design:
Success Criteria
Estimated Effort
Timeline: 1-2 days
Complexity: Medium (requires eigendecomposition)