-
-
Notifications
You must be signed in to change notification settings - Fork 8
Labels
enhancementNew feature or requestNew feature or request
Description
Overview
This issue tracks the implementation of comprehensive integration tests for the Clustering module. This is a P0 (Critical Priority) module for unsupervised learning.
Parent Issue: #615
Current Status
- Coverage: 0% for algorithms (only metrics tested)
- Source Files: 90+ files
- Test Files: Only ClusteringMetricsIntegrationTests.cs (tests metrics, not algorithms)
Module Location
src/Clustering/
Classes to Test
Core Clustering Algorithms
Partitional Clustering:
- KMeans
- MiniBatchKMeans
- OnlineKMeans
- KMedoids
- FuzzyCMeans
- GMeans
- XMeans
- BisectingKMeans
- SeededKMeans
- COPKMeans (constrained)
Density-Based Clustering:
- DBSCAN
- HDBSCAN
- OPTICS
- Denclue
- MeanShift
Hierarchical Clustering:
- AgglomerativeClustering
- BIRCH
- CURE
- CLARANS
Spectral Clustering:
- SpectralClustering
Subspace Clustering:
- CLIQUE
- SUBCLU
Probabilistic Clustering:
- GaussianMixtureModel
- AffinityPropagation
Other:
- SelfOrganizingMap
- ConsensusClustering
Distance Metrics
- EuclideanDistance
- ManhattanDistance
- CosineDistance
- MahalanobisDistance
- ChebyshevDistance
- MinkowskiDistance
Spatial Data Structures
- KDTree
- BallTree
Evaluation Metrics (already partially tested)
- SilhouetteScore
- DaviesBouldinIndex
- CalinskiHarabaszIndex
- DunnIndex
- AdjustedRandIndex
- NormalizedMutualInformation
- VMeasure
- FMeasure
- FowlkesMallowsIndex
- JaccardIndex
- Purity
- VariationOfInformation
- ClusteringEntropy
- ConnectivityIndex
- WCSS
Validation Methods
- ElbowMethod
- GapStatistic
- StabilityValidation
- BootstrapValidation
AutoML
- ClusteringAutoML
- ClusteringGridSearch
- ClusteringEvaluator
Test Categories Required
1. Basic Clustering Tests
- Verify clusters are assigned for all data points
- Test with known cluster structures (blobs, moons, circles)
- Verify number of clusters matches expectation
- Test reproducibility with same random seed
2. Algorithm-Specific Tests
KMeans Family:
- Test convergence behavior
- Verify centroid computation
- Test with different initialization methods (random, k-means++)
- Test with different distance metrics
- Compare against scikit-learn KMeans
DBSCAN:
- Test eps and min_samples parameters
- Verify noise point detection
- Test with varying density clusters
- Compare against scikit-learn DBSCAN
Hierarchical:
- Test different linkage methods (single, complete, average, ward)
- Verify dendrogram structure
- Test distance thresholds
Spectral:
- Test different affinity matrices
- Verify eigenvalue computation
- Test with different number of components
3. Edge Cases
- Test with single data point
- Test with all identical points
- Test with high-dimensional data
- Test with outliers
- Test with very small/large values
- Test with NaN values (should handle gracefully)
4. Performance Tests
- Test with varying dataset sizes
- Verify memory usage is reasonable
- Test convergence speed
5. Serialization Tests
- Save and load trained clustering models
- Verify predictions match after reload
- Test with various configurations
6. Clone Tests
- Clone trained models
- Verify independence
- Verify predictions match
Mathematical Correctness Verification
KMeans
- Verify WCSS (Within-Cluster Sum of Squares) decreases with iterations
- Compare centroids against manually computed values
- Verify cluster assignments minimize distance to centroid
DBSCAN
- Verify core points, border points, noise classification
- Test reachability relationships
GMM
- Verify log-likelihood increases with iterations
- Test probability assignments sum to 1
Metrics
- Verify Silhouette scores in range [-1, 1]
- Verify Davies-Bouldin lower is better
- Compare against scikit-learn metrics
Test Data
Create test datasets:
- Well-separated blobs - 3 clusters, clearly separable
- Overlapping clusters - 2 clusters with overlap
- Non-spherical clusters - Moons, circles
- High-dimensional - 50+ dimensions
- Imbalanced clusters - Different sizes
Priority Order
-
Critical (Test First):
- KMeans
- DBSCAN
- GaussianMixtureModel
- AgglomerativeClustering
-
High:
- MiniBatchKMeans
- HDBSCAN
- SpectralClustering
- KMedoids
- MeanShift
-
Medium:
- All other algorithms
- Validation methods
- AutoML components
Acceptance Criteria
- All major clustering algorithms have tests
- Tests cover basic functionality and edge cases
- Mathematical correctness verified against reference implementations
- Serialization works correctly
- At least 80% code coverage
- All tests pass on both net8.0 and net471
References
- scikit-learn clustering: https://scikit-learn.org/stable/modules/clustering.html
- HDBSCAN docs: https://hdbscan.readthedocs.io/
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request