Skip to content

test: Add comprehensive integration tests for Clustering module [P0] #617

@ooples

Description

@ooples

Overview

This issue tracks the implementation of comprehensive integration tests for the Clustering module. This is a P0 (Critical Priority) module for unsupervised learning.

Parent Issue: #615

Current Status

  • Coverage: 0% for algorithms (only metrics tested)
  • Source Files: 90+ files
  • Test Files: Only ClusteringMetricsIntegrationTests.cs (tests metrics, not algorithms)

Module Location

src/Clustering/

Classes to Test

Core Clustering Algorithms

Partitional Clustering:

  • KMeans
  • MiniBatchKMeans
  • OnlineKMeans
  • KMedoids
  • FuzzyCMeans
  • GMeans
  • XMeans
  • BisectingKMeans
  • SeededKMeans
  • COPKMeans (constrained)

Density-Based Clustering:

  • DBSCAN
  • HDBSCAN
  • OPTICS
  • Denclue
  • MeanShift

Hierarchical Clustering:

  • AgglomerativeClustering
  • BIRCH
  • CURE
  • CLARANS

Spectral Clustering:

  • SpectralClustering

Subspace Clustering:

  • CLIQUE
  • SUBCLU

Probabilistic Clustering:

  • GaussianMixtureModel
  • AffinityPropagation

Other:

  • SelfOrganizingMap
  • ConsensusClustering

Distance Metrics

  • EuclideanDistance
  • ManhattanDistance
  • CosineDistance
  • MahalanobisDistance
  • ChebyshevDistance
  • MinkowskiDistance

Spatial Data Structures

  • KDTree
  • BallTree

Evaluation Metrics (already partially tested)

  • SilhouetteScore
  • DaviesBouldinIndex
  • CalinskiHarabaszIndex
  • DunnIndex
  • AdjustedRandIndex
  • NormalizedMutualInformation
  • VMeasure
  • FMeasure
  • FowlkesMallowsIndex
  • JaccardIndex
  • Purity
  • VariationOfInformation
  • ClusteringEntropy
  • ConnectivityIndex
  • WCSS

Validation Methods

  • ElbowMethod
  • GapStatistic
  • StabilityValidation
  • BootstrapValidation

AutoML

  • ClusteringAutoML
  • ClusteringGridSearch
  • ClusteringEvaluator

Test Categories Required

1. Basic Clustering Tests

  • Verify clusters are assigned for all data points
  • Test with known cluster structures (blobs, moons, circles)
  • Verify number of clusters matches expectation
  • Test reproducibility with same random seed

2. Algorithm-Specific Tests

KMeans Family:

  • Test convergence behavior
  • Verify centroid computation
  • Test with different initialization methods (random, k-means++)
  • Test with different distance metrics
  • Compare against scikit-learn KMeans

DBSCAN:

  • Test eps and min_samples parameters
  • Verify noise point detection
  • Test with varying density clusters
  • Compare against scikit-learn DBSCAN

Hierarchical:

  • Test different linkage methods (single, complete, average, ward)
  • Verify dendrogram structure
  • Test distance thresholds

Spectral:

  • Test different affinity matrices
  • Verify eigenvalue computation
  • Test with different number of components

3. Edge Cases

  • Test with single data point
  • Test with all identical points
  • Test with high-dimensional data
  • Test with outliers
  • Test with very small/large values
  • Test with NaN values (should handle gracefully)

4. Performance Tests

  • Test with varying dataset sizes
  • Verify memory usage is reasonable
  • Test convergence speed

5. Serialization Tests

  • Save and load trained clustering models
  • Verify predictions match after reload
  • Test with various configurations

6. Clone Tests

  • Clone trained models
  • Verify independence
  • Verify predictions match

Mathematical Correctness Verification

KMeans

  • Verify WCSS (Within-Cluster Sum of Squares) decreases with iterations
  • Compare centroids against manually computed values
  • Verify cluster assignments minimize distance to centroid

DBSCAN

  • Verify core points, border points, noise classification
  • Test reachability relationships

GMM

  • Verify log-likelihood increases with iterations
  • Test probability assignments sum to 1

Metrics

  • Verify Silhouette scores in range [-1, 1]
  • Verify Davies-Bouldin lower is better
  • Compare against scikit-learn metrics

Test Data

Create test datasets:

  1. Well-separated blobs - 3 clusters, clearly separable
  2. Overlapping clusters - 2 clusters with overlap
  3. Non-spherical clusters - Moons, circles
  4. High-dimensional - 50+ dimensions
  5. Imbalanced clusters - Different sizes

Priority Order

  1. Critical (Test First):

    • KMeans
    • DBSCAN
    • GaussianMixtureModel
    • AgglomerativeClustering
  2. High:

    • MiniBatchKMeans
    • HDBSCAN
    • SpectralClustering
    • KMedoids
    • MeanShift
  3. Medium:

    • All other algorithms
    • Validation methods
    • AutoML components

Acceptance Criteria

  • All major clustering algorithms have tests
  • Tests cover basic functionality and edge cases
  • Mathematical correctness verified against reference implementations
  • Serialization works correctly
  • At least 80% code coverage
  • All tests pass on both net8.0 and net471

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions