Problem Statement
DBSCAN is a powerful density-based clustering algorithm that can find arbitrarily-shaped clusters and identify outliers. Currently missing from aprender.
Advantages over K-Means:
- No need to specify number of clusters
- Finds arbitrarily-shaped clusters
- Identifies outliers as noise points
- Robust to different cluster densities
Proposed Solution
Implement DBSCAN following sklearn API with EXTREME TDD methodology.
Algorithm
Core Concepts:
- eps: Maximum distance between two samples to be neighbors
- min_samples: Minimum points to form a dense region
- Core point: Has ≥ min_samples within eps distance
- Border point: Within eps of core point, but not core itself
- Noise point: Neither core nor border
Steps:
- For each unvisited point p:
- Find all points within eps distance (neighbors)
- If neighbors < min_samples: mark as noise
- Else: create new cluster, expand from p
- Expand cluster: recursively add reachable core points
Implementation
Trait: UnsupervisedEstimator
API Design:
pub struct DBSCAN {
eps: f32,
min_samples: usize,
labels: Option<Vec<i32>>, // -1 for noise, 0+ for cluster ID
}
impl UnsupervisedEstimator for DBSCAN {
type Labels = Vec<i32>;
fn fit(&mut self, x: &Matrix<f32>) -> Result<(), &'static str>;
fn predict(&self, x: &Matrix<f32>) -> Vec<i32>;
}
Success Criteria
- ✅ DBSCAN struct with UnsupervisedEstimator trait
- ✅ eps and min_samples parameters
- ✅ Noise detection (label = -1)
- ✅ 12+ tests passing
- ✅ Zero clippy warnings
- ✅ Example: examples/dbscan_clustering.rs
Estimated Effort
Timeline: 2-3 days
Complexity: Medium-High (spatial indexing for performance)
Problem Statement
DBSCAN is a powerful density-based clustering algorithm that can find arbitrarily-shaped clusters and identify outliers. Currently missing from aprender.
Advantages over K-Means:
Proposed Solution
Implement DBSCAN following sklearn API with EXTREME TDD methodology.
Algorithm
Core Concepts:
Steps:
Implementation
Trait:
UnsupervisedEstimatorAPI Design:
Success Criteria
Estimated Effort
Timeline: 2-3 days
Complexity: Medium-High (spatial indexing for performance)