These metrics were first described in Characterizing the impacts of dataset imbalance on single-cell data integration. If you use these metrics, please consider citing our work.
Class imbalance is prevalent across real-world datasets, including images, natural language, and biological data. In unsupervised learning, clustering performance is often assessed with respect to a ground-truth set of labels using metrics such as the Adjusted Rand Index (ARI). Akin to the issue in classification when using overall accuracy, clustering metrics fail to capture information about class imbalance. imbalanced-clustering presents balanced clustering metrics, that take into account class imbalance and reweigh the results accordingly. Combined with vanilla clustering metrics (https://scikit-learn.org/stable/modules/clustering.html), imbalanced-clustering offers a more complete perspective on clustering and related tasks.
pip install balanced-clustering
Currently, there are five balanced metrics that can be used - Balanced Adjusted Rand Index, Balanced Adjusted Mutual Information, Balanced Homogeneity, Balanced Completeness, Balanced V-measure.
import numpy as np
from balanced_clustering import balanced_adjusted_rand_index
labels_sim = np.random.choice(["A","B","C"], 1000, replace=True)
cluster_sim = np.random.choice([1,2,3,4,5], 1000, replace=True)
balanced_adjusted_rand_index(labels_sim, cluster_sim)
Consider the following example where we have 3 classes simulated from isotropic Gaussian distributions, and the clustering algorithm mis-clusters the smallest class:
import numpy as np
from sklearn.metrics import adjusted_rand_score, adjusted_mutual_info_score, \
homogeneity_score, completeness_score, v_measure_score
from sklearn.cluster import KMeans
from balanced_clustering import balanced_adjusted_rand_index, \
balanced_adjusted_mutual_info, balanced_completeness, \
balanced_homogeneity, balanced_v_measure, return_metrics
# Set a seed for reproducibility and create generator
np.random.seed(42)
# Sample three classes from separated gaussian distributions with varying
# standard deviations and class size
c_1 = np.random.default_rng(seed = 0).normal(loc = 0, scale = 0.5, size = (500, 2))
c_2 = np.random.default_rng(seed = 1).normal(loc = -2, scale = 0.1, size = (20, 2))
c_3 = np.random.default_rng(seed = 2).normal(loc = 3, scale = 1, size = (500, 2))
classes = np.concatenate(
[np.repeat("A", len(c_1)), np.repeat("B", len(c_2)), np.repeat("C", len(c_3))]
)
# Perform k-means clustering with k = 2 - this misclusters the smallest class
cluster_arr = np.concatenate([c_1, c_2, c_3])
kmeans_res = KMeans(n_clusters = 2, random_state = 42).fit_predict(X = cluster_arr)
# Return and print balanced and imbalanced comparisons
return_metrics(
class_arr = classes, cluster_arr = kmeans_res,
)
ARI imbalanced: 0.915 ARI balanced: 0.5434
AMI imbalanced: 0.8671 AMI balanced: 0.686
Homogeneity imbalanced: 0.8204 Homogeneity balanced: 0.5402
Completeness imbalanced: 0.9198 Completeness balanced : 0.941
V-measure imbalanced: 0.8673 V-measure balanced: 0.6864
For more details on the implementation of the balanced clustering metrics, mathematical formalism, and specific use-cases, please see Walkthrough notebook #1. Extended experiments on interpolation behavior between the balanced and vanilla imbalanced metrics can be found in Walkthrough notebook #2.
Please note that Extended Documentation is currently a work in progress
If any issues occur in either installation or usage, please open them and include a reproducible example.
Maan, H. et al. (2024) ‘Characterizing the impacts of dataset imbalance on single-cell data integration’, Nature biotechnology. Available at: https://doi.org/10.1038/s41587-023-02097-9.