balanced-clustering

Assessing clustering performance in imbalanced data contexts

These metrics were first described in Characterizing the impacts of dataset imbalance on single-cell data integration. If you use these metrics, please consider citing our work.

Class imbalance is prevalent across real-world datasets, including images, natural language, and biological data. In unsupervised learning, clustering performance is often assessed with respect to a ground-truth set of labels using metrics such as the Adjusted Rand Index (ARI). Akin to the issue in classification when using overall accuracy, clustering metrics fail to capture information about class imbalance. imbalanced-clustering presents balanced clustering metrics, that take into account class imbalance and reweigh the results accordingly. Combined with vanilla clustering metrics (https://scikit-learn.org/stable/modules/clustering.html), imbalanced-clustering offers a more complete perspective on clustering and related tasks.

Installation via pip

pip install balanced-clustering

Usage

Currently, there are five balanced metrics that can be used - Balanced Adjusted Rand Index, Balanced Adjusted Mutual Information, Balanced Homogeneity, Balanced Completeness, Balanced V-measure.

import numpy as np
from balanced_clustering import balanced_adjusted_rand_index

labels_sim = np.random.choice(["A","B","C"], 1000, replace=True)
cluster_sim = np.random.choice([1,2,3,4,5], 1000, replace=True)
balanced_adjusted_rand_index(labels_sim, cluster_sim)

Detailed example

Consider the following example where we have 3 classes simulated from isotropic Gaussian distributions, and the clustering algorithm mis-clusters the smallest class:

import numpy as np
from sklearn.metrics import adjusted_rand_score, adjusted_mutual_info_score, \
    homogeneity_score, completeness_score, v_measure_score
from sklearn.cluster import KMeans

from balanced_clustering import balanced_adjusted_rand_index, \
    balanced_adjusted_mutual_info, balanced_completeness, \
    balanced_homogeneity, balanced_v_measure, return_metrics

# Set a seed for reproducibility and create generator
np.random.seed(42)

# Sample three classes from separated gaussian distributions with varying
# standard deviations and class size 
c_1 = np.random.default_rng(seed = 0).normal(loc = 0, scale = 0.5, size = (500, 2))
c_2 = np.random.default_rng(seed = 1).normal(loc = -2, scale = 0.1, size = (20, 2))
c_3 = np.random.default_rng(seed = 2).normal(loc = 3, scale = 1, size = (500, 2))
classes = np.concatenate(
    [np.repeat("A", len(c_1)), np.repeat("B", len(c_2)), np.repeat("C", len(c_3))]
)

# Perform k-means clustering with k = 2 - this misclusters the smallest class 
cluster_arr = np.concatenate([c_1, c_2, c_3])
kmeans_res = KMeans(n_clusters = 2, random_state = 42).fit_predict(X = cluster_arr)

# Return and print balanced and imbalanced comparisons 
return_metrics(
    class_arr = classes, cluster_arr = kmeans_res,
)

ARI imbalanced: 0.915 ARI balanced: 0.5434
AMI imbalanced: 0.8671 AMI balanced: 0.686
Homogeneity imbalanced: 0.8204 Homogeneity balanced: 0.5402
Completeness imbalanced: 0.9198 Completeness balanced : 0.941
V-measure imbalanced: 0.8673 V-measure balanced: 0.6864

Notebooks

For more details on the implementation of the balanced clustering metrics, mathematical formalism, and specific use-cases, please see Walkthrough notebook #1. Extended experiments on interpolation behavior between the balanced and vanilla imbalanced metrics can be found in Walkthrough notebook #2.

Please note that Extended Documentation is currently a work in progress

Issues/bugs

If any issues occur in either installation or usage, please open them and include a reproducible example.

Citation information

Maan, H. et al. (2024) ‘Characterizing the impacts of dataset imbalance on single-cell data integration’, Nature biotechnology. Available at: https://doi.org/10.1038/s41587-023-02097-9.

Name		Name	Last commit message	Last commit date
Latest commit History 65 Commits
.github/workflows		.github/workflows
balanced_clustering		balanced_clustering
dist		dist
experiments		experiments
notebooks		notebooks
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md
README.rst		README.rst
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

balanced-clustering

Assessing clustering performance in imbalanced data contexts

Table of contents

Installation via pip

Usage

Detailed example

Notebooks

Issues/bugs

Citation information

About

Releases 1

Packages

Languages

License

hsmaan/balanced-clustering

Folders and files

Latest commit

History

Repository files navigation

balanced-clustering

Assessing clustering performance in imbalanced data contexts

Table of contents

Installation via pip

Usage

Detailed example

Notebooks

Issues/bugs

Citation information

About

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages