### **Imbalanced clustering metrics for benchmarking, analysis, and comparison studies** 

`imbalanced-clustering` is a python library that has repurposed popular clustering indices such as the Adjusted Rand Index (ARI) and Adjusted Mutual Information (AMI) to account for class imbalance. Although the original intended use of these metrics was to compare clustering results from different techniques, they have recently been applied frequently in cases where ground-truth labels exist and the result of a clustering technique is compared to  these values. An example of this is in single-cell sequencing benchmarking studies, where ground-truth celltype labels are compared to clustering results after a meaningful transformation of the latent space, such as the integration of two or more datasets. 

Although these metrics are meaningful in these settings, they often overemphasize the importance of larger classes in the data. In single-cell datasets, it's common that some celltypes exist in greater proportions than others in a given tissue sample, and they will thus be overrepresented in the sequencing data (e.g. alpha and beta cells from pancreatic islet samples). Because of this overrepresentation, the larger classes will have a greater influence on the results of these common clustering metrics. This may hide important information about smaller classes/celltypes, and potential erasure of their heterogeneity in a clustering setting. This problem is akin to the imbalanced learning problem in classical machine learning literature, where metrics such as accuracy and precision/recall can fail to capture the effects of a classifier on smaller/minority classes. 

Given this limitation, the `imbalanced-clustering` python library balances popular clustering metrics, such that the information from each of the classes from the ground-truth data, regardless of their representation in the data, will be weighted equally. 

The following demo notebook is divided up into two parts:

**A)**: First we begin by examining how this weighing is done to account for ground-truth class imbalance by specifically considering the Adjusted Rand Index (ARI), as the other metrics undergo the exact same changes.

**B)**: The feasibility of these metrics, as well as their concordance with the base imbalanced metrics is determined. To begin, sanity checks and control experiments are done to determine if the same results hold for both the base and modified balanced metrics. Then, different class/clustering scenarios are explored that emphasize the utility of the balanced metrics. 

### Part A) Analysis of reweighting of the Adjusted Rand Index  

### Part B) Computational analysis of balanced metrics 

In this part of the demo, we'll look at some of the properties of these balanced metrics, perform some sanity checks to ensure they are working as intended, and demonstrate important use cases and results. 

For **Part B-1)**, we'll start by looking at some baselines to ensure that the balanced metrics return similar results as the baseline metrics in edge cases and randomized data. There are two properties we want to ensure that exist, namely the expectation - where a random clustering results in a similar value for both, and normalization, where both the balanced and baseline metrics are bounded in the same way. We've ensured that these properties hold theoretically, and we'll demonstrate that they do computationally. 

Let's begin by loading the appropriate functions and libraries.

In [3]:
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.metrics import adjusted_rand_score, adjusted_mutual_info_score, \
    homogeneity_score, completeness_score, v_measure_score
from sklearn.cluster import KMeans

from imbalanced_clustering import balanced_adjusted_rand_index, \
    balanced_adjusted_mutual_info, balanced_completeness, \
    balanced_homogeneity, balanced_v_measure 

ModuleNotFoundError: No module named 'seaborn'

We can begin testing random clustering by sampling data from a uniform distribution for 3 classes, and then performing k-means clustering on them. Let's see how the metrics behave when the classes are balanced (same number in each) in this respect. 

In [None]:
# Set a seed for reproducibility 
np.random.seed(42)

# Sample three classes from a uniform distribution
a = np.random.uniform(0, 100, (500, 2))
b = np.random.uniform(0, 100, (500, 2))
c = np.random.uniform(0, 100, (500, 2))

# Plot the given results for each class 
cluster_df = pd.DataFrame({
    "x" : np.concatenate((c_1[:, 0], c_2[:, 0], c_3[:, 0])),
    "y" : np.concatenate((c_1[:, 1], c_2[:, 1], c_3[:, 1])),
    "cluster": np.concatenate(
        (
            np.repeat("A", len(c_1)),
            np.repeat("B", len(c_2)),
            np.repeat("C", len(c_3))
        )
    )
})
sns.scatterplot(
    x = "x",
    y = "y",
    hue = "cluster",
    data = cluster_df
)

# Perform k-means clustering and plot the results  
cluster_arr = np.array(cluster_df.iloc[:, 0:2])
kmeans_res = cluster.KMeans(n_clusters = 2).fit_predict(X = cluster_arr)
cluster_df["kmeans"] = kmeans_res
sns.scatterplot(
    x = "x",
    y = "y",
    hue = "kmeans",
    data = cluster_df
)

# Determine the values for 