# Metrics and Methods for Explainable Clustering

## Explainability in Clustering (Metrics)

We can observe a growing number of paper that employ decision tree for the construction of explainable partitions that aim to minimize the $k$-means cost function. The work proposed by [Laber et al. (2022)](https://arxiv.org/pdf/2112.14718) uses metrics related with depths of the leaves in the resulting tree. This aspect is important because the explainability of a decision tree depends on these depths and explaining leaves that are far from the root involves many tests, which make it harder to grasp the model's logic. 

### Weighted Average Depth (WAD)

- This metric weighs the depth of each leaf by the number of points of its associated cluster. To minimize it, large clusters shall be associated with shallower leaves (shorter explanation);
- For a partition $\mathcal{P}=(C_{1}, C_{2},\dots, C_{k})$ induced by a binary decision tree $\mathcal{D}$ with $k$ leaves, where the cluster $C_{i}$ is associated with the leaf $i$,
$$
WAD(\mathcal{D}) = \frac{\sum\limits_{i=1}^{k}\mid C_{i}\mid l_{i}}{n} 
$$
where $l_{i}$ is the number of conditions $i$ the path from the root to leaf $i$.


### Weighted Average Explanation Size (WAES)

- This metric replaces the depth of a leaf in WAD by the number of non-redundant tests in the path from root to the leaf;
- For a partition $\mathcal{P}=(C_{1}, C_{2},\dots, C_{k})$ induced by a binary decision tree $\mathcal{D}$ with $k$ leaves, where the cluster $C_{i}$ is associated with the leaf $i$,
$$
WAES(\mathcal{D}) = \frac{\sum\limits_{i=1}^{k}\mid C_{i}\mid l_{i}^{nr}}{n} 
$$
where $l_{i}^{nr}$ is the number of *non-redundant conditions* $i$ the path from the root to leaf $i$.


## Explainability in Clustering (Methods)

### Iterative Mistake Minimization (IMM)

The method was proposed by [Dasgupta et al. (2020)](http://proceedings.mlr.press/v119/moshkovitz20a/moshkovitz20a.pdf)


### ExKMC

The method was proposed by [Frost et al. (2020)](https://arxiv.org/pdf/2006.02399)


### ExShallow

The method proposed by [Laber et al. (2022)](https://arxiv.org/pdf/2112.14718)


### ExGreedy

The method proposed by [Laber and Murtinho (2021)](https://proceedings.mlr.press/v139/laber21a/laber21a.pdf)

## Applications and Comparison

In [2]:
import warnings

import pandas as pd
from holisticai.datasets import load_dataset
from holisticai.pipeline import Pipeline

from holisticai.explainability.metrics import (
    weighted_average_depth,
    weighted_average_explainability_score,
    weighted_tree_gini,
    tree_depth_variance
)

warnings.filterwarnings("ignore")

In [3]:
dataset = load_dataset('clinical_records', protected_attribute='sex', preprocessed=True)
train_test = dataset.train_test_split(test_size=0.2, random_state=42)

train = train_test['train']
test = train_test['test']

dataset

In [4]:
!pip install -q ShallowTree


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m24.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [6]:
from ShallowTree.ShallowTree import ShallowTree

data = train['X']
k = len(train['y'])

# create a tree that will partition the data into k clusters
tree = ShallowTree(k)

# fit the method
tree.fit(data)

# compute metrics
f_tree = tree.tree.left
waes = weighted_average_explainability_score(f_tree)
wad = weighted_average_depth(f_tree)
wgni = weighted_tree_gini(f_tree)
tdv = tree_depth_variance(f_tree)

print("WAES:", waes)
print("WAD:", wad)
print("WGNI:", wgni)
print("TDV:", tdv)

AttributeError: 'ShallowTree' object has no attribute 'children_left'

In [None]:
tree

<ShallowTree.ShallowTree.ShallowTree at 0x72085a7511e0>