<a href="https://colab.research.google.com/github/maciejskorski/huber_clustering/blob/main/HuberClustering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Soft Huber Clustering

This notebook demonstrates a soft-clustering technique, based on the likelihood inspired by the Huber function. As demonstrated on datasets, this can outperform popular clustering techniques inspired by gaussian assumptions (KMeans, GaussianMixtures).

In [1]:
!git clone https://github.com/maciejskorski/huber_clustering.git
%cd huber_clustering

Cloning into 'huber_clustering'...
remote: Enumerating objects: 42, done.[K
remote: Counting objects: 100% (42/42), done.[K
remote: Compressing objects: 100% (36/36), done.[K
remote: Total 42 (delta 10), reused 16 (delta 2), pack-reused 0[K
Unpacking objects: 100% (42/42), done.
/content/huber_clustering


# Huber Clustering and Benchmarks

We fit the model on few datasets and score using the ARI goodness-of-fit.
The result are compared with KMeans and GaussianMixtures. 

Note: In the current implementation the Huber model is spherical (one scale for all features) and should be compared with spherical Gaussian Mixtures, but ocassionaly performs better then diagonal GaussianMixtures!

In [2]:
from sklearn import datasets
import numpy as np
import pandas as pd

from HuberMixtures import HuberMixture
from sklearn.cluster import KMeans
from sklearn.mixture import GaussianMixture

from sklearn.metrics.cluster import adjusted_rand_score,adjusted_rand_score

## Dataset: Iris

In [3]:
data = datasets.load_iris()
X,y = data.data.astype('float32'), data.target.astype('int32')
X = (X-X.mean(0))/X.std(0)
n_classes = len(np.unique(y))

print( adjusted_rand_score(y,HuberMixture(n_classes,huber_scale=0.25).fit_predict(X).argmax(1)) )
print( adjusted_rand_score(y,KMeans(n_classes).fit_predict(X)) )
print( adjusted_rand_score(y,GaussianMixture(n_classes,covariance_type='spherical').fit_predict(X)) )

0.6638956080488512
0.6201351808870379
0.6217034719190815


## Dataset: Breast Cancer

In [4]:
data = datasets.load_breast_cancer()
X,y = data.data.astype('float32'), data.target.astype('int32')
X = (X-X.mean(0))/X.std(0)
n_classes = len(np.unique(y))

print( adjusted_rand_score(y,HuberMixture(n_classes,huber_scale=0.5).fit_predict(X,n_iter=50).argmax(1)) )
print( adjusted_rand_score(y,KMeans(n_classes).fit_predict(X)) )
print( adjusted_rand_score(y,GaussianMixture(n_classes,covariance_type='diag').fit_predict(X)) )

0.7302553422550654
0.6707206476880808
0.6779411384513467


## Dataset: Digits (from MNIST)

In [5]:
data = datasets.load_digits()
X,y = data.data.astype('float32'), data.target.astype('int32')
X = (X-X.mean(0))/(1e-6+X.std(0))
n_classes = len(np.unique(y))

print( adjusted_rand_score(y,HuberMixture(n_classes,huber_scale=0.25).fit_predict(X).argmax(1)) )
print( adjusted_rand_score(y,KMeans(n_classes).fit_predict(X)) )
print( adjusted_rand_score(y,GaussianMixture(n_classes,covariance_type='diag').fit_predict(X)) )

0.564923306448209
0.46688023827202596
0.24563721647313141
