# Analysing linear seperability of clusters
This file takes you through an example of comparing the standard deviation in cluster size (cluster size uneveness) and the mean distance between a point in a cluster its centroid (spread of cluster) before and after application of radial basis function

## First some imports and setting variables

In [1]:
import json
import numpy as np
import pandas as pd
from scipy.spatial.distance import cdist
from sklearn.cluster import KMeans
from sklearn.kernel_approximation import RBFSampler

In [2]:
# This is an example on the glass formation ability dataset, using magpie featurisation
# Feel free to select other files/folders to compare
data_file = 'data/linear_seperability/datasets/gfa/magpie_CBFV.csv'
# How many clusters do we want to investigate
n_clusters = 7

## Read in data and remove anything unneeded

In [3]:
df = pd.read_csv(data_file)
if 'target' in df.columns:
    df = df.drop('target', axis=1)
if 'formula' in df.columns:
    df = df.drop('formula', axis=1)

## Cluster and desired metrics

In [4]:
km = KMeans(n_clusters=n_clusters)
predictions = km.fit_predict(df)
# Put it into a pandas Series for access to creature comforts
predictions = pd.Series(predictions)

In [5]:
# Occupation of each cluster:
predictions.value_counts()

0    2013
6    1974
4    1197
1     585
3     454
2      52
5      39
dtype: int64

In [6]:
cluster_size_uneveness = predictions.value_counts().std()

In [7]:
# spread of cluster
total = 0
for i, centroid in enumerate(km.cluster_centers_):
    data_in_cluster = df[predictions==i]
    total += cdist(data_in_cluster, centroid[None,:]).sum()
spread_of_cluster = total/len(predictions)

## Apply kernel function and repeat the process

In [8]:
rbf = RBFSampler()
transformed_data = rbf.fit_transform(df)
km = KMeans(n_clusters=n_clusters)
predictions = km.fit_predict(transformed_data)
# Put it into a pandas Series for access to creature comforts
predictions = pd.Series(predictions)

In [9]:
# Occupation of each cluster:
predictions.value_counts().std()

47.34976240700686

In [10]:
kernelised_cluster_size_uneveness = predictions.value_counts().std()

In [11]:
# spread of cluster
total = 0
for i, centroid in enumerate(km.cluster_centers_):
    data_in_cluster = transformed_data[predictions==i]
    total += cdist(data_in_cluster, centroid[None,:]).sum()
kernelised_spread_of_cluster = total/len(predictions)

## Now lets compare the two

In [12]:
print(f'Applying RBF changed the standard deviation in cluster sizes from {round(cluster_size_uneveness,2)}\
 to {round(kernelised_cluster_size_uneveness, 2)}, when clustering with k={n_clusters} on this dataset')
print(f'Applying RBF changed the mean distance from a point in a cluster to its centroid from {round(spread_of_cluster,2)}\
 to {round(kernelised_spread_of_cluster, 2)}, when clustering with k={n_clusters} on this dataset')

Applying RBF changed the standard deviation in cluster sizes from 840.53 to 47.35, when clustering with k=7 on this dataset
Applying RBF changed the mean distance from a point in a cluster to its centroid from 42250.87 to 0.98, when clustering with k=7 on this dataset


As noted in the main text of the paper:

* More even cluster sizes are not necessarily indicative of ***better*** (or worse) clusterings, more even clusters are helpful in LOCO-CV because highly uneven clusters increase the effect of size of the holdout set on the resulting measurement. 
* More tightly packed clusters are not necessarily indicative of ***better*** (or worse) clusterings they are an interesting side effect of the radial basis function transformation on these data