# Using Pyrea with Nutrimouse Data Utilising Hierarchical and Spectral Clustering

In this notebok we demonstrate Pyrea's usage by performing hierarchical and spectral clustering on the Nutrimouse[<sup>1</sup>](#fn1) dataset.

We will do this using the Parea_1 structure, a structure that is included as a helper function in the Pyrea software package.

## Imports
This notebook requires Pyrea, mvlearn, and Numpy, let's import the relevant items here:

In [1]:
import pyrea
import numpy as np
from mvlearn.datasets import load_nutrimouse

## Load Data

Load the Nutrimouse data from mvlearn:[<sup>2</sup>](#fn2)

In [3]:
nutrimouse_dataset = load_nutrimouse()
data = [nutrimouse_dataset['gene'], nutrimouse_dataset['lipid']]

y_all = np.vstack((nutrimouse_dataset['genotype'], nutrimouse_dataset['diet'])).T
y = y_all[:,0]

Note: the variable `y` contains the ground truths that we will use for evaluation later in the notebook. The ground truths are not used during training.

Preview the shape of the data:

In [4]:
print(f'Number of views: {len(data)}')
print(f'Shape of view 1: {np.shape(data[0])[0]} x {np.shape(data[0])[1]}')
print(f'Shape of view 2: {np.shape(data[1])[0]} x {np.shape(data[1])[1]}')

Number of views: 2
Shape of view 1: 40 x 120
Shape of view 2: 40 x 21


As can be seen there are 2 views. View 1 has 120 features for each of the 40 mice, while view 2 has 21 features for each of the 40 mice. As this is a multi-view dataset, the 40 samples refer to the same 40 mice in both datasets.

We will use Parea_1 to perform hierarchical clustering and spectral clustering on this dataset, and use Pyrea's built-in genetic algorithm functionality to find the best hyperparameters to use for this data.

# Parea_1

## Hierarchical Clustering

Perform the genetic algorithm as follows, which will learn the best parameters to use for the clustering:

In [5]:
params_hierarchical = pyrea.parea_1_genetic(data, k_min=2, k_max=5, k_final=2, n_generations=10, n_population=100)

Silhouette score: 0.803125
Silhouette score: 0.6416554659498207
Silhouette score: 0.5083840729274167
Silhouette score: 0.5052819915591655
Silhouette score: 0.4916666666666667
Silhouette score: 0.8235294117647058
Silhouette score: 0.4916699372056515
Silhouette score: 0.46596595655806183
Silhouette score: 0.5414722007223942
Silhouette score: 0.46374999999999994
Silhouette score: 0.6692448680351906
Silhouette score: 0.5733333333333334
Silhouette score: 0.7943548387096774
Silhouette score: 0.7858333333333334
Silhouette score: 0.4863095238095238
Silhouette score: 0.805
Silhouette score: 0.5765277777777779
Silhouette score: 0.5891369047619047
Silhouette score: 0.4916666666666667
Silhouette score: 0.81875
Silhouette score: 0.4916666666666667
Silhouette score: 0.7514976958525346
Silhouette score: 0.626736111111111
Silhouette score: 0.5430925505653766
Silhouette score: 0.5052819915591655
Silhouette score: 0.64375
Silhouette score: 0.5617943548387097
Silhouette score: 0.6840909090909092
Silhouet

Once this is complete `params_hierarchical` contains the optimal parameters for this data, which we can then use to call the `parea_1` function with the optimal parameters that we have learned:

In [6]:
labels_hierarchical = pyrea.parea_1(data, *params_hierarchical)

print(labels_hierarchical)

[1 1 1 1 0 0 1 1 1 0 1 1 1 1 1 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0]


We can also print the parameters to see which were selected by the genetic algorithm:

In [7]:
params_hierarchical

['hierarchical',
 'centroid',
 3,
 'hierarchical',
 'average',
 3,
 'hierarchical',
 'centroid',
 2,
 'hierarchical',
 'single',
 2,
 'disagreement']

## Spectral Clustering

This is performed in almost the same way:

In [8]:
params_spectral = pyrea.parea_1_genetic_spectral(data, k_min=2, k_max=8, k_final=2, n_neighbors_min=10, n_neighbors_max=15, n_population=10, n_generations=10)

Silhouette score: 0.6155405405405405
Silhouette score: 0.6432432432432431
Silhouette score: 0.7527027027027027
Silhouette score: 0.46597222222222234
Silhouette score: 0.7128724216959512
Silhouette score: 0.48513513513513506
Silhouette score: 0.6965277777777777
Silhouette score: 0.7319444444444445
Silhouette score: 0.40428571428571425
Silhouette score: 0.7810540738034344
gen	nevals	avg     	std     	min     	max     
0  	10    	0.628928	0.126004	0.404286	0.781054
Silhouette score: 0.6042482858272331
Silhouette score: 0.8101351351351351
Silhouette score: 0.45499999999999996
Silhouette score: 0.6972222222222222
Silhouette score: 0.6833333333333333
Silhouette score: 0.8078571428571429
Silhouette score: 0.7532731092436975
Silhouette score: 0.5837837837837838
1  	8     	0.684408	0.104855	0.455   	0.810135
Silhouette score: 0.6972222222222222
Silhouette score: 0.6347222222222222
Silhouette score: 0.6613615216201423
Silhouette score: 0.5429476435799966
Silhouette score: 0.9132675438596491
2  	

Again this returns our optimal parameters, which we then use to perform the clustering using the `parea_1_spectral` function:

In [9]:
labels_spectral = pyrea.parea_1_spectral(data, *params_spectral, k_final=2)

print(labels_spectral)

[0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0]


We can also print the parameters to see which were selected by the genetic algorithm:

In [10]:
params_spectral

['spectral',
 10,
 7,
 'spectral',
 14,
 7,
 'spectral',
 15,
 2,
 'spectral',
 13,
 3,
 'agreement']

# Evaluation

## Parea 1

We compare our results to several specially designed algorithm for multi-view data, using mvlearn. We use the Normalized Mutual Information (NMI) score for this evaluation.

First some imports:

In [11]:
from sklearn.metrics import normalized_mutual_info_score as nmi_score
from mvlearn.cluster import MultiviewCoRegSpectralClustering, MultiviewSphericalKMeans, MultiviewKMeans

First we compute the NMI score for hierarchical clustering (using ther ground truths stored in `y` from above):

In [12]:
s_nmi_hierarchical = nmi_score(labels_hierarchical, y)
s_nmi_hierarchical

0.5615896365639194

And now for spectral clustering:

In [13]:
s_nmi_spectral = nmi_score(labels_spectral, y)
s_nmi_spectral

0.11470162494150578

## mvlearn

Here we use mvlearn's specialised multi-view clustering algorithms on the same data and also compute the NMI scores. 

First, Multiview Coregularized Spectral Clustering:

In [14]:
mv_coreg_spectral = MultiviewCoRegSpectralClustering(n_clusters=2,
                                               random_state=42,
                                               n_init=100)

labels_mv_coreg_spectral = mv_coreg_spectral.fit_predict(data)

s_nmi_mv_coreg_spectral = nmi_score(labels_mv_coreg_spectral, y)
s_nmi_mv_coreg_spectral

0.4591921489796446

Now with Multiview Spherical Means:

In [15]:
mv_spherical_kmeans = MultiviewSphericalKMeans(n_clusters=2, random_state=42)

labels_spherical_kmeans = mv_spherical_kmeans.fit_predict(data)

s_nmi_spherical_kmeans = nmi_score(labels_spherical_kmeans, y)
s_nmi_spherical_kmeans

0.6843628400956293

And finally we use mvlearn's Multiview KMeans:

In [17]:
mv_kmeans = MultiviewKMeans(n_clusters=2, random_state=42)

labels_kmeans = mv_kmeans.fit_predict(data)

s_nmi_kmeans = nmi_score(labels_kmeans, y)
s_nmi_kmeans

0.5615896365639194

# Conclusions

Some text here.

# References

<span id="fn1"><sup>1</sup> Nutrimouse data: https://aasldpubs.onlinelibrary.wiley.com/doi/10.1002/hep.21510</span>

<span id="fn2"><sup>2</sup> mvlearn Nutrimouse example: https://mvlearn.github.io/auto_examples/datasets/plot_nutrimouse.html</span>