# Using Pyrea with Nutrimouse Data Utilising Hierarchical and Spectral Clustering

In this notebok we demonstrate Pyrea's usage by performing hierarchical and spectral clustering on the Nutrimouse[<sup>1</sup>](#fn1) dataset.

We will do this using the Parea_1 structure, a structure that is included as a helper function in the Pyrea software package.

## Imports
This notebook requires Pyrea, mvlearn, and Numpy, let's import the relevant items here:

In [1]:
import pyrea
import numpy as np
from mvlearn.datasets import load_nutrimouse

## Load Data

Load the Nutrimouse data from mvlearn:[<sup>2</sup>](#fn2)

In [62]:
nutrimouse_dataset = load_nutrimouse()
data = [nutrimouse_dataset['gene'], nutrimouse_dataset['lipid']]

y_all = np.vstack((nutrimouse_dataset['genotype'], nutrimouse_dataset['diet'])).T
y = y[:,0]

Note: the variable `y` contains the ground truths that we will use for evaluation later in the notebook. The ground truths are not used during training.

Preview the shape of the data:

In [63]:
print(f'Number of views: {len(data)}')
print(f'Shape of view 1: {np.shape(data[0])[0]} x {np.shape(data[0])[1]}')
print(f'Shape of view 2: {np.shape(data[1])[0]} x {np.shape(data[1])[1]}')

Number of views: 2
Shape of view 1: 40 x 120
Shape of view 2: 40 x 21


As can be seen there are 2 views. View 1 has 120 features for each of the 40 mice, while view 2 has 21 features for each of the 40 mice. As this is a multi-view dataset, the 40 samples refer to the same 40 mice in both datasets.

We will use Parea_1 to perform hierarchical clustering and spectral clustering on this dataset, and use Pyrea's built-in genetic algorithm functionality to find the best hyperparameters to use for this data.

# Parea_1

## Hierarchical Clustering

Perform the genetic algorithm as follows, which will learn the best parameters to use for the clustering:

In [16]:
params_hierarchical = pyrea.parea_1_genetic(data, k_min=2, k_max=5, k_final=2, n_generations=10, n_population=100)

Silhouette score: 0.765625
Silhouette score: 0.4848440138466734
Silhouette score: 0.6873233197961458
Silhouette score: 0.7136363636363636
Silhouette score: 0.5850405546834118
Silhouette score: 0.5617943548387097
Silhouette score: 0.5725405546834118
Silhouette score: 0.7020833333333334
Silhouette score: 0.6508099393603595
Silhouette score: 0.7331417624521073
Silhouette score: 0.6416554659498207
Silhouette score: 0.6083333333333334
Silhouette score: 0.6104166666666667
Silhouette score: 0.45572991822991826
Silhouette score: 0.6416554659498207
Silhouette score: 0.790625
Silhouette score: 0.49511494252873567
Silhouette score: 0.6354166666666667
Silhouette score: 0.5305925505653766
Silhouette score: 0.5520833333333333
Silhouette score: 0.5891369047619047
Silhouette score: 0.5675287356321839
Silhouette score: 0.7136363636363636
Silhouette score: 0.47002591706539076
Silhouette score: 0.5617943548387097
Silhouette score: 0.6645833333333334
Silhouette score: 0.7760416666666666
Silhouette score: 

Once this is complete `params_hierarchical` contains the optimal parameters for this data, which we can then use to call the `parea_1` function with the optimal parameters that we have learned:

In [64]:
labels_hierarchical = pyrea.parea_1(data, *params_hierarchical)

print(labels_hierarchical)

[1 1 1 1 0 0 1 1 1 0 1 1 1 1 1 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0]


We can also print the parameters to see which were selected by the genetic algorithm:

In [65]:
params_hierarchical

['hierarchical',
 'complete',
 3,
 'hierarchical',
 'average',
 3,
 'hierarchical',
 'average',
 2,
 'hierarchical',
 'ward2',
 2,
 'disagreement']

## Spectral Clustering

This is performed in almost the same way:

In [50]:
params_spectral = pyrea.parea_1_genetic_spectral(data, k_min=2, k_max=8, k_final=2, n_neighbors_min=10, n_neighbors_max=15, n_population=10, n_generations=10)

Silhouette score: 0.3278604963112005
Silhouette score: 0.7688375350140056
Silhouette score: 0.572972972972973
Silhouette score: 0.7014285714285714
Silhouette score: 0.8033088235294118
Silhouette score: 0.41357142857142853
Silhouette score: 0.5330732292917167
Silhouette score: 0.5944444444444444
Silhouette score: 0.7345238095238095
Silhouette score: 0.4038915566226491
gen	nevals	avg     	std     	min    	max     
0  	10    	0.585391	0.157672	0.32786	0.803309
Silhouette score: 0.5906820031820031
Silhouette score: 0.721218487394958
Silhouette score: 0.5371621621621622
Silhouette score: 0.7856617647058823
Silhouette score: 0.37875482625482626
Silhouette score: 0.6369047619047619
1  	6     	0.629411	0.125333	0.378755	0.785662
Silhouette score: 0.5258593625019997
Silhouette score: 0.6692857142857143
Silhouette score: 0.7978430353430352
Silhouette score: 0.5438375350140057
Silhouette score: 0.4992857142857143
Silhouette score: 0.669047619047619
Silhouette score: 0.811904761904762
Silhouette s

Again this returns our optimal parameters, which we then use to perform the clustering using the `parea_1_spectral` function:

In [66]:
labels_spectral = pyrea.parea_1_spectral(data, *params_spectral, k_final=2)

print(labels_spectral)

[1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0
 0 0 0]




We can also print the parameters to see which were selected by the genetic algorithm:

In [52]:
params_spectral

['spectral',
 12,
 2,
 'spectral',
 10,
 2,
 'spectral',
 10,
 2,
 'spectral',
 15,
 3,
 'agreement']

# Evaluation

## Parea 1

We compare our results to several specially designed algorithm for multi-view data, using mvlearn. We use the Normalized Mutual Information (NMI) score for this evaluation.

First some imports:

In [58]:
from sklearn.metrics import normalized_mutual_info_score as nmi_score
from mvlearn.cluster import MultiviewCoRegSpectralClustering, MultiviewSphericalKMeans, MultiviewKMeans

First we compute the NMI score for hierarchical clustering (using ther ground truths stored in `y` from above):

In [67]:
s_nmi_hierarchical = nmi_score(labels_hierarchical, y)
s_nmi_hierarchical

0.5615896365639194

And now for spectral clustering:

In [68]:
s_nmi_spectral = nmi_score(labels_spectral, y)
s_nmi_spectral

0.5466306756208169

## mvlearn

Here we use mvlearn's specialised multi-view clustering algorithms on the same data and also compute the NMI scores. 

First, Multiview Coregularized Spectral Clustering:

In [71]:
mv_coreg_spectral = MultiviewCoRegSpectralClustering(n_clusters=2,
                                               random_state=42,
                                               n_init=100)

labels_mv_coreg_spectral = mv_coreg_spectral.fit_predict(data)

s_nmi_mv_coreg_spectral = nmi_score(labels_mv_coreg_spectral, y)
s_nmi_mv_coreg_spectral

0.4591921489796446

Now with Multiview Spherical Means:

In [74]:
mv_spherical_kmeans = MultiviewSphericalKMeans(n_clusters=2, random_state=42)

labels_spherical_kmeans = mv_spherical_kmeans.fit_predict(data)

s_nmi_spherical_kmeans = nmi_score(labels_spherical_kmeans, y)
s_nmi_spherical_kmeans

0.6843628400956293

And finally we use mvlearn's Multiview KMeans:

In [75]:
mv_kmeans = MultiviewKMeans(n_clusters=2, random_state=42)

labels_kmeans = mv_kmeans.fit_predict(data)

s_nmi_kmeans = nmi_score(labels_kmeans, y)
s_nmi_mvlearn3

0.5615896365639194

# Conclusions

Some text here.

# References

<span id="fn1"><sup>1</sup> Nutrimouse data: https://aasldpubs.onlinelibrary.wiley.com/doi/10.1002/hep.21510</span>

<span id="fn2"><sup>2</sup> mvlearn Nutrimouse example: https://mvlearn.github.io/auto_examples/datasets/plot_nutrimouse.html</span>