# Using Pyrea with Nutrimouse Data Utilising Hierarchical and Spectral Clustering

In this notebok we demonstrate Pyrea's usage by performing hierarchical and spectral clustering on the Nutrimouse[<sup>1</sup>](#fn1) dataset.

We will do this using the Parea_1 structure, a structure that is included as a helper function in the Pyrea software package.

## Imports
This notebook requires Pyrea, mvlearn, and Numpy, let's import the relevant items here:

In [1]:
import pyrea
import numpy as np
from operator import itemgetter
from mvlearn.datasets import load_nutrimouse

## Load Data

Load the Nutrimouse data from mvlearn:[<sup>2</sup>](#fn2)

In [2]:
nutrimouse_dataset = load_nutrimouse()
data = [nutrimouse_dataset['gene'], nutrimouse_dataset['lipid']]

y_all = np.vstack((nutrimouse_dataset['genotype'], nutrimouse_dataset['diet'])).T
y = y_all[:,0]

Note: the variable `y` contains the ground truths that we will use for evaluation later in the notebook. The ground truths are not used during training.

Preview the shape of the data:

In [3]:
print(f'Number of views: {len(data)}')
print(f'Shape of view 1: {np.shape(data[0])[0]} x {np.shape(data[0])[1]}')
print(f'Shape of view 2: {np.shape(data[1])[0]} x {np.shape(data[1])[1]}')

Number of views: 2
Shape of view 1: 40 x 120
Shape of view 2: 40 x 21


As can be seen there are 2 views. View 1 has 120 features for each of the 40 mice, while view 2 has 21 features for each of the 40 mice. As this is a multi-view dataset, the 40 samples refer to the same 40 mice in both datasets.

We will use Parea_1 to perform hierarchical clustering and spectral clustering on this dataset, and use Pyrea's built-in genetic algorithm functionality to find the best hyperparameters to use for this data.

# Parea_1

## Hierarchical Clustering

Perform the genetic algorithm as follows, which will learn the best parameters to use for the clustering:

In [4]:
params_hierarchical = pyrea.parea_1_genetic(data, k_min=2, k_max=5, k_final=2, n_generations=3, n_population=10)

Silhouette score: 0.6995833333333333
Silhouette score: 0.6083333333333334
Silhouette score: 0.5625
Silhouette score: 0.8166666666666668
Silhouette score: 0.46607142857142847
Silhouette score: 0.5071428571428571
Silhouette score: 0.41482142857142856
Silhouette score: 0.5032142857142856
Silhouette score: 0.5641369047619047
Silhouette score: 0.4791699372056515
gen	nevals	avg     	std     	min     	max     
0  	10    	0.562164	0.114072	0.414821	0.816667
Silhouette score: 0.5071428571428571
Silhouette score: 0.5625
Silhouette score: 0.44812303709362533
Silhouette score: 0.7458333333333333
Silhouette score: 0.5641369047619047
Silhouette score: 0.81875
Silhouette score: 0.5071428571428571
Silhouette score: 0.6083333333333334
Silhouette score: 0.6995833333333333
1  	9     	0.616113	0.113767	0.448123	0.81875 
Silhouette score: 0.8625
Silhouette score: 0.7217867146858744
Silhouette score: 0.81875
Silhouette score: 0.81875
Silhouette score: 0.7458333333333333
Silhouette score: 0.7458333333333333


Once this is complete `params_hierarchical` contains the optimal parameters for this data, which we can then use to call the `parea_1` function with the optimal parameters that we have learned:

In [5]:
labels_hierarchical = pyrea.parea_1(data, *params_hierarchical)

print(labels_hierarchical)

[1 1 1 1 0 0 1 1 1 0 1 1 1 1 1 1 0 1 0 1 0 0 0 1 0 0 0 0 1 0 0 0 0 0 1 0 0
 1 0 0]


We can also print the parameters to see which were selected by the genetic algorithm:

In [6]:
params_hierarchical

['hierarchical',
 'single',
 5,
 'hierarchical',
 'median',
 5,
 'hierarchical',
 'ward',
 2,
 'hierarchical',
 'single',
 3,
 'disagreement']

## Spectral Clustering

This is performed in almost the same way:

In [7]:
params_spectral = pyrea.parea_1_genetic_spectral(data, k_min=2, k_max=8, k_final=2, n_neighbors_min=10, n_neighbors_max=15, n_population=10, n_generations=3)

Silhouette score: 0.6962612612612613
Silhouette score: 0.38844537815126046
Silhouette score: 0.3454545454545455
Silhouette score: 0.5086440601610625
Silhouette score: 0.33914065898216617
Silhouette score: 0.41237716399547997
Silhouette score: 0.3448436192070988
Silhouette score: 0.5154130380421191
Silhouette score: 0.5096638655462185
Silhouette score: 0.539705292702486
gen	nevals	avg     	std     	min     	max     
0  	10    	0.459995	0.108836	0.339141	0.696261
Silhouette score: 0.6031428303303302
Silhouette score: 0.6333302890940127
Silhouette score: 0.3684953597661961
Silhouette score: 0.4137893093283824
Silhouette score: 0.628125
Silhouette score: 0.5338791562717029
Silhouette score: 0.7021491224363496
Silhouette score: 0.7149128396689373
Silhouette score: 0.5728280172821337
Silhouette score: 0.3535983717778365
1  	10    	0.552425	0.125304	0.353598	0.714913
Silhouette score: 0.7004166666666667
Silhouette score: 0.7143055555555555
2  	2     	0.661777	0.0676587	0.533879	0.714913
Silho

Again this returns our optimal parameters, which we then use to perform the clustering using the `parea_1_spectral` function:

In [8]:
labels_spectral = pyrea.parea_1_spectral(data, *params_spectral, k_final=2)

print(labels_spectral)

[1 1 1 1 1 0 1 1 1 0 1 1 1 1 1 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 1 0 0]


We can also print the parameters to see which were selected by the genetic algorithm:

In [9]:
params_spectral

['spectral',
 15,
 6,
 'spectral',
 11,
 5,
 'spectral',
 11,
 5,
 'spectral',
 13,
 2,
 'agreement']

# Evaluation

## Parea 1

We compare our results to several specially designed algorithm for multi-view data, using mvlearn. We use the Normalized Mutual Information (NMI) score for this evaluation.

First some imports:

In [10]:
from sklearn.metrics import normalized_mutual_info_score as nmi_score
from mvlearn.cluster import MultiviewCoRegSpectralClustering, MultiviewSphericalKMeans, MultiviewKMeans

First we compute the NMI score for hierarchical clustering (using ther ground truths stored in `y` from above):

In [11]:
s_nmi_hierarchical = nmi_score(labels_hierarchical, y)
s_nmi_hierarchical

0.23180186867415759

And now for spectral clustering:

In [12]:
s_nmi_spectral = nmi_score(labels_spectral, y)
s_nmi_spectral

0.48348413489732744

## mvlearn

Here we use mvlearn's specialised multi-view clustering algorithms on the same data and also compute the NMI scores. 

First, Multiview Coregularized Spectral Clustering:

In [13]:
mv_coreg_spectral = MultiviewCoRegSpectralClustering(n_clusters=2,
                                               random_state=42,
                                               n_init=100)

labels_mv_coreg_spectral = mv_coreg_spectral.fit_predict(data)

s_nmi_mv_coreg_spectral = nmi_score(labels_mv_coreg_spectral, y)
s_nmi_mv_coreg_spectral

0.4591921489796446

Now with Multiview Spherical Means:

In [14]:
mv_spherical_kmeans = MultiviewSphericalKMeans(n_clusters=2, random_state=42)

labels_spherical_kmeans = mv_spherical_kmeans.fit_predict(data)

s_nmi_spherical_kmeans = nmi_score(labels_spherical_kmeans, y)
s_nmi_spherical_kmeans

0.6843628400956293

And finally we use mvlearn's Multiview KMeans:

In [15]:
mv_kmeans = MultiviewKMeans(n_clusters=2, random_state=42)

labels_kmeans = mv_kmeans.fit_predict(data)

s_nmi_kmeans = nmi_score(labels_kmeans, y)
s_nmi_kmeans

0.5615896365639194

# Parea 2

In [16]:
params_parea2 = pyrea.parea_2_mv_genetic(data, k_min=2, k_max=5, n_population=10, n_generations=3)

Starting parea 2 genetic...
['c_0_type', 'c_1_type', 'c_0_method', 'c_1_method', 'c_0_k', 'c_1_k', 'c_0_pre_type', 'c_1_pre_type', 'c_0_pre_method', 'c_1_pre_method', 'c_0_pre_k', 'c_1_pre_k', 'fusion_method']
Silhouette score: 0.25206614153538387
Silhouette score: 0.529158739073293
Silhouette score: 0.30154001252838464
Silhouette score: 0.4859592299809691
Silhouette score: 0.529158739073293
Silhouette score: 0.5406250000000001
Silhouette score: 0.3668353174603175
Silhouette score: 0.2138679567221625
Silhouette score: 0.4594696969696971
Silhouette score: 0.529158739073293
gen	nevals	avg     	std     	min     	max     
0  	10    	0.420784	0.119896	0.213868	0.540625
Silhouette score: 0.25206614153538387
Silhouette score: 0.5406250000000001
Silhouette score: 0.5406250000000001
Silhouette score: 0.5406250000000001
Silhouette score: 0.529158739073293
Silhouette score: 0.25206614153538387
Silhouette score: 0.5406250000000001
Silhouette score: 0.4859592299809691
Silhouette score: 0.4637623893

In [17]:
params_parea2_converted = pyrea.convert_to_parameters(data, params_parea2)
labels_parea_2 = pyrea.parea_2_mv(data, **params_parea2_converted)

print(f"Labels: {labels_parea_2}")

s_nmi_parea_2 = nmi_score(labels_parea_2, y)
print(f"NMI score: {s_nmi_parea_2}")

Labels: [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0
 0 0 0]
NMI score: 0.6843628400956293


# Conclusions

Parea 2 scores 0.6843628400956293 with following configuration: 

```python
params_parea2 = pyrea.parea_2_mv_genetic(data, k_min=2, k_max=5, n_population=10, n_generations=3)
```

which is on par with mvlearn's best result (using Multiview Spherical k Means).

# References

<span id="fn1"><sup>1</sup> Nutrimouse data: https://aasldpubs.onlinelibrary.wiley.com/doi/10.1002/hep.21510</span>

<span id="fn2"><sup>2</sup> mvlearn Nutrimouse example: https://mvlearn.github.io/auto_examples/datasets/plot_nutrimouse.html</span>