# Virgo pipeline until labeled subset

In this notebook, we present the application of Virgo. The snap id 750 and 790 correspond to ClustHD_1 and ClustHD_2 respectively from the paper. All parameters are tuned to run on the stated hardware in the paper.

Optionally, the notebook can be used interactively and gifs of the results can also be created. To create gifs, set the "store_gif" paramter to True in the plotting functions.

In [None]:
from virgo.data.cluster import VirgoCluster
from virgo.data.cleaner import AutoDensityCleaner
from virgo.models.kernel import VirgoKernel
from virgo.models.mixture import VirgoMixture
import os

%load_ext autoreload
%autoreload 2

# %matplotlib notebook
%matplotlib inline

### Import raw data set

Available snap ids in supplementary material: 750 and 790

In [None]:
snap_id = 750

cdir = os.getcwd()
filebase = cdir + f"/data/snap_{snap_id}"

virgo_cluster = VirgoCluster(
    file_name=filebase, io_mode=1, cut_mach_dim=-2, n_max_data=800000, 
)

virgo_cluster.scale_data()
virgo_cluster.print_datastats()
virgo_cluster.plot_raw_hists(bins=100)

### Denoise and center raw data set

Use Nystroem approximation, staitonary RBF kernel, PCA and GMM. Densenst GMM component is kept as result.

In [None]:
virgo_kernel = VirgoKernel(virgo_cluster, k_nystroem=100, pca_comp=5)
virgo_kernel()
virgo_cluster.print_datastats()

In [None]:
virgo_mixture = VirgoMixture(virgo_cluster, n_comp=2)
elbo = virgo_mixture.fit()

virgo_mixture.predict(remove_uncertain_labels=False)
labels_removed = virgo_cluster.get_labels(return_counts=True)
print("Classes and number of particles:\t", labels_removed)

virgo_cluster.plot_cluster(
    cmap_vmax=2,
    n_step=25,
    plot_kernel_space=True,
    store_gif=False,
    gif_title=f"virgo_denoise{snap_id}_kernelspace",
)
virgo_cluster.plot_cluster(
    cmap_vmax=2,
    n_step=25,
    store_gif=False,
    gif_title=f"virgo_denoise{snap_id}",
)

In [None]:
d_cleaner = AutoDensityCleaner(virgo_cluster)
d_cleaner.clean()
print(virgo_cluster.get_labels(return_counts=True))
virgo_cluster.plot_cluster(n_step=10)

### Create labeled subset of denoised data

Using a physically motivated kernel function, PCA and a FoF with automatic linking length estimator, we create a labeled subset of the original data set. This step reduced the data set size significantly, but turns the problem into a supervised classification.

In [None]:
vc_2 = VirgoCluster(file_name=None)
vc_2.data = virgo_cluster.cluster[virgo_cluster.cluster_labels >=0][::10]
vc_2.scale_data()
vc_2.print_datastats()

In [None]:
virgo_kernel = VirgoKernel(
    vc_2, k_nystroem=500, pca_comp=6, spatial_dim=[0, 1, 2, 3, 4, 5]
)

virgo_kernel(virgo_kernel.custom_kernel)
vc_2.print_datastats()

In [None]:
vc_2.run_fof(
    min_group_size=600,
    use_scaled_data=True,
)

labels, counts = vc_2.get_labels(return_counts=True)
print("Classes and number of particles:\t", labels_removed)

vc_2.plot_cluster(
    n_step=1,
    plot_kernel_space=True,
    store_gif=False,
    gif_title=f"snap{snap_id}_fit_kspace",
)
vc_2.plot_cluster(
    n_step=1,
    maker_size=3.0,
    store_gif=False,
    gif_title=f"snap{snap_id}_fit_sub",
)

In [None]:
labels, counts = vc_2.get_labels(return_counts=True)
vc_2.cluster_labels[vc_2.cluster_labels < 0] = labels.shape[0] - 1 
vc_2.plot_cluster(
    n_step=1,
    store_gif=False,
    gif_title=f"snap{snap_id}_fit_sub_wnoise",
)

### Export result for SV-DKL scalability 

In [None]:
# vc_2.export_cluster(f"vc_fitted_{snap_id}", remove_uncertain=False, remove_evno=True)