# Virgo Demo 1 - Base pipeline

In [10]:
from virgo.cluster import VirgoCluster
from virgo.kernel import VirgoSimpleKernel
from virgo.mixture import VirgoMixture
from virgo.cleaner import LowDensityCleaner

%load_ext autoreload
%autoreload 2

%matplotlib notebook

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


### Data class

VirgoCluster is meant to be the base class for data handling. It stores separately raw data, the rescaled data set and the final cluster and cluster_label arrays.The rescaled data set is created of the scale_data() class method is called. print_datastats() prints a few helper info about the stored datasets.

Virgo can process txt files as well as simulation snaps. The io_mode has to be sed accordingly (default io_mode=0 for txt files.) Optionally, the mach number dimension can be filtered. By default, this is not done, but the default ceiling and floor values are 15 and 1.

In [11]:
file_name = "/home/max/Software/virgo/data/data.txt"
virgo_cluster = VirgoCluster(file_name=file_name, io_mode=0, cut_mach_dim=-1)
virgo_cluster.scale_data()
virgo_cluster.print_datastats()

Data set 0 - Shape: (671556, 8)
Mean / Std: 43920.488 / 135103.105
Min / Max: -7516.963 / 694763.000
Data set 1 - Shape: (671556, 7)
Mean / Std: 0.000 / 1.000
Min / Max: -4.529 / 8.530


### Kernel

Virgo uses a covariance function to create additional feature space dimensions by leveraging correlations in the datasets itself. For the time being this is a very simple LinearKernel. VirgoKernel needs to be instantiated with the corresponding VirgoCluster object and then just called. For VirgoSimple kernel, the new feature dimensions are added to the rescaled data set automatically, as can be seen from the stats output.

Currently, only the spatial dimensions are used for the kernel. Dimensions to use can be passed as list.

In [12]:
virgo_kernel = VirgoSimpleKernel(virgo_cluster)
virgo_kernel()
virgo_cluster.print_datastats()

Data set 0 - Shape: (671556, 8)
Mean / Std: 43920.488 / 135103.105
Min / Max: -7516.963 / 694763.000
Data set 1 - Shape: (671556, 8)
Mean / Std: 2.599 / 11.813
Min / Max: -4.529 / 302.402


### Gaussian mixture fit model

We are using a Gaussian mixture model to classify the data. The VirgoMixture class currently has a GaussianMixture model with fixed number of components and a BayesianGaussianMixture model with a Dirichlet process prior to downweight unneeded components. We currently employ the former as default for the time being.

The evidence lower bound is returned as goodness-of-fit measure and the component weights can be called from the model as attribute.

Calling the predict() method without any data as input, automatically sets the labels for the entire dataset in the VirgoCluster. The option to remove labels with a probability below 95% is also there, but not called on default. The threshhold can be changed as an input parameter as well.

In [13]:
virgo_mixture = VirgoMixture(virgo_cluster, n_comp=12)
# virgo_mixture = VirgoMixture(virgo_cluster, n_comp=25, mixture_type="bayesian_gaussian")
elbo = virgo_mixture.fit()

print(f"ELBO: {elbo}")
print(f"Mixture weights {virgo_mixture.model.weights_}")

virgo_mixture.predict(remove_uncertain_labels=True)
labels_removed = virgo_cluster.get_labels(return_counts=True)
print(labels_removed)

ELBO: -8.59226327164546
Mixture weights [0.04846238 0.18421426 0.02491986 0.18465893 0.00440741 0.05632831
 0.02226649 0.05656123 0.10078733 0.03427779 0.15040955 0.13270647]
Removed 113503
(array([-1,  0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11]), array([113503, 116769, 111717,  96458,  72526,  41240,  29006,  28496,
        23732,  14805,  10637,  10461,   2206]))


### Visualization 

VirgoCluster has a general plotting method plot_cluster() to visualize the fitted data. Specific labels can be called via list input. "Removed" uncertain labels are automatically not shown, but can be switched on again. Maker size is also an input parameter. The 3D-plots can be exported as gif as well.

In [18]:
virgo_cluster.plot_cluster(n_step=50, plot_kernel_space=True, store_gif=False, gif_title="gmm_simple_kspace")
virgo_cluster.plot_cluster(n_step=50, store_gif=False, gif_title="gmm_simple")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [19]:
virgo_cluster.plot_cluster(n_step=25, cluster_label=[0, 1, 2, 3, 4, 5], store_gif=False, gif_title="gmm_simple_zoom")

<IPython.core.display.Javascript object>

In [7]:
virgo_cluster.plot_cluster(n_step=10, remove_uncertain=False, cluster_label=[-1])

<IPython.core.display.Javascript object>

### Cleaning

We can further clean the resulting clusters by either further separating a cluster by checking with a two component GaussianMixture fit or by removing low density clusters who are of low interest to our problem. The latter is more stable for the time being, as both rely on an emiprical parameter, but the density cut is physically motivated and easier to verify.

Relabeling due to cluster size ist called on default, but can be set to False.

In [8]:
virgo_cluster.plot_cluster(n_step=50)
virgo_cluster.get_labels(return_counts=True)

<IPython.core.display.Javascript object>

(array([-1,  0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14]),
 array([116722, 115754,  86054,  70529,  63182,  52610,  33615,  30383,
         19962,  16173,  14027,  12681,  11762,  10369,   9306,   8427]))

In [9]:
d_cleaner = LowDensityCleaner(virgo_cluster, 1e-8)
d_cleaner.clean()
print(virgo_cluster.get_labels(return_counts=True))
virgo_cluster.plot_cluster(n_step=50)

Cluster -1
Cluster 0
Density: 6.2928579482239655e-06
Cluster 1
Density: 1.359766342282378e-07
Cluster 2
Density: 3.6270525255246954e-07
Cluster 3
Density: 9.522632084968439e-07
Cluster 4
Density: 1.198008034227201e-07
Cluster 5
Density: 1.7030550177691195e-08
Cluster 6
Density: 1.8342465459572028e-08
Cluster 7
Density: 1.5746073914906625e-08
Cluster 8
Density: 2.930190194677519e-09
Cluster 9
Density: 1.0277853740838906e-09
Cluster 10
Density: 8.906439438842585e-09
Cluster 11
Density: 1.2332640636492295e-09
Cluster 12
Density: 2.0307429206149054e-10
Cluster 13
Density: 1.4002999777777942e-10
Cluster 14
Density: 4.698086737164749e-07
(array([-1,  0,  1,  2,  3,  4,  5,  6,  7,  8]), array([191040, 115754,  86054,  70529,  63182,  52610,  33615,  30383,
        19962,   8427]))


  self.clusters = np.array(self.clusters)
  self.labels = np.array(self.labels)


<IPython.core.display.Javascript object>

In [10]:
virgo_cluster.plot_cluster(
    n_step=10, cluster_label=[0, 1, 2, 3, 4, 5, 6], store_gif=False, gif_title="gmm_simple_cleaned"
)
virgo_cluster.plot_cluster(
    n_step=10, cluster_label=[0, 1, 2, 3, 4], store_gif=False, gif_title="gmm_simple_cleaned_big5"
)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [18]:
virgo_cluster.plot_cluster(
    n_step=25, cluster_label=[1, 3, 6, 7]
)

<IPython.core.display.Javascript object>

### Export results

Cluster results, in the original data format, and their labels can be exported with VirgoCluster.export_cluster(). Event numbers (added 0th dimension) can be removed again and only positiv labels can be filtered (both False on default):

In [12]:
# virgo_cluster.export_cluster("vc_simple_cleaned_15", remove_uncertain=True, remove_evno=True)