# Virgo Demo 1 - Base pipeline

In [1]:
from virgo.cluster import VirgoCluster
from virgo.kernel import VirgoSimpleKernel
from virgo.mixture import VirgoMixture
from virgo.cleaner import LowDensityCleaner

%load_ext autoreload
%autoreload 2

%matplotlib notebook

### Data class

VirgoCluster is meant to be the base class for data handling. It stores separately raw data, the rescaled data set and the final cluster and cluster_label arrays.The rescaled data set is created of the scale_data() class method is called. print_datastats() prints a few helper info about the stored datasets. 

In [2]:
file_name = "/home/max/Software/virgo/data/data.txt"
virgo_cluster = VirgoCluster(file_name=file_name)
virgo_cluster.scale_data()
virgo_cluster.print_datastats()

Data set 0 - Shape: (694764, 8)
Mean / Std: 43809.885 / 134895.773
Min / Max: -7516.963 / 694763.000
Data set 1 - Shape: (694764, 7)
Mean / Std: 0.000 / 1.000
Min / Max: -4.197 / 217.386


### Kernel

Virgo uses a covariance function to create additional feature space dimensions by leveraging correlations in the datasets itself. For the time being this is a very simple LinearKernel. VirgoKernel needs to be instantiated with the corresponding VirgoCluster object and then just called. The new feature dimensions are added to the rescaled data set automatically, as can be seen from the stats output.

Currently, only the spatial dimensions are used for the kernel. Dimensions to use can be passed as list.

In [3]:
virgo_kernel = VirgoSimpleKernel(virgo_cluster)
virgo_kernel()
virgo_cluster.print_datastats()

Data set 0 - Shape: (694764, 8)
Mean / Std: 43809.885 / 134895.773
Min / Max: -7516.963 / 694763.000
Data set 1 - Shape: (694764, 8)
Mean / Std: 0.260 / 1.525
Min / Max: -4.197 / 217.386


### Gaussian mixture fit model

We are using a Gaussian mixture model to classify the data. Te VirgoMixture class currently has a GaussianMixture model with fixed number of components and a BayesianGaussianMixture model with a Dirichlet process prior to downweight unneeded components. We currently emply the former as default for the time being.

The evidence lower bound is returned as goodness-of-fit measure and the component weights can be called from the model as attribute.

Calling the predict() method without any data as input, automatically sets the labels for the entire dataset in the VirgoCluster. The option to remove labels with a probability belong 95% is also there, but not called on default. The threshhold can be changed as an input parameter as well.

In [4]:
# virgo_mixture = VirgoMixture(virgo_cluster, n_comp=25, mixture_type="bayesian_gaussian")
virgo_mixture = VirgoMixture(virgo_cluster, n_comp=12)
elbo = virgo_mixture.fit()

print(f"ELBO: {elbo}")
print(f"Mixture weights {virgo_mixture.model.weights_}")

virgo_mixture.predict(remove_uncertain_labels=True)
labels_removed = virgo_cluster.get_labels(return_counts=True)
print(labels_removed)

ELBO: -1.0923124195406866
Mixture weights [0.07111255 0.18422858 0.07965982 0.06529861 0.05766045 0.10871365
 0.09996873 0.01758581 0.04501501 0.05782285 0.08071358 0.13222036]
Removed 70529
(array([-1,  0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11]), array([ 70529, 120915,  89623,  67949,  65826,  53098,  45502,  41663,
        36677,  32585,  30242,  29133,  11022]))


### Visualization 

VirgoCluster has a general plotting method plot_cluster() to visualize the fitted data. Specific labels can be called via list input. "Removed" uncertain labels are automatically not shown, but can be switched on again. Maker size is also an input parameter.

In [5]:
virgo_cluster.plot_cluster(n_step=50)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [6]:
virgo_cluster.plot_cluster(n_step=25, cluster_label=[0, 1, 2, 3])

<IPython.core.display.Javascript object>

In [7]:
virgo_cluster.plot_cluster(n_step=10, remove_uncertain=False, cluster_label=[-1])

<IPython.core.display.Javascript object>

### Cleaning

We can further clean the resultign clusters by either further separating a cluster by checking with a two component GaussianMixture fit or by removing low density clusters who are of low interest to our problem. The latter is more stable for the time being, as both rely on an emiprical parameter, but the desnity cut is physically motivated and easier to verify.

Relabeling due to cluster size ist called on default, but can be set to False.

In [8]:
virgo_cluster.plot_cluster(n_step=50)
virgo_cluster.get_labels(return_counts=True)

<IPython.core.display.Javascript object>

(array([-1,  0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11]),
 array([ 70529, 120915,  89623,  67949,  65826,  53098,  45502,  41663,
         36677,  32585,  30242,  29133,  11022]))

In [9]:
d_cleaner = LowDensityCleaner(virgo_cluster, 1e-10)
d_cleaner.clean()
print(virgo_cluster.get_labels(return_counts=True))
virgo_cluster.plot_cluster(n_step=50)

Cluster -1
Cluster 0
Density: 6.088370660789577e-06
Cluster 1
Density: 1.3352118618005608e-07
Cluster 2
Density: 6.430267141153393e-07
Cluster 3
Density: 2.30618632062407e-07
Cluster 4
Density: 1.2804480186978986e-07
Cluster 5
Density: 2.2813047938858117e-08
Cluster 6
Density: 4.349780444690317e-09
Cluster 7
Density: 2.1747818233955915e-08
Cluster 8
Density: 7.785967929222691e-11
Cluster 9
Density: 1.8959965144895944e-09
Cluster 10
Density: 1.1519686290898638e-06
Cluster 11
Density: 3.602103775739396e-14
(array([-1,  0,  1,  2,  3,  4,  5,  6,  7,  8,  9]), array([114136, 120915,  89623,  67949,  65826,  53098,  45502,  41663,
        36677,  30242,  29133]))


  self.clusters = np.array(self.clusters)
  self.labels = np.array(self.labels)


<IPython.core.display.Javascript object>

In [10]:
virgo_cluster.plot_cluster(n_step=25, cluster_label=[0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

<IPython.core.display.Javascript object>

### Export results

Cluster results, in the original data format, and their labels can be exported with VirgoCluster.export_cluster(). Event numbers (added 0th dimension) can be removed again and only positiv labels can be filtered (both False on default):

In [11]:
virgo_cluster.export_cluster(remove_uncertain=True, remove_evno=True)