# Virgo Demo 1 - Base pipeline

In [1]:
from virgo.cluster import VirgoCluster
from virgo.kernel import VirgoKernel
from virgo.mixture import VirgoMixture
from virgo.cleaner import LowDensityCleaner

%load_ext autoreload
%autoreload 2

%matplotlib notebook

### Data class

VirgoCluster is meant to be the base class for data handling. It stores separately raw data, the rescaled data set and the final cluster and cluster_label arrays.The rescaled data set is created of the scale_data() class method is called. print_datastats() prints a few helper info about the stored datasets. 

In [2]:
file_name = "/home/max/Software/virgo/data/data.txt"
virgo_cluster = VirgoCluster(file_name=file_name)
virgo_cluster.scale_data()
virgo_cluster.print_datastats()

Data set 0 - Shape: (694764, 7)
Mean / Std: 442.512 / 2048.529
Min / Max: -7516.963 / 38340.406
Data set 1 - Shape: (694764, 7)
Mean / Std: -0.000 / 1.000
Min / Max: -4.197 / 217.386


### Kernel

Virgo uses a covariance function to create additional feature space dimensions by leveraging correlations in the datasets itself. For the time being this is a very simple LinearKernel. VirgoKernel needs to be instantiated with the corresponding VirgoCluster object and then just called. The new feature dimensions are added to the rescaled data set automatically, as can be seen from the stats output.

Currently, only the spatial dimensions are used for the kernel. Dimensions to use can be passed as list.

In [3]:
virgo_kernel = VirgoKernel(virgo_cluster)
virgo_kernel()
virgo_cluster.print_datastats()

Data set 0 - Shape: (694764, 7)
Mean / Std: 442.512 / 2048.529
Min / Max: -7516.963 / 38340.406
Data set 1 - Shape: (694764, 8)
Mean / Std: 0.260 / 1.525
Min / Max: -4.197 / 217.386


### Gaussian mixture fit model

We are using a Gaussian mixture model to classify the data. Te VirgoMixture class currently has a GaussianMixture model with fixed number of components and a BayesianGaussianMixture model with a Dirichlet process prior to downweight unneeded components. We currently emply the former as default for the time being.

The evidence lower bound is returned as goodness-of-fit measure and the component weights can be called from the model as attribute.

Calling the predict() method without any data as input, automatically sets the labels for the entire dataset in the VirgoCluster. The option to remove labels with a probability belong 95% is also there, but not called on default. The threshhold can be changed as an input parameter as well.

In [4]:
virgo_mixture = VirgoMixture(virgo_cluster, n_comp=12)
elbo = virgo_mixture.fit()

print(f"ELBO: {elbo}")
print(f"Mixture weights {virgo_mixture.model.weights_}")

virgo_mixture.predict(remove_uncertain_labels=True)
labels_removed = virgo_cluster.get_labels(return_counts=True)
print(labels_removed)

ELBO: -1.1829541249839983
Mixture weights [0.13193212 0.05985904 0.18544966 0.07876352 0.075908   0.06910326
 0.05041656 0.13277296 0.01933785 0.0954495  0.05000753 0.05100001]
Removed 70419
(array([-1,  0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11]), array([ 70419,  89206,  36187, 124064,  52305,  41416,  39621,  31666,
        76500,  12209,  63642,  25761,  31768]))


### Visualization 

VirgoCluster has a general plotting method plot_cluster() to visualize the fitted data. Specific labels can be called via list input. "Removed" uncertain labels are automatically not shown, but can be switched on again.

In [5]:
virgo_cluster.plot_cluster(n_step=50)

<IPython.core.display.Javascript object>

In [30]:
virgo_cluster.plot_cluster(n_step=25, cluster_label=[0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

<IPython.core.display.Javascript object>

In [20]:
virgo_cluster.plot_cluster(n_step=10, remove_uncertain=False, cluster_label=[-1])

<IPython.core.display.Javascript object>

### Cleaning

We can further clean the resultign clusters by either further separating a cluster by checking with a two component GaussianMixture fit or by removing low density clusters who are of low interest to our problem. The latter is more stable for the time being, as both rely on an emiprical parameter, but the desnity cut is physically motivated and easier to verify.

In [21]:
virgo_cluster.plot_cluster(n_step=50)
virgo_cluster.get_labels(return_counts=True)

<IPython.core.display.Javascript object>

(array([-1,  0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11]),
 array([ 70419,  89206,  36187, 124064,  52305,  41416,  39621,  31666,
         76500,  12209,  63642,  25761,  31768]))

In [22]:
d_cleaner = LowDensityCleaner(virgo_cluster, 1e-8)
d_cleaner.clean()
print(virgo_cluster.get_labels(return_counts=True))
virgo_cluster.plot_cluster(n_step=20)

Cluster -1
Cluster 0
Density: 1.3451631151897936e-07
Cluster 1
Density: 7.48451751716096e-09
Cluster 2
Density: 3.971683990895726e-06
Cluster 3
Density: 1.2519703104586068e-07
Cluster 4
Density: 3.0765873518591344e-08
Cluster 5
Density: 1.1977005075974648e-10
Cluster 6
Density: 1.6060726161337386e-08
Cluster 7
Density: 4.1967103289511994e-08
Cluster 8
Density: 3.988444033700147e-14
Cluster 9
Density: 9.753028224857917e-07
Cluster 10
Density: 1.8503713566145555e-09
Cluster 11
Density: 9.495552639319072e-07
(array([-1,  0,  2,  3,  4,  6,  7,  9, 11]), array([184197,  89206, 124064,  52305,  41416,  31666,  76500,  63642,
        31768]))


  self.clusters = np.array(self.clusters)
  self.labels = np.array(self.labels)


<IPython.core.display.Javascript object>