## Testing and validation of ERiC implementation

In this notebook, we test the outputs of both ELKI ERiC and our implementation of ERiC by comparing the outputs of the sample datasets from the two artificial datasets given at: https://elki-project.github.io/datasets/

In [1]:
import numpy as np
import pandas as pd

from lib import *

from elki_eric import elki_eric
import elki_parser
import validation

In [2]:
def load_dataset(file):
    with open(file) as f:
        lines = [line.rstrip() for line in f if "#" not in line]
        lines = [",".join(line.split(" ")) for line in lines]
    
    df = pd.DataFrame([sub.split(",") for sub in lines], columns=["x1","x2","label"])
    X = df[["x1","x2"]]
    X = X.astype(float)
    
    y = df["label"]
    return X, y
    
    
def run_ERiC(df, k=10, min_samples=2, delta_affine=0.5, delta_dist=0.5):
    D = df.to_numpy(dtype=np.float64)
    point_info, partitions = make_partitions(D, k)

    models, clusters = cluster_partitions(D, partitions, point_info, delta_affine, delta_dist, min_samples)

    cluster_info = compute_cluster_list(clusters, D)
    cluster_info = build_hierarchy(cluster_info, delta_affine, delta_dist)
    
    return cluster_info    
    
    
def run_ELKI_ERiC(df, k=10, min_samples=2, delta_affine=0.5, delta_dist=0.5, output_file=None):
    D = df.to_numpy(dtype=np.float64)
    
    elki_eric(
        X, 
        k=k, 
        dbscan_minpts=min_samples, 
        alpha=0.85, 
        delta_dist=delta_dist, 
        delta_affine=delta_affine, 
        output_file_name=output_file)
    
    df1_output = elki_parser.read_file(output_file)
    cluster_info = elki_parser.parse_file(df1_output)
    
    return cluster_info
    

### Mouse dataset

In [3]:
X, y = load_dataset("sample_dataset/mouse.csv")


In [5]:
cluster_info_eric = run_ERiC(X)

partition:  1
---cluster:  1  size: 24
partition:  2
---cluster:  1  size: 476


In [4]:
elki_output_df1 = "elki_df1_output.txt"
cluster_info_elki_eric = run_ELKI_ERiC(X, output_file=elki_output_df1)

Run elki
Saving ELKI results in elki_df1_output.txt
Writing completed.


In [6]:
validation.validate(cluster_info_eric, cluster_info_elki_eric)

The implementations return the same number of clusters.
No. of clusters (our ERiC): 2
No. of clusters (ELKI ERiC): 2

The implementations return the same number of lambdas.
No. of lambdas (our ERiC): {1: 1, 2: 1}
No. of lambdas (ELKI ERiC): {1: 1, 2: 1}

Cluster sizes were identical for lambda=1
Cluster sizes were identical for lambda=2

Validation result: The outputs of the algorithms are identical.


In [5]:
cluster_info_eric

NameError: name 'cluster_info_eric' is not defined

In [7]:
cluster_info_elki_eric

{1: {'lambda': 1,
  'index': 0,
  'points': array([ 11,  14,  27,  28,  46,  53, 129, 136, 154, 172, 181, 190, 222,
         223, 232, 242, 266, 293, 314, 399, 471, 486, 488, 500], dtype=int32),
  'parents': [2]},
 2: {'lambda': 2,
  'index': 0,
  'points': array([ 66, 132, 264, 396, 330, 263,  67,  65, 197, 265, 395, 397, 200,
         134, 196, 464, 394, 130, 201, 199, 133, 135, 465, 131, 463, 462,
         458, 459, 461, 460, 268, 400,  70, 392, 260, 137, 325, 193, 195,
          62, 398, 402,  64, 270, 258,  68, 323, 327,  72,   1, 262, 335,
         337,   3, 333,   7, 329, 331,   5, 338,  58, 140, 321, 387, 322,
          74,  75, 403, 139,  59, 189, 188, 274, 273, 468, 192, 456, 455,
         204, 386, 452, 390,  55, 472, 451, 205, 467, 127, 124, 123, 209,
         208, 334, 128, 342, 383, 119, 391, 326, 269, 407,  71, 277, 184,
         318, 144,  63,   2,   6, 411, 261,  10,  79, 257, 253, 249, 214,
         247, 280, 116, 182, 281, 346, 279, 480, 150, 148, 149, 312, 313,
    