## Testing and validation of ERiC implementation

In this notebook, we test the outputs of both ELKI ERiC and our implementation of ERiC by comparing the outputs of the sample datasets from the two artificial datasets given at: https://elki-project.github.io/datasets/

In [1]:
import numpy as np
import pandas as pd

from lib import *

from elki_eric import elki_eric
import elki_parser
import validation

Here we define methods that load the data and run both algorithms. In the *validation.py* file we compare the sizes and content of the clusters. We also output the structure so that it can be checked manually.

In [2]:
def load_dataset(file):
    with open(file) as f:
        lines = [line.rstrip() for line in f if "#" not in line]
        lines = [",".join(line.split(" ")) for line in lines]
    
    df = pd.DataFrame([sub.split(",") for sub in lines], columns=["x1","x2","label"])
    X = df[["x1","x2"]]
    X = X.astype(float)
    
    y = df["label"]
    return X, y
    
    
def run_ERiC(df, k=10, min_samples=2, delta_affine=0.5, delta_dist=0.5):
    D = df.to_numpy(dtype=np.float64)
    point_info, partitions = make_partitions(D, k)

    models, clusters = cluster_partitions(D, partitions, point_info, delta_affine, delta_dist, min_samples)

    cluster_info = compute_cluster_list(clusters, D)
    cluster_info = build_hierarchy(cluster_info, delta_affine, delta_dist, D.shape[1])
    
    return cluster_info    
    
    
def run_ELKI_ERiC(df, k=10, min_samples=2, delta_affine=0.5, delta_dist=0.5, output_file=None):
    D = df.to_numpy(dtype=np.float64)
    
    elki_eric(
        X, 
        k=k, 
        dbscan_minpts=min_samples, 
        alpha=0.85, 
        delta_dist=delta_dist, 
        delta_affine=delta_affine, 
        output_file_name=output_file)
    
    df1_output = elki_parser.read_file(output_file)
    cluster_info = elki_parser.parse_file(df1_output)
    
    return cluster_info   

### Mouse dataset

In [3]:
X, y = load_dataset("sample_dataset/mouse.csv")

# Our implementation
cluster_info_eric = run_ERiC(X)
# ELKI implementation
elki_output_df1 = "elki_df1_output.txt"
cluster_info_elki_eric = run_ELKI_ERiC(X, output_file=elki_output_df1)

# Run validation
validation.validate(cluster_info_eric, cluster_info_elki_eric)

Run elki
Saving ELKI results in elki_df1_output.txt
Writing completed.
The implementations return the same number of clusters.
No. of clusters (our ERiC): 2
No. of clusters (ELKI ERiC): 2

The implementations return the same number of lambdas.
No. of lambdas (our ERiC): {1: 1, 2: 1}
No. of lambdas (ELKI ERiC): {1: 1, 2: 1}

Our ERiC structure:
Partition  1
--- cluster 0  size: 24
------ points
[10, 13, 26, 27, 45, 52, 128, 135, 153, 171, 180, 189, 221, 222, 231, 241, 265, 292, 313, 398, 470, 485, 487, 499]
------ parents
[2]
Partition  2
--- cluster 0  size: 476
------ points
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 11, 12, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 46, 47, 48, 49, 50, 51, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 

### Vary density dataset

In [4]:
X, y = load_dataset("sample_dataset/vary_density.csv")

# Our implementation
cluster_info_eric = run_ERiC(X)
# ELKI implementation
elki_output_df2 = "elki_df2_output.txt"
cluster_info_elki_eric = run_ELKI_ERiC(X, output_file=elki_output_df2)

# Run validation
validation.validate(cluster_info_eric, cluster_info_elki_eric)

Run elki
Saving ELKI results in elki_df2_output.txt
Writing completed.
The implementations return the same number of clusters.
No. of clusters (our ERiC): 2
No. of clusters (ELKI ERiC): 2

The implementations return the same number of lambdas.
No. of lambdas (our ERiC): {1: 1, 2: 1}
No. of lambdas (ELKI ERiC): {1: 1, 2: 1}

Our ERiC structure:
Partition  1
--- cluster 0  size: 3
------ points
[65, 133, 143]
------ parents
[2]
Partition  2
--- cluster 0  size: 147
------ points
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 1

As seen, the outputs for both sample datasets are identical.