# Benchmark h-NNE version 2 vs version 1
We compare here the k-nn and trustworthiness scores of h-NNE v2 and v1 as well as their running times on some small datasets. Using the benchmarking script, one can evaluate h-nne on different datasets.

We observe that the h-NNE v2 provides a visualization which spreads better the data and significantly reduces the collapse of large clusters to small areas.

## Setup

In order to run the benchmarking, aside of installing the `hnne` package you need to:

- Install the `hnne_benchmarking` package. To do this:
    - Clone the h-NNE project: `git clone git@github.com:koulakis/h-nne.git`
    - Navigate to the `benchmarking` directory and inside it run `pip install .`
- Download and place the small size h-NNE benchmarking datasets in a directory `<path to data>` of your choice.
    - MNIST and FMNIST will automatically be downloaded there the first time you load them, so nothing to do for those two.
    - For COIL-20, dowload from here the processed dataset: https://www.cs.columbia.edu/CAVE/software/softlib/coil-20.php, extract it and copy it to the data path.
    - For shuttle, download the data from here: https://archive.ics.uci.edu/dataset/148/statlog+shuttle, extract and copy its contents to the data path.

In [1]:
import time
from pathlib import Path

import numpy as np
import matplotlib.pyplot as plt
from hnne_benchmarking.data import dataset_loaders, DatasetGroup
from hnne_benchmarking.run_benchmarks import run_eval
from hnne_benchmarking.utils import combine_method_overviews_from_csv

In [2]:
data_path = Path("<path to datasets>")
outputs_path = Path("<path to outputs>")

data_path = Path("/home/marios/datasets/hnne_datasets")
outputs_path = Path("~/Desktop/tmp_rm")

## Load small, medium or large datasets, see ./hnne/benchmarking/data.py
loaders = dataset_loaders(dataset_group=DatasetGroup.small)
loaders.keys()

dict_keys(['coil_20', 'shuttle', 'mnist', 'fmnist'])

## Run benchmark

In [3]:
dataset_group = DatasetGroup.small
start_cluster_view='auto'
output_directory = outputs_path / "benchmark_results" / str(dataset_group)
v2_size_threshold = None

for hnne_version in ["v1", "v2"]:
    print(f"Running hnne {hnne_version}")
    experiment_name = f'hnne_v2_packlevel_{start_cluster_view}' if hnne_version == "v2" else 'hnne_v1'
    
    scores, plt = run_eval(
        data_path=data_path,
        dataset_group=dataset_group,
        n_components=2,
        distance="cosine",
        radius=0.4,
        ann_threshold=20_000,
        preliminary_embedding="pca",
        validate_only_1nn=False,     
        compute_trustworthiness=True,
        random_state=42,
        verbose=False,
    
        # hnne v2 params
        prefered_num_clust= None, ## set_ncluster_view ...  
        hnne_version = hnne_version,
        start_cluster_view=start_cluster_view,
        v2_size_threshold=v2_size_threshold,
        
        # Save params
        save_experiment=True,
        plot_projection=True,            
        experiment_name=experiment_name,
        output_directory=output_directory,
        points_plot_limit =5000_000,
        figsize=(5, 5),
        skip_done=False,
        scale_data=False,
    )

Running hnne v1
Loading coil_20...
Finch time: 0.9686776300004567, projection time: 1.6933651069994085
Validating coil_20 on [1, 3, 5, 10] nearest neighbors...


100%|███████████████████████████████████████████| 40/40 [00:05<00:00,  7.75it/s]


Failed to run experiment on coil_20.
Error: Cannot save file into a non-existent directory: '/home/marios/Desktop/tmp_rm/benchmark_results/DatasetGroup.small/hnne_v1/scores'
Loading shuttle...


Exception ignored in: <bound method IPythonKernel._clean_thread_parent_frames of <ipykernel.ipkernel.IPythonKernel object at 0x77ca49315e40>>
Traceback (most recent call last):
  File "/home/marios/.pyenv/versions/hnne/lib/python3.10/site-packages/ipykernel/ipkernel.py", line 775, in _clean_thread_parent_frames
    def _clean_thread_parent_frames(
KeyboardInterrupt: 


Finch time: 21.65576375799992, projection time: 1.3302487970004222
Validating shuttle on [1, 3, 5, 10] nearest neighbors...


100%|██████████████████████████████████████████████████████████████████████████████████████| 40/40 [00:25<00:00,  1.58it/s]


Failed to run experiment on shuttle.
Error: Cannot save file into a non-existent directory: '/home/marios/Desktop/tmp_rm/benchmark_results/DatasetGroup.small/hnne_v1/scores'
Loading mnist...
Finch time: 2.99827311899935, projection time: 2.577566238000145
Validating mnist on [1, 3, 5, 10] nearest neighbors...


100%|██████████████████████████████████████████████████████████████████████████████████████| 40/40 [00:24<00:00,  1.61it/s]


Failed to run experiment on mnist.
Error: Cannot save file into a non-existent directory: '/home/marios/Desktop/tmp_rm/benchmark_results/DatasetGroup.small/hnne_v1/scores'
Loading fmnist...
Finch time: 2.2238812380001036, projection time: 1.2392120650001743
Validating fmnist on [1, 3, 5, 10] nearest neighbors...


100%|██████████████████████████████████████████████████████████████████████████████████████| 40/40 [00:23<00:00,  1.73it/s]


Failed to run experiment on fmnist.
Error: Cannot save file into a non-existent directory: '/home/marios/Desktop/tmp_rm/benchmark_results/DatasetGroup.small/hnne_v1/scores'


OSError: Cannot save file into a non-existent directory: '/home/marios/Desktop/tmp_rm/benchmark_results/DatasetGroup.small/hnne_v1/scores'

## Display the scores of the two versions on different datasets

In [None]:
method_csvs = [
    ("hnne_v1", Path(output_directory, "hnne_v1", "scores","all_datasets_scores.csv")),
    ("hnne_v2", Path(output_directory, "hnne_v2_packlevel_auto", "scores","all_datasets_scores.csv")),
     ]
output_csv_path = Path(output_directory,'combined_overview.csv')
combined_overview = combine_method_overviews_from_csv(
    method_csvs,
    output_csv_path,
    knn_cols=("1-nn", "3-nn", "5-nn", "10-nn"),
    trust_col="trustworthiness",
    proj_time_col="proj_time",
    knn_group_name="KNN accuracy",
)

In [None]:
combined_overview