# 12-HDBSCAN-Generated Cluster Metric Distributions

In this notebook we determine a metric for HDBSCAN-generated clusters. This metric is a surrogate for the missing DBSCAN's ε parameter that determines the maximum reachability distance for points in the cluster. Deriving such a metric is useful when determining H3 hexagon sizes to seamlessly cover the cluster, the inflate size for a concave hull-generated shape, or the maximum radius for bubble shaping.

**Requirements:**

- Please run the `05-clustering-hdbscan.ipynb` notebook first and its dependencies.
- Recommended install: [ipywidgets](https://ipywidgets.readthedocs.io/en/stable/user_install.html). Enable using `jupyter nbextension enable --py widgetsnbextension --sys-prefix` for Jupyter Notebook and `jupyter labextension install @jupyter-widgets/jupyterlab-manager` for Jupyter Lab.k

In [None]:
import numpy as np
import pandas as pd
import networkx as nx

from db.api import VedDb
from itertools import groupby
from tqdm import trange, tqdm
from fitter import Fitter, get_common_distributions
from geo.math import vec_haversine, square_haversine, num_haversine

The function `get_graph_distances` uses the location array to calculate the list of minimum distances using the network theory approach. It starts by calculating the distance matrix. Next it build the undirected graph with the distances as edge weights. Finally, it determines the minimum soanning tree and returns the associated distances as a list.

In [None]:
def get_graph_mst_minimums(locations):
    n = locations.shape[0]
    g = nx.Graph()
    
    dist = square_haversine(locations[:, 0], locations[:, 1])

    g.add_nodes_from(range(locations.shape[0]))
    g.add_edges_from([(i, j, {'weight': dist[i, j]}) for i in range(n) for j in range(i + 1, n)])
            
    mst = nx.minimum_spanning_tree(g, algorithm='prim', weight="weight")
    min_dist = [dist[e[0], e[1]] for e in mst.edges()]
    return min_dist

Declare the database object

In [None]:
db = VedDb()

Get all cluster identifiers

In [None]:
sql = "select cluster_id from cluster"
cluster_ids = [c[0] for c in db.query(sql)]

The code below iterates through all clusters and determines the respective minimum spanning tree weight distribution. Next it uses the `fitter` package to determine what is the distribution type that better fits the distances, and appends the best one to a list.

In [None]:
dists = []
for c in tqdm(cluster_ids):
    c_locs = db.get_cluster_locations(c)
    min_ws = get_graph_mst_minimums(c_locs)
    fitter = Fitter(min_ws, distributions=get_common_distributions())
    fitter.fit()
    dists.append(fitter.get_best())

Now, we get the sorted list of the distribution names.

In [None]:
names = sorted([list(d.keys())[0] for d in dists])

Finally, we count the distribution names and present the results as a Pandas DataFrame.

In [None]:
dist_data = [[key, len(list(group))] for key, group in groupby(names)]
dist_df = pd.DataFrame(data=dist_data, columns=["Distribution", "Count"])
dist_df["Percent"] = dist_df["Count"] / dist_df["Count"].sum() * 100
dist_df.sort_values(["Count"], ascending=False)