# The Density Peak Advanced clustering algorithm

----------------
Load the package:

In [9]:
import io
import sys
import pandas as pd
import numpy as np

sys.path.append('../../')
from DPApipeline.DPA import PAk, twoNN, DPA

%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


Read input csv file:

In [10]:
data_F1 = pd.read_csv("../DPA/tests/benchmarks/Fig1.dat", sep=" ", header=None)

How to run Density Peak Advanced clustering:

    The default pipeline makes use of the PAk density estimator and of the TWO-NN intristic dimension estimator.
    The densities and the corresponding errors can also be provided as precomputed arrays.

    Parameters
    ----------

    Z : float, default = 1
        The number of standard deviations, which fixes the level of statistical confidence at which
        one decides to consider a cluster meaningful.

    metric : string, or callable
        The distance metric to use.
        If metric is a string, it must be one of the options allowed by
        scipy.spatial.distance.pdist for its metric parameter, or a metric listed in
        pairwise.PAIRWISE_DISTANCE_FUNCTIONS. If metric is "precomputed", X is assumed to
        be a distance matrix. Alternatively, if metric is a callable function, it is
        called on each pair of instances (rows) and the resulting value recorded. The
        callable should take two arrays from X as input and return a value indicating
        the distance between them. Default is 'euclidean'.
        
    densities : array [n_samples], default = None
        The logarithm of the density at each point. If provided, the following parameters are ignored:
        density_algo, k_max, D_thr.

    err_densities : array [n_samples], default = None
        The uncertainty in the density estimation, obtained by computing
        the inverse of the Fisher information matrix.
    
    k_hat : array [n_samples], default = None
        The optimal number of neighbors for which the condition of constant density holds.
        
    n_jobs : int or None, optional (default=None)
        The number of jobs to use for the computation. This works by computing
        each of the n_init runs in parallel.
        ``None`` means 1 unless in a :obj:`joblib.parallel_backend` context.
        ``-1`` means using all processors. See :term:`Glossary <n_jobs>`
        for more details.

    Parameters specific of the PAk estimator:
    -----------------------------------------

    density_algo : string, default = "PAk"
        Define the algorithm to use as density estimator. It mast be one of the options allowed by
        VALID_DENSITY.

    k_max : int, default=1000
        This parameter is considered if density_algo is "PAk" or "kNN", it is ignored otherwise.
        k_max set the maximum number of nearest-neighbors considered by the density estimator.
        If density_algo="PAk", k_max is used by the algorithm in the search for the
        largest number of neighbors ``\hat{k}`` for which the condition of constant density
        holds, within a given level of confidence.
        If density_algo="kNN", k_max set the number of neighbors to be used by the standard
        k-Nearest Neighbor algorithm.
        If the number of points in the sample N is
        less than the default value, k_max will be set automatically to the value ``N/2``, if .
        
    D_thr : float, default=23.92812698
        This parameter is considered if density_algo is "PAk", it is ignored otherwise.
        Set the level of confidence in the PAk density estimator. The default value corresponds to a p-value of
        ``10**{-6}`` for a ``\chiˆ2`` distribution with one degree of freedom.

    dim : int, default = None
        Intrinsic dimensionality of the sample. If dim is provided, the following parameters are ignored:
        dim_algo, blockAn, block_ratio, frac.

    dim_algo : string, or callable, default="twoNN"
        Method for intrinsic dimensionality calculation. If dim_algo is "auto", dim is assumed to be
        equal to n_samples. If dim_algo is a string, it must be one of the options allowed by VALID_DIM.

    Parameters specific of the TWO-NN estimator:
    --------------------------------------------

    blockAn : bool, default=True
        This parameter is considered if dim_algo is "twoNN", it is ignored otherwise.
        If blockAn is True the algorithm perform a block analysis that allows discriminating the relevant 
        dimensions as a function of the block size. This allows to study the stability of the estimation 
        with respect to changes in the neighborhood size, which is crucial for ID estimations when the 
        data lie on a manifold perturbed by a high-dimensional noise.

    block_ratio : int, default=20
        This parameter is considered if dim_algo is "twoNN", it is ignored otherwise.
        Set the minimum size of the blocks as n_samples/block_ratio. If blockAn=False, block_ratio is ignored.
        
    frac : float, default=1
        This parameter is considered if dim_algo is "twoNN", it is ignored otherwise.
        Define the fraction of points in the data set used for ID calculation. By default the full 
        data set is used.



    Attributes
    ----------
    labels_ : array [Nclus]
        The clustering labels assigned to each point in the data set.

    topography_ : array [Nclus, Nclus]
        Let be Nclus the number of clusters, the topography consists in a Nclus × Nclus symmetric matrix,
        in which the diagonal entries are the heights of the peaks and the off-diagonal entries are the
        heights of the saddle points.

    distances_ : array [n_samples, k_max+1]
        Distances to the k_max neighbors of each points. The point itself is included in the array.

    indices_ : array [n_samples, k_max+1]
        Indices of the k_max neighbors of each points. The point itself is included in the array.

    k_hat_ : array [n_samples], default = None
        The optimal number of neighbors for which the condition of constant density holds.

    centers_ :array [Nclus]
        The clustering labels assigned to each point in the data set.

In [11]:
est = DPA.DensityPeakAdvanced(Z=1.5)

In [12]:
est.fit(data_F1)

  a, _, _, _ = np.linalg.lstsq(np.array(x)[:,np.newaxis], np.array(y))


DensityPeakAdvanced(D_thr=23.92812698, Z=1.5, blockAn=True, block_ratio=20,
                    densities=[8.951118219205181, 4.437268289686326,
                               3.4299282849553783, 6.871420680485775,
                               7.11485462391007, 9.57656137123776,
                               9.470094550357937, 7.31040958472572,
                               9.413436563943083, 7.677202641203396,
                               9.529954317873974, 7.436337210296516,
                               5.763625906717161, 9.71869601411903,
                               9.520998710622525,...
                                   0.1346859764710466, 0.14019867026796917,
                                   0.14528504114305565, 0.24523554819823287,
                                   0.12790510004512853, 0.07430370873870562,
                                   0.10838521440452242, 0.1323072983192249,
                                   0.16356945585107635, 0.12896434681821414, ...],
  

In [13]:
est.topography_

[[0, 1, 0.0, 0.0],
 [0, 2, 6.602688718447325, 0.26604585934392433],
 [0, 3, 3.4737493893916898, 0.3507059343758202],
 [0, 4, 3.9686924952437796, 0.4917622040778217],
 [0, 5, 0.0, 0.0],
 [0, 6, 2.9788214773915116, 0.4917622040778217],
 [0, 7, 3.235695723880518, 0.3958473906635696],
 [0, 8, 0.0, 0.0],
 [0, 9, 6.6573450426236125, 0.2490073533144393],
 [0, 10, 0.0, 0.0],
 [0, 11, 0.0, 0.0],
 [1, 2, 0.0, 0.0],
 [1, 3, 3.2145070663288036, 0.3265159805584621],
 [1, 4, 0.0, 0.0],
 [1, 5, 2.144230212419588, 0.5244044240850758],
 [1, 6, 0.0, 0.0],
 [1, 7, 0.0, 0.0],
 [1, 8, 4.2608239777780845, 0.34543846209192886],
 [1, 9, 0.0, 0.0],
 [1, 10, 1.8657494127677032, 0.6831300510639732],
 [1, 11, 2.195471185443899, 0.40382783841251346],
 [2, 3, 0.0, 0.0],
 [2, 4, 2.414484309933649, 0.2550061995371002],
 [2, 5, 0.0, 0.0],
 [2, 6, 3.6031877275373536, 0.3142012345545232],
 [2, 7, 0.0, 0.0],
 [2, 8, 0.0, 0.0],
 [2, 9, 3.945230871669852, 0.4413674147523748],
 [2, 10, 0.0, 0.0],
 [2, 11, 0.0, 0.0],
 [3, 4,

The PAk and twoNN estimator can be used indipendently from the DPA clustering method.

In [14]:
rho_est = PAk.PointAdaptive_kNN()
d_est = twoNN.twoNearestNeighbors()

In [15]:
results = rho_est.fit(data_F1)
print(results.densities_[:10])

dim = d_est.fit(data_F1).dim_
print(dim)

[8.951118219205181, 4.437268289686326, 3.4299282849553783, 6.871420680485775, 7.11485462391007, 9.57656137123776, 9.470094550357937, 7.31040958472572, 9.413436563943083, 7.677202641203396]
2


  a, _, _, _ = np.linalg.lstsq(np.array(x)[:,np.newaxis], np.array(y))
