Exploring some intrinsic dimension (ID) and local intrinsic dimension (LID) estimators for high-dimensional data.

### References
1. Ma, Xingjun, et al. "Characterizing adversarial subspaces using local intrinsic dimensionality." arXiv preprint arXiv:1801.02613 (2018).
1. Amsaleg, Laurent, et al. "Estimating local intrinsic dimensionality." Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2015.
1. Ansuini, Alessio, et al. "Intrinsic dimension of data representations in deep neural networks." arXiv preprint arXiv:1905.12784 (2019).
1. Carter, Kevin M., Raviv Raich, and Alfred O. Hero III. "On local intrinsic dimension estimation and its applications." IEEE Transactions on Signal Processing 58.2 (2009): 650-663.
1. Levina, Elizaveta, and Peter J. Bickel. "Maximum likelihood estimation of intrinsic dimension." Advances in neural information processing systems. 2005.
1. Facco, Elena, et al. "Estimating the intrinsic dimension of datasets by a minimal neighborhood information." Scientific reports 7.1 (2017): 12140.

### LID estimation method from [1] and [2]

In [1]:
import numpy as np
from pynndescent import NNDescent
from sklearn.neighbors import NearestNeighbors
from multiprocessing import cpu_count
from metrics_custom import (
    distance_SNN, 
    neighborhood_membership_vectors
)
from generate_data import MFA_model
from lid_estimators import (
    lid_mle_amsaleg, 
    id_two_nearest_neighbors
)

In [2]:
# Suppress annoying numba warning
import warnings
from numba import NumbaPendingDeprecationWarning
warnings.filterwarnings('ignore', '', NumbaPendingDeprecationWarning)

In [3]:
# Define some constants
num_proc = max(cpu_count() - 2, 1)
seed_rng = np.random.randint(1, high=10000)
K = 20
n_neighbors = max(K + 2, 20)
rho = 0.5
metric_primary = 'euclidean'

In [4]:
# Generate data according to a mixture of factor analysis (MFA) model
np.random.seed(seed_rng)

# number of mixture components
n_components = 10
# dimension of the observed space
dim = 500

# dimension of the latent space. This determines the local intrinsic dimension
# dim_latent = 10
# model = MFA_model(n_components, dim, dim_latent=dim_latent, seed_rng=seed_rng)

# Can specify a range for the latent dimension instead of a single value
dim_latent_range = (10, 20)
model = MFA_model(n_components, dim, dim_latent_range=dim_latent_range, seed_rng=seed_rng)

# Generate data from the model
N = 1000
N_test = 100
data, labels = model.generate_data(N)
data_test, labels_test = model.generate_data(N_test)

In [5]:
# Construct an approximate nearest neighbor (ANN) index to query nearest neighbors
params = {
    'metric': metric_primary, 
    'n_neighbors': n_neighbors,
    'rho': rho,
    'random_state': seed_rng,
    'n_jobs': num_proc, 
    'verbose': True
}
index = NNDescent(data, **params)

Fri Nov 22 02:33:29 2019 Building RP forest with 7 trees
Fri Nov 22 02:33:29 2019 parallel NN descent for 10 iterations
	 0  /  10
	 1  /  10
	 2  /  10


In [6]:
# Query the K nearest neighbors of each point. 
# Since each point will be selected as its own nearest neighbor, we query for `K+1` neighbors and ignore the self neighbors
nn_indices_, nn_distances_ = index.query(data, k=(K + 1))

In [7]:
# Remove each point from it's own neighborhood set
nn_indices = np.array(
    [nn_indices_[i, nn_indices_[i, :] != i] for i in range(N)], 
    dtype=nn_indices_.dtype
)
nn_distances = np.array(
    [nn_distances_[i, nn_indices_[i, :] != i] for i in range(N)], 
    dtype=nn_distances_.dtype
)

In [8]:
# Calculate the local intrinsic dimension in the neighborhood of each point
lid = lid_mle_amsaleg(nn_distances)
print("Mean LID = {:.4f}".format(np.mean(lid)))

Mean LID = 8.8093


In [9]:
p = [0, 2.5, 25, 50, 75, 97.5, 100]
out = np.percentile(lid, p)
print("Percentiles of the LID distribution:")
for a, b in zip(p, out):
    print("{:.1f}\t{:.4f}".format(a, b))

Percentiles of the LID distribution:
0.0	2.5948
2.5	3.3769
25.0	6.6347
50.0	8.6242
75.0	10.6621
97.5	15.5388
100.0	21.2013


### Intrinsic dimension estimation using the Two-NN method [6, 3]

In [10]:
id = id_two_nearest_neighbors(nn_distances)
print("Intrinsic dimension estimated using the two-NN method = {:.4f}".format(id))

Intrinsic dimension estimated using the two-NN method = 13.1859
