In [1]:
%load_ext lab_black
%load_ext autotime
import pandas as pd
import numpy as np

time: 240 ms (started: 2022-09-12 17:49:21 -07:00)


A 3D S-curve with a hole data set, used to validate the [PaCMAP method](https://arxiv.org/abs/2012.04456). The function below is taken from the [data_prep function at the PaCMAP github repo](https://github.com/YingfanWang/PaCMAP/blob/d34bfdd644c1dd68e8181c926ff34e98b53b0453/experiments/run_experiments.py):

In [3]:
from sklearn.datasets import make_swiss_roll, make_s_curve


def make_scurvehole():
    X, labels = make_s_curve(n_samples=10000, random_state=20200202)
    anchor = np.array([0, 1, 0])
    indices = np.sum(np.square(X - anchor), axis=1) > 0.3
    X, labels = X[indices], labels[indices]
    return X, labels

time: 388 ms (started: 2022-09-12 17:52:09 -07:00)


In [6]:
data, labels = make_scurvehole()
data, data.shape, labels, labels.shape

(array([[ 0.29630642,  0.69668633,  1.95509293],
        [ 0.42420331,  0.51052323,  1.90556698],
        [ 0.99639638,  0.86050764,  1.08481893],
        ...,
        [ 0.78023631,  0.19581041, -1.62548485],
        [-0.13674143,  1.02601064, -1.99060677],
        [ 0.58515088,  0.67243971, -1.81092444]]),
 (9505, 3),
 array([-3.44241573, -3.57967458, -4.62746802, ...,  2.24654913,
         3.27876385,  2.51652655]),
 (9505,))

time: 7.29 ms (started: 2022-09-12 17:52:48 -07:00)


The `labels` are effectively the coordinate along the longest axis of the curve, so can be used to color each point.

## Data Pipeline

In [8]:
target = pd.DataFrame(dict(label=labels))
target

Unnamed: 0,label
0,-3.442416
1,-3.579675
2,-4.627468
3,2.432928
4,-0.921451
...,...
9500,-1.864398
9501,0.688607
9502,2.246549
9503,3.278764


time: 8.78 ms (started: 2022-09-12 17:54:09 -07:00)


In [9]:
from drnb.dataset import create_data_pipeline

data_pipe = create_data_pipeline(
    convert=dict(dtype="float32", layout="c"),
    data_export=["csv", "npy"],
    target_export=["csv", "pkl"],
    neighbors=dict(
        n_neighbors=[15, 50, 150],
        method="exact",
        metric=["euclidean"],
        file_types=["csv", "npy"],
    ),
    triplets=dict(
        n_triplets_per_point=5,
        seed=1337,
        file_types=["csv", "npy"],
    ),
    verbose=True,
)

INFO:rich:Requesting one extra neighbor to account for self-neighbor


time: 4.03 s (started: 2022-09-12 17:54:18 -07:00)


In [11]:
data_result = data_pipe.run("scurvehole", data=data, target=target, verbose=True)

INFO:rich:initial data shape: (9505, 3)
INFO:rich:Removing rows with NAs
INFO:rich:data shape after filtering NAs: (9505, 3)
INFO:rich:Keeping all columns
INFO:rich:data shape after filtering columns: (9505, 3)
INFO:rich:No scaling
INFO:rich:Converting to numpy with {'dtype': 'float32', 'layout': 'c'}
INFO:rich:Writing data for scurvehole
INFO:rich:Processing target with initial shape (9505, 1)
INFO:rich:Keeping all columns
INFO:rich:Writing target for scurvehole
INFO:rich:Calculating nearest neighbors
INFO:rich:Finding 151 neighbors using faiss with euclidean metric and params: {}
INFO:rich:Calculating triplets
INFO:rich:Writing csv format to triplets/scurvehole.5.1337.idx.csv
INFO:rich:Writing csv format to triplets/scurvehole.5.1337.l2.csv
INFO:rich:Writing numpy format to triplets/scurvehole.5.1337.idx.npy
INFO:rich:Writing numpy format to triplets/scurvehole.5.1337.l2.npy
INFO:rich:Writing pipeline result for scurvehole


time: 6.96 s (started: 2022-09-12 17:54:48 -07:00)
