In [1]:
%load_ext lab_black
%load_ext autotime
import pandas as pd
import numpy as np

time: 406 ms (started: 2022-09-12 14:49:09 -07:00)


The Swiss Roll dataset from the Isomap paper. This is basically a 2D sheet of points that has been rolled up. Unfolding it is a goal beloved of all manifold learning methods. However, the MIT website has been less kind as I can no longer find the original URL on its domain: <http://web.mit.edu/cocosci/isomap/isomap.html>. However the internet archive can help us here.

In [3]:
from io import BytesIO

import requests
import scipy.io

req = requests.get(
    "https://web.archive.org/web/20160101000000/http://web.mit.edu/cocosci/isomap/swiss_roll_data.mat",
    timeout=10,
)
data = scipy.io.loadmat(BytesIO(req.content), squeeze_me=True, struct_as_record=False)

time: 350 ms (started: 2022-09-12 14:50:01 -07:00)


In [4]:
data

{'__header__': b'MATLAB 5.0 MAT-file, Platform: LNX86, Created on: Thu May 18 03:25:00 2000',
 '__version__': '1.0',
 '__globals__': [],
 'X_data': array([[-7.81669039, 11.63249768,  4.98949669, ..., -0.15761409,
         -6.13858617, 10.16509492],
        [-6.41497431, -3.83386251, -2.87793725, ...,  7.87242236,
         13.21593522,  8.52082985],
        [15.51749805, 12.96227365, 27.81921752, ..., 18.35944185,
         32.82565478,  1.06143158]]),
 'Y_data': array([[39.40748137, 63.46282985,  4.29079433, ..., 19.01280751,
         94.79912446, 76.50555546],
        [15.51749805, 12.96227365, 27.81921752, ..., 18.35944185,
         32.82565478,  1.06143158]])}

time: 7.49 ms (started: 2022-09-12 14:50:05 -07:00)


In [6]:
x = data["X_data"]
x.shape

(3, 20000)

time: 3.01 ms (started: 2022-09-12 15:37:19 -07:00)


Note that this data is stored by column.

In [7]:
data["Y_data"][1, :]

array([15.51749805, 12.96227365, 27.81921752, ..., 18.35944185,
       32.82565478,  1.06143158])

time: 19.3 ms (started: 2022-09-12 15:46:11 -07:00)


## Data pipeline

First, transpose the data to get it by row:

In [9]:
x = x.T

time: 1.17 ms (started: 2022-09-12 15:46:59 -07:00)


In [13]:
target = pd.DataFrame(dict(X=data["Y_data"][0, :], Y=data["Y_data"][1, :]))

time: 3.92 ms (started: 2022-09-12 15:49:59 -07:00)


In [14]:
target

Unnamed: 0,X,Y
0,39.407481,15.517498
1,63.462830,12.962274
2,4.290794,27.819218
3,22.285673,19.880264
4,57.325563,12.826453
...,...,...
19995,30.505732,7.540049
19996,54.118022,18.001537
19997,19.012808,18.359442
19998,94.799124,32.825655


time: 9.83 ms (started: 2022-09-12 15:50:00 -07:00)


In [15]:
from drnb.dataset import create_data_pipeline

data_pipe = create_data_pipeline(
    data_export=["csv", "npy"],
    target_export=["csv", "pkl"],
    neighbors=dict(
        n_neighbors=[15, 50, 150],
        method="exact",
        metric=["euclidean"],
        file_types=["csv", "npy"],
    ),
    triplets=dict(
        n_triplets_per_point=5,
        seed=1337,
        file_types=["csv", "npy"],
    ),
    verbose=True,
)

INFO:rich:Requesting one extra neighbor to account for self-neighbor


time: 7.94 s (started: 2022-09-12 15:50:25 -07:00)


In [16]:
data_result = data_pipe.run("isoswiss", data=x, target=target, verbose=True)

INFO:rich:initial data shape: (20000, 3)
INFO:rich:Removing rows with NAs
INFO:rich:data shape after filtering NAs: (20000, 3)
INFO:rich:Keeping all columns
INFO:rich:data shape after filtering columns: (20000, 3)
INFO:rich:No scaling
INFO:rich:Converting to numpy with {'dtype': 'float32', 'layout': 'c'}
INFO:rich:Writing data for isoswiss
INFO:rich:Processing target with initial shape (20000, 2)
INFO:rich:Keeping all columns
INFO:rich:Writing target for isoswiss
INFO:rich:Calculating nearest neighbors
INFO:rich:Finding 151 neighbors using faiss with euclidean metric and params: {}
INFO:rich:Calculating triplets
INFO:rich:Writing csv format to triplets/isoswiss.5.1337.idx.csv
INFO:rich:Writing csv format to triplets/isoswiss.5.1337.l2.csv
INFO:rich:Writing numpy format to triplets/isoswiss.5.1337.idx.npy
INFO:rich:Writing numpy format to triplets/isoswiss.5.1337.l2.npy
INFO:rich:Writing pipeline result for isoswiss


time: 35.5 s (started: 2022-09-12 15:50:46 -07:00)
