In [1]:
%load_ext lab_black
%load_ext autotime
import pandas as pd
import numpy as np

time: 241 ms (started: 2022-09-15 22:54:22 -07:00)


The Swiss Roll dataset from the Isomap paper. This is basically a 2D sheet of points that has been rolled up. Unfolding it is a goal beloved of all manifold learning methods. However, the MIT website has been less kind as I can no longer find the original URL on its domain: <http://web.mit.edu/cocosci/isomap/isomap.html>. However the internet archive can help us here.

In [2]:
from io import BytesIO

import requests
import scipy.io

req = requests.get(
    "https://web.archive.org/web/20160101000000/http://web.mit.edu/cocosci/isomap/swiss_roll_data.mat",
    timeout=10,
)
data = scipy.io.loadmat(BytesIO(req.content), squeeze_me=True, struct_as_record=False)

time: 1.3 s (started: 2022-09-15 22:54:22 -07:00)


In [3]:
data

{'__header__': b'MATLAB 5.0 MAT-file, Platform: LNX86, Created on: Thu May 18 03:25:00 2000',
 '__version__': '1.0',
 '__globals__': [],
 'X_data': array([[-7.81669039, 11.63249768,  4.98949669, ..., -0.15761409,
         -6.13858617, 10.16509492],
        [-6.41497431, -3.83386251, -2.87793725, ...,  7.87242236,
         13.21593522,  8.52082985],
        [15.51749805, 12.96227365, 27.81921752, ..., 18.35944185,
         32.82565478,  1.06143158]]),
 'Y_data': array([[39.40748137, 63.46282985,  4.29079433, ..., 19.01280751,
         94.79912446, 76.50555546],
        [15.51749805, 12.96227365, 27.81921752, ..., 18.35944185,
         32.82565478,  1.06143158]])}

time: 8.24 ms (started: 2022-09-15 22:54:23 -07:00)


In [4]:
x = data["X_data"]
x.shape

(3, 20000)

time: 5.92 ms (started: 2022-09-15 22:54:23 -07:00)


Note that this data is stored by column.

In [5]:
data["Y_data"][1, :]

array([15.51749805, 12.96227365, 27.81921752, ..., 18.35944185,
       32.82565478,  1.06143158])

time: 4.02 ms (started: 2022-09-15 22:54:23 -07:00)


## Data pipeline

First, transpose the data to get it by row:

In [6]:
x = x.T

time: 1.96 ms (started: 2022-09-15 22:54:23 -07:00)


In [7]:
target = pd.DataFrame(dict(X=data["Y_data"][0, :], Y=data["Y_data"][1, :]))

time: 3.33 ms (started: 2022-09-15 22:54:23 -07:00)


In [8]:
target

Unnamed: 0,X,Y
0,39.407481,15.517498
1,63.462830,12.962274
2,4.290794,27.819218
3,22.285673,19.880264
4,57.325563,12.826453
...,...,...
19995,30.505732,7.540049
19996,54.118022,18.001537
19997,19.012808,18.359442
19998,94.799124,32.825655


time: 9.7 ms (started: 2022-09-15 22:54:23 -07:00)


In [11]:
from drnb.io.pipeline import create_default_pipeline

data_result = create_default_pipeline(check_for_duplicates=True).run(
    "isoswiss",
    data=x,
    target=target,
    tags=["synthetic", "lowdim", "isomap"],
    verbose=True,
)

time: 15.2 s (started: 2022-09-15 23:01:19 -07:00)
