In [1]:
%load_ext lab_black
%load_ext autotime
import numpy as np
import pandas as pd

time: 229 ms (started: 2022-09-20 11:03:48 -07:00)


This is a dataset that is part of the collection used in [T-SNE Is Not Optimized to Reveal Clusters in Data](https://arxiv.org/abs/2110.02573) and [Stochastic Cluster Embedding](https://arxiv.org/abs/2108.08003) suggest that this dataset should be easy to get obvious clusters in the output, but that t-SNE fails to do so. The others are `cytometry`, `higgs`, `icjnn` and `shuttle`.

Unlike most of the datasets, this one requires you to download the files locally yourself. They are hosted via box, and unless the owner decides to give you a permanent link, you can't get a URL that can be downloaded by a Python script. So to make this notebooks work:

* Go to the [sce data](https://ntnu.app.box.com/s/ar1j9iijjw266xs45jm32w24yadvo21b) link.
* Download [tomoradar_vectorial_data.mat](https://ntnu.app.box.com/s/ar1j9iijjw266xs45jm32w24yadvo21b/file/847073339939). **Warning**: this is nearly 6GB in size.
* Also download the [python/TOMORADAR_labels.npy](https://ntnu.app.box.com/s/ar1j9iijjw266xs45jm32w24yadvo21b/file/861962236512) file.
* Install the `h5py` package, e.g. `pip install h5py`.

## Extract the data

Replace `<PATH-WHERE-YOU-DOWNLOADED-THE-DATA>` with the actual place you downloaded the data.

In [1]:
import h5py


def read_tomoradar():
    filepath = "<PATH-WHERE-YOU-DOWNLOADED-THE-LABELS>/tomoradar_vectorial_data.mat"
    arrays = {}
    f = h5py.File(filepath)
    for k, v in f.items():
        arrays[k] = np.array(v)
    return arrays["X"]

**Warning**: This will take up nearly 8GB of RAM when read in.

In [3]:
data = read_tomoradar()
data, data.shape

(array([[-73.61126364, -73.18554208, -73.35228294, ..., -72.80931773,
         -70.98203336, -89.55940631],
        [-93.82978868, -93.83472585, -92.36178304, ..., -88.30937777,
         -88.69410652, -87.81034337],
        [-91.08240965, -92.11775507, -93.7669276 , ..., -99.03227389,
         -82.81084882, -85.15919977],
        ...,
        [-95.76371832, -89.02337378, -94.03413996, ..., -88.69418292,
         -83.86915127, -84.46926235],
        [-91.08240965, -92.11775507, -93.7669276 , ..., -99.03227389,
         -82.81084882, -85.15919977],
        [-93.82978868, -93.83472585, -92.36178304, ..., -88.30937777,
         -88.69410652, -87.81034337]]),
 (8192, 120024))

time: 1min 9s (started: 2022-09-20 11:03:49 -07:00)


The data is stored by column, so we need to transpose it:

In [4]:
data = data.T
data.shape, data.dtype, data.flags

((120024, 8192),
 dtype('<f8'),
   C_CONTIGUOUS : False
   F_CONTIGUOUS : True
   OWNDATA : False
   WRITEABLE : True
   ALIGNED : True
   WRITEBACKIFCOPY : False
   UPDATEIFCOPY : False)

time: 5.05 ms (started: 2022-09-20 11:04:59 -07:00)


## Download labels

Replace `<PATH-WHERE-YOU-DOWNLOADED-THE-LABELS>` with the actual place you downloaded the data.

In [5]:
labels = np.load("<PATH-WHERE-YOU-DOWNLOADED-THE-LABELS>/TOMORADAR_labels.npy")
labels, np.unique(labels)

(array([2, 0, 2, ..., 1, 1, 1]), array([0, 1, 2]))

time: 14.6 ms (started: 2022-09-20 11:04:59 -07:00)


## Data Pipeline

In [6]:
target = pd.DataFrame(dict(labels=labels))
target

Unnamed: 0,labels
0,2
1,0
2,2
3,2
4,2
...,...
120019,1
120020,1
120021,1
120022,1


time: 12.8 ms (started: 2022-09-20 11:04:59 -07:00)


In [7]:
from drnb.io.pipeline import create_default_pipeline

data_result = create_default_pipeline(
    check_for_duplicates=False, csv=False, reduce=1000
).run(
    "tomoradar-pca1000",
    data=data,
    target=target,
    verbose=True,
    url="https://github.com/rozyangno/sce",
)

time: 2min 49s (started: 2022-09-20 11:05:27 -07:00)


In [8]:
from drnb.io.pipeline import create_default_pipeline

data_result = create_default_pipeline(
    check_for_duplicates=False, csv=False, reduce=1000, scale="z"
).run(
    "tomoradar-z-pca1000",
    data=data,
    target=target,
    verbose=True,
    url="https://github.com/rozyangno/sce",
)

time: 3min 30s (started: 2022-09-20 11:09:53 -07:00)


In [9]:
from drnb.io.pipeline import create_default_pipeline

data_result = create_default_pipeline(check_for_duplicates=False, csv=True).run(
    "tomoradar",
    data=data,
    target=target,
    verbose=True,
    tags=["highdim"],
    url="https://github.com/rozyangno/sce",
)

time: 13min 1s (started: 2022-09-20 11:41:14 -07:00)
