In [1]:
import numpy as np
import pandas as pd

time: 271 ms (started: 2022-09-20 21:36:41 -07:00)


This is a dataset that is part of the collection used in [T-SNE Is Not Optimized to Reveal Clusters in Data](https://arxiv.org/abs/2110.02573) and [Stochastic Cluster Embedding](https://arxiv.org/abs/2108.08003) suggest that this dataset should be easy to get obvious clusters in the output, but that t-SNE fails to do so. The others are `higgs`, `icjnn`, `shuttle` and `tomoradar`.

Unlike most of the datasets, this one requires you to download the files locally yourself. They are hosted via box, and unless the owner decides to give you a permanent link, you can't get a URL that can be downloaded by a Python script. So to make this notebooks work:

* Go to the [sce data](https://ntnu.app.box.com/s/ar1j9iijjw266xs45jm32w24yadvo21b) link.
* Download [flow_cytometry_pmbc_lucs.mat](https://ntnu.app.box.com/s/ar1j9iijjw266xs45jm32w24yadvo21b/file/847057240971).
* There are no labels for this dataset.
* Make sure you have installed the `h5py` package, e.g. `pip install h5py` (also needed for `tomoradar`).

## Extract the data

Replace `<PATH-WHERE-YOU-DOWNLOADED-THE-DATA>` with the actual place you downloaded the data.

In [2]:
import h5py


def read_cytometry():
    filepath = "<PATH-WHERE-YOU-DOWNLOADED-THE-DATA>/flow_cytometry_pmbc_lucs.mat"
    arrays = {}
    f = h5py.File(filepath)
    for k, v in f.items():
        arrays[k] = np.array(v)
    return arrays["X"]

time: 22.1 ms (started: 2022-09-20 21:37:11 -07:00)


In [3]:
data = read_cytometry()
data, data.shape

(array([[ 1.56596406e+05,  3.81096016e+04,  3.00204004e+04, ...,
          1.55344328e+05,  2.04530406e+05,  1.01437203e+05],
        [ 1.30489000e+05,  3.76060000e+04,  2.93060000e+04, ...,
          1.30363000e+05,  1.73296000e+05,  8.60070000e+04],
        [ 7.86480312e+04,  6.64136250e+04,  6.71335938e+04, ...,
          7.80945938e+04,  7.73480391e+04,  7.72935781e+04],
        ...,
        [ 3.51150000e+03, -2.76750000e+02, -6.54750000e+02, ...,
          2.56725000e+03,  2.65725000e+03,  3.45000000e+01],
        [ 7.68750000e+02,  4.65000000e+01,  1.41000000e+02, ...,
          1.80150000e+03,  1.93650000e+03, -3.52500000e+01],
        [ 0.00000000e+00,  0.00000000e+00,  0.00000000e+00, ...,
          6.96689990e+03,  6.96689990e+03,  6.96689990e+03]]),
 (21, 1000000))

time: 1.64 s (started: 2022-09-20 21:37:12 -07:00)


The data is stored by column, so we need to transpose it:

In [4]:
data = data.T
data.shape, data.dtype, data.flags

((1000000, 21),
 dtype('<f8'),
   C_CONTIGUOUS : False
   F_CONTIGUOUS : True
   OWNDATA : False
   WRITEABLE : True
   ALIGNED : True
   WRITEBACKIFCOPY : False
   UPDATEIFCOPY : False)

time: 3.63 ms (started: 2022-09-20 21:37:43 -07:00)


## Data Pipeline

In [5]:
from drnb.io.pipeline import create_default_pipeline

data_result = create_default_pipeline(check_for_duplicates=True, csv=True).run(
    "cytometry",
    data=data,
    verbose=True,
    url="https://github.com/rozyangno/sce",
)

time: 12min 30s (started: 2022-09-20 21:38:27 -07:00)
