In [1]:
%load_ext lab_black
%load_ext autotime
import numpy as np
import pandas as pd

import drnb as nb

time: 591 ms (started: 2023-07-22 11:41:15 -07:00)


In [2]:
from drnb.io.dataset import get_available_dataset_info, list_available_datasets

time: 666 ms (started: 2023-07-22 11:41:16 -07:00)


`get_available_dataset_info` provides an overview of datasets which have gone through the data pipeline. Some columns of note:

* `n_items`: number of rows.
* `n_dim`: number of columns.
* `n_target_cols`: number of columns in the `target` metadata.
* `n_na_rows`: how many rows of the original data were removed for containing missing entries.
* `scale`: if any scaling was carried out, e.g. `z` means Z-scaling/standard scaling.
* `dim_red`: was any initial dimensionality reduction carried out, for example PCA. In that case, the amount of variance explained is also recorded.
* `n_duplicates`: how many rows are duplicated in the dataset (this occurs after remove `na` rows and column selection).


If all the notebooks in this folder are run, you should get something out like the following:

In [4]:
df = get_available_dataset_info()

time: 2.74 s (started: 2023-07-22 11:42:00 -07:00)


In [6]:
df.loc[
    [
        "isoswiss",
        "scurvehole",
        "mammoth",
        "s1k",
        "spheres",
        "mnist",
        "tasic2018-pca50",
        "macosko2015-pca50",
    ],
    :,
]

Unnamed: 0_level_0,n_items,n_dim,n_target_cols,n_na_rows,scale,dim_red,n_duplicates,tags,created_on,updated_on,url
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
isoswiss,20000,3,2.0,0,,,0.0,synthetic lowdim isomap,2022-09-16 06:01:34,2022-09-16 06:01:34,
scurvehole,9505,3,1.0,0,,,0.0,lowdim synthetic,2022-09-16 06:01:58,2022-09-16 06:01:58,https://github.com/YingfanWang/PaCMAP
mammoth,50000,3,,0,,,0.0,lowdim synthetic,2022-09-16 06:13:48,2022-09-16 06:13:48,https://github.com/PAIR-code/understanding-umap
s1k,1000,9,1.0,0,,,0.0,small lowdim,2022-12-28 00:32:16,2022-12-28 00:32:16,
spheres,10000,101,1.0,0,,,,synthetic,2023-04-06 05:48:04,2023-04-06 05:48:04,https://github.com/BorgwardtLab/topological-au...
mnist,70000,784,1.0,0,,,0.0,image,2023-05-03 04:53:27,2023-05-03 04:53:27,http://yann.lecun.com/exdb/mnist/
tasic2018-pca50,23822,50,3.0,0,,PCA 50 (62.64%),,scRNAseq,2022-09-18 00:36:27,2022-09-18 00:36:27,https://doi.org/10.1038/s41586-018-0654-5
macosko2015-pca50,44808,50,2.0,0,,PCA 50 (30.02%),,scRNAseq,2023-06-11 02:53:31,2023-06-11 02:53:31,https://doi.org/10.1016/j.cell.2015.05.002


time: 21.1 ms (started: 2023-07-22 11:43:14 -07:00)


In [3]:
get_available_dataset_info()

Unnamed: 0_level_0,n_items,n_dim,n_target_cols,n_na_rows,scale,dim_red,n_duplicates,tags,created_on,updated_on,url
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1kgp,3450,868,3,0,,,0,,2022-12-18 07:39:55,2022-12-18 07:39:55,https://doi.org/10.1371/journal.pgen.1008432
1kgp-pca50,3450,50,3,0,,PCA 50 (27.53%),,,2022-12-18 07:40:06,2022-12-18 07:40:06,https://doi.org/10.1371/journal.pgen.1008432
airfoil,1503,5,1,0,z,,0,lowdim,2022-12-25 21:56:05,2022-12-25 21:56:05,https://archive.ics.uci.edu/ml/datasets/airfoi...
avonet,11009,11,3,0,z,,297,small lowdim,2022-09-16 05:34:19,2022-09-16 05:34:19,https://doi.org/10.1111/ele.13898
cifar10,60000,3072,2,0,,,0,image,2022-09-19 05:46:39,2022-09-19 05:46:39,https://www.cs.toronto.edu/~kriz/cifar.html
...,...,...,...,...,...,...,...,...,...,...,...
tomoradar-z-pca1000,120024,1000,1,0,z,PCA 1000 (77.21%),,,2022-09-20 18:13:23,2022-09-20 18:13:23,https://github.com/rozyangno/sce
two_gaussians100d,10000,100,1,0,,,,synthetic,2023-02-13 01:09:29,2023-02-13 01:09:29,
two_gaussians2d,10000,2,1,0,,,,synthetic,2023-02-13 01:07:22,2023-02-13 01:07:22,
two_gaussians5d,10000,5,1,0,,,,synthetic,2023-02-13 01:08:25,2023-02-13 01:08:25,


time: 3.54 s (started: 2023-07-22 11:41:16 -07:00)


The idea around the tags:
* `small`: a small dataset, < 4096 items. This is the cutoff point where UMAP will go from exact nearest neighbors to approximate, although how onerous it is to get exact nearest neighbors is also a function of the dimensionality of the data too, hence the following classifications:
* `lowdim`: a dataset with < 100 features.
* `highdim`: a dataset with 1000-9999 features. Many nearest neighbor routines start having trouble here.
* `vhighdim`: a dataset with >= 10000 features. The temptation to run PCA on such large datasets is almost overwhelming. Nearest neighbor methods tend to be quite slow with these methods and also the results may not be what you want (there tends to be hubs formed).
    * I don't have a tag for 100-999 features. That's pretty "normal" in terms of suitability for dimensionality reduction vs challenge.
* `image`: the dataset consists of images.
* `scRNAseq`: single cell RNA sequence data.
* `synthetic`: artificial data to test some aspect of a dimensionality reduction or manifold learning method. Or real data used to a similar end (e.g. `mammoth`).

A simpler, but related function:

In [4]:
list_available_datasets()

['avonet',
 'cifar10',
 'cifar10act',
 'coil100',
 'coil20',
 'fashion',
 'frey',
 'iris',
 'isofaces',
 'isoswiss',
 'kuzushiji',
 'lamanno2020',
 'macosko2015',
 'macosko2015-pca50',
 'macosko2015z',
 'macosko2015z-pca50',
 'mammoth',
 'mnist',
 'ng20',
 'norb',
 'olivetti',
 'olivetti92x112',
 'penguins',
 's1k',
 'scurvehole',
 'tasic2018',
 'tasic2018-pca50']

time: 8.79 ms (started: 2022-09-18 16:37:34 -07:00)


In [5]:
list_available_datasets(with_target=True)

['avonet',
 'cifar10',
 'cifar10act',
 'coil100',
 'coil20',
 'fashion',
 'iris',
 'isofaces',
 'isoswiss',
 'kuzushiji',
 'lamanno2020',
 'macosko2015',
 'macosko2015-pca50',
 'macosko2015z',
 'macosko2015z-pca50',
 'mnist',
 'ng20',
 'norb',
 'olivetti',
 'olivetti92x112',
 'penguins',
 's1k',
 'scurvehole',
 'tasic2018',
 'tasic2018-pca50']

time: 16.8 ms (started: 2022-09-18 16:37:34 -07:00)
