# Data Loading Tutorial

In [1]:
cd ../..

/home/jeff/git/scVI


In [2]:
from scvi.dataset import LoomDataset, CsvDataset, Dataset10X, AnnDataset
from scvi.dataset import BrainLargeDataset, CortexDataset, PbmcDataset, RetinaDataset, HematoDataset, CbmcDataset, BrainSmallDataset

## Generic Datasets
`scvi v0.1.3` supports dataset loading for the following three generic file formats: 
* `.loom` files
* `.csv` files 
* `.h5ad` files
* datasets from `10x` website 

### Loading a `.loom` file
Any `.loom` file can be loaded with initializing `LoomDataset` with `filename`. 

Optional parameters: 
* `save_path`: save path (default to be `data/`) of the file
* `url`: url the dataset if the file needs to be downloaded from the web
* `new_n_genes`: the number of subsampling genes - set it to be `False` to turn off subsampling
* `subset_genes`: a list of gene names for subsampling

In [3]:
# Loading a remote dataset 
remote_loom_dataset = LoomDataset("osmFISH_SScortex_mouse_all_cell.loom", 
                                  save_path='data/', 
                                  url='http://linnarssonlab.org/osmFISH/osmFISH_SScortex_mouse_all_cells.loom')

Downloading file at data/osmFISH_SScortex_mouse_all_cell.loom
Preprocessing dataset
Finished preprocessing dataset


In [4]:
# Loading a local dataset 
local_loom_dataset = LoomDataset("osmFISH_SScortex_mouse_all_cell.loom", 
                                 save_path='data/')

File data/osmFISH_SScortex_mouse_all_cell.loom already downloaded
Preprocessing dataset
Finished preprocessing dataset


### Loading a `.csv` file 
Any `.csv` file can be loaded with initializing `CsvDataset` with `filename`. 

Optional parameters: 
* `save_path`: save path (default to be `data/`) of the file
* `url`: url of the dataset if the file needs to be downloaded from the web
* `compression`: set `compression` as `.gz`, `.bz2`, `.zip`, or `.xz` to load a zipped `csv` file 
* `new_n_genes`: the number of subsampling genes - set it to be `False` to turn off subsampling
* `subset_genes`: a list of gene names for subsampling 

Note: `CsvDataset` currently only supoorts `.csv` files that are genes by cells. 

In [5]:
# Loading a remote dataset 
remote_csv_dataset = CsvDataset("GSE100866_CBMC_8K_13AB_10X-RNA_umi.csv.gz",
                                save_path='data/', 
                                compression='gzip', 
                                url = "https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE100866&format=file&file=GSE100866%5FCBMC%5F8K%5F13AB%5F10X%2DRNA%5Fumi%2Ecsv%2Egz")

Downloading file at data/GSE100866_CBMC_8K_13AB_10X-RNA_umi.csv.gz
Preprocessing dataset
Finished preprocessing dataset
Downsampling from 36280 to 600 genes


In [6]:
# Loading a local dataset 
local_csv_dataset = CsvDataset("GSE100866_CBMC_8K_13AB_10X-RNA_umi.csv.gz", 
                               save_path='data/', 
                               compression='gzip') 

File data/GSE100866_CBMC_8K_13AB_10X-RNA_umi.csv.gz already downloaded
Preprocessing dataset
Finished preprocessing dataset
Downsampling from 36280 to 600 genes


### Loading a `.h5ad` file
[AnnData](http://anndata.readthedocs.io/en/latest/) objects can be stored in `.h5ad` format. Any `.h5ad` file can be loaded with initializing `AnnDataset` with `filename`. 

Optional parameters: 
* `save_path`: save path (default to be `data/`) of the file
* `url`: url the dataset if the file needs to be downloaded from the web
* `new_n_genes`: the number of subsampling genes - set it to be `False` to turn off subsampling
* `subset_genes`: a list of gene names for subsampling 

In [3]:
# Loading a local dataset 
local_ann_dataset = AnnDataset("TM_droplet_mat.h5ad", 
                               save_path = 'data/') 

File data/TM_droplet_mat.h5ad already downloaded
Preprocessing dataset
Finished preprocessing dataset


### Loading a file from `10x` website 

`10x` has published several datasets on their [website](https://www.10xgenomics.com). 
Initialize `Dataset10X` by passing in the dataset name of one of the following datasets that `scvi` currently supports: `frozen_pbmc_donor_a`, `frozen_pbmc_donor_b`, `frozen_pbmc_donor_c`, `pbmc8k`, `pbmc4k`, `t_3k`, `t_4k`, and `neuron_9k`. 

Optional parameters: 
* `save_path`: save path (default to be `data/`) of the file
* `type`: set `type` (default to be `filtered`) to be `filtered` or `raw` to choose one from the two datasets that's available on `10X`
* `new_n_genes`: the number of subsampling genes - set it to be `False` to turn off subsampling

In [4]:
tenX_dataset = Dataset10X("neuron_9k")

Downloading file at data/10X/neuron_9k/filtered_gene_bc_matrices.tar.gz
Preprocessing dataset
Extracting tar file
Finished preprocessing dataset
Downsampling from 27998 to 3000 genes


## Built-In Datasets 

We've also implemented seven built-in datasets to make it easier to reproduce results from the scVI paper. 

* **PBMC**: 12,039 human peripheral blood mononuclear cells profiled with 10x; 
* **RETINA**: 27,499 mouse retinal bipolar neurons, profiled in two batches using the Drop-Seq technology; 
* **HEMATO**: 4,016 cells from two batches that were profiled using in-drop; 
* **CBMC**: 8,617 cord blood mononuclear cells profiled using 10x along with, for each cell, 13 well-characterized mononuclear antibodies; 
* **BRAIN SMALL**: 9,128 mouse brain cells profiled using 10x. 

### Loading `BRAIN-LARGE` dataset

<font color='red'>Loading BRAIN-LARGE requires at least 32 GB memory!</font>

`BrainLargeDataset` consists of 1.3 million mouse brain cells, spanning the cortex, hippocampus and subventricular zone, and profiled with 10x chromium. We use this dataset to demonstrate the scalability of scVI. It can be used to demonstrate the scalability of scVI.  

Reference: 10x genomics (2017). URL https://support.10xgenomics.com/single-cell-gene-expression/datasets. 

In [5]:
brain_large_dataset = BrainLargeDataset() 

Downloading file at data/genomics.h5
Preprocessing Brain Large data
720 genes subsampled
1306127 cells subsampled
Finished preprocessing data


### Loading `CORTEX` dataset
`CortexDataset` consists of 3,005 mouse cortex cells profiled with the Smart-seq2 protocol, with the addition of UMI. To facilitate com- parison with other methods, we use a filtered set of 558 highly variable genes. The `CortexDataset` exhibits a clear high-level subpopulation struc- ture, which has been inferred by the authors of the original publication using computational tools and annotated by inspection of specific genes or transcriptional programs. Similar levels of annotation are provided with the `PbmcDataset` and `RetinaDataset`. 

Reference: Zeisel, A. et al. Cell types in the mouse cortex and hippocampus revealed by single-cell rna-seq. Science 347, 1138–1142 (2015). 

In [6]:
cortex_dataset = CortexDataset() 

Downloading file at data/expression.bin
Preprocessing Cortex data
Finished preprocessing Cortex data


### Loading `PBMC` dataset
`PbmcDataset` consists of 12,039 human peripheral blood mononu- clear cells profiled with 10x. 

Reference: Zheng, G. X. Y. et al. Massively parallel digital transcriptional profiling of single cells. Nature Communications 8, 14049 (2017). 

In [2]:
pbmc_dataset = PbmcDataset() 

Downloading file at data/10X/pbmc8k/filtered_gene_bc_matrices.tar.gz
Preprocessing dataset
Extracting tar file
Finished preprocessing dataset
Downsampling from 33694 to 3000 genes
Downloading file at data/10X/pbmc4k/filtered_gene_bc_matrices.tar.gz
Preprocessing dataset
Extracting tar file
Finished preprocessing dataset
Downsampling from 33694 to 3000 genes
Keeping 2903 genes


### Loading `RETINA` dataset 
`RetinaDataset` includes 27,499 mouse retinal bipolar neu- rons, profiled in two batches using the Drop-Seq technology. 

Reference: Shekhar, K. et al. Comprehensive classification of retinal bipolar neurons by single-cell transcriptomics. Cell 166, 1308–1323.e30 (2017). 

In [8]:
retina_dataset = RetinaDataset()

Downloading file at data/retina.loom
Preprocessing dataset
Finished preprocessing dataset


### Loading `HEMATO` dataset 
`HematoDataset` includes 4,016 cells from two batches that were profiled using in-drop. This data provides a snapshot of hematopoietic progenitor cells differentiating into various lineages. We use this dataset as an example for cases where gene expression varies in a continuous fashion (along pseudo-temporal axes) rather than forming discrete subpopulations. 

Reference: Tusi, B. K. et al. Population snapshots predict early haematopoietic and erythroid hierarchies. Nature 555, 54–60 (2018).

In [10]:
hemato_dataset = HematoDataset() 

Downloading data.zip
Downloading file at data/HEMATO/bBM.raw_umifm_counts.csv.gz
Preprocessing Hemato data
Finished preprocessing Hemato data


### Loading `CBMC` dataset
`CbmcDataset` includes 8,617 cord blood mononuclear cells pro- filed using 10x along with, for each cell, 13 well-characterized mononuclear antibodies. We used this dataset to analyze how the latent spaces inferred by dimensionality-reduction algorithms summarize protein marker abundance.

Reference: Stoeckius, M. et al. Simultaneous epitope and transcriptome measurement in single cells. Nature Methods 14, 865–868 (2017).

In [9]:
cbmc_dataset = CbmcDataset()

Downloading file at data/citeSeq/cbmc/cbmc_rna.csv.gz
Downloading file at data/citeSeq/cbmc/cbmc_adt.csv.gz
Downloading file at data/citeSeq/cbmc/cbmc_adt_centered.csv.gz
Preprocessing data
Selecting only HUMAN genes (20400 / 36280)
Finish preprocessing data


### Loading `BRAIN-SMALL` dataset
`BrainSmallDataset` consists of 9,128 mouse brain cells profiled using 10x. This dataset is used as a complement to PBMC for our study of zero abundance and quality control metrics correlation with our generative posterior parameters.

Reference: 

In [10]:
brain_small_dataset = BrainSmallDataset()

File data/10X/neuron_9k/filtered_gene_bc_matrices.tar.gz already downloaded
Preprocessing dataset
Finished preprocessing dataset
Downsampling from 27998 to 3000 genes
