## Data import and export in Scarf

In [1]:
%load_ext autotime

import scarf
scarf.__version__

'0.11.0'

time: 1.78 s (started: 2021-07-31 12:03:23 +02:00)


---
### 1) Fetch datsets from cloud repository

Scarf stores many single-cell datasets online on [OSF](https://osf.io/zeupv/). Herein datasets are stored in many different formats including MTX, 10x HDF5 and H5ad(anndata). These files can readily be downloaded using Scarf's `fetch_dataset` command.

To check which datasets are available to download, use the `show_available_datasets` function:

In [2]:
scarf.show_available_datasets()

baron_8K_pancreas_rnaseq
bastidas-ponce_4K_pancreas-d15_rnaseq
cao_2.1M_moca_rnaseq
cao_4.9M_fetal_rnaseq
cusanovich_81K_mouse_atacseq
domcke_721K_fetal_atacseq
hca_783K_blood_rnaseq
kang_14K_ifnb-pbmc_rnaseq
kang_15K_pbmc_rnaseq
muraro_2K_pancreas_rnaseq
saunders_110K_brain_rnaseq
segerstolpe_2K_pancreas_rnaseq
tenx_1.3M_brain_rnaseq
tenx_10K_pbmc_atacseq
tenx_5K_pbmc_rnaseq
tenx_8K_pbmc_citeseq
xin_1K_pancreas_rnaseq
zeisel_161K_nervous_rnaseq
zheng_69K_pbmc_rnaseq
time: 4.61 s (started: 2021-07-31 12:03:25 +02:00)


**Naming format**: Datasets are named using this rule: \<author\>\_\<number of cells\>\_\<cell/tissue type or species\>\_\<single-cell method>

Now using any of these dataset names we can download the dataset of our choice:

In [3]:
# This dataset is in Cellranger (10x) HDF5 format.
scarf.fetch_dataset('tenx_10K_pbmc_atacseq', save_path='./scarf_datasets')

Downloading 1/1:   0%|                                                                                        …

[1mINFO[0m: Download finished! File saved here: C:\Users\parashar\Desktop\scarf_vignettes\scarf_datasets\tenx_10K_pbmc_atacseq\data.h5
time: 6.45 s (started: 2021-07-31 12:03:30 +02:00)


The above dataset gets saved under the directory `scarf_datasets` in our current working directory. You can modify `save_path` parameter to save data in location of your choice. The dataset above was downloaded in 10x's HDF5 format. Let download few more datasets that are in differnet file formats.

In [4]:
# This dataset is in MTX format along with barcodes and features TSV files.
scarf.fetch_dataset('xin_1K_pancreas_rnaseq', save_path='./scarf_datasets')

Downloading 1/3:   0%|                                                                                        …

[1mINFO[0m: Download finished! File saved here: C:\Users\parashar\Desktop\scarf_vignettes\scarf_datasets\xin_1K_pancreas_rnaseq\barcodes.tsv.gz


Downloading 2/3:   0%|                                                                                        …

[1mINFO[0m: Download finished! File saved here: C:\Users\parashar\Desktop\scarf_vignettes\scarf_datasets\xin_1K_pancreas_rnaseq\features.tsv.gz


Downloading 3/3:   0%|                                                                                        …

[1mINFO[0m: Download finished! File saved here: C:\Users\parashar\Desktop\scarf_vignettes\scarf_datasets\xin_1K_pancreas_rnaseq\matrix.mtx.gz
time: 9.17 s (started: 2021-07-31 12:03:36 +02:00)


In [5]:
# This dataset is in H5ad (anndata) format.
scarf.fetch_dataset('bastidas-ponce_4K_pancreas-d15_rnaseq', save_path='./scarf_datasets')

Downloading 1/1:   0%|                                                                                        …

[1mINFO[0m: Download finished! File saved here: C:\Users\parashar\Desktop\scarf_vignettes\scarf_datasets\bastidas-ponce_4K_pancreas-d15_rnaseq\data.h5ad
time: 4.81 s (started: 2021-07-31 12:03:45 +02:00)


---
### 2) Conversion to Scarf's Zarr format file

Scarf stores data as dense, compressed chunks in Zarr file format. `scarf.readers` and `scarf.writers` modules contain classes that allow reading many different file formats and convert them to Zarr. There are often complementary reader and writer classes. Let's explore them below.

#### From 10x's HDF5 file format

In [6]:
# Change file_type to 'rna' in case of sc-RNA-seq or CITE-Seq
reader = scarf.CrH5Reader('scarf_datasets/tenx_10K_pbmc_atacseq/data.h5', file_type='atac')

writer = scarf.CrToZarr(reader, zarr_fn='scarf_datasets/pbmc_atac.zarr')  # change value of `zarr_fn` to your choice of filename and path
writer.dump()

  0%|                                                                                                         …

time: 26.2 s (started: 2021-07-31 12:03:50 +02:00)


#### From 10x's (Cellranger) MTX file format

`scarf.CrDirReader` class reads MTX files generated by Cellranger pipeline. `CrDirReader` stands for 'Cellranger directory reader'. Once read in, the data can be dumped into Zarr format using `scarf.CrToZarr` class. Following is an example of how to do this conversion:

In [7]:
 # Note here we only give name of directory containing MTX file (along with barcodes and features file)
reader = scarf.CrDirReader('scarf_datasets/xin_1K_pancreas_rnaseq', file_type='rna')

writer = scarf.CrToZarr(reader, zarr_fn='scarf_datasets/xin_1K.zarr')  # change value of `zarr_fn` to your choice of filename and path
writer.dump()



  0%|                                                                                                         …

time: 9.22 s (started: 2021-07-31 12:04:16 +02:00)


#### From Anndata H5ad file format

In [8]:
 # Note here we only give name of directory containing MTX file (along with barcodes and features file)
reader = scarf.H5adReader('scarf_datasets/bastidas-ponce_4K_pancreas-d15_rnaseq/data.h5ad', 
                          cell_ids_key = 'index',               # Where Cell/barcode ids are saved under 'obs' slot
                          feature_ids_key = 'index',            # Where gene ids are saved under 'var' slot
                          feature_name_key = 'gene_short_name')  # Where gene names are saved under 'var' slot

writer = scarf.H5adToZarr(reader, zarr_fn='scarf_datasets/differentiating_pancreatic_cells.zarr') # change value of `zarr_fn` to your choice of filename and path
writer.dump()

[1mINFO[0m: No value provided for assay names. Will use default value: 'RNA'


Reading attributes from group obs:   0%|                                                                      …



Reading attributes from group var:   0%|                                                                      …

  0%|                                                                                                         …

time: 4.25 s (started: 2021-07-31 12:04:26 +02:00)


Conversion from [Loom](https://loompy.org/) file formats is also supported using `scarf.LoomReader` and `scarf.LoomToZarr` which can be used in similar fashion as other readers and writers.

---
### 3) Exporting to data from Zarr file format

#### To Cellranger (10x) MTX format

In [9]:
ds = scarf.DataStore('scarf_datasets/differentiating_pancreatic_cells.zarr')

(RNA) Computing nCells and dropOuts:   0%|                                                                    …

(RNA) Computing nCounts:   0%|                                                                                …

(RNA) Computing nFeatures:   0%|                                                                              …

(RNA) Computing RNA_percentMito:   0%|                                                                        …

(RNA) Computing RNA_percentRibo:   0%|                                                                        …

time: 4.02 s (started: 2021-07-31 12:04:30 +02:00)


In [10]:
scarf.writers.to_mtx(ds.RNA, mtx_directory='scarf_datasets/diff_pancreas')

  0%|                                                                                                         …

time: 29.4 s (started: 2021-07-31 12:04:34 +02:00)


#### To H5ad format

Conversion to H5ad is the preferred mode as it runs much faster and produces files with smaller footprints. Updates are underway to include all the data from Zarr file like UMAP, PCA and graph, into anndata.

In [11]:
ds = scarf.DataStore('scarf_datasets/differentiating_pancreatic_cells.zarr')

time: 62 ms (started: 2021-07-31 12:05:03 +02:00)


In [12]:
scarf.writers.to_h5ad(ds.RNA, h5ad_filename='scarf_datasets/diff_pancreas.h5ad')

  0%|                                                                                                         …

time: 5.45 s (started: 2021-07-31 12:05:04 +02:00)


---
That is all for this vignette.