## Data import and export in Scarf

In [1]:
%load_ext autotime

import scarf
scarf.__version__

'0.7.2'

time: 814 ms


---
### 1) Fetch fetch datsets from cloud repository

Scarf stores many single-cell datasets online on [OSF](https://osf.io/zeupv/). Herein datasets are stored in many different foramts including MTX, 10x HDF5 and H5ad(anndata). These files can readily be downlaoded using Scarf's `fetch_dataset` command.

To check which datasets are available to download you use `show_available_datasets` function

In [2]:
scarf.show_available_datasets()

baron_8K_pancreas_rnaseq
muraro_2K_pancreas_rnaseq
segerstolpe_2K_pancreas_rnaseq
xin_1K_pancreas_rnaseq
zheng_69K_pbmc_rnaseq
cao_2.1M_moca_rnaseq
zeisel_161K_nervous_rnaseq
hca_783K_blood_rnaseq
tenx_8K_pbmc_citeseq
tenx_1.3M_brain_rnaseq
kang_15K_pbmc_rnaseq
kang_14K_ifnb-pbmc_rnaseq
bastidas-ponce_4K_pancreas-d15_rnaseq
tenx_10K_pbmc_atacseq
saunders_110K_brain_rnaseq
cusanovich_81K_mouse_atacseq
tenx_5K_pbmc_rnaseq
time: 819 ms


**Naming format**: Datasets are named using this rule: \<author\>_\<number of cells\>_\<cell/tissue type or species\>_\<single-cell method>

Now using any of these dataset names we can download the dataset of our choice:

In [3]:
# This dataset is in Cellranger (10x) HDF5 format.
scarf.fetch_dataset('tenx_10K_pbmc_atacseq', save_path='./scarf_datasets')

[1mINFO[0m: Download started...
[1mINFO[0m: Download finished! File saved here: ./scarf_datasets/tenx_10K_pbmc_atacseq/data.h5
time: 2.52 s


The above dataset gets saved under the directory `scarf_datasets` in our current working directory. You can modify `save_path` parameter to save data in location of your choice. The dataset above was downloaded in 10x' HDF5 format. Let download few more datasets that are in differnet file formats.

In [4]:
# This dataset is in MTX format along with barcodes and features TSV files.
scarf.fetch_dataset('xin_1K_pancreas_rnaseq', save_path='./scarf_datasets')

[1mINFO[0m: Download started...
[1mINFO[0m: Download finished! File saved here: ./scarf_datasets/xin_1K_pancreas_rnaseq/barcodes.tsv.gz
[1mINFO[0m: Download started...
[1mINFO[0m: Download finished! File saved here: ./scarf_datasets/xin_1K_pancreas_rnaseq/features.tsv.gz
[1mINFO[0m: Download started...
[1mINFO[0m: Download finished! File saved here: ./scarf_datasets/xin_1K_pancreas_rnaseq/matrix.mtx.gz
time: 5.1 s


In [5]:
# This dataset is in H5ad (anndata) format.
scarf.fetch_dataset('bastidas-ponce_4K_pancreas-d15_rnaseq', save_path='./scarf_datasets')

[1mINFO[0m: Download started...
[1mINFO[0m: Download finished! File saved here: ./scarf_datasets/bastidas-ponce_4K_pancreas-d15_rnaseq/data.h5ad
time: 2.15 s


---
### 2) Conversion to Scarf's Zarr format file

Scarf stores data as dense, compressed chunks in Zarr file format. `scarf.readers` and `scarf.writers` modules contain classes that allow reading many different file formats and convert them to Zarr. There are oftent complementary reader and writer classes. Let's exploter them below.

#### From 10x's HDF5 file format


`scarf.CrDirReader` function 

In [6]:
# Change file_type to 'rna' in case of sc-RNA-seq or CITE-Seq
reader = scarf.CrH5Reader('scarf_datasets/tenx_10K_pbmc_atacseq/data.h5', file_type='atac')

writer = scarf.CrToZarr(reader, zarr_fn='scarf_datasets/pbmc_atac.zarr')  # change value of `zarr_fn` to your choice of filename and path
writer.dump()

100%|██████████| 10/10 [00:11<00:00,  1.18s/it]

time: 12.1 s





#### From 10x's (Cellranger) MTX file format

`scarf.CrDirReader` class reads MTX files generated by Cellranger pipeline. `CrDirReader` stands for 'Cellranger directory reader'. Once read in the data can be dumped into Zarr format using `scarf.CrToZarr` class. Following is an example of how to do this conversion:

In [7]:
 # Note here we only give name of directory containing MTX file (along with barcodes and features file)
reader = scarf.CrDirReader('scarf_datasets/xin_1K_pancreas_rnaseq', file_type='rna')

writer = scarf.CrToZarr(reader, zarr_fn='scarf_datasets/xin_1K.zarr')  # change value of `zarr_fn` to your choice of filename and path
writer.dump()



100%|██████████| 2/2 [00:02<00:00,  1.37s/it]

time: 2.92 s





#### From Anndata H5ad file format

`scarf.CrDirReader` class reads MTX files generated by Cellranger pipeline. `CrDirReader` stands for 'Cellranger directory reader'. Once read in the data can be dumped into Zarr format using `scarf.CrToZarr` class. Following is an example of how to do this conversion:

In [8]:
 # Note here we only give name of directory containing MTX file (along with barcodes and features file)
reader = scarf.H5adReader('scarf_datasets/bastidas-ponce_4K_pancreas-d15_rnaseq/data.h5ad', 
                          cell_ids_key = 'index',               # Where Cell/barcode ids are saved under 'obs' slot
                          feature_ids_key = 'index',            # Where gene ids are saved under 'var' slot
                          feature_name_key = 'gene_short_name')  # Where gene names are saved under 'var' slot

writer = scarf.H5adToZarr(reader, zarr_fn='scarf_datasets/differentiating_pancreatic_cells.zarr') # change value of `zarr_fn` to your choice of filename and path
writer.dump()

[1mINFO[0m: `X` slot in H5ad file has unequal sized child groups
[1mINFO[0m: No value provided for assay names. Will use default value: 'RNA'


Reading attributes from group obs: 100%|██████████| 5/5 [00:00<00:00, 197.63it/s]




Reading attributes from group var: 100%|██████████| 2/2 [00:00<00:00, 74.68it/s]
100%|██████████| 4/4 [00:01<00:00,  3.33it/s]

time: 1.34 s





Conversion from **Loom** file formats is also supported using `scarf.LoomReader` and `scarf.LoomToZarr` which can be used in similar fashion as other readers and writers.

---
### 3) Exporting to data from Zarr file format

#### To Cellranger (10x) MTX format

In [9]:
ds = scarf.DataStore('scarf_datasets/differentiating_pancreatic_cells.zarr')

[1mINFO[0m: Setting assay RNA to assay type: RNAassay
[1mINFO[0m: (RNA) Computing nCells and dropOuts
[########################################] | 100% Completed |  0.4s
[1mINFO[0m: (RNA) Computing nCounts
[########################################] | 100% Completed |  0.4s
[1mINFO[0m: (RNA) Computing nFeatures
[########################################] | 100% Completed |  0.4s
[1mINFO[0m: Computing percentage of RNA_percentMito
[########################################] | 100% Completed |  0.3s
[1mINFO[0m: Computing percentage of RNA_percentRibo
[########################################] | 100% Completed |  0.3s
time: 2.13 s


In [10]:
scarf.writers.to_mtx(ds.RNA, mtx_directory='scarf_datasets/diff_pancreas')

100%|██████████| 4/4 [00:09<00:00,  2.41s/it]

time: 9.78 s





#### To H5ad format

Conversion to H5ad is the preferred mode as it runs much faster and produces files with smaller footprints. Updates are underway to include all the data from Zarr file like,  UMAP, PCA and graph into anndata

In [11]:
ds = scarf.DataStore('scarf_datasets/differentiating_pancreatic_cells.zarr')

time: 12.4 ms


In [12]:
scarf.writers.to_h5ad(ds.RNA, h5ad_filename='scarf_datasets/diff_pancreas.h5ad')

100%|██████████| 4/4 [00:02<00:00,  1.65it/s]

time: 2.49 s





Finally, let's have a look at all the files that now exist in the our previously empty scarf_datasets directory

In [13]:
ls -d scarf_datasets/*

[0m[01;34mscarf_datasets/bastidas-ponce_4K_pancreas-d15_rnaseq[0m/
[01;34mscarf_datasets/differentiating_pancreatic_cells.zarr[0m/
[01;34mscarf_datasets/diff_pancreas[0m/
scarf_datasets/diff_pancreas.h5ad
[01;34mscarf_datasets/pbmc_atac.zarr[0m/
[01;34mscarf_datasets/tenx_10K_pbmc_atacseq[0m/
[01;34mscarf_datasets/xin_1K_pancreas_rnaseq[0m/
[01;34mscarf_datasets/xin_1K.zarr[0m/
time: 136 ms


---
That is all for this vignette.