# Scanpy on Dask with Jupyter - Zheng preprocessing recipe

This notebook runs Scanpy's `recipe_zheng17` function using Dask on the 1.3M neurons [dataset](https://support.10xgenomics.com/single-cell-gene-expression/datasets/1M_neurons) from 10x Genomics. The data is stored in Zarr format on GCS.

Imports - note that we import the regular Scanpy preprocessing API (`scanpy.api.pp`), although the version of Scanpy used is actually one that has had some [minor adjustments](https://github.com/tomwhite/scanpy/tree/dask-1.2.2) to work with Dask.

In [1]:
import anndata as ad
import dask.array as da
import gcsfs.mapping

from dask.distributed import Client
from scanpy.api.pp import recipe_zheng17

font search path ['/opt/conda/lib/python3.6/site-packages/matplotlib/mpl-data/fonts/ttf', '/opt/conda/lib/python3.6/site-packages/matplotlib/mpl-data/fonts/afm', '/opt/conda/lib/python3.6/site-packages/matplotlib/mpl-data/fonts/pdfcorefonts']
generated new fontManager


Create a Dask client. This will automatically connect to the distributed scheduler.

In [2]:
client = Client()

Load the data from GCS in Zarr format as an AnnData object. Note that we have to use a trick to substitute the main matrix `X` with the one loaded by Dask (`da.from_zarr`) so that it can be processed in a distributed fashion.

In [3]:
gcs = gcsfs.GCSFileSystem('hca-scale', token='cloud')
store = gcsfs.mapping.GCSMap('ll-sc-data-bkup/10x/anndata_zarr_2000/10x.zarr', gcs=gcs)
adata = ad.read_zarr(store)
adata.X = da.from_zarr(store, component='X')

Variable names are not unique. To make them unique, call `.var_names_make_unique`.


Run the Zheng recipe, then write out the `X` matrix to GCS in Zarr format.

In [4]:
%%time
recipe_zheng17(adata)
store_out = gcsfs.mapping.GCSMap('ll-sc-data-bkup/10x/anndata_zarr/10x-recipe-dask.zarr', gcs=gcs)
adata.X.to_zarr(store_out, overwrite=True)

  return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
  return ufunc.reduce(obj, axis, dtype, out, **passkwargs)


CPU times: user 34.4 s, sys: 2.84 s, total: 37.2 s
Wall time: 13min 31s
