# one-off preparation and download of the database (optional)

It is recommended to work on a local lamin database in many instances:

1. you might want to do some dataset level preprocessing (PCA, neighboors, clustering, re-annotation, harmonization of labels, etc..) 
2. to make the loading time much faster when you do large training runs
3. to work on compute nodes that might not have access to internet

In [1]:
# initialize a local lamin database
# !lamin init --storage ~/scdataloader --schema bionty
! lamin load scdataloader

💡 found cached instance metadata: /home/ml4ig1/.lamin/instance--jkobject--scdataloader.env
💡 loaded instance: jkobject/scdataloader
💡 loaded instance: jkobject/scdataloader


In [1]:
import bionty as bt
import lamindb as ln

from scdataloader import utils

%load_ext autoreload
%autoreload 2

[92m→[0m connected lamindb: jkobject/scprint


## load some known ontology names

first if you use a local instance you will need to populate your ontologies. Meaning loading all the elements from ontological references and build the hierarchical tree.

One can just add everything by keeping the default `None` value for the `ontology` argument, but this will take a long time.

Instead, one load only the ontologies we need. By using all the used/existing cellxgene ontology names.

In [None]:
utils.populate_my_ontology()

❗ now recursing through parents: this only happens once, but is much slower than bulk saving
❗ now recursing through parents: this only happens once, but is much slower than bulk saving


## Directly download a lamin database

In this context one ca either directly download a lamin database (here the cellxgene database as example).

In [5]:
list(ln.Collection.using(instance="laminlabs/cellxgene").filter(name="cellxgene-census").all())

[Collection(uid='dMyEX3NTfKOEYXyMu591', version='2023-12-15', is_latest=False, name='cellxgene-census', hash='0NB32iVKG5ttaW5XILvG', visibility=1, created_by_id=1, transform_id=19, run_id=24, updated_at='2024-01-30 09:09:49 UTC'),
 Collection(uid='dMyEX3NTfKOEYXyMKDAQ', version='2023-07-25', is_latest=False, name='cellxgene-census', hash='pEJ9uvIeTLvHkZW2TBT5', visibility=1, created_by_id=1, transform_id=18, run_id=23, updated_at='2024-01-30 09:06:05 UTC'),
 Collection(uid='dMyEX3NTfKOEYXyMKDD7', version='2024-07-01', is_latest=True, name='cellxgene-census', hash='nI8Ag-HANeOpZOz-8CSn', visibility=1, created_by_id=1, transform_id=22, run_id=27, updated_at='2024-07-16 12:24:38 UTC')]

In [7]:
cx_dataset = ln.Collection.using(instance="laminlabs/cellxgene").filter(name="cellxgene-census", version='2023-12-15').one()
cx_dataset, len(cx_dataset.artifacts.all())

[93m![0m no run & transform get linked, consider calling ln.context.track()


(Collection(uid='dMyEX3NTfKOEYXyMu591', version='2023-12-15', is_latest=False, name='cellxgene-census', hash='0NB32iVKG5ttaW5XILvG', visibility=1, created_by_id=1, transform_id=19, run_id=24, updated_at='2024-01-30 09:09:49 UTC'),
 1113)

In [None]:
# you can use it as is

In [None]:
# or load it locally like so
mydataset = utils.load_dataset_local(lb, cx_dataset, "~/scdataloader/", name="cellxgene-local", description="the full cellxgene database", only=(0,2))

## Or Preprocessing + Download

In this case we use a custom made function that applies a preprocessing after downloading each files in the database.

In [4]:
from scdataloader.preprocess import (
    LaminPreprocessor,
    additional_postprocess,
    additional_preprocess,
)

In [6]:
DESCRIPTION='preprocessed by scDataLoader'

In [7]:
# Here we also add some additional preprocessing (happens at the beginning of the preprocessing function) and post processing (happens at the end of the preprocessing function)
# this serves as an exemple of the flexibility of the function
do_preprocess = LaminPreprocessor(additional_postprocess=additional_postprocess, additional_preprocess=additional_preprocess, skip_validate=True, subset_hvg=0)


In [16]:
preprocessed_dataset = do_preprocess(cx_dataset, name=DESCRIPTION, description=DESCRIPTION, start_at=6, version="2")

0
Artifact(uid='Mgilie8RUip2slElQoDx', key='cell-census/2023-12-15/h5ads/77044335-0ac5-4406-9b3f-8cdd3656d67b.h5ad', suffix='.h5ad', accessor='AnnData', description='Dissection: Pons (Pn) - afferent nuclei of cranial nerves in pons - PnAN', version='2023-12-15', size=131533988, hash='QCaNEZY9apeRmazmwGXAWg-16', hash_type='md5-n', n_observations=23349, visibility=1, key_is_virtual=False, updated_at=2024-01-29 07:45:42 UTC, storage_id=2, transform_id=16, run_id=22, created_by_id=1)
AnnData object with n_obs × n_vars = 23349 × 59357
    obs: 'roi', 'organism_ontology_term_id', 'disease_ontology_term_id', 'self_reported_ethnicity_ontology_term_id', 'assay_ontology_term_id', 'sex_ontology_term_id', 'development_stage_ontology_term_id', 'donor_id', 'suspension_type', 'dissection', 'fraction_mitochondrial', 'fraction_unspliced', 'cell_cycle_score', 'total_genes', 'total_UMIs', 'sample_id', 'supercluster_term', 'cluster_id', 'subcluster_id', 'cell_type_ontology_term_id', 'tissue_ontology_term_



❗ no run & transform get linked, consider passing a `run` or calling ln.track()


... storing 'dpt_group' as categorical
... storing 'symbol' as categorical
... storing 'ncbi_gene_ids' as categorical
... storing 'biotype' as categorical
... storing 'description' as categorical
... storing 'synonyms' as categorical
... storing 'organism' as categorical


1
Artifact(uid='iAZPSOBKLpaK7lqyYp9O', key='cell-census/2023-12-15/h5ads/2a8ca8f3-5599-4cda-b973-3a2dfc3c1fe6.h5ad', suffix='.h5ad', accessor='AnnData', description='Dissection: Amygdaloid complex (AMY) - Corticomedial nuclear group (CMN) - anterior cortical nucleus - CoA', version='2023-12-15', size=169529860, hash='WuShpnxfduKWKn23oYUCTA-21', hash_type='md5-n', n_observations=10778, visibility=1, key_is_virtual=False, updated_at=2024-01-29 07:45:42 UTC, storage_id=2, transform_id=16, run_id=22, created_by_id=1)
... downloading 2a8ca8f3-5599-4cda-b973-3a2dfc3c1fe6.h5ad: 100.0%
AnnData object with n_obs × n_vars = 10778 × 59357
    obs: 'roi', 'organism_ontology_term_id', 'disease_ontology_term_id', 'self_reported_ethnicity_ontology_term_id', 'assay_ontology_term_id', 'sex_ontology_term_id', 'development_stage_ontology_term_id', 'donor_id', 'suspension_type', 'dissection', 'fraction_mitochondrial', 'fraction_unspliced', 'cell_cycle_score', 'total_genes', 'total_UMIs', 'sample_id', 'sup



KeyboardInterrupt: 

In [17]:
#we have processed that many files
len(ln.Artifact.filter(version='2', description=DESCRIPTION))

1

In [18]:
# we can load a preprocessed anndata like this
adata = ln.Artifact.filter(version='2', description=DESCRIPTION)[0].backed()
adata.obs

Unnamed: 0,roi,organism_ontology_term_id,disease_ontology_term_id,self_reported_ethnicity_ontology_term_id,assay_ontology_term_id,sex_ontology_term_id,development_stage_ontology_term_id,donor_id,suspension_type,dissection,...,total_counts_hb,log1p_total_counts_hb,pct_counts_hb,outlier,mt_outlier,leiden_3,leiden_2,leiden_1,dpt_group,heat_diff
73b14343-9071-4da1-884d-52f04c781a44,Human PnAN,NCBITaxon:9606,PATO:0000461,HANCESTRO:0005,EFO:0009922,PATO:0000384,HsapDv:0000136,H19.30.001,nucleus,Pons (Pn) - afferent nuclei of cranial nerves ...,...,3.0,1.386294,0.023204,True,False,13,11,10,,
91e3096c-9a19-4827-8050-9e08fe22f9e0,Human PnAN,NCBITaxon:9606,PATO:0000461,HANCESTRO:0005,EFO:0009922,PATO:0000384,HsapDv:0000136,H19.30.001,nucleus,Pons (Pn) - afferent nuclei of cranial nerves ...,...,0.0,0.000000,0.000000,False,False,5,4,3,,
7bd7c1f8-5080-4ea4-87ff-4c02bf641059,Human PnAN,NCBITaxon:9606,PATO:0000461,HANCESTRO:0005,EFO:0009922,PATO:0000384,HsapDv:0000136,H19.30.001,nucleus,Pons (Pn) - afferent nuclei of cranial nerves ...,...,2.0,1.098612,0.016004,True,False,19,17,6,,
47d8e4b5-daf3-4b3f-bdaa-c877fce88f08,Human PnAN,NCBITaxon:9606,PATO:0000461,HANCESTRO:0005,EFO:0009922,PATO:0000384,HsapDv:0000144,H18.30.002,nucleus,Pons (Pn) - afferent nuclei of cranial nerves ...,...,3.0,1.386294,0.008974,True,True,24,20,16,,
5bc7626d-4e45-4bf1-8818-922502760364,Human PnAN,NCBITaxon:9606,PATO:0000461,HANCESTRO:0005,EFO:0009922,PATO:0000384,HsapDv:0000144,H18.30.002,nucleus,Pons (Pn) - afferent nuclei of cranial nerves ...,...,0.0,0.000000,0.000000,True,False,34,27,22,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
f266c95f-e602-416b-9f3e-c8f081352e79,Human PnAN,NCBITaxon:9606,PATO:0000461,HANCESTRO:0005,EFO:0009922,PATO:0000384,HsapDv:0000144,H18.30.002,nucleus,Pons (Pn) - afferent nuclei of cranial nerves ...,...,2.0,1.098612,0.032889,False,False,12,9,9,9_PATO:0000461_CL:0002453_UBERON:0000988,0.048092
02530a0a-bd8b-40ff-b3ab-18bbbac42eda,Human PnAN,NCBITaxon:9606,PATO:0000461,HANCESTRO:0005,EFO:0009922,PATO:0000384,HsapDv:0000144,H18.30.002,nucleus,Pons (Pn) - afferent nuclei of cranial nerves ...,...,1.0,0.693147,0.011792,False,False,12,9,9,9_PATO:0000461_CL:0002453_UBERON:0000988,0.049665
a1840018-61f1-4345-b665-90ce5ad7936d,Human PnAN,NCBITaxon:9606,PATO:0000461,HANCESTRO:0005,EFO:0009922,PATO:0000384,HsapDv:0000144,H18.30.002,nucleus,Pons (Pn) - afferent nuclei of cranial nerves ...,...,0.0,0.000000,0.000000,False,False,12,9,9,9_PATO:0000461_CL:0002453_UBERON:0000988,0.049922
02c6c139-4346-443b-8209-9ab51d80595c,Human PnAN,NCBITaxon:9606,PATO:0000461,HANCESTRO:0005,EFO:0009922,PATO:0000384,HsapDv:0000144,H18.30.002,nucleus,Pons (Pn) - afferent nuclei of cranial nerves ...,...,0.0,0.000000,0.000000,False,False,12,9,9,9_PATO:0000461_CL:0002453_UBERON:0000988,0.056045


In [20]:
# I need to remake the dataset as it failed for some files and I had to restart at position 11 (As you can see in the preprocess() function)
name="preprocessed dataset"
description="preprocessed dataset using scdataloader"
dataset = ln.Collection(ln.Artifact.filter(version='2', description=DESCRIPTION), name=name, description=description)
dataset.save()
dataset.artifacts.count()

1