# Tutorial: Accessing single-cell datasets in `laminlabs/hubmap`

Here, we show how the HubMAP instance is structured and how datasets and be queried and accessed.

HubMAP associates several 'data products', which are the single raw datasets, into higher level 'datasets'.
For example, the dataset [HBM983.LKMP.544](https://portal.hubmapconsortium.org/browse/dataset/20ee458e5ee361717b68ca72caf6044e) has three data products:

1. [raw_expr.h5ad](https://assets.hubmapconsortium.org/f6eb890063d13698feb11d39fa61e45a/raw_expr.h5ad)
1. [expr.h5ad](https://assets.hubmapconsortium.org/f6eb890063d13698feb11d39fa61e45a/expr.h5ad)
2. [secondary_analysis.h5ad](https://assets.hubmapconsortium.org/f6eb890063d13698feb11d39fa61e45a/secondary_analysis.h5ad)
3. [scvelo_annotated.h5ad](https://assets.hubmapconsortium.org/f6eb890063d13698feb11d39fa61e45a/scvelo_annotated.h5ad)

The [laminlabs/hubmap](https://lamin.ai/laminlabs/hubmap) registers these data products as `ln.Artifact` that jointly form a `ln.Collection`.

In [2]:
import lamindb as ln

assert ln.setup.settings.instance.slug == "laminlabs/hubmap"

ln.track()

[92m→[0m connected lamindb: laminlabs/hubmap
[92m→[0m created Transform('TMDTNYfmBzK10000'), started new Run('l8TeOLsI...') at 2025-02-25 12:57:28 UTC
[92m→[0m notebook imports: bionty==1.1.0 lamindb==1.1.0


## Getting HubMAP datasets and data products

The `key` attribute of `ln.Artifact` and `ln.Collection` corresponds to the IDs of the URLs.
For example, the id in the URL https://portal.hubmapconsortium.org/browse/dataset/20ee458e5ee361717b68ca72caf6044e is the `key` of the corresponding collection:

In [3]:
small_intenstine_collection = ln.Collection.get(key="20ee458e5ee361717b68ca72caf6044e")
small_intenstine_collection

Collection(uid='QjQSiso1qPlnX6iX0000', is_latest=True, key='20ee458e5ee361717b68ca72caf6044e', description='RNAseq data from the small intestine of a 67.0-year-old white female', hash='jF6aG3Nd4qQHBvY8v8Q8dg', created_by_id=3, space_id=1, run_id=11, created_at=2025-01-28 14:17:01 UTC)

We can get all associated data products like:

In [4]:
small_intenstine_collection.artifacts.all().df()

Unnamed: 0_level_0,uid,key,description,suffix,kind,otype,size,hash,n_files,n_observations,_hash_type,_key_is_virtual,_overwrite_versions,space_id,storage_id,schema_id,version,is_latest,run_id,created_at,created_by_id,_aux,_branch_code
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1
28,AzqCWQAKLMV3iTMA0000,f6eb890063d13698feb11d39fa61e45a/raw_expr.h5ad,RNAseq data from the small intestine of a 67.0...,.h5ad,,AnnData,67867992,of_TeLP6cet2JBj3o_kZmQ,,6000,md5-etag,False,False,1,2,,,True,11,2025-01-28 14:16:35.355582+00:00,3,,1
29,fWN781TxuZibkBOR0000,f6eb890063d13698feb11d39fa61e45a/secondary_ana...,RNAseq data from the small intestine of a 67.0...,.h5ad,,AnnData,888111371,ian3P5CN68AAvoDMC6sZLw,,5956,md5-etag,False,False,1,2,,,True,11,2025-01-28 14:16:39.348589+00:00,3,,1
30,enXVzwjw4voS8UCb0000,f6eb890063d13698feb11d39fa61e45a/expr.h5ad,RNAseq data from the small intestine of a 67.0...,.h5ad,,AnnData,139737320,kR476u81gwXI6rEbXzNBvQ,,6000,md5-etag,False,False,1,2,,,True,11,2025-01-28 14:16:43.385980+00:00,3,,1


Note the key of these three `Artifacts` which corresponds to the assets URL.
For example, https://assets.hubmapconsortium.org/f6eb890063d13698feb11d39fa61e45a/expr.h5ad is the direct URL to the `expr.h5ad` data product.

Artifacts can be directly loaded:

In [5]:
small_intenstine_af = (
    small_intenstine_collection.artifacts.filter(key__icontains="raw_expr.h5ad")
    .distinct()
    .one()
)
adata = small_intenstine_af.load()

In [6]:
adata

AnnData object with n_obs × n_vars = 6000 × 98000
    var: 'hugo_symbol'

## Querying single-cell datasets

Currently, only the `Artifacts` of the `raw_expr.h5ad` data products are labeled with metadata.
The available metadata includes `ln.Reference`, `bt.Tissue`, `bt.Disease`, `bt.ExperimentalFactor`, and many more.
Please have a look at [the instance](https://lamin.ai/laminlabs/hubmap) for more details.

In [9]:
# Get one dataset with a specific type of heart failure
heart_failure_adata = (
    ln.Artifact.filter(diseases__name="heart failure with reduced ejection fraction")
    .first()
    .load()
)
heart_failure_adata

... synchronizing expr.h5ad: 100.0%


AnnData object with n_obs × n_vars = 52534 × 60286
    obs: 'cell_id'
    var: 'hugo_symbol'
    layers: 'spliced', 'spliced_unspliced_sum', 'unspliced'

In [10]:
ln.finish()

[94m•[0m please hit CTRL + s to save the notebook in your editor .... still waiting .....
. [92m✓[0m
[93m![0m cells [(0, 2), (6, 9)] were not run consecutively
[92m→[0m finished Run('l8TeOLsI') after 2m at 2025-02-25 12:59:54 UTC
[92m→[0m go to: https://lamin.ai/laminlabs/hubmap/transform/TMDTNYfmBzK10000
[92m→[0m to update your notebook from the CLI, run: lamin save /home/lukas/code/hubmap_registration/access_query_tutorial.ipynb
