# Tutorial: Accessing datasets in `laminlabs/hubmap`

Here, we show how the HubMAP instance is structured and how datasets and be queried and accessed.

HubMAP associates several 'data products', which are the single raw datasets, into higher level 'datasets'.
For example, the single-cell dataset [HBM983.LKMP.544](https://portal.hubmapconsortium.org/browse/dataset/20ee458e5ee361717b68ca72caf6044e) has four data products:

1. [expr.h5ad](https://assets.hubmapconsortium.org/f6eb890063d13698feb11d39fa61e45a/expr.h5ad)
2. [raw_expr.h5ad](https://assets.hubmapconsortium.org/f6eb890063d13698feb11d39fa61e45a/raw_expr.h5ad)
3. [secondary_analysis.h5ad](https://assets.hubmapconsortium.org/f6eb890063d13698feb11d39fa61e45a/secondary_analysis.h5ad)
4. [scvelo_annotated.h5ad](https://assets.hubmapconsortium.org/f6eb890063d13698feb11d39fa61e45a/scvelo_annotated.h5ad)

The [laminlabs/hubmap](https://lamin.ai/laminlabs/hubmap) registers these data products as `ln.Artifact` that jointly form a `ln.Collection`.

In [1]:
import lamindb as ln
import h5py
import pandas as pd
import anndata as ad

assert ln.setup.settings.instance.slug == "laminlabs/hubmap"

ln.track("TMDTNYfmBzK1")

[92m→[0m connected lamindb: laminlabs/hubmap
[92m→[0m found notebook access_query_tutorial.ipynb, making new version
[92m→[0m created Transform('TMDTNYfmBzK10002'), started new Run('lOYN3nI7...') at 2025-05-26 10:42:17 UTC
[92m→[0m notebook imports: anndata==0.10.9 h5py==3.13.0 lamindb==1.5.3 pandas==2.2.3


## Getting HubMAP datasets and data products

The `key` attribute of `ln.Artifact` and `ln.Collection` corresponds to the IDs of the URLs.
For example, the id in the URL https://portal.hubmapconsortium.org/browse/dataset/20ee458e5ee361717b68ca72caf6044e is the `key` of the corresponding collection:

In [2]:
small_intenstine_collection = ln.Collection.get(key="20ee458e5ee361717b68ca72caf6044e")
small_intenstine_collection

Collection(uid='xvmP4QeSH584JUbg0000', is_latest=True, key='20ee458e5ee361717b68ca72caf6044e', description='RNAseq data from the small intestine of a 67-year-old white female', hash='bxpInd96BItVhxWNhgQStw', space_id=1, created_by_id=5, run_id=35, created_at=2025-05-21 11:15:36 UTC)

We can get all associated data products like:

In [3]:
small_intenstine_collection.artifacts.all().df()

Unnamed: 0_level_0,uid,key,description,suffix,kind,otype,size,hash,n_files,n_observations,_hash_type,_key_is_virtual,_overwrite_versions,space_id,storage_id,schema_id,version,is_latest,run_id,created_at,created_by_id,_aux,_branch_code
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1
28,AzqCWQAKLMV3iTMA0000,f6eb890063d13698feb11d39fa61e45a/raw_expr.h5ad,RNAseq data from the small intestine of a 67-y...,.h5ad,,AnnData,67867992,of_TeLP6cet2JBj3o_kZmQ,,6000.0,md5-etag,False,False,1,2,,,True,11,2025-01-28 14:16:35.355582+00:00,3,,1
29,fWN781TxuZibkBOR0000,f6eb890063d13698feb11d39fa61e45a/secondary_ana...,RNAseq data from the small intestine of a 67-y...,.h5ad,,AnnData,888111371,ian3P5CN68AAvoDMC6sZLw,,5956.0,md5-etag,False,False,1,2,,,True,11,2025-01-28 14:16:39.348589+00:00,3,,1
876,dYhDR2fx8dccLWer0000,f6eb890063d13698feb11d39fa61e45a/scvelo_annota...,RNAseq data from the small intestine of a 67-y...,.h5ad,,AnnData,641007602,HxvPzL_Pkx6ncEJJcS_GWw,,,md5-etag,False,False,1,2,,,True,35,2025-05-21 11:15:19.475249+00:00,5,,1
30,enXVzwjw4voS8UCb0000,f6eb890063d13698feb11d39fa61e45a/expr.h5ad,RNAseq data from the small intestine of a 67-y...,.h5ad,,AnnData,139737320,kR476u81gwXI6rEbXzNBvQ,,6000.0,md5-etag,False,False,1,2,,,True,11,2025-01-28 14:16:43.385980+00:00,3,,1


Note the key of these four `Artifacts` which corresponds to the assets URL.
For example, https://assets.hubmapconsortium.org/f6eb890063d13698feb11d39fa61e45a/expr.h5ad is the direct URL to the `expr.h5ad` data product.

Artifacts can be directly loaded:

In [4]:
small_intenstine_af = (
    small_intenstine_collection.artifacts.filter(key__icontains="raw_expr.h5ad")
    .distinct()
    .one()
)
adata = small_intenstine_af.load()

In [5]:
adata

AnnData object with n_obs × n_vars = 6000 × 98000
    var: 'hugo_symbol'

## Querying single-cell RNA sequencing datasets

Currently, only the `Artifacts` of the `raw_expr.h5ad` data products are labeled with metadata.
The available metadata includes `ln.Reference`, `bt.Tissue`, `bt.Disease`, `bt.ExperimentalFactor`, and many more.
Please have a look at [the instance](https://lamin.ai/laminlabs/hubmap) for more details.

In [6]:
# Get one dataset with a specific type of heart failure
heart_failure_adata = (
    ln.Artifact.filter(diseases__name="heart failure with reduced ejection fraction")
    .first()
    .load()
)

heart_failure_adata

AnnData object with n_obs × n_vars = 52534 × 60286
    obs: 'cell_id'
    var: 'hugo_symbol'
    layers: 'spliced', 'spliced_unspliced_sum', 'unspliced'

## Querying bulk RNA sequencing datasets

Bulk datasets contain a single file: expression_matrices.h5, which is a `hdf5` file containing transcript by sample matrices of TPM and number of reads. 
These files are labeled with metadata, including `ln.Reference`, `bt.Tissue`, `bt.Disease`, `bt.ExperimentalFactor`, and many more. 
To make the expression data usable with standard analysis workflows, we first read the TPM and raw count matrices from the file and then convert them into a single AnnData object. 
In this object, raw read counts are stored in .X, and TPM values are added as a separate layer under `.layers["tpm"]`.

In [7]:
# Get one placenta tissue dataset:
placenta_data = ln.Artifact.filter(tissues__name="placenta").first().cache()

In [8]:
def load_matrix(group):
    values = group["block0_values"][:]
    columns = group["block0_items"][:].astype(str)
    index = group["axis1"][:].astype(str)

    return pd.DataFrame(values, index=index, columns=columns)


with h5py.File(placenta_data, "r") as f:
    tpm_df = load_matrix(f["tpm"])
    reads_df = load_matrix(f["num_reads"])

In [9]:
# Use raw read counts as the main matrix
placenta_adata = ad.AnnData(X=reads_df.values)
placenta_adata.obs_names = reads_df.index
placenta_adata.var_names = reads_df.columns

# Store TPM normalized values in a layer
placenta_adata.layers["tpm"] = tpm_df.values

# Add identifiers
placenta_adata.obs["sample_id"] = placenta_adata.obs_names
placenta_adata.var["gene_id"] = placenta_adata.var_names

In [10]:
placenta_adata

AnnData object with n_obs × n_vars = 1 × 612302
    obs: 'sample_id'
    var: 'gene_id'
    layers: 'tpm'

In [11]:
ln.finish()

[94m•[0m please hit CMD + s to save the notebook in your editor .. [92m✓[0m
[92m→[0m finished Run('lOYN3nI7') after 22s at 2025-05-26 10:42:39 UTC
[92m→[0m go to: https://lamin.ai/laminlabs/hubmap/transform/TMDTNYfmBzK10002
[92m→[0m to update your notebook from the CLI, run: lamin save /Users/altananamsaraeva/Desktop/Lamin/hubmap-registration/access_query_tutorial.ipynb
