[![Jupyter Notebook](https://img.shields.io/badge/Jupyter%20Notebook-orange)](https://github.com/laminlabs/lamindb/blob/main/docs/bio-registries.ipynb)

# Manage biological registries 

Quite generally, registries can anchor dry & wetlab work in a common framework to more easily access & model data.

What's special is that LaminDB allows you to leverage ontologies for it, based on plug-in {mod}`lnschema_bionty`.

## Setup

Let us create an instance that has {mod}`lnschema_bionty` mounted:

In [None]:
!lamin init --storage ./test-registries --schema bionty

In [None]:
import lamindb as ln
import lnschema_bionty as lb

Let's pre-populate the cell type registry with a few records:

In [None]:
lb.Species.from_bionty(name="human").save()
lb.CellType.from_bionty(name="T cell").save()
lb.CellType(name="my T cell subtype").save()

## Search or look up terms from a public ontology

Let's start with a public ontology for cell types.

[Bionty](https://lamin.ai/docs/bionty/bionty.bionty) - short for "biological entity" - is a class for accessing ontologies and similar resources:

In [None]:
bionty = lb.CellType.bionty()

In [None]:
bionty

We can use it to search cell types:

In [None]:
bionty.search("gamma delta T cell").head(3)

And we can also use it to look up cell types with auto-complete:

In [None]:
lookup = bionty.lookup()
lookup.gamma_delta_t_cell

## Create a record for an in-house registry

You can create a registry record directly by passing the result of a Bionty lookup:

In [None]:
lb.CellType(lookup.gamma_delta_t_cell)

Or specify to create from Bionty public source:

In [None]:
gdt_cell = lb.CellType.from_bionty(ontology_id=lookup.gamma_delta_t_cell.ontology_id)

In [None]:
gdt_cell

Records creation is synonyms aware:

In [None]:
lb.CellType.from_bionty(name="B lymphocyte")

When we save this record to the registry, logging informs us that we're also saving parent ontology terms.


```{dropdown} Will I always see a ton of parents being saved?

No, this only happens a single time.

- If we accidentally save the same record again, lamindb will recognize that the record and all parents are already in the registry.
- If we save another record that has overlapping parents, only new parents will be saved.

```

In [None]:
gdt_cell.save()

View the ontological hierarchy:

In [None]:
gdt_cell.view_parents()

Or access the parents directly:

In [None]:
gdt_cell.parents.df()

You can construct hierarchies of terms by specifying parents:

In [None]:
my_celltype = lb.CellType.filter(name="my T cell subtype").one()
my_celltype.parents.add(gdt_cell)

In [None]:
gdt_cell.view_parents(distance=2, with_children=True)

This cell type and all its parents can now be queried & searched in the registry using `lb.CellType.filter` and `lb.CellType.search`.

Further down the guide, we'll see how this will help us to annotate and validate files & datasets!

## Bulk create records by parsing data

Consider a DataFrame-based example:

In [None]:
adata = ln.dev.datasets.anndata_with_obs()

In [None]:
adata.obs.head()

In [None]:
adata.obs.cell_type.value_counts()

You need to specify a field correspond to the values you are passing, for instance "CellType.name" or "CellType.ontology_id" in this case.

The key design principal of `Registry.from_values()` is to not create non-validated records.

`Registry.from_values()` creates records in the following steps:

1. If existing DB records that match the input field values, return records without creating new
2. If input values matches synonyms associated with existing DB records, return records without creating new
3. (`lnschema_bionty` only) For non-existing DB records, create records from Bionty that matches corresponding Bionty field
4. (`lnschema_bionty` only) Create records from Bionty that matches synonyms

No records will be created if input field values that are not mappable in the above ways.

In [None]:
# Input has 4 unique values of cell type names
adata.obs.cell_type.unique().tolist()

In [None]:
cell_types = lb.CellType.from_values(adata.obs.cell_type, lb.CellType.name)

cell_types

What if the input contains synonyms? They are recognized:

In [None]:
celltype_names = [
    "gamma-delta T cell",  # existing record with the same name
    "T lymphocyte",  # existing record with synonym
    "hepatocyte",  # Bionty record with the same name
    "HSC",  # Bionty record with synonym
    "my new cell type",  # Not exist in DB, not exist in Bionty
]

In [None]:
lb.CellType.from_values(celltype_names, lb.CellType.name)

Note that no record is created from "my new cell type". If you are sure to register it, use the default constructor:

In [None]:
my_celltype = lb.CellType(name="my new cell type")
my_celltype.save()

Similarly, we can create entries based on cell type ontology ids that eliminates the synonyms ambiguity:

In [None]:
# Input has 3 unique values and 1 empty string (empty values don't result a record)
adata.obs.cell_type_id.unique().tolist()

In [None]:
lb.CellType.from_values(adata.obs.cell_type_id, lb.CellType.ontology_id)

If we're happy with `cell_types` records (in particular, are sure that "my new cell type" needs to be added to the DB), we save them to the DB in one transaction:

In [None]:
ln.save(cell_types)

Now let's check out our in-house registry:

In [None]:
lb.CellType.filter().df()

## Search or lookup terms in the DB

In [None]:
lb.CellType.search("gamma delta T cell").head(2)

In [None]:
celltype_db_lookup = lb.CellType.lookup()

In [None]:
hsc_record = celltype_db_lookup.hematopoietic_stem_cell

In [None]:
hsc_record

## Map or add synonyms to terms in the DB

```{important}

Despite records creation via `Registry.from_values()` are synonyms aware, `.validate()` is not.

In order to pass validation, run `.standardize()` so that only validated terms are associated with your data.
```

In [None]:
# synonyms aware
lb.CellType.from_values(["HSC", "blood forming stem cell"], "name")

In [None]:
# synonyms are not validated
lb.CellType.validate(["HSC", "blood forming stem cell"]);

Convert synonyms to standardized names:

In [None]:
lb.CellType.standardize(["HSC", "blood forming stem cell"])

Add a new synonym to a record:

In [None]:
hsc_record.add_synonym("HSCs")

Now this new synonym can also be mapped:

In [None]:
lb.CellType.standardize(["HSCs"])

A special synonym is "abbr" (abbreviation), which has its own field and can be assigned via:

In [None]:
hsc_record.set_abbr("HSC")

Similarly, users can create a lookup object from abbr field:

In [None]:
celltype_db_lookup = lb.CellType.lookup("abbr")
hsc_record = celltype_db_lookup.hsc
hsc_record

The same workflow works for all of `lnschema_bionty`'s ORMs.

## Multi-species registries

Multi-species ORMs are species aware, for instance, Gene:

In [None]:
lb.Gene.from_bionty(
    symbol="TCF7", species="human"
)  # error is raised without passing species

You can also omit the `species` argument, if you configure it globally:

In [None]:
lb.settings.species = "mouse"

In [None]:
lb.Gene.from_bionty(symbol="Ap5b1")

## Track underlying ontology sources

Under-the-hood, ontology sources are tracked:

In [None]:
lb.BiontySource.filter(currently_used=True).df()

Each record is linked to a versioned bionty source (if it was created from bionty):

In [None]:
cell_type_record = lb.CellType.filter(name="hepatocyte").one()
cell_type_record.bionty_source

In [None]:
!lamin delete --force test-registries
!rm -r test-registries