[![Jupyter Notebook](https://img.shields.io/badge/Jupyter%20Notebook-orange)](https://github.com/laminlabs/lamindb/blob/main/docs/bio-registries.ipynb)

# Manage biological registries 

Registries manage the formalized knowledge & experimental design that anchor dry & wetlab work.

In LaminDB, registries are standard SQL tables, equipped with [mechanisms that avoid typos & duplicated data](/faq/idempotency).

In addition, LaminDB makes it easy to import records from public ontologies, based on plug-in {mod}`lnschema_bionty`.

In this notebook, you'll see how to manage an in-house ontology anchored in public knowledge.

(If you also manage experimental design through registries, you can access all metadata through one API and store it in one simple SQL database.)

## Setup

Let us create an instance that has {mod}`lnschema_bionty` mounted:

In [None]:
!lamin init --storage ./test-registries --schema bionty

In [None]:
import lamindb as ln
import lnschema_bionty as lb

In [None]:
ln.settings.verbosity = "info"

Let's pre-populate the cell type registry with a few records:

In [None]:
lb.Species.from_bionty(name="human").save()
lb.CellType.from_bionty(name="T cell").save()
lb.CellType(name="my T cell subtype").save()

## Access records in public ontologies

We start with a public ontology for cell types.

[Bionty](https://lamin.ai/docs/bionty/bionty.bionty) - short for "biological entity" - is a class for accessing public ontologies.

Bionty provides simple access to curated public ontologies that Lamin hosts for reliable and performant access. For most Bionty objects, you can access the underlying ontology through [Pronto](https://github.com/althonos/pronto).

(If you don't need to manage in-house registries, you can also use the [bionty](https://lamin.ai/docs/bionty) package standalone.)

Let's create a `Bionty` object:

In [None]:
bionty = lb.CellType.bionty()

In [None]:
bionty

We can use it to search the public ontology against cell types:

In [None]:
bionty.search("gamma delta T cell").head(3)

And we can also use it to look up cell types with auto-complete:

In [None]:
lookup = bionty.lookup()
lookup.gamma_delta_t_cell

## Create records in in-house ontologies

We can now create a record for our in-house SQL registry by passing the result of the lookup in the public ontology to the `CellType` constructor:

In [None]:
gdt_cell = lb.CellType(lookup.gamma_delta_t_cell)

(Alternatively, we could construct the gamma delta T cell via {meth}`~lnschema_bionty.dev.BioRegistry.from_bionty`, which is synonyms-aware.)

In [None]:
gdt_cell

When we save this record to the registry, logging informs us that we're also saving parent ontological terms:

In [None]:
gdt_cell.save()

```{dropdown} Will I always see parents being saved?

No, this only happens a single time.

- If we accidentally save the same record again, it will be recognized that the record and all parents are already in the registry.
- If we save another record that has overlapping parents, only new parents will be saved.

```

View the ontological hierarchy:

In [None]:
gdt_cell.view_parents()

Or access the parents directly:

In [None]:
gdt_cell.parents.df()

You can construct custom hierarchies of terms by specifying parents:

In [None]:
my_celltype = lb.CellType.filter(name="my T cell subtype").one()
my_celltype.parents.add(gdt_cell)

In [None]:
gdt_cell.view_parents(distance=2, with_children=True)

This cell type and all its parents can now be queried & searched in the registry using `lb.CellType.filter` and `lb.CellType.search`.

## Load records for values in data sources

When accessing data sources, one often encounters bulk references to entities that might be corrupted or curated using different standardization schemes.

Let's consider an example based on an `AnnData` object:

In [None]:
adata = ln.dev.datasets.anndata_with_obs()

In the `cell_type` annotations of this `AnnData` object, we find 4 references to cell types:

In [None]:
adata.obs.cell_type.value_counts()

We'd like to load the corresponding records in our in-house ontology to annotate the batch of data.

To this end, you'll typically use {class}`~lamindb.dev.Registry.from_values`, which will both validate & load records that match the values.

In [None]:
cell_types = lb.CellType.from_values(adata.obs.cell_type)

cell_types

Logging informed us that all 4 cell types are validated.

And because we loaded these records at the same time, we could readily use them to annotate a batch of data.

:::{dropdown} What happened under-the-hood?

`.from_values()` performs the following look ups:

1. If registry records match the values, load these records
2. If values match synonyms of registry records, load these records
3. (`lnschema_bionty`-only) If no record in the registry matches, attempt to load records from a public reference through Bionty
4. (`lnschema_bionty`-only) Same as 3. but based on synonyms

No records will be returned if input field values aren't mappable.

Example:

```
celltype_names = [
    "gamma-delta T cell",  # existing record with the same name
    "T lymphocyte",  # existing record with synonym
    "hepatocyte",  # Bionty record with the same name
    "HSC",  # Bionty record with synonym
    "my new cell type",  # Not exist in DB, not exist in Bionty
]
lb.CellType.from_values(celltype_names)
```

This returns records for all names except from "my new cell type".

If you'd like to add this new value to the registry, do it like so:

```
my_celltype = lb.CellType(name="my new cell type")
my_celltype.save()
```

:::


Alternatively, we can create entries based on ontology ids:

In [None]:
adata.obs.cell_type_id.unique().tolist()

In [None]:
lb.CellType.from_values(adata.obs.cell_type_id, field=lb.CellType.ontology_id)

If we're happy with `cell_types` records, we save them to the registry:

In [None]:
ln.save(cell_types)

Now, let's inspect our in-house registry:

In [None]:
lb.CellType.filter().df()

## Access records in in-house ontologies

Search:

In [None]:
lb.CellType.search("gamma delta T cell").head(2)

Or look up with auto-complete:

In [None]:
cell_types = lb.CellType.lookup()
hsc_record = cell_types.hematopoietic_stem_cell

hsc_record

## Validate & standardize

Simple validation of an iterable of values works like so:

In [None]:
lb.CellType.validate(["HSC", "blood forming stem cell"])

Because these values don't comply with the registry, they're not validated!

You can easily convert these values to validated standardized names based on synonyms like so:

In [None]:
lb.CellType.standardize(["HSC", "blood forming stem cell"])

Alternatively, you can use `.from_values()`, which will only ever create validated records and automatically standardize under-the-hood:

In [None]:
lb.CellType.from_values(["HSC", "blood forming stem cell"])

We can also add new synonyms to a record like so:

In [None]:
hsc_record.add_synonym("HSCs")

And when we encounter this synonym as a value, it will now be standardized using synonyms-lookup, and mapped on the correct registry record:

In [None]:
lb.CellType.standardize(["HSCs"])

A special synonym is `.abbr` (short for abbreviation), which has its own field and can be assigned via:

In [None]:
hsc_record.set_abbr("HSC")

You can create a lookup object from the `.abbr` field:

In [None]:
cell_types = lb.CellType.lookup("abbr")
hsc = cell_types.hsc
hsc

The same workflow works for all of `lnschema_bionty`'s registries.

## Manage registries across species

Most registries are species-aware, for instance, `Gene`:

In [None]:
lb.Gene.from_bionty(symbol="TCF7", species="human")

Similarly, API calls that interact with multi-species registries accept a `species` argument, e.g.:

In [None]:
lb.Gene.validate(["TCF7", "ABC1"], species="human")

You can also pass species for validating features upon registering data, e.g., in `ln.File.from_anndata(..., field=lb.Gene.ensembl_gene_id, species=...)`.

And when working with the same species throughout your analysis/workflow, you can omit the `species` argument by configuring it globally:

In [None]:
lb.settings.species = "mouse"

In [None]:
lb.Gene.from_bionty(symbol="Ap5b1")

## Track underlying ontology versions

Under-the-hood, source ontology versions are automatically tracked:

In [None]:
lb.BiontySource.filter(currently_used=True).df()

Each record is linked to a versioned bionty source (if it was created from bionty):

In [None]:
hepatocyte = lb.CellType.filter(name="hepatocyte").one()
hepatocyte.bionty_source

## Create records from specific public ontologies

By default, records are created from the `"currently_used"` Bionty sources which are configured during the instance initialization, e.g.:

In [None]:
lb.Phenotype.bionty()

Sometimes, the default source doesn't contain the ontology term you are looking for.

You can then specify to create a record from a non-default source:

In [None]:
bionty_source = lb.BiontySource.filter(entity="Phenotype", source="pato").one()
age = lb.Phenotype.from_bionty(name="age", bionty_source=bionty_source)
age

In [None]:
age.bionty_source

Analogously, you can pass `bionty_source` to bulk-create records from a non-default source:

In [None]:
records = lb.Phenotype.from_values(["age", "life span"], bionty_source=bionty_source)
records

In [None]:
!lamin delete --force test-registries
!rm -r test-registries