[![Jupyter Notebook](https://img.shields.io/badge/Jupyter%20Notebook-orange)](https://github.com/laminlabs/lamindb/blob/main/docs/bio-registries.ipynb)

# Manage biological registries 

Registries can anchor dry & wetlab work by providing reference values for basic entities.

In LaminDB, registries are standard SQL tables, equipped with [mechanisms that avoid typos & duplicated data](/faq/idempotency).

In addition, LaminDB makes it easy to import records from public ontologies, based on plug-in {mod}`lnschema_bionty`.

With this, you can manage an in-house ontology anchored in public knowledge & experimental design; all through the same API and stored in a simple SQL database.

## Setup

Let us create an instance that has {mod}`lnschema_bionty` mounted:

In [None]:
!lamin init --storage ./test-registries --schema bionty

In [None]:
import lamindb as ln
import lnschema_bionty as lb

In [None]:
ln.settings.verbosity = "info"

Let's pre-populate the cell type registry with a few records:

In [None]:
lb.Species.from_bionty(name="human").save()
lb.CellType.from_bionty(name="T cell").save()
lb.CellType(name="my T cell subtype").save()

## Access records in public ontologies

We start with a public ontology for cell types.

[Bionty](https://lamin.ai/docs/bionty/bionty.bionty) - short for "biological entity" - is a class for accessing public ontologies.

Bionty wraps low-level packages like [pronto](https://github.com/althonos/pronto) to provide simple access to curated public knowledge assets that Lamin hosts for reliable and performant access.

If you don't need to manage in-house registries, you can also use the [bionty](https://lamin.ai/docs/bionty) Python package standalone.

Let's create a `Bionty` object:

In [None]:
bionty = lb.CellType.bionty()

In [None]:
bionty

We can use it to search the public ontology against cell types:

In [None]:
bionty.search("gamma delta T cell").head(3)

And we can also use it to look up cell types with auto-complete:

In [None]:
lookup = bionty.lookup()
lookup.gamma_delta_t_cell

## Create records in in-house ontologies

We can now create a record for our in-house SQL registry by passing the result of the lookup in the public ontology to the `CellType` constructor:

In [None]:
gdt_cell = lb.CellType(lookup.gamma_delta_t_cell)

(Alternatively, we could construct the gamma delta T cell via {meth}`~lnschema_bionty.dev.BioRegistry.from_bionty`, which is synonyms-aware.)

In [None]:
gdt_cell

When we save this record to the registry, logging informs us that we're also saving parent ontological terms:

In [None]:
gdt_cell.save()

```{dropdown} Will I always see parents being saved?

No, this only happens a single time.

- If we accidentally save the same record again, it will be recognized that the record and all parents are already in the registry.
- If we save another record that has overlapping parents, only new parents will be saved.

```

View the ontological hierarchy:

In [None]:
gdt_cell.view_parents()

Or access the parents directly:

In [None]:
gdt_cell.parents.df()

You can construct custom hierarchies of terms by specifying parents:

In [None]:
my_celltype = lb.CellType.filter(name="my T cell subtype").one()
my_celltype.parents.add(gdt_cell)

In [None]:
gdt_cell.view_parents(distance=2, with_children=True)

This cell type and all its parents can now be queried & searched in the registry using `lb.CellType.filter` and `lb.CellType.search`.

## Validate records from external data sources

In data sources, one often receives bulk references to entities that might be corrupted or follow a different standardization scheme.

Let's consider an example based on an `AnnData` object:

In [None]:
adata = ln.dev.datasets.anndata_with_obs()

In the `cell_type` annotations of this `AnnData` object, we find 4 references to cell types:

In [None]:
adata.obs.cell_type.value_counts()

We'd like to bulk-validate them to ensure that they match our in-house ontology.

In LaminDB, you'll typically use {class}`~lamindb.dev.Registry.from_values` to this end, which will both validate & load the matching records from the in-house registry or the default public reference.

In [None]:
cell_types = lb.CellType.from_values(adata.obs.cell_type)

cell_types

Logging informed us that all 4 cell types are validated. Because we created these records at the same time, we can use them to annotate a batch of data.

:::{dropdown} What happened under-the-hood?

`.from_values()` performs the following look ups:

1. If registry records match the values, load these records
2. If values match synonyms of registry records, load these records
3. (`lnschema_bionty`-only) If no record in the registry matches, attempt to load records from a public reference through Bionty
4. (`lnschema_bionty`-only) Same as 3. but based on synonyms

No records will be returned if input field values aren't mappable.

Example:

```
celltype_names = [
    "gamma-delta T cell",  # existing record with the same name
    "T lymphocyte",  # existing record with synonym
    "hepatocyte",  # Bionty record with the same name
    "HSC",  # Bionty record with synonym
    "my new cell type",  # Not exist in DB, not exist in Bionty
]
lb.CellType.from_values(celltype_names)
```

This returns records for all names except from "my new cell type".

If you'd like to add this new value to the registry, do it like so:

```
my_celltype = lb.CellType(name="my new cell type")
my_celltype.save()
```

:::


Alternatively, we can create entries based on ontology ids:

In [None]:
adata.obs.cell_type_id.unique().tolist()

In [None]:
lb.CellType.from_values(adata.obs.cell_type_id, field=lb.CellType.ontology_id)

If we're happy with `cell_types` records (in particular, are sure that "my new cell type" needs to be added to the DB), we save them to the DB in one transaction:

In [None]:
ln.save(cell_types)

Now let's check out our in-house registry:

In [None]:
lb.CellType.filter().df()

## Access records in-house ontologies

In [None]:
lb.CellType.search("gamma delta T cell").head(2)

In [None]:
celltype_db_lookup = lb.CellType.lookup()

In [None]:
hsc_record = celltype_db_lookup.hematopoietic_stem_cell

In [None]:
hsc_record

## Standardize names & add synonyms

```{important}

While records creation via `Registry.from_values()` is synonyms aware; `.validate()` is not.

In order to pass validation, run `.standardize()` so that only validated terms are associated with your data.
```

In [None]:
# synonyms aware
lb.CellType.from_values(["HSC", "blood forming stem cell"])

In [None]:
# synonyms are not validated
lb.CellType.validate(["HSC", "blood forming stem cell"]);

Convert synonyms to standardized names:

In [None]:
lb.CellType.standardize(["HSC", "blood forming stem cell"])

Add a new synonym to a record:

In [None]:
hsc_record.add_synonym("HSCs")

Now this new synonym can also be mapped:

In [None]:
lb.CellType.standardize(["HSCs"])

A special synonym is "abbr" (abbreviation), which has its own field and can be assigned via:

In [None]:
hsc_record.set_abbr("HSC")

Similarly, users can create a lookup object from abbr field:

In [None]:
celltype_db_lookup = lb.CellType.lookup("abbr")
hsc_record = celltype_db_lookup.hsc
hsc_record

The same workflow works for all of `lnschema_bionty`'s ORMs.

## Manage registries across species

Multi-species ORMs are species aware, for instance, `Gene`:

In [None]:
lb.Gene.from_bionty(
    symbol="TCF7", species="human"  # error is raised without passing species
)

Similarly, API calls that interacts with multi-species registries accept a `species` argument, e.g.:

In [None]:
lb.Gene.validate(["TCF7", "ABC1"], species="human");

Or specify species for validating features upon registering data: `ln.File.from_anndata(..., field=lb.Gene.ensembl_gene_id, species=...)`

When working with the same species throughout your analysis/workflow, you can omit the `species` argument by configuring it globally:

In [None]:
lb.settings.species = "mouse"

In [None]:
lb.Gene.from_bionty(symbol="Ap5b1")

## Track underlying ontology versions

Under-the-hood, ontology sources are automatically tracked:

In [None]:
lb.BiontySource.filter(currently_used=True).df()

Each record is linked to a versioned bionty source (if it was created from bionty):

In [None]:
cell_type_record = lb.CellType.filter(name="hepatocyte").one()
cell_type_record.bionty_source

## Create records from specific public ontologies

By default, records are created from the "currently_used" bionty_sources which was configured during the instance initialization.

In [None]:
lb.Phenotype.bionty()

Sometimes the default source doesn't contain the ontology term you are looking for. You may specify to create a record from a non-default source:

In [None]:
bionty_source = lb.BiontySource.filter(entity="Phenotype", source="pato").one()
record = lb.Phenotype.from_bionty(name="age", bionty_source=bionty_source)
record

In [None]:
record.bionty_source

Similarly, pass `bionty_source` to bulk create records from a non-default source:

In [None]:
records = lb.Phenotype.from_values(["age", "life span"], bionty_source=bionty_source)
records

In [None]:
!lamin delete --force test-registries
!rm -r test-registries