[![Jupyter Notebook](https://img.shields.io/badge/Source%20on%20GitHub-orange)](https://github.com/laminlabs/lamindb/blob/main/docs/bio-registries.ipynb)

# Manage biological registries 

This guide shows how to manage metadata for basic biological entities based on plugin {mod}`bionty`.

In [None]:
# !pip install 'lamindb[bionty]'
!lamin init --storage ./test-registries --schema bionty

In [None]:
import lamindb as ln
import bionty as bt

## Seed registries with public ontologies

Let's first populate our {class}`~bionty.CellType` registry with the configured public ontology (Cell Ontology):

In [None]:
# check configured public ontology
bt.Source.filter(entity="bionty.CellType", currently_used=True).one()

In [None]:
# populate the database with the public ontology
bt.CellType.import_source()

This is now your in-house CellType registry:

In [None]:
# all public cell types are now available in LaminDB
bt.CellType.df()

In [None]:
# similarly, let's populate the Gene registry with human and mouse genes
bt.Gene.import_source(organism="human")
bt.Gene.import_source(organism="mouse")

## Access records in in-house registries

Search key words:

In [None]:
bt.CellType.search("gamma-delta T").df().head(2)

Or look up with auto-complete:

In [None]:
cell_types = bt.CellType.lookup()
hsc_record = cell_types.hematopoietic_stem_cell
hsc_record

Filter by fields and relationships:

In [None]:
gdt_cell = bt.CellType.get(ontology_id="CL:0000798", created_by__handle="testuser1")
gdt_cell

View the ontological hierarchy:

In [None]:
gdt_cell.view_parents()  # pass with_children=True to also view children

Or access the parents and children directly:

In [None]:
gdt_cell.parents.df()

In [None]:
gdt_cell.children.df()

It is also possible to recursively query parents or children, getting direct parents (children), their parents, and so forth.

In [None]:
gdt_cell.query_parents().df()

In [None]:
gdt_cell.query_children().df()

You can construct custom hierarchies of records:

In [None]:
# register a new cell type
my_celltype = bt.CellType(name="my new T-cell subtype").save()
# specify "gamma-delta T cell" as a parent
my_celltype.parents.add(gdt_cell)

# visualize hierarchy
gdt_cell.view_parents(distance=2, with_children=True)

## Create records from values

When accessing datasets, one often encounters bulk references to entities that might be corrupted or standardized using different standardization schemes.

Let's consider an example based on an `AnnData` object, in the `cell_type` annotations of this `AnnData` object, we find 4 references to cell types:

In [None]:
adata = ln.core.datasets.anndata_with_obs()
adata.obs.cell_type.value_counts()

We'd like to load the corresponding records in our in-house registry to annotate a dataset.

To this end, you'll typically use {class}`~lamindb.core.CanCurate.from_values`, which will both validate & retrieve records that match the values.

In [None]:
cell_types = bt.CellType.from_values(adata.obs.cell_type)
cell_types

Logging informed us that 3 cell types were validated. Since we loaded these records at the same time, we could readily use them to annotate a dataset.

:::{dropdown} What happened under-the-hood?

`.from_values()` performs the following look ups:

1. If registry records match the values, load these records
2. If values match synonyms of registry records, load these records
3. If no record in the registry matches, attempt to load records from a public ontology
4. Same as 3. but based on synonyms

No records will be returned if all 4 look ups are unsuccessful.

Sometimes, it's useful to treat validated records differently from non-validated records. Here is a way:

```
original_values = ["gut", "gut2"]
inspector = bt.Tissue.inspect(original_values)
records_from_validated_values = bt.Tissue.from_values(inspector.validated)
```

:::


Alternatively, we can retrieve records based on ontology ids:

In [None]:
adata.obs.cell_type_id.unique().tolist()

In [None]:
bt.CellType.from_values(adata.obs.cell_type_id, field=bt.CellType.ontology_id)

## Validate & standardize

Simple validation of an iterable of values works like so:

In [None]:
bt.CellType.validate(["fat cell", "blood forming stem cell"])

Because these values don't comply with the registry, they're not validated!

You can easily convert these values to validated standardized names based on synonyms like so:

In [None]:
bt.CellType.standardize(["fat cell", "blood forming stem cell"])

Alternatively, you can use `.from_values()`, which will only ever return validated records and automatically standardize under-the-hood:

In [None]:
bt.CellType.from_values(["fat cell", "blood forming stem cell"])

If you are now sure what to do, use `.inspect()` to get instructions:

In [None]:
bt.CellType.inspect(["fat cell", "blood forming stem cell"]);

We can also add new synonyms to a record like so:

In [None]:
hsc_record.add_synonym("HSC")

And when we encounter this synonym as a value, it will now be standardized using synonyms-lookup, and mapped on the correct registry record:

In [None]:
bt.CellType.standardize(["HSC"])

A special synonym is `.abbr` (short for abbreviation), which has its own field and can be assigned via:

In [None]:
hsc_record.set_abbr("HSC")

You can create a lookup object from the `.abbr` field:

In [None]:
cell_types = bt.CellType.lookup("abbr")
hsc = cell_types.hsc
hsc

The same workflow works for all of `bionty`'s registries.

## Manage registries across organisms

Several registries are organism-aware (has a `.organism` field), for instance, {class}`~bionty.Gene`.

In this case, API calls that interact with multi-organism registries require an `organism` argument when there's ambiguity.

For instance, when validating gene symbols:

In [None]:
bt.Gene.validate(["TCF7", "ABC1"], organism="human")

In contrary, working with Ensembl Gene IDs doesn't require passing `organism`, as there's no ambiguity:

In [None]:
bt.Gene.validate(["ENSG00000000419", "ENSMUSG00002076988"], field=bt.Gene.ensembl_gene_id)

When working with the same organism throughout your analysis/workflow, you can omit the `organism` argument by configuring it globally:

In [None]:
bt.settings.organism = "mouse"
bt.Gene.from_source(symbol="Ap5b1")

## Track underlying ontology source versions

Under-the-hood, source ontology versions are automatically tracked for each registry:

In [None]:
bt.Source.filter(currently_used=True).df()

Each record is linked to a versioned public source (if it was created from public):

In [None]:
hepatocyte = bt.CellType.get(name="hepatocyte")
hepatocyte.source

## Create records from specific source

By default, new records are imported or created from the `"currently_used"` public sources which are configured during the instance initialization, e.g.:

In [None]:
bt.Source.filter(entity="bionty.Phenotype", currently_used=True).df()

Sometimes, the default source doesn't contain the ontology term you are looking for.

You can then specify to create a record from a non-default source. For instance, we can use the `ncbitaxon` ontology:

In [None]:
source = bt.Source.get(entity="bionty.Organism", name="ncbitaxon")
source

In [None]:
# validate against the NCBI Taxonomy
bt.Organism.validate(["iris setosa", "iris versicolor", "iris virginica"], source=source)

In [None]:
records = bt.Organism.from_values(
    ["iris setosa", "iris versicolor", "iris virginica"], source=source
)

# since we didn't seed the Organism registry with the NCBITaxon public ontology
# we need to save the records to the database
ln.save(records)

# now we can query a iris organism and view its parents and children
iris = bt.Organism.get(name="iris")
iris.view_parents(with_children=True)

In [None]:
# clean up test instance
!lamin delete --force test-registries