[![Jupyter Notebook](https://img.shields.io/badge/Source%20on%20GitHub-orange)](https://github.com/laminlabs/lamindb/blob/main/docs/bio-registries.ipynb)

# Manage biological registries 

With plug-in {mod}`bionty`, it becomes easy to import records from public biological ontologies.

In [None]:
#! pip install 'lamindb[bionty]'
!lamin init --storage ./test-registries --schema bionty

Let's pre-populate the {class}`bionty.Organism` and {class}`bionty.CellType` registry with a few records:

In [None]:
import lamindb as ln
import bionty as bt

bt.Organism.from_public(name="human").save()
bt.CellType.from_public(name="T cell").save()
bt.CellType(name="my T cell subtype").save()

## Access records in public ontologies

Consider a public ontology for cell types: `.public()` returns a {class}`bionty.core.PublicOntology` object for accessing a public ontology.

In [None]:
public = bt.CellType.public()
public

We can use it to search the public ontology against cell types:

In [None]:
public.search("gamma delta T cell").head(3)

Or to look up cell types with auto-complete:

In [None]:
lookup = public.lookup()
lookup.gamma_delta_t_cell

## Create records in in-house ontologies

We can now create a record for our in-house SQL registry by passing the result of the lookup in the public ontology to the `CellType` constructor:

In [None]:
gdt_cell = bt.CellType(lookup.gamma_delta_t_cell)
gdt_cell

Alternatively, we can construct the gamma delta T cell via {meth}`~bionty.core.BioRecord.from_public`, which is synonyms-aware:

In [None]:
bt.CellType.from_public(ontology_id="CL:0000798")

When we save this record to the registry, logging informs us that we're also saving parent records:

In [None]:
gdt_cell.save()

```{dropdown} Will I always see parents being saved?

No, this only happens a single time.

- If we accidentally save the same record again, it will be recognized that the record and all parents are already in the registry.
- If we save another record that has overlapping parents, only new parents will be saved.

```

View the ontological hierarchy:

In [None]:
gdt_cell.view_parents()

Or access the parents directly:

In [None]:
gdt_cell.parents.df()

You can construct custom hierarchies of records:

In [None]:
my_celltype = bt.CellType.filter(name="my T cell subtype").one()
my_celltype.parents.add(gdt_cell)
gdt_cell.view_parents(distance=2, with_children=True)

This cell type and all its parents can now be queried & searched in the registry via `bt.CellType.filter()` and `bt.CellType.search()`.

## Load records for values in data sources

When accessing data sources, one often encounters bulk references to entities that might be corrupted or standardized using different standardization schemes.

Let's consider an example based on an `AnnData` object, in the `cell_type` annotations of this `AnnData` object, we find 4 references to cell types:

In [None]:
adata = ln.core.datasets.anndata_with_obs()
adata.obs.cell_type.value_counts()

We'd like to load the corresponding records in our in-house ontology to annotate a dataset.

To this end, you'll typically use {class}`~lamindb.core.Record.from_values`, which will both validate & load records that match the values.

In [None]:
cell_types = bt.CellType.from_values(adata.obs.cell_type)
cell_types

Logging informed us that 3 cell types were validated. Since we loaded these records at the same time, we could readily use them to annotate a dataset.

:::{dropdown} What happened under-the-hood?

`.from_values()` performs the following look ups:

1. If registry records match the values, load these records
2. If values match synonyms of registry records, load these records
3. If no record in the registry matches, attempt to load records from a public ontology
4. Same as 3. but based on synonyms

No records will be returned if all 4 look ups are unsuccessful.

Example:

```
celltype_names = [
    "gamma-delta T cell",  # existing record with the same name
    "T lymphocyte",  # existing record with synonym
    "hepatocyte",  # public record with the same name
    "HSC",  # public record with synonym
    "my new cell type",  # Not exist in in-house registry, not exist in public reference
]
bionty.CellType.from_values(celltype_names)
```

This returns records for all names except from "my new cell type".

If you'd like to add this new value to the registry, do it like so:

```
my_celltype = bionty.CellType(name="my new cell type")
my_celltype.save()
```

Sometimes, it's useful to treat validated records differently from non-validated records. Here is a way:

```
original_values = ["gut", "gut2"]
validated_status = bt.Tissue.validate(original_values)
validated_values = [value for value, validated in zip(original_values, validated_status) if validated]
records_from_validated_values = bt.Tissue.from_values(validated_values)
ln.save(records_from_validated_values)
```

:::


Alternatively, we can create entries based on ontology ids:

In [None]:
adata.obs.cell_type_id.unique().tolist()

In [None]:
bt.CellType.from_values(adata.obs.cell_type_id, field=bt.CellType.ontology_id)

If we're happy with the cell type records, we save them to the registry:

In [None]:
ln.save(cell_types)

Now, let's look at our in-house registry:

In [None]:
bt.CellType.df()

## Access records in in-house ontologies

Search:

In [None]:
bt.CellType.search("gamma delta T cell").df().head(2)

Or look up with auto-complete:

In [None]:
cell_types = bt.CellType.lookup()
hsc_record = cell_types.hematopoietic_stem_cell
hsc_record

## Validate & standardize

Simple validation of an iterable of values works like so:

In [None]:
bt.CellType.validate(["HSC", "blood forming stem cell"])

Because these values don't comply with the registry, they're not validated!

You can easily convert these values to validated standardized names based on synonyms like so:

In [None]:
bt.CellType.standardize(["HSC", "blood forming stem cell"])

Alternatively, you can use `.from_values()`, which will only ever create validated records and automatically standardize under-the-hood:

In [None]:
bt.CellType.from_values(["HSC", "blood forming stem cell"])

We can also add new synonyms to a record like so:

In [None]:
hsc_record.add_synonym("HSCs")

And when we encounter this synonym as a value, it will now be standardized using synonyms-lookup, and mapped on the correct registry record:

In [None]:
bt.CellType.standardize(["HSCs"])

A special synonym is `.abbr` (short for abbreviation), which has its own field and can be assigned via:

In [None]:
hsc_record.set_abbr("HSC")

You can create a lookup object from the `.abbr` field:

In [None]:
cell_types = bt.CellType.lookup("abbr")
hsc = cell_types.hsc
hsc

The same workflow works for all of `bionty`'s registries.

## Manage registries across organisms

Most registries are organism-aware, for instance, `Gene`:

In [None]:
bt.Gene.from_public(symbol="TCF7", organism="human")

Similarly, API calls that interact with multi-organism registries accept a `organism` argument, e.g.:

In [None]:
bt.Gene.validate(["TCF7", "ABC1"], organism="human")

And when working with the same organism throughout your analysis/workflow, you can omit the `organism` argument by configuring it globally:

In [None]:
bt.settings.organism = "mouse"
bt.Gene.from_public(symbol="Ap5b1")

## Track underlying ontology versions

Under-the-hood, source ontology versions are automatically tracked:

In [None]:
bt.Source.filter(currently_used=True).df()

Each record is linked to a versioned public source (if it was created from public):

In [None]:
hepatocyte = bt.CellType.filter(name="hepatocyte").one()
hepatocyte.source

## Create records from specific public ontologies

By default, records are created from the `"currently_used"` public sources which are configured during the instance initialization, e.g.:

In [None]:
bt.Phenotype.public()

In [None]:
bt.Phenotype.sources(currently_used=True).df()

Sometimes, the default source doesn't contain the ontology term you are looking for.

You can then specify to create a record from a non-default source. For instance, instead of using untyped labels for iris organisms as {doc}`/tutorial2`, we can use the `ncbitaxon` ontology:

```python

source = bt.PublicSource.filter(entity="Organism", source="ncbitaxon").one()
iris_setosa = bt.Organism.from_public(name="iris setosa", source=source)
iris_setosa.save()
```

Analogously, you can pass `source` to bulk-create records from a non-default source:

```python

records = bt.Organism.from_values(
    ["iris setosa", "iris versicolor", "iris virginica"], source=source
)
ln.save(records)
iris_setosa.parents.get(name="iris").view_parents(with_children=True)
```

In [None]:
# clean up test instance
!lamin delete --force test-registries