# Access public ontologies

Here we show how to access public knowledge for organisms, genes, proteins and cell markers.

In the following guide, you'll see how to manage in-house knowledge: {doc}`bio-registries`.

In [None]:
!lamin init --storage ./test-ontologies --schema bionty

In [None]:
import lnschema_bionty as lb
import pandas as pd

# currently necessary to add an entry "human" into an empty instance
lb.settings.organism = "human"

Let us create a public knowledge accessor with {meth}`lnschema_bionty.dev.BioRegistry.bionty`, which chooses a default public knowledge source from {meth}`lnschema_bionty.BiontySource`.

You'll get a [Bionty](https://lamin.ai/docs/bionty/bionty.bionty) object, which you can think about as a less-capable registry:

In [None]:
gene_bt = lb.Gene.bionty(organism="human")
gene_bt

As for registries, you can get a `DataFrame` for any `Bionty` object:

In [None]:
df = gene_bt.df()
df.head()

## Look-up terms

As for registrie, terms can be searched with auto-complete using a lookup object:

In [None]:
lookup = gene_bt.lookup()

The `.` accessor provides normalized terms (lower case, only contains alphanumeric characters and underscores):

In [None]:
lookup.tcf7

To look up the exact original strings, convert the lookup object to dict and use the `[]` accessor:

In [None]:
lookup_dict = lookup.dict()
lookup_dict["TCF7"]

By default, the `name` field is used to generate lookup keys. You can specify another field to look up:

In [None]:
lookup = gene_bt.lookup(gene_bt.ncbi_gene_id)

If multiple entries are matched, they are returned as a list:

In [None]:
lookup.bt_100126572

In [None]:
lookup_dict = lookup.dict()
lookup_dict["100126572"]

## Search terms

Also search behaves in the same way as it does for registries:

In [None]:
celltype_bt = lb.CellType.bionty()
celltype_bt.search("cytotoxic T cells").head(3)

By default, search also covers synonyms:

In [None]:
celltype_bt.search("P cell").head(3)

You can turn off synonym matching with `synonyms_field=None`:

In [None]:
celltype_bt.search("P cell", synonyms_field=None).head(3)

Search another field (default is `.name`):

In [None]:
celltype_bt.search("CD8 postive alpha beta T cells", field=celltype_bt.definition).head(
    3
)

## Inspect and map synonyms of gene identifiers

Let us generate a `DataFrame` that stores a number of gene identifiers, some of which corrupted:

In [None]:
data = {
    "gene symbol": ["A1CF", "A1BG", "FANCD1", "corrupted"],
    "ncbi id": ["29974", "1", "5133", "corrupted"],
    "ensembl_gene_id": [
        "ENSG00000148584",
        "ENSG00000121410",
        "ENSG00000188389",
        "ENSGcorrupted",
    ],
}
df_orig = pd.DataFrame(data).set_index("ensembl_gene_id")
df_orig

First we can check whether any of our values are validated against the ontology reference:

In [None]:
validated = gene_bt.validate(df_orig.index, gene_bt.ensembl_gene_id)
validated

Show what hasn't validated:

In [None]:
df_orig.index[~validated]

The same procedure is available for `ncbi_gene_id` or `symbol`.

First, we validate which symbols are mappable against the ontology.

In [None]:
gene_bt.validate(df_orig["ncbi id"], gene_bt.ncbi_gene_id)

In [None]:
validated_symbols = gene_bt.validate(df_orig["gene symbol"], gene_bt.symbol)
df_orig["gene symbol"][~validated_symbols]

Here, 2 of the gene symbols are not validated. Let's inspect why:

In [None]:
gene_bt.inspect(df_orig["gene symbol"], gene_bt.symbol);

Logging suggests to use `.standardize()`:

In [None]:
mapped_symbol_synonyms = gene_bt.standardize(df_orig["gene symbol"])
mapped_symbol_synonyms

Optionally, you can return a mapper in the form of `{synonym1: standardized_name1, ...}`:

In [None]:
gene_bt.standardize(df_orig["gene symbol"], return_mapper=True)

We can use the standardized symbols as the new standardized index:

In [None]:
df_curated = df_orig.reset_index()
df_curated.index = mapped_symbol_synonyms
df_curated

## Convert gene identifiers

You can convert identifiers by passing `return_field` to {meth}`~lamindb.dev.CanValidate.standardize`:

In [None]:
gene_bt.standardize(
    df_curated.index, field=gene_bt.symbol, return_field=gene_bt.ensembl_gene_id
)

And return mappable identifiers as a dict:

In [None]:
gene_bt.standardize(
    df_curated.index,
    field=gene_bt.symbol,
    return_field=gene_bt.ensembl_gene_id,
    return_mapper=True,
)

## Standardize CellMarker identifiers

Depending on how the data was collected and which terminology was used, it is not always possible to curate values.
Some values might have used a different standard or be corrupted.

This section will demonstrate how to look up non-validated cell marker terms and curate them using `CellMarker`.

First, we take an example DataFrame whose index containing a valid & invalid cell markers (antibody targets) and an additional feature (time) from a flow cytometry dataset.

In [None]:
markers = pd.DataFrame(
    index=[
        "KI67",
        "CCR7",
        "CD14",
        "CD8",
        "CD45RA",
        "CD4",
        "CD3",
        "CD127a",
        "PD1",
        "Invalid-1",
        "Invalid-2",
        "CD66b",
        "Siglec8",
        "Time",
    ]
)

Let's instantiate the CellMarker ontology with the default database and version.

In [None]:
cellmarker_bt = lb.CellMarker.bionty()
cellmarker_bt

Now let’s check which cell markers from the file can be found in the reference:

In [None]:
cellmarker_bt.inspect(markers.index, cellmarker_bt.name);

Logging suggests we map synonyms:

In [None]:
synonyms_mapper = cellmarker_bt.standardize(markers.index, return_mapper=True)

Now we mapped 4 additional terms:

In [None]:
synonyms_mapper

Let's replace the synonyms with standardized names in the markers DataFrame:

In [None]:
markers.rename(index=synonyms_mapper, inplace=True)

From the logging, it can be seen that 4 terms were not found in the reference!

Among them `Time`, `Invalid-1` and `Invalid-2` are non-marker channels which won’t be curated by cell marker.

In [None]:
cellmarker_bt.inspect(markers.index, cellmarker_bt.name);

We don't really find `CD127a`, let's check in the lookup with auto-completion:

In [None]:
lookup = cellmarker_bt.lookup()

In [None]:
lookup.cd127

Indeed we find it should be cd127, we had a typo there with `cd127a`.

Now let’s fix the markers so all of them can be linked:

```{tip}
Using the .lookup instead of passing a string helps eliminate possible typos!
```

In [None]:
curated_df = markers.rename(index={"CD127a": lookup.cd127.name})

Optionally, search:

In [None]:
cellmarker_bt.search("CD127a").head()

Now we see that all cell types validate:

In [None]:
cellmarker_bt.validate(curated_df.index, cellmarker_bt.name)

## Version ontology sources

For any given entity, we can choose from a number of versions:

In [None]:
lb.BiontySource.filter(entity="CellType").df()

When instantiating a Bionty object, we can choose a source or version:

In [None]:
bionty_source = lb.BiontySource.filter(source="cl", version="2022-08-16").one()
celltype_bt = lb.CellType.bionty(bionty_source=bionty_source)
celltype_bt

The currently used ontologies can be displayed using:

In [None]:
lb.BiontySource.filter(currently_used=True).df()

In [None]:
!lamin delete --force test-ontologies
!rm -r test-ontologies