# Gene

lamindb provides access to the following public gene ontologies through [bionty](https://lamin.ai/docs/bionty):

1. [Ensembl](https://ensembl.org)
2. [NCBI Gene](https://www.ncbi.nlm.nih.gov/gene)

Here we show how to access and search gene ontologies to standardize new data.

## Setup

In [None]:
!lamin init --storage ./test-public-ontologies --schema bionty

In [None]:
import bionty as bt
import pandas as pd

## PublicOntology objects

Let us create a public ontology accessor with {meth}`~bionty.core.BioRegistry.public`, which chooses a default public ontology source from {class}`~docs:bionty.PublicSource`. It's a [PublicOntology](https://lamin.ai/docs/bionty.core.publicontology) object, which you can think about as a public registry:

In [None]:
public = bt.Gene.public(organism="human")
public

As for registries, you can export the ontology as a `DataFrame`:

In [None]:
df = public.df()
df.head()

Unlike registries, you can also export it as a Pronto object via `public.ontology`.

## Look up terms

As for registries, terms can be looked up with auto-complete:

In [None]:
lookup = public.lookup()

The `.` accessor provides normalized terms (lower case, only contains alphanumeric characters and underscores):

In [None]:
lookup.tcf7

To look up the exact original strings, convert the lookup object to dict and use the `[]` accessor:

In [None]:
lookup_dict = lookup.dict()
lookup_dict["TCF7"]

By default, the `name` field is used to generate lookup keys. You can specify another field to look up:

In [None]:
lookup = public.lookup(public.ncbi_gene_id)

If multiple entries are matched, they are returned as a list:

In [None]:
lookup.bt_100126572

## Search terms

Search behaves in the same way as it does for registries:

In [None]:
public.search("TP53").head(3)

By default, search also covers synonyms:

In [None]:
public.search("PDL1").head(3)

You can turn this off synonym by passing `synonyms_field=None`:

In [None]:
public.search("PDL1", synonyms_field=None).head(3)

Search another field (default is `.name`):

In [None]:
public.search("tumor protein p53", field=public.description).head()

## Standardize gene identifiers

Let us generate a `DataFrame` that stores a number of gene identifiers, some of which corrupted:

In [None]:
data = {
    "gene symbol": ["A1CF", "A1BG", "FANCD1", "corrupted"],
    "ncbi id": ["29974", "1", "5133", "corrupted"],
    "ensembl_gene_id": [
        "ENSG00000148584",
        "ENSG00000121410",
        "ENSG00000188389",
        "ENSGcorrupted",
    ],
}
df_orig = pd.DataFrame(data).set_index("ensembl_gene_id")
df_orig

First we can check whether any of our values are validated against the ontology reference:

In [None]:
validated = public.validate(df_orig.index, public.ensembl_gene_id)
df_orig.index[~validated]

Next, we validate which symbols are mappable against the ontology:

In [None]:
# based on NCBI gene ID
public.validate(df_orig["ncbi id"], public.ncbi_gene_id)

In [None]:
# based on Gene symbols
validated_symbols = public.validate(df_orig["gene symbol"], public.symbol)
df_orig["gene symbol"][~validated_symbols]

Here, 2 of the gene symbols are not validated. Inspect why:

In [None]:
public.inspect(df_orig["gene symbol"], public.symbol);

Logging suggests to use `.standardize()`:

In [None]:
mapped_symbol_synonyms = public.standardize(df_orig["gene symbol"])
mapped_symbol_synonyms

Optionally, you can return a mapper in the form of `{synonym1: standardized_name1, ...}`:

In [None]:
public.standardize(df_orig["gene symbol"], return_mapper=True)

We can use the standardized symbols as the new standardized index:

In [None]:
df_curated = df_orig.reset_index()
df_curated.index = mapped_symbol_synonyms
df_curated

You can convert identifiers by passing `return_field` to {meth}`~lamindb.core.CanValidate.standardize`:

In [None]:
public.standardize(
    df_curated.index,
    field=public.symbol,
    return_field=public.ensembl_gene_id,
)

And return mappable identifiers as a dict:

In [None]:
public.standardize(
    df_curated.index,
    field=public.symbol,
    return_field=public.ensembl_gene_id,
    return_mapper=True,
)

## Ontology source versions

For any given entity, we can choose from a number of versions:

In [None]:
bt.PublicSource.filter(entity="Gene").df()

When instantiating a Bionty object, we can choose a source or version:

In [None]:
public_source = bt.PublicSource.filter(
    source="ensembl", version="release-110", organism="human"
).one()
public = bt.Gene.public(public_source=public_source)
public

The currently used ontologies can be displayed using:

In [None]:
bt.PublicSource.filter(currently_used=True).df()