# Validate, inspect & standardize identifiers

To make data queryable by an entity identifier, one needs to ensure that identifiers comply to a chosen standard.

Bionty enables this by mapping metadata on the versioned ontologies using {meth}`~bionty.Bionty.validate` and {meth}`~bionty.Bionty.inspect`.

For terms that are not directly mappable, we offer (also see {doc}`./search`):
- {meth}`~bionty.Bionty.standardize`
- {meth}`~bionty.Bionty.lookup`
- {meth}`~bionty.Bionty.search`

In [None]:
import bionty as bt
import pandas as pd

## Inspect and mapping synonyms of gene identifiers

To illustrate it, let us generate a DataFrame that stores a number of gene identifiers, some of which corrupted.

In [None]:
data = {
    "gene symbol": ["A1CF", "A1BG", "FANCD1", "corrupted"],
    "ncbi id": ["29974", "1", "5133", "corrupted"],
    "ensembl_gene_id": [
        "ENSG00000148584",
        "ENSG00000121410",
        "ENSG00000188389",
        "ENSGcorrupted",
    ],
}
df_orig = pd.DataFrame(data).set_index("ensembl_gene_id")

In [None]:
df_orig

First we can check whether any of our values are validated against the ontology reference.

Tip: available fields are accessible via `gene_bt.fields`

In [None]:
gene_bt = bt.Gene()

gene_bt

In [None]:
validated = gene_bt.validate(df_orig.index, gene_bt.ensembl_gene_id)
validated

In [None]:
# show not validated terms
df_orig.index[~validated]

The same procedure is available for ncbi_gene_id or gene symbol. First, we validate which symbols are mappable against the ontology.

In [None]:
gene_bt.validate(df_orig["ncbi id"], gene_bt.ncbi_gene_id)

In [None]:
validated_symbols = gene_bt.validate(df_orig["gene symbol"], gene_bt.symbol)

In [None]:
df_orig["gene symbol"][~validated_symbols]

Here, 2 of the gene symbols are not validated. What shall we do? Let's run a full inspection of these symbols:

In [None]:
gene_bt.inspect(df_orig["gene symbol"], gene_bt.symbol);

Inspect detects synonyms and suggests to use .standardize():

In [None]:
# mpping synonyms returns a list of standardized terms:
mapped_symbol_synonyms = gene_bt.standardize(df_orig["gene symbol"])

mapped_symbol_synonyms

Optionally, only returns a mapper of {synonym : standardized name}:

In [None]:
gene_bt.standardize(df_orig["gene symbol"], return_mapper=True)

We can use the standardized symbols as the new standardized index:

In [None]:
df_curated = df_orig.reset_index()
df_curated.index = mapped_symbol_synonyms

In [None]:
df_curated

## Standardize and look up unmapped CellMarker identifiers

Depending on how the data was collected and which terminology was used, it is not always possible to curate values.
Some values might have used a different standard or be corrupted.

This section will demonstrate how to look up unmatched terms and curate them using `CellMarker`.

First, we take an example DataFrame whose index containing a valid & invalid cell markers (antibody targets) and an additional feature (time) from a flow cytometry dataset.

In [None]:
markers = pd.DataFrame(
    index=[
        "KI67",
        "CCR7",
        "CD14",
        "CD8",
        "CD45RA",
        "CD4",
        "CD3",
        "CD127a",
        "PD1",
        "Invalid-1",
        "Invalid-2",
        "CD66b",
        "Siglec8",
        "Time",
    ]
)

Let's instantiate the CellMarker ontology with the default database and version.

In [None]:
cellmarker_bt = bt.CellMarker()

cellmarker_bt

Now let’s check which cell markers from the file can be found in the reference:

In [None]:
cellmarker_bt.inspect(markers.index, cellmarker_bt.name);

Logging suggests we map synonyms:

In [None]:
synonyms_mapper = cellmarker_bt.standardize(markers.index, return_mapper=True)

Now we mapped 4 additional terms:

In [None]:
synonyms_mapper

Let's replace the synonyms with standardized names in the markers DataFrame:

In [None]:
markers.rename(index=synonyms_mapper, inplace=True)

From the logging, it can be seen that 4 terms were not found in the reference!

Among them `Time`, `Invalid-1` and `Invalid-2` are non-marker channels which won’t be curated by cell marker.

In [None]:
cellmarker_bt.inspect(markers.index, cellmarker_bt.name);

We don't really find `CD127a`, let's check in the lookup with auto-completion:

In [None]:
lookup = cellmarker_bt.lookup()

In [None]:
lookup.cd127

Indeed we find it should be cd127, we had a typo there with `cd127a`.

Now let’s fix the markers so all of them can be linked:

```{tip}
Using the .lookup instead of passing a string helps eliminate possible typos!
```

In [None]:
curated_df = markers.rename(index={"CD127a": lookup.cd127.name})

Optionally, run a fuzzy match:

In [None]:
cellmarker_bt.search("CD127a").head()

OK, now we can try to run curate again and all cell markers are linked!

In [None]:
cellmarker_bt.inspect(curated_df.index, cellmarker_bt.name);