# Inspect & map identifiers

To make data queryable by an entity identifier, one needs to ensure that identifiers comply to a chosen standard.

Bionty enables this by mapping metadata on the versioned ontologies using {meth}`~bionty.Bionty.inspect`.

For terms that are not directly mappable, we offer:
- {meth}`~bionty.Bionty.map_synonyms`.
- {meth}`~bionty.Bionty.lookup`.
- {meth}`~bionty.Bionty.fuzzy_match`.

In [None]:
from bionty import Gene, CellMarker, CellType
import pandas as pd

## Inspect and mapping synonyms of gene identifiers

To illustrate it, let us generate a DataFrame that stores a number of gene identifiers, some of which corrupted.

In [None]:
data = {
    "gene symbol": ["A1CF", "A1BG", "FANCD1", "corrupted"],
    "hgnc id": ["HGNC:24086", "HGNC:5", "HGNC:1101", "corrupted"],
    "ensembl_gene_id": [
        "ENSG00000148584",
        "ENSG00000121410",
        "ENSG00000188389",
        "ENSGcorrupted",
    ],
}
df_orig = pd.DataFrame(data).set_index("ensembl_gene_id")

In [None]:
df_orig

First we can check whether any of our values are mappable against the ontology reference.

Tip: available fields are accessible via auto-completion: `gene_bionty.`

In [None]:
gene_bionty = Gene()

In [None]:
gene_bionty.inspect(df_orig.index, gene_bionty.ensembl_gene_id)

The same procedure is available for gene symbols. First, we inspect which symbols are mappable against the ontology.

In [None]:
gene_bionty.inspect(df_orig["gene symbol"], gene_bionty.symbol)

Apparently 2 of the gene symbols are mappable. Bionty further warns us that some of our symbols can be mapped into standardized symbols.

Mapping synonyms returns a list of standardized terms:

In [None]:
mapped_symbol_synonyms = gene_bionty.map_synonyms(
    df_orig["gene symbol"], gene_bionty.symbol
)

mapped_symbol_synonyms

Optionally, only returns a mapper of {synonym : standardized name}:

In [None]:
gene_bionty.map_synonyms(df_orig["gene symbol"], gene_bionty.symbol, return_mapper=True)

We can use the standardized symbols as the new index:

In [None]:
df_curated = df_orig.reset_index()
df_curated.index = mapped_symbol_synonyms

In [None]:
df_curated

You may return a DataFrame with a boolean column indicating if the identifiers are mappable:

In [None]:
gene_bionty.inspect(df_curated.index, gene_bionty.symbol, return_df=True)

## Standardize and look up unmapped CellMarker identifiers

Depending on how the data was collected and which terminology was used, it is not always possible to curate values.
Some values might have used a different standard or be corrupted.

This section will demonstrate how to look up unmatched terms and curate them using `CellMarker`.

First, we take an example DataFrame whose index containing a valid & invalid cell markers (antibody targets) and an additional feature (time) from a flow cytometry dataset.

In [None]:
markers = pd.DataFrame(
    index=[
        "KI67",
        "CCR7x",
        "CD14",
        "CD8",
        "CD45RA",
        "CD4",
        "CD3",
        "CD127",
        "PD1",
        "Invalid-1",
        "Invalid-2",
        "CD66b",
        "Siglec8",
        "Time",
    ]
)

Let's instantiate the CellMarker ontology with the default database and version.

In [None]:
cell_marker_bionty = CellMarker()

cell_marker_bionty

First, we can have a look at the cell marker table that we just loaded.

In [None]:
df = cell_marker_bionty.df()

In [None]:
df.head()

Now let’s check which cell markers from the file can be found in the reference:

In [None]:
cell_marker_bionty.inspect(markers.index, cell_marker_bionty.name, return_df=True)

Logging suggests we map synonyms:

In [None]:
synonyms_mapper = cell_marker_bionty.map_synonyms(
    markers.index, cell_marker_bionty.name, return_mapper=True
)

Now we mapped 3 additional terms:

In [None]:
synonyms_mapper

Let's replace the synonyms with standardized names in the markers DataFrame:

In [None]:
markers.rename(index=synonyms_mapper, inplace=True)

From the logging, it can be seen that 4 terms were not found in the reference!

Among them `Time`, `Invalid-1` and `Invalid-2` are non-marker channels which won’t be curated by cell marker.

In [None]:
cell_marker_bionty.inspect(markers.index, cell_marker_bionty.name, return_df=True)

We don't really find `CCR7x`, let's check in the lookup with auto-completion:

In [None]:
cell_marker_bionty_lookup = cell_marker_bionty.lookup()

```{figure} ./images/lookup_ccr7.png
---
width: 70%
align: left
class: with-shadow
---
```

In [None]:
cell_marker_bionty_lookup.CCR7

Indeed we find it should be CCR7, we had a typo there with `CCR7x`.

Now let’s fix the markers so all of them can be linked:

```{tip}
Using the .lookup instead of passing a string helps eliminate possible typos!
```

In [None]:
curated_df = markers.rename(index={"CCR7x": cell_marker_bionty_lookup.CCR7.name})

OK, now we can try to run curate again and all cell markers are linked!

In [None]:
cell_marker_bionty.inspect(curated_df.index, cell_marker_bionty.name)

## Map CellType names via fuzzy string matching

In [None]:
cell_type_bionty = CellType()

In [None]:
cell_type_bionty.fuzzy_match("T cells", cell_type_bionty.name)

By default, fuzzy_match also matches against synonyms:

In [None]:
cell_type_bionty.fuzzy_match("P cell", cell_type_bionty.name)

You can turn off synonym matching with `synonyms_field=None`:

In [None]:
cell_type_bionty.fuzzy_match("P cell", cell_type_bionty.name, synonyms_field=None)

Return all results ranked by matching ratios:

In [None]:
cell_type_bionty.fuzzy_match(
    "P cell", cell_type_bionty.name, return_ranked_results=True
).head()

Tied results will all be returns:

In [None]:
cell_type_bionty.fuzzy_match("A cell", cell_type_bionty.name, synonyms_field=None)