# Curate entity identifiers

To make data queryable by an entity identifier, one needs to ensure that identifiers comply to a chosen standard.
Bionty enables this by curating data against the versionized ontologies using {meth}`~bionty.Entity.curate`.

We'll demonstrate this by first curating genes and second CellMarkers where not all values can be immediately mapped.

Let's start by importing the required modules from Bionty and Pandas.

In [None]:
from bionty import Gene, CellMarker, lookup
import pandas as pd

## Curating genes

To illustrate it, generate a DataFrame that stores a number of gene identifiers, some of them corrupted.

In [None]:
data = {
    "gene symbol": ["A1CF", "A1BG", "FANCD1", "corrupted"],
    "hgnc id": ["HGNC:24086", "HGNC:5", "HGNC:1101", "corrupted"],
    "ensembl_gene_id": [
        "ENSG00000148584",
        "ENSG00000121410",
        "ENSG00000188389",
        "corrupted",
    ],
}
df_orig = pd.DataFrame(data).set_index("ensembl_gene_id")

In [None]:
df_orig

To do this, we need a reference identifier (specified as the `id` parameter when initiating the Gene class), the list can be looked up via {meth}`~bionty.lookup`.

In [None]:
lookup.gene_id

To curate the DataFrame into queryable form, we create an index that corresponds to a default identifier (By default, we use `ensembl_gene_id`).

The default behavior is to curate the index if a column name is not provided.

In [None]:
Gene().curate(df_orig)

The curated DataFrame has now been reindexed by the curated cell types.
A new column `orig_index` containing the original index has been added.
Furthermore, a new column `__curated__` containing booleans of whether the data could be successfully curated or not has been added.

You may provide a column name to curate a specific column against a reference identifier.

In [None]:
Gene(id=lookup.gene_id.hgnc_id).curate(df_orig, column="hgnc id")

When mapping symbols, the function will automatically convert the aliases into standardized symbols. In this example, `PD-1` is converted into `PDCD1`.

In [None]:
Gene(id=lookup.gene_id.symbol).curate(df_orig, column="gene symbol")

## Match (unmappable) cell markers to the reference

Depending on how the data was collected and which terminology was used, it is not always possible to curate the values.
Some values might have used a different standard or are simply corrupted.

This section will demonstrate how to look up unmatched terms and curating them using The CellMarker entity.
First, we create an example Pandas DataFrame containing a few valid and invalid cell markers (antibody targets) and features (Time) from a flow cytometry dataset.

In [None]:
markers = pd.DataFrame(
    index=[
        "KI67",
        "CCR7",
        "CD14",
        "CD8",
        "CD45RA",
        "CD4",
        "CD3",
        "CD127",
        "PD1",
        "Invalid-1",
        "Invalid-2",
        "CD66b",
        "Siglec8",
        "Time",
    ]
)

Let's instantiate the CellMarker ontology with the default database and version.

In [None]:
cell_marker = CellMarker()

First, we can have a look at the cell marker table that we just loaded.

In [None]:
df = cell_marker.df

In [None]:
df.tail()

Now let’s check which cell markers from the file can be found in the reference.
We do this using the `.curate function`:

In [None]:
cell_marker.curate(markers)

From the logging, it can be seen that 7 terms were not found in the reference!

Among them `Time`, `Invalid-1` and `Invalid-2` are a non-marker channel which won’t be curated by cell marker.

However, some markers such as "CD66b" and "Siglec8" are valid but not purely upper-case.

The markers in reference table are case sensitive by default so let’s try to turn off case sensitivity:

In [None]:
cell_marker.curate(markers, case_sensitive=False)

OK, great, we are down to 4 unmatched terms (3 non-markers)!

Now let’s manually search this term `PD1` by `lookup.` with auto-completion.

In [None]:
lookup = cell_marker.lookup

```{figure} ../img/lookup_pd1.png
---
width: 70%
align: left
class: with-shadow
---
```

In [None]:
lookup.PD_1

Indeed we find PD-1 which means that there is a "-" missing in our DataFrame entry.

Now let’s fix the markers so all of them can be linked:

```{tip}
Using the .lookup instead of passing a string helps eliminate possible typos!
```

In [None]:
curated_df = markers.rename(index={"PD1": lookup.PD_1.name})

OK, now we can try to run curate again and all cell markers are linked!

In [None]:
cell_marker.curate(curated_df, case_sensitive=False)