# Curate entity identifiers

To make data queryable by an entity identifier, one needs to ensure that identifiers comply to a chosen standard.

Bionty has {meth}`~bionty.EntityTable.curate` for this.

In [None]:
from bionty import Gene, lookup
import pandas as pd

To illustrate it, take a DataFrame that stores a number of gene identifiers, some of them corrupted.

In [None]:
data = {
    "gene symbol": ["A1CF", "A1BG", "FANCD1", "corrupted"],
    "hgnc id": ["HGNC:24086", "HGNC:5", "HGNC:1101", "corrupted"],
    "ensembl_gene_id": [
        "ENSG00000148584",
        "ENSG00000121410",
        "ENSG00000188389",
        "corrupted",
    ],
}
df_orig = pd.DataFrame(data).set_index("ensembl_gene_id")

In [None]:
df_orig

To do this, we need a reference identifier (specified as the `id` parameter when initiating the Gene class), the list can be looked up via {meth}`~bionty.lookup`.

In [None]:
lookup.gene_id

To curate the DataFrame into queryable form, we create an index that corresponds to a default identifier (By default, we use `ensembl_gene_id`).

The default behavior is to curate the index if a column name is not provided.

In [None]:
Gene().curate(df_orig)

You may provide a column name to curate a specific column against a reference identifier.

In [None]:
Gene(id=lookup.gene_id.hgnc_id).curate(df_orig, column="hgnc id")

When mapping symbols, the function will automatically convert the aliases into standardized symbols. In this example, `PD-1` is converted into `PDCD1`.

In [None]:
Gene(id=lookup.gene_id.symbol).curate(df_orig, column="gene symbol")