# Curate entity identifiers

To make data queryable by an entity identifier, one needs to ensure that identifiers comply to a chosen standard.

Bionty has {meth}`~bionty.EntityTable.curate` for this.

In [None]:
from bionty import Gene
import pandas as pd

To illustrate it, take a DataFrame that stores a number of gene identifiers, some of them corrupted.

In [None]:
data = {
    "hgnc_symbol": ["A1CF", "A1BG", "corrupted1", "corrupted2"],
    "hgnc_id": ["HGNC:24086", "HGNC:5", "corrupted1", "corrupted2"],
    "ensembl.gene_id": [
        "ENSG00000148584",
        "ENSG00000121410",
        "corrupted1",
        "corrupted2",
    ],
}
df_orig = pd.DataFrame(data).set_index("hgnc_id")

In [None]:
df_orig

To curate the DataFrame into queryable form, we create an index that corresponds to a default identifier. Here, in the case of human genes, the HGNC gene symbol.

To do this, we need a reference identifier, one of 'hgnc_id', 'name', 'entrez.gene_id', 'ensembl.gene_id', 'vega_id', 'ucsc_id', 'pubmed_id', 'horde_id'. For instance:

In [None]:
Gene().curate(df_orig, column="ensembl.gene_id")