# Curate entity identifiers

To make data queryable by an entity identifier, one needs to ensure that identifiers comply to a chosen standard.

Bionty has {meth}`~bionty.EntityTable.curate` for this.

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
from bionty import Gene
import pandas as pd

To illustrate it, take a DataFrame that stores a number of gene identifiers, some of them corrupted.

In [None]:
data = {
    "hgnc_symbol": ["A1CF", "A1BG", "103AS", "corrupted"],
    "hgnc_id": ["HGNC:24086", "HGNC:5", "HGNC:6332", "corrupted"],
    "ensembl.gene_id": [
        "ENSG00000148584",
        "ENSG00000121410",
        "ENSG00000189013",
        "corrupted",
    ],
}
df_orig = pd.DataFrame(data).set_index("hgnc_id")

In [None]:
df_orig

To curate the DataFrame into queryable form, we create an index that corresponds to a default identifier. Here, in the case of human genes, the HGNC gene symbol.

To do this, we need a reference identifier, one of 'hgnc_id', 'name', 'entrez.gene_id', 'ensembl.gene_id', 'vega_id', 'ucsc_id', 'pubmed_id', 'horde_id'. For instance:

In [None]:
Gene().curate(df_orig, column="ensembl.gene_id")

When mapping symbols, the function will automatically convert the aliases into standardized symbols. In this example, `103AS` is converted into `KIR2DL4`.

In [None]:
Gene().curate(df_orig, column="hgnc_symbol")

In [None]:
data = {
    "hgnc_symbol": ["A1CF", "A1BG", "corrupted1", "corrupted2"],
    "hgnc_id": ["HGNC:24086", "HGNC:5", "corrupted1", "corrupted2"],
    "ensembl.gene_id": [
        "ENSG00000148584",
        "ENSG00000121410",
        "corrupted1",
        "corrupted2",
    ],
}
df = pd.DataFrame(data).set_index("hgnc_id")


def test_curate():
    df_curated = Gene().curate(df.set_index("hgnc_symbol"))
    assert df_curated.index.to_list() == ["A1CF", "A1BG", "corrupted1", "corrupted2"]
    assert df_curated.__curated__.to_list() == [True, True, False, False]

In [None]:
test_curate()

In [None]:
Gene().curate(df.set_index("hgnc_symbol"))

In [None]:
pd.Series(df.index, index=df.index)