# Ingest the Ensembl gene table as the gene table

HGNC table contain only 20k+ validated human entries. When people perform bioinformatic analysis, they often use ensembl ids for transcripts and therefore we are switching to the Ensembl table in `bionty.Gene().df`.

In [1]:
import pandas as pd
from lnschema_bionty import id

## Ensembl download

These tables are downloaded from [biomart](https://www.ensembl.org/biomart/martview/) database (`Ensembl Genes 107`) containing the following id columns for every species:
- `Gene stable ID`
- `Transcript stable ID`
- `Protein stable ID`
- `Gene name`
- `Gene Synonym`
- `Gene type`
- `NCBI gene (formerly Entrezgene) ID`

Addtional species-specific columns are also present for:
- human: `HGNC ID`, `MIM gene accession`
- mouse: `MGI ID`

In [2]:
dfs = {
    "human": "~/Downloads/mart_export-human.txt",  # http://www.ensembl.org/biomart/martview/4d75e3d44de27e6ed58cfce974f0a755?VIRTUALSCHEMANAME=default&ATTRIBUTES=hsapiens_gene_ensembl.default.feature_page.ensembl_gene_id|hsapiens_gene_ensembl.default.feature_page.ensembl_transcript_id|hsapiens_gene_ensembl.default.feature_page.ensembl_peptide_id|hsapiens_gene_ensembl.default.feature_page.external_gene_name|hsapiens_gene_ensembl.default.feature_page.external_synonym|hsapiens_gene_ensembl.default.feature_page.gene_biotype|hsapiens_gene_ensembl.default.feature_page.entrezgene_id|hsapiens_gene_ensembl.default.feature_page.hgnc_id|hsapiens_gene_ensembl.default.feature_page.mim_gene_accession&FILTERS=&VISIBLEPANEL=resultspanel
    "mouse": "~/Downloads/mart_export-mouse.txt",  # http://www.ensembl.org/biomart/martview/4d75e3d44de27e6ed58cfce974f0a755?VIRTUALSCHEMANAME=default&ATTRIBUTES=mmusculus_gene_ensembl.default.feature_page.ensembl_gene_id|mmusculus_gene_ensembl.default.feature_page.ensembl_transcript_id|mmusculus_gene_ensembl.default.feature_page.ensembl_peptide_id|mmusculus_gene_ensembl.default.feature_page.external_gene_name|mmusculus_gene_ensembl.default.feature_page.external_synonym|mmusculus_gene_ensembl.default.feature_page.gene_biotype|mmusculus_gene_ensembl.default.feature_page.entrezgene_id|mmusculus_gene_ensembl.default.feature_page.mgi_id&FILTERS=&VISIBLEPANEL=resultspanel
}

## Curate the tables

In [3]:
for species, path in dfs.items():
    print(f"----------{species}----------")
    df = pd.read_csv(path, dtype=str)
    print(f"Initial shape: {df.shape}")

    # Aggregate the `Gene Synonym` column
    df_alias = df[["Gene name", "Gene Synonym"]].drop_duplicates().dropna()
    df_alias = df_alias.groupby("Gene name").agg("|".join)
    del df["Gene Synonym"]
    df = df.drop_duplicates()
    df = pd.merge(df, df_alias, on="Gene name", how="left")

    # add ids to each entry
    ids = []
    for i in df.index:
        ids.append(id.gene())
    df.index = ids

    display(df.head())
    print(f"Final shape: {df.shape}")

    # save to a feather file
    df.to_parquet(f"ensembl-ids-{species}.parquet")
    print(f"Saved as ensembl-ids-{species}.parquet.")

----------human----------
Initial shape: (620902, 9)


Unnamed: 0,Gene stable ID,Transcript stable ID,Protein stable ID,Gene name,Gene type,NCBI gene (formerly Entrezgene) ID,HGNC ID,MIM gene accession,Gene Synonym
o4eTd6,ENSG00000210049,ENST00000387314,,MT-TF,Mt_tRNA,,HGNC:7481,,MTTF|trnF
J6g60l,ENSG00000211459,ENST00000389680,,MT-RNR1,Mt_rRNA,,HGNC:7470,,12S|MOTS-c|MTRNR1
dfmRFQ,ENSG00000210077,ENST00000387342,,MT-TV,Mt_tRNA,,HGNC:7500,,MTTV|trnV
kepv0X,ENSG00000210082,ENST00000387347,,MT-RNR2,Mt_rRNA,,HGNC:7471,,16S|HN|MTRNR2
rXIl5w,ENSG00000209082,ENST00000386347,,MT-TL1,Mt_tRNA,,HGNC:7490,,MTTL1|TRNL1


Final shape: (276652, 9)
Saved as ensembl-ids-human.parquet.
----------mouse----------
Initial shape: (296054, 8)


Unnamed: 0,Gene stable ID,Transcript stable ID,Protein stable ID,Gene name,Gene type,NCBI gene (formerly Entrezgene) ID,MGI ID,Gene Synonym
docczI,ENSMUSG00000064336,ENSMUST00000082387,,mt-Tf,Mt_tRNA,,MGI:102487,tRNA|tRNA-Phe|TrnF tRNA
qtJtiK,ENSMUSG00000064337,ENSMUST00000082388,,mt-Rnr1,Mt_rRNA,,MGI:102493,12S ribosomal RNA|12S rRNA|12SrRNA|Rnr1 s-rRNA
r8waPO,ENSMUSG00000064338,ENSMUST00000082389,,mt-Tv,Mt_tRNA,,MGI:102472,tRNA|tRNA-Val|TrnaV tRNA
gn8os3,ENSMUSG00000064339,ENSMUST00000082390,,mt-Rnr2,Mt_rRNA,,MGI:102492,16S ribosomal RNA|16S rRNA|16SrRNA|Rnr2 16S ri...
FIDClx,ENSMUSG00000064340,ENSMUST00000082391,,mt-Tl1,Mt_tRNA,,MGI:102482,tRNA|tRNA Leu|tRNA Leu_1|TrnrL1 tRNA


Final shape: (150702, 8)
Saved as ensembl-ids-mouse.parquet.
