# Ensembl gene -> `bionty.Gene().df`

In [1]:
!lndb load bionty-assets

migrate-unnecessary


In [2]:
!lndb login sunnyosun

In [3]:
import lamindb as ln
import pandas as pd
from lnschema_bionty import id

ln.nb.header()

2022-10-26 11:38:03,833:INFO - Note: NumExpr detected 10 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
2022-10-26 11:38:03,834:INFO - NumExpr defaulting to 8 threads.


0,1
author,Sunny Sun (sunnyosun)
id,z2WNjjvFuzwf
version,1
time_init,2022-09-27 11:04
time_run,2022-10-26 09:40
consecutive_cells,True
pypackage,lamindb==0.6.0 lnschema_bionty==0.4.3 pandas==1.5.0


## Ensembl download

The table has a `version` column with value of `Ens107`.

These tables are downloaded from [biomart](https://www.ensembl.org/biomart/martview/) database (`Ensembl Genes 107`) containing the following id columns for every species:
- `Gene stable ID`
- `Transcript stable ID`
- `Protein stable ID`
- `Gene name`
- `Gene Synonym`
- `Gene type`
- `Gene description` __# this is a new column added in v2__
- `NCBI gene (formerly Entrezgene) ID`

Addtional species-specific columns are also present for:
- human: `HGNC ID`, `MIM gene accession`
- mouse: `MGI ID`

In [4]:
# Downloaded on 2022-09-27

dfs = {
    "human": "https://bionty-assets.s3.amazonaws.com/mart_export-human.txt",  # http://www.ensembl.org/biomart/martview/4d75e3d44de27e6ed58cfce974f0a755?VIRTUALSCHEMANAME=default&ATTRIBUTES=hsapiens_gene_ensembl.default.feature_page.ensembl_gene_id|hsapiens_gene_ensembl.default.feature_page.ensembl_transcript_id|hsapiens_gene_ensembl.default.feature_page.ensembl_peptide_id|hsapiens_gene_ensembl.default.feature_page.external_gene_name|hsapiens_gene_ensembl.default.feature_page.external_synonym|hsapiens_gene_ensembl.default.feature_page.gene_biotype|hsapiens_gene_ensembl.default.feature_page.entrezgene_id|hsapiens_gene_ensembl.default.feature_page.hgnc_id|hsapiens_gene_ensembl.default.feature_page.mim_gene_accession&FILTERS=&VISIBLEPANEL=resultspanel
    "mouse": "https://bionty-assets.s3.amazonaws.com/mart_export-mouse.txt",  # http://www.ensembl.org/biomart/martview/4d75e3d44de27e6ed58cfce974f0a755?VIRTUALSCHEMANAME=default&ATTRIBUTES=mmusculus_gene_ensembl.default.feature_page.ensembl_gene_id|mmusculus_gene_ensembl.default.feature_page.ensembl_transcript_id|mmusculus_gene_ensembl.default.feature_page.ensembl_peptide_id|mmusculus_gene_ensembl.default.feature_page.external_gene_name|mmusculus_gene_ensembl.default.feature_page.external_synonym|mmusculus_gene_ensembl.default.feature_page.gene_biotype|mmusculus_gene_ensembl.default.feature_page.entrezgene_id|mmusculus_gene_ensembl.default.feature_page.mgi_id&FILTERS=&VISIBLEPANEL=resultspanel
}

## Curate the tables

In [5]:
allids = []

for species, path in dfs.items():
    print(f"----------{species}----------")
    df = pd.read_csv(path, dtype=str)
    print(f"Initial shape: {df.shape}")

    # Aggregate the `Gene Synonym` column
    df_alias = df[["Gene name", "Gene Synonym"]].drop_duplicates().dropna()
    df_alias = df_alias.groupby("Gene name").agg("|".join)
    del df["Gene Synonym"]
    df = df.drop_duplicates()
    df = pd.merge(df, df_alias, on="Gene name", how="left")

    # add the version column
    df["version"] = "Ens107"

    display(df.head())
    print(f"All ids shape: {df.shape}")

    # save all ids to a parquet file
    df.to_parquet(f"ensembl-ids-{species}.parquet")
    print(f"Saved as ensembl-ids-{species}.parquet.")

    # subset to genes only
    df = df.loc[:, ~df.columns.isin(["Transcript stable ID", "Protein stable ID"])]
    df = df.drop_duplicates()

    # add ids to each entry
    ids = []
    for i in df.index:
        ids.append(id.gene())
    df.index = ids
    df.index.name = "id"

    display(df.head())
    print(f"Final shape: {df.shape}")

    # save all ids to a parquet file
    df.to_parquet(f"gene-{species}.parquet")
    print(f"Saved as gene-{species}.parquet.")

    # all ids across species
    allids += ids

# make sure ids are unique
assert len(set(allids)) == len(allids)

----------human----------
Initial shape: (620902, 10)


Unnamed: 0,Gene stable ID,Transcript stable ID,Protein stable ID,Gene name,Gene type,Gene description,NCBI gene (formerly Entrezgene) ID,HGNC ID,MIM gene accession,Gene Synonym,version
0,ENSG00000210049,ENST00000387314,,MT-TF,Mt_tRNA,mitochondrially encoded tRNA-Phe (UUU/C) [Sour...,,HGNC:7481,,MTTF|trnF,Ens107
1,ENSG00000211459,ENST00000389680,,MT-RNR1,Mt_rRNA,mitochondrially encoded 12S rRNA [Source:HGNC ...,,HGNC:7470,,12S|MOTS-c|MTRNR1,Ens107
2,ENSG00000210077,ENST00000387342,,MT-TV,Mt_tRNA,mitochondrially encoded tRNA-Val (GUN) [Source...,,HGNC:7500,,MTTV|trnV,Ens107
3,ENSG00000210082,ENST00000387347,,MT-RNR2,Mt_rRNA,mitochondrially encoded 16S rRNA [Source:HGNC ...,,HGNC:7471,,16S|HN|MTRNR2,Ens107
4,ENSG00000209082,ENST00000386347,,MT-TL1,Mt_tRNA,mitochondrially encoded tRNA-Leu (UUA/G) 1 [So...,,HGNC:7490,,MTTL1|TRNL1,Ens107


All ids shape: (276652, 11)
Saved as ensembl-ids-human.parquet.


Unnamed: 0_level_0,Gene stable ID,Gene name,Gene type,Gene description,NCBI gene (formerly Entrezgene) ID,HGNC ID,MIM gene accession,Gene Synonym,version
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Lzl9xt,ENSG00000210049,MT-TF,Mt_tRNA,mitochondrially encoded tRNA-Phe (UUU/C) [Sour...,,HGNC:7481,,MTTF|trnF,Ens107
ILAWa7,ENSG00000211459,MT-RNR1,Mt_rRNA,mitochondrially encoded 12S rRNA [Source:HGNC ...,,HGNC:7470,,12S|MOTS-c|MTRNR1,Ens107
XkyeQz,ENSG00000210077,MT-TV,Mt_tRNA,mitochondrially encoded tRNA-Val (GUN) [Source...,,HGNC:7500,,MTTV|trnV,Ens107
jDD2jW,ENSG00000210082,MT-RNR2,Mt_rRNA,mitochondrially encoded 16S rRNA [Source:HGNC ...,,HGNC:7471,,16S|HN|MTRNR2,Ens107
J58H9b,ENSG00000209082,MT-TL1,Mt_tRNA,mitochondrially encoded tRNA-Leu (UUA/G) 1 [So...,,HGNC:7490,,MTTL1|TRNL1,Ens107


Final shape: (68856, 9)
Saved as gene-human.parquet.
----------mouse----------
Initial shape: (296054, 9)


Unnamed: 0,Gene stable ID,Transcript stable ID,Protein stable ID,Gene name,Gene type,Gene description,NCBI gene (formerly Entrezgene) ID,MGI ID,Gene Synonym,version
0,ENSMUSG00000064336,ENSMUST00000082387,,mt-Tf,Mt_tRNA,mitochondrially encoded tRNA phenylalanine [So...,,MGI:102487,tRNA|tRNA-Phe|TrnF tRNA,Ens107
1,ENSMUSG00000064337,ENSMUST00000082388,,mt-Rnr1,Mt_rRNA,mitochondrially encoded 12S rRNA [Source:MGI S...,,MGI:102493,12S ribosomal RNA|12S rRNA|12SrRNA|Rnr1 s-rRNA,Ens107
2,ENSMUSG00000064338,ENSMUST00000082389,,mt-Tv,Mt_tRNA,mitochondrially encoded tRNA valine [Source:MG...,,MGI:102472,tRNA|tRNA-Val|TrnaV tRNA,Ens107
3,ENSMUSG00000064339,ENSMUST00000082390,,mt-Rnr2,Mt_rRNA,mitochondrially encoded 16S rRNA [Source:MGI S...,,MGI:102492,16S ribosomal RNA|16S rRNA|16SrRNA|Rnr2 16S ri...,Ens107
4,ENSMUSG00000064340,ENSMUST00000082391,,mt-Tl1,Mt_tRNA,mitochondrially encoded tRNA leucine 1 [Source...,,MGI:102482,tRNA|tRNA Leu|tRNA Leu_1|TrnrL1 tRNA,Ens107


All ids shape: (150702, 10)
Saved as ensembl-ids-mouse.parquet.


Unnamed: 0_level_0,Gene stable ID,Gene name,Gene type,Gene description,NCBI gene (formerly Entrezgene) ID,MGI ID,Gene Synonym,version
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Epd98t,ENSMUSG00000064336,mt-Tf,Mt_tRNA,mitochondrially encoded tRNA phenylalanine [So...,,MGI:102487,tRNA|tRNA-Phe|TrnF tRNA,Ens107
RiOxA6,ENSMUSG00000064337,mt-Rnr1,Mt_rRNA,mitochondrially encoded 12S rRNA [Source:MGI S...,,MGI:102493,12S ribosomal RNA|12S rRNA|12SrRNA|Rnr1 s-rRNA,Ens107
cMIElg,ENSMUSG00000064338,mt-Tv,Mt_tRNA,mitochondrially encoded tRNA valine [Source:MG...,,MGI:102472,tRNA|tRNA-Val|TrnaV tRNA,Ens107
DbiNNA,ENSMUSG00000064339,mt-Rnr2,Mt_rRNA,mitochondrially encoded 16S rRNA [Source:MGI S...,,MGI:102492,16S ribosomal RNA|16S rRNA|16SrRNA|Rnr2 16S ri...,Ens107
NO6NBF,ENSMUSG00000064340,mt-Tl1,Mt_tRNA,mitochondrially encoded tRNA leucine 1 [Source...,,MGI:102482,tRNA|tRNA Leu|tRNA Leu_1|TrnrL1 tRNA,Ens107


Final shape: (57110, 8)
Saved as gene-mouse.parquet.


## Push to bionty-assets.lndb

In [6]:
ingest = ln.db.Ingest()

In [7]:
ingest.add("ensembl-ids-human.parquet")
ingest.add("ensembl-ids-mouse.parquet")

ingest.add("gene-human.parquet")
ingest.add("gene-mouse.parquet");

In [8]:
ingest.commit()

✅ Cell numbers increase consecutively: Awesome!


2022-10-26 11:38:51,646:INFO - Found credentials in shared credentials file: ~/.aws/credentials


Upload /Users/sunnysun/Documents/repos.nosync/bionty-assets/docs/ingest/ensembl-ids-human.parquet: 1.00
Upload /Users/sunnysun/Documents/repos.nosync/bionty-assets/docs/ingest/ensembl-ids-mouse.parquet: 1.00
Upload /Users/sunnysun/Documents/repos.nosync/bionty-assets/docs/ingest/gene-human.parquet: 1.00
Upload /Users/sunnysun/Documents/repos.nosync/bionty-assets/docs/ingest/gene-mouse.parquet: 1.00
ℹ️ Added notebook 'Ensembl gene -> `bionty.Gene().df`' (z2WNjjvFuzwf, 1) by user sunnyosun.
✅ Ingested the following dobjects:
+---+---------------------------------------------------+--------------------------------------------------------+----------------------+
|   | dobject                                           | jupynb                                                 | user                 |
+---+---------------------------------------------------+--------------------------------------------------------+----------------------+
| 0 | ensembl-ids-human.parquet (eS3P7zGVRniwrYQoAlIO4) |

Now on S3:
- human genes: https://bionty-assets.s3.amazonaws.com/KJ1HgB695AqbVWvfit8sl.parquet
- mouse genes: https://bionty-assets.s3.amazonaws.com/xaBDkhBYLXWHq6gJYnedD.parquet