## Map Ensembl ID to Gene Name

Written By: Qingyang Xu

Last Modified: 04/18/2021

MMRF genomic data uses `Ensembl ID` of each gene while DevMap uses gene names. This script uses the module `mygene` to map `Ensembl ID` in MMRF to their corresponding gene names.

References

- Python `mygene` module documentation

https://pypi.org/project/mygene/

- Python `pybiomart` module documentation

https://pypi.org/project/pybiomart/

- Download patient genomic data (e.g. `MMRF_CoMMpass_IA15a_CNA_Exome_PerGene_LargestSegment.txt`)

https://research.themmrf.org/

- Download DevMap cell line data (e.g. `CCLE_expression.csv`)

https://depmap.org/portal/download/

In [3]:
import os
import glob
import numpy as np
import pandas as pd
import mygene
from pybiomart import Server

In [4]:
# install the following modules
#!pip install mygene
#!pip install pybiomart

In [5]:
# MMRF patient genomic data
fn = './data/MMRF_CoMMpass_IA15a_E74GTF_Salmon_V7.2_Filtered_Gene_TPM.txt'
gene = pd.read_csv(fn, delimiter='\t')

In [12]:
print('Number of genes in MMRF: %d'%gene.shape[0])

Number of genes in MMRF: 56430


In [6]:
gene.head()

Unnamed: 0,GENE_ID,MMRF_2602_1_BM,MMRF_1677_1_BM,MMRF_2699_1_BM,MMRF_2401_2_BM,MMRF_2539_1_BM,MMRF_2465_1_BM,MMRF_1338_1_BM,MMRF_1360_1_BM,MMRF_1645_1_BM,...,MMRF_1889_3_BM,MMRF_1831_1_BM,MMRF_2851_1_BM,MMRF_2291_1_BM,MMRF_1461_1_BM,MMRF_1447_1_BM,MMRF_1824_1_BM,MMRF_2694_1_BM,MMRF_1386_1_BM,MMRF_1432_1_BM
0,ENSG00000000003,9.25769,0.789786,0.350717,0.997529,4.36627,0.267504,9.83924,2.99393,1.36739,...,0.175736,2.27245,0.279247,3.66427,0.236085,4.22505,11.3975,11.171,1.40462,0.101453
1,ENSG00000000005,0.01899,0.0,0.0,0.0,0.0,0.220804,0.0,0.0,0.066857,...,0.0,0.0,0.0,0.103328,0.0,0.0,0.237989,0.0,0.0,0.0
2,ENSG00000000419,58.0081,54.1275,42.5283,67.6337,42.6435,53.5764,76.5282,63.8543,66.5916,...,101.759,47.8512,88.4244,43.8287,63.7617,33.6481,27.02,39.5575,92.7905,67.901
3,ENSG00000000457,8.53577,8.77477,3.05102,3.18055,2.71304,5.68574,10.0199,5.10278,11.6436,...,17.1512,7.94448,5.31074,4.58591,8.58568,8.71818,6.89403,6.32898,12.9549,13.3323
4,ENSG00000000460,4.30059,3.83481,1.86926,4.64755,1.54243,2.67956,5.23438,1.75356,6.42892,...,19.4365,6.73468,1.05347,1.82054,3.7375,5.60105,3.77852,3.89309,5.3295,14.4449


In [7]:
# DevMap cell line data
rnaseq = pd.read_csv('./data/CCLE_expression.csv')
rnaseq_gene_names = rnaseq.columns[1:]
rnaseq_genes = [col.split(' ')[0] for col in rnaseq.columns[1:]]

In [10]:
print('Number of genes in CCLE: %d'%len(rnaseq_genes))

Number of genes in CCLE: 19177


In [8]:
rnaseq.head(50)

Unnamed: 0.1,Unnamed: 0,TSPAN6 (7105),TNMD (64102),DPM1 (8813),SCYL3 (57147),C1orf112 (55732),FGR (2268),CFH (3075),FUCA2 (2519),GCLC (2729),...,ARHGAP11B (89839),AC004593.2 (1124),AC090517.4 (54816),AL160269.1 (11046),ABCF2-H2BE1 (114483834),POLR2J3 (548644),H2BE1 (114483833),AL445238.1 (647264),GET1-SH3BGR (106865373),AC113348.1 (102724657)
0,ACH-001113,4.990501,0.0,7.273702,2.765535,4.480265,0.028569,1.269033,3.058316,6.483171,...,1.214125,0.0,0.111031,0.15056,1.427606,5.781884,0.0,0.0,0.799087,0.0
1,ACH-001289,5.209843,0.545968,7.070604,2.538538,3.510962,0.0,0.176323,3.836934,4.20085,...,1.835924,0.0,0.31034,0.0,0.807355,4.704319,0.0,0.0,0.464668,0.070389
2,ACH-001339,3.77926,0.0,7.346425,2.339137,4.254745,0.056584,1.339137,6.724241,3.671293,...,1.823749,0.084064,0.176323,0.042644,1.38405,4.931683,0.0,0.028569,0.263034,0.0
3,ACH-001538,5.726831,0.0,7.086189,2.543496,3.102658,0.0,5.914565,6.099716,4.475733,...,0.871844,0.137504,0.263034,2.485427,0.713696,3.858976,0.0,0.0,0.0,0.0
4,ACH-000242,7.465648,0.0,6.435462,2.414136,3.864929,0.831877,7.198003,5.45253,7.112492,...,2.324811,0.163499,0.163499,0.0,1.117695,4.990501,0.0,0.0,0.0,0.0
5,ACH-000708,4.914086,0.176323,6.946848,2.577731,3.853996,0.0,0.084064,4.855491,4.934045,...,2.31034,0.124328,0.056584,0.084064,2.498251,5.303781,0.0,0.0,0.263034,0.0
6,ACH-000327,4.032982,0.0,5.806582,1.948601,2.684819,0.014355,3.117695,5.977509,3.65306,...,0.799087,0.669027,0.070389,0.0,1.090853,4.996841,0.0,0.042644,0.286881,0.028569
7,ACH-000233,0.097611,0.0,5.919102,3.983678,3.733354,0.028569,6.11124,2.963474,3.415488,...,1.883621,0.0,0.056584,0.014355,3.356144,6.83996,0.0,0.0,2.280956,0.0
8,ACH-000461,4.712596,0.0,6.406333,2.247928,3.032101,0.028569,0.097611,5.528571,6.383704,...,1.459432,0.189034,0.042644,0.124328,3.367371,5.529196,0.0,0.0,0.275007,0.0
9,ACH-000705,5.101398,0.0,6.309976,2.361768,4.280214,0.028569,0.201634,2.543496,6.126601,...,1.570463,0.0,0.097611,0.176323,1.981853,5.860963,0.594549,0.0,0.790772,0.0


In [9]:
# query gene names
ens = list(gene['GENE_ID'])
mg = mygene.MyGeneInfo()
gene_syms = mg.querymany(ens, scopes='ensembl.gene', fields='symbol', species='human')

querying 1-1000...done.
querying 1001-2000...done.
querying 2001-3000...done.
querying 3001-4000...done.
querying 4001-5000...done.
querying 5001-6000...done.
querying 6001-7000...done.
querying 7001-8000...done.
querying 8001-9000...done.
querying 9001-10000...done.
querying 10001-11000...done.
querying 11001-12000...done.
querying 12001-13000...done.
querying 13001-14000...done.
querying 14001-15000...done.
querying 15001-16000...done.
querying 16001-17000...done.
querying 17001-18000...done.
querying 18001-19000...done.
querying 19001-20000...done.
querying 20001-21000...done.
querying 21001-22000...done.
querying 22001-23000...done.
querying 23001-24000...done.
querying 24001-25000...done.
querying 25001-26000...done.
querying 26001-27000...done.
querying 27001-28000...done.
querying 28001-29000...done.
querying 29001-30000...done.
querying 30001-31000...done.
querying 31001-32000...done.
querying 32001-33000...done.
querying 33001-34000...done.
querying 34001-35000...done.
queryin

In [14]:
for g in gene_syms:
    print(g)
    break

{'query': 'ENSG00000000003', '_id': '7105', '_score': 23.181335, 'symbol': 'TSPAN6'}


In [61]:
# see which Ensembl ID's are not mapped to gene names
genes_not_found = []

ensembl_found = []
ccle_found = [] # HGNC only
ccle_name_found = [] # HGNC + index

for g in gene_syms:

    if 'notfound' in g:
        genes_not_found.append(g['query'])
        continue
    
    ensembl = g['query']
    sym = g['symbol']
    idx = g['_id']
        #print(gene_name)
    found = False
    if sym not in rnaseq_genes:
        #print('Not found: '+sym)
        genes_not_found.append(ensembl)
        
    else: 
        ind = rnaseq_genes.index(sym)
        ensembl_found.append(ensembl)
        ccle_found.append(sym)
        ccle_name_found.append(rnaseq_gene_names[ind]) # HGNC + index

In [71]:
print(len(genes_not_found))

37761


In [70]:
df['HGNC_ID'].drop_duplicates()

0              TSPAN6 (7105)
1               TNMD (64102)
2                DPM1 (8813)
3              SCYL3 (57147)
4           C1orf112 (55732)
                ...         
18666          GRIN2B (2904)
18667           SNURF (8926)
18668          H3-2 (440686)
18669    TMEM271 (112441426)
18670        ZBTB8B (728116)
Name: HGNC_ID, Length: 18658, dtype: object

In [65]:
df = pd.DataFrame({'Ensembl_ID':ensembl_found, 'HGNC':ccle_found, 'HGNC_ID':ccle_name_found})

In [66]:
df.to_csv('./Ensembl_HGNC_map_042421.csv',index=False)

In [68]:
print(len(ccle_found))

18671


In [45]:
# use Biomart
server = Server(host='http://feb2014.archive.ensembl.org/')

dataset = (server.marts['ENSEMBL_MART_ENSEMBL']
                 .datasets['hsapiens_gene_ensembl'])

biomart = dataset.query(attributes=['ensembl_gene_id','hgnc_symbol'])

In [72]:
biomart_genes = list(biomart['Ensembl Gene ID'])


for g in genes_not_found:
    if 'ENSG' not in g: continue
    #print(g)
        
    ind = biomart_genes.index(g)
    sym = biomart['HGNC symbol'].iloc[ind]
    #print(sym)
        
    if sym in rnaseq_genes: 
        ind = rnaseq_genes.index(sym)
        ensembl_found.append(g)
        ccle_found.append(sym)
        ccle_name_found.append(rnaseq_gene_names[ind]) # HGNC + index
    else:
        #print('Not found: '+g)
        genes_not_found_biomart.append(g)

In [73]:
df = pd.DataFrame({'Ensembl_ID':ensembl_found, 'HGNC':ccle_found, 'HGNC_ID':ccle_name_found})
df.head()

Unnamed: 0,Ensembl_ID,HGNC,HGNC_ID
0,ENSG00000000003,TSPAN6,TSPAN6 (7105)
1,ENSG00000000005,TNMD,TNMD (64102)
2,ENSG00000000419,DPM1,DPM1 (8813)
3,ENSG00000000457,SCYL3,SCYL3 (57147)
4,ENSG00000000460,C1orf112,C1orf112 (55732)


In [74]:
df.shape[0]

18971

In [75]:
df.to_csv('./Ensembl_HGNC_map_042421.csv',index=False)

In [76]:
df.drop_duplicates()

Unnamed: 0,Ensembl_ID,HGNC,HGNC_ID
0,ENSG00000000003,TSPAN6,TSPAN6 (7105)
1,ENSG00000000005,TNMD,TNMD (64102)
2,ENSG00000000419,DPM1,DPM1 (8813)
3,ENSG00000000457,SCYL3,SCYL3 (57147)
4,ENSG00000000460,C1orf112,C1orf112 (55732)
...,...,...,...
18966,ENSG00000267596,CCL15,CCL15 (6359)
18967,ENSG00000267645,POLR2J2,POLR2J2 (246721)
18968,ENSG00000269028,MTRNR2L12,MTRNR2L12 (100462981)
18969,ENSG00000270386,UGT2A1,UGT2A1 (10941)


In [77]:
df['HGNC_ID'].drop_duplicates()

0                TSPAN6 (7105)
1                 TNMD (64102)
2                  DPM1 (8813)
3                SCYL3 (57147)
4             C1orf112 (55732)
                 ...          
18964         DUX4 (100288687)
18965           LYPD8 (646627)
18966             CCL15 (6359)
18968    MTRNR2L12 (100462981)
18970              ZNF8 (7554)
Name: HGNC_ID, Length: 18927, dtype: object

In [78]:
18971/56430 

0.33618642566010987