## Map Ensembl ID to Gene Name

Written By: Qingyang Xu

Last Modified: 04/18/2021

MMRF genomic data uses `Ensembl ID` of each gene while DevMap uses gene names. This script uses the module `mygene` to map `Ensembl ID` in MMRF to their corresponding gene names.

References

- Python `mygene` module documentation

https://pypi.org/project/mygene/

- Download patient genomic data (e.g. `MMRF_CoMMpass_IA15a_CNA_Exome_PerGene_LargestSegment.txt`)

https://research.themmrf.org/

- Download DevMap cell line data (e.g. `CCLE_expression.csv`)

https://depmap.org/portal/download/

In [2]:
import os
import glob
import numpy as np
import pandas as pd
import mygene

In [49]:
# MMRF patient genomic data
fn = './data/MMRF_CoMMpass_IA15a_CNA_Exome_PerGene_LargestSegment.txt'
gene = pd.read_csv(fn, delimiter='\t')

In [50]:
gene.head()

Unnamed: 0,Gene,MMRF_1016_1_BM,MMRF_1020_3_BM,MMRF_1021_1_BM,MMRF_1029_1_BM,MMRF_1030_1_BM,MMRF_1030_3_BM,MMRF_1030_4_BM,MMRF_1031_1_BM,MMRF_1032_1_BM,...,MMRF_2912_1_BM,MMRF_2913_1_BM,MMRF_2914_1_BM,MMRF_2915_1_BM,MMRF_2917_1_BM,MMRF_2918_1_BM,MMRF_2921_1_BM,MMRF_2923_1_BM,MMRF_2924_1_BM,MMRF_2926_1_BM
0,ENSG00000000003,-1.0317,-1.0402,-0.7883,-1.0642,0.0125,-0.0673,-0.759,-1.0306,-0.9907,...,-0.2112,-0.7789,-0.9605,-1.018,-0.9999,-0.983,-0.9669,-0.9869,0.0114,-0.9903
1,ENSG00000000005,-1.0317,-1.0402,-0.7883,-1.0642,0.0125,-0.0673,-0.759,-1.0306,-0.9907,...,-0.2112,-0.7789,-0.9605,-1.018,-0.9999,-0.983,-1.2127,-0.9869,0.0114,-0.9903
2,ENSG00000000419,-0.197,0.3719,0.0424,-0.0265,-0.0377,0.1059,0.0626,-0.0251,-0.042,...,-0.0291,-0.0713,0.0626,-0.025,0.0117,0.0352,-0.0276,0.0181,-0.0181,-0.0035
3,ENSG00000000457,0.3021,0.7277,0.6047,-0.0363,-0.0111,-0.0328,-0.0701,0.5172,0.015,...,-0.0149,1.1662,-0.0545,-0.0075,0.0211,-0.3082,-0.0127,0.5641,0.0119,-0.0001
4,ENSG00000000460,0.3021,0.7277,0.6047,-0.0363,-0.0111,-0.0328,-0.0701,0.5172,0.015,...,-0.0149,1.1662,-0.0545,-0.0075,0.0211,-0.3082,-0.0127,0.5641,0.0119,-0.0001


In [51]:
# DevMap cell line data
rnaseq = pd.read_csv('./data/CCLE_expression.csv')

In [52]:
rnaseq.head()

Unnamed: 0.1,Unnamed: 0,TSPAN6 (7105),TNMD (64102),DPM1 (8813),SCYL3 (57147),C1orf112 (55732),FGR (2268),CFH (3075),FUCA2 (2519),GCLC (2729),...,ARHGAP11B (89839),AC004593.2 (1124),AC090517.4 (54816),AL160269.1 (11046),ABCF2-H2BE1 (114483834),POLR2J3 (548644),H2BE1 (114483833),AL445238.1 (647264),GET1-SH3BGR (106865373),AC113348.1 (102724657)
0,ACH-001113,4.990501,0.0,7.273702,2.765535,4.480265,0.028569,1.269033,3.058316,6.483171,...,1.214125,0.0,0.111031,0.15056,1.427606,5.781884,0.0,0.0,0.799087,0.0
1,ACH-001289,5.209843,0.545968,7.070604,2.538538,3.510962,0.0,0.176323,3.836934,4.20085,...,1.835924,0.0,0.31034,0.0,0.807355,4.704319,0.0,0.0,0.464668,0.070389
2,ACH-001339,3.77926,0.0,7.346425,2.339137,4.254745,0.056584,1.339137,6.724241,3.671293,...,1.823749,0.084064,0.176323,0.042644,1.38405,4.931683,0.0,0.028569,0.263034,0.0
3,ACH-001538,5.726831,0.0,7.086189,2.543496,3.102658,0.0,5.914565,6.099716,4.475733,...,0.871844,0.137504,0.263034,2.485427,0.713696,3.858976,0.0,0.0,0.0,0.0
4,ACH-000242,7.465648,0.0,6.435462,2.414136,3.864929,0.831877,7.198003,5.45253,7.112492,...,2.324811,0.163499,0.163499,0.0,1.117695,4.990501,0.0,0.0,0.0,0.0


In [53]:
# query gene names
ens = list(gene['Gene'])
mg = mygene.MyGeneInfo()
gene_syms = mg.querymany(ens, scopes='ensembl.gene', fields='symbol', species='human')

querying 1-1000...done.
querying 1001-2000...done.
querying 2001-3000...done.
querying 3001-4000...done.
querying 4001-5000...done.
querying 5001-6000...done.
querying 6001-7000...done.
querying 7001-8000...done.
querying 8001-9000...done.
querying 9001-10000...done.
querying 10001-11000...done.
querying 11001-12000...done.
querying 12001-13000...done.
querying 13001-14000...done.
querying 14001-15000...done.
querying 15001-16000...done.
querying 16001-17000...done.
querying 17001-18000...done.
querying 18001-19000...done.
querying 19001-20000...done.
querying 20001-21000...done.
querying 21001-22000...done.
querying 22001-23000...done.
querying 23001-24000...done.
querying 24001-25000...done.
querying 25001-26000...done.
querying 26001-27000...done.
querying 27001-28000...done.
querying 28001-29000...done.
querying 29001-30000...done.
querying 30001-31000...done.
querying 31001-32000...done.
querying 32001-33000...done.
querying 33001-34000...done.
querying 34001-35000...done.
queryin

In [54]:
# see which Ensembl ID's are not mapped to gene names
genes_not_found = []

for g in gene_syms:
    
    if 'notfound' in g:
        genes_not_found.append(g['query'])
        continue
    
    sym = g['symbol']
    idx = g['_id']
    gene_name = sym + ' (%s)'%idx
        #print(gene_name)
    found = False
    for col in rnaseq.columns: 
        found = True
        break
    if not found:
        print('Not found: '+gene_name)
        genes_not_found.append(gene_name)

In [55]:
print(len(genes_not_found))

5639


In [56]:
print(df.shape[0])

57663


In [57]:
for g in genes_not_found: print(g)

ENSG00000005955
ENSG00000006074
ENSG00000006075
ENSG00000006114
ENSG00000017373
ENSG00000017621
ENSG00000031544
ENSG00000034063
ENSG00000049319
ENSG00000056661
ENSG00000068793
ENSG00000069712
ENSG00000072444
ENSG00000073009
ENSG00000077809
ENSG00000083842
ENSG00000087916
ENSG00000090920
ENSG00000102069
ENSG00000102080
ENSG00000104725
ENSG00000105663
ENSG00000107618
ENSG00000107623
ENSG00000108264
ENSG00000108270
ENSG00000108272
ENSG00000108278
ENSG00000108292
ENSG00000108294
ENSG00000108296
ENSG00000108753
ENSG00000110347
ENSG00000116883
ENSG00000116957
ENSG00000117289
ENSG00000120586
ENSG00000121848
ENSG00000122497
ENSG00000122718
ENSG00000124529
ENSG00000124578
ENSG00000124693
ENSG00000126542
ENSG00000128802
ENSG00000129270
ENSG00000129277
ENSG00000129282
ENSG00000130201
ENSG00000130489
ENSG00000130723
ENSG00000131795
ENSG00000132130
ENSG00000132139
ENSG00000132142
ENSG00000132498
ENSG00000133149
ENSG00000133808
ENSG00000135213
ENSG00000135847
ENSG00000136653
ENSG00000137259
ENSG0000

ENSG00000221020
ENSG00000221021
ENSG00000221026
ENSG00000221027
ENSG00000221029
ENSG00000221030
ENSG00000221032
ENSG00000221035
ENSG00000221037
ENSG00000221041
ENSG00000221048
ENSG00000221050
ENSG00000221051
ENSG00000221053
ENSG00000221058
ENSG00000221062
ENSG00000221067
ENSG00000221069
ENSG00000221070
ENSG00000221072
ENSG00000221073
ENSG00000221074
ENSG00000221075
ENSG00000221077
ENSG00000221079
ENSG00000221082
ENSG00000221084
ENSG00000221085
ENSG00000221087
ENSG00000221094
ENSG00000221095
ENSG00000221096
ENSG00000221100
ENSG00000221101
ENSG00000221104
ENSG00000221105
ENSG00000221108
ENSG00000221110
ENSG00000221111
ENSG00000221112
ENSG00000221113
ENSG00000221115
ENSG00000221118
ENSG00000221121
ENSG00000221122
ENSG00000221123
ENSG00000221126
ENSG00000221128
ENSG00000221130
ENSG00000221131
ENSG00000221132
ENSG00000221134
ENSG00000221135
ENSG00000221136
ENSG00000221137
ENSG00000221141
ENSG00000221142
ENSG00000221145
ENSG00000221147
ENSG00000221156
ENSG00000221157
ENSG00000221159
ENSG0000

ENSG00000251846
ENSG00000251847
ENSG00000251848
ENSG00000251849
ENSG00000251860
ENSG00000251863
ENSG00000251871
ENSG00000251872
ENSG00000251876
ENSG00000251879
ENSG00000251881
ENSG00000251885
ENSG00000251894
ENSG00000251899
ENSG00000251901
ENSG00000251902
ENSG00000251909
ENSG00000251911
ENSG00000251912
ENSG00000251918
ENSG00000251926
ENSG00000251927
ENSG00000251928
ENSG00000251930
ENSG00000251933
ENSG00000251938
ENSG00000251944
ENSG00000251949
ENSG00000251950
ENSG00000251959
ENSG00000251962
ENSG00000251963
ENSG00000251966
ENSG00000251968
ENSG00000251969
ENSG00000251979
ENSG00000251989
ENSG00000251995
ENSG00000252000
ENSG00000252004
ENSG00000252009
ENSG00000252024
ENSG00000252038
ENSG00000252048
ENSG00000252054
ENSG00000252055
ENSG00000252056
ENSG00000252058
ENSG00000252059
ENSG00000252071
ENSG00000252075
ENSG00000252077
ENSG00000252078
ENSG00000252085
ENSG00000252088
ENSG00000252092
ENSG00000252093
ENSG00000252100
ENSG00000252102
ENSG00000252109
ENSG00000252110
ENSG00000252111
ENSG0000

ENSG00000265580
ENSG00000265582
ENSG00000265583
ENSG00000265585
ENSG00000265587
ENSG00000265589
ENSG00000265591
ENSG00000265592
ENSG00000265597
ENSG00000265600
ENSG00000265601
ENSG00000265603
ENSG00000265605
ENSG00000265610
ENSG00000265611
ENSG00000265613
ENSG00000265615
ENSG00000265616
ENSG00000265619
ENSG00000265621
ENSG00000265624
ENSG00000265627
ENSG00000265628
ENSG00000265629
ENSG00000265632
ENSG00000265636
ENSG00000265637
ENSG00000265638
ENSG00000265642
ENSG00000265645
ENSG00000265647
ENSG00000265649
ENSG00000265650
ENSG00000265654
ENSG00000265655
ENSG00000265661
ENSG00000265662
ENSG00000265665
ENSG00000265667
ENSG00000265672
ENSG00000265676
ENSG00000265677
ENSG00000265679
ENSG00000265680
ENSG00000265686
ENSG00000265687
ENSG00000265695
ENSG00000265696
ENSG00000265700
ENSG00000265704
ENSG00000265705
ENSG00000265708
ENSG00000265709
ENSG00000265710
ENSG00000265711
ENSG00000265714
ENSG00000265715
ENSG00000265720
ENSG00000265722
ENSG00000265730
ENSG00000265731
ENSG00000265736
ENSG0000