# Galaxy Erasin-treated Leydig cells (mouse)

This notebook is used to annotate genes from data analysed from the experience of treating Erasin in Leydig cells. The analysed data contains downregulated and upregulated genes from the treatment. The 

In [1]:
import requests
import json
import pandas as pd

## Read data from the original file

In [8]:
upreg = pd.read_excel("./input/Galaxy96-(upregulated_LOG2FC1).xlsx", usecols=[0])
downreg = pd.read_excel("./input/Galaxy97-(downregulated_LOG2FC-1).xlsx", usecols=[0])
print(f"There are {upreg.shape[0]} genes upregulated and {downreg.shape[0]} downregulated" )

There are 1849 genes upregulated and 525 downregulated


Let's take a look at first 10 upregulated genes and first 10 downregulated genes

In [9]:
#upregulated genes
upreg[:10]

Unnamed: 0,1.GeneID
0,ENSMUST00000039926.9
1,ENSMUST00000026475.14
2,ENSMUST00000100857.9
3,ENSMUST00000146857.1
4,ENSMUST00000080511.2
5,ENSMUST00000021413.8
6,ENSMUST00000024988.14
7,ENSMUST00000027012.13
8,ENSMUST00000105105.3
9,ENSMUST00000059619.2


In [10]:
#downregulated genes
downreg[:10]

Unnamed: 0,1.GeneID
0,ENSMUST00000053020.7
1,ENSMUST00000055688.9
2,ENSMUST00000018506.12
3,ENSMUST00000089011.5
4,ENSMUST00000053063.6
5,ENSMUST00000023686.14
6,ENSMUST00000019333.9
7,ENSMUST00000056406.6
8,ENSMUST00000131422.7
9,ENSMUST00000236905.1


As we can see, genes on the list are saved as ENSEMBL unique ID for transcript. Our job is to extract from ENSEMBLE data base or to use their API tools to translate these unique ID into gene names. 

My aproach is to first translate transcript ID to transcript name and ENSEMBLE gene ID, using Ensemble REST API. 

Next, gene names can be retrieved either directly from the transcript name or through biotools.fr API. The latter is used because its API is much faster than ENSEMBLE and it also serves as a cross validation tools (in case our annotation goes wrong). 

However as we can see later, its database is not as up-to-date as ENSEMBLE API. Therefore, final gene names are directly extracted from transcript name.


## Upregulated genes

### Transform transcript ID (ENSMUST) to transcript name and gene id (ENSMUSG) using Ensemble REST API

In [110]:
server = "https://rest.ensembl.org"
ref_trx_name = {}
ref_parent = {}
for transcript_id in upreg.values[:]:
    id_ = transcript_id[0][:transcript_id[0].find(".")]
    ext = f"/lookup/id/{id_}?"
    r = requests.get(server+ext, headers={ "Content-Type" : "application/json"})
 
    if not r.ok:
        ref_trx_name[transcript_id[0]] = "0"
        ref_parent[transcript_id[0]] = "0"
        continue
    
    decoded = r.json()
    ref_trx_name[transcript_id[0]] = decoded["display_name"]
    ref_parent[transcript_id[0]] = decoded["Parent"]

#### This took around 30min - 60min to complete ! 

In [None]:
trx_name = list(ref_trx_name.values())
gene_ids = list(ref_parent.values())
upreg["transcript_name"] = trx_name
upreg["gene_ids"] = gene_ids

### Transform gene ID (ENSMUSG) to gene name using biotool.fr API 

In [123]:
url = "https://biotools.fr/mouse/ensembl_symbol_converter/"
ids = [gene_id for gene_id in gene_ids if gene_ids != "0"]

ids_json = json.dumps(ids)

body ={"api":1, "ids":ids_json}
r = requests.post(url,data=body)
ref_gene_name = json.loads(r.text)
genes_names = []
for id_ in gene_ids:
    if id_ == "0":
        genes_names.append("0")
    else:
        genes_names.append(ref_gene_name[id_])
upreg["genes_names"] = genes_names

Let's take a look at our new data

In [39]:
# upreg.to_csv("./input/upregulated_genes.csv", header=True, index=False)
upreg[:10]

Unnamed: 0,1.GeneID,transcript_name,gene_ids,genes_names
0,ENSMUST00000039926.9,Dusp8-201,ENSMUSG00000037887,Dusp8
1,ENSMUST00000026475.14,Ddit3-201,ENSMUSG00000025408,Ddit3
2,ENSMUST00000100857.9,Dusp16-201,ENSMUSG00000030203,Dusp16
3,ENSMUST00000146857.1,Gm11274-201,ENSMUSG00000085331,Gm11274
4,ENSMUST00000080511.2,H1f5-201,ENSMUSG00000058773,Hist1h1b
5,ENSMUST00000021413.8,Nfkbia-201,ENSMUSG00000021025,Nfkbia
6,ENSMUST00000024988.14,C3-201,ENSMUSG00000024164,C3
7,ENSMUST00000027012.13,Casp4-201,ENSMUSG00000033538,Casp4
8,ENSMUST00000105105.3,H3c4-201,ENSMUSG00000099583,Hist1h3d
9,ENSMUST00000059619.2,Cdc42ep1-201,ENSMUSG00000049521,Cdc42ep1


In this table, transcript name and genes_ids are from ENSEMBLE and genes_names are from biotools. As shown here, we can already extract gene name from transcript name. 

Next, we are going to do the same for downregulated genes

## Downregulated genes

#### Transform transcript ID (ENSMUST) to transcript name and gene id (ENSMUSG) using Ensemble REST API

In [132]:
server = "https://rest.ensembl.org"
ref_trx_name = {}
ref_parent = {}
for transcript_id in downreg.values[:]:
    id_ = transcript_id[0][:transcript_id[0].find(".")]
    ext = f"/lookup/id/{id_}?"
    r = requests.get(server+ext, headers={ "Content-Type" : "application/json"})
 
    if not r.ok:
        ref_trx_name[transcript_id[0]] = "0"
        ref_parent[transcript_id[0]] = "0"
        continue
    
    decoded = r.json()
    ref_trx_name[transcript_id[0]] = decoded["display_name"]
    ref_parent[transcript_id[0]] = decoded["Parent"]

In [135]:
trx_name = list(ref_trx_name.values())
gene_ids = list(ref_parent.values())
downreg["transcript_name"] = trx_name
downreg["gene_ids"] = gene_ids

#### Transform gene ID (ENSMUSG) to gene name using biotool.fr API 


In [136]:
url = "https://biotools.fr/mouse/ensembl_symbol_converter/"
ids = [gene_id for gene_id in gene_ids if gene_ids != "0"]

ids_json = json.dumps(ids)

body ={"api":1, "ids":ids_json}
r = requests.post(url,data=body)
ref_gene_name = json.loads(r.text)
genes_names = []
for id_ in gene_ids:
    if id_ == "0":
        genes_names.append("0")
    else:
        genes_names.append(ref_gene_name[id_])
downreg["genes_names"] = genes_names

In [40]:
# downreg.to_csv("./input/downregulated_genes.csv", header=True, index=False)
downreg[:10]

Unnamed: 0,1.GeneID,transcript_name,gene_ids,genes_names
0,ENSMUST00000053020.7,Neurl1b-201,ENSMUSG00000034413,Neurl1b
1,ENSMUST00000055688.9,Phf13-201,ENSMUSG00000047777,Phf13
2,ENSMUST00000018506.12,Kpna2-201,ENSMUSG00000018362,Kpna2
3,ENSMUST00000089011.5,Snn-201,ENSMUSG00000037972,Snn
4,ENSMUST00000053063.6,Hexim1-201,ENSMUSG00000048878,Hexim1
5,ENSMUST00000023686.14,Tmem50b-201,ENSMUSG00000022964,Tmem50b
6,ENSMUST00000019333.9,Rnf145-201,ENSMUSG00000019189,Rnf145
7,ENSMUST00000056406.6,Fam78a-201,ENSMUSG00000050592,Fam78a
8,ENSMUST00000131422.7,Dna2-203,ENSMUSG00000036875,Dna2
9,ENSMUST00000236905.1,Gm50321-201,ENSMUSG00000118383,


In [88]:
downreg = pd.read_csv("./input/downregulated_genes.csv")
upreg = pd.read_csv("./input/upregulated_genes.csv")

# Cleaning data

In this section, we'll try to clean up our data. It's possible that some transcripts can't be found in ENSEMBLE database. The return value for these transcripts would be "0". For transcripts than can't be found in the biotools.fr database, the returned values are NaN. Our job is to handle these transcripts, cross check between transcripts name and genes names, and handle differences between these two.



#### Changing columns name

In [89]:
upreg.rename(columns={"1.GeneID":"transcript_ids", "transcript_name":"transcript_names"}, inplace=True)
upreg[:10]

Unnamed: 0,transcript_ids,transcript_names,gene_ids,genes_names
0,ENSMUST00000039926.9,Dusp8-201,ENSMUSG00000037887,Dusp8
1,ENSMUST00000026475.14,Ddit3-201,ENSMUSG00000025408,Ddit3
2,ENSMUST00000100857.9,Dusp16-201,ENSMUSG00000030203,Dusp16
3,ENSMUST00000146857.1,Gm11274-201,ENSMUSG00000085331,Gm11274
4,ENSMUST00000080511.2,H1f5-201,ENSMUSG00000058773,Hist1h1b
5,ENSMUST00000021413.8,Nfkbia-201,ENSMUSG00000021025,Nfkbia
6,ENSMUST00000024988.14,C3-201,ENSMUSG00000024164,C3
7,ENSMUST00000027012.13,Casp4-201,ENSMUSG00000033538,Casp4
8,ENSMUST00000105105.3,H3c4-201,ENSMUSG00000099583,Hist1h3d
9,ENSMUST00000059619.2,Cdc42ep1-201,ENSMUSG00000049521,Cdc42ep1


In [90]:
downreg.rename(columns={"1.GeneID":"transcript_ids", "transcript_name":"transcript_names"}, inplace=True)
downreg[:10]

Unnamed: 0,transcript_ids,transcript_names,gene_ids,genes_names
0,ENSMUST00000053020.7,Neurl1b-201,ENSMUSG00000034413,Neurl1b
1,ENSMUST00000055688.9,Phf13-201,ENSMUSG00000047777,Phf13
2,ENSMUST00000018506.12,Kpna2-201,ENSMUSG00000018362,Kpna2
3,ENSMUST00000089011.5,Snn-201,ENSMUSG00000037972,Snn
4,ENSMUST00000053063.6,Hexim1-201,ENSMUSG00000048878,Hexim1
5,ENSMUST00000023686.14,Tmem50b-201,ENSMUSG00000022964,Tmem50b
6,ENSMUST00000019333.9,Rnf145-201,ENSMUSG00000019189,Rnf145
7,ENSMUST00000056406.6,Fam78a-201,ENSMUSG00000050592,Fam78a
8,ENSMUST00000131422.7,Dna2-203,ENSMUSG00000036875,Dna2
9,ENSMUST00000236905.1,Gm50321-201,ENSMUSG00000118383,


## Treat NaN and "0" values

### "0" value means the id can't be found in ensemble database

#### Number of genes that can't be found in ensemble database

In [91]:
print(f"""In upregulated genes {sum(upreg["transcript_names"] == "0")} genes can't be found in ensemble database""" )
print(f"""In downregulated genes genes {sum(downreg["transcript_names"] == "0")} genes can't be found in ensemble database""" )

In upregulated genes 20 genes can't be found in ensemble database
In downregulated genes genes 4 genes can't be found in ensemble database


#### Take a look at the transcript

In [92]:
idx = upreg["transcript_names"] == "0"
upreg[idx]

Unnamed: 0,transcript_ids,transcript_names,gene_ids,genes_names
178,ENSMUST00000198018.1,0,0,0
274,ENSMUST00000183460.1,0,0,0
325,ENSMUST00000184934.1,0,0,0
339,ENSMUST00000099326.9,0,0,0
421,ENSMUST00000200671.1,0,0,0
441,ENSMUST00000184987.1,0,0,0
579,ENSMUST00000185085.1,0,0,0
954,ENSMUST00000238490.1,0,0,0
1239,ENSMUST00000053308.9,0,0,0
1291,ENSMUST00000199558.1,0,0,0


In [93]:
idx = downreg["transcript_names"] == "0"
downreg[idx]

Unnamed: 0,transcript_ids,transcript_names,gene_ids,genes_names
232,ENSMUST00000104097.1,0,0,0
370,ENSMUST00000122759.1,0,0,0
420,ENSMUST00000157366.1,0,0,0
519,ENSMUST00000157984.1,0,0,0


#### These transcripts probably aren't annotated yet or the results of previous analyses errors. It's better to separate and save them in a separated files for future analysis.

In [94]:
idx = upreg["transcript_names"] == "0"
upreg_unannotated = upreg[idx]
upreg_unannotated.to_csv("./input/upregulated_genes_unannotated.csv", header=True, index=False)

In [95]:
idx = downreg["transcript_names"] == "0"
downreg_unannotated = downreg[idx]
downreg_unannotated.to_csv("./input/downregulated_genes_unannotated.csv", header=True, index=False)

In [96]:
idx = upreg["transcript_names"] != "0"
upreg_annotated = upreg[idx]
upreg_annotated.to_csv("./input/upregulated_genes_annotated.csv", header=True, index=False)

In [97]:
idx = downreg["transcript_names"] != "0"
downreg_annotated = downreg[idx]
downreg_annotated.to_csv("./input/downregulated_genes_annotated.csv", header=True, index=False)

In [98]:
downreg = pd.read_csv("./input/downregulated_genes_annotated.csv")
upreg = pd.read_csv("./input/upregulated_genes_annotated.csv")

## NaN value

#### NaN value is caused because the gene id can't be found in biotool reference database and it can be fixed by extract the gene names directly from transcript name

In [99]:
idx = downreg["genes_names"].apply(lambda x: type(x)) == float
downreg[idx]["genes_names"]

9      NaN
27     NaN
87     NaN
190    NaN
210    NaN
313    NaN
445    NaN
448    NaN
480    NaN
483    NaN
502    NaN
508    NaN
510    NaN
Name: genes_names, dtype: object

In [101]:
downreg.loc[idx,"genes_names"] = downreg[idx]["transcript_names"].apply(lambda x: x[:-4])
downreg[idx]["genes_names"]

9         Gm50321
27        Gm49708
87        Gm49709
190       Gm49894
210       Gm50322
313       Gm50367
445        Gm4356
448       Gm31641
480    AC118698.1
483       Gm52965
502       Gm49980
508       Gm50232
510       Gm50013
Name: genes_names, dtype: object

By doing a quick manual search in ensemble database, most of the genes are for long non coding RNA.

#### Let's check for discrepancy between the transcript name from ensemble and genes name from biotool

In [102]:
downreg[downreg["genes_names"] != downreg["transcript_names"].apply(lambda x: x[:-4])]

Unnamed: 0,transcript_ids,transcript_names,gene_ids,genes_names
16,ENSMUST00000054920.4,Myorg-201,ENSMUSG00000046312,AI464131
90,ENSMUST00000230913.1,Lncppara-202,ENSMUSG00000116305,AC162302.2
97,ENSMUST00000105801.8,Slc66a1-202,ENSMUSG00000028744,Pqlc2
99,ENSMUST00000061455.8,Tent5c-201,ENSMUSG00000044468,Fam46c
112,ENSMUST00000227212.1,Gm49083-201,ENSMUSG00000115354,AC154806.2
124,ENSMUST00000040307.5,Marchf9-201,ENSMUSG00000040502,March9
231,ENSMUST00000225111.1,Gm49336-205,ENSMUSG00000114797,Nupl1
251,ENSMUST00000198607.4,Tent4a-202,ENSMUSG00000034575,Papd7
314,ENSMUST00000117994.7,Cip2a-202,ENSMUSG00000033031,C330027C09Rik
342,ENSMUST00000083226.1,Gm50452-201,ENSMUSG00000065160,Mir5117


#### Most of these discrepancy are just different names of the same gene. It's better to go with the result from ensemble because it seems to be more up-to-date. 

In [105]:
idx = downreg["genes_names"] != downreg["transcript_names"].apply(lambda x: x[:-4])
downreg.loc[idx,"genes_names"] = downreg["transcript_names"].apply(lambda x: x[:-4])
downreg[idx]

Unnamed: 0,transcript_ids,transcript_names,gene_ids,genes_names
16,ENSMUST00000054920.4,Myorg-201,ENSMUSG00000046312,Myorg
90,ENSMUST00000230913.1,Lncppara-202,ENSMUSG00000116305,Lncppara
97,ENSMUST00000105801.8,Slc66a1-202,ENSMUSG00000028744,Slc66a1
99,ENSMUST00000061455.8,Tent5c-201,ENSMUSG00000044468,Tent5c
112,ENSMUST00000227212.1,Gm49083-201,ENSMUSG00000115354,Gm49083
124,ENSMUST00000040307.5,Marchf9-201,ENSMUSG00000040502,Marchf9
231,ENSMUST00000225111.1,Gm49336-205,ENSMUSG00000114797,Gm49336
251,ENSMUST00000198607.4,Tent4a-202,ENSMUSG00000034575,Tent4a
314,ENSMUST00000117994.7,Cip2a-202,ENSMUSG00000033031,Cip2a
342,ENSMUST00000083226.1,Gm50452-201,ENSMUSG00000065160,Gm50452


#### Do the same for upregulated genes

In [106]:
idx = upreg["genes_names"] != upreg["transcript_names"].apply(lambda x: x[:-4])
upreg.loc[idx,"genes_names"] = upreg["transcript_names"].apply(lambda x: x[:-4])
upreg[idx]

Unnamed: 0,transcript_ids,transcript_names,gene_ids,genes_names
4,ENSMUST00000080511.2,H1f5-201,ENSMUSG00000058773,H1f5
8,ENSMUST00000105105.3,H3c4-201,ENSMUSG00000099583,H3c4
14,ENSMUST00000040914.2,H1f2-201,ENSMUSG00000036181,H1f2
18,ENSMUST00000045301.8,H1f3-201,ENSMUSG00000052565,H1f3
22,ENSMUST00000110452.1,H2bc11-201,ENSMUSG00000069300,H2bc11
...,...,...,...,...
1777,ENSMUST00000237842.1,Gm50361-201,ENSMUSG00000117881,Gm50361
1798,ENSMUST00000235456.1,Gm52981-201,ENSMUSG00000118086,Gm52981
1813,ENSMUST00000225011.1,Gm7644-201,ENSMUSG00000114553,Gm7644
1819,ENSMUST00000097395.4,Gm3435-201,ENSMUSG00000116895,Gm3435


# Complete preprocess data, let's save

In [107]:
upreg.to_csv("./input/upregulated_genes_annotated.csv", header=True, index=False)
downreg.to_csv("./input/downregulated_genes_annotated.csv", header=True, index=False)