In [1]:
import pandas as pd

# ChIP Atlas Genes targets of SOX17 (hg38)

The first data set is retrieved from ChIP Atlas. We went to "Target genes" function, in mouse genome hg38, searched in the +- 10k kb distance from the TSS of all genes to find gene targets of SOX17. 

Let's take a look at the dataset

In [2]:
sox17 = pd.read_csv("./input/SOX17.10.hg38.tsv", delimiter="\t")
print(f"There are {sox17.shape[0]} genes retrieved from ChIP-Atlas ")

There are 7563 genes retrieved from ChIP-Atlas 


In [3]:
sox17[:10]

Unnamed: 0,Target_genes,SOX17|Average,SRX702102|hESC_derived_mesendodermal_cells,SRX702103|hESC_derived_mesendodermal_cells,SRX702104|hESC_derived_mesendodermal_cells,SRX702101|hESC_HUES64,SRX5771670|Tcam-2,SRX5771672|Tcam-2,SRX5771674|Tcam-2,STRING
0,FAM222B,482.0,0,0,0,0,1234,1015,1125,0
1,NANOS3,470.428571,0,0,0,0,1216,942,1135,0
2,TUBB,469.857143,0,0,0,0,1117,980,1192,0
3,MDC1,469.857143,0,0,0,0,1117,980,1192,0
4,OLFM2,455.571429,0,0,0,0,1135,987,1067,0
5,CPVL,440.285714,0,0,0,0,1151,972,959,0
6,CHN2,440.285714,0,0,0,0,1151,972,959,0
7,OCM,439.714286,0,0,0,0,1015,1057,1006,0
8,CYP4A22,437.714286,0,0,0,0,1124,1010,930,0
9,COL17A1,433.857143,0,0,0,0,1045,912,1080,0


We only wanted the first and second columns (which is the score).  

In [4]:
sox17.rename(columns={"SOX17|Average":"Score"}, inplace=True)

In [5]:
sox17[:][["Target_genes","Score"]].to_csv("./input/sox17_targets.csv", header=True, index=False)
sox17_targets = pd.read_csv("./input/sox17_targets.csv")
sox17_targets[:10]

Unnamed: 0,Target_genes,Score
0,FAM222B,482.0
1,NANOS3,470.428571
2,TUBB,469.857143
3,MDC1,469.857143
4,OLFM2,455.571429
5,CPVL,440.285714
6,CHN2,440.285714
7,OCM,439.714286
8,CYP4A22,437.714286
9,COL17A1,433.857143


Let's put the gene names in upper case for easier search down the way

In [6]:
sox17_targets[:]["Target_genes"] = sox17_targets[:]["Target_genes"].apply(lambda x:x.upper())
sox17_targets[:10]

Unnamed: 0,Target_genes,Score
0,FAM222B,482.0
1,NANOS3,470.428571
2,TUBB,469.857143
3,MDC1,469.857143
4,OLFM2,455.571429
5,CPVL,440.285714
6,CHN2,440.285714
7,OCM,439.714286
8,CYP4A22,437.714286
9,COL17A1,433.857143


# HPA - Genes expressed in favorable and unfavorable ovarian cancer prognostics 

Here, we want a list of genes that are expressed in human ovarian cancer from Human Protein Atlas data base (HPA). We do did by query the Human Protein Atlas database with "prognostic:Ovarian cancer;Unfavorable" or "prognostic:Ovarian cancer;Favorable".

This will give us 2 list of genes that are expressed in human ovarian cancer with evidence from IHC with favorable and unfavorable prognostics

#### Favorable prognostic

In [7]:
ovc_fav = pd.read_csv("./input/prognostic_Ovarian_favorable.tsv",delimiter="\t", usecols=[0,1,2,3])
print(f"There are {ovc_fav.shape[0]} genes expressed in favorable prognostic of ovarian caner")

There are 357 genes expressed in favorable prognostic of ovarian caner


Make sure to put all the gene names in upper case.Let's take a look at the data set

In [8]:
ovc_fav[:]["Gene"] = ovc_fav[:]["Gene"].apply(lambda x:x.upper())
ovc_fav[:10]

Unnamed: 0,Gene,Gene synonym,Ensembl,Gene description
0,AADAC,"CES5A1, DAC",ENSG00000114771,Arylacetamide deacetylase
1,ABT1,Esf2,ENSG00000146109,Activator of basal transcription 1
2,ACOT13,"HT012, THEM2",ENSG00000112304,Acyl-CoA thioesterase 13
3,ACSM3,"SA, SAH",ENSG00000005187,Acyl-CoA synthetase medium chain family member 3
4,ADA2,"ADGF, CECR1, IDGFL",ENSG00000093072,Adenosine deaminase 2
5,ADIPOR1,"ACDCR1, PAQR1",ENSG00000159346,Adiponectin receptor 1
6,ADRA1B,,ENSG00000170214,Adrenoceptor alpha 1B
7,AKT1,"AKT, PKB, PRKBA, RAC",ENSG00000142208,AKT serine/threonine kinase 1
8,ALDH5A1,"SSADH, SSDH",ENSG00000112294,Aldehyde dehydrogenase 5 family member A1
9,ALG1L2,,ENSG00000251287,"ALG1, chitobiosyldiphosphodolichol beta-mannos..."


#### Unfavorable prognostic

In [9]:
ovc_unfav = pd.read_csv("./input/prognostic_Ovarian_unfavorable.tsv",delimiter="\t", usecols=[0,1,2,3])
print(f"There are {ovc_unfav.shape[0]} genes expressed in unfavorable prognostic of ovarian caner")

There are 152 genes expressed in unfavorable prognostic of ovarian caner


Make sure to put all the gene names in upper case.Let's take a look at the data set

In [10]:
ovc_unfav[:]["Gene"] = ovc_unfav[:]["Gene"].apply(lambda x:x.upper())
ovc_unfav[:10]

Unnamed: 0,Gene,Gene synonym,Ensembl,Gene description
0,AAK1,"DKFZp686K16132, KIAA1048",ENSG00000115977,AP2 associated kinase 1
1,AGAP1,"CENTG2, GGAP1, KIAA1099",ENSG00000157985,"ArfGAP with GTPase domain, ankyrin repeat and ..."
2,AGFG1,"HRB, RAB, RIP",ENSG00000173744,ArfGAP with FG repeats 1
3,AHDC1,"DJ159A19.3, RP1-159A19.1",ENSG00000126705,AT-hook DNA binding motif containing 1
4,AL136454.1,,ENSG00000231767,
5,ALOX5AP,FLAP,ENSG00000132965,Arachidonate 5-lipoxygenase activating protein
6,ANKRD13A,"ANKRD13, NY-REN-25",ENSG00000076513,Ankyrin repeat domain 13A
7,ANXA4,ANX4,ENSG00000196975,Annexin A4
8,AP2A1,"ADTAA, CLAPA1",ENSG00000196961,Adaptor related protein complex 2 alpha 1 subunit
9,ARID1B,"6A3-5, BAF250b, DAN15, ELD/OSA1, KIAA1235, p250R",ENSG00000049618,AT-rich interaction domain 1B


Lucky for us, the data set from hpa also contain gene synonym. This will help us in the next step.

# Common genes

Our final goal is to find genes that are the targets of SOX17 that are expressed in ovarian cancer in either favorable or unfavorable prognostic. 

One gene can have different names. In order to make sure that we don't miss any common genes just because they are synonyms in datasets, we constructed a synonym reference, taking advantage of the gene synonym from HPA data set.

We also try to include in the score obtain from the Stat3 data to evaluate the likelihood of a Stat3 target

#### Constructing the HPA synonym reference

In [11]:
genes_hpa_fav = list(ovc_fav.Gene)
#original to synonyms reference
genes_synonyms_hpa = {}

#synonyms to original reference
synonyms_genes_hpa = {}

for gene,syns in zip(genes_hpa_fav,ovc_fav["Gene synonym"]):
    genes_synonyms_hpa[gene] = [syns]
    if type(syns) == float:
        continue
    else:
        for s in syns.split(", "):
            synonyms_genes_hpa[s.upper()] = gene            

#### Favorable_OVC vs HPA

In [12]:
sox17_targets_list = list(sox17_targets.Target_genes)
sox17_scores_list = list(sox17_targets.Score)

In [13]:
sox17_ovc_fav = []
sox17_ovc_fav_score = {}
for g,s in zip(sox17_targets_list,sox17_scores_list):
    if g in genes_synonyms_hpa:
        sox17_ovc_fav.append(g)
        sox17_ovc_fav_score[g] = s
    if g in synonyms_genes_hpa:
        sox17_ovc_fav.append(synonyms_genes_hpa[g])
        sox17_ovc_fav_score[synonyms_genes_hpa[g]] = s

In [14]:
print(f"we have {len(sox17_ovc_fav)} hit")

we have 149 hit


Let's try to remove any duplicates

In [15]:
sox17_ovc_fav = set(sox17_ovc_fav)
print(f"After duplicates elimination, we have {len(sox17_ovc_fav)} hit")

After duplicates elimination, we have 147 hit


In [16]:
print(sox17_ovc_fav)

{'OSGEPL1', 'ZNF83', 'FEN1', 'SOSTDC1', 'CCAR1', 'ORAI1', 'DHRS4L2', 'UTP4', 'CAAP1', 'SMIM27', 'RALBP1', 'CAMTA1', 'LMO4', 'TAF13', 'GCH1', 'HLA-A', 'CASP8', 'FLOT1', 'HOMEZ', 'AMMECR1', 'SERPINB6', 'SMU1', 'ZNF121', 'UBR2', 'WDR83OS', 'HPDL', 'GALNT6', 'VPS29', 'TCEAL3', 'TMEM258', 'PLAA', 'FZD3', 'ZNF354A', 'C1ORF115', 'CASP2', 'HEXIM2', 'ZNF641', 'PCBP2', 'HDGF', 'DNAJA1', 'TOPORS', 'FOXA2', 'EXOSC4', 'ANAPC15', 'GNAS', 'C6ORF62', 'SNRPE', 'HLA-DOB', 'MRS2', 'C18ORF21', 'ZNF184', 'OCIAD2', 'PLPP5', 'ZNF429', 'SMUG1', 'SLC7A11', 'CHRAC1', 'CENPL', 'ZNF391', 'MAN1A2', 'MAGOHB', 'PARP9', 'SLAMF7', 'PPIL3', 'UBE2L3', 'TRIM38', 'MRPS11', 'IDO1', 'FBXO9', 'FBXO16', 'NDUFS5', 'JPT1', 'PACSIN3', 'ZNF85', 'PTPN2', 'UBE2J1', 'DYDC2', 'CCDC34', 'ERGIC2', 'SERPING1', 'EMC9', 'PSMB5', 'IFRD1', 'EIF4E3', 'CALR', 'GSTZ1', 'ADRA1B', 'CXCR4', 'PRIM2', 'UBAP2', 'ZNF714', 'TMEM53', 'SEM1', 'NFX1', 'WLS', 'ZNF93', 'SSBP1', 'SEC22B', 'GNL2', 'ODR4', 'NOL7', 'KCTD14', 'DCUN1D5', 'CENPH', 'HSP90AB1', 'FA

#### Unfavorable_OVC vs HPA

In [17]:
genes_hpa_unfav = list(ovc_unfav.Gene)
#original to synonyms reference
genes_synonyms_hpa = {}

#synonyms to original reference
synonyms_genes_hpa = {}

for gene,syns in zip(genes_hpa_unfav,ovc_unfav["Gene synonym"]):
    genes_synonyms_hpa[gene] = [syns]
    if type(syns) == float:
        continue
    else:
        for s in syns.split(", "):
            synonyms_genes_hpa[s.upper()] = gene            

In [18]:
sox17_ovc_unfav = []
sox17_ovc_unfav_score = {}
for g,s in zip(sox17_targets_list,sox17_scores_list):
    if g in genes_synonyms_hpa:
        sox17_ovc_unfav.append(g)
        sox17_ovc_unfav_score[g] = s
    if g in synonyms_genes_hpa:
        sox17_ovc_unfav.append(synonyms_genes_hpa[g])
        sox17_ovc_unfav_score[synonyms_genes_hpa[g]] = s

In [19]:
print(f"we have {len(sox17_ovc_unfav)} hit")

we have 70 hit


In [20]:
sox17_ovc_unfav = set(sox17_ovc_unfav)
print(f"After duplicates elimination, we have {len(sox17_ovc_unfav)} hit")

After duplicates elimination, we have 68 hit


In [21]:
print(sox17_ovc_unfav)

{'SLC9A1', 'PC', 'EHD2', 'HDAC4', 'TCF15', 'TMEM181', 'CAVIN1', 'SLC12A9', 'CLDN4', 'CERCAM', 'SPIDR', 'CNNM4', 'MYO9B', 'NUMBL', 'SCAF8', 'KRT7', 'NOTCH2', 'SH3PXD2A', 'GRB7', 'SPOCK2', 'CD163', 'RECQL', 'EMP1', 'TRIL', 'ARRDC2', 'DAGLB', 'ASAP3', 'AP2A1', 'KIF26B', 'NRROS', 'SPOCK1', 'CCNE1', 'NDST1', 'UNC5B', 'SOGA1', 'PLAC9', 'MYC', 'CST4', 'TPCN2', 'RAI1', 'DSE', 'C5AR1', 'WTAP', 'SNX33', 'EPB41L2', 'TBC1D22A', 'MYO18A', 'ANKRD13A', 'WASF2', 'LRCH1', 'EPHA4', 'JADE2', 'MYPOP', 'POLR1A', 'TSC22D1', 'FSTL3', 'RIPK4', 'RCBTB1', 'RPS6KA2', 'MMP14', 'TTC7A', 'SCNN1A', 'GAL3ST4', 'AGFG1', 'PDP1', 'SRGAP1', 'PCDHGC3', 'ALOX5AP'}


# Let's save the result 

#### Sox17 targets that are favorable in ovarian cancer

In [22]:
sox17_ovc_fav_df = pd.DataFrame(data={"Genes":list(sox17_ovc_fav),"Scores":[sox17_ovc_fav_score[g] for g in sox17_ovc_fav]})
sox17_ovc_fav_df[:10]

Unnamed: 0,Genes,Scores
0,OSGEPL1,136.428571
1,ZNF83,252.857143
2,FEN1,51.0
3,SOSTDC1,15.857143
4,CCAR1,94.571429
5,ORAI1,153.571429
6,DHRS4L2,23.428571
7,UTP4,58.428571
8,CAAP1,213.142857
9,SMIM27,77.0


In [23]:
sox17_ovc_fav_df.to_excel("./final_results/sox17_ovc_fav.xlsx", header=True, index=False)

#### Sox17 targets that are unfavorable in ovarian cancer

In [24]:
sox17_ovc_unfav_df = pd.DataFrame(data={"Genes":list(sox17_ovc_unfav),"Scores":[sox17_ovc_unfav_score[g] for g in sox17_ovc_unfav]})
sox17_ovc_unfav_df[:10]

Unnamed: 0,Genes,Scores
0,SLC9A1,191.0
1,PC,56.285714
2,EHD2,40.142857
3,HDAC4,19.857143
4,TCF15,29.714286
5,TMEM181,23.571429
6,CAVIN1,114.857143
7,SLC12A9,174.428571
8,CLDN4,164.714286
9,CERCAM,15.714286


In [25]:
sox17_ovc_unfav_df.to_excel("./final_results/sox17_ovc_unfav.xlsx", header=True, index=False)