# A. Input data-sets

pyPARAGON requires two inputs: 

    i) A reference network which is composed of already known interactions, 
    ii) Seed nodes which can come from in omics datasets, or be frequent mutations in disases. 


### Preparing a reference network

A reference network is introduced to pyPARAGON as a networkx object. Reference networks can include nan-nan interactions, causing troubles in calculations. Thus, we exclude nan-nan interactions in reference networks. Additionally, pyPARAGON does not consider self-interactions in graphlets. Therefore, we also eliminate self interactions to get rid of computational problems. Except for nan-nan and self-interactions, we may not use any filtration or additional preprocessing steps. pyPARAGON is able to eliminate the interactions with low confidence scores thanks to the flux calculation. Thus, we suggest not applying a confidence threshold value for the filtration. However, the user can eliminate the edges with low confidence scores. As an example, we excluded edges with an edge score less than 0.2.

In [1]:
import pandas as pd, networkx as nx

In [2]:
HIPPIE_df=pd.read_csv("../Source/Interactomes/HIPPIE_v2_3.tab", keep_default_na=False, sep="\t") # excluding rows with nan value
HIPPIE_df

Unnamed: 0,Gene_1,Gene_2,Score
0,ALDH1A1,ALDH1A1,0.76
1,ITGA7,CHRNA1,0.73
2,PPP1R9A,ACTG1,0.65
3,SRGN,CD44,0.63
4,GRB7,ERBB2,0.90
...,...,...,...
783177,STIL,CDK5RAP2,0.63
783178,SIRT7,SNORA10,0.63
783179,BMP2,SNX18,0.63
783180,C1QBP,MRPL50,0.63


In [3]:
HIPPIE_df=HIPPIE_df[HIPPIE_df["Score"]>0.2]
HIPPIE_df

Unnamed: 0,Gene_1,Gene_2,Score
0,ALDH1A1,ALDH1A1,0.76
1,ITGA7,CHRNA1,0.73
2,PPP1R9A,ACTG1,0.65
3,SRGN,CD44,0.63
4,GRB7,ERBB2,0.90
...,...,...,...
783177,STIL,CDK5RAP2,0.63
783178,SIRT7,SNORA10,0.63
783179,BMP2,SNX18,0.63
783180,C1QBP,MRPL50,0.63


In [4]:

# Networkx object and removing self interactions
HIPPIE_nx=nx.from_pandas_edgelist(HIPPIE_df,"Gene_1","Gene_2",edge_attr="Score")
self_interactions=list(nx.selfloop_edges(HIPPIE_nx))


for u, v in self_interactions:
    if isinstance(u, float):
        print(u)
    else:
        HIPPIE_nx.remove_edge(u,v)
    
HIPPIE_df=nx.to_pandas_edgelist(HIPPIE_nx)
print("The number of edges in HIPPIE network",HIPPIE_nx.number_of_edges())
print("The number of nodes in HIPPIE network",HIPPIE_nx.number_of_nodes())


edges=tuple(zip(HIPPIE_df.source,HIPPIE_df.target))
for edge in edges:
    if edge not in HIPPIE_nx.edges:
        print("check the edge",edge)

The number of edges in HIPPIE network 769054
The number of nodes in HIPPIE network 19433


###  Preparing seed nodes

Seed nodes are known or detected context-specific knowledge. They can come from omics datasets, multi-omics datasets, or well-known, frequent mutations in diseases. Score your seeds as 1 if you do not generate initial node weights from raw data.

During multi-omics data integration, the user can follow two strategies with pyPARAGON. In the first strategy,  your hits in omics datasets can be transformed into nodes represented in the reference network. Identifying possible transcription factors which target differential hits in transcriptomics or bind to the genome can be useful to carry transcriptomics and genomics knowledge into the interactomes.  With the second use option of pyPARAGON in multi-omic datasets, you may extend the reference network by merging various types of networks, such as regulator networks, interactomes, metabolic networks, etc. In this case, you can directly use multi-omics hits, which are represented in the reference network. 

In [4]:
initial_nodes_df=pd.read_csv(f'../Source/Netpath/Sampling_0_5/AndrogenReceptor_05A/AndrogenReceptor_05A_var_0.nodes',sep="\t")

initial_nodes=initial_nodes_df.name.to_list()
initial_weights=[1 for _ in initial_nodes]

## B) Constructing Graphlet Guided Network 

During the construction of a graphlet-guided network (GGN), pyPARAGON does not consider the confidence scores of edges in parsing the graphlets. The advanced usage of pyPARAGON in the construction of GGN is detailed in the Graphlet_Guided_Network_Construction, covering reference network permutations, graphlet frequencies, and graphlet motif selections. In this script, we recruit four-node graphlets, such as Graphlet5, Graphlet6, Graphlet7, and Graphlet8. During graphlet generation, we derived four-node graphlets from three-node graphlets. Also you can directly select four-node graphlets in pyPARAGON to overcome the computational cost. 

In [5]:
from Paragon import GraphletGuidance

In [6]:
Graphlet_list=['Graphlets5', 'Graphlets6', 'Graphlets7', 'Graphlets8']

In [7]:

GRF_lite=GraphletGuidance(HIPPIE_nx)

In [8]:
GGN_nx=GRF_lite.construct_GGN(initial_nodes,Graphlets=Graphlet_list)

82 of 82 input nodes have been found in the given network



In [9]:
GGN_nx.number_of_nodes()

4137

In [10]:
GGN_df=nx.to_pandas_edgelist(GGN_nx)
GGN_df

Unnamed: 0,source,target
0,GSN,
1,GSN,SRC
2,GSN,PCNA
3,GSN,PLEKHA7
4,GSN,PRPF8
...,...,...
20805,IL6ST,RHOJ
20806,IL6ST,RHOQ
20807,IL6ST,RHOC
20808,IL6ST,SOCS3


In [11]:
GRF_lite.write_guided_graphlet_network(f'../Outputs/AndrogenReceptor_GGN')

True

## C) PageRank flux and context-specific network inference

pyPARAGON, before infering a context-specific network, applies personalized PageRank algorithm and then calculates flux scores of edges. Among edges in the GGN, the union of highly scored edges constructs the context-specific network. Network propagation by personalized PageRank strongly depends on the edge confidence scores and seed node weights. If the reference interactome is an unweighted graph, pyPARAGON sets a default edge score of 1.0 to all edges in the background. Similarly, if seed nodes does not have weights, pyPARAGON assigns them a default value of 1.0 

Here, as a guided network, we applied the GGN, which is a trimmed form of reference network. We focus on the related region of reference networks. Reference networks, such as interactomes or regulatomes, are mainly integrated databases that cover a variety of databases regardless of context specificity. Thus, it is critical that GGN selects the associated region of the reference network. However, at that point, the user may use tissue-specific, cell lineage-specific, or other specific networks as a guide, independent of the GGN construction. Another case is that reference networks, such as yeast interactomes, may not be as highly connected as human reference networks. In this case, the user can use the reference network directly without a guide network. The PageRank Flux and Network Inference script details the use of guide networks.



In [12]:
from Paragon import NetworkInference 

In [13]:
pgrf=NetworkInference(network=HIPPIE_nx,guide_network=GGN_nx,edge_attribute="Score")

When invoking the network inference step in pyPARAGON, it is mandatory to introduce a reference network. pyPARAGON assumes the whole reference network functions as a guide network in the absence of a specific guide network introduction. If the user does not provide an edge attribute as a confidence score or another edge score, pyPARAGON assigns a default value of 1 to all edges. 

In [14]:
pgrf.load_initial_nodes(list(initial_nodes))

Seed nodes in this script lack weights. Thus, pyPARAGON gives a uniform value of 1 to all seed nodes. The user may input the specified weights of seed nodes ranging from 0 to 1, as explained in the PageRank Flux and Network Inference script.   

In [15]:
Infered_nx=pgrf.reconstruct_subnetwork( max_edge_count=2000,alpha=0.8,threshold=0.8)


theshold of 0.560000 limits predictions to 2000 edges


In [16]:
pgrf.write_created_network(f'../Outputs/AndrogenReceptor_inferred_context_specific_network')

True

# D) Interpreting a context specific network

pyPARAGON partitions a context-specific network into separate communities known as network modules for biological analysis, integrating these communities with biological annotations. This script explores over-representation analysis using gene ontology annotations. The user may expand the use of overrepresentation analysis to additional datasets like KEGG, Reactome, and others. The script of Interpretable_Communities_in_Subnetworks provides a full explanation of the interpretation module, community analysis.

In [17]:
from Paragon import CommunityAnalysis 

In [18]:
CA=CommunityAnalysis(Infered_nx)

### Community detection

In [19]:
module_df, node_df=CA.get_communities_in_DataFrames()
node_df

Unnamed: 0,Genes,Community
0,RHOG,Module_1
1,IL6R,Module_1
2,CMYA5,Module_1
3,SMARCA4,Module_1
4,AR,Module_1
...,...,...
345,MAPK1,Module_32
346,CASP8,Module_32
347,RB1,Module_32
348,PARP1,Module_32


In [20]:
node_df.to_csv(f'../Outputs/AndrogenReceptor_inferred_csn_nodes_in_communities.tab',sep="\t", index=False)

In [21]:
module_df

Unnamed: 0,Community_name,Community
0,Module_1,RHOG;IL6R;CMYA5;SMARCA4;AR;ZMIZ1;NR2C1;SRY;TLE...
1,Module_2,RHOB;DNM2;APPL1;PRNP;BSG;VCP;RUVBL2;PRKCZ;TRAF...
2,Module_3,S100A2;PSMC3IP;CDH1;ERBB2;PIK3R2;PIK3CA;CCR5
3,Module_4,SIRT1;CSNK2A1;HDAC2;SNW1;HIC1
4,Module_5,UBE2E3;CHD3;STK40;SEZ6L2;ETV5;COP1;DET1
5,Module_6,WWP2;KDM4C;VIRMA;ACTN4;RANGAP1;NELFCD;NXF1;UBA...
6,Module_7,HDAC7;YWHAZ;YWHAB;PPP2CA;YWHAH;FOXO1;RAC3;UBE2...
7,Module_8,GTF2B;NBR1;USP43;NCOA4;MAPK6
8,Module_9,RHOD;ARRB2;PML;SENP1;ELAVL1;HNRNPK;HDAC4;SUMO3...
9,Module_10,CDKN1A;TRIM63;HGS;PPP1CC;DYRK1A;POLR2A;H2AFB1;...


In [22]:
module_df.to_csv(f'../Outputs/AndrogenReceptor_inferred_csn_communities.tab',sep="\t", index=False)

### Overrepresentation analysis (ORA) with gene ontology annotations

Reference knowledge for ORA is tabulated in two columns: Specific ID or name of annotations, and components which are represented in networks.   

In [23]:
GOA_bio_proc_df=pd.read_csv(f'../Source/Annotations/GOA_proteins_isoforms_prepared.tab',sep="\t")
GOA_bio_proc_df

Unnamed: 0,GO ID,DB Object Symbol
0,GO:0002250,IGKV3-7
1,GO:0002250,IGKV1D-42
2,GO:0002250,IGLV4-69
3,GO:0002250,IGLV8-61
4,GO:0002250,IGLV4-60
...,...,...
210592,GO:0006958,C1QA
210593,GO:0001682,RPP40
210594,GO:0061640,DCTN3
210595,GO:0045892,NELFCD


In [24]:
returned_all=CA.hypergeometric_test_for_all_communities(reference_network=HIPPIE_nx,
                                           prior_knowledge_df=GOA_bio_proc_df,
                                           prior_knowledge_on="GO ID", 
                                           name_on="DB Object Symbol")

In [25]:
returned_all 

Unnamed: 0,Community_name,GO ID,p-value,Erichment_Score,Genes in Module,Intersecting Genes,The number of intersecting genes,Process_Gene,The number of components of prior_knowledge
0,Module_1,GO:0000122,1.522214e-03,3.150793,"[RHOG, IL6R, CMYA5, SMARCA4, AR, ZMIZ1, NR2C1,...","[AR, NR2C1, SMARCA4, SRY]",4,"[LINC-PINT, E2F8, FEZF1, CNOT1, NUPR2, HELT, N...",968
1,Module_1,GO:0006355,1.307438e-04,3.636005,"[RHOG, IL6R, CMYA5, SMARCA4, AR, ZMIZ1, NR2C1,...","[SRY, NR2C1, AR, SMARCA4, TLE1]",5,"[ZNF722, A0A2R8YD15, FOXO3B, NFILZ, A0A7P0TAN4...",1006
2,Module_1,GO:0006357,1.158394e-02,2.189669,"[RHOG, IL6R, CMYA5, SMARCA4, AR, ZMIZ1, NR2C1,...","[SMARCA4, AR, ZMIZ1, NR2C1]",4,"[MT-RNR1, EPOP, UNCX, ZNF98, ZNF716, ZNF724, Z...",1697
3,Module_1,GO:0008284,4.400661e-06,4.737049,"[RHOG, IL6R, CMYA5, SMARCA4, AR, ZMIZ1, NR2C1,...","[IL6R, AR, IL6ST, SMARCA4, RHOG]",5,"[PRAMEF33, A0A0G2JP20, PRAMEF22, PRAMEF27, NKX...",498
4,Module_1,GO:0010628,1.885954e-03,3.717017,"[RHOG, IL6R, CMYA5, SMARCA4, AR, ZMIZ1, NR2C1,...","[AR, TLE1, SRY]",3,"[ODAM, KPNA7, LGALS9, TLR4, BMPR1B, ETV2, PIK3...",460
...,...,...,...,...,...,...,...,...,...
248,Module_30,GO:0140467,1.775290e-07,8.434220,"[FOS, PHB, CEBPA, CEBPG, UBR5, SRA1, HSP90AB1,...","[FOS, CEBPA, CEBPG]",3,"[CREB3, MAF, FOS, JUN, FOSL1, JUNB, CEBPB, ATF...",26
249,Module_31,GO:0006357,5.805930e-03,3.318940,"[FBXW7, RUNX2, SMURF2, MSX2, TCF15]","[RUNX2, TCF15, MSX2]",3,"[MT-RNR1, EPOP, UNCX, ZNF98, ZNF716, ZNF724, Z...",1697
250,Module_31,GO:0045892,2.618700e-04,4.987517,"[FBXW7, RUNX2, SMURF2, MSX2, TCF15]","[MSX2, RUNX2, SMURF2]",3,"[PRAMEF33, A0A0G2JP20, TLE7, PRAMEF22, PRAMEF2...",587
251,Module_32,GO:0006915,2.658497e-04,4.979789,"[MAPK1, CASP8, RB1, PARP1, GMEB1]","[PARP1, MAPK1, CASP8]",3,"[SYCE3, USP17L3, USP17L5, LGALS16, KLLN, PANO1...",590


In [26]:
returned_all.to_csv(f'../Outputs/AndrogenReceptor_inferred_csn_overrepresentation_results.tab',sep="\t", index=False)