# Introduction to data analysis for natural and social sciences
This notebook contitutes the first part of the exam.

Here the steps of article "Patient-specific Boolean models of signalling networks guide personalised treatments" are retraced and the results reproduced.

## Imports and global settings

In [32]:
import biolqm
import ginsim
import numpy as np
import pandas as pd

In [33]:
PATH_FILES = "report"

EXT_EXCEL = "xlsx"
EXT_TAB = "tsv"

# Prostate Boolean model construction
The Boolean model is constructed starting from information available in literature. Then further pathways are identified by the use of software ROMA and pypath and they are added to the existing network.

## Boolean model construction
The authors collected all data regarding the network, such as nodes, their role, logical rules, in the two following Excel files:

In [34]:
fname_nodes_pathways = "Montagud2022_nodes in pathways.xlsx"
fname_nodes_network = "Montagud2022_interactions_sources.xlsx"

Data are loaded in Pandas dataframe to ease their manipulation.

In [35]:
df_nodes_pathways = pd.read_excel(
    io=f"{PATH_FILES}/{fname_nodes_pathways}",
    header=None,
    names=["node", "pathway"]
)
sheet_interactions = "Nodes"
df_nodes_interactions = pd.read_excel(
    io=f"{PATH_FILES}/{fname_nodes_network}",
    sheet_name=sheet_interactions,
    header=1,
    converters={"Reference: PMID": lambda c: np.str_(c).strip()}  # Remove a useless line break in a cell.
)
sheet_unique="Nodes_unique"
df_nodes_unique = pd.read_excel(
    io=f"{PATH_FILES}/{fname_nodes_network}",
    sheet_name=sheet_unique
)

In [36]:
# One single logical rule is associated to each node, indeed the result of grouping by node and rule is a list of exactly 133 rows.
df_count = df_nodes_interactions.groupby(["Target node", "Logical rule"]).count()
display(df_count)
del df_count

Unnamed: 0_level_0,Unnamed: 1_level_0,HUGO names,Interaction type,Source,Description,Reference: PMID
Target node,Logical rule,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
AKT,((HSPs | (PDK1 & PIP3) | PIP3 | (SHH & PIP3)) & !PTCH1),5,5,5,5,5
AMPK,(ATR | HIF1 | AMP_ATP | ATM) & !FGFR3,6,6,6,6,6
AMP_ATP,(!Nutrients),1,1,1,1,1
APAF1,((Caspase8 | BAX | p53 | Bak | HSPs) & !Bcl_XL & !BCL2 & !AKT),8,8,8,8,8
AR,((GLI | EP300 | HSPs | NKX3_1 | EZH2 | NCOA3 | PKC | SMAD | Androgen) & !PTEN & !NCOR1 & !NCOR2 & !MDM2),13,13,13,13,13
...,...,...,...,...,...,...
p21,((p53 | SMAD | HIF1 | ZBTB17) & !TERT & !MYC_MAX & !MDM2 & !AKT & !ERK),9,9,9,9,9
p38,(MAP3K1_3 & !ERK & !GADD45),3,3,3,3,3
p53,((Acidosis | CHK1_2 | p38 | HIF1) & !BCL2 & !MDM2 & !HSPs & !Snail),9,9,9,9,9
p70S6kab,(mTORC2 | PDK1),2,2,2,2,2


In [37]:
# df_nodes_unique contains only 121 nodes and not 133 as df_nodes_pathways.
# In other words, df_nodes_unique["Node"] should be a subset of df_nodes_pathways["node"]. Why?
df = df_nodes_pathways.set_index("node")
df_subset = df_nodes_unique.set_index("Node")

# The input nodes should be all removed, since they are not regulated by authors' choice,
# hence they are not part of any pathway. But this is not what happens.
display(df.loc[df["pathway"] == "Input"])
display(df.drop(labels=df_subset.index, errors="ignore"))

# Moreover, in df_nodes_unique there is a node called MAX which is not part of the nodes considered for the final network.
# Surprisingly, it is not present among the nodes in df_nodes_interactions.
try:
    df.drop(labels=df_subset.index)
except Exception as e:
    print(e)
finally:
    del df
    del df_subset

# In conclusion, it seems that the choice of nodes from the Excel files can not be deduced directly
# just from the observation of the content of the files.
# In particular, I should use data in df_nodes_unique with caution,
# since their relation with the other data is not straightforward.

Unnamed: 0_level_0,pathway
node,Unnamed: 1_level_1
Acidosis,Input
Androgen,Input
Carcinogen,Input
EGF,Input
FGF,Input
fused_event,Input
Hypoxia,Input
Nutrients,Input
SPOP,Input
TGFb,Input


Unnamed: 0_level_0,pathway
node,Unnamed: 1_level_1
Acidosis,Input
Androgen,Input
Apoptosis,Output
Carcinogen,Input
DNA_Damage,DNA repair pathw
DNA_Repair,Output
EMT,Invasion pathw
Hypoxia,Input
Invasion,Output
Metastasis,Output


"['MAX'] not found in axis"


Data about nodes are then exported in files with tab-separated values (TSV) format, to import them in Cytoscape in a later time.

In [38]:
name_nodes_pathways = fname_nodes_pathways.removesuffix(f".{EXT_EXCEL}")
name_nodes_network = fname_nodes_network.removesuffix(f".{EXT_EXCEL}")

df_nodes_pathways.to_csv(
    path_or_buf=f"{name_nodes_pathways}.{EXT_TAB}",
    sep='\t',
    index=False
)
df_nodes_interactions.to_csv(
    path_or_buf=f"{name_nodes_network}_{sheet_interactions}.{EXT_TAB}",
    sep='\t',
    index=False
)
df_nodes_unique.to_csv(
    path_or_buf=f"{name_nodes_network}_{sheet_unique}.{EXT_TAB}",
    sep='\t',
    index=False
)

To create the network, one single data file can be used, which contains data about interactions and pathways.

In [39]:
df_cytoscape = df_nodes_interactions.join(
    other=df_nodes_pathways.set_index("node"),
    on="Target node"
)
df_cytoscape.to_csv(
    path_or_buf=f"cytoscape_data.{EXT_TAB}",
    sep='\t',
    index=False
)

After having imported the file in Cytoscape, node "0/1" is hidden because it is generated by the software as source node for input nodes.

To build the regulatory network, GINsim is used by the authors. The resulting network is exported as ZGINML file, available in the Supplementary file 1, named

In [40]:
fname_model = "Montagud2022_Prostate_Cancer.zginml"

The file is then imported in Cytoscape for visual improvement.

In [41]:
# Only nine nodes are referred as proper inputs in the article (cfr. appendix 1.2.3). The remaining two are "fused_event" and "SPOP".

# Node "fused_event" is present to consider the condition of fusion with gene ERG
# and is added manually based on existing literature (cfr. appendix 1.1.5).
display(df_nodes_interactions.loc[df_nodes_interactions["Target node"] == "fused_event"])
display(df_nodes_interactions.loc[df_nodes_interactions["Source"] == "fused_event"])

# Node "SPOP" is present to consider mutations of gene SPOP which are frequent in prostate cancer.
display(df_nodes_interactions.loc[df_nodes_interactions["Target node"] == "SPOP"])
display(df_nodes_interactions.loc[df_nodes_interactions["Source"] == "SPOP"])

Unnamed: 0,Target node,HUGO names,Interaction type,Source,Description,Reference: PMID,Logical rule
197,fused_event,TMPRSS2,input,0/1,TMPRSS2-Ets gene fusions were identified in pr...,"23264855, 20118910",(fused_event)
198,fused_event,SLC45A3,input,0/1,TMPRSS2 and SLC45A3 were the only 5' partner i...,20118910,(fused_event)
199,fused_event,NDRG1,input,0/1,ERG gene rearrangements and mechanism of rearr...,20118910,(fused_event)


Unnamed: 0,Target node,HUGO names,Interaction type,Source,Description,Reference: PMID,Logical rule
39,AR_ERG,ERG fused,+,fused_event,ERG can fuse with TMPRSS2 protein to form an o...,23264855,((AR & fused_event) | (AR & fused_event & !NKX...


Unnamed: 0,Target node,HUGO names,Interaction type,Source,Description,Reference: PMID,Logical rule
420,SPOP,SPOP,input,0/1,Input of the model,,(SPOP)


Unnamed: 0,Target node,HUGO names,Interaction type,Source,Description,Reference: PMID,Logical rule
125,DAXX,DAXX,-,SPOP,Phosphorylation of Daxx by ATM upon DNA damage...,23405218,(!ATM & !ATR & !SPOP)
128,DNA_Damage,,+,SPOP,"From Fumia et al, 2013",23922675,((Carcinogen | (Carcinogen & ROS)) & !SPOP)
203,GLI,"GLI1, GLI2",-,SPOP,Stabilization of speckle-type POZ protein (Spo...,24072710,((WNT | SMO) & !SPOP)
312,NCOA3,NCOA3,-,SPOP,Mutations in SPOP represent the most common po...,24239470,(!SPOP & p38)


### Use cases for the GINsim model
The GINsim model can be used directly for some tasks and all the information are contained in the ZGINML file.

In [42]:
ginsim_model = ginsim.load(f"{PATH_FILES}/{fname_model}")

The network can be diplayed with

In [43]:
ginsim.show(ginsim_model)

and the stable states of the model can be evaluated. This task is performed by bioLQM:

biolqm_model = ginsim.to_biolqm(ginsim_model)
biolqm_fixpoints = biolqm.fixpoints(biolqm_model)

# Prostate Boolean model simulation
To perform simulations using the Boolean model, MaBoSS is used. First, configurations and information about the model are extracted from the GINsim model:

In [47]:
maboss_model = ginsim.to_maboss(ginsim_model)

Then the number of trajectories and other configurations are set:

In [48]:
maboss_model.update_parameters(
    sample_count=5000
)
# MC probabilmente vanno modificati altri parametri perché il file CFG fornito differisce da quello che genero io qui.

## Wild type simulation

In [None]:
# MC qui riproduco i grafici di figura 3.