# Up and running with PANDA and netZooPy
Author:
Daniel Morgan<sup>1</sup>

<sup>1</sup>Channing division of network medicine, Brigham's and Women hospital and Harvard Medical School, Boston, MA.

## Introduction
Regulatory network reconstruction is a fundamental problem in computational biology. There are significant limitations to such reconstruction using individual datasets, and increasingly people attempt to construct networks using multiple, independent datasets obtained from complementary sources, but methods for this integration are lacking. We developed PANDA<sup>1</sup> (Passing Attributes between Networks for Data Assimilation), a message-passing model using multiple sources of information to predict regulatory relationships, and used it to integrate protein-protein interaction, gene expression, and sequence motif data to reconstruct genome-wide, condition-specific regulatory networks in yeast as a model. The resulting networks were not only more accurate than those produced using individual data sets and other existing methods, but they also captured information regarding specific biological mechanisms and pathways that were missed using other methodologies. PANDA is scalable to higher eukaryotes, applicable to specific tissue or cell type data and conceptually generalizable to include a variety of regulatory, interaction, expression, and other genome-scale data.

PANDA starts with a prior network of putative regulatory interactions (center network in the image below), a prior network of protein-protein interactions between transcription factors, and target gene expression data, which is converted into a co-expression network.

<img src="img/panda.png" style="width: 200px;">  

A message passing framework is used to find agreement between the three input networks. First, the responsibility (R) is calculated: 

<img src="img/responsibility.png" style="width: 200px;">  

Then, the availability (A): 

<img src="img/availability.png" style="width: 200px;">  

The prior gene regulatory network W is then updated using the responsibility and availability:  

<img src="img/combine.png" style="width: 300px;">  

Next, the protein cooperativity and gene co-regulatory networks are updated::

<img src="img/cooperativity.png" style="width: 300px;">  
<img src="img/co-regulatory.png" style="width: 300px;"> 

Self-interactions in P and C are also updated to satisfy convergence:  

<img src="img/p.png" style="width: 300px;">  
<img src="img/c.png" style="width: 300px;">  

Convergence is evaluated using a hamming distance:

<img src="img/hamming.png" style="width: 300px;">  

An overview of the algorithm is detailed in this figure.

In [None]:
from IPython.display import Image
from IPython.core.display import HTML
Image(url= "https://journals.plos.org/plosone/article/figure/image?size=large&id=info:doi/10.1371/journal.pone.0064832.g001", width=500, height=500)

## 1. Installation and Setup
This vignette can be ran on the server or locally by setting this parameter.

In [None]:
runserver=1

PANDA is distributed through the netZooPy package, which can be installed as follows:

In [None]:
if runserver==0:
    !cd ~
    !git clone https://github.com/netZoo/netZooPy.git
    !cd netZooPy
    !pip3 install -e .
    ppath='netZooPy/tests/ToyData/'
elif runserver==1:
    ppath='/opt/data/'

The previous command allows to set the data folder for the analysis on the server. Then, we load the libraries to run the analysis.

In [None]:
import os
from netZooPy.panda.panda import Panda # To load PANDA
import pandas as pd                    # To read data frames
import matplotlib.pyplot as plt        # To plot networks
import sys                             # To compute size of variables

## 2. Parameter Setting & Exploring the Data

First, we start by setting the path to the 1) motif prior network, 2) the gene expression data, and 3) the ppi network data.
The motif prior network is typically a TF-by-gene binary matrix where 1 indicates the presence of sequence (motif) of a TF in the gene regulatory region and 0 otherwise.
Gene expression data is typically a gene-by-sample matrix containing expression data.
PPI network is a TF-by-TF binary matrix, where 1 indicates a physical interaction between two TFs and 0 otherwise.
If two TFs are likely to bind, they are likely to form regulatory complexes for the same genes, which will be updated in the network inference process.

In [None]:
expression_data=ppath+'ToyExpressionData.txt'
motif_data     =ppath+'ToyMotifData.txt'
ppi_data       =ppath+'ToyPPIData.txt'
panda_output   ='../data/output_panda.txt'

First, we can read the gene expression data.

In [None]:
expression=pd.read_csv(expression_data,sep="\t",header=None, index_col=0)

There are 1000 genes and 50 samples in our toy gene expression data. Among the three networks, gene expression data is the one that provides "context" to the network. The remaining files are "generic" and include known interaction lists.

In [None]:
motif_data=pd.read_csv(motif_data,sep="\t",header=None)
motif_data

The motif network has three columns: 1) a source node (TF), 2) target node (Genes), and 3) an edge weight of either 1 or 0.

The number of TFs:

In [None]:
motif_data[0].unique().size

The number of genes:

In [None]:
motif_data[1].unique().size

Since the first column is TF, you thus have 87 TF and 913 genes are returned from the second column, with their interaction weights in the third column (motif_data[2]). Now lets check out the ppi data, another interaction list with three columns, with 238 interactions between the TF.

In [None]:
ppi_data=pd.read_csv(ppi_data,sep="\t",header=None)
ppi_data

The TF PPI network is typically built from STRING database<sup>2</sup> and has the source nodes in the first column, the target TF in the second column, and the edge weight that varies between 0 and 1.

## 3. Calling PANDA

One can chose to run in terminal simply by pointing to the input files using this function from netZooPy <br>
`python run_panda.py -e ToyExpressionData.txt -m ToyMotifData.txt -p ToyPPIData.txt -f True -o test_panda.txt`

Alternatively one can continue running in Jupyter, using all data sources:

In [None]:
expression_data=ppath+'ToyExpressionData.txt'
motif_data     =ppath+'ToyMotifData.txt'
ppi_data       =ppath+'ToyPPIData.txt'

Then, we call PANDA as follows.

In [None]:
panda_obj = Panda(expression_data, motif_data, ppi_data, save_tmp=True,save_memory = False, remove_missing=False, keep_expression_matrix = False)

The network can be saved using the function `save_panda_results`.

In [None]:
panda_obj.save_panda_results(panda_output)

We can verify the size of the object on memory. This is useful when computing large-scale networks.

In [None]:
sys.getsizeof(panda_obj)

Finally, we can plot the top network edges and save them as a file.

In [None]:
panda_obj.top_network_plot(top=10, file='../data/panda_top_10.png')

## Running PANDA with missing input

We can run PANDA without expression and PPI data, this will replace these networks by the identity matrix and run the inference using only the motif network.

In [None]:
expression_data=None
motif_data='/opt/data/ToyMotifData.txt'
ppi_data=None

PANDA can be called as follows:

In [None]:
panda_obj = Panda(expression_data,  motif_data, ppi_data,remove_missing=True, save_memory=False)

The result network can be saved using the function `save_panda_results`.

In [None]:
panda_obj.save_panda_results(panda_output)

Like we saw earlier, the result network can be plotted and saved using the function `top_network_plot`.

In [None]:
panda_obj.top_network_plot(top=10, file='../data/panda_top_10.png')

Another possiblity would be to run PANDA without the expression matrix and using only the motif and PPI data. In this case, the co-expression network will be replaced by the identity matrix.

In [None]:
expression_data=None
motif_data='/opt/data/ToyMotifData.txt'
ppi_data='/opt/data/ToyPPIData.txt'

The function call is the following:

In [None]:
panda_obj = Panda(expression_data,  motif_data, ppi_data,remove_missing=True, save_memory=False)

The results can be saved using `save_panda_results`.

In [None]:
panda_obj.save_panda_results(panda_output)

The network can be plotted as follows:

In [None]:
panda_obj.top_network_plot(top=10, file='../data/panda_top_10.png')

Also, PANDA can be ran without a motif network, which will be replace by the identity matrix.

In [None]:
expression_data='/opt/data/ToyExpressionData.txt'
motif_data=None
ppi_data='/opt/data/ToyPPIData.txt'

In [None]:
panda_obj = Panda(expression_data, motif_data, ppi_data, save_memory=False)

Here as well, the result can be saved using `save_panda_results`.

In [None]:
panda_obj.save_panda_results(panda_output)

You can also save RAM memory for large-scale networks by deleting intermediary variables by using `save_memory=True`. However, for downstream analyses such as gene indegree computation, we need to keep those variables in the object by setting `save_memory=False`. In this case, the function call is:

In [None]:
expression_data=ppath+'ToyExpressionData.txt'
motif_data     =ppath+'ToyMotifData.txt'
ppi_data       =ppath+'ToyPPIData.txt'
panda_obj = Panda(expression_data, motif_data, ppi_data, save_memory=False)
panda_obj.save_panda_results(panda_output)

Basic follow up analysis is also possible, such as degree calculation per gene, called gene targeting scores<sup>3</sup>, which is a summary score that can be used to find associations between network features and clinical variables and phenotypes.

In [None]:
panda_obj.return_panda_indegree()

These results can be saved using `save_panda_results`.

In [None]:
panda_obj.save_panda_results(panda_output)

# References
1- Glass K, Huttenhower C, Quackenbush J, Yuan GC. Passing Messages Between Biological Networks to Refine Predicted Interactions, PLoS One, 2013 May 31;8(5):e64832

2- Mering, Christian von, et al. "STRING: a database of predicted functional associations between proteins." Nucleic acids research 31.1 (2003): 258-261.

3- Weighill, Deborah, et al. "Gene targeting in disease networks." Frontiers in Genetics 12 (2021): 501.