# Building PANDA and LIONESS Regulatory Networks from GTEx Gene Expression Data in R
Author: Deborah Weighill<sup>1</sup>

<sup>1</sup> Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA.

# 1. Introduction

This case study demonstrates the use of PANDA<sup>1</sup> and LIONESS<sup>2</sup> through the netZooR package. It follows the same steps as a [previous tutorial](../netZooR/panda_gtex_tutorial_server.ipynb) that builds and compares gene regulatory networks between cell lines (LCL) and their tissues of origin (whole blood)<sup>3</sup> using GTEx gene expression collection of "normal", undiseased human tissues. We will first start by building a gene regualtory network for LCL cell lines, then whole blood and visualize the largest edges in the network. Then, we will build gene regualtory networks for each cell in LCL data set using LIONESS<sup>2</sup>. In addition to these differences to the previous tutorial, we will use the faster Python implementation of PANDA and LIONESS that we can bind to R through `reticulate` and deomonstrate how these functions can be called.

This analysis can be ran on the server or locally by setting the following parameter.

In [None]:
runserver=1

We need to set the paths for data in the server

In [None]:
if (runserver==1){
    ppath='/opt/data/'
}else if(runserver==0){
    ppath=''
}

In addition, some sections of the case study can requires some time to run, therefore, we can set this parameter to load precomputed data.

In [None]:
precomputed=1

# 1. PANDA

## 1.1. PANDA Overview
PANDA<sup>1</sup> (Passing Attributes between Networks for Data Assimilation) is a method for constructing gene regulatory networks. It uses message passing to find congruence between 3 different data layers: protein-protein interaction (PPI), gene expression, and transcription factor (TF) motif data.

More details can be found in the published [paper](https://doi.org/10.1371/journal.pone.0064832)<sup>1</sup>.

## 1.2. Running a single PANDA analysis



If this notebook is ran locally, we need to install some packages

In [None]:
if(runserver==0){
    install.packages("visNetwork",repos = "http://cran.us.r-project.org",dependencies=TRUE)
}

Then, we start by loading the following libraries. We use the `data.table` library for reading in large datasets as it is more efficient.

In [None]:
library(netZooR)    # To load PANDA and LIONESS
library(data.table) # To load data 
library(visNetwork) # To visualize the networks

Now, we need to bind R to Python since we are calling PANDA and LIONESS from Python because netZooPy has an optimized implementation of PANDA. If you want to use a pure R version, [this tutorial](../netZooR/panda_gtex_tutorial_server.ipynb) has an example for a PANDA analysis. However, it is only necessary when we're working locally. On this Jupyter notebook server, we just need to tell R where to find Python using the `py_config()` command, which also serves to check the installation.

In [None]:
py_config()

To do the Python binding locally, you need to point R to your python 3 installation.

```
use_python("/usr/bin/python3")
```

Make sure that this is the installation that has all the required python libraries (numpy, scipy, and pandas) installed.

In [None]:
py_require(c('pandas','scipy','joblib'))

Now we locate our PPI and motif priors. The PPI represents physical interactions between transcription factor proteins, and is an undirected network. The motif prior represents putative regulation events where a transcription factor binds in the promotor of a gene to regulate its expression, as predicted by the presence of transcription factor binding motifs in the promotor region of the gene. The motif prior is thus a directed network linking transcription factors to their predicted gene targets. These are small example priors for the purposes of demonstrating this method. 

The PPI and motif priors are available in our AWS public bucket, and can be downloaded into current working directory.

Let's download and take a look at the priors:

In [None]:
if(runserver==0){
    system("curl -O  https://netzoo.s3.us-east-2.amazonaws.com/netZooR/tutorial_datasets/motif_GTEx.txt")
    system("curl -O  https://netzoo.s3.us-east-2.amazonaws.com/netZooR/tutorial_datasets/ppi_GTEx.txt")
}

The motif data has three columns representing the source TF and the target gene and the presence or absence of TF motif in the promoter region of the target gene.

In [None]:
motif <- read.table(paste0(ppath,"motif_GTEx.txt")) 
motif[1:5,]

PPI data has interactions between 2 TFs in the 2 first columns and an edge weight between 0 and 1 in the third column, that represent the strength of interaction between the TFs.

In [None]:
ppi <- read.table(paste0(ppath,"ppi_GTEx.txt"))
ppi[1:5,]

Now we locate gene expression data. 

As example, We will use a portion of the GTEx (Genotype-Tissue Expression)<sup>4</sup> version 7 RNA-Seq data, read in the expression data and the list of LCL samples. Then parse the expression data.

We can either 

1) downlaod the file GTEx_Analysis_2016-01-15_v7_RNASeQCv1.1.8_gene_tpm.gct from from https://gtexportal.org/home/datasets or in our AWS bucket and place it in the folder "expressionData". We will initially use the LCL RNA-seq data to create a regulatory network for this cell line. Later, we will also generate a regulatory network for whole blood for comaprison. 

Here, we use the expression data and sample ids file copy from our AWS bucket. First, we download gene expression data for local runs.

In [None]:
if(runserver==0){
    system("curl -O  https://netzoo.s3.us-east-2.amazonaws.com/netZooR/tutorial_datasets/GTEx_Analysis_2016-01-15_v7_RNASeQCv1.1.8_gene_tpm.gct")
}

Since this gene expression file has LCL and whole blood data combined, we need to build separate data matrices for each type by filtering sample IDs. Samples IDs can be downloaded as follows:

In [None]:
if(runserver==0){
    system("curl -O  https://netzoo.s3.us-east-2.amazonaws.com/netZooR/tutorial_datasets/LCL_samples.txt")
}

Next, gene expression data is processed by removing the trascript IDs from gene names to match the gene names in the motif data, filtering LCL samples, and finalyy keep the genes with at least 20 gene expression samples.

In [None]:
if(precomputed==0){
    # load the GTEx expression matrix
    expr <- fread(paste0(ppath,"GTEx_Analysis_2016-01-15_v7_RNASeQCv1.1.8_gene_tpm.gct"), header = TRUE, skip = 2, data.table = TRUE)
    # remove the transcript ids so that the genes match the gene ids in the tf-motif prior
    expr$Name<-sub("\\.[0-9]","", expr$Name)

    #load the sample ids of LCL samples
    lcl_samples <-fread(paste0(ppath,"LCL_samples.txt"), header = FALSE, data.table=FALSE)

    #select the columns of the expression matrix corresponding to the LCL samples.
    lcl_expr <- expr[,union("Name",intersect(c(lcl_samples[1:149,]),colnames(expr))), with=FALSE]

    #determine the number of non-NA/non-zero rows in the expression data. This is to be able to ensure that PANDA will have enough values in the vectors to calculate pearson correlations between gene expression profiles in the construction of the gene co-exression prior.
    zero_na_counts <- apply(lcl_expr, MARGIN = 1, FUN = function(x) length(x[(!is.na(x)| x!=0) ]))

    #maintain only genes with at least 20 valid gene expression entries
    clean_data <- lcl_expr[zero_na_counts > 20,]

    #write the cleaned expression data to a file, ready to be passed as an argument to the PANDA algorithm.
    write.table(clean_data, file = "../data/pandaExprLCL.txt", sep = "\t", col.names = FALSE, row.names = FALSE, quote = FALSE)
}

Alternatively, the pre-processed data `pandaExprLCL.txt` can be downloaded from the netZoo AWS S3 Bucket.

In [None]:
if(precomputed==1){
    system("curl -O  https://netzoo.s3.us-east-2.amazonaws.com/netZooR/tutorial_datasets/pandaExprLCL.txt")
}

Now we run PANDA, pointing it to the parsed expression data, motif prior, and PPI prior. `modeProcess` determines the way that TF and gene names are handled across the three input priors. If set to `legacy`, gene names will be taken from gene expression data and TF names will be taken from motif data. `remove_missing` removes any TF and gene that is not present across all three data types.

In [None]:
panda_results_LCL <- pandaPy(expr_file = paste0(ppath,"pandaExprLCL.txt") , motif_file = paste0(ppath,"motif_GTEx.txt"), ppi_file = paste0(ppath,"ppi_GTEx.txt"), modeProcess="legacy", remove_missing = TRUE)

Let's take a look at the results. The output contains a list of three data frames:

First, a data frame containing the regulatory network (bipartite graph) with edge weights representing the "likelihood" that a transcription factor binds the promotor of and regulates the expression of a gene.   

In [None]:
regNetLCL <- panda_results_LCL$panda
regNetLCL[1:5,]

Second, a data frame of the in-degrees of genes (sum of the weights of inbound edges around a gene). These are also called gene targeting scores<sup>5</sup>.

In [None]:
inDegreeLCL <- panda_results_LCL$indegree
head(inDegreeLCL)

Third, a data frame of the out-degrees of TFs (sum of the weights of outbound edges around a TF) or TF targeting scores.

In [None]:
outDegreeLCL <- panda_results_LCL$outdegree
head(outDegreeLCL)

## 1.3. Run another PANDA analysis on Whole Blood Samples

In this section, we will build a PANDA gene regulatory network on whole blood samples. Like the LCL expression data in previous section, we can either download the raw data and process or load a pre-processed data set. Since, we downloaded the combined LCL and whole blood gene expression data in the previous section, here, we only need to download the sample names for whole blood samples.

In [None]:
if(runserver==0){
    system("curl -O https://netzoo.s3.us-east-2.amazonaws.com/netZooR/tutorial_datasets/WholeBlood_samples.txt")
}

Then, we follow the same steps to remove transcript id from gene names, filter whole blood samples, and select genes that have at least 20 sample measurments.

In [None]:
if(precomputed==0){
    #load the sample ids of Whole Blood samples
    wblood_samples <-fread(paste0(ppath,"WholeBlood_samples.txt"), header = FALSE, data.table=FALSE)

    #select the columns of the expression matrix corresponding to the LCL samples.
    wblood_expr <- expr[,union("Name",intersect(c(wblood_samples[1:149,]),colnames(expr))), with=FALSE]

    #determine the number of non-NA/non-zero rows in the expression data. This is to be able to ensure that PANDA will have enough values in the vectors to calculate pearson correlations between gene expression profiles in the construction of the gene co-exression prior.
    zero_na_counts_wblood <- apply(wblood_expr, MARGIN = 1, FUN = function(x) length(x[(!is.na(x)| x!=0) ]))

    #maintain only genes with at least 20 valid gene expression entries
    clean_data_wb <- wblood_expr[zero_na_counts_wblood > 20,]

    #write the cleaned expression data to a file, ready to be passed as an argument to the PANDA algorithm.
    write.table(clean_data_wb, file = "../data/pandaExprWholeBlood.txt", sep = "\t", col.names = FALSE, row.names = FALSE, quote = FALSE)
}

Alternatively, we can download the processed whole blood expression data directly from the netZoo AWS Bucket.

In [None]:
if(precomputed==1){
    system("curl -O https://netzoo.s3.us-east-2.amazonaws.com/netZooR/tutorial_datasets/pandaExprWholeBlood.txt")
}

Using this gene expression matrix, and the generic PPI and motif networks, we can now call PANDA using the same parameters as for LCL cell lines.

In [None]:
panda_results_wblood <- pandaPy(expr_file = paste0(ppath,"pandaExprWholeBlood.txt") , motif_file = paste0(ppath,"motif_GTEx.txt"), ppi_file = paste0(ppath,"ppi_GTEx.txt"), modeProcess="legacy", remove_missing = TRUE)

We can visualize the whole blood gene regulatory network using `visNetwork` library which requires defining an `edges` data frame that contains the edges in the network.

In [None]:
edges <- head(panda_results_wblood$panda[order(panda_results_wblood$panda$Score,decreasing = TRUE),], 500)
edges$arrows = "to" 
colnames(edges) <- c("from","to","motif","force","arrows")

And a `nodes` data frame that has information about the nodes in the network.

In [None]:
nodes <- data.frame(id = unique(as.vector(as.matrix(edges[,c(1,2)]))) , 
                    label=unique(as.vector(as.matrix(edges[,c(1,2)]))))
nodes$group <- ifelse(nodes$id %in% edges$from, "TF", "gene")

Finally, we plot the network, with TF in green squares and genes in yellow circles.

In [None]:
net <- visNetwork(nodes, edges, width = "100%")
net <- visGroups(net, groupname = "TF", shape = "square",
                     color = list(background = "teal", border="black"))
net <- visGroups(net, groupname = "gene", shape = "dot",       
                     color = list(background = "gold", border="black"))
visLegend(net, main="Legend", position="right", ncol=1) 

# 2. Single-sample inference using LIONESS

LIONESS (Linear Interpolation to Obtain Network Estimates for Single Samples) is a method for creating sample-specific networks<sup>2</sup>, using an aggregate network inference method like PANDA or coexpression networks. When applied to a PANDA regulatory network, the result is a set of gene regulatory networks, one for each sample in the gene expression dataset. More information on LIONESS<sup>2</sup> can be found in the published (paper)[https://doi.org/10.1016/j.isci.2019.03.021].

Running LIONESS with netZoo is simple, and very similar to running PANDA:

In [None]:
lionessLCL <- lionessPy(expr_file = paste0(ppath,"pandaExprLCL.txt") , motif_file = paste0(ppath,"motif_GTEx.txt"), ppi_file = paste0(ppath,"ppi_GTEx.txt"), modeProcess="legacy", remove_missing = TRUE)

LIONESS will ran PANDA first to build an aggregate network across all samples, then derives a PANDA network for each sample by linear interpolation.

In [None]:
lionessLCL[1:5,1:10]

 The result is a data frame in which the first colum  contains TFs, the second column contains genes and each subsequent column contains the edge weight for that particular TF-gene pair in a particular sample.

# References
1 - Glass, Kimberly, et al. "Passing messages between biological networks to refine predicted interactions." PloS one 8.5 (2013): e64832.

2 - Kuijjer, Marieke Lydia, et al. "Estimating sample-specific regulatory networks." Iscience 14 (2019): 226-240.

3- Lopes-Ramos, Camila M., et al. "Regulatory network changes between cell lines and their tissues of origin." BMC genomics 18.1 (2017): 1-13.

4- GTEx Consortium, et al. "The Genotype-Tissue Expression (GTEx) pilot analysis: Multitissue gene regulation in humans." Science 348.6235 (2015): 648-660.

5- Weighill, Deborah, et al. "Gene targeting in disease networks." Frontiers in Genetics 12 (2021): 501.