# Building PANDA Regulatory Networks from cell line and tissue expression data from GTEx Gene Expression Data in R
Author: Camila Lopes-Ramos<sup>1</sup>

<sup>1</sup> Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA.

# 1. Introduction
In this vignette, we will build one regulatory network for LCL cell line samples and one for whole blood samples from the GTEx gene expression data<sup>1</sup>  using the netZooR package. Next, we will compare the two networks, and find the pathways enriched for genes differentially targeted between the LCL cell line and whole blood. 
  
Cell lines are an essential tool in biomedical research and often used as surrogates for tissues. LCLs (obtained from the transformation of B cells present in whole blood) are among the most widely used continuous cell lines with the ability to proliferate indefinitely. By comparing the regulatory networks of LCL cell lines with its tissue of origin (whole blood), we find that LCLs exhibit large changes in their patterns of transcription factor regulation, specifically a loss of repressive transcription factor targeting of cell cycle genes<sup>2</sup>.

## 1.1. Install netZooR package

This notebook can be run on the server or locally by setting the following parameter.

In [None]:
runserver=1

Setting this parameter locally allows to install the packages required for this analysis.

In [None]:
if(runserver==0){
    install.packages("devtools")
    library(devtools)
    devtools::install_github("netZoo/netZooR", build_vignettes = FALSE)
    if (!requireNamespace("BiocManager", quietly = TRUE))   
        install.packages("BiocManager",repos = "http://cran.us.r-project.org")  
    BiocManager::install("fgsea")   
    install.packages("ggplot2")  
    install.packages("reshape2")  
}

It allows also to set the paths to data files for the inputs on the server.

In [None]:
if(runserver==1){
    ppath='/opt/data/'
}else if(runserver==0){
    ppath=''   
}

## 1.2. Install and load packages
Then, we load these required packages.

In [None]:
library('netZooR')      # To load PANDA and LIONESS
library('ggplot2')      # For plotting
library('reshape2')     # For data processing
library('visNetwork') # For network visualization

We also need `fgsea` for gene enrichment analysis

In [None]:
library('fgsea')        # For gene enrichment analysis

An example of enrichment on toy data provided with the package can be run using the following code:

In [None]:
data(examplePathways)
data(exampleRanks)
fgseaRes <- fgsea(pathways = examplePathways, 
                  stats    = exampleRanks,
                  minSize  = 15,
                  maxSize  = 500)

# 2. PANDA

## 2.1. PANDA Overview
PANDA (Passing Attributes between Networks for Data Assimilation)<sup>2</sup> is a method for constructing gene regulatory networks. It uses message passing to find congruence between 3 different data layers: protein-protein interaction (PPI), gene expression, and transcription factor (TF) motif data.

More details can be found in the published paper https://doi.org/10.1371/journal.pone.0064832.

## 2.2. Building a PANDA regulatory network

Now we locate our PPI and motif priors. The ppi represents physical interactions between transcription factor proteins, and is an undirected network. The transcription factor motif prior represents putative regulation events where a transcription factor binds in the promotor of a gene to regulate its expression, as predicted by the presence of transcription factor binding motifs in the promotor region of the gene. The motif prior is thus a directed network linking transcription factors to their predicted gene targets. These are small example priors for the purposes of demonstrating this method. A complete set of motif priors by species can be downloaded from: https://sites.google.com/a/channing.harvard.edu/kimberlyglass/tools/resources  
The function source.PPI can be used to source the protein-protein interaction in the STRING database<sup>3</sup>.

Let's take a look at the priors. First the motif network.

In [None]:
motif <- read.delim(paste0(ppath,"motif_subset.txt"), stringsAsFactors=F, header=F)
motif[1:5,]

Then, the PPI network.

In [None]:
ppi <- read.delim(paste0(ppath,"ppi_subset.txt"), stringsAsFactors=F, header=F)
ppi[1:5,]

Now we locate our expression data. As an example, we will use a subset of the GTEx<sup>1</sup> version 7 RNA-Seq data, downloaded from https://gtexportal.org/home/datasets. We start with a subset of RNA-Seq data (tpm normalized) for 1,000 genes from 130 LCL cell line samples and 407 whole blood samples that we load from the server.

In [None]:
exp <- read.delim(paste0(ppath,"expression_tpm_lcl_blood_subset.txt"), stringsAsFactors = F, check.names = F)

Then, we log transform the tpm normalized expression.

In [None]:
exp <- log2(exp+1)

Then, determine the number of non-NA/non-zero rows in the expression data. This is to be able to ensure that PANDA will have enough values in the vectors to calculate pearson correlations between gene expression profiles in the construction of the gene co-expression prior.

In [None]:
zero_na_counts <- apply(exp, MARGIN = 1, FUN = function(x) length(x[(!is.na(x) & x!=0) ]))

Then, we keep only genes with at least 20 valid gene expression entries

In [None]:
exp <- exp[zero_na_counts > 20,]

and we load the sample ids of LCL samples.

In [None]:
lcl_samples <-read.delim(paste0(ppath,"LCL_samples.txt"), header=FALSE, stringsAsFactors=FALSE)

This allows us to select the columns of the expression matrix corresponding to the LCL samples.

In [None]:
lcl_exp <- exp[,colnames(exp) %in% lcl_samples[,1]]

Next, we load the sample ids of whole blood samples.

In [None]:
wblood_samples <-read.delim(paste0(ppath,"WholeBlood_samples.txt"), header=FALSE, stringsAsFactors=FALSE)

and we select the columns of the expression matrix corresponding to the whole blood samples.

In [None]:
wb_exp <- exp[,colnames(exp) %in% wblood_samples[,1]]

Now we run PANDA, pointing it to the parsed expression data, motif prior and ppi prior. We will point to the same motif and ppi priors for each PANDA run, which represents the initial putative regulatory information. We then point to the expression matrix correspoding to the LCL samples to generate the LCL regulatory network first.

In [None]:
pandaLCL <- panda(motif, lcl_exp, ppi, mode="intersection")
pandaLCL

And to the expression matrix corresponding to the whole blood samples to generate the whole blood regulatory network.

In [None]:
pandaWB <- panda(motif, wb_exp, ppi, mode="intersection")
pandaWB

The regulatory network (bipartite graph) with edge weights representing the "likelihood" that a transcription factor binds the promotor of and regulates the expression of a gene.  

In [None]:
regNetLCL <- pandaLCL@regNet
regNetWB <- pandaWB@regNet
regNetLCL[1:5,1:5]

This dataframe represents the bipartite regulatory network (transcription factors as rows and target genes as columns).

# 3. Visualizing the networks
In this section we will visualize parts of the network using the `visNetwork` package.

## 3.1. Plot the 200 highest edge weights

Because the network is at the scale of the genome, we select only the top 200 edges by edge weight for visualization.

In [None]:
nDiffs= 200 # top edges to plot (top edges with largest absolute value)
diffNet = pandaLCL@regNet
nTFs  = dim(diffNet)[1]

VisNetwork requires an edges dataframe describing the edges in the network and a nodes dataframe describing the nodes in the network. The edges dataframe is constriucted as follows.

In [None]:
edges           = matrix(0L, nDiffs, 3)
colnames(edges) = c("from","to","value")
edges = as.data.frame(edges)
aa    = order(as.matrix(abs(diffNet)), decreasing = TRUE)
bb    = sort(as.matrix(abs(diffNet)), decreasing = TRUE)
edges$value  = as.matrix(diffNet)[aa[1:nDiffs]]
geneIdsTop   = (aa[1:nDiffs] %/% dim(diffNet)[1]) + 1
tfIdsTop     = aa[1:nDiffs] %% dim(diffNet)[1]
tfIdsTop[tfIdsTop == 0] = nTFs
edges$to     = colnames(diffNet)[geneIdsTop]
edges$from   = rownames(diffNet)[tfIdsTop]                                  
edges$arrows = "to"   
edges$value  = edges$value

The nodes dataframe describes TF and gene nodes.

In [None]:
nodes       = data.frame(id = unique(as.vector(as.matrix(edges[,c(1,2)]))), 
                    label=unique(as.vector(as.matrix(edges[,c(1,2)]))))
nodes$group = ifelse(nodes$id %in% edges$from, "TF", "gene")

Finally, we plot the network.

In [None]:
net <- visNetwork(nodes, edges, width = "100%")
net <- visGroups(net, groupname = "TF", shape = "triangle",
                 color = list(background = "purple", border="black"))
net <- visGroups(net, groupname = "gene", shape = "dot",       
                 color = list(background = "teal", border="black"))
visLegend(net, main="Legend", position="right", ncol=1) 

## 3.2. Plot the top differential edges betwen LCL and WB
In this case study, we are interested in comapring LCL cell lines and their tissue of origin which is blood. Therefore we can also plot the differential network between them. We define the differential network as the difference between both networks.

In [None]:
nDiffs= 200 # top edges to plot (top edges with largest absolute value)
diffNet = pandaLCL@regNet - pandaWB@regNet

Then, we define the edges dataframe.

In [None]:
edges           = matrix(0L, nDiffs, 3)
colnames(edges) = c("from","to","value")
edges = as.data.frame(edges)
aa    = order(as.matrix(abs(diffNet)), decreasing = TRUE)
bb    = sort(as.matrix(abs(diffNet)), decreasing = TRUE)
edges$value  = as.matrix(diffNet)[aa[1:nDiffs]]
geneIdsTop   = (aa[1:nDiffs] %/% dim(diffNet)[1]) + 1
tfIdsTop     = aa[1:nDiffs] %% dim(diffNet)[1]
tfIdsTop[tfIdsTop == 0] = nTFs
edges$to     = colnames(diffNet)[geneIdsTop]
edges$from   = rownames(diffNet)[tfIdsTop]                                  
edges$arrows = "to"   
edges$color  = ifelse(edges$value > 0, "green", "red")
edges$value  = abs(edges$value)

Then, the nodes dataframe.

In [None]:
nodes       = data.frame(id = unique(as.vector(as.matrix(edges[,c(1,2)]))), 
                    label=unique(as.vector(as.matrix(edges[,c(1,2)]))))
nodes$group = ifelse(nodes$id %in% edges$from, "TF", "gene")

Finally, we plot the network.

In [None]:
net <- visNetwork(nodes, edges, width = "100%")
net <- visGroups(net, groupname = "TF", shape = "triangle",
                 color = list(background = "purple", border="black"))
net <- visGroups(net, groupname = "gene", shape = "dot",       
                 color = list(background = "teal", border="black"))
visLegend(net, main="Legend", position="right", ncol=1) 

# 4. Calculating degree  

Finally, we can compute gene targeting scores<sup>4</sup> as a summary statistic for all genes and TFs defined as follows:
* out-degrees of TFs: sum of the weights of outbound edges around a TF


In [None]:
lcl_outdegree <- calcDegree(pandaLCL, type="tf")
wb_outdegree <- calcDegree(pandaWB, type="tf")

* in-degrees of genes: sum of the weights of inbound edges around a gene

In [None]:
lcl_indegree <- calcDegree(pandaLCL, type="gene")
wb_indegree <- calcDegree(pandaWB, type="gene")

Then, we calculate the gene in-degree difference for two different panda regulatory networks (LCL minus whole blood).

In [None]:
degreeDiff <- calcDegreeDifference(pandaLCL, pandaWB, type="gene")
head(degreeDiff)

# 5. Gene Set Enrichment Analysis
Well will use the `fgsea` package to perform gene set enrichment analysis. We need to point to a ranked gene list (for example the gene in-degree difference between LCL and whole blood), and a list of gene sets (or signatures) in gmt format to test for enrichment. The gene sets can be downloaded from MSigDB: http://software.broadinstitute.org/gsea/msigdb Same gene annotation should be used in the ranked gene list and gene sets. In our example we will use the KEGG pathways downloaded from MSigDB.

## 5.1. Run fgsea
We start first by loading the pathway information database for KEGG.

In [None]:
pathways <- gmtPathways(paste0(ppath,"c2.cp.kegg.v7.0.symbols.gmt"))

Then, to retrieve biological-relevant processes, we will load and use the complete ranked gene list (27,175 genes) calculated from the complete network instead of the 1,000 subset genes we used in this tutorial example to build PANDA networks within a very short run time.

In [None]:
degreeDiff_all <- read.delim(paste0(ppath,"lclWB_indegreeDifference.rnk"),stringsAsFactors = F,header=F)
degreeDiff_all <- setNames(degreeDiff_all[,2], degreeDiff_all[,1])

Then, we run fgsea on the ranked gene scores.

In [None]:
fgseaRes <- fgsea(pathways, degreeDiff_all, minSize=15, maxSize=500, nperm=1000)
head(fgseaRes)

The results include patwhay name, the statistical significance of the input gene list in these pathways, and other metrics such as fold change. We can set a significance threshold to an adjusted p-value of 0.05.

In [None]:
sig <- fgseaRes[fgseaRes$padj < 0.05,]

Since we took LCL a our reference group, the pathways with negative NES are the ones less enriched in LCL. Therefore, we get the top 10 significant pathways enriched for genes having lower targeting in LCLs as follows.

In [None]:
sig[order(sig$NES)[1:10],]

## 5.2. Bubble plot of top differentially targeted pathways 
We can also visualize this result as a bubble plot that we construct as follows. First, we set some general settings fot the plot.

In [None]:
dat <- data.frame(fgseaRes)
# Settings
fdrcut <- 0.05 # FDR cut-off to use as output for significant signatures
dencol_neg <- "blue" # bubble plot color for negative ES
dencol_pos <- "red" # bubble plot color for positive ES
signnamelength <- 4 # set to remove prefix from signature names (2 for "GO", 4 for "KEGG", 8 for "REACTOME")
asp <- 3 # aspect ratio of bubble plot
charcut <- 100 # cut signature name in heatmap to this nr of characters

Then, we modify the signature names to make them more readable.

In [None]:
a <- as.character(dat$pathway) # 'a' is a great variable name to substitute row names with something more readable
for (j in 1:length(a)){
  a[j] <- substr(a[j], signnamelength+2, nchar(a[j]))
}
a <- tolower(a) # convert to lower case (you may want to comment this out, it really depends on what signatures you are looking at, c6 signatures contain gene names, and converting those to lower case may be confusing)
for (j in 1:length(a)){
  if(nchar(a[j])>charcut) { a[j] <- paste(substr(a[j], 1, charcut), "...", sep=" ")}
} # cut signature names that have more characters than charcut, and add "..."
a <- gsub("_", " ", a)
dat$NAME <- a

Then we determine what signatures to plot (based on FDR cut).

In [None]:
dat2 <- dat[dat[,"padj"]<fdrcut,]
dat2 <- dat2[order(dat2[,"padj"]),] 
dat2$signature <- factor(dat2$NAME, rev(as.character(dat2$NAME)))

Next, we determine the labels to colors based on their NES values.

In [None]:
sign_neg <- which(dat2[,"NES"]<0)
sign_pos <- which(dat2[,"NES"]>0)

and we assign colors to them.

In [None]:
signcol <- rep(NA, length(dat2$signature))
signcol[sign_neg] <- dencol_neg # text color of negative signatures
signcol[sign_pos] <- dencol_pos # text color of positive signatures
signcol <- rev(signcol) # need to revert vector of colors, because ggplot starts plotting these from below

Finally, we draw the bubble plot.

In [None]:
g<-ggplot(dat2, aes(x=padj,y=signature,size=size))
g+geom_point(aes(fill=NES), shape=21, colour="white")+
  theme_bw()+ # white background, needs to be placed before the "signcol" line
  xlim(0,fdrcut)+
  scale_size_area(max_size=10,guide="none")+
  scale_fill_gradient2(low=dencol_neg, high=dencol_pos)+
  theme(axis.text.y = element_text(colour=signcol))+
  theme(aspect.ratio=asp, axis.title.y=element_blank()) # test aspect.ratio

Bubble plot of gene sets (KEGG pathways) on y-axis and adjusted p-value (padj) on x-axis. Bubble size indicates the number of genes in each gene set, and bubble color indicates the normalized enrichment score (NES). Blue is for negative NES (enrichment of higher targeted genes in whole blood), and red is for positive NES (enrichment of higher targeted genes in LCL).

# 6. References

1- GTEx Consortium. "The Genotype-Tissue Expression (GTEx) pilot analysis: multitissue gene regulation in humans." Science 348.6235 (2015): 648-660.

2- Lopes-Ramos, Camila M., et al. "Regulatory network changes between cell lines and their tissues of origin." BMC genomics 18.1 (2017): 1-13.

3- Mering, Christian von, et al. "STRING: a database of predicted functional associations between proteins." Nucleic acids research 31.1 (2003): 258-261.

4- Weighill, Deborah, et al. "Gene targeting in disease networks." Frontiers in Genetics 12 (2021): 501.