# Inferring Gene Regulatory Networks from GTEx Gene Expression Data in R with OTTER
Author: Rebekka Burkholz<sup>1</sup>

<sup>1</sup> Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA.

# 1. Introduction
In this tutorial, we will learn how to infer OTTER gene regulatory networks from gene expression data with netZooR. We will focus on two examples: (1) the LCL cell line<sup>1</sup> and (2) whole blood (WB) samples from the GTEx gene expression data<sup>4</sup>. OTTER<sup>2</sup> has been developed from a machine learning perspective in close analogy to PANDA<sup>3</sup>.
For this reason, we will follow the same steps as in [the respective tutorial for PANDA](../netZooR/panda_gtex_tutorial_server.ipynb).

First, we will build one regulatory network for LCL cell line samples and one for whole blood samples. Next, we will compare the two networks by a pathway enrichment analysis for differentially targeted genes.

Cell lines are an essential tool in biomedical research and are often used as surrogates for tissues. LCLs (obtained from the transformation of B cells present in whole blood) are among the most widely used continuous cell lines with the ability to proliferate indefinitely. By comparing the regulatory networks of LCL cell lines with its tissue of origin (whole blood), we find that LCLs exhibit large changes in their patterns of transcription factor regulation, specifically a loss of repressive transcription factor targeting of cell cycle genes.

## Package installation

This tutorial can be ran on the server and locally by setting the following parameter.

In [None]:
runserver=1

Next, we need to set the files paths on the server.

In [None]:
if(runserver==1){
    ppath='/opt/data/'
}else if (runserver==0){
    ppath=''
}

You might need to install these packages on your computer if you are running the tutorial locally.

In [None]:
if(runserver==0){
    if (!requireNamespace("BiocManager", quietly = TRUE))
        install.packages("BiocManager",repos = "http://cran.us.r-project.org")
    BiocManager::install("fgsea")
    install.packages("reshape2",repos = "http://cran.us.r-project.org")
    install.packages("ggplot2",repos = "http://cran.us.r-project.org")
    install.packages("remotes")
    library(remotes)
    remotes::install_github("netZoo/netZooR", build_vignettes = FALSE)
}

## Loading packages
In this step, we are going to load the libraries for the analysis.

In [None]:
library(netZooR)       # To load OTTER from netZooR
library(fgsea)         # For gene enrichment analysis
library(ggplot2)       # For plotting
library(reshape2)      # For data loading
library(data.table)    # For data processing
library(visNetwork)    # For network visualization

# 2. OTTER

## 2.1. Background of OTTER
OTTER (Optimize to Estimate Regulation) is a general method to infer a bipartite network $W$ from noisy observations of its projections $WW^T$ and $W^TW$. It is explained in detail in the accompanying publication<sup>3</sup>.
In this tutorial, we are particularly interested in constructing a gene regulatory network $W$ between transcription factors (TFs) and genes. Higher link weights are associated with a higher probability of TFs binding to the promoter region of a gene. 
![Inference of a bipartite gene regulatory network between transcription factors (TFs) and genes.](https://netzoo.s3.us-east-2.amazonaws.com/netZooR/tutorial_datasets/genereg.png)

OTTER requires the following inputs: (1) a correlation matrix $C$, which is based on gene expression data, (2) protein-protein interactions represented by the matrix $P$, and (3) an initial guess $W_0$ of $W$, which we base on TF binding motifs.
Feel free to play with other choices of $C$, $P$, and $W_0$ as well.

OTTER solves the following optimization problem with ADAM gradient descent:
$$\min_W \frac{(1-\lambda)}{4}\Vert WW^T - \tilde{P} \Vert^2 + \frac{\lambda}{4} \Vert W^TW - C \Vert^2 + \frac{\gamma}{2} \Vert W \Vert^2,$$
which links transformed protein-protein interactions $\tilde{P} = P+2.2$ and the gene expression correlation matrix $C$ with the projections of the unknown gene regulatory network $W$. 
![OTTER infers the gene regulatory network W assuming that P and C are its (noise corrupted) projections.](https://netzoo.s3.us-east-2.amazonaws.com/netZooR/tutorial_datasets/BipartiteProjections.png)

Gradient descent is an iterative optimization method, which needs to be initialized. OTTER starts from $\tilde{P}W_0$, as this choice defines the current state-of-the-art in gene regulatory network inference based on gene expression data (see the original paper).
The transformation $\tilde{P}W_0$ assumes that binding events to genes with a high number of TF bindings sites are more likely. More details can be found in the original paper. 

## 2.2. OTTER parameters
The success of OTTER depends greatly on the right choice of parameters. Some parameters are related to the OTTER objective, while others refer to the ADAM gradient descent approach. If we want to call OTTER with the original parameters that have been tuned to infer gene regulatory networks for breast cancer and cervix cancer tissues (see original paper), we can simply call the otter function as $W \leftarrow otter(W_0,P,C)$.

This sets the original parameters to $\lambda = 0.0035, \gamma = 0.335, Iter = 32, \eta = 0.00001, bexp = 1$, where $\lambda \in [0,1]$ is a tuning parameter in the OTTER objective that decides how much importance we give to matching $C$ or $\tilde{P}$. 

$\gamma \geq 0$ is a regularization parameter in the OTTER objective, which corrects for high noise in $\tilde{P}$ and $C$.

The parameters $Iter = 32, \eta = 0.00001, bexp = 1$ refer to the gradient descent procedure, where $Iter$ controls the number of gradient steps, $\eta$ the stepsize, and $bexp$ the exponential decay of the stepsize.
In the examples that we study next, we will only use the default parameters for simplicity.

## 2.3. OTTER Network Inference

First, we have to define the input matrices $P$, $C$, and $W_0$.
Let's start with $P$ and $W_0$ and locate our ppi data (to construct P) and motif priors (for $W_0$). The ppi represents physical interactions between transcription factor proteins, and is an undirected network. The transcription factor motif prior represents putative regulation events where a transcription factor binds in the promotor of a gene to regulate its expression, as predicted by the presence of transcription factor binding motifs in the promotor region of the gene. The motif prior is thus a directed bipartite network linking transcription factors to their predicted gene targets. These are small example priors for the purposes of demonstrating this method. A complete set of priors by species can be downloaded from: https://sites.google.com/a/channing.harvard.edu/kimberlyglass/tools/resources

If you are running the tutorial locally, please download the files to your local directory using these commands and change the argument in `read.delim` with the relevant file paths.

In [None]:
if(runserver==0){
    # download motif and ppi file from AWS Bucket
    system("curl -O  https://netzoo.s3.us-east-2.amazonaws.com/netZooR/tutorial_datasets/motif_subset.txt")
    system("curl -O https://netzoo.s3.us-east-2.amazonaws.com/netZooR/tutorial_datasets/ppi_subset.txt")
}

Let's take a look at the motif prior.

In [None]:
motif <- read.delim(paste0(ppath,"motif_subset.txt"), stringsAsFactors=F, header=F)

It contains edges between TFs that has a DNA binding motif in the promoter region of their target genes. The third column is binary, it has 1 if the binding is confirmed otherwise, it will have 0.

In [None]:
motif[1:5,]

There are 49950 edges in total.

In [None]:
print(dim(motif))

The PPI network has weighted egdes between pairs of TFs. The third column corresponds to edge weights and has a value between 0 and 1

In [None]:
ppi <- read.delim(paste0(ppath,"ppi_subset.txt"), stringsAsFactors=F, header=F)
ppi[1:5,]

The total number of edges is 430.

In [None]:
print(dim(ppi))

Next we locate our expression data and filter out genes, which are not expressed in too many samples. As example, we will use a subset of the GTEx version 7 RNA-Seq data, downloaded from https://gtexportal.org/home/datasets. We start with a subset of RNA-Seq data (tpm normalized) for 1,000 genes from 130 LCL cell line samples and 407 whole blood samples. 

If you are running the tutorial locally, use this command to download the files.

In [None]:
if(runserver==0){
    #dowload and load the GTEx expression matrix (tpm normalized expression)
    system("curl -O  https://netzoo.s3.us-east-2.amazonaws.com/netZooR/tutorial_datasets/expression_tpm_lcl_blood_subset.txt")
    }

Now, we can load the gene expression file for whole blood.

In [None]:
exp <- read.delim(paste0(ppath,"expression_tpm_lcl_blood_subset.txt"), stringsAsFactors = F, check.names = F)

Then, we normalize the data by log transforming the expresion values.

In [None]:
exp <- log2(exp+1)

Next, we determine the number of non-NA/non-zero rows in the expression data. This ensures that we have enough values in the vectors to calculate pearson correlations between gene expression profiles in the construction of the gene co-expression prior.

In [None]:
zero_na_counts <- apply(exp, MARGIN = 1, FUN = function(x) length(x[(!is.na(x) & x!=0) ]))

We keep only genes with at least 20 valid gene expression entries.

In [None]:
exp <- exp[zero_na_counts > 20,]

Finally, we intersect the motif, PPI, and expression data so that the set of genes is the same.

In [None]:
exp <- exp[rownames(exp) %in% motif$V2,]
motif_subset <- motif[(motif$V1 %in% rownames(exp)) & (motif$V2 %in% rownames(exp)),]
ppi_subset <- ppi[(ppi$V1 %in% motif_subset$V1) & (ppi$V2 %in% motif_subset$V1),]

Now, that we donwloaded the gene expression for both LCL and whole blood, we need to download the annotation for each sample to identify which sample belongs to either LCL or whole blood.

In [None]:
if(runserver==0){
    #Load the sample ids of LCL samples
    systm("curl -O  https://netzoo.s3.us-east-2.amazonaws.com/netZooR/tutorial_datasets/LCL_samples.txt")
}

Then, we load the file.

In [None]:
lcl_samples <-fread(paste0(ppath,"LCL_samples.txt"), header = FALSE, data.table=FALSE)

And we select the columns of the expression matrix corresponding to the LCL samples.

In [None]:
lcl_exp <- exp[,colnames(exp) %in% lcl_samples[,1]]

Now, we do the same for whole blood. We download the sample names.

In [None]:
if(runserver==0){
    # Load the sample ids of whole blood samples
    system("curl -O https://netzoo.s3.us-east-2.amazonaws.com/netZooR/tutorial_datasets/WholeBlood_samples.txt")
}

And load the annotation file.

In [None]:
wblood_samples <-fread(paste0(ppath,"WholeBlood_samples.txt"), header = FALSE, data.table=FALSE)

Using the annotation file, we can separate the whole blood gene expression columns from our original matrix.

In [None]:
wb_exp <- exp[,colnames(exp) %in% wblood_samples[,1]]

For use in OTTER, we have to transform the networks from an edge list format into adjacency matrix. First, we need to define gene and TF names from our data.

In [None]:
geneNames <- unique(motif_subset[,2])
tfNames <- unique(motif_subset[,1])

As well as their sizes.

In [None]:
ng <- length(geneNames)
ntf <- length(tfNames)

We then start by transforming the motif network into adjacency matrix.

In [None]:
W0 <- matrix(data=0, nrow = ntf, ncol=ng, dimnames = list(tfNames, geneNames))
W0[cbind(motif_subset[,1], motif_subset[,2])] <- motif_subset[,3]

Then, the protein-protein interaction network.

In [None]:
P <- matrix(data=0, nrow = ntf, ncol=ntf, dimnames = list(tfNames, tfNames))
P[cbind(ppi_subset[,1],ppi_subset[,2])] <- ppi_subset[,3]

For gene expression data, we need to compute the gene co-expression matrix, that will be used as an input to OTTER.

In [None]:
C_lcl <- cor(t(lcl_exp))
C_wb <- cor(t(wb_exp))

Now we can run OTTER. We want to generate two gene regulatory networks for comparison, one based on the LCL:

In [None]:
otterLCL <- otter(W0, P, C_lcl)

and one based on the whole blood data:

In [None]:
otterWB <- otter(W0, P, C_wb)

Hence, we also have to run OTTER twice. Note that the matrices $P$ and $W_0$ are identical in each run. The input to OTTER only differ in the correlation matrix $C_{lcl}$ or $C_{wb}$, respectively.

OTTER networks have relatively small weights because of the internal normalization of edges. But the scaling does not matter for the prediction of regulatory links. The higher the edge weight, the higher is the probability that a transcription factor binds to the promoter region of a gene and regulates its expression. For convenience, we multiply all weights with a factor that sets the maximum weight to one.

In [None]:
otterLCL <- otterLCL/max(otterLCL)
otterWB  <- otterWB/max(otterWB)

# 3. Visualizing the networks
In this section we will visualize parts of the network using the `visNetwork` package.

## 3.1. Plotting the LCL network

As a visualization example, we will plot the LCL cell line OTTER network by taking the 200 largest edge weights in the network by absolute value.

In [None]:
nDiffs= 200

Then, we build the input data for `visNetwork` which requires an edges dataframe.

In [None]:
diffNet = otterLCL
nTFs  = dim(diffNet)[1]

# Edges data frame
edges           = matrix(0L, nDiffs, 3)
colnames(edges) = c("from","to","value")
edges = as.data.frame(edges)
aa    = order(as.matrix(abs(diffNet)), decreasing = TRUE)
bb    = sort(as.matrix(abs(diffNet)), decreasing = TRUE)
edges$value  = as.matrix(diffNet)[aa[1:nDiffs]]
geneIdsTop   = (aa[1:nDiffs] %/% dim(diffNet)[1]) + 1
tfIdsTop     = aa[1:nDiffs] %% dim(diffNet)[1]
tfIdsTop[tfIdsTop == 0] = nTFs
edges$to     = colnames(diffNet)[geneIdsTop]
edges$from   = rownames(diffNet)[tfIdsTop]                                  
edges$arrows = "to"   
edges$value  = exp(edges$value)
edges$color  = "grey"

Then a nodes dataframe.

In [None]:
# Nodes data frame
nodes       = data.frame(id = unique(as.vector(as.matrix(edges[,c(1,2)]))), 
                    label=unique(as.vector(as.matrix(edges[,c(1,2)]))))
nodes$group = ifelse(nodes$id %in% edges$from, "TF", "gene")

We can finally, plot the network, using TFs as yellow triangle and genes as black circles.

In [None]:
net <- visNetwork(nodes, edges, width = "100%")%>% 
  visHierarchicalLayout()
net <- visGroups(net, groupname = "TF", shape = "triangle",
                 color = list(background = "yellow", border="black"))
net <- visGroups(net, groupname = "gene", shape = "dot",       
                 color = list(background = "black", border="black"))
visLegend(net, main="Legend", position="right", ncol=1) 

## 3.2. Plot the top differential edges betwen OTTER LCL and WB networks

We can use the LCL and whole blood networks to compare regulatory differences between LCL cell lines and their tissue of origin (whole blood). We can compute a differential network by taking the difference of edge weights using the LCL cell line as a reference group.

In [None]:
diffNet = otterLCL - otterWB

To plot the network, we can take the 200 top edges with largest absolute value.

In [None]:
nDiffs= 200 

Then, we define the parameters for `visNetwork`.

In [None]:
# Edges data frame
edges           = matrix(0L, nDiffs, 3)
colnames(edges) = c("from","to","value")
edges = as.data.frame(edges)
aa    = order(as.matrix(abs(diffNet)), decreasing = TRUE)
bb    = sort(as.matrix(abs(diffNet)), decreasing = TRUE)
edges$value  = as.matrix(diffNet)[aa[1:nDiffs]]
geneIdsTop   = (aa[1:nDiffs] %/% dim(diffNet)[1]) + 1
tfIdsTop     = aa[1:nDiffs] %% dim(diffNet)[1]
tfIdsTop[tfIdsTop == 0] = nTFs
edges$to     = colnames(diffNet)[geneIdsTop]
edges$from   = rownames(diffNet)[tfIdsTop]                                  
edges$arrows = "to"   
edges$color  = ifelse(edges$value > 0, "green", "red")
edges$value  = abs(edges$value)

# Nodes data frame
nodes       = data.frame(id = unique(as.vector(as.matrix(edges[,c(1,2)]))), 
                    label=unique(as.vector(as.matrix(edges[,c(1,2)]))))
nodes$group = ifelse(nodes$id %in% edges$from, "TF", "gene")

Finally, we plot the network as we did previously. Green edges are those larger in our reference group LCL and red edges are smaller in whole blood in comparison to LCL. 

In [None]:
net <- visNetwork(nodes, edges, width = "100%")%>% 
  visHierarchicalLayout()
net <- visGroups(net, groupname = "TF", shape = "triangle",
                 color = list(background = "black", border="black"))
net <- visGroups(net, groupname = "gene", shape = "dot",       
                 color = list(background = "yellow", border="black"))
visLegend(net, main="Legend", position="right", ncol=1) 

# 4. Calculating node degree  

We can calculte node degree in the network as follows:
* out-degrees of TFs: sum of the weights of edges pointing to a TF
* in-degrees of genes: sum of the weights of edges pointing to a gene

This is particularly useful to reduce the networks to a single summary vector that we can use for Gene Set Enrichment Analysis, which will allow us to understand the enriched pathways in our differential network. First, we compute the outdegree.

In [None]:
lcl_outdegree <- apply(otterLCL, 1, sum) 
wb_outdegree  <- apply(otterWB, 1, sum) 

Then, the node indegree.

In [None]:
lcl_indegree  <- apply(otterLCL, 2, sum) 
wb_indegree   <- apply(otterWB, 2, sum) 

Finally, we compute the difference between the node degrees. In this case, since we are interested in gene enrichment analysis, we take the node indegree because OTTER networks are directed networks that links TFs to their target genes.

In [None]:
degreeDiff    <- lcl_indegree-wb_indegree 
head(degreeDiff)

# 5. Gene Set Enrichment Analysis
We will use the `fgsea` package to perform a gene set enrichment analysis. Inputs are a ranked gene list (for example the gene in-degree difference between LCL and whole blood) and a list of gene sets (or signatures) in `gmt` format to test for enrichment. The gene sets can be downloaded from MSigDB: http://software.broadinstitute.org/gsea/msigdb The same gene annotation should be used in the ranked gene list and the gene sets. In our example, we will use the KEGG pathways downloaded from MSigDB.

## 5.1. Run fgsea

First, if you are working locally, please download the pathway annotation file otherwise you can skip this step.

In [None]:
if(runserver==0){
    system("curl -O  https://netzoo.s3.us-east-2.amazonaws.com/netZooR/tutorial_datasets/c2.cp.kegg.v7.0.symbols.gmt")
}

The we read the file that contains pathways and the genes associated to them as determined by the KEGG database<sup>5</sup>.

In [None]:
pathways <- gmtPathways(paste0(ppath,"c2.cp.kegg.v7.0.symbols.gmt"))

To retrieve biological-relevant processes, we will load and use the complete ranked gene list (consisting of 26,077 out of 27,174 genes). The in-degree difference has been calculated based on the complete networks instead of the subnetworks, which we constructed in this tutorial as small examples (with reduced run time). First, let's download the precomputed results if we are working locally.

In [None]:
if(runserver==0){
    system("curl -O  https://netzoo.s3.us-east-2.amazonaws.com/netZooR/tutorial_datasets/lclWB_indegreeDifference_otter.txt")
}

Then, let's test the enrichment of our list of genes for pathways in KEGG. First, we need to load the differential targeting profile we computed earlier.

In [None]:
degreeDiff_all <- read.delim(paste0(ppath,"lclWB_indegreeDifference_otter.txt"),stringsAsFactors = F,header=F)
degreeDiff_all <- setNames(degreeDiff_all[,2], degreeDiff_all[,1])

Then, we run fgsea as follows:

In [None]:
fgseaRes <- fgsea(pathways, degreeDiff_all, minSize=15, maxSize=500, nperm=1000)
head(fgseaRes)

We obtain a set of pathways linked to our inuput geness. Significance of association can be assessed through the p-value and multiple-testing corrected p-value. Here, we can take an adjusted p-value of 0.05 as our threshold:

In [None]:
sig <- fgseaRes[fgseaRes$padj < 0.05,]

Then, we can get the top 10 significant pathways enriched for genes having lower targeting in LCLs by evaluating the Normalized Enrichment Score (NES). This score is positive if our reference group (LCL) is enriched for pathways and negative otherwise. 

In [None]:
sig[order(sig$NES)[1:10],]

## 5.2. Bubble plot of top differentially targeted pathways
Bubble plot of gene sets (KEGG pathways) on y-axis and adjusted p-value (padj) on x-axis. Bubble size indicates the number of genes in each gene set, and bubble color indicates the normalized enrichment score (NES). Blue is for negative NES (enrichment of higher targeted genes in whole blood), and red is for positive NES (enrichment of higher targeted genes in LCL). First, we need ot convert our fgsea results to a dataframe as input to the bubble plot function

In [None]:
dat <- data.frame(fgseaRes)

We set our significance threshold to 0.05.

In [None]:
fdrcut <- 0.05 # FDR cut-off to use as output for significant signatures

Then, we specify as set of parameters for plotting, such as the color of bubble for enriched terms.

In [None]:
dencol_neg <- "blue" # bubble plot color for negative ES
dencol_pos <- "red" # bubble plot color for positive ES
signnamelength <- 4 # set to remove prefix from signature names (2 for "GO", 4 for "KEGG", 8 for "REACTOME")
asp <- 3 # aspect ratio of bubble plot
charcut <- 100 # cut signature name in heatmap to this nr of characters

We clean pathway names to make the plot readable.

In [None]:
a <- as.character(dat$pathway) # 'a' is a great variable name to substitute row names with something more readable
for (j in 1:length(a)){
  a[j] <- substr(a[j], signnamelength+2, nchar(a[j]))
}
a <- tolower(a) # convert to lower case (you may want to comment this out, it really depends on what signatures you are looking at, c6 signatures contain gene names, and converting those to lower case may be confusing)
for (j in 1:length(a)){
  if(nchar(a[j])>charcut) { a[j] <- paste(substr(a[j], 1, charcut), "...", sep=" ")}
} # cut signature names that have more characters than charcut, and add "..."
a <- gsub("_", " ", a)
dat$NAME <- a

Next, we set determine what signatures to plot (based on FDR cut).

In [None]:
dat2 <- dat[dat[,"padj"]<fdrcut,]
dat2 <- dat2[order(dat2[,"padj"]),] 
dat2$signature <- factor(dat2$NAME, rev(as.character(dat2$NAME)))

Then, we define the colors based the enrichement for each group as defined earlier.

In [None]:
# Determine what labels to color
sign_neg <- which(dat2[,"NES"]<0)
sign_pos <- which(dat2[,"NES"]>0)
# Color labels
signcol <- rep(NA, length(dat2$signature))
signcol[sign_neg] <- dencol_neg # text color of negative signatures
signcol[sign_pos] <- dencol_pos # text color of positive signatures
signcol <- rev(signcol) # need to revert vector of colors, because ggplot starts plotting these from below

Finally, we draw the bubble plot.

In [None]:
g<-ggplot(dat2, aes(x=padj,y=signature,size=size))
g+geom_point(aes(fill=NES), shape=21, colour="white")+
  theme_bw()+ # white background, needs to be placed before the "signcol" line
  xlim(0,fdrcut)+
  scale_size_area(max_size=10,guide="none")+
  scale_fill_gradient2(low=dencol_neg, high=dencol_pos)+
  theme(axis.text.y = element_text(colour=signcol))+
  theme(aspect.ratio=asp, axis.title.y=element_blank()) # test aspect.ratio

The plot summarizes our findings in the gene enrichment analysis, we see in particular that there are notable differences between LCL and whole blood, which denotes differences in regualtory processes between cell lines and their tissues of origin. In particular, we see the term `patwhays in cancer` enriched in LCL cell line. This is not surprising since LCLs are immortalized by EBV infection, which is linked itself to increased cases of cancer<sup>6</sup>.

# References

1- Lopes-Ramos, Camila M., et al. "Regulatory network changes between cell lines and their tissues of origin." BMC genomics 18.1 (2017): 1-13.

2- Glass, Kimberly, et al. "Passing messages between biological networks to refine predicted interactions." PloS one 8.5 (2013): e64832.

3- Weighill, Deborah, et al. "Gene regulatory network inference as relaxed graph matching." Proceedings of the... AAAI Conference on Artificial Intelligence. AAAI Conference on Artificial Intelligence. Vol. 35. No. 11. NIH Public Access, 2021.

4- GTEx Consortium, et al. "The Genotype-Tissue Expression (GTEx) pilot analysis: Multitissue gene regulation in humans." Science 348.6235 (2015): 648-660.

5- Kanehisa, Minoru, and Susumu Goto. "KEGG: kyoto encyclopedia of genes and genomes." Nucleic acids research 28.1 (2000): 27-30.

6- Thompson, Matthew P., and Razelle Kurzrock. "Epstein-Barr virus and cancer." Clinical cancer research 10.3 (2004): 803-821.