# Uncovering gene regulation in Mycobacterium Tuberculosis using netZooR
Author: Tian Wang<sup>1</sup>

<sup>1</sup> Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA.

# 1. Introduction

netZooR is an R package which consists of several methods to construct, analyze and plot gene regulatory networks, including the following tools:

* **PANDA**(Passing Attributes between Networks for Data Assimilation)<sup>1</sup> is a message-passing model to gene regulatory network reconstruction. It integrates multiple sources of biological data, including protein-protein interaction, gene expression, and transcription factor binding motifs information to reconstruct genome-wide, condition-specific regulatory networks.[[Glass et al. 2013]](http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0064832)

* **LIONESS**(Linear Interpolation to Obtain Network Estimates for Single Samples)<sup>2</sup> is a method to estimate sample-specific regulatory networks by applying linear interpolation to the predictions made by existing aggregate network inference approaches.[[Kuijjer et al. 2019]](https://www.sciencedirect.com/science/article/pii/S2589004219300872)

* **CONDOR** (COmplex Network Description Of Regulators)<sup>3</sup> implements methods to cluster biapartite networks and estimatiing the contribution of each node to its community's modularity.[[Platig et al. 2016]](http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1005033)

* **ALPACA**(ALtered Partitions Across Community Architectures)<sup>4</sup> is a method to compare two genome-scale networks derived from different phenotypic states to identify condition-specific modules. [[Padi and Quackenbush 2018]](https://www.nature.com/articles/s41540-018-0052-5)

In this vignette, we will run these tools on Mycobacterium Tuberculosis data to model gene regulatory processes.

The help pages for the  usage of core functions can be accessed as follows:

In [None]:
?pandaPy
?createCondorObject
?pandaToCondorObject
?lionessPy
?alpaca
?pandaToAlpaca
?sambar

## 1.1. Getting Started

### Prerequisites

If you're running this netbook on the server, you can set the parameter `runserver` to 1 and you can skip to the [loading packages](#section_1) section.

In [None]:
runserver=1

Locally, you need to set `runserver` to 0. Using this pacakage requires [**Python**](https://www.python.org/downloads/) (3.X) and some Python libraries, [**R**](https://cran.r-project.org/) (>= 4.0).

Some plotting functions will require [**Cytoscape**](https://cytoscape.org/) installed.

### Required Python libraries

How to install Python libraries depends varies from different platforms. More instructions could be find [here](https://packaging.python.org/tutorials/installing-packages/). 

The following Python libraries (or packages) are required by running PANDA and LIONESS algorithms:

The required Python packages are: [pandas](https://pandas.pydata.org/), [numpy](http://www.numpy.org/), [networkx](https://networkx.github.io/), [matplotlib.pyplot](https://matplotlib.org/api/pyplot_api.html).

### Installing
netZooR package could be downloaded via `install_github()` function from `remotes` package. To install netZooR without vignettes, set the "build_vignettes = FALSE" argument.

In [None]:
if (runserver==0){
    is_netZooR_available <- require("netZooR")
    if (is_netZooR_available==0){
        install.packages("remotes") 
        library(remotes)
        remotes::install_github("netZoo/netZooR", build_vignettes = TRUE)
    }
    ppath=''
}else{
    ppath='/opt/data/'
}

<a id='section_1'></a> 
### Loading packages

In [None]:
library(netZooR)    # For network inference
library(viridisLite)# To visualize communities
library(visNetwork) # For network visualization 

### Configuring Python

We will run PANDA through the R-Python interface (reticulate).
netZooR will invoke the Python in R environment through reticulate package.
Configure which version of Python to use if necessary, here in netZooR, Python 3.X is required. 
More details can be found [here](https://cran.r-project.org/web/packages/reticulate/vignettes/versions.html)

In [None]:
if (runserver==0){
    #check your Python configuration and the specific version of Python in use currently
    py_config()
    # reset to Python 3.X if necessary, like below:
    use_python("/usr/local/bin/python3")
}

The previous command is necessary to bind R to Python since we are calling PANDA from Python because netZooPy has an optimized implementation of PANDA. Check [this tutorial](http://netbooks.networkmedicine.org/user/marouenbg/notebooks/netZooR/panda_gtex_tutorial_server.ipynb) for an example using a pure R implementation of PANDA. However, it is only necessary when we're working locally. On this Jupyter notebook server, we just need to tell R where to find Python using this command

```
Sys.setenv(RETICULATE_PYTHON = "/opt/anaconda3/py38/bin/python")
```

## 1.2. Data Sources

PANDA<sup>1</sup> builds a gene regulatory network by integrating three sources of data: 1) TF motif data, 2) TF PPI network , and 3) gene expression data. 

### Motif data
An example specie-sepcific PANDA-ready transcription factor binding motif data is included in the netZooR package, which are derived from motif scan and motif info files located on https://sites.google.com/a/channing.harvard.edu/kimberlyglass/tools/resourcesby. Motif data is a data frame that contains three columns: 1) TF (source node), 2) Gene (target node), and 3) weight is binary (0/1) value to indicate the presence of a TF motif in the promoter region of the target egen.

### PPI
This package includes a function `source.PPI` to build a Protein-Protein Interactions (PPI) througt STRING database given a list of proteins of interest. The [STRINGdb](http://www.bioconductor.org/packages/release/bioc/html/STRINGdb.html) is already loaded while loading netZooR.

In [None]:
# TF is a data frame with single column filled with TFs of Mycobacterium tuberculosis H37Rv.
motif_file_path <- system.file("extdata", "chip_matched.txt", package = "netZooR", mustWork = TRUE)
motif <- read.table(motif_file_path, sep="\t")
# create a data frame with the TF column
TF  <- data.frame(motif[,1])
PPI <- sourcePPI(TF, STRING.version="11", species.index=83332, score_threshold=0)
PPI

PPI data has three columns: 1) source node (TF), 2) target node (TF), 3) weight which a value between and 0 and 1 that indicates the strength of connection between these 2 TFs.

### TB gene expression data


We will use TB example datasets that are integrated in netZooR package.
In this application, we will build a case and control network using 2 gene expression dataset, one transcription factor binding motifs dataset, and one protein-protein interaction datasets from the netZooR package. This data can also be fetched through AWS.

Using the data in the package, we need to specify the file path of these files as follows:

In [None]:
# retrieve the file path of these files
treated_expression_file_path <- system.file("extdata", "expr4_matched.txt", package = "netZooR", mustWork = TRUE)
control_expression_file_path <- system.file("extdata", "expr10_matched.txt", package = "netZooR", mustWork = TRUE)
motif_file_path <- system.file("extdata", "chip_matched.txt", package = "netZooR", mustWork = TRUE)
ppi_file_path   <- system.file("extdata", "ppi_matched.txt", package = "netZooR", mustWork = TRUE)

They can be downloaded to working directory from AWS.

In [None]:
if (runserver==0){
    # case gene expression
    system("curl -O  https://netzoo.s3.us-east-2.amazonaws.com/netZooR/example_datasets/expr4_matched.txt")
    # control gene expression
    system("curl -O  https://netzoo.s3.us-east-2.amazonaws.com/netZooR/example_datasets/expr10_matched.txt")
    # motif data
    system("curl -O  https://netzoo.s3.us-east-2.amazonaws.com/netZooR/example_datasets/chip_matched.txt")
    # PPI data
    system("curl -O  https://netzoo.s3.us-east-2.amazonaws.com/netZooR/example_datasets/ppi_matched.txt")
}

# 3. PANDA algorithm

Then, we assign the file paths defined previously in the PANDA call to `expr_file`, `motif_file`, and `ppi_file` arguments. Then we set option `rm_missing` to `TRUE` to remove TFs and genes that are not present in all three inputs.

We do this operation for both case and control networks. First with the case network

In [None]:
treated_all_panda_result <- pandaPy(expr_file = treated_expression_file_path, motif_file = motif_file_path, ppi_file= ppi_file_path,modeProcess="legacy",  remove_missing = TRUE )

Then, the control network:

In [None]:
control_all_panda_result <- pandaPy(expr_file = control_expression_file_path,motif_file = motif_file_path, ppi_file= ppi_file_path,modeProcess="legacy",  remove_missing = TRUE )

The result vector `treated_all_panda_result` and vector `control_all_panda_result` below are large lists with three elements: the entire PANDA network in the `$panda` slot, the gene targeting scores or node indegree, and the TF targeting scores or node outdegree. Use `$panda`,`$indegree` and `$outdegree` to access each list item resepctively.

We can use `$panda`to access the entire PANDA network.

In [None]:
treated_net <- treated_all_panda_result$panda
control_net <- control_all_panda_result$panda
treated_net

The PANDA network is a data frame that has 4 columns. A source column (TFs), a target column (Genes), a binary motif column that is identical to the input motif network, and a force column that has the edge weight in the PANDA network.

## PANDA Cytoscape Plotting
Cytoscape is an interactivity network visualization tool highly recommanded to explore the PANDA network. Before using this function `plot.panda.in.cytoscape`, please install and launch Cytoscape (3.6.1 or greater) and keep it running whenever using this function. 

Before, calling this function, we need to reduce the network size by selecting the top 1000 edges in PANDA network by edge weight.

In [None]:
panda.net <- head(treated_net[order(control_net$Score,decreasing = TRUE),], 1000)

Run this function to create a network in Cytoscape (Requires a desktop installation of Cytoscape)

In [None]:
if (runserver==0){
    visPandaInCytoscape(panda.net, network.name="PANDA")
}

On netbooks server, we can use the visNetwork library to plot the largest 100 edges of the graph. We need to prepare the data in the required format.

In [None]:
panda.net=panda.net[-3] # remove unncessary column
num_edges <- 100 # number of edges to plot
edges = panda.net
colnames(edges) = c("from","to","value")
edges = edges[order(-edges$value),]
edges = edges[1:num_edges,]

edges$arrows = "to"
edges$color = ifelse(edges$value > 0, "green", "red")
edges$value = abs(edges$value)

nodes = data.frame(id = unique(as.vector(as.matrix(edges[,c(1,2)]))), label=unique(as.vector(as.matrix(edges[,c(1,2)]))))
nodes$group = ifelse(nodes$id %in% edges$from, "TF", "gene")

Then, we can call visNetwork on the newly contructed data frame. TFs are yellow triangles, genes are blue circles, positive edges are colored in green, and negative edges in red.

In [None]:
net <- visNetwork(nodes, edges, width = "100%")
net <- visGroups(net, groupname = "TF", shape = "triangle",
                 color = list(background = "orange", border="black"))
net <- visGroups(net, groupname = "gene", shape = "dot",       
                 color = list(background = "darkblue", border="black"))
visLegend(net, main="Legend", position="right", ncol=1)

# 4. LIONESS Algorithm 
LIONESS reconstructs single-sample networks for each gene expression sample from an aggregate network such as PANDA. LIONESS uses the same arguments as PANDA. In this example, we will run LIONESS algorithm for the first two samples. If we don't specify the `start_sample` and `end_sample` arguments, LIONESS will generate networks for all samples.

In [None]:
control_lioness_result <- lionessPy(expr_file = control_expression_file_path,motif_file = motif_file_path, ppi_file= ppi_file_path,modeProcess="legacy",  remove_missing = TRUE, start_sample=1, end_sample=2)
control_lioness_result

The output values of `lionessPy()` is a data frame where first two columns represent TFs (regulators) and Genes (targets) while the rest columns represent each sample. Each cell has the estimated edge weights calculated by LIONESS.

# 5. CONDOR Algorithm and plotting
CONDOR allows to detect communities in gene regulatory networks, like those built by PANDA. However, there a few processing steps to make the network complient with CONDOR format.
PANDA networks can simply be converted into condor.object by `panda.to.condor.object(panda.net, threshold)`
Defaults option  `threshold` is the average of [median weight of non-prior edges] and [median weight of prior edges], all weights mentioned previously are transformationed with formula `w'=ln(e^w+1)` before calculating the median and average which makes all edge weights positive for CONDOR. All the edges selected will remain the orginal weights calculated by PANDA.

In [None]:
treated_condor_object <- pandaToCondorObject(treated_net, threshold = 0)

Then, CONDOR can be called on the PANDA object

In [None]:
treated_condor_object <-condorCluster(treated_condor_object,project = FALSE)

The result of CONDOR is community assignment for each node of the network. The communities structure can be plotted by igraph.

In [None]:
treated_color_num <- max(treated_condor_object$red.memb$com)
treated_color     <- viridis(treated_color_num, alpha = 1, begin = 0, end = 1, direction = 1, option = "D")
condorPlotCommunities(treated_condor_object, color_list=treated_color, point.size=0.04, xlab="Genes", ylab="TFs")

This plot shows that CONDOR estimates the TB network to have three distinct communities.

# 6. ALPACA Algorithm

ALPACA compares 2 networks by detecting differences in their community structure. ALPACA can be called on 2 PANDA network for example. The function `panda.to.alpaca` allows to link both methods

In [None]:
alpaca<- pandaToAlpaca(treated_net, control_net, NULL, verbose=FALSE)

The result list `alpaca` contains 2 slots. The first one is a community assignement for each node and the second one is a modularity score for each node, which indicates the contribution of each node to the modularity of the community that it belongs to. 

# More tutorials

Browse with `browseVignettes("netZooR")` locally or check [this link for cloud notebooks](http://netbooks.networkmedicine.org/).

## Note
If there is an error like `Error in fetch(key) : lazy-load database.rdb' is corrupt` when accessing the help pages of functions in this package after being loaded. It's [a limitation of base R](https://github.com/r-lib/devtools/issues/1660) and has not been solved yet. Restart R session and re-load this package will help.


# References

1- Glass, Kimberly, et al. "Passing messages between biological networks to refine predicted interactions." PloS one 8.5 (2013): e64832.

2- Kuijjer, Marieke Lydia, et al. "Estimating sample-specific regulatory networks." Iscience 14 (2019): 226-240.

3- Platig, John, et al. "Bipartite community structure of eQTLs." PLoS computational biology 12.9 (2016): e1005033.

4- Padi, Megha, and John Quackenbush. "Detecting phenotype-driven transitions in regulatory network structure." NPJ systems biology and applications 4.1 (2018): 1-12.

5- Kuijjer, Marieke Lydia, et al. "Cancer subtype identification using somatic mutation data." British journal of cancer 118.11 (2018): 1492-1501.