# Building single-sample regulatory networks using LIONESS and netZooPy
Author: 
Qi (Alex) Song<sup>1</sup>.

<sup>1</sup>Channing division of network medicine, Brigham's and Women hospital and Harvard Medical School, Boston, MA. (qi.song@channing.harvard.edu)

## 1. Introduction
In this tutorial, we will briefly walk through the steps to perform analysis with LIONESS<sup>1</sup> algorithm using netZooPy package. LIONESS is an algorithm for estimating sample-specific gene regualtory networks for each individual in a population. LIONESS infers individual sample networks by applying linear interpolation to the predictions made by another aggregate network inference approache<sup>1</sup>. In this tutorial, we will use PANDA<sup>3</sup> as our aggregate network inference apporach to build sample-specific networks.

## 2. Installation of netZooPy.
netZooPy comes with full support for LIONESS algorithm. netZooPy can be installed through `pip` command. For more details, please refer to the installation guide at netZooPy documentation site [here](https://netzoopy.readthedocs.io/en/latest/install/index.html).    

This analysis can be ran on the netbooks server or locally by specifying the following parameter

In [None]:
runserver=1

This will aloow us to set server-specific parmeters for this netbook.

In [None]:
if runserver==1:
    ppath='/opt/data/'
elif runserver==0:
    ppath=''

## 3. Load required modules
We will need `Panda` and `Lioness` python classes from netZooPy package. We will also need `read_csv()` function from `pandas` package for demonstrating the input data sets. 

In [None]:
import os
from netZooPy.panda import Panda
from netZooPy.lioness import Lioness
from netZooPy.lioness.analyze_lioness import AnalyzeLioness
import pandas as pd

## 4. Load input data

Now let's look at the three data sets to get a sense about what the inputs look like.

In [None]:
exp_data = pd.read_csv(ppath+'ToyExpressionData.txt',header=None, index_col = 0, sep = "\t")
motif_data = pd.read_csv(ppath+'ToyMotifData.txt',header=None, sep = "\t")
ppi_data = pd.read_csv(ppath+'ToyPPIData.txt',header=None, sep = "\t")

Expression data is a matrix where rows are genes and columns are samples. There are 1000 genes and 50 samples in this gene expression dataset. This data set will be used to construct the first input network to PANDA, which is a gene coexpression network.

In [None]:
exp_data

Motif data should be formatted into a three-column list, where first column contains TF IDs and second column the target gene IDs and third column the interaction scores. This data set will be the second input network to PANDA, which is a "seed" network that PANDA uses as an initial estimate for the inference. Edges in this binary network have a value of 1 if the TF has a motif in the promoter region of the target gene and 0 otherwise.

In [None]:
motif_data

There are 87 unique TFs and 913 unique motifs in this motif dataset.

In [None]:
motif_data[0].unique().shape[0]

In [None]:
motif_data[1].unique().shape[0]

Finally, TF protein protein interaction PPI data should be formatted into a three-column list, where first two columns contain protein IDs and third column contains a score for each interaction. This will be the third input network to PANDA.

In [None]:
pd.concat([ppi_data[0],ppi_data[1]]).unique().size

This PPI dataset has 238 interactions among 87 TFs. Typically, TF PPI networks are built by extracting TF nodes from the STRING database<sup>2</sup>. Edge weight vary between 0 and 1 to indicate the strength of the connection between TFs. For example, if 2 TFs physically bind to each other, the edge weight would be close to 1.

To sum up, PANDA builds a TF-gene regulatory network by evaluating an initial seed network which is the TF to gene motif network. This seed network is then projected into the gene coexpression network, then the TF PPI network and the average of both projections is computed<sup>3</sup>.

## 5. Run Panda
Before running LIONESS, we will first need to generate a `Panda` network, which is the aggregate network for all samples in the gene expression data. This will be used later to run `Lioness` and estiamte a network for each sample in the gene expression data. Note that the argument `keep_expression_matrix` should be specified as `True` in the PANDA step to be able to run LIONESS later because LIONESS needs to call `Panda` function to build networks on each sample of the gene expression matrix.

In [None]:
panda_obj = Panda(ppath+'ToyExpressionData.txt',
                  ppath+'ToyMotifData.txt',
                  ppath+'ToyPPIData.txt',
                  remove_missing=False, 
                  keep_expression_matrix=True, save_memory=False, modeProcess='legacy')

The `modeProcess` argument allows to define the function behavior when TFs and genes are not the same across the 3 input networks. `intersection` takes the intersection across all input networks, `union` takes the union of the three network nodes, and `legacy` uses the gene annotation from gene expression and TF annotation from the motif network.

## 6. Run Lioness to estimate sample-specific networks
We will first use the `Panda` object as input for `Lioness` object. Then `Lioness` will run Panda algorithm in its iterations to estimate sample-specific network for each sample.   

In [None]:
lioness_obj = Lioness(panda_obj, save_dir='../data')

Using linear interpolation, LIONESS computes a PANDA network for each one of the 50 samples in the gene expression data and saves them in the directory specified in `save_dir` folder.

## 7. Run Lioness with co-expression matrix
LIONESS can work with any method that generates a network for an aggregate set of samples such correlation networks. To compute LIONESS with coexpression matrix instead of PANDA, we can set motif data to `None`:

In [None]:
motif = None

First, we need to compute the gene co-expression network. To keep the same workflow as the previous case, we will use the `Panda` function to do this. Also, we need to make sure to keep expression matrix for next step.

In [None]:
panda_obj = Panda(ppath+'ToyExpressionData.txt',
                  None,
                  ppath+'ToyPPIData.txt',
                  save_tmp=True,
                  remove_missing=False,
                  keep_expression_matrix=True, modeProcess='legacy')

Then, we call LIONESS on each sample of the expression data.

In [None]:
lioness_obj = Lioness(panda_obj, save_dir='../data')

This produces gene co-expression networks for each sample in our data using linear interpolation. This is particularly intersting when we have a few samples for a specific group that wouldn't allow to compute expression. By integrating these samples in a population, LIONESS allows to estimate the network for each sample.

## 8. Visualize Lioness results
The function `AnalyzeLioness()` can be used to visualize LIONESS network. You may select only the `top` genes to be visualized in the graph. In current version of Lioness. Only the network of the first sample will be visualized using `.top_network_plot()` function.

In [None]:
analyze_lioness_obj = AnalyzeLioness(lioness_obj)

This object contains now processed network data that we can use to plot the top edges of the network using `top_network_plot`. The `file` argument contains the file path to save the figure.

In [None]:
analyze_lioness_obj.top_network_plot(top = 10, file = "../data/lioness_top_10.png")

## 9. Save Lioness results
We can save LIONESS results by using `save_lioness_results()` method of the `Lioness` object. The edge weights of each LIONESS network will be saved into an output file. 

First, we define the ouput format as text.

In [None]:
lioness_obj.save_fmt='txt'

Then we save the file.

In [None]:
lioness_obj.save_lioness_results()

The file contains networks in a 2+n column format. The 2 first columns are identical to TF and target IDs from the `.export_panda_results` property of `Panda` object. 

In [None]:
panda_obj.export_panda_results

The `n` other columns correspond to each sample of the population, and each column has the edge weights between TFs and genes for each specific network.

# References
1- Kuijjer ML, Tung MG, Yuan GC, Quackenbush J, Glass K: Estimating Sample-Specific Regulatory Networks. iScience 2019.

2- Mering, Christian von, et al. "STRING: a database of predicted functional associations between proteins." Nucleic acids research 31.1 (2003): 258-261.

3- Glass, Kimberly, et al. "Passing messages between biological networks to refine predicted interactions." PloS one 8.5 (2013): e64832.