## Analyzing the transcriptional regulatory network of *E. coli*

Content here is licensed under a CC 4.0 License. The code in this notebook is released under the MIT license. 


By Manu Flores. 

In [None]:
# uncomment the next line if you're in Google Collab 
#! pip install -r https://raw.githubusercontent.com/manuflores/grnlearn_tutorial/master/requirements.txt
#! wget https://raw.githubusercontent.com/manuflores/grnlearn_tutorial/master/notebooks/grn.py

In [None]:
import grn as g
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import networkx as nx
import matplotlib as mpl
 
import hvplot
import hvplot.pandas
import holoviews as hv
from holoviews import dim, opts
import bokeh_catplot
import bokeh 
import bokeh.io
from bokeh.io import output_file, save, output_notebook
output_notebook()
hv.extension('bokeh')

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

g.set_plotting_style()
%matplotlib inline
%config InlineBackend.figure_format = 'svg'

seed = 8 
np.random.seed(seed)

### Load transcriptional regulatory network (TRN)

First - off in this exploration, let's load the transcriptional regulatory network of E. coli. We'll use pandas to load the data into the notebook and then use NetworkX to analyze it. 

This network was downloaded from [RegulonDB](http://regulondb.ccg.unam.mx/menu/download/datasets/). RegulonDB is the largest knowledge-base / database of *E. coli* and is mantained by the group of [Julio Collado](http://www.ccg.unam.mx/pedro-julio-collado-vides/) at UNAM, and there are similar databases for other model organisms such as [SubtiWiki](http://subtiwiki.uni-goettingen.de/) for *B. subtilis* (curated by the group of Jörg Stülke at University of Göttingen) and [WormBase](https://www.wormbase.org/) from the [Paul Sternberg](http://wormlab.caltech.edu/) lab at Caltech. This databases are great resources to start studying about model organisms through biological data analysis. 

If you're running the notebook in Google Colab you will have to download it using the `wget` command from unix before hand. 

In [None]:
# uncomment the following line if you're in Google colab
# url = 'https://raw.githubusercontent.com/manuflores/grnlearn_tutorial/master/data/trn_ecoli.txt'
# df_trn = pd.read_csv(url,  comment= '#',
#                     delimiter = '\t', index_col = False)

In [None]:
# Read in the transcriptional regulatory network file 
df_trn = pd.read_csv('../data/trn_ecoli.txt',  comment= '#',
                     delimiter = '\t', index_col = False)

Let's look at the dataframe.

In [None]:
df_trn.head()

The way the dataset is arranged is that each row is an interaction, the `tf` correspond to the **transcription factors**, and the `tg` column to the **target gene** that's being regulated by the TF.

We can now turn it into a graph object using NetworkX. 

In [None]:
# Pandas DataFrame to a NetworkX graph object
trn = nx.from_pandas_edgelist(df= df_trn,
                              source= 'tf',
                              target='tg')

In reality, this network is a directed graph, but for simplicity let's treat it as an undirected network. 

We can now go ahead and get the global regulators of the TRN, a.k.a. the hubs. There are several [centrality measures](https://networkx.github.io/documentation/stable/reference/algorithms/centrality.html) in NetworkX, we'll use the eigenvector centrality. 

In [None]:
# Calculating eigenvector centrality to get the hubs
eigen_cen= nx.eigenvector_centrality(trn)

The output of the function is a dictionary of tuples corresponding to the gene name of each node in the network and its eigenvector centrality. Let's make a list out of it and print the first 10 elements. 

In [None]:
list(eigen_cen.items())[:10]

We can see that these are arranged in alphabetical order. We can arrange them by its centrality and then return the first 10 most central nodes, i.e. the hubs. 

In [None]:
# sort the dictionary to get the hubs
hubs= sorted(eigen_cen.items(), key= lambda cc: cc[1], reverse= True)[:10]

hubs

Nice! We actually get the most important global transcription factors in the network. We can corroborate by just getting the top 10 TFs with the most interactions from the dataframe. 

In [None]:
# Get the first 
df_trn.tf.value_counts().head(10)

### Visualizing the transcription regulatory network. 

Let's proceed with our analysis by plotting the network itself. Specifically, we want to tease out the structure of the net by looking at it. 

Before we go and plot the whole network, let's extract its [largest connected component](https://en.wikipedia.org/wiki/Component_(graph_theory)). 

In [None]:
# Extract the network's larget_connected_component
trn_lcc = max(nx.connected_component_subgraphs(trn), key=len)

The NetworkX library has an off-the-shelf function to plot networks using a Matplotlib backend, it also has different layouts to explore. 

In [None]:
nx.draw(trn_lcc,
        node_color = 'lightgreen',
        node_label = False,
        alpha = 0.6,
        node_size = 30)

Nice! From the network we can (kind of) tease out that this fuzzball might have some nice structure to it : inside it is a very convoluted hairball, but in the edges we see some clear structure: **few nodes regulate many nodes**. Moreover the net **organizes into clusters**, we'll get back to that in the next tutorial. This is typical of real world networks, like social networks, power grids, and really most biological networks. This type of nets are generally known as **scale-free networks**. 

We can more easily this pattern if we plot the distributions number of connections of each node in the network, i.e. the degree distribution. Let's extract that information, put it inside at a dataframe and plot it using the `bokeh_captlot` package from the great Justin Bois.


In [None]:
# Get all the gene names for the nodes in the net
nodes = [node for node in trn.nodes()]

# Get the degree of each node
degree_distro = [degree for  node, degree in trn.degree()]

# Make a list corresponding to the annotation in the network
tf_annot = [ 'tf' if node in df_trn.tf.values else 'tg' for node in nodes ]

# Save this information in a tidy dataframe
net_stats = pd.DataFrame(
    {'gene_name': nodes,
     'degree_distribution': degree_distro, 
     'tf_annot': tf_annot}
)

net_stats.head()

Nice, we can now go ahead and plot the distribution it using `bokeh_catplot`. This is an awesome library to visualize distributions using interactive plots in bokeh. [The library was made by Justin Bois](https://github.com/justinbois/bokeh-catplot).

In [None]:
# Plot degree distribution
p = bokeh_catplot.strip(
    data = net_stats,
    val = 'degree_distribution',
    cats = 'tf_annot',
    jitter = True, 
    horizontal = True,
    tooltips = [('gene_name', '@gene_name'),
                ('tf_annot', '@tf_annot')],
    marker_kwargs={'alpha': 0.5},
    palette= ['#fc8d62', '#8da0cb'], 
)


bokeh.io.show(p)

Nice! We can clearly see that a lot of TFs that have more than 10 interactions in the network- We can also see that the most heavily regulated genes have more than 10 incoming TFs too!

What if we wanted to zoom in to a certain region of the network? Well, we can do it effectively using high-level commands of the [`hvplot.networkx`](https://hvplot.pyviz.org/user_guide/NetworkX.html) module. Let's also activate the hover tool in order to see the gene name when we put the cursor in a given node. 

In [None]:
import hvplot.networkx as hvnx

spring = hvnx.draw(trn_lcc,
                   node_color = 'lightgreen',
                   node_size = 30,
                   alpha = 0.6, 
                   with_labels=False)

spring.opts(tools = ['hover'],height = 600, width = 600)

Despite this is not the best visualization possible with the power of Holoviews, we can now start exploring the network in detail. We can clearly see how the network the ramifications at the edges, and if we zoom to them, we will se some operons like the tryptophan biosynthesis one in the upper left. 

### Clustering the transcriptional regulatory network. 

So far we've seen that the TRN has a scale-free structure, and that it organizes into well defined clusters. 

What if we wanted to actually get the clusters directly from the network data? Well, we can do this with a community detection algorithm. One of the most well established network community algorithms is called the [Louvain algorithm](https://arxiv.org/pdf/0803.0476). The name of the algorithm comes from the fact that the first author work (or worked) at the Louvain [University in Belgium](https://uclouvain.be/fr/index.html). This algorithm is implemented in the [`python-louvain`](https://python-louvain.readthedocs.io/en/latest/api.html) library in python. 

In [None]:
import community

We can call the algorithm directly on the largest connected component to extract the network clusters. 

In [None]:
communities_trn = community.best_partition(trn_lcc)

Let's see how many clusters we get. 

In [None]:
#How many clusters do we get with the TRN's LCC? 
n_clusters = max(communities_trn.values())
n_clusters

We can now proceed to set the cluster valuyes as network attributes in the `nx.Graph`object. 

In [None]:
nx.set_node_attributes(trn_lcc,
                       values= communities_trn,
                       name = 'modularity')

With this in place, we can now iterate over the nodes object in the network to get the cluster for each node in the network. 

In [None]:
cluster_list = []

In [None]:
for i in range(n_clusters):

    cluster_lcc = [n for n in trn_lcc.nodes()\
                   if trn_lcc.node[n]['modularity'] == i]

    cluster_list.append(cluster_lcc)

Let's look at the first 10 members of cluster 3. 

In [None]:
cluster_list[2][:10]

Nice! If we wanted to get a more high-level view of the core clusters of the regulatory net, it would be better to do the clustering in the TF-TF network, that is, only consider the interactions corresponding to transcription factors. You can download the TF-TF network from [RegulonDB](http://regulondb.ccg.unam.mx/menu/download/datasets/) in the TF-TF interactions section. I'll leave this as an exercise for people that want to go forward with this analysis. You can then interpret your clusters using information from [Ecocyc](http://ecocyc.org/).

In [None]:
#write your code here

### Extracting the PurR regulon from the regulatory network

Before we end, let's extract the data the PurR regulon. A regulon is a group of genes co-regulated by a TF or set of TFs. We'll use this annotation we'll use to label our data and train an ML model.

In [None]:
df_trn.head()

In [None]:
# Selecting the data corresponding to the PurR regulon
pur_regulon = df_trn[df_trn['tf'] == 'purr']

In [None]:
pur_regulon.shape

In [None]:
#pur_regulon.to_csv('../../data/purr_regulon_rdb.csv', index = False)

The PurR regulon will serve as our training set. But how can we really make a prediction of the possible targets of the PurR transcription factors ? And furthermore, how can we know if our predictions are right. Well, we'll need another dataset to compare against. 

### The Palsson lab hiTRN

The [Palsson Lab](http://systemsbiology.ucsd.edu/Researchers/Palsson) at UCSD has expanded the TRN of *E. coli* in RegulonDB, using a technique called ChIP-seq. They coined this expanded network of high-confidence interactions, **the hiTRN**. They published a [study](https://www.pnas.org/content/114/38/10286.long) a couple of years ago, where they compiled the knowledge generated of more than a decade of work and made a really neat analysis on the core modules of the network.

Let's load the hiTRN and extract the PurR regulon. This will serve as our test dataset for the ML model.

In [None]:
#Run this cell if you're in collab
# url_hi_trn = 'https://raw.githubusercontent.com/manuflores/grnlearn_tutorial/master/data/hiTRN_palsson_lab.csv'
# hiTRN = pd.read_csv(url_hi_trn)

In [None]:
# Load the hiTRN 
hiTRN = pd.read_csv('../data/hiTRN_palsson_lab.csv')
hiTRN.head()

In [None]:
# Extract the PurR regulon of the hiTRN 
pur_regulon_hi = hiTRN[hiTRN['TF'] == 'PurR']

Because some nodes have more than one interaction, let's just keep the TF and gene columns and drop duplicated interactions. 

In [None]:
purr_hi = pur_regulon_hi[['TF', 'gene']]

In [None]:
purr_hi.drop_duplicates(inplace =True)

Now we can finally save our dataset. 

In [None]:
#purr_hi.to_csv('../../data/purr_regulon_hitrn.csv', index = False)


All right, we're good to go ! In this tutorial we have explored a bit of the the genetic network of *E. coli* and gotten a general feel of its structure. In the next tutorial we'll use the regulons extracted in this notebook to train our ML model. 
