## Functional prediction of hypothetical TFs in bacteria using supervised machine learning in *E. coli* K-12 

Created by Emanuel Flores-Bautista in 2018.  All code contained in this notebook is licensed under the [Creative Commons License 4.0](https://creativecommons.org/licenses/by/4.0/).

In [None]:
##Import modules, community is the module for clustering networks
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import networkx as nx
import random 
import community
import matplotlib as mpl
from sklearn.preprocessing import StandardScaler as st
import sort_seq as ss
from sort_seq import * 
from sklearn.model_selection import train_test_split

ss.set_plotting_style_2()
##Setting the pyplot figures inside the notebook
%matplotlib inline
#Get svg graphics from the notebook
%config InlineBackend.figure_format = 'png'
np.random.seed(42)

In [None]:
path =  '../../../Documents/uni/bioinfo/data/coli/'

### 1. TF-TF network


Now that we saw that our approach works, let's work with the TF-TF network. Working with this subset allows us to analize the layer of the TRN that coordinates the TRN's dynamics. 

In [None]:
#Loading the TF-TF TRN, available at RegulonDB

tf_trn = pd.read_csv(path + "tf-tf-l.txt", delimiter= '\t', comment= '#', index_col= False)
tf_trn.head()

In [None]:
del(tf_trn['Unnamed: 5'])

We can use the `.describe()` method to have an overview of our data. 

In [None]:
tf_trn.describe()

We see that there are 456 interactions in the TF- TF network that CRP has the most outgoing arrows and GadX as the node with the most incoming arrows. Close to 70% of the TF-TF network has strong evidences.  

In [None]:
#Let's turn the TF TRN dataframe into a graph object
net = nx.from_pandas_edgelist(df= tf_trn, source= 'TF', target='TG',
                             edge_attr='regType')

Let's get the global regulators (hubs) of the TF-TF network of $E. coli$. 

In [None]:
#Computing the eigencentrality metric on the TF-TF net to get the hubs

tf_hubs = get_network_hubs(net)

tf_hubs

We see that the hubs of the TF-TF network is very similar to the complete TRN. This demonstrates that there are global regulators, but also local regulators and that hubs regulate both local TFs and TGs.

Because there are some outliers in the network that are only regulating themselves (PAR, NAR), or regulating one gene (toxin-antitoxin). Let's extract the TF-TF network's largest connected component (LCC).

In [None]:
##Computing the LCC
net= max(nx.connected_component_subgraphs(net), key=len)

Let's visualize it using the `draw()` function of Nx. 

### 2. TF-TF Network Clustering using the Louvain algorithm

Now let's cluster the TF-TF network's LCC using the Louvain algorithm. 

In [None]:
##Cluster the TF-TF network LCC

communities = community.best_partition(net)

In [None]:
n_clusters_tf = max(communities.values())

n_clusters_tf

We have 11 clusters.

In [None]:
##Let's look at the cluster assignment for each TF in the TF-TF network. 
#To do this remove the hash(#) before communities.

#communities

OxyR in Cluster 11, SoxRS/ MarAR in cluster 1...

Now let's add this cluster labels as an attribute in the network.

In [None]:
nx.set_node_attributes(net, values= communities, name='modularity')

### 3. Expression data pre-processing. 

Before we make the pre-processing of the expression data, let's extract the data corresponding to the hypothetical TFs, we'll later use it to make our predictions, i.e. to test what is the most likely functional role of each putative TF inside the cell. 

In [None]:
path

In [None]:
tf = pd.read_csv(path + 'tf_list_gene_name.csv', comment = '#')

In [None]:
hyp_tfs = pd.read_csv(path + 'hypTF_list_genes.csv')

In [None]:
hyp_tfs.head(2)

In [None]:
hypTFs = hyp_tfs.hyptfs.values

#tf_list = tf.TF.values

In [None]:
'yjjj' in hypTFs

In [None]:
len(hypTFs)

In [None]:
hypTFs = list(hypTFs)

##### Raw expression data from Colombos

In [None]:
df_x = pd.read_csv(path + "ecoli_exp_data_COLOMBOS.txt", delimiter= '\t', comment= '#')
df_x.head()

In [None]:
df_x.shape

In [None]:
annot, denoised_df = exp_data_preprocessing(df_x)

In [None]:
denoised_df.shape

In [None]:
full_denoised_data = pd.concat([annot, denoised_df], axis = 1)

In [None]:
full_denoised_data.head()

In [None]:
full_denoised_data.shape

### Separating the experimental and hypothetical TFs. 

Before we proceed with our workflow, we have to separate the experimental and hypothetical TFs. This step has two advantages. First, we can train our network with only experimental TFs, and second, we can then use this HypTF dataset to make the functional predictions in one step. 

In [None]:
hypTFs.extend(['dgor', 'ykfn', 'frlr', 'bdcr', 'mqsa', 'fimz', 'sgcr', 'dlmr'])

In [None]:
hypTFs_list = []

for row in full_denoised_data['gene name']:
    if row in hypTFs:
        hypTFs_list.append(1)
    else:
        hypTFs_list.append(0) 
        
full_denoised_data['hyp'] = hypTFs_list

In [None]:
##Filtering out the HypTFs

hyp_tfs_test = full_denoised_data[full_denoised_data['hyp'] ==1]
hyp_tfs_test.head()

In [None]:
hyp_tfs_test.shape

In [None]:
del(hyp_tfs_test['hyp'])

In [None]:
hyp_tfs_test.head(2)

In [None]:
hyp_tfs_test.to_csv('../data/ml_dfs/hyp_tfs_coli_X_test.csv')

### Pre process expression data

In [None]:
##Filtering out all of the genes that aren't HypTFs
## We'll use this dataframe for downstream analysis

df_xx = full_denoised_data[full_denoised_data['hyp'] ==0]

In [None]:
df_xx.head(2)

In [None]:
df_xx.shape

In [None]:
del(df_xx['hyp'])

In [None]:
df_xx.head(2)

In [None]:
df_xx.to_csv('../data/ml_dfs/coli_denoised_data.csv')

### Feature selection

from sklearn.feature_selection import SelectPercentile,\
mutual_info_classif, chi2, f_classif

### Multi-class classification data preparation: Assign the cluster labels to the expression dataframe.

Let's proceed to make the classification. Sidenote: we'll try to not include the global regulators.

First off, let's extract the clusters.

In [None]:
n_clusters_tf

In [None]:
#tf_cluster_list = []
tf_clusters = get_network_clusters(net, n_clusters_tf)

In [None]:
cluster1, cluster2, cluster3, cluster4,\
cluster5, cluster6, cluster7, cluster8, \
cluster9, cluster10, cluster11 = tf_clusters

In the next step, we'll filter out the regulons for each cluster, using the TRN data from RegulonDB, stored in the `trn_df` object. After that, we'll make a list of the TGs in each regulon, that we'll later use to set the labels of each cluster and then proceed to the classification.

Note: Notice that the clusters vary with each run. We will extract the regulons using a high confidence run, that extracted functionally robust TF clusters. 

Cluster 1 : DNA repair 

In [None]:
trn_df = pd.read_csv(path + 'trn-l.txt', sep = '\t',
                     comment = '#', index_col = False)

In [None]:
trn_df.head(2)

In [None]:
#Let's check the TFs of cluster 1
#print('Some members of cluster 1', cluster1[:5])

#Now let's filter the regulons of each TF from the TRN

cluster_1 = trn_df[ (trn_df['tf'] == 'dnaa') | (trn_df['tf'] == 'yedw')  | \
                  (trn_df['tf'] == 'ydfh') | (trn_df['tf'] == 'phob') \
                  | (trn_df['tf'] == 'cusr') | (trn_df['tf'] == 'argp') \
                  | (trn_df['tf'] == 'ascg') | (trn_df['tf'] == 'prpr')  ]
cluster_1_tgs = []

#Making a list that corresponds to the first cluster's target genes (TGs)

for row in cluster_1['tg']:
    cluster_1_tgs.append(row)
    
#Make a set to avoid repetition, and then re-make a list out of it.
    
cluster1_tgs = list(set(cluster_1_tgs))

#print('Cluster 1 has {} nodes'.format(len(cluster1_tgs)))

#Let's look at the TFs of cluster 2
#print('Some members of cluster 2', cluster2[:5])

#Filter the regulons of each TF in cluster 2 from the TRN
cluster_2 =  trn_df[ (trn_df['tf'] == 'phop') |  (trn_df['tf'] == 'ydeo') \
                       | (trn_df['tf'] == 'rutr') | (trn_df['tf'] == 'gade')  \
                    | (trn_df['tf'] == 'lrp') | (trn_df['tf'] == 'stpa') \
                    | (trn_df['tf'] == 'rscb') | (trn_df['tf'] == 'gadw') \
                    | (trn_df['tf'] == 'h-ns') | (trn_df['tf'] == 'leuo')| (trn_df['tf'] == 'adiy')   | (trn_df['tf'] == 'evga') \
                   | (trn_df['tf'] == 'nemr') | (trn_df['tf'] == 'rcsb-bglj') \
                   | (trn_df['tf'] == 'trer') | (trn_df['tf'] == 'cspa') \
                   | (trn_df['tf'] == 'gadx') | (trn_df['tf'] == 'torr') \
                   | (trn_df['tf'] == 'hns')  | (trn_df['tf'] == 'nhar') \
                   | (trn_df['tf'] == 'bglj') | (trn_df['tf'] == 'sdia') \
                   | (trn_df['tf'] == 'rcsa') ]

cluster_2_tgs = []

#Making a list that corresponds to the second cluster's target genes (TGs)

for row in cluster_2['tg']:
    cluster_2_tgs.append(row)
    
    
#Make a set to avoid repetition, and then re-make a list out of it.
    
cluster2_tgs = list(set(cluster_2_tgs))

#Let's see how many TGs does cluster 2 have 

#print('Cluster 2 has {} nodes'.format(len(cluster2_tgs)))

cluster_3 =  trn_df[(trn_df['tf'] == 'rob') | (trn_df['tf'] == 'soxr') \
                     |(trn_df['tf'] == 'acrr')| (trn_df['tf'] == 'soxs') \
                    |(trn_df['tf'] =='pdel' ) | (trn_df['tf'] == 'hupa')\
                    |(trn_df['tf'] =='mtlr' ) | (trn_df['tf'] =='hupb') \
                    |(trn_df['tf'] =='baer') | (trn_df['tf'] =='marr') \
                    |(trn_df['tf'] =='cra') \
                    |(trn_df['tf'] =='decr') |(trn_df['tf'] =='cpxr')\
                    |(trn_df['tf'] =='mara')]


cluster_3_tgs = []

#Making a list that corresponds to the cluster's target genes (TGs)

for row in cluster_3['tg']:
    cluster_3_tgs.append(row)
    
#Make a set to avoid repetition, and then re-make a list out of it.

    
cluster3_tgs = list(set(cluster_3_tgs))

#print('Cluster 3 has {} nodes'.format(len(cluster3_tgs)))


#Filter the regulons of each TF in cluster 4 from the TRN...

cluster_4 =  trn_df[ (trn_df['tf'] == 'metj') | (trn_df['tf'] == 'fur') \
                    | (trn_df['tf'] == 'oxyr') | (trn_df['tf'] == 'purr') \
                    | (trn_df['tf'] =='metr' )  ]

cluster_4_tgs = []

#Making a list that corresponds to the cluster's target genes (TGs)...

for row in cluster_4['tg']:
    cluster_4_tgs.append(row)
    
#Make a set to avoid repetition, and then re-make a list out of it...
cluster4_tgs = list(set(cluster_4_tgs))

print('Cluster 4 has {} nodes'.format(len(cluster4_tgs)))


cluster_5 =  trn_df[ (trn_df['tf'] == 'mata') | (trn_df['tf'] == 'csgd') \
                    | (trn_df['tf'] == 'mlra') | (trn_df['tf'] == 'puta') \
                    | (trn_df['tf'] =='rsta' ) | (trn_df['tf'] =='flhdc' )\
                    | (trn_df['tf'] =='ecpr' ) | (trn_df['tf'] =='mqsa' ) \
                    | (trn_df['tf'] =='flhc' ) | (trn_df['tf'] =='flhd' ) \
                    | (trn_df['tf'] =='sutr' ) | (trn_df['tf'] =='basr' ) \
                    | (trn_df['tf'] =='mqsr' ) | (trn_df['tf'] =='ompr' ) \
                    | (trn_df['tf'] =='bola' ) | (trn_df['tf'] =='rcda' ) \
                    | (trn_df['tf'] =='fliz' ) | (trn_df['tf'] =='hdfr' ) \
                    | (trn_df['tf'] =='cadc' ) | (trn_df['tf'] =='lrha' ) \
                    | (trn_df['tf'] =='yjjq' ) | (trn_df['tf'] =='qseb' ) ]

cluster_5_tgs = []

for row in cluster_5['tg']:
    cluster_5_tgs.append(row)
    
cluster5_tgs = list(set(cluster_5_tgs))

print('Cluster 5 has {} nodes'.format(len(cluster5_tgs)))


cluster_6 =  trn_df[ (trn_df['tf'] == 'srlr') | (trn_df['tf'] == 'rbsr') \
                    | (trn_df['tf'] == 'zrar') | (trn_df['tf'] == 'mhpr') \
                    | (trn_df['tf'] =='malt' ) | (trn_df['tf'] =='mali' ) \
                    | (trn_df['tf'] =='gntr' ) | (trn_df['tf'] =='fucr' ) \
                    | (trn_df['tf'] =='uxur' ) | (trn_df['tf'] =='gutm' ) \
                    | (trn_df['tf'] =='mlc' ) | (trn_df['tf'] =='nagc' ) \
                    | (trn_df['tf'] =='exur' ) | (trn_df['tf'] =='melr' ) \
                    | (trn_df['tf'] =='lsrr' ) | (trn_df['tf'] =='cytr' ) \
                    | (trn_df['tf'] =='rhar' ) | (trn_df['tf'] =='idnr' ) \
                    | (trn_df['tf'] =='gutr' ) | (trn_df['tf'] =='comr' ) \
                    | (trn_df['tf'] =='glpr' ) | (trn_df['tf'] =='chbr' ) \
                    | (trn_df['tf'] =='creb' ) | (trn_df['tf'] =='laci' ) \
                    | (trn_df['tf'] =='rhas' ) \
                    
                   ]


cluster_6_tgs = []

for row in cluster_6['tg']:
    cluster_6_tgs.append(row)
    
cluster6_tgs = list(set(cluster_6_tgs))

print('Cluster 6 has {} nodes'.format(len(cluster6_tgs)))

cluster_7 =  trn_df[ (trn_df['tf'] == 'maze-mazf') | (trn_df['tf'] == 'tdcr') \
                    | (trn_df['tf'] == 'yeil') | (trn_df['tf'] == 'hipab') \
                    | (trn_df['tf'] =='ihfb' ) | (trn_df['tf'] =='tdca' ) \
                    | (trn_df['tf'] =='hipb' ) | (trn_df['tf'] =='hipa' ) \
                    | (trn_df['tf'] =='yiaj' ) \
                    | (trn_df['tf'] =='maze' )
                   ]

cluster_7_tgs = []

for row in cluster_7['tg']:
    cluster_7_tgs.append(row)
    
cluster7_tgs = list(set(cluster_7_tgs))

print('Cluster 7 has {} nodes'.format(len(cluster7_tgs)))


cluster_8 =  trn_df[ (trn_df['tf'] == 'puur') | (trn_df['tf'] == 'xylr') \
                    | (trn_df['tf'] == 'beti') | (trn_df['tf'] == 'lldr') \
                     | (trn_df['tf'] =='arac' )]

cluster_8_tgs = []

for row in cluster_8['tg']:
    cluster_8_tgs.append(row)
    
cluster8_tgs = list(set(cluster_8_tgs))

print('Cluster 8 has {} nodes'.format(len(cluster8_tgs)))


cluster_9 =  trn_df[ (trn_df['tf'] == 'narl') | (trn_df['tf'] == 'pdhr') \
                    | (trn_df['tf'] == 'hyfr') | (trn_df['tf'] == 'fhla') \
                    | (trn_df['tf'] =='dcur' ) \
                    | (trn_df['tf'] =='mode' ) | (trn_df['tf'] =='caif' ) \
                    | (trn_df['tf'] =='nikr' ) | (trn_df['tf'] =='mraz' ) \
                    | (trn_df['tf'] =='dpia' ) | (trn_df['tf'] =='yqji' ) \
                    | (trn_df['tf'] =='appy' ) | (trn_df['tf'] =='sigma54' )]


cluster_9_tgs = []

for row in cluster_9['tg']:
    cluster_9_tgs.append(row)
    
cluster9_tgs = list(set(cluster_9_tgs))

print('Cluster 9 has {} nodes'.format(len(cluster9_tgs)))


cluster_10 =  trn_df[ (trn_df['tf'] == 'norr') | (trn_df['tf'] == 'cbl') \
                    | (trn_df['tf'] == 'nsrr') | (trn_df['tf'] == 'fear') \
                    | (trn_df['tf'] =='ntrc' ) | (trn_df['tf'] =='cysb' ) \
                    | (trn_df['tf'] =='glng' ) | (trn_df['tf'] =='asnc' ) \
                    | (trn_df['tf'] =='dsdc' ) | (trn_df['tf'] =='ihfa' ) \
                    | (trn_df['tf'] =='nac' )  ]

cluster_10_tgs = []

for row in cluster_10['tg']:
    cluster_10_tgs.append(row)
    
cluster10_tgs = list(set(cluster_10_tgs))

print('Cluster 10 has {} nodes'.format(len(cluster10_tgs)))


cluster_11 =  trn_df[ (trn_df['tf'] == 'ada') | (trn_df['tf'] == 'aidb') ]
cluster_11_tgs = []

for row in cluster_11['tg']:
    cluster_11_tgs.append(row)
    
cluster11_tgs = list(set(cluster_11_tgs))

print('Cluster 11 has {} nodes'.format(len(cluster11_tgs)))

Cluster 2 : Glutamate dependent acid response(GDAR)

Cluster 3: Multi-stress response, and global regulators

Note: Let's not consider Fis protein.

C4: Iron, purines, and ros response

Cluster 5 : Biofilm and motility

Cluster 6 : Central carbon metabolism. Let's not consider CRP. 

#| (trn_df['tf'] =='crp' )
#| (trn_df['tf'] =='ihf' )
#| (trn_df['tf'] =='arca' )
#| (trn_df['tf'] =='fnr' ) 

Cluster 7: Toxin-antitoxin systems(TAS). Let's not consider IHF. 

Note-to-self: One can use the cluster dataframes to make subnetwork visualizations

Cluster 8: Carbohydrate metabolism and respiration

Let's not consider ArcA

Cluster 9 : Nitrogen metabolism. Note: Let's not consider Fnr. 

Cluster 10 : aminoacid and

Because, cluster 11 is so small, compared to the other clusters, and might generate an unbalanced training, we'll not consider it as part of the classification procedure. 

In [None]:
#Let's re-check our df_xx dataframe, that corresponds to the annot + exp data

df_xx.head(2)

Let's proceed with the classification procedure.

In [None]:
#Initializing the labels' lists

labels1 = []
labels2 = []
labels3 = []
labels4 = []
labels5 = []
labels6 = []
labels7 = []
labels8 = []
labels9 = []
labels10 = []

In [None]:
##Seting up the labels for each cluster

#C1
for row in df_xx['gene name']:
    if row in cluster1_tgs:
        labels1.append(1)
    else:
        labels1.append(0)
        
#C2        
for row in df_xx['gene name']:
    if row in cluster2_tgs:
        labels2.append(1)
    else:
        labels2.append(0)
        
#C3
for row in df_xx['gene name']:
    if row in cluster3_tgs:
        labels3.append(1)
    else:
        labels3.append(0)

#C4 
for row in df_xx['gene name']:
    if row in cluster4_tgs:
        labels4.append(1)
    else:
        labels4.append(0)

#C5
for row in df_xx['gene name']:
    if row in cluster5_tgs:
        labels5.append(1)
    else:
        labels5.append(0)
        
#C6
for row in df_xx['gene name']:
    if row in cluster6_tgs:
        labels6.append(1)
    else:
        labels6.append(0)
        
#C7
for row in df_xx['gene name']:
    if row in cluster7_tgs:
        labels7.append(1)
    else:
        labels7.append(0)
        
#C8
for row in df_xx['gene name']:
    if row in cluster8_tgs:
        labels8.append(1)
    else:
        labels8.append(0)
        
for row in df_xx['gene name']:
    if row in cluster9_tgs:
        labels9.append(1)
    else:
        labels9.append(0)
        
#C10
for row in df_xx['gene name']:
    if row in cluster10_tgs:
        labels10.append(1)
    else:
        labels10.append(0)
        
#C11
#for row in df_xx['gene name']:
#    if row in cluster11_tgs:
#        labels11.append(1)
#    else:
#        labels11.append(0)

In [None]:
## Checking if we have correct classification annotation
1 in labels3

Now, let's append this lists as columns in the dataframe, for each of the clusters' labels.

In [None]:
df_xx['cluster 1'] = labels1
df_xx['cluster 2'] = labels2
df_xx['cluster 3'] = labels3
df_xx['cluster 4'] = labels4
df_xx['cluster 5'] = labels5
df_xx['cluster 6'] = labels6
df_xx['cluster 7'] = labels7
df_xx['cluster 8'] = labels8
df_xx['cluster 9'] = labels9
df_xx['cluster 10'] = labels10

In [None]:
df_xx.head()

Let's check if we classified our data set correctly. Let's take SoxS as an example. Notice that with each run, the clusters change (b/c of the louvain algorithm), so this step has to be adapted with each run. 

In [None]:
annot[annot['gene name'] == ('soxs')]  

All the multiple stress response TFs are in cluster3.

In [None]:
cluster3

Now that we have its location in the network and its cluster label, let's check it in the `df_exp` dataframe. 

In [None]:
df_xx.loc[3784, 'cluster 3']

$VoilÃ .$

Now let's check if a random TF is correctly classified as a non-member of the cluster4. 

In [None]:
np.random.choice(list(communities.keys()))

In [None]:
'rcsb-bglj' in cluster3

In [None]:
annot[annot['gene name'] == ('rcsb')]  

In [None]:
denoised_df.loc[2083, 'cluster 3']

We're good to go. 

### Dividing the training and test dataset.

To extract the current knowledge we have from the TRN, we will train a neural network with the known TFs and test the network to predict the label association of the hypothetical TFs. 


We'll make a partition with training data being the regulons + random noise (to avoid overfitting). The random noise will be the expression data for genes that do not appear to be regulated by TFs, as according with the RegulonDB TRN. 

In [None]:
tgs_set = set(cluster1_tgs+cluster2_tgs +cluster3_tgs+
             cluster4_tgs + cluster5_tgs + cluster6_tgs + 
             cluster7_tgs + cluster8_tgs + cluster9_tgs+ 
             cluster10_tgs)

TGs_list = [1 if row in tgs_set else 0 for row in df_xx['gene name'] ]

In [None]:
len(TGs_list)

In [None]:
df_xx.head()

In [None]:
##Adding the TG list as a column to the expression data

df_xx['TGs'] = TGs_list

In [None]:
##Let's filter out the genes that are regulated by TFs

regulons_df = df_xx[df_xx['TGs'] == 1]

In [None]:
regulons_df.head(3)

In [None]:
#Let's delete the TGs column

del(regulons_df['TGs'])

In [None]:
regulons_df.shape

In [None]:
##Let's filter out the genes that are not regulated by TFs

non_reg_df  = df_xx[df_xx['TGs'] == 0]

In [None]:
non_reg_df.shape

In [None]:
del(non_reg_df['TGs'])

In [None]:
##Making a dataframe called noise, by randomly picking 
##genes that are NOT REGULATED by TFs without replacement

noise = non_reg_df.sample(n = 50, replace = False, axis = 0, random_state = 42)

In [None]:
regulons_with_noise = pd.concat([regulons_df, noise]) ## unbiased train/test dataset 

In [None]:
regulons_with_noise.shape ##Let's look at the nrows and ncols

In [None]:
regulons_with_noise.head(2)

In [None]:
regulons_with_noise.to_csv('../data/ml_dfs/ecoli_ml.csv')

Now we can divide our X and y data. X_data will be pure expression data, and y_data corresponds to the cluster labels for classification.

In [None]:
X_data = regulons_with_noise.iloc[:,:-10]
y_data = regulons_with_noise.iloc[:,-10:]

In [None]:
clus2 = y_data['cluster 2'].values

In [None]:
X_new = SelectPercentile(f_classif, percentile=80).fit_transform(X_data, clus2)
X_new.shape

We'll make a random partition from the `regulons_with_noise` data. 

In [None]:
#The test subset will correspond to 30% of the data at random

X_train, X_test, y_train, y_test = train_test_split(X_data, y_data, test_size=0.2, random_state=42) 

In [None]:
#Split the data with the 80% most important PCs

X_train, X_test, y_train, y_test = train_test_split(X_new, y_data, test_size=0.2, random_state=42) 


### Multi-class Neural Network using Keras 

Now we're going to train a neural network using Keras. 

In [None]:
import keras
from keras.models import Sequential
from keras.layers import Dense
from keras.utils import np_utils
from keras.metrics import categorical_accuracy
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.preprocessing import LabelEncoder


Output layer size = 10..

In [None]:
#softmax activation
model = Sequential()
model.add(Dense(units=1000, activation='softmax', input_dim=800))
model.add(Dense(units=10))# 10 output
model.compile(loss= 'mse', optimizer='RMSprop', metrics= ['accuracy'])
history = model.fit(X_train, y_train, epochs=100, batch_size= 200)

In [None]:
x = bokeh.plotting.figure(height=400,
                          width=650,
                          x_axis_label='epoch', 
                          y_axis_label='accuracy',
                         y_range=(0, 1), title= 'Model Training Accuracy')

x.circle(x = np.arange(1,101,1), y =history.history['acc'], fill_alpha = 0.5)
#x.line(t, p[:,1])
#x.line(t, p[:,2])
bokeh.io.show(x)

In [None]:
sns.set_style('whitegrid')

In [None]:
# Keras simulations using Matplotlib 

n_simulations = 30

train_acc = []
test_acc = []

for i in range(n_simulations):
    
    X_train, X_test, y_train, y_test = train_test_split(X_data, y_data
                                                        , test_size=0.3, random_state=42) 
    
    #softmax activation
    model = Sequential()
    model.add(Dense(units=1000, activation='softmax', input_dim=n_components))
    model.add(Dense(units=10))# 10 output
    model.compile(loss= 'mse', optimizer='RMSprop', metrics= ['accuracy'])
    history = model.fit(X_train, y_train, epochs=80, batch_size= 200)

    accuracy = history.history['acc']
    loss = history.history['loss']
    train_acc.append(accuracy[79])
    
    score = model.evaluate(X_test, y_test,verbose=0)
    test_acc.append(score[1])

    # summarize history for accuracy/loss
    plt.plot(accuracy, 'o', color = 'royalblue', alpha = 0.3, markersize= 5)
    plt.plot(loss, 'o', color = 'orangered', alpha = 0.3, markersize= 5)
    plt.title('Keras Model Training $E. coli$ ', fontsize = 16)
    plt.ylabel('Acc / Loss ')
    plt.xlabel('epoch')
    plt.ylim(0,1.05)
    plt.legend(['Acc.','Loss' ], loc='best')
    
plt.savefig('keras-model-train-ecoli.tiff', dpi = 350)    

In [None]:
organism = ['$E. coli$'] * len(train_acc)
train = ['train'] * len(train_acc)
test = ['test'] * len(train_acc)

x = list(zip(train_acc, organism,train))
y = list(zip(test_acc, organism, test))


entries= x + y 

ecoli_df = pd.DataFrame(index = range(n_simulations*2))
ecoli_df = pd.DataFrame(entries, columns=['accuracy', 'organism', 'type'])

ecoli_df.to_csv('ecoli-model.csv')

sns.violinplot(x = 'organism', y = 'accuracy', hue = 'type', data = ecoli_df,
               inner = 'quartile',palette = 'pastel')


plt.ylim(.3, 1.01)

In [None]:
y_pred = model.predict(X_test, batch_size =100)
y_pred_flat = np.round(y_pred.flatten())
y_test_flat = y_test.values.flatten()

In [None]:
from sklearn.metrics import confusion_matrix

In [None]:
conf_mat = confusion_matrix(y_test_flat, y_pred_flat)
fig, ax = plt.subplots(figsize=(8,8))
sns.heatmap(conf_mat, annot=True, fmt='d',
            xticklabels=['not in cluster', 'inside cluster'], yticklabels=['not in cluster', 'inside cluster'])
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.title('Keras Classifier confusion matrix $E. coli$')
#plt.savefig('conf-mat-ecoli-keras.tiff', dpi = 350)

Let's try other parameters. 

In [None]:
#Relu activation

model = Sequential()
model.add(Dense(units=2000, activation='relu', input_dim=n_components))
model.add(Dense(units=10))# 11 outputs
model.compile(loss='mse', optimizer='SGD', metrics= ['accuracy'])
history = model.fit(X_train, y_train, epochs=80, batch_size= 100)

In [None]:
x = bokeh.plotting.figure(height=400,
                          width=650,
                          x_axis_label='epoch', 
                          y_axis_label='accuracy',
                         y_range=(0, 1), title= 'Model Training Accuracy')

x.circle(x = np.arange(1,81,1), y =history.history['acc'], fill_alpha = 0.5)
#x.line(t, p[:,1])
#x.line(t, p[:,2])
bokeh.io.show(x)

In [None]:
#Softmax activation with 200 epochs

model = Sequential()
model.add(Dense(units=500, activation='softmax', input_dim=n_components))
model.add(Dense(units=10))# 10 output
model.compile(loss='mse', optimizer= 'RMSprop', metrics= ['accuracy'])
history = model.fit(X_train, y_train, epochs=200, batch_size= 100)

In [None]:
x = bokeh.plotting.figure(height=400,
                          width=650,
                          x_axis_label='epoch', 
                          y_axis_label='accuracy',
                         y_range=(0, 1), title= 'Model Training Accuracy')

x.circle(x = np.arange(1,201,1), y =history.history['acc'], fill_alpha = 0.5)

bokeh.io.show(x)

### Comparison with other ML algorithms 

In [None]:
from sklearn.multioutput import MultiOutputClassifier
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.neural_network import MLPClassifier

In [None]:
perceptron = MLPClassifier(solver='lbfgs', alpha=1e-5,
                    hidden_layer_sizes=(1200, 20), random_state=42)


In [None]:
clf = MultiOutputClassifier(perceptron)
clf.fit(X_train, y_train)

y_pred= clf.predict(X_test)

In [None]:
y_pred

In [None]:
y_pred_df = pd.DataFrame(y_pred)

In [None]:
df_pred = y_pred_df.apply(lambda x: x.idxmax(), axis = 1)
df_pred.tail()

In [None]:
df_test = y_test_df.apply(lambda x: x.idxmax(), axis = 1)

In [None]:
df_test.head()

In [None]:
df_test_list = list(df_test)

In [None]:
label_encoder = LabelEncoder()
integer_encoded = label_encoder.fit_transform(df_test_list)


In [None]:
integer_encoded[:5]

In [None]:
df_test.head(3)

In [None]:
cnf_matrix = confusion_matrix(integer_encoded, df_pred)
np.set_printoptions(precision=2)


In [None]:
class_names = list(y_test.columns)

In [None]:
plt.figure(figsize = (10,10))
plot_confusion_matrix(cnf_matrix, classes=class_names,normalize = True,
                      title='Confusion Matrix')

In [None]:
plot_confusion_matrix(cnf_matrix, classes=class_names, normalize=True,
                      title='Normalized confusion matrix')

In [None]:
from numpy import array
from numpy import argmax
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
# define example
data = ['cold', 'cold', 'warm', 'cold', 'hot', 'hot', 'warm', 'cold', 'warm', 'hot']
values = array(data)
print(values)
# integer encode
label_encoder = LabelEncoder()
integer_encoded = label_encoder.fit_transform(values)
print(integer_encoded)
# binary encode
onehot_encoder = OneHotEncoder(sparse=False)
integer_encoded = integer_encoded.reshape(len(integer_encoded), 1)
onehot_encoded = onehot_encoder.fit_transform(integer_encoded)
print(onehot_encoded)
# invert first example
inverted = label_encoder.inverse_transform([argmax(onehot_encoded[0, :])])
print(inverted)

In [None]:
label_encoder.inverse_transform([argmax(onehot_encoded[2, :])])

In [None]:
accuracy_score(y_test, y_pred)

In [None]:
y_test_flat = y_test.values.flatten()
y_pred_flat = y_pred.flatten()

In [None]:
conf_mat = confusion_matrix(y_test_flat, y_pred_flat)
fig, ax = plt.subplots(figsize=(8,8))
sns.heatmap(conf_mat, annot=True, fmt='d',
            xticklabels=['not in cluster', 'inside cluster'], yticklabels=['not in cluster', 'inside cluster'])
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.title('MLP Confusion Matrix $E. coli$')

In [None]:
def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        print("Normalized confusion matrix")
    else:
        print('Confusion matrix, without normalization')

    #print(cm)

    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt),
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.ylabel('True label')
    plt.xlabel('Predicted label')
    plt.tight_layout()

In [None]:
from sklearn import svm, datasets

In [None]:
# import some data to play with
iris = datasets.load_iris()
X = iris.data
y = iris.target
class_names = iris.target_names

# Split the data into a training set and a test set
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

# Run classifier, using a model that is too regularized (C too low) to see
# the impact on the results
classifier = svm.SVC(kernel='linear', C=0.01)
y_pred = classifier.fit(X_train, y_train).predict(X_test)

In [None]:
y_test

In [None]:
perceptron.fit(X_train, y_train) 

y_pred= perceptron.predict(X_test)
y_test_flat = y_test.values.flatten()
y_pred_flat = y_pred.flatten()

In [None]:
accuracy_score(y_test, y_pred)

In [None]:
conf_mat = confusion_matrix(y_test_flat, y_pred_flat)
fig, ax = plt.subplots(figsize=(8,8))
sns.heatmap(conf_mat, annot=True, fmt='d',
            xticklabels=['not in cluster', 'inside cluster'], yticklabels=['not in cluster', 'inside cluster'])
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.title('MLP Confusion Matrix $E. coli$')
plt.savefig('conf-mat-ecoli-mlp.tiff', dpi = 350)

In [None]:
from sklearn.metrics import classification_report,accuracy_score
print(classification_report(y_test, y_pred))
print('Accuracy score : ', accuracy_score(y_test, y_pred))

In [None]:
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_estimators=30, max_depth=30, random_state=42)

model.fit(X_train, y_train)

y_pred = model.predict(X_test)
y_pred_flat = y_pred.flatten()

In [None]:
accuracy_score(y_test, y_pred)

In [None]:
conf_mat = confusion_matrix(y_test_flat, y_pred_flat)
fig, ax = plt.subplots(figsize=(8,8))
sns.heatmap(conf_mat, annot=True, fmt='d',
            xticklabels=['not in cluster', 'inside cluster'], yticklabels=['not in cluster', 'inside cluster'])
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.title('Random Forest Confusion Matrix $E. coli $')
plt.savefig('conf-mat-ecoli-rf.tiff', dpi = 350)

In [None]:
print(classification_report(y_test, y_pred))
print('Accuracy score : ', accuracy_score(y_test, y_pred))

### Conclusion. 

We can see that the Keras model can make sense of highly noisy expression data, that might not represent the best dataset for training (i.e. it contains  conditions that might not be related to the TRN wiring per se, or the evolutionary history of $E.coli$ ). However, we get ~ 80% accuracy in the training. The next step would be to make the predictions, for each hypothetical TF to find out, what would be its functional module. However we will not jump and do that before we can exhaustively confirm that this is the best classification accuracy that we can get. 


It's important to emphasize that $E.coli$ might have some of these hypothetical TFs in low expression levels, and thus they might barely be exerting any significant regulation inside the cell, and still not represent an expensive genomic accesory in energetic levels (i.e. not wasting energy in transcription/translation). Thus, one possible explanation for the necessity for these TFs would be that they are a genomic arsenal to coordinate transcriptional programs for future events, that might confer an evolutionary advantage to the bacterium. In other words, these extended transcriptional repertoire might be an arsenal for future adverse conditions. However, this last suggestion has to be tested experimentally. 

Lastly, I'll print out the versions of the most important Python modules used in the workflow for reproducibility purpuses.

In [None]:
import sklearn

In [None]:
import matplotlib

In [None]:
print(keras.__version__)
print(sklearn.__version__)
print(np.__version__)##numpy version
print(nx.__version__)##NetworkX
print(matplotlib.__version__)
print(sns.__version__)#Seaborn
print(pd.__version__)#Pandas