## Supervised learning of  a simple genetic network in *E. coli*

Content here is licensed under a CC 4.0 License. The code in this notebook is released under the MIT license. 


By Manu Flores. 

In [None]:
import grn_learn as g
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import networkx as nx
import matplotlib as mpl
from scipy.stats import pearsonr

import bebi103 #jbois' library 
import hvplot
import hvplot.pandas
import holoviews as hv
from holoviews import dim, opts
import bokeh_catplot
import bokeh 
import bokeh.io
from bokeh.io import output_file, save, output_notebook


output_notebook()
hv.extension('bokeh')
np.random.seed(42)

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

g.set_plotting_style()
%matplotlib inline
%config InlineBackend.figure_format = 'svg'

%load_ext autoreload
%autoreload 

### Load in data RNA-seq dataset. 

Story of the data. Citation : y-ome. 

In [None]:
df = pd.read_csv('palsson_rna_seq.csv')

In [None]:
df.head()

In [None]:
data_ = df.copy()

In [None]:
annot = data_.iloc[:, :2]

In [None]:
annot.head()

In [None]:
data = data_.iloc[:, 2:]

### Data preprocessing. 

Let's start our data analysis pipeline by normalizing and looking for null values .

In [None]:
from sklearn.preprocessing import StandardScaler as scaler 

In [None]:
ss = scaler()
norm_data = ss.fit_transform(data)

Let's check if the data has any null entries.

In [None]:
norm_data= pd.DataFrame(norm_data, columns = data.columns)
norm_data.describe()

It looks like there are none. We can quickly verify this using the `pd.notnull` function from pandas.

In [None]:
np.all(pd.notnull(norm_data))

All right, we're good to go ! 

### Load in PurR regulon datasets

Now we can go ahead and load the PurR regulon datasets. 

In [None]:
purr_regulondb = pd.read_csv('../../data/purr_regulon_db.csv')

In [None]:
purr_hi = pd.read_csv('../../data/purr_regulon_hitrn.csv')

In [None]:
print('The RegulonDB has %d nodes and the hiTRN has %d nodes \
for the PurR regulon genetic network respectively.'%(purr_regulondb.shape[0], purr_hi.shape[0]))

Let's extract the TGs as a `np.array` and get the genes that were discovered by the Palsson Lab. 

In [None]:
purr_rdb_tgs = np.unique(purr_regulondb.tg.values)

In [None]:
len(purr_rdb_tgs)

In [None]:
purr_hi_tgs = np.unique(purr_hi.gene.values)

purr_hi_tgs = [gene.lower() for gene in purr_hi_tgs]

In [None]:
new_purr_tgs = set(purr_hi_tgs) - set(purr_rdb_tgs)

new_purr_tgs

We can see that indeed the hiTRN has 5 more interactions. Let's see if we can accurately predict this interactions directly from the RNA-seq data. 

### Visualize correlation

Before jumping to applying an ML model to our data, let's proceed to make a simple EDA. As I've said in the presentation the notion that makes this approach biologically plausible is that **genes that are coexpressed are probably corregulated**. A simple proxy for coexpression is correlation across expression conditions. 

Let's make a couple of plots to see that indeed the test genes that we're looking for are correlated with purr, and if this relationship looks linear. We'll use the Seaborn library in this case because it has a nice feat that allows to embed a statistical function into the plot. 

In [None]:
def corr_plot(data, gene_1, gene_2):
    """
    Scatter plot to devise correlation. 
    
    Parameters
    -----------
    * data(pd.DataFrame): Input dataframe that contains for which to pull out data. 
    
    * gene_x (str): gene_name of the genes to visualize.
    
    Returns 
    ---------
    * fig (plt.figure) : sns.jointplot hardcoded to be a scatterplot of the genes. 
    
    """
    gene_1_data  = data[data['gene_name'] == gene_1]
    
    assert gene_1_data.shape[0] ==1, 'Gene 1 not in dataset'
    
    gene_1_vals =  gene_1_data.iloc[:, 3:].values.T
    
    gene_2_data  = data[data['gene_name'] == gene_2]
    
    assert gene_2_data.shape[0] ==1, 'Gene 2 not in dataset'
    
    gene_2_vals =  gene_2_data.iloc[:, 3:].values.T
    
    df_plot = pd.DataFrame({gene_1: gene_1_vals.flatten(),
                            gene_2 : gene_2_vals.flatten()})
    
    plt.figure(figsize = (6, 4))
    fig = sns.jointplot(data = df_plot, 
                  x = gene_1,
                  y = gene_2,
                  stat_func = pearsonr,
                  alpha = 0.5,
                  color = 'dodgerblue');
    
    return fig

We can now iterate over the putative TGs and plot them against PurR. In the following plots, each dot represents the expression level (in [FPKM](https://www.rna-seqblog.com/rpkm-fpkm-and-tpm-clearly-explained/), a proxy for the number of mRNA counts for a given gene) of both genes in a specific expression condition. 

In [None]:
for new_tg in new_purr_tgs: 
    
    corr_plot(df, 'purr', new_tg);

We can see that some, but not all the genes are strongly correlated with PurR. This is normal because the TRN has a lot of feedback so it could be that despite that PurR regulates a given gene, there are potentially other TFs controlling those target genes. 

### Filter noise using PCA. 

Principal component analysis is a widely used technique in unsupervised learning to perform dimensionality reduction. One can also use PCA as a "noise reduction" technique because projecting into a (smaller) latent space and reconstructing the dataset from this space with smaller dimensionality forces the algorithm to learn important features of the data. Specifically the latent space (the principal components) will maximize the variance across the dataset. 

First, let's explore the dimensionality of our RNA-seq dataset. 

In [None]:
from sklearn.decomposition import PCA 

In [None]:
pca = PCA()
pca = pca.fit(norm_data)

In [None]:
cum_exp_var = np.cumsum(pca.explained_variance_ratio_)

# look at it
plt.figure(figsize = (6,4))
plt.plot(cum_exp_var*100, color = 'dodgerblue') #because LA
plt.xlabel('Number of dimensions', fontsize= 16)
plt.ylabel('Cumulative variance percentage', fontsize = 16)
plt.title('PCA Explained Variance');

In [None]:
print('The first five principal components explain %.2f of the variance in the dataset.'%cum_exp_var[4])

We can see that the dataset is of very small dimensionality. We can now project into this subspace that contains 95% of the variance and reconstruct the dataset. 

In [None]:
pca = PCA(0.95).fit(norm_data)
latent = pca.transform(norm_data)

In [None]:
reconstructed = pca.inverse_transform(latent)

In [None]:
recon_df= pd.DataFrame(reconstructed, columns = data.columns)

In [None]:
df.iloc[:, :2].shape, recon_df.shape

In [None]:
recon_df_ = pd.concat([df.iloc[:, :2], recon_df], axis = 1)

In [None]:
recon_df_.head()

### Visualize correlation again. 

Let's visualize the dataset again. 

In [None]:
for new_tg in new_purr_tgs: 
    
    corr_plot(recon_df_, 'purr', new_tg);

We can see that in the reconstructed space, we've constrained the data to have a bigger covariance. 

### Visualize in PCA space

Given that we already have the projection of our dataset into a smaller dimension, we can also visualize all of the genes in the first two principal components. 

In [None]:
hv.Points((latent[: , 0], latent[: , 1])).opts(xlabel = 'principal component 1',
                                               ylabel = 'principal component 2',
                                               color = '#1E90FF', 
                                               size = 5, 
                                               alpha = 0.15, 
                                               padding = 0.1, 
                                               width = 400)

We cannot really see a specific structure in the first two components. Maybe a non-linear dimensionality reduction technique such as UMAP could do a better job to get the clusters in higher dimensions. We'll come back to that in the next tutorial. 

### Annotate datasets

Now that we have preprocessed our data we can proceed to annotate it. Specifically we want to label our data for each gene, if its inside the PurR regulon or not. 

First-off, let's generate our test set. We'll use a helper function that let's us filter from the dataframe. 

In [None]:
def get_gene_data(data, gene_name_column, test_gene_list):
    
    """
    Extract data from specific genes given a larger dataframe.
    
    Parameters
    ------------
    
    * data (pd.DataFrame): large dataframe from where to filter.
    * gene_name_column (str): column to filter from in the dataset.
    * test_gene_list (array-like) : a list of genes you want to get. 
    
    Returns
    ---------
    * gene_profiles (pd.DataFrame) : dataframe with the genes you want
    
    """
    
    gene_profiles = pd.DataFrame()

    for gene in data[gene_name_column].values:

        if gene in test_gene_list: 

            df_ = data[(data[gene_name_column] == gene)]

            gene_profiles = pd.concat([gene_profiles, df_])
    
    gene_profiles.drop_duplicates(inplace = True)
    
    return gene_profiles 

Let's make a one hot encoded vector that corresponds to being an element of the PurR regulon. 

In [None]:
one_hot = [1 if row  in purr_hi_tgs else 0 for  row in  recon_df_['gene_name'].values]

In [None]:
recon_df_['output'] = one_hot

In [None]:
recon_df_.head()

In [None]:
test_purr_tgs  = list(new_purr_tgs)

In [None]:
test = get_gene_data(recon_df_, 'gene_name', test_purr_tgs)

In [None]:
test.head()

In [None]:
type(x)

Let's drop these test genes from the reconstructed dataset. 

In [None]:
recon_df_non_regulon = recon_df_.copy().drop(test_.index.to_list())

Nice! Now we can go ahead and add some "noise" to our test dataset, in the sense that we need to test if our algorithm can point out negative examples. 

In [None]:
noise = recon_df_non_regulon.sample(n = 30, replace = False,
                         axis = 0, random_state = 42)

Let's merge both of this dataframes to get an unbiased test set. 

In [None]:
df_test_unb = pd.concat([test, noise]) ## unbiased test 

In [None]:
df_test_unb.shape

In [None]:
df_test_unbiased = df_test_unb.copy().reset_index(drop= True)

In [None]:
df_test_unbiased.head()

In [None]:
df_test_unbiased.shape

In [None]:
df_train = recon_df_non_regulon.copy()

### Train - test split

In [None]:
df_train.head()
df_test_unbiased.head()

In [None]:
df_train.shape
df_test_unbiased.shape

In [None]:
X_train = df_train.iloc[:, 2: -1].values
y_train = df_train.iloc[:,  -1].values

In [None]:
X_train[:5, :5]
y_train[:5]

In [None]:
X_test = df_test_unbiased.iloc[:, 2:-1].values

y_test = df_test_unbiased.iloc[:, -1].values

In [None]:
X_test[:5, :5]
y_test[:5]

### Balance dataset using SMOTE

In [None]:
pd.Series(y_train).value_counts()

In [None]:
pd.Series(y_test).value_counts()

In [None]:
from imblearn.over_sampling import SMOTE

#resampling is done on training dataset only
X_train_res, y_train_res = SMOTE(random_state = 42).fit_sample(X_train, y_train)

### Linear SVM 

In [None]:
from sklearn.svm import LinearSVC

In [None]:
linear_svm_clf = LinearSVC()

In [None]:
linear_svm_clf.fit(X_train_res, y_train_res)

In [None]:
predictions = linear_svm_clf.predict(X_test)

In [None]:
from sklearn.metrics import accuracy_score

In [None]:
accuracy_score(y_test, predictions)

In [None]:
from sklearn.metrics import classification_report

In [None]:
pd.DataFrame((print(classification_report(y_test, predictions))))

In [None]:
predictions == y_test

We ca

### Random forest

In [None]:
from sklearn.ensemble import AdaBoostClassifier

In [None]:
ada = AdaBoostClassifier()

In [None]:
ada.fit(X_train, y_train)

In [None]:
ada_pred = ada.predict(X_test)

In [None]:
print(classification_report(y_test, ada_pred))

Probably overfit. 

### Keras neural net. 

In [None]:
from keras.models import Sequential
from keras.layers import Dense
from keras.metrics import categorical_accuracy


In [None]:
X_test.shape[1]

In [None]:
model = Sequential()
model.add(Dense(units=64, activation='softmax', input_dim= X_test.shape[1]))
model.add(Dense(units=1)) # one output
model.compile(loss='mse', optimizer='RMSprop', metrics= ['accuracy'])

history = model.fit(X_train_res, y_train_res, epochs=10, batch_size=32)
accuracy = history.history['acc']

### Cross-validation

In [None]:
#from sklearn.model_selection import accuracy_score

In [None]:
from sklearn.metrics

In [None]:
cross_val_score?

In [None]:
cross_val_score(linear_svm_clf, X_train, y_train, 
               cv = 5)

### Make pipeline

In [None]:
from sklearn.pipeline import make_pipeline

In [None]:
from sklearn.compose import ColumnTransformer, make_column_transformer

In [None]:
df_train.head()
df_test_unbiased.head()

In [None]:
df_master = pd.concat([df_train, df_test_unbiased])

In [None]:
df_master.tail()

In [None]:
make_pipeline?

In [None]:
pipe = make_pipeline(scaler(), LinearSVC())

In [None]:
pipe

In [None]:
pipe.fit(X_train, y_train)

In [None]:
preds = pipe.predict(X_test)

In [None]:
preds == y_test

In [None]:
from sklearn.metrics import confusion_matrix

In [None]:
sns.heatmap(confusion_matrix(y_test, preds) / confusion_matrix(y_test, preds).sum(axis = 0),
                             cmap = 'viridis_r')