In [2]:
import math
import pandas as pd
%matplotlib inline



# ADAGE-Based Integration of Publicly Available _Pseudomonas aeruginosa_ Gene Expression Data with Denoising Autoencoders Illuminates Microbe-Host Interactions

Tan J, Hammond JH, Hogan DA, Greene CS. 2016, 
mSystems: Volume 1 Issue 1 e00025-15

http://msystems.asm.org/content/1/1/e00025-15

Presentation by Lisa Cohen,
ECE 221,
October 13, 2016

# Background, importance
* Gene expression = mRNAs transcribed from genes prior to translation into protein
* Contains information about organism's functioning state and capacity to respond
* Less-well-studied organisms (nonmodel) are challenging: how to assign gene annotations?
* Yet: **"One of the great unifying principles of modern biology is that organisms show marked similarity in their major pathways of metabolism."** --Garrett & Grisham. Biochemistry
* Evolution is giving us a glimpse of what is important
* We are in an exciting time! Growing databases of sequences and gene expression data: NCBI-SRA, GEO
* Why not use these data to learn?

# Neural Network - Autoencoder

* **ADAGE**: **A**nalysis using **D**enoising **A**utoencoders of **G**ene **E**xpression
* Type of unsupervised learning
* Input: unlabeled sample _x_ is a vector, no associated metadata
* Purpose: extract meaningful features from hidden nodes
* Training data
* Goal: minimize distance between output and input
* Videos: https://youtu.be/FzS3tMl4Nsc, https://www.youtube.com/watch?v=t2NQ_c5BFOc


# This paper - Detail of method
* All Affymetrix GeneChips microarray data for _Pseudomonas aeruginosa_ were downloaded from [ArrayExpress database](https://www.ebi.ac.uk/arrayexpress/)
* 950 arrays and 109 experiments

In [13]:
# original data dimensions
compendium = pd.read_table("data/inline-supplementary-material-1.txt",sep="\t")
compendium.head()
print compendium.shape

(5549, 951)


* added random noise to corrupt compendium, setting some genes = 0
* trained a neural network with hidden nodes
* removed added noise and reconstructed original data
* The purpose of adding noise is because X = Y is not enough, the point is discovering new features from hidden nodes

# Denoising Autoencoder - How it works
<img src="Vincent_2008_Fig1.png" width="800">
[Vincent et al. 2008](https://www.iro.umontreal.ca/~vincentp/Publications/denoising_autoencoders_tr1316.pdf)
<img src="Vincent_2008_Fig2.png" width="800">

![](Tan_etal_2016_Fig1.png)

# ADAGE - Algorithm
<img src="Tan_etal_2016_equations.png" width="1000">
* input = _x_, corrupted input with Weights (W) = _A_, b = bias
* apply sigmoid function, s
* reconstructed input _z_ by applying s again (W' = transformed)
* Lh is likelihood of input _x_ relative to output _z_

In [14]:
# apply sigmoid function to each sample vector of genes
def sigmoid(x):
  return 1 / (1 + math.exp(-x))
# apply sigmoid function for each gene in node 1..n
# approaches 0 for neg x
# approaches 1 for pos x 
# example
sample1 = [0,0,1,0.5,5,3]
for gene in sample1:
    print sigmoid(gene)

0.5
0.5
0.73105857863
0.622459331202
0.993307149076
0.952574126822


# Figure 2 - ADAGE Weights
* Weights - learned vector for each gene via gradient descent - reflected contribution of each gene to the activity of each node
* After training, computed activity for each new sample
* HW = high weight genes >= 2 standard deviations from mean 
* e.g. operonic co-membership

In [10]:
# dimensions of Weight matrix, for each node
weight_matrix = pd.read_csv("data/inline-supplementary-material-2.csv")
weight_matrix.head()
print weight_matrix.shape

(5549, 51)


"Cooperonic" - genes cooperating on same operon, positions in **B** from [Trunk et al. 2010](https://www.ncbi.nlm.nih.gov/pubmed/20553552)
![](Tan_etal_2016_Fig2_A_B_C_D.png)

Capturing functional features, based on Euclidian distance between weight vectors connecting each gene to 50 nodes and assigned the closest neighbor genes' function to the target gene
![](Tan_etal_2016_Fig2_E.png)

In [15]:
# high weight (HW) nodes
HW = pd.read_csv("data/inline-supplementary-material-3.csv")
HW.head()
print HW.shape

(330, 100)


In [9]:
# get genes in each node
operon_node = pd.read_csv("data/inline-supplementary-material-4.csv")
operon_node.head(200)

Unnamed: 0,node,operon,q_value
0,node1,PA3327;PA3328;PA3329;PA3330;PA3331;PA3332;PA33...,0.000000
1,node1,PA4250;PA4251;PA4252;PA4253;PA4254;PA4255;PA42...,0.000000
2,node1,PA1714;PA1715;PA1716;PA1717;PA1718;PA1719;PA17...,0.000000
3,node1,PA1806;PA1807;PA1808;PA1809;PA1810;PA1811,0.000000
4,node1,PA4863;PA4864;PA4865;PA4866;PA4867;PA4868,0.000000
5,node1,PA4242;PA4243;PA4244;PA4245;PA4246;PA4247;PA42...,0.004228
6,node1,PA2637;PA2638;PA2639;PA2640;PA2641;PA2642;PA26...,0.000000
7,node1,PA3799;PA3800;PA3801;PA3802;PA3803;PA3804;PA38...,0.002304
8,node1,PA0996;PA0997;PA0998;PA0999;PA1000,0.000000
9,node1,PA2066;PA2067;PA2068;PA2069,0.000000


Extracted features represented sequence differences between strains
![](Tan_etal_2016_Fig3.png)

Node 42 reflected _Anr_ activity in both existing and new experiments
![](Tan_etal_2016_Fig4.png)

Reanalysis of previous study,
used [KEGG pathway database](http://www.genome.jp/kegg/pathway.html?sess=2764b8338258d6286de91bbebe6faf46) to confirm hidden features extracted
![](Tan_etal_2016_Fig5.png)

In [16]:
# ADAGE analysis of ALL Pseudomonas aeruginosa data from ArrayExpress for all 50 nodes
new_activities = pd.read_csv("data/inline-supplementary-material-5.csv")
new_activities.head()
new_activities.shape

(950, 51)

# Comparison with PCA/ICA
* Similar patterns, need to examine all PC whereas nodes of importance capture combination of features
<img src="S1A.png" width="600">

# Comparison with PCA/ICA (continued)
* across all PC and all nodes
<img src="S1B.png" width="600">

<img src="S2.png" width="600">

# Marine Microbial Eukaryotic Transcriptome Sequencing Project
* 678 marine microbes, [Keeling et al. 2014](http://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1001889) 
* Heterokonta unicellular eukaryotes: dinoflagellates, ciliates, diatoms, etc.
* 40 phyla
<img src="10.1371-journal.pbio.1001889.g002.png" height="100">

In [79]:
mmetsp = pd.read_csv("https://raw.githubusercontent.com/glympsed/glympsed/master/mmetsp/batch1.mmetsp.OGcounts.filtered.csv")
mmetsp.shape
mmetsp.head()

Unnamed: 0.1,Unnamed: 0,OG_091326,OG_091320,OG_291897,OG_334539,OG_019415,OG_019412,OG_019418,OG_293803,OG_324372,...,OG_150566,OG_334536,OG_334537,OG_334534,OG_293804,OG_334532,OG_143642,OG_321249,OG_092677,OG_092673
0,SRR1300371,0,2270,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,SRR1328074,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,SRR1300355,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,SRR1300497,0,983,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,SRR1300495,0,0,0,0,0,0,0,0,0,...,1166,7259,172,2337,0,372,0,0,0,0


# Conclusions
* ADAGE, with a denoising autoencoder approach provides the opportunity to identify biologically-important patterns
* Will be very useful to discover pathways of importance in nonmodel species