Skip to content
BioBombe: Sequentially compressed gene expression features enhances biological signatures
Jupyter Notebook HTML Python R Shell
Branch: master
Clone or download
gwaygenomics Adding Unequal Variance to Signature Analysis T-Test (#186)
* make independent t-test have equal variance!

* add tissue balance and rerun separate notebook

* update top features in notebook 2

* scientific data updated the URL!

* add the actual data in case it updates again

* make sure equal variance is false

* remove hard coded best features in visualization

also calculate t-test in figures after filtering tissues

* update full separation plot figure

* rerun full signature pipeline

* make points in figure bigger b/c using test set

* rerun pipeline

* closes #187

* modify figure for rerunning analysis

* rerun pipeline
Latest commit 843923e Sep 21, 2019
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
0.expression-download Updating Gene Signature Module (module 10) in Response to Reviewer Co… Sep 17, 2019
1.initial-k-sweep update docs in response to PR comments Jan 24, 2019
10.gene-expression-signatures Adding Unequal Variance to Signature Analysis T-Test (#186) Sep 21, 2019
2.sequential-compression Adding Algorithm and Latent Dimensionality Execution Time analysis (#185 Sep 17, 2019
3.build-hetnets Adding Total Count to Hetnet description (#106) Jan 14, 2019
4.analyze-components
5.analyze-stability Update to `k` dimensions instead if `z` dimensions (#132) Jan 23, 2019
6.biobombe-projection add in top results previously omitted Mar 6, 2019
7.analyze-coverage Updating Coverage Figure (#166) Feb 15, 2019
8.gtex-interpret Update Documentation and Module 10 (#178) Mar 8, 2019
9.tcga-classify Update Documentation and Module 10 (#178) Mar 8, 2019
config move configuration files to better folders Apr 30, 2018
docs Updating Logo and Description in README (#183) Sep 13, 2019
scripts Initializing Pipelines (#74) Dec 18, 2018
.gitattributes Reorganize hetnet scripts into separate module (#10) Apr 23, 2018
.gitignore
LICENSE.md license to trustees Mar 22, 2018
README.md Updating Logo and Description in README (#183) Sep 13, 2019
analysis-pipeline.sh Update Documentation and Module 10 (#178) Mar 8, 2019
biobombe-implementation.png fix implementation figure! Jan 24, 2019
compression-overview.png Replace z with k in Overview Figures (#133) Jan 23, 2019
environment.yml Updating Gene Signature Module (module 10) in Response to Reviewer Co… Sep 17, 2019

README.md

logo

Sequential Compression of Gene Expression Data Across Latent Space Dimensions

Gregory Way and Casey Greene 2018

University of Pennsylvania

DOI

The repository stores data and data processing modules to sequentially compress gene expression data.

Named after the mechanical device developed by Alan Turing and other cryptologists in World War II to decipher secret messages sent by Enigma machines, BioBombe represents an approach used to decipher hidden messages embedded in gene expression data. We use the BioBombe approach to study different biological representations learned across compression algorithms and various latent dimensionalities.

In this repository, we compress three different gene expression data sets (TCGA, GTEx, and TARGET) across 28 different latent dimensions (k) using five different algorithms (PCA, ICA, NMF, DAE, and VAE). We evaluate each algorithm and dimension using a variety of metrics. Our goal is to construct reproducible gene expression signatures with unsupervised learning.

Links to access data and archived results can be found here: https://greenelab.github.io/BioBombe/

Citation

Sequential compression across latent space dimensions enhances gene expression signatures Way, G.P., Zietz, M., Himmelstein, D.S., Greene, C.S. biorXiv preprint (2019) doi:10.1101/573782

Approach

Our approach is outlined below:

overview

BioBombe Training Implementation

Our model implementation is described below.

implementation

Analysis Modules

To reproduce the results and figures of the analysis, the modules should be run in order.

Name Description
0.expression-download Download and process gene expression data to run through pipeline
1.initial-k-sweep Determine a set of optimal hyperparameters for Tybalt and ADAGE models across a representative range of k dimensions
2.sequential-compression Train various algorithms to compress gene expression data across a large range of k dimensions
3.build-hetnets Download, process, and integrate various curated gene sets into a single heterogeneous network
4.analyze-components Visualize the reconstruction and sample correlation results of the sequential compression analysis
5.analyze-stability Determine how stable compression solutions are between and across algorithms, and across dimensions
6.biobombe-projection Apply BioBombe matrix interpretation analysis and overrepresentation analyses to assign biological knowledge to compression features
7.analyze-coverage Determine the coverage, or proportion, of enriched gene sets in compressed latent space features for all models and ensembles of models
8.gtex-interpret Interpret compressed features in the GTEX data
9.tcga-classify Input compressed features from TCGA data into supervised machine learning classifiers to detect pathway aberration
10.gene-expression-signatures Identify gene expression signatures for sample sex in GTEx and TCGA data, and MYCN amplification in TARGET data

Algorithms

See 2.sequential-compression for more details.

Computational Environment

All processing and analysis scripts were performed using the conda environment specified in environment.yml. To build and activate this environment run:

# conda version 4.5.0
conda env create --force --file environment.yml

conda activate biobombe
You can’t perform that action at this time.