Summary
++ BioBombe examines many low-dimensional representations of gene + expression data. Named after the large mechanical devices built by + Alan Turing and other cryptologists during World War II to decode + secret messages, BioBombe represents an approach to decipher hidden + messages embedded in gene expression data. In the manuscript, we use + this approach to compare the biological representations learned by + various compression algorithms across different latent space + dimensionalities ranging from k = 2 to k = 200. +
++ This website provides convenient links to all the resources produced + for the manuscript including software, processed gene expression + input data, compressed features for all algorithms, latent + dimensionalities, and initializations for all datasets, and + performance results and models for the large scale cancer-type and + mutation classification analysis. +
+Primary Findings
++ There does not exist a single optimal algorithm or latent + dimensionality for learning biological representations. Many different biological signatures are learned by different + compression algorithms at various latent dimensionalities. A + practitioner aiming to optimize feature discovery by compressing + gene expression data should use multiple algorithms across a large + range of latent dimensionalities. +
+Citation
+
+ Sequential compression across latent space dimensions enhances
+ gene expression signatures
+ Way, G.P., Zietz, M., Himmelstein, D.S., Greene, C.S.
+ biorXiv preprint (2019) ·
+ doi:10.1101/573782
+
Approach
++ We train five compression algorithms (principal components analysis + (PCA), independent components analysis (ICA), non-negative matrix + factorization (NMF), denoising autoencoders (DAE), and variational + autoencoders (VAE)) using three benchmark gene expression datasets: + The Cancer Genome Atlas (TCGA), Genome Tissue Expression Consortium + project (GTEx), and the Therapeutically Applicable Research To + Generate Effective Treatments (TARGET) project across a wide range + of latent dimensionalities (k). We assess model performance by + measuring reconstruction, correlation between input and + reconstructed output, model stability and geneset coverage. +
+![](approach.png)
Resources
+Software
+ ++ Includes code, data, documentation, results, figures, and a + computational environment for the full analysis. Each numbered + module represents specific data processing steps or analysis + results. +
+Input Data
++ Data +
++ Includes Processed Training and Testing Datasets for TCGA, GTEx, and + Target as git LFS files +
+Heterogeneous Networks
+ ++ Includes real and permuted hetnets1 for MSigDB2 + and xCell3 gene sets, as well as more details about + heterogeneous networks. +
+
+ 1
+ Systematic integration of biomedical knowledge prioritizes drugs
+ for repurposing
+ Himmelstein DS, Lizee A, Hessler C, Brueggeman L, Chen SL, Hadley D,
+ Green A, Khankhanian P, Baranzini SE.
+ eLife (2017) ·
+ doi:10.7554/eLife.26726
+
+ 2
+ The Molecular Signatures Database Hallmark Gene Set Collection
+ Liberzon A, Birger C, Thorvaldsdóttir H, Ghandi M, Mesirov
+ JP, Tamayo P.
+ Cell Systems (2015) · 1:417-25
+
+ 3
+ xCell: digitally portraying the tissue cellular heterogeneity
+ landscape
+ Aran D, Hu Z, Butte AJ.
+ Genome Biology (2017) ·
+ doi:10.1186/s13059-017-1349-1
+
Compressed Features
++ Results: +
+ ++ Randomly Permuted Data: +
+ +TCGA Classification Results
++ Results +
++ Includes BioBombe feature coefficients, sample activation scores, + and classifier metrics for all supervised learning models (elastic + net logistic regression) trained to predict cancer-type and mutation + status. +
+Acknowledge ments
+ + This work was funded in party by The Gordon and Betty Moore + Foundation under GBMF 4552 (CSG) and the National Institutes of + Health's National Human Genome Research Institute under R01 HG010067 + (CSG) and the National Institutes of Health under T32 HG000046 + (GPW). +
++ We would like to thank Jaclyn Taroni, Yoson Park, and Alexandra Lee + for insightful discussions and code review. We also thank Jo Lynne + Rokita and John Maris for insightful discussions regarding the + neuroblastoma analysis. +
+