Skip to content
An unsupervised transfer learning approach for rare disease transcriptomics
Branch: master
Clone or download
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
data Include additional MB dataset Dec 20, 2018
diagrams Update README (#32) Aug 17, 2018
docker Update: add more packages (#64) Jan 3, 2019
figure_notebooks Add MB plots for two cohorts (#66) Jan 4, 2019
plots
results Add differential expression notebook and results (#65) Jan 3, 2019
scripts Update: allow for repeats when using sample lists Dec 11, 2018
util Add sample size 'sweep' evaluations (#48) Dec 5, 2018
.gitattributes Update: account for biological context repeats in evals Dec 16, 2018
.gitignore Add Robinson, et al. MB dataset Dec 24, 2018
.nojekyll
00-data_download.sh Add Robinson, et al. MB dataset Dec 24, 2018
01-PLIER_util_proof-of-concept_notebook.Rmd Add custom functions for working with PLIER models and initial explor… Apr 8, 2018
01-PLIER_util_proof-of-concept_notebook.nb.html Add custom functions for working with PLIER models and initial explor… Apr 8, 2018
02-recount2_PLIER_exploration.Rmd Isolated immune cell reconstruction evaluation (#5) Apr 12, 2018
02-recount2_PLIER_exploration.nb.html Isolated immune cell reconstruction evaluation (#5) Apr 12, 2018
03-isolated_cell_type_populations.Rmd Tweaking heatmap figures (#24) Aug 14, 2018
03-isolated_cell_type_populations.nb.html Tweaking heatmap figures (#24) Aug 14, 2018
04-isolated_immune_cell_reconstruction.Rmd Isolated immune cell reconstruction evaluation (#5) Apr 12, 2018
04-isolated_immune_cell_reconstruction.nb.html Isolated immune cell reconstruction evaluation (#5) Apr 12, 2018
05-sle-wb_PLIER.Rmd Systemic lupus erythematosus whole blood PLIER (#6) Apr 20, 2018
05-sle-wb_PLIER.nb.html Systemic lupus erythematosus whole blood PLIER (#6) Apr 20, 2018
06-sle-wb_cell_type.Rmd Systemic lupus erythematosus whole blood PLIER (#6) Apr 20, 2018
06-sle-wb_cell_type.nb.html Systemic lupus erythematosus whole blood PLIER (#6) Apr 20, 2018
07-sle_cell_type_recount2_model.Rmd Adding figure notebook + figure: plasma cell box plots (#27) Aug 14, 2018
07-sle_cell_type_recount2_model.nb.html Adding figure notebook + figure: plasma cell box plots (#27) Aug 14, 2018
08-identify_ifn_LVs.Rmd Add SLE IFN trials notebooks (#7) May 4, 2018
08-identify_ifn_LVs.nb.html Add SLE IFN trials notebooks (#7) May 4, 2018
09-sle_ifn_data_prep.Rmd Add SLE IFN trials notebooks (#7) May 4, 2018
09-sle_ifn_data_prep.nb.html Add SLE IFN trials notebooks (#7) May 4, 2018
10-sle_ifn_analysis.Rmd Add initial figure notebooks (#22) Aug 12, 2018
10-sle_ifn_analysis.nb.html Add initial figure notebooks (#22) Aug 12, 2018
11-subsample_recount_PLIER.R Add recount subsampling script and results (#8) May 7, 2018
12-train_NARES_PLIER.Rmd Add differential expression analyses in ANCA-associated vasculitis (#14) May 31, 2018
12-train_NARES_PLIER.nb.html Add NARES PLIER training and comparison to recount2 model (#9) May 8, 2018
13-compare_NARES_B.Rmd Tweaking heatmap figures (#24) Aug 14, 2018
13-compare_NARES_B.nb.html Tweaking heatmap figures (#24) Aug 14, 2018
14-NARES_MCPcounter.Rmd Add figure notebook + figure: neutrophil scatterplots (#25) Aug 14, 2018
14-NARES_MCPcounter.nb.html Add figure notebook + figure: neutrophil scatterplots (#25) Aug 14, 2018
15-evaluate_subsampling.Rmd
15-evaluate_subsampling.nb.html Add recount2 subsampling evaluations (#13) May 31, 2018
16-repeat_sle_wb_PLIER.R Add recount2 subsampling evaluations (#13) May 31, 2018
17-plotting_repeat_evals.Rmd Add initial figure notebooks (#22) Aug 12, 2018
17-plotting_repeat_evals.nb.html Add initial figure notebooks (#22) Aug 12, 2018
18-NARES_differential_expression.Rmd Add differential expression analyses in ANCA-associated vasculitis (#14) May 31, 2018
18-NARES_differential_expression.nb.html Add differential expression analyses in ANCA-associated vasculitis (#14) May 31, 2018
19-GPA_blood_differential_expression.Rmd Add differential expression analyses in ANCA-associated vasculitis (#14) May 31, 2018
19-GPA_blood_differential_expression.nb.html Add differential expression analyses in ANCA-associated vasculitis (#14) May 31, 2018
20-kidney_differential_expression.Rmd Add glomeruli data download from greenelab/rheum-plier-data (#31) Aug 17, 2018
20-kidney_differential_expression.nb.html Add glomeruli data download from greenelab/rheum-plier-data (#31) Aug 17, 2018
21-AAV_DLVE.Rmd Add glomeruli data download from greenelab/rheum-plier-data (#31) Aug 17, 2018
21-AAV_DLVE.nb.html Add glomeruli data download from greenelab/rheum-plier-data (#31) Aug 17, 2018
22-GPA_blood_top_LVs.Rmd
22-GPA_blood_top_LVs.nb.html Add further exploration of GPA blood dataset (#15) Jun 4, 2018
23-explore_AAV_recount_LVs.Rmd Explore gene set associations and loadings of additional differential… Aug 12, 2018
23-explore_AAV_recount_LVs.nb.html Explore gene set associations and loadings of additional differential… Aug 12, 2018
24-explore_rtx.Rmd Add rituximab exploratory data analyses (#17) Jun 22, 2018
24-explore_rtx.nb.html Add rituximab exploratory data analyses (#17) Jun 22, 2018
25-predict_response.Rmd Add RTX prediction notebook (#18) Jun 24, 2018
25-predict_response.nb.html Add RTX prediction notebook (#18) Jun 24, 2018
26-describe_recount2.Rmd Fix: divide by sample count, not run count Dec 23, 2018
26-describe_recount2.nb.html Fix: divide by sample count, not run count Dec 23, 2018
27-oncogenic_pathway_recount2_model.Rmd Are MSigDB oncogenic pathways captured by the recount2 model? (#42) Nov 16, 2018
27-oncogenic_pathway_recount2_model.nb.html Are MSigDB oncogenic pathways captured by the recount2 model? (#42) Nov 16, 2018
28-train_different_biological_contexts.sh Update: model -> models Dec 11, 2018
29-train_models_different_sample_size.sh Newline Dec 3, 2018
30-evaluate_sample_size_and_biological_context.Rmd Fix typo Dec 17, 2018
30-evaluate_sample_size_and_biological_context.nb.html Fix typo Dec 17, 2018
31-plotting_sample_size_biological_context_coverage.Rmd Update in response to PR comments Dec 18, 2018
31-plotting_sample_size_biological_context_coverage.nb.html Update in response to PR comments Dec 18, 2018
32-explore_pathway_separation.Rmd Refactor in response to PR comments Dec 18, 2018
32-explore_pathway_separation.nb.html Refactor in response to PR comments Dec 18, 2018
33-pathway_overlap_biological_contexts.Rmd Pathway overlap between biological contexts (#53) Dec 19, 2018
33-pathway_overlap_biological_contexts.nb.html Pathway overlap between biological contexts (#53) Dec 19, 2018
34-DIPG_data_cleaning.Rmd Update: function for obtaining sample attributes from series matrix Dec 26, 2018
34-DIPG_data_cleaning.nb.html Update: function for obtaining sample attributes from series matrix Dec 26, 2018
35-DIPG_recount2_model.Rmd Add recount2 B for DIPG Dec 13, 2018
35-DIPG_recount2_model.nb.html Add recount2 B for DIPG Dec 13, 2018
36-DIPG_analysis.Rmd Reorder notebooks Dec 23, 2018
36-DIPG_analysis.nb.html Reorder notebooks Dec 23, 2018
37-medulloblastoma_recount2_model.Rmd Reorder notebooks Dec 23, 2018
37-medulloblastoma_recount2_model.nb.html Reorder notebooks Dec 23, 2018
38-medulloblastoma_DELV.Rmd Add differential expression notebook and results (#65) Jan 3, 2019
38-medulloblastoma_DELV.nb.html Add differential expression notebook and results (#65) Jan 3, 2019
39-L2_penalty.Rmd Add short experiment re: L2 parameter (#63) Jan 3, 2019
39-L2_penalty.nb.html
40-SLE_MCPcounter.Rmd Reorder notebook; save results as TSV Dec 25, 2018
40-SLE_MCPcounter.nb.html Reorder notebook; save results as TSV Dec 25, 2018
LICENSE_BSD-3.md Add License (#1) Apr 3, 2018
LICENSE_CC0.md Add License (#1) Apr 3, 2018
README.md Add updated figshare to README (#68) Jan 15, 2019

README.md

MultiPLIER

A unsupervised transfer learning approach for rare disease transcriptomics

Taroni JN, Grayson PC, Hu Q, Eddy S, Kretzler M, Merkel PA, and Greene CS+. MultiPLIER: a transfer learning framework reveals systemic features of rare autoimmune disease. bioRxiv. 2018.

+Correspondence via issues or to greenescientist@gmail.com

Data

Data used in this analysis repo were processed in greenelab/rheum-plier-data. Please see that repository for relevant citations.

Data and code, including items that are too large to be stored with Git LFS (e.g., some models), associated with v0.2.0 are available at the following DOI: 10.6084/m9.figshare.6982919.v2

Dependencies

We have prepared a Docker image that contains all the dependencies required to reproduce these analyses. See docker/Dockerfile for more information about dependencies.

After installing Docker (Docker documentation), the image can be obtained:

docker pull jtaroni/multi-plier:0.2.0

We use R notebooks for analysis, which can be run and modified using RStudio. RStudio is included on our Docker image. This guide from Andrew Heiss, specifically the Run locally with a GUI section, is a great starting point for working with RStudio and Docker.

Overview

Unsupervised machine learning methods provide a promising means to analyze and interpret large datasets. However, most datasets generated by individual researchers remain too small to fully benefit from these methods. In the case of rare diseases, there may be too few cases available, even when multiple studies are combined. We sought to determine whether or not machine learning models could be constructed from large public data compendia and then transferred to small datasets for subsequent analysis. We trained models using Pathway Level Information ExtractoR (PLIER) (Github) over datasets of different types and scales. Models constructed from large public datasets were i) more detailed than those constructed from individual datasets; ii) included features that aligned well to important biological factors; iii) transferrable to rare disease datasets where the models describe biological processes related to disease severity more effectively than models trained within those datasets.

We call this approach MultiPLIER because we train on multiple datasets, tissues, and biological conditions.

We focus on groups of systemic autoimmune conditions in this project; one group of conditions is rare and the other disease is not. First, we establish that PLIER is appropriate for use in a single tissue, multi-dataset compendium (greenelab/rheum-plier-data/sle-wb) constructed from publicly available systemic lupus erythematosus (SLE) whole blood (WB) microarray data. We demonstrate that MultiPLIER, trained on the recount2 RNA-seq compendium, performs similarly in capturing certain cell type-specific signals and captures additional pathway signals over an SLE WB model. We also analyze expression data from 3 tissues from anti-neutrophilic cytoplasmic antibodies (ANCA)-associated vasculitis (AAV), a family of rare diseases, with MultiPLIER.

Overview figure of dataset-specific PLIER and MultiPLIER. Boxes with solid colored fills represent inputs to the model. White boxes with colored outlines represent model output. (A) PLIER (Mao et al., 2017) automatically extracts latent variables (LVs), shown as the matrix B, and their loadings (Z). We can train PLIER model for each of three datasets from different tissues, which results in three dataset-specific latent spaces. (B) PLIER takes as input a prior information/knowledge matrix C and applies a constraint such that some of the loadings (Z) and therefore some of the LVs capture biological signal in the form of curated pathways or cell type-specific gene sets. (C) Ideally, an LV will map to a single gene set or a group of highly related gene sets to allow for easy interpretation of the model. PLIER applies a penalty on U to facilitate this. Purple fill in a cell indicates a non-zero value and a darker purple indicates a higher value. We show an undesirable U matrix in the top toy example (Ci) and a favorable U matrix in the bottom toy example (Cii). (D) If models have been trained on individual datasets, we may be required to find “matching” LVs in different dataset- or tissue-specific models using the loadings (Z) from each model. Using a metric like the Pearson correlation between loadings, we may or may not be able to find a well-correlated match between datasets. (E) The MultiPLIER approach: train a PLIER on a large collection of uniformly processed data from many different biological contexts and conditions (recount2; Collado-Torres et al., 2017)—a MultiPLIER model—and then project the individual datasets into the MultiPLIER latent space. The hatched fill indicates the sample dataset of origin. (F) Latent variables from the MultiPLIER model can be tested for differential expression between disease and controls in multiple tissues.

For more information about the training set, please see this notebook.

Notebooks

Analysis notebooks are numbered and present in the top level directory. We've enabled Github pages for easy viewing of the notebooks. Some steps in the pipeline are R scripts rather than notebooks due to their computationally intensive nature; we exclude these from the TOC below.

Note that not all analyses present in this repository are included in the preprint.

The figure_notebooks directory contains notebooks that were used specifically to generate figures suitable for publication (figure_notebooks/figures).

License

This repository is dual licensed as BSD 3-Clause (source code) and CC0 1.0 (figures, documentation, and our arrangement of the facts contained in the underlying data), with the following exceptions:

  • recount2 data is licensed CC-BY.
You can’t perform that action at this time.