This repository contains data and code to generate the results and reproduce the figures and tables found in A Flexible, Interpretable, and Accurate Approach for Imputing the Expression of Unmeasured Genes, published in Nucleic Acids Research. This work introduces a new method for imputing gene expression. The method introduced, SampleLASSO, uses the LASSO machine learning algorithm in a way that captures context specific biologically relevant information to guide imputation.
This repo provides:
- The data, results, and figures presented in the manuscript.
- Code to regenerate the results and figures.
- A function that allows a user to upload a dataset to be imputed, and then we use SampleLASSO to fill in the unmeasured genes and also report which other expression samples in the training data were the most helpful for imputation.
Section 1: Pre-computed Data, Results, and Figures/Tables
The data used in this study (networks, embeddings, and genesets) is available on Zenodo. To get the data run
PDF versions of the figures can be found in
figures/. The notebook that generates the figures can be found at
Section 2: Regenerating the Results and Figures/Tables
This code was tested on an Anaconda distribution of python. The major packages used are:
python 3.7 numpy 1.16.4 scipy 1.3.0 pandas 0.24.2 scikit-learn 0.20.3 matplotlib 3.0.3 seaborn 0.9.0 statsmodels 0.9.0 tensorflow-gpu 1.14.0 (this was run with python 3.6) keras-gpu 2.2.4 (this was run with python 3.6)
The parallelization of the code was tested with Slurm on the high performance computing cluster at Michigan State University.
Running LASSO and KNN code
main.py: Main script that generates imputed values
main_utls.py: Helper function for main.py
main_slurm.py: A python script that will submit numerous jobs through slurm
run_test_jobs.share scripts that start running the relevant jobs.
main_knitting.py: Combines all predictions for one hyperparameter set into one file
main_evalautions.py: Makes a file that has evaluations for different metrics
Running DNN code
DNN_main.py: Main script that generates imputed values, and makes the evaluation file
DNN_slurm.py: A python script that submits all relevant DNN jobs
Running GAN code
GGAN_main.py: Main script that generates imputed values, and makes the evaluation file
GGAN_slurm.py: A python script that submits all relevant GGAN jobs
weightnorm.py: This a utility file for
Running SEEK data code
seek_*.py: These files generate the results, where the * is replaced with an identifer for a given imputation method
seek_slurm.py: A python script that submits all relevant SEEK jobs
Running Normalization code
Normalization_Analysis.py: Main script that generates normalization analysis results
Normalization_Analysis.sb: An sbatch file that allocates a slurm job for normalization script
Running Beta Analysis code
beta_main.py: Main script that generates imputed values
betas_slurm.py: A python script that submits the jobs through slurm
betas_knitting_evals_move.py: This combines all predictions for one hyperparameter set into one file and make a file for evaluations of different metrics
Section 3: User function for imputing any data
To impute an new data use the function found at
src/user_function.py which as the following arguments
-mgf, --measured_genes_file: The path to a tab separated file where the rows are the different genes, the first column contains the gene IDs and the rest of the columns contain the expression data to be imputed.
-t, --targets: The path to a text file containing the gene IDs of unmeasrued genes that need to be imputed. If this path is not given, then all the genes in the training set that are not in the measured_genes_file will be imputed
-td, --training_data: The path to the data to be used for training (right now need to be a numpy array that has samples along the rows and genes along the columns)
-id, --gene_ids: The path to the file that maps the columns in the training data to gene IDs
-tk, --training_key: The path that maps the GSE and GSM IDs to the samples in the training set
-upd, --use_all_paper_data: If this argument is set to either Microarray or RNAseq the function will ignore arguments 3-5 and just use the pre-supplied data used this work.
An example to run is
cd src python user_function.py -mgf ../data/example_data.tsv -t ../data/example_targets.tsv -td ../data/Microarray_Trn_Exp.npy -id ../data/GeneIDs.txt -tk ../data/Microarray_Trn_Key.tsv
This function output 4 files into the directory
user_results in a subdirectory that is label with the timestamp YYYY-MM-DD-HH-SS
predictions.tsv: A tab separated file with the first column being the Gene IDs and the rest of the columns being the imputed expression values
top_betas.tsv: A tab separated file where for each GSM that was imputed, it gives back 100 training samples with the highest model coefficients
unusable_measured_genes.txt: A text file containing gene IDs in the uploaded measured_genes_file that were not in the training set
unusable_targets.txt: A text file that list gene IDs of target genes not imputed because they were also in the measured_genes_file
Section 4: Additional Information
See LICENSE.md for license information for all data used in this project.
If you use this work, please cite:
Mancuso CA, Canfield JL, Singla D, Krishnan A (2020) A flexible, interpretable, and accurate approach for imputing the expression of unmeasured genes. Nucleic Acids Research, 48:e125 https://doi.org/10.1093/nar/gkaa881.
Christopher A Mancuso#, Jake Canfield#, Deepak Singla, Arjun Krishnan*
# These authors are joint first authors.
* General correspondence should be addressed to AK at firstname.lastname@example.org.
This work was primarily supported by US National Institutes of Health (NIH) grants R35 GM128765 to AK and in part by MSU start-up funds to AK and NIH F32 Fellowship F32GM134595 for CM.
We are grateful for the support from the members of the Krishnan Lab.
- Lachmann A, Torre D, Keenan AB, Jagodnik KM, Lee HJ, Wang L, Silverstein MC, Ma’ayan A. Massive mining of publicly available RNA-seq data from human and mouse. Nature Communications 9. Article number: 1366 (2018), doi:10.1038/s41467-018-03751-6
Edgar R, Domrachev M, Lash AE. Gene Expression Omnibus: NCBI gene expression and hybridization array data repository Nucleic Acids Res. 2002 Jan 1;30(1):207-10
Barrett T, Wilhite SE, Ledoux P, Evangelista C, Kim IF, Tomashevsky M, Marshall KA, Phillippy KH, Sherman PM, Holko M, Yefanov A, Lee H, Zhang N, Robertson CL, Serova N, Davis S, Soboleva A. NCBI GEO: archive for functional genomics data sets--update. Nucleic Acids Res. 2013 Jan;41(Database issue):D991-5.
Lee YS, Krishnan A, Oughtred R, Rust J, Chang CS, Ryu J, Kristensen VN, Dolinski K, Theesfeld CL, Troyanskaya OG. (2019) A Computational Framework for Genome-wide Characterization of the Human Disease Landscape Cell Systems 8(2):P152-162 DOI: 10.1016/j.cels.2018.12.010
Lee YS, Krishnan A, Zhu Q, Troyanskaya OG. (2013) Ontology-aware classification of tissue and cell-type signals in gene expression profiles across platforms and technologies. Bioinformatics 29(23):3036-44 DOI https://doi.org/10.1093/bioinformatics/btt529
- Zhu A, Wong AK, Krishnan A, Aure MR, Tadych A, Zhang R, Corney DC, Greene CS, Bongo LA, Kristensen VN, Charikar M, Li K & Troyanskaya OG (2015) Targeted exploration and analysis of large cross-platform human transcriptomic compendia Nature Methods 12(3):211-4 DOI: 10.1038/nmeth.3249