Metagenomic-metabolomic associations workflow

Short description

This workflow is designed to study large arrays of multiomics data by analyzing patterns of cooccurrence of pairs of metagenomics and metabolomics features in a large amount of samples in order to find statistically significant functional associations, e.g. a bacterium produces an antibiotic. The workflow is also meant to assess various metabolomics feature extraction tools and/or correlation methods by generating a decoy database and calculating False Discovery Rate (FDR). The workflow is built and run with Snakemake, which allows to easily change and expand the pipeline when necessary.

Directories structure

input: preprocessed initial data
log: workflow steps logs
output: results
preprocessing: subworkflow for converting the raw data to the unified format
scripts: workflow scripts
tmp: temporary workflow files

Main steps

Data preprocessing

Features data comes in a variety of format, depending on the source and methods used to obtain it, therefore a preprocessing step is needed to convert this data to a unified format (described below). There are no restrictions on the way this step is performed as long as it follows the output requirements, however, a Snakemake-based subworkflow located in the preprocessing directory can be used (needs to be implemented explicitly).

As the result of this step the following files should be created in the input directory:

mbx.feat.<type>: metabolomics features
mgx.feat.<type>: metagenomics features
samples.tsv: samples' names and additional information

Decoy database creation

To be able to evaluate the quality of the feature extraction or association methods that are used, we generate a decoy database of metabolomic features by applying a random permutation to the sample set for every feature. This decoy database is later use to estimate the FDR.

As the result of this step the following files are created in the input directory:

mbx_decoy.feat.<type>: decoy metabolomics features

Pairwise associations search

For every pair of a metagenomic and a metabolomic features we compare their abundance profiles to find pairs of associated features. For now, Fisher's exact test (scipy.stats.fisher_exact) is applied to binary abundance data (present or not present) and the p-value is evaluated to measure the statistical significance of the association of these features, however, other association evaluation methods, e.g. Pearson or Spearman correlation, may be added in the future.

Fisher's exact test can use one of the three alternative hypotheses. Depending on which fisher_alternative is used different kinds of associations would be searched for:

greater: positive associations (presence of feature A is associated with presence of feature 'B') would be found
less: negative associations (presence of feature A is associated with absence of feature 'B') would be found
two-sided: both kinds of associations would be found

Only the pairs with a p-value under the given threshold are reported. If associations_parts equals 1 then all input pairs are processed as one job, otherwise the input is split into given amount of parts of almost equal size and each part is processed as a separate job, all results being merged afterwards. As the result of this step the following files are created in the output directory:

associations.tsv: the reported pairs of associated features

Decoy associations search

The same procedure described above is applied to the decoy database of metabolomic features to determine known false positive associations.

The same fisher_alternative, threshold and associations_parts are used. As the result of this step the following files are created in the output directory:

associations_decoy.tsv: the reported pairs of associated features (using decoy metabolomics feature list)

False Discovery Rate estimation

After the associations search we assess the quality of results obtained for a particular p-value threshold by estimating the FDR using two methods:

Target-Decoy Approach: FDR = associations_decoy / associations_real, where associations_decoy and associations_real are the amounts of associations of metagenomic features with decoy and real metabolomics features respectively.
Benjamini-Hochberg method: the upper-bound estimate on the FDR is calculated as the lowest level for which all the reported associations are considered significant when applying the BH step-up procedure.

FDR is estimated for a list of fdr_thresholds. Every value in fdr_thresholds should be greater than or equal to the previously used threshold. As the result of this step the following files are created in the output directory:

fdr.tsv: a table of estimated FDRs for the given p-value thresholds

Data formats

Samples

samples.tsv file is a TAB-separated table of the samples data with the first line being the header and every consecutive line describing a sample. The table has the following columns:

name: name of the sample

Other columns are allowed and will be ignored by the workflow.

Features

Every feature file has the extension .feat.<type>, where <type> denotes the data representation type in the file and can be one of the following:

bin_dense: dense representation of which samples every the features is present in
bin_sparse: sparse representation of which samples the features is present in
real: real-valued presence data of the features in the samples, e.g. abundance or intensity

*.feat.bin_sparse files have the following structure:

first line consists of the strings name, annotation and samples separated by TAB characters
every consecutive line describes one feature and consists of the name and the annotation of the feature followed by the list of the samples this feature is present in, all fields being separated by TAB characters
it is required that each feature is present in at least one sample

*.feat.bin_dense files have the following structure:

first line consists of the strings name and annotation followed by the list of all the samples, all fields being separated by TAB characters ('\t')
every consecutive line describes one feature and consists of the name and the annotation of the feature followed by the list of '0' and '1' characters, meaning the absence and presence of the feature in the corresponding sample respectively, all fields being separated by TAB characters ('\t')

*.feat.real files have the same structure as the *.feat.bin_dense, but it contains real-valued data instead of binary '0'/'1'.

annotation fields are either <type> or <type>/<custom_annotation>, where <custom_annotation>, if available, is the any relevant annotation of the feature, e.g. the OTU name or the result of molecular database dereplication, and type is one of the following:

MBX_DECOY: for metabolomic decoy features
MBX: for metabolomic features
MGX: for metagenomic features

Associations

associations.tsv and other associations files are TAB-separated tables with the first line being the header and every consecutive line describing a reported association pair. The table has the following columns:

feature1: name of the first feature of the pair
feature2: name of the second feature of the pair
annotation1: annotation of the first feature of the pair
annotation2: annotation of the second feature of the pair
p-value: p-value of the association
samples1: number of the samples the first feature is present in
samples2: number of the samples the second feature is present in
samples_shared: number of the samples both features are present in

Other columns are allowed and will be ignored by the workflow.

FDR list

fdr.tsv file is a TAB-separated table with the first line being the header and every consecutive line describing an FDR calculated for a p-value threshold from fdr_thresholds. The table has the following columns:

FDR: the FDR calculated using Target-Decoy Approach
FDR_BH: the FDR calculated using Benjamini-Hochberg method
threshold: the p-value threshold for which the FDR is calculated
n_decoy_assn: the number of associations with decoy metabolomics features having the p-value under the threshold
n_real_assn: the number of associations with real metabolomics features having the p-value under the threshold

Other columns are allowed and will be ignored by the workflow.

Configuration

config.yaml is the configuration file written in the YAML format. The following entries should be specified:

fisher_alternative: the alternative hypothesis used in Fisher's exact test
threshold: the p-value threshold for the association pair to be reported
fdr_thresholds: the list of p-value thresholds for which the FDR values should be calculated
associations_parts: the number of jobs the associations step is split into allowing for parallel execution

Running the workflow

To run the whole workflow perform the following steps:

Prepare and do the data preprocessing step using any method you choose.
Run snakemake.

Individual steps can also be executed. For more information visit Snakemake website.

Authors

Egor Shcherbin, Liu Cao, Hosein Mohimani

Mohimani Lab, CMU, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Metagenomic-metabolomic associations workflow

Short description

Directories structure

Main steps

Data preprocessing

Decoy database creation

Pairwise associations search

Decoy associations search

False Discovery Rate estimation

Data formats

Samples

Features

Associations

FDR list

Configuration

Running the workflow

Authors

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
dataset		dataset
input		input
log		log
output		output
preprocessing		preprocessing
scripts		scripts
tmp		tmp
README.md		README.md
Research and Academic Use software CODE license file AssociationNetworks.docx		Research and Academic Use software CODE license file AssociationNetworks.docx
Snakefile		Snakefile
config.yaml		config.yaml

mohimanilab/AssociationNetworks

Folders and files

Latest commit

History

Repository files navigation

Metagenomic-metabolomic associations workflow

Short description

Directories structure

Main steps

Data preprocessing

Decoy database creation

Pairwise associations search

Decoy associations search

False Discovery Rate estimation

Data formats

Samples

Features

Associations

FDR list

Configuration

Running the workflow

Authors

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages