Author: Menghan Liu
@ twitter @
email
Date: December 12, 2019
This computational pipeline for exhaustive search of taxonomic contribution to a specific function, within a defined microbiome community.
Status: Source code of the pipeline can be found on this repository. Downstream analyses scripts are available per request.
- Installation
- Getting started
- Tutorial
- Determine mapping identity cutoff
- Test run with sample data
- Citation
git clone https://github.com/ml3958/FindTaxaCtrbt
This command will download all relevent scripts to your local computer.
In order to use these scripts, you either need to be inside this
downloaded folder via cd <Downloaded_FindTaxaCtrbt_folder>
, or add the
full path when calling the script.
Preparation involve two steps, where step 1 is to set up a conda environment with all required dependency; step 2 is to gather data.
Step 1. Environment setup
This step needs to be completed for anyone who wants to use this pipeline.
The environment required for FindTaxaCtrbt can be handled via creating a conda environment based on the FindTaxaCtrbt_environment.yml file provided, with the following code
conda env create -n FindTaxaCtrbt --file FindTaxaCtrbt_environment.yml # Create a new conda environment called FindTaxaCtrbt
conda activate FindTaxaCtrbt # initiate the environment
you can customize your environment name by replacing FindTaxaCtrbt with <name_of_your_choice>
You can also manually install the core dependencies (Diamond, Parallel v0.9.14, R v3.3.2, r packages dplyr and tidyr)
Step 2. Download data
Depending upon experimental design and research purpose, this step needs to be customized performed.
- First, Reference protein: a list of protein homologs for the
microbiome function/enzyme of interest
- a .fasta file containging their amino acid sequences
- a .csv table containing their taxonomic origin
- Second, microbiome meta’omics samples from publicaly avaiable
samples, or freshly generated
- metagenome or metatranscriptome samples
1. Initiate conda environment
conda activate FindTaxaCtrbt
2. Build diamond index for protein reference
For each protein fasta file, you only need to build index for once upon first usage.
diamond makedb \
--in <ref_protein_homolog.fa> \ # path to input fasta file for reference protein homologs
--db <path_of_database_to_write_to> # output directory of database
3. [CORE] Run diamond
You can run the program on one sample
# Run the program
bash scripts/FindTaxaCtrbt.sh <path_to_diamond_index> <path_to_MTG/MTS_sample> <directory_to_write_output>
Inputs:
- <path_to_diamond_index>
- <path_to_MTG/MTS_sample> currently accept .fastq.gz file
- <directory_to_write_output> provided as positional manner, thus must follow the order
Outputs:
- diamond output xx.m8
- diamong log file xx.log
or, run the program on all samples in one directory
<directory_of_input_sample>
using paralllel
parallel \
bash scripts/FindTaxaCtrbt.sh <path_to_database> {.} <directory_to_write_output> \ # command to loop for via parallel
::: `ls <directory_of_input_sample>/*` # provide value for {.}
4. [CORE] Parse diamond output
Rscript --vanilla ../scripts/parse_blastx.R <path_to_diamond_m8_output> <output_file> <identity_cutoff>
or parallel
dir=<.m8_file_directory>
parallel \
Rscript --vanilla ../scripts/parse_blastx.R {1} {1}_protein_summary.txt <identity_cutoff> \
::: \
`ls ${dir}/* |grep m8$`
5. Merge batch output into one sample
Download sample data via this link
Within the downloaded folder, there is a frc_oxc_oxdd_uniref100.faa that contains the amino acid sequences of 6067 protein homologs of oxalate degrading enzymes OXDD, FRC and OXC.
Also, there are two folders MTG/ and MTS/ that contains 3 metagenomic and 3 metatranscriptomic fastq.gz samples, respectively.
For example, I am interested in the oxalyl-coA decarboxylase (OXC), which is involved in the oxalate degradation function. Therefore, I found a homologous protein family for OXC at Uniprot.
UniprotID | Description | Phylum | Family | Genus | Species | Strain |
---|---|---|---|---|---|---|
A0A063X7E6 | Oxalyl-CoA decarboxylase | Proteobacteria | Acetobacteraceae | Acetobacter | aceti | Acetobacter aceti 1023 |
A9X6P8 | Oxalyl-CoA decarboxylase | Proteobacteria | Acetobacteraceae | Acetobacter | aceti | Acetobacter aceti |
A0A1Y0V990 | Oxalyl-CoA decarboxylase | Proteobacteria | Acetobacteraceae | Acetobacter | ascendens | Acetobacter ascendens |
A0A149USA4 | Oxalyl-CoA decarboxylase | Proteobacteria | Acetobacteraceae | Acetobacter | malorum | Acetobacter malorum |
A0A1A0DAZ7 | Oxalyl-CoA decarboxylase | Proteobacteria | Acetobacteraceae | Acetobacter | pasteurianus | Acetobacter pasteurianus |
A0A368A7F9 | Oxalyl-CoA decarboxylase | Proteobacteria | Acetobacteraceae | Acetobacter | pasteurianus | Acetobacter pasteurianus |
Now, let’s start a test run on those sample data
# Initiate conda environment
conda activate FindTaxaCtrbt
# Switch to your directory
cd /Users/menghanliu/Documents/FindTaxaCtrbt_sample_data
# Build diamond index for protein reference
diamond makedb \
--in frc_oxc_oxdd_uniref100.faa \ # frc_oxc_oxdd_uniref100.faa as input
--db frc_oxc_oxdd_uniref100.faa # prefix for databae, output: frc_oxc_oxdd_uniref100.faa.dmnd
# Check whether the database is sucesfully built
ls frc_oxc_oxdd_uniref100.faa.dmnd
# Make a new folder to store the results
mkdir -p output
Next we can run the program on either one sample,
# Run the program on one sample MTG/CSM5FZ42.fastq.gz
bash <path_to_your_FindTaxaCtrbt_folder>/scripts/FindTaxaCtrbt.sh \
frc_oxc_oxdd_uniref100.faa \
MTG/CSM5FZ42.fastq.gz \
test_result test_result/
Or parallel on all samples within one folder
parallel \
bash <path_to_your_FindTaxaCtrbt_folder>/scripts/FindTaxaCtrbt.sh \
frc_oxc_oxdd_uniref100.faa \
{.} \
test_result test_result \
::: ls MTG/*
Microbial contributions to oxalate metabolism in health and disease Menghan Liu, Joseph C. Devlin, Jiyuan Hu, Angelina Volkova, Thomas W Battaglia, Allyson Byrd, Png Loke, Huilin Li, Kelly V Ruggles, Aristotelis Tsirigos, Martin J Blaser, Lama Nazzal medRxiv 2020.01.27.20018770; doi: https://doi.org/10.1101/2020.01.27.20018770