FindTaxaCtrbt

Author: Menghan Liu
@ twitter @ email

Date: December 12, 2019

This computational pipeline for exhaustive search of taxonomic contribution to a specific function, within a defined microbiome community.

Status: Source code of the pipeline can be found on this repository. Downstream analyses scripts are available per request.

Installation

git clone https://github.com/ml3958/FindTaxaCtrbt

This command will download all relevent scripts to your local computer.

In order to use these scripts, you either need to be inside this downloaded folder via cd <Downloaded_FindTaxaCtrbt_folder>, or add the full path when calling the script.

Getting started

Preparation involve two steps, where step 1 is to set up a conda environment with all required dependency; step 2 is to gather data.

Step 1. Environment setup

This step needs to be completed for anyone who wants to use this pipeline.

The environment required for FindTaxaCtrbt can be handled via creating a conda environment based on the FindTaxaCtrbt_environment.yml file provided, with the following code

conda env create -n FindTaxaCtrbt --file FindTaxaCtrbt_environment.yml # Create a new conda environment called FindTaxaCtrbt

conda activate FindTaxaCtrbt # initiate the environment

you can customize your environment name by replacing FindTaxaCtrbt with <name_of_your_choice>

You can also manually install the core dependencies (Diamond, Parallel v0.9.14, R v3.3.2, r packages dplyr and tidyr)

Step 2. Download data

Depending upon experimental design and research purpose, this step needs to be customized performed.

First, Reference protein: a list of protein homologs for the microbiome function/enzyme of interest
- a .fasta file containging their amino acid sequences
- a .csv table containing their taxonomic origin
Second, microbiome meta’omics samples from publicaly avaiable samples, or freshly generated
- metagenome or metatranscriptome samples

Tutorial

1. Initiate conda environment

conda activate FindTaxaCtrbt

2. Build diamond index for protein reference

For each protein fasta file, you only need to build index for once upon first usage.

diamond makedb \
   --in <ref_protein_homolog.fa> \ # path to input fasta file for reference protein homologs
   --db <path_of_database_to_write_to>  # output directory of database

3. [CORE] Run diamond

You can run the program on one sample

# Run the program
bash scripts/FindTaxaCtrbt.sh  <path_to_diamond_index> <path_to_MTG/MTS_sample> <directory_to_write_output>

Inputs:

<path_to_diamond_index>
<path_to_MTG/MTS_sample> currently accept .fastq.gz file
<directory_to_write_output> provided as positional manner, thus must follow the order

Outputs:

diamond output xx.m8
diamong log file xx.log

or, run the program on all samples in one directory <directory_of_input_sample> using paralllel

parallel \ 
     bash scripts/FindTaxaCtrbt.sh  <path_to_database> {.} <directory_to_write_output> \ # command to loop for via parallel 
     ::: `ls <directory_of_input_sample>/*` # provide value for {.}

4. [CORE] Parse diamond output

Rscript --vanilla ../scripts/parse_blastx.R <path_to_diamond_m8_output> <output_file> <identity_cutoff>

or parallel

dir=<.m8_file_directory>
parallel \
   Rscript --vanilla ../scripts/parse_blastx.R {1} {1}_protein_summary.txt <identity_cutoff> \
   ::: \
      `ls ${dir}/* |grep m8$`

5. Merge batch output into one sample

Determine the mapping identity cutoff

Test run with sample data

Download sample data via this link

Within the downloaded folder, there is a frc_oxc_oxdd_uniref100.faa that contains the amino acid sequences of 6067 protein homologs of oxalate degrading enzymes OXDD, FRC and OXC.

Also, there are two folders MTG/ and MTS/ that contains 3 metagenomic and 3 metatranscriptomic fastq.gz samples, respectively.

For example, I am interested in the oxalyl-coA decarboxylase (OXC), which is involved in the oxalate degradation function. Therefore, I found a homologous protein family for OXC at Uniprot.

UniprotID	Description	Phylum	Family	Genus	Species	Strain
A0A063X7E6	Oxalyl-CoA decarboxylase	Proteobacteria	Acetobacteraceae	Acetobacter	aceti	Acetobacter aceti 1023
A9X6P8	Oxalyl-CoA decarboxylase	Proteobacteria	Acetobacteraceae	Acetobacter	aceti	Acetobacter aceti
A0A1Y0V990	Oxalyl-CoA decarboxylase	Proteobacteria	Acetobacteraceae	Acetobacter	ascendens	Acetobacter ascendens
A0A149USA4	Oxalyl-CoA decarboxylase	Proteobacteria	Acetobacteraceae	Acetobacter	malorum	Acetobacter malorum
A0A1A0DAZ7	Oxalyl-CoA decarboxylase	Proteobacteria	Acetobacteraceae	Acetobacter	pasteurianus	Acetobacter pasteurianus
A0A368A7F9	Oxalyl-CoA decarboxylase	Proteobacteria	Acetobacteraceae	Acetobacter	pasteurianus	Acetobacter pasteurianus

Now, let’s start a test run on those sample data

# Initiate conda environment
conda activate FindTaxaCtrbt 

# Switch to your directory
cd /Users/menghanliu/Documents/FindTaxaCtrbt_sample_data

# Build diamond index for protein reference 
diamond makedb \
   --in frc_oxc_oxdd_uniref100.faa \ # frc_oxc_oxdd_uniref100.faa as input
   --db frc_oxc_oxdd_uniref100.faa  # prefix for databae, output: frc_oxc_oxdd_uniref100.faa.dmnd
   
# Check whether the database is sucesfully built 
ls frc_oxc_oxdd_uniref100.faa.dmnd


# Make a new folder to store the results 
mkdir -p output

Next we can run the program on either one sample,

# Run the program on one sample MTG/CSM5FZ42.fastq.gz
bash <path_to_your_FindTaxaCtrbt_folder>/scripts/FindTaxaCtrbt.sh  \
     frc_oxc_oxdd_uniref100.faa \
     MTG/CSM5FZ42.fastq.gz \
     test_result test_result/

Or parallel on all samples within one folder

parallel \
     bash <path_to_your_FindTaxaCtrbt_folder>/scripts/FindTaxaCtrbt.sh  \
         frc_oxc_oxdd_uniref100.faa \
         {.} \
         test_result test_result \
    ::: ls MTG/*

Citation

Microbial contributions to oxalate metabolism in health and disease Menghan Liu, Joseph C. Devlin, Jiyuan Hu, Angelina Volkova, Thomas W Battaglia, Allyson Byrd, Png Loke, Huilin Li, Kelly V Ruggles, Aristotelis Tsirigos, Martin J Blaser, Lama Nazzal medRxiv 2020.01.27.20018770; doi: https://doi.org/10.1101/2020.01.27.20018770

Name		Name	Last commit message	Last commit date
Latest commit History 73 Commits
scripts		scripts
FindTaxaCtrbt_environment.yml		FindTaxaCtrbt_environment.yml
README.md		README.md
README.rmd		README.rmd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FindTaxaCtrbt

TOC

Installation

Getting started

Tutorial

Determine the mapping identity cutoff

Test run with sample data

Citation

About

Releases

Packages

Languages

ml3958/FindTaxaCtrbt

Folders and files

Latest commit

History

Repository files navigation

FindTaxaCtrbt

TOC

Installation

Getting started

Tutorial

Determine the mapping identity cutoff

Test run with sample data

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages