Skip to content

ml3958/FindTaxaCtrbt

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

73 Commits
 
 
 
 
 
 
 
 

Repository files navigation

FindTaxaCtrbt

Author: Menghan Liu
@ twitter @ email

Date: December 12, 2019

This computational pipeline for exhaustive search of taxonomic contribution to a specific function, within a defined microbiome community.

Status: Source code of the pipeline can be found on this repository. Downstream analyses scripts are available per request.

TOC

  1. Installation
  2. Getting started
  3. Tutorial
  4. Determine mapping identity cutoff
  5. Test run with sample data
  6. Citation

Installation

git clone https://github.com/ml3958/FindTaxaCtrbt

This command will download all relevent scripts to your local computer.

In order to use these scripts, you either need to be inside this downloaded folder via cd <Downloaded_FindTaxaCtrbt_folder>, or add the full path when calling the script.

Getting started

Preparation involve two steps, where step 1 is to set up a conda environment with all required dependency; step 2 is to gather data.

Step 1. Environment setup

This step needs to be completed for anyone who wants to use this pipeline.

The environment required for FindTaxaCtrbt can be handled via creating a conda environment based on the FindTaxaCtrbt_environment.yml file provided, with the following code

conda env create -n FindTaxaCtrbt --file FindTaxaCtrbt_environment.yml # Create a new conda environment called FindTaxaCtrbt

conda activate FindTaxaCtrbt # initiate the environment

you can customize your environment name by replacing FindTaxaCtrbt with <name_of_your_choice>

You can also manually install the core dependencies (Diamond, Parallel v0.9.14, R v3.3.2, r packages dplyr and tidyr)

Step 2. Download data

Depending upon experimental design and research purpose, this step needs to be customized performed.

  • First, Reference protein: a list of protein homologs for the microbiome function/enzyme of interest
    • a .fasta file containging their amino acid sequences
    • a .csv table containing their taxonomic origin
  • Second, microbiome meta’omics samples from publicaly avaiable samples, or freshly generated
    • metagenome or metatranscriptome samples

Tutorial

1. Initiate conda environment

conda activate FindTaxaCtrbt 

2. Build diamond index for protein reference

For each protein fasta file, you only need to build index for once upon first usage.

diamond makedb \
   --in <ref_protein_homolog.fa> \ # path to input fasta file for reference protein homologs
   --db <path_of_database_to_write_to>  # output directory of database

3. [CORE] Run diamond

You can run the program on one sample

# Run the program
bash scripts/FindTaxaCtrbt.sh  <path_to_diamond_index> <path_to_MTG/MTS_sample> <directory_to_write_output>

Inputs:

  1. <path_to_diamond_index>
  2. <path_to_MTG/MTS_sample> currently accept .fastq.gz file
  3. <directory_to_write_output> provided as positional manner, thus must follow the order

Outputs:

  1. diamond output xx.m8
  2. diamong log file xx.log

or, run the program on all samples in one directory <directory_of_input_sample> using paralllel

parallel \ 
     bash scripts/FindTaxaCtrbt.sh  <path_to_database> {.} <directory_to_write_output> \ # command to loop for via parallel 
     ::: `ls <directory_of_input_sample>/*` # provide value for {.}

4. [CORE] Parse diamond output

Rscript --vanilla ../scripts/parse_blastx.R <path_to_diamond_m8_output> <output_file> <identity_cutoff>

or parallel

dir=<.m8_file_directory>
parallel \
   Rscript --vanilla ../scripts/parse_blastx.R {1} {1}_protein_summary.txt <identity_cutoff> \
   ::: \
      `ls ${dir}/* |grep m8$`

5. Merge batch output into one sample

Determine the mapping identity cutoff

Test run with sample data

Download sample data via this link

Within the downloaded folder, there is a frc_oxc_oxdd_uniref100.faa that contains the amino acid sequences of 6067 protein homologs of oxalate degrading enzymes OXDD, FRC and OXC.

Also, there are two folders MTG/ and MTS/ that contains 3 metagenomic and 3 metatranscriptomic fastq.gz samples, respectively.

For example, I am interested in the oxalyl-coA decarboxylase (OXC), which is involved in the oxalate degradation function. Therefore, I found a homologous protein family for OXC at Uniprot.

UniprotID Description Phylum Family Genus Species Strain
A0A063X7E6 Oxalyl-CoA decarboxylase Proteobacteria Acetobacteraceae Acetobacter aceti Acetobacter aceti 1023
A9X6P8 Oxalyl-CoA decarboxylase Proteobacteria Acetobacteraceae Acetobacter aceti Acetobacter aceti
A0A1Y0V990 Oxalyl-CoA decarboxylase Proteobacteria Acetobacteraceae Acetobacter ascendens Acetobacter ascendens
A0A149USA4 Oxalyl-CoA decarboxylase Proteobacteria Acetobacteraceae Acetobacter malorum Acetobacter malorum
A0A1A0DAZ7 Oxalyl-CoA decarboxylase Proteobacteria Acetobacteraceae Acetobacter pasteurianus Acetobacter pasteurianus
A0A368A7F9 Oxalyl-CoA decarboxylase Proteobacteria Acetobacteraceae Acetobacter pasteurianus Acetobacter pasteurianus

Now, let’s start a test run on those sample data

# Initiate conda environment
conda activate FindTaxaCtrbt 

# Switch to your directory
cd /Users/menghanliu/Documents/FindTaxaCtrbt_sample_data

# Build diamond index for protein reference 
diamond makedb \
   --in frc_oxc_oxdd_uniref100.faa \ # frc_oxc_oxdd_uniref100.faa as input
   --db frc_oxc_oxdd_uniref100.faa  # prefix for databae, output: frc_oxc_oxdd_uniref100.faa.dmnd
   
# Check whether the database is sucesfully built 
ls frc_oxc_oxdd_uniref100.faa.dmnd


# Make a new folder to store the results 
mkdir -p output

Next we can run the program on either one sample,

# Run the program on one sample MTG/CSM5FZ42.fastq.gz
bash <path_to_your_FindTaxaCtrbt_folder>/scripts/FindTaxaCtrbt.sh  \
     frc_oxc_oxdd_uniref100.faa \
     MTG/CSM5FZ42.fastq.gz \
     test_result test_result/

Or parallel on all samples within one folder

parallel \
     bash <path_to_your_FindTaxaCtrbt_folder>/scripts/FindTaxaCtrbt.sh  \
         frc_oxc_oxdd_uniref100.faa \
         {.} \
         test_result test_result \
    ::: ls MTG/*

Citation

Microbial contributions to oxalate metabolism in health and disease Menghan Liu, Joseph C. Devlin, Jiyuan Hu, Angelina Volkova, Thomas W Battaglia, Allyson Byrd, Png Loke, Huilin Li, Kelly V Ruggles, Aristotelis Tsirigos, Martin J Blaser, Lama Nazzal medRxiv 2020.01.27.20018770; doi: https://doi.org/10.1101/2020.01.27.20018770

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published