Skip to content

oschakoory/RiboTaxa

Repository files navigation

RiboTaxa logo

RiboTaxa v1.5

by Oshma Chakoory, Sophie Marre, and Pierre Peyret.

GitHub GitHub release (latest by date) GitHub GitHub all releases

RiboTaxa is a complete pipeline to rapidly filter and reconstruct the full length SSU rRNA gene from Illumina (meta)genomic dataset and perform taxonomic classification on the reconstructed sequences.

RiboTaxa takes as input singled-end or paired-end files which can be in compressed format (fastq.gz) or uncompressed format (.fastq).

Tools used in RiboTaxa pipeline:

RiboTaxa Pipeline

If you use RiboTaxa in your work, please cite the RiboTaxa paper:

Oshma Chakoory, Sophie Comtet-Marre, Pierre Peyret, RiboTaxa: combined approaches for rRNA genes taxonomic resolution down to the species level from metagenomics data revealing novelties, NAR Genomics and Bioinformatics, Volume 4, Issue 3, September 2022, lqac070, https://doi.org/10.1093/nargab/lqac070

Contents

Quick-start

Install Miniconda

Miniconda provides the conda environment and package manager, and is the recommended way to install RiboTaxa. Follow the Miniconda instructions for downloading and installing Miniconda. It is important to follow all of the directions provided in the Miniconda instructions, particularly ensuring that you run the following commands at the end of the installation process, to ensure that your Miniconda installation is fully installed and configured correctly:

conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge

Updating Miniconda

After installing Miniconda and opening a new terminal, make sure you’re running the latest version of conda:

conda update conda

Install RiboTaxa within a conda environment

Once you have Miniconda installed, run the following command to create the required conda environments for RiboTaxa. Creating a virtual environment is recommended to avoid conflicts between required dependencies and those in existing environment.

git clone https://github.com/oschakoory/RiboTaxa.git
cd RiboTaxa
bash conda_virt_env.sh

The first virtual environment RiboTaxa_py27 uses python 2.7 and contains all the necessary tools for quality control, to filter 16S18S reads and reconstruct full length rRNAs sequences.

The second virtual environment RiboTaxa_py36 uses python 3.6 and runs sklearn classifier of QIIME2 to taxonomically classify full length reconstruted rRNAs sequences.

Activate the conda environment

Activation of virtual environment is done automatically in scripts. No need to activate environment manually, unless you want to scrutinize each tool used in RiboTaxa.

conda activate RiboTaxa_py27
conda list

emirge_amplicon.py --help
conda activate RiboTaxa_py36
conda list

Install usearch (if you do not have usearch in your $path)

USEARCH is a unique sequence analysis tool which offers search and clustering algorithms that are often orders of magnitude faster than BLAST. To reconstruct full length rRNA sequences, EMIRGE uses USEARCH.

Please ensure that you have USEARCH installed in your $PATH using the following command:

usearch --version
echo $PATH

If not, please follow the usearch-installation instructions to download USEARCH and install it in your $PATH.

To make sure that usearch is successfully installed, run

usearch --version

Now, you are ready to use RiboTaxa !!!

Let the power of RiboTaxa begin...

Indexing databases

RiboTaxa pipeline includes tools like sortmerna and emirge, both of which need indexed databases of their own. The latest database SILVA SSU 138.1 can be downloaded here and the indexed database SILVA SSU 138.1 for RiboTaxa are available here.

To index your own database, you will need to fill the config file indexDB_arguments.conf.If you are not sure of certains parameters, leave as defined except for directories and input files.

The configuration file is very important and each parameter needs to be filled to avoid errors.

script used: indexDB_RiboTaxa.sh

[Setting up directories...]

####set up RiboTaxa directory path **
RiboTaxa_DIR = /home/user/Documents/RiboTaxa

[Setting up database path...]

####set up database path+database file in fasta format **
DB_DIR = /home/user/Documents/Databases/SILVA_138_SSURef_Nr99_tax_silva.fasta

#### Set up output directory **
OUTPUT = /home/user/Documents/Databases

#### set up the number of threads/CPUs to be used for indexing
THREAD = 8

#### set up the clustering id threshold (between 0.7-1)
CLUSTER_ID = 0.97

Once it is filled with all the necessary information, you can run the following command and index your database.

bash -i RiboTaxa_DIR/indexDB_RiboTaxa.sh PATH_TO/indexDB_arguments.conf

Indexing database takes a while. Using the maximum number of available threads/CPUs will save time. This step will produce two directories in your OUTPUT path:

  • sortmerna_indexed_DB : containing indexed files for sortmeRNA
  • bowtie_indexed_DB : containing indexed files by Bowtie to be used for EMIRGE and MetaRib

For the taxonomic classification by sklearn classifier, the database used is the trained classifier SILVA 138 reference sequence downloaded from the Data resources of Qiime2.

Please ensure that you are using the Silva138 database from qiime2.2022.11. You can also download it here.

To use other databases such as Greengenes or UNITE, you can download already trained classifer or train your own database by following the Data resources instructions.

Running RiboTaxa pipeline

RiboTaxa pipeline will

RiboTaxa can be used for one singled-end/paired-end dataset or multiple singled-end/paired-end datasets in the same folder.

To run RiboTaxa, you will need to fill the config file RiboTaxa_arguments.conf. If you are not sure of certains parameters, leave as defined except for fields denoted by **.

##The configuration file is very important and each parameter needs to be filled to avoid errors.
##Mandatory parameters are denoted by **. Most of them depend on the configuration of your computer capacities/configuration and of your sequencing data.
Other tool parameters have been optimized and can be left at default. 

#-----------------------
#Setting up directories
#-----------------------

[BASE]
####set up RiboTaxa directory **
RiboTaxa_DIR = /home/user/RiboTaxa

####set up data directory containing only raw reads in fastq/fastq.gz format **
###paired-end files should be metagenome_R1.fastq/fastq.gz and metagenome_R2.fastq/fastq.gz
###singled-end file should be metagenome.fastq/fastq.gz
DATA_DIR = /home/user/Documents/raw_reads

####format of your paired end files **
## fastq : if files are not compressed
## fastq.gz: if files are compressed in gz format
FORMAT = fastq

####set up output directory **
OUTPUT = /home/user/Documents/RiboTaxa_results

####number of threads/CPUS to be used through the pipeline **
THREAD = 8

[BBMAP]
####RAM limit to be used by BBTOOLS/BBMAP during quality control and mapping **
##depends of the computer RAM. Use approx 80% of available RAM
##example: Available RAM = 16GB, therefore RAM = 80/100*16 = 12GB
RAM = 12

#-----------------------------
#Quality control using BBTOOLS
#-----------------------------

####Trim reads to remove bases matching adapter sequences (Default value = r)
## f (don't trim)
## r (trim to the right)
## l (trim to the left)
##In ktrim=r mode, once a reference kmer is matched in a read, 
##that kmer and all the bases to the right will be trimmed, leaving 
##only the bases to the left; this is the normal mode for adapter trimming.

ktrim = r

####Kmer length used for finding adapters (Default value = 21)
kmer = 21

####Reads shorter than this length (bases) after trimming will be discarded (Default value = 60)
minlength = 60

#Regions with average quality BELOW this will be trimmed (Default value = 20)
trimq = 20

####Trim read ends to remove bases with quality below trimq (Default value = rl)
# rl (trim both ends), 
# f (neither end), 
# r (right end only), 
# l (left end only),
# w (sliding window)

qtrim = rl

####reads with more Ns than this (after trimming) will be discarded (Default value = 1)
maxns = 1

#------------------------------------
#Filter 16S/18S reads using SortmeRNA
#------------------------------------

####indexed database directory for sortmerna **
##This directory should contain one .clustered.fasta and several .clustered. files
SORTMERNA_DB = /home/user/Documents/Databases/sortmerna_indexed_DB


#----------------------------------------------------------
#Reconstructing 16S/18S sequences using EMIRGE and MetaRIB
#----------------------------------------------------------

[EMIRGE]

####set up database directory containing indexed files for emirge **
##This directory should contain one fasta file and several .ebwt files
EMIRGE_DB = /home/user/Documents/Databases/bowtie_indexed_DB

####length of longest reads **
MAX_LENGTH = 300

####identity threshold (Default value = 1)
##This the JOIN_TRESHOLD parameter of EMIRGE. If two candidate sequences 
##share >= this fractional identity over their bases with mapped reads, then
##merge the two sequences into one for the next iteration. Fixed to 1, sequence
##reconstruction gave the best results on controlled samples (mock or synthetic communities).
IDENTITY = 1 

####number of iterations (Default value = 40)
##Number of iterations to perform by EMIRGE for sequence reconstruction.
##The default value can fit to most of metagenomic data. EMIRGE authors recommended
to increase this value for complex communities. 
NUM_ITERATION = 40

####mean insert size **
##Insert size distribution mean
MEAN_INSERT_SIZE = 300 

####standard deviation **
##Insert size distribution standard deviation
STD_DEV = 100 

####minimum fraction of the length of a candidate
##reference sequence that must be covered by mapped
##reads (Default=0.3) Range [0.0,1.0]
MIN_COV = 0.3

[METARIB]
####Subsampling reads number in each iteration (Default=1000000)
SAMPLING_NUM = 1000000

#-----------------------------------------------------------
#Taxonomic classfication using sklearn_classifier of qiime2
#-----------------------------------------------------------

####Set up path+database name for sklearn classifier **
SKLEARN_DB = /home/user/Documents/Databases/qiime2020.8_silva138/silva-138-99-nb-classifier.qza

####Confidence threshold for limiting taxonomic depth (default = 0.7)
##This threshold ensures qualititative affiliation of reconstructed sequences. 
##We do not recommend to lower this value under 0.7. 
CONFIDENCE = 0.7

#Number of reads to process in each batch (default = auto: 20 000 sequences)
#use BATCH = 1 if you have less than 16GB RAM to avoid errors
BATCH = auto

Once it is filled with all the necessary information, you can use the following command to run the RiboTaxa pipeline for

Singled-end dataset(s)
bash -i RiboTaxa_DIR/Pipeline_RiboTaxa_SE.sh PATH_TO/RiboTaxa_arguments.conf
Paired-end dataset(s)
bash -i RiboTaxa_DIR/Pipeline_RiboTaxa_PE.sh PATH_TO/RiboTaxa_arguments.conf

For each singled-end/paired-end sample, RiboTaxa will create one directory using the sample name in your OUTPUT path of your RiboTaxa_arguments.conf file. Each sample directory will contain the 4 following sub-directories:

  • quality_control : This directory contains your (meta)genomics files after adpaters removal and trimming. It also has two sub_directories before_fastqc and after_fastqc containing quality reports of your sequence files before and after trimming. You may look at the .html files in each sub-directory to have an overview of each (meta)genomics file or look into multiqc folder to have an overview of all the (meta)genomics files given to this pipeline.

  • output_sortmerna : This folder contains filtered 16S/18S sequences from your trimmed (meta)genomics sequence files: metagenome_R1_16S18Sreads.fastq,metagenome_R2_16S18Sreads.fastq(for paired-end) or metagenome_16S18Sreads.fastq (for singled-end) and a metagenome.log indicating the % of 16S/18S reads filtered from your (meta)genomics dataset.

  • SSU_sequences : This folder contains two subfolders: output_emirge and output_metarib. These subfolders contain iterations perfomed by EMIRGE and MetaRib to reconstruct full-length/nearly full-length rRNA 16S/18S gene sequences.

  • The file metagenome_SSU_sequences.fasta contains the final full-length/nearly full-length rRNA 16S/18S gene sequences recontructed by EMIRGE and MetaRib.

  • Taxonomy : This folder contains the taxonomic classification of the full-length SSU rRNA sequences reconstructed by EMIRGE. It takes metagenome_SSU_sequences.fasta as input, converts it to metagenome_SSU_sequences_qiime2.qza and returns the classification of each sequence as metagenome_renamed_16S18S_recons_qiime2_taxonomy.qza. To calculate relative abundace of each reconstructed SSU sequence, BBmap is used to map short reads onto SSU sequence and the number of assigned reads per sequence is divided by the sum of all the assigned reads (and multiplied by 100). Relative abundance is thus expressed in %. To view the classification and relative abundance of each sequence, open metagenome_SSU_taxonomy_abundance.tsv.

. The metagenome_SSU_taxonomy_abundance.tsv contains the following column names:

Sequence_ID		Domain		Phylum			Class			Order		Family		Genus			Species		Confidence 	Length(bp)	Assigned reads		Relative_Abundance(%)
3|EU334524.1.1558	Bacteria	Desulfobacterota	Desulfuromonadia	Geobacterales	Geobacteraceae	Trichlorobacter	Geobacter_lovleyi	0.999874566	1425			68			4.6687

Running RiboTaxa pipeline on test data

To run RiboTaxa pipeline on test data, you need to download the indexed database of SILVA SSU 138.1 available here

For the taxonomic classification by sklearn classifier, the database used is the trained classifier SILVA 138 reference sequence downloaded from the Data resources of Qiime2.

Once all the indexed databases are successfully downloaded into ~/database, you need to update the directory path of each in your RiboTaxa_arguments.conf. The remaining parameters can be left as default.

Then run:

bash -i RiboTaxa_DIR/Pipeline_RiboTaxa_PE.sh RiboTaxa_DIR/test_data/RiboTaxa_arguments.conf

The test data directory also contains the results of RiboTaxa of the test sample.

Group multiple samples taxonomy files into one

To group taxonomic files of multiple samples into one, you will need to copy all taxonomic files (*_SSU_taxonomy_abundance.tsv) into the same folder along with a sample.csv (see in RiboTaxa_DIR/test_data/multiple_samples_Taxonomy):

Sample01,Control
Sample02,Control
Sample03,Control
Sample04,Treated
Sample05,Treated
Sample06,Treated

Then run:

cd RiboTaxa_DIR/scripts

./RiboTaxa_group_taxonomy.sh $INPUT_PATH $OUTPUT_PATH

whereby: $INPUT_PATH is the input path containing the input files

$OUTPUT_PATH is the desired output path

An exemple test has been conducted with the following parameters:

$INPUT_PATH = RiboTaxa_DIR/test_data/multiple_samples_Taxonomy

$OUTPUT_PATH = RiboTaxa_DIR/test_data/multiple_samples_Taxonomy/Output

The results are in RiboTaxa_DIR/test_data/multiple_samples_Taxonomy/Output.

In the output folder, there are 4 files:

  1. Complete_taxonomy_abundance.csv: containing the abundance of the complete taxonomy in all the samples
  2. Family_abundance.csv: containing the abundance of the families in all the samples
  3. Genus_abundance.csv: containing the abundance of the genera in all the samples
  4. Species_abundance.csv: containing the abundance of the species in all the samples

Example of the Species_abundance.csv:

Study Control Control Control Treated Treated Treated
Sample Sample01 Sample02 Sample03 Sample04 Sample05 Sample06
Bifidobacterium_bifidum 0 0.2050580 0.4260770 12.6970194 7.3577775 0.5264176
Bifidobacterium_sp. 0 0.008041001 0.098325006 0.007881000 0.045245990 0.642682495
Cosenzaea_myxofaciens 0 0.04422800 0 0 0 0.03552537
Enterobacter_sp. 0.5 1.881710 0.732972 0.165496 0.010442 0
Lactobacillus_acidophilus 0 2.850710 7.460818 0.058543 4.284497 12.540370
Lactobacillus_kitasatonis 0 0.0241240 0.5095060 0 0.8405428 1.8279293
Proteus_mirabilis 0 0.05629001 18.44943811 6.40029719 14.55544994 42.99186000
Proteus_sp. 0 0.036187 0 0 0 0
Streptomyces_cinnamoneus 0 0.06433201 0 0 0.01044200 0
uncultured_Syntrophomonas 15.85 0.016083 0 0 0 0