MetaMeta: Integrating metagenome analysis tools to improve taxonomic profiling

Piro, V. C., Matschkowski, M., & Renard, B. Y. (2017). MetaMeta: integrating metagenome analysis tools to improve taxonomic profiling. Microbiome, 5(1), 101. http://doi.org/10.1186/s40168-017-0318-y

Install:

Miniconda:

# Download conda installer
wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh

# Set permissions to execute
chmod +x Miniconda3-latest-Linux-x86_64.sh 	

# Execute. Make sure to "yes" to add the conda to your PATH
./Miniconda3-latest-Linux-x86_64.sh 		

# Add channels
conda config --add channels defaults
conda config --add channels conda-forge
conda config --add channels bioconda

MetaMeta:

conda install metameta=1.2.0

All other tools and dependencies are installed in their own environment automatically on the first run (with --use-conda parameter active).

Alternatively, install MetaMeta in a separated environment (named "metametaenv") with the command:

conda create -n metametaenv metameta=1.2.0	
source activate metametaenv # Command to activate the environment. To deactivate use "source deactivate"

Run:

Create a configuration file (yourconfig.yaml) with the required fields (workdir, dbdir and samples):

workdir: "/home/user/folder/results/"
dbdir: "/home/user/folder/databases/"
samples:
  sample_name_1:
     fq1: "/home/user/folder/reads/file.1.fq"
     fq2: "/home/user/folder/reads/file.2.fq"

All paths set on this file are relative to the workdir (if not absolute)

Check rules and output files:

metameta --configfile yourconfig.yaml -np

Run MetaMeta:

metameta --configfile yourconfig.yaml --use-conda --keep-going --cores 24

Alternatively, make a copy of the configuration file for the complete set of parameters cp ~/miniconda3/opt/metameta/config/example_complete.yaml yourconfig.yaml
The number of --cores is the total amount avaiable for the pipeline. Number of specific threads for the tools should be set on the configuration file (yourconfig.yaml) with the parameter threads
On the first run MetaMeta will download and install the configured tools as well as the database files (archaea_bacteria_201503 by default - see below) necessary for each tool.

Pre-configured databases:

Available databases:

Info	Date	metameta database name
Archaea + Bacteria - RefSeq Complete Genomes	2015-03	`archaea_bacteria_201503`
Fungal + Viral - RefSeq Complete Genomes	2017-09	`fungi_viral_201709`

Database availability per tool:

database	clark	dudes	gottcha	kaiju	kraken	motus
`archaea_bacteria_201503`	Yes	Yes	Yes	Yes	Yes	Yes
`fungi_viral_201709`	Yes	Yes	No	Yes	Yes	No

Running sample data:

cd ~/miniconda3/opt/metameta/

Pre-configured Archaea and Bacteria database:

./metameta --configfile sampledata/sample_data_archaea_bacteria.yaml --use-conda --keep-going --cores 6

Custom database (some viral reference genomes):

./metameta --configfile sampledata/sample_data_custom_viral.yaml --use-conda --keep-going --cores 6

Results:

cd sampledata/results/

Running MetaMeta on a cluster environment:

Make a copy of cluster configuration file:

cp ~/miniconda3/opt/metameta/config/cluster.json yourcluster.json

Edit the file with your cluster specifications (threads, partitions, cpu/memory, etc) for each rule.

Run MetaMeta (slurm example):

metameta --configfile yourconfig.yaml --keep-going --use-conda -j 999 --cluster-config yourcluster.json --cluster "sbatch --job-name {cluster.job-name} --output {cluster.output} --partition {cluster.partition} --nodes {cluster.nodes} --cpus-per-task {cluster.cpus-per-task} --mem {cluster.mem} --time {cluster.time}"

you can change the cluster command (sbatch) and adapt them to your cluster system.

Custom databases:

MetaMeta uses by default Archaea and Bacteria sequences as reference database (archaea_bacteria_201503 - see below). Additionaly MetaMeta allows the creation of custom database.

First select which databses should be used on the configuration file:

databases:
  - archaea_bacteria_201503
  - custom_db

all samples will run agains the "archaea_bacteria_201503" and the new "custom_db" databases

Second, create an entry with the path to the sequences that should be added to the custom database:

custom_db:
    clark: "sampledata/database/"
    dudes: "sampledata/database/"
    kaiju: "sampledata/database/"
    kraken: "sampledata/database/"

clark and dudes require one or more fasta files (extension .fna) with the accession.version identifier after the header ">" (e.g. ">NC_001998.1 Guinea pig Chlamydia phage, complete genome")
kaiju requires one or more GenBank flat file (extension .gbff)
kraken requires one or more fasta files (extension .fna) with the gi identifier on the header (e.g. ">gi|9632287|ref|NC_001998.1| Guinea pig Chlamydia phage, complete genome")

MetaMeta will compile the "custom_db" on the first run and use it as a database. After finished it is possible to delete de database definition from the configuration file for the following runs.

Creating a custom database based on NCBI genomes:

It is possible to create a custom database based on the set of genomes from NCBI

Download the genome_updater script:

git clone https://github.com/pirovc/genome_updater

Download the desired database: Example -> All fungi genomes available on refseq, fasta and GenBank formats with 6 threads:

./genome_updater.sh -d "refseq" -g "fungi" -f "genomic.fna.gz,genomic.gbff.gz" -t 6 -o fungi_genomes/
mkdir -p custom_fungi_db/clark_dudes/ custom_fungi_db/kaiju/ custom_fungi_db/kraken/

Extract files: clark and dudes:

zcat fungi_genomes/files/*.fna.gz > custom_fungi_db/clark_dudes/fungi_genomes.fna

kaiju:

zcat fungi_genomes/files/*.gbff.gz > custom_fungi_db/kaiju/fungi_genomes.gbff

kraken (with header conversion to GI, old NCBI style):

zcat fungi_genomes/files/*.fna.gz | awk '{if(substr($0, 0, 1)==">"){sep=index($0," ");acc=substr($0,2,sep-2);header=substr($0,sep+1); cmd="wget -qO - \"https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id="acc"&rettype=gi\""; cmd | getline gi; close(cmd); print ">gi|" gi "|ref|" acc "| " header }else{ print $0 }}' > custom_fungi_db/kraken/fungi_genomes.fna

Add entry on the configuration file:

databases:
  - new_custom_fungi_db

Finally, add the path for each set of reference sequences on the configuration file:

new_custom_fungi_db:
    clark: "custom_fungi_db/clark_dudes/"
    dudes: "custom_fungi_db/clark_dudes/"
    kaiju: "custom_fungi_db/kaiju/"
    kraken: "custom_fungi_db/kraken/"

On the first run MetaMeta will compile the "new_custom_fungi_db" database for each configured tool. After finished it is possible to delete de database definition from the configuration file for the following runs.

Pre-install a complete environment:

wget https://raw.githubusercontent.com/pirovc/metameta/master/envs/metameta_complete.yaml
conda env create -f metameta_complete.yaml
source activate metametaenv_complete

Merging final results:

To merge final results from many samples into one final tabular file:

~/miniconda3/opt/metameta/scripts/merge_final_profiles.sh workdir/samples_*/metametamerge/database/final.metametamerge.profile.out

Folder structure:

MetaMeta can run several tools with several samples against several databases. The files on the working directory and database directory are organized in the structure below:

WORKDIR:
	SAMPLE_1/
		TOOL_1/ (*)
			DB_1/
			DB_2/
			...
		TOOL_2/ (*)
			...
		PROFILES/
			DB_1/
				TOOL_1.profile.out
				TOOL_2.profile.out
				...
			DB_2/
				...
		METAMETAMERGE/
			DB_1/
				FINAL_PROFILE.out
				FINAL_PROFILE_KRONA.html
			DB_2/
				...
		LOG/
			DB_1/
			DB_2/
			...
		READS/ (*)
			TOOL_1.1.fq
			TOOL_1.2.fq
			TOOL_2.1.fq
			TOOL_2.2.fq
			...
	SAMPLE_2/
		...
	CLUSTERLOG/ (**)

DBDIR:
	DB_1/
		TOOL_1_DB/
		TOOL_2_DB/
		...
		TOOL_1.dbprofile.out
		TOOL_2.dbprofile.out
		...
		LOG/
	DB_2/
		...
	TAXONOMY/
		LOG/

(*) removed when keepfiles=0 (**) only when running on cluster mode

Adding a new tool:

MetaMeta integrates profiling and binning tools and it has 6 pre-configured tools (clark, dudes, gottcha, kaiju, kraken and motus). New tools are required to use the NCBI Taxonomy structure and nomenclature/identifiers to be added to the pipeline. MetaMeta accepts BioBoxes format directly (https://github.com/bioboxes/rfc/tree/master/data-format) or a .tsv file in the following format:

Profiling: rank, taxon name or taxid, abundance

Example:

genus   Methanospirillum        0.0029
genus   Thermus 0.0029
genus   568394      0.0029
species Arthrobacter sp. FB24   0.0835
species 195      0.0582
species Mycoplasma gallisepticum        0.0536

Binning: readid, taxon name or taxid, lenght of sequence assigned

Example:

M2|S1|R140      354     201
M2|S1|R142      195     201
M2|S1|R145      457425  201
M2|S1|R146      562     201
M2|S1|R147      1245471 201
M2|S1|R150      354     201

MetaMeta pipeline uses Snakemake. To add a new tool to the pipeline it is necessary to create two main files described below. Replace 'newtool' with the tool identifier (lower case, no spaces, no special chars):

tools/newtool.sm -> specifies how to execute the tool
	Rules:
	- newtool_run_1[..n] -> one or more rules necessary to run the tool
	- newtool_rpt -> final rule that should output a file newtool.profile.out in an accepted output format (described above)
	
tools/newtool_db_custom.sm -> specifies how to download/compile the database/references
	Rules:
	- newtool_db_custom_1[..n] -> one or more rules necessary to compile the database.
	- newtool_db_custom_profile -> this rule generates automatically the database profile. It should have as an output a file (newtool.dbaccession.out) with the accession version identifier for all sequences used in the database.
	- newtool_db_custom_check -> rule to check the required database files. It should have as an input all mandatory files that should be present to the database work properly.

Template files can be found inside the folder tools/template. Once the two files are inside the tools folder, it is necessary to add the tool identifier to the YAML configuration file.

Changelog:

v1.2.0)

Updated to Snakemake 4.3.0 (from 3.9.1)
Bug fixes on custom database creation and database profile generation
Centralized taxonomy download (once for all tools, kept on dbdir:taxonomy/)
Updated tools: kaiju 1.0 -> 1.4.5, dudes 0.07 -> 0.08, spades 3.9.0 -> 3.11.1
Addition of new pre-configured databases: fungal_viral_201709
Multiple pre-configured databases support
Several fixes on custom database creation

v1.1.1) Bug fixes parsing output files for kraken and kaiju

v1.1) Support single and paired-end reads, multiple and custom databases, krona integration

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MetaMeta: Integrating metagenome analysis tools to improve taxonomic profiling

Install:

Run:

Pre-configured databases:

Running sample data:

Running MetaMeta on a cluster environment:

Custom databases:

Creating a custom database based on NCBI genomes:

Pre-install a complete environment:

Merging final results:

Folder structure:

Adding a new tool:

Changelog:

About

Releases 4

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 87 Commits
config		config
envs		envs
sampledata		sampledata
scripts		scripts
tools		tools
LICENSE		LICENSE
README.md		README.md
Snakefile		Snakefile
metameta		metameta

License

pirovc/metameta

Folders and files

Latest commit

History

Repository files navigation

MetaMeta: Integrating metagenome analysis tools to improve taxonomic profiling

Install:

Run:

Pre-configured databases:

Running sample data:

Running MetaMeta on a cluster environment:

Custom databases:

Creating a custom database based on NCBI genomes:

Pre-install a complete environment:

Merging final results:

Folder structure:

Adding a new tool:

Changelog:

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 4

Packages 0

Languages

Packages