The Metagenomic Sequence Simulator (MeSS) is a Snakemake pipeline, implemented using Snaketool, for simulating illumina, Oxford Nanopore (ONT) and Pacific Bioscience (PacBio) shotgun metagenomic samples.
MeSS takes as input NCBI taxa or local genome assemblies to generate either long (PacBio or ONT) or short (illumina) reads. In addition to reads, MeSS optionally generates bam alignment files and taxonomic + sequence abundances in CAMI format.
%%{init: {'theme':'forest'}}%%
flowchart LR
input["samples.tsv
or
samples/*.tsv"] --> taxons
subgraph genome_download["genome download"]
dlchoice{download ?}
taxons["taxons or
accesions"] --> dlchoice
dlchoice -->|yes| assembly_finder
dlchoice -->|no| fasta
assembly_finder --> fasta
end
style genome_download color:#15161a
input --> distchoice
subgraph community_design["`**community design**`"]
distchoice{draw distribution ?}
distchoice -->|yes| dist["distribution
(lognormal, even)"]
dist --> abundances
distchoice -->|no| reads
distchoice -->|no| bases
distchoice -->|no| abundances
depth["coverage depth"]
reads --> depth
bases --> depth
abundances["abundances
(sequence, taxonomic)"] --> depth
end
style community_design color:#15161a
style community_design color:#15161a
fasta --> simulator
depth --> simulator
simulator["read simulator
(art_illumina, pbsim3...)"]
simulator --> bam
simulator --> fastq
simulator --> CAMI-profile
%% subgraph color fills
classDef red fill:#faeaea,color:#fff,stroke:#333;
classDef blue fill:#eaecfa,color:#fff,stroke:#333;
class genome_download blue
class community_design red
More details can be found in the documentation
- Conda (Miniforge)
conda create -n mess mess
- Docker
docker pull ghcr.io/metagenlab/mess:latest
- From source
git clone https://github.com/metagenlab/MeSS.git
pip install -e MeSS
Let's simulate two metagenomic samples with the following taxa and read counts in samples.tsv
:
sample | taxon | reads |
---|---|---|
sample1 | 487 | 174840 |
sample1 | 727 | 90679 |
sample1 | 729 | 13129 |
sample2 | 28132 | 147863 |
sample2 | 199 | 147545 |
sample2 | 729 | 131300 |
mess run -i samples.tsv
Important
Apptainer is the default and recommended dependency deployment method for maximum reproducibility !
If you would like to use conda you can specify --sdm conda
.
- Downloaded genomes in
mess_out/assembly_finder/download
┣ 📂GCF_000144405.1
┃ ┗ 📜GCF_000144405.1_ASM14440v1_genomic.fna.gz
┣ 📂GCF_001298465.1
┃ ┗ 📜GCF_001298465.1_ASM129846v1_genomic.fna.gz
┣ 📂GCF_016127215.1
┃ ┗ 📜GCF_016127215.1_ASM1612721v1_genomic.fna.gz
┣ 📂GCF_020736045.1
┃ ┗ 📜GCF_020736045.1_ASM2073604v1_genomic.fna.gz
┣ 📂GCF_022869645.1
┗ 📜GCF_022869645.1_ASM2286964v1_genomic.fna.gz
- Simulated reads in
mess_out/fastq
┣ 📜sample1_R1.fq.gz
┣ 📜sample1_R2.fq.gz
┣ 📜sample2_R1.fq.gz
┗ 📜sample2_R2.fq.gz
Tip
By default mess
outputs paired illumina reads with the Hiseq25k error profile. Other outputs, and error profiles are described here and here
Using samples.tsv
, mess
runs in under 2min, while using around 1.8GB of physical RAM
task_id | hash | native_id | name | status | exit | submit | duration | realtime | %cpu | peak_rss | peak_vmem | rchar | wchar |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | fe/03c2bc | 62286 | MESS (1) | COMPLETED | 0 | 2024-09-04 12:41:15.820 | 1m 50s | 1m 50s | 111.5% | 1.8 GB | 9 GB | 3.5 GB | 2.4 GB |
1 | ff/0d03b1 | 73355 | MESS (1) | COMPLETED | 0 | 2024-09-04 12:55:12.903 | 1m 52s | 1m 52s | 112.6% | 1.7 GB | 8.8 GB | 3.5 GB | 2.4 GB |
1 | 07/d352bf | 83576 | MESS (1) | COMPLETED | 0 | 2024-09-04 12:57:30.600 | 1m 50s | 1m 50s | 113.2% | 1.7 GB | 8.9 GB | 3.5 GB | 2.4 GB |
Note
Average resources usage measured 3 times with one CPU (using nextflow, excluding dependency deployment time).
More details in the resource usage documentation
Using phage.tsv
sample | taxon | cov_sim |
---|---|---|
phage | 347329 | 200 |
- Illumina
mess run -i phage.tsv --tech illumina -o mess_out/illumina
seqkit stats --all -T -b mess_out/illumina/fastq/*
file | num_seqs | sum_len | avg_len | N50 | Q20(%) | Q30(%) | AvgQual |
---|---|---|---|---|---|---|---|
phage_R1.fq.gz | 44000 | 6600000 | 150.0 | 150 | 98.01 | 91.67 | 27.81 |
phage_R2.fq.gz | 44000 | 6600000 | 150.0 | 150 | 97.31 | 89.65 | 26.52 |
- Nanopore
mess run -i phage.tsv --tech nanopore -o mess_out/nanopore
seqkit stats --all -T -b mess_out/nanopore/fastq/*
file | num_seqs | sum_len | avg_len | N50 | Q20(%) | Q30(%) | AvgQual |
---|---|---|---|---|---|---|---|
phage.fq.gz | 1486 | 13203006 | 8884.9 | 12329 | 73.99 | 62.65 | 13.60 |
- PacBio HiFi
mess run -i phage.tsv -o mess_out/pacbio --tech pacbio --error hifi
seqkit stats --all -T -b mess_out/pacbio/fastq/*
file | num_seqs | sum_len | avg_len | N50 | Q20(%) | Q30(%) | AvgQual |
---|---|---|---|---|---|---|---|
phage.fq.gz | 1430 | 12588621 | 8803.2 | 12666 | 99.92 | 99.78 | 40.51 |
Note
We use pbsim3 to simulate multi-pass CLR reads which are converted to HiFi reads with ccs.
PacBio HiFi reads simulations usually take longer compared to other error profiles.
Inspired by readSimulator's approach, mess
can shuffle genome start points to get circular genome assemblies.
Warning
All contigs in the fasta will be circularised
- Linear (default,
--rotate 1
)
mess run -i phage.tsv -o mess_out/linear
- Circular (
--rotate 3
)
mess run -i phage.tsv --rotate 3 -o mess_out/circular
All command-line options at described here