Skip to content
idfarbanecha edited this page May 3, 2020 · 7 revisions

Welcome to the MeSS wiki!

Workflow rules overview:

The Metagenomic Sequence Simulator (MeSS) is a pipeline designed to generate metagenomic mock communities from a set of genomes with user-defined read proportions. MeSS can be broken down into two main steps:

1) Assemblies Download

The first step of the workflow relies on a set of rules from Assembly_finder. Using user-queried taxonomy identifiers or scientific names, Assembly_finder searches all available assemblies from NCBI according to multiple criteria such as Refseq categories, assembly status, contig count and Genbank release date. Once selected, assemblies are then downloaded.

This workflow chunk is achieved by steps or rules described in details below.

a) Generate assembly tables

Based on names form the input table, the rule's script will search corresponding Taxonomy Identifiers (TaxID) from NCBI's assembly database. TaxID are used for searching assemblies without ambiguity brought by scientific names that can match to different entries. For example, inputting "rhinvoirus" matches to several species of the picornavirus family. Thus, to be more precise, it is preferable for the user to input a TaxID for the desired taxonomic rank.

For each TaxID, the scipt then searches for all existing assemblies and stores assembly information in a table as shown below.

b) Filter assembly tables

After finding all possible assemblies, the next step is to select the best hits according to criteria specified in the config file. The user can choose to select genomes from Genbank or Refseq, complete, reference or representative assemblies and exclude ones from metagenomes.

In addition, a filtering feature was added to select one representative genome per taxonomic rank. For example, if the goal is to find all assemblies for the genus "Pseudomonas" and select one assembly per species, the user can add 'species' to the Rank_to_filter_by variable in the config file.

c) Combine assembly tables and download

The previous rules are executed for each line from the input table, and all tables are then merged into one containing the assemblies to download using their Genbank ftp link.

2) Read simulation

MeSS makes use of art_illumina to generate sequencing reads with error profiles corresponding to a sequencing technology of choice. art_illumina generates a set of reads for each fasta header from the assembly file, thus, to avoid generating reads for each contig in a fragmented assembly, all contigs are merged and seperated by 1000N nucleotides.

By default, MeSS generates even distribution within one superkingdom, thus for 9 bacterial and 2 viral species were queried, each bacterial and viral species will represent respectively 1% and 0.5% of the total number of reads. Furthermore, the pipeline offers the possibility to modify relative read abundance, by setting read percentages for human, virus, bacteria and non-human eukaryotes. To generate the metagenome fasta file, scripts are used to concatenate all reads into one fastq file while shuffling read order to avoid structure in the data.

Finally, to visualize the metagenome's contents, Krona charts representing read proportions can be generated.

Each rule are described in more details below.

a) Get read proportions

The user can directly specify read precentages in the input table, however, if the user did not input read percentages for each TaxID, the rule's script will generate even read proportions for each superkingdom, as explained above. If the user assigns more than one genome for one TaxID, the script divides the read percentage by the number of required assemblies. At the end of the rule, a table with assembly informations and their corresponding read proportion to simulate is generated.

b) Create read counts table

After assigning each genome a read percentage, the next step is to caculate the number of reads to simulate with art_illumina. For this, a script multiplies the percentages with the total number of reads to simulate from the config file. In addition, the workflow offers the user the possibility to generate multiple replicates of the same simulation run with slight read number variation. In fact, the read number for each replicate is pooled from a normal distribution with a mean (read number calculated previously) and standard deviation specified in the config file. At the end of this rule, a table is generated, containing assembly information and their corresponding read numbers.

c) Decompress assemblies and merge contigs

Gzipped assemblies are downloaded and then decompressed. Depending on the assembly quality, the genome can be fragmented, which will impact the read simulation process. In fact, art_illumina will simulate reads individually for each header and thus contig present in the fasta file. To avoid that, a script merging all contigs, seperated by 1000N into a single header was written.

d) Generate reads

The rule generates either paired or single end reads depending on the parameter specified by the user in the config file. The rule takes as input the merged contig fasta files and the table with read counts generated previously and outputs fastq files. art_illumina's command line needs specific parameters such as sequencing technology error profile, read length, mean and standard deviation fragment length which can be modified in the config file.