Skip to content
mSWEEP High-resolution sweep metagenomics using fast probabilistic inference
C++ CMake
Branch: master
Clone or download

Latest commit

Fetching latest commit…
Cannot retrieve the latest commit at this time.

Files

Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
config add checks that the input files/directories exist, and verify themist… Mar 10, 2020
example
external/cxxio add checks that the input files/directories exist, and verify themist… Mar 10, 2020
include add missing include Mar 10, 2020
src
.gitignore
CMakeLists.txt
CMakeLists.txt.in
LICENSE Initial commit Jun 4, 2018
README.md add sort-output to themisto-pseudoalign flags (might work without, th… Mar 29, 2020

README.md

mSWEEP

Fast and accurate bacterial community composition estimation on strain level by using pseudoalignments and variational inference.

More about mSWEEP in the article High-resolution sweep metagenomics using fast probabilistic inference in Wellcome Open Research (awaiting peer review)

If you use our method, please cite us as Mäklin T, Kallonen T, David S et al. High-resolution sweep metagenomics using fast probabilistic inference [version 1; peer review: awaiting peer review]. Wellcome Open Res 2020, 5:14 (https://doi.org/10.12688/wellcomeopenres.15639.1)

Installation

mSWEEP can be obtained either in the form of a precompiled binary

In addition to mSWEEP, you will need to install either Themisto (recommended) or kallisto for pseudoalignment.

Compiling from source

Requirements

  • C++11 compliant compiler.
  • cmake (v3.0 or newer)

Optional

  • Compiler with OpenMP support.

If your compiler does not support OpenMP, mSWEEP can only be run in single-threaded mode. The prebuilt binaries are compiled with OpenMP support and allow parallellization.

Compilation

Clone the mSWEEP repository

git clone https://github.com/PROBIC/mSWEEP.git

enter the directory and run

> mkdir build
> cd build
> cmake ..
> make
  • This will compile the mSWEEP executable in build/bin/mSWEEP.

Compilation tips for improving performance

  1. If you intend to run mSWEEP on the machine used in compiling the source code, you might want to add the '-march=native -mtune=native' flags if compiling with GCC by running
> cmake -DCMAKE_CXX_FLAGS="-march=native -mtune=native" -DCMAKE_C_FLAGS="-march=native -mtune=native" ..
> make

Using these options significantly reduces the runtime of mSWEEP in some environments (e.g. most HPC setups).

  1. If the Intel C++ compiler is available in your environment, you might want to use that to compile mSWEEP — especially if running on Intel hardware. The compiler can be specified by running
> cmake -DCMAKE_C_COMPILER=icc -DCMAKE_CXX_COMPILER=icpc ..
> make

Usage

Reference data

A reference sequence collection and a grouping into clonal complexes/sequence types is available in FigShare for the following species

  • Campylobacter jejuni
  • Escherichia coli
  • Klebsiella pneumoniae
  • Staphylococcus epidermidis

mSWEEP supports using a custom reference database. A typical workflow for constructing the custom database might proceed as follows

  1. Gather assembled sequences for the species of interest. Use taxonomic profiling tools like MetaFlow or MetaPhlAn2 to identify the species in your sample if you are unsure what sequences to include, or suspect that our samples contain sequencing data from contaminating species.

  2. a) Provide a grouping for the assemblies (e.g. sequence types, clonal complexes, or the output of some clustering algorithm.). If your species have an established multi-locus sequence typing scheme, the mlst software is a good candidate for producing the grouping.

  3. b) (Optional) Filter out reference sequences that cannot be reliably assigned to a group (eg. the sequence type cannot be determined) and perform other appropriate quality control measures.

  4. If an assembly contains multiple contigs, merge them into a single contig. Do this for all assemblies.

  5. Index the database with your pseudoalignment tool of choice and proceed with running the mSWEEP pipeline.

Toy data (Themisto)

(Recommended) Enter the toy data directory (example/) and run the build_index and pseudoalign commands from Themisto. The --mem-megas option controls the maximum amount of RAM used in constructing the index (if the limit is exceeded, Themisto will use disk storage), and --n-threads the number of threads to use.

## Build the Themisto index without clustering
mkdir themisto_index
mkdir tmp

## Index using at most 2048 megabytes of memory and 2 threads.
build_index --k 31 --input-file example.fasta --auto-colors --index-dir themisto_index --temp-dir tmp --mem-megas 2048 --n-threads 2

## Pseudoalign reads using 2 threads
pseudoalign --query-file 215_1.fastq.gz --outfile 215_1_alignment.txt --rc --index-dir themisto_index --temp-dir tmp --n-threads 2 --sort-output
pseudoalign --query-file 215_2.fastq.gz --outfile 215_2_alignment.txt --rc --index-dir themisto_index --temp-dir tmp --n-threads 2 --sort-output

Next, run mSWEEP on the alignment files (the -t option controls the number of threads used in the estimation.)

## Run mSWEEP with 2 threads
mSWEEP --themisto-1 215_1_alignment.txt --themisto-2 215_2_alignment.txt -i clustering.txt -t 2 -o 215 --themisto-index themisto_index

This will write the relateve abundances to the "215_abundances.txt" file in the folder mSWEEP was run in. If the '-o' option is not specified, the abundances will print to cout.

Note that supplying the --themisto-index is optional but highly recommended (running mSWEEP without this option will not validate the input 'clustering.txt' and may cause undefined behaviour).

Experimental usage

Bootstrapping confidence intervals

mSWEEP can be used to produce confidence intervals for the abundance estimates by bootstrapping the pseudoalignment counts and rerunning the abundance estimation a number of times. This can be done automatically by adding the '--iters' option to running mSWEEP:

> mSWEEP --themisto-1 215_1_alignment_1.txt --themisto-2 215_2_alignment_2.txt -i cluster_indicators.txt -t 2 -o 215 --iters 100

The bootstrapped abundance estimates will be appended to the output file '215_abundances.txt' as new columns and can be used to calculate confidence intervals for each of the abundance estimates.

Bootstrapping in mSWEEP by default resamples from all input pseudoalignments. If you wish to resample fewer pseudoalignments, supply the '--bootstrap-count ' option with the number of resamples to perform

> mSWEEP --themisto-1 215_1_alignment_1.txt --themisto-2
215_2_alignment_2.txt -i cluster_indicators.txt -t 2 -o 215 --iters 100 --bootstrap-count 1000

This will result in wider confidence intervals because the bootstrap iterations have less data available to them.

It is also possible to set the initial random seed with the '--seed ' option, which enables replicating the bootstrap results across multiple runs.

Embedding colors in the Themisto index

Alternatively, it is possible to embed the grouping information in Themisto's index, effectively treating any pseudoalignment in the group as if the read aligned to all sequences in the group. This approach will in most cases produce different results than the recommended one, but will reduce the RAM, CPU, and disk space requirements for running Themisto and mSWEEP.

## Build grouped Themisto index
mkdir themisto_grouped_index
mkdir tmp
build_index --k 31 --input-file example.fasta --color-file clustering.txt --index-dir themisto_grouped_index --temp-dir tmp

## Pseudoalign reads
pseudoalign --query-file 215_1.fastq.gz --outfile 215_1_alignment.txt --rc --index-dir themisto_grouped_index --temp-dir tmp --sort-output
pseudoalign --query-file 215_2.fastq.gz --outfile 215_2_alignment.txt --rc --index-dir themisto_grouped_index --temp-dir tmp --sort-output

## Extract unique cluster indicators from the clustering.txt file
awk '!seen[$0]++' clustering.txt > unique_clusters.txt

## Run mSWEEP with 2 threads
mSWEEP --themisto-1 215_1_alignment.txt --themisto-2 215_2_alignment.txt -i unique_clusters.txt -t 2

These example commands will construct the Themisto index with the grouping indicators from 'clustering.txt' embedded in it. This means that if you wish to change the grouping indicators, the index must be reconstructed.

Toy data (kallisto)

kallisto index -i example_index example.fasta

kallisto pseudo -i example_index -o kallisto_out_folder 215_1.fastq.gz 215_2.fastq.gz
mSWEEP -f kallisto_out_folder -i clustering.txt

You should see that roughly 90% of the reads are assigned to group "clust2".

General pipeline

Preprocessing

  • Obtain a set of reference sequences. (See the steps under Usage -> Reference data)
  • Index the reference sequences for pseudoalignment with Themisto:

build_index --k 31 --input-file reference_sequences.fasta --auto-colors --index-dir themisto_index --temp-dir tmp

  • ... or with kallisto

kallisto pseudo -i reference_kmi reference_sequences.fasta

  • Define some grouping for the reference and save the grouping in a text file where each line contains the identifier of the grouping the corresponding reference sequence belongs to. For example with four sequences and two groups:
cluster1
cluster2
cluster2
cluster1

The grouping identifiers must be in the same order as their corresponding sequences appear in the reference file. Alternatively, you can use supply the reference sequences and a table containing the groups to reorder the indicators.

Reordering identifiers

If your grouping identifiers are not in the same order as in the fasta file, you mSWEEP can reorder them based on the fasta file. In this case, mSWEEP requires as input a tab-separated table where the first column contains the name of the sequence as it appears in the fasta file and the second column contains the group:

seq_156	group2
seq_157	group1
seq_197	group4
seq_20	group3
seq_285	group4
seq_43	group2
seq_44	group1
seq_5	group3

and the fasta file. If the above table is contained in the 'groups_list.tsv" file, running

mSWEEP --themisto-1 215_1_alignment.txt --themisto-2 215_2_alignment.txt -t 2 --fasta example.fasta --groups-list groups_list.tsv

will instruct mSWEEP to extract the identifiers from the table in the correct order.

It is also possible to use a table separated by a character different than tab, e. g. ',', by specifying the separator with the '--groups-delimiter' argument:

mSWEEP --themisto-1 215_1_alignment.txt --themisto-2 215_2_alignment.txt -t 2 --fasta example.fasta --groups-list groups_list.tsv --groups-delimiter ,

Should you require the reordered identifiers written to a file (e. g. for running mSWEEP on multiple input files), please refer to the 'matchfasta' utility shipped alongside mSWEEP which performs the reordering.

Analysing reads (with Themisto)

  • Pseudomap paired-end reads:
mkdir tmp
pseudoalign --index-dir themisto_index --query-file reads_1.fastq.gz --outfile reads_1_out.txt --temp-dir tmp --rc --sort-output
pseudoalign --index-dir themisto_index --query-file reads_2.fastq.gz --outfile reads_2_out.txt --temp-dir tmp --rc --sort-output
  • Use mSWEEP to estimate cluster abundances from a single file with 2 threads:
mSWEEP --themisto-1 215_1_alignment.txt --themisto-2 215_2_alignment.txt -i clustering.txt -t 2

Themisto can also utilize multiple threads in the mapping phase. You can run Themisto on multiple threads by specifying the number of threads with the '--n-threads 8' flag.

Analysing reads (with kallisto)

  • Pseudomap paired-end reads:

kallisto pseudo -i reference_kmi -o kallisto_out reads_1.fastq reads_2.fastq

kallisto pseudo -i reference_kmi -o kallisto_batch_out -b kallisto_batch.txt

  • Use mSWEEP to estimate cluster abundances from a single file:

mSWEEP -f kallisto_out_folder -i cluster_indicators.txt -o abundances.txt -t 2

  • Or from a batch file and output to abundances/ folder:
  • (NOT RECOMMENDED anymore as of 18 December 2019: kallisto's batch mode can be a bit unstable)

mSWEEP -b kallisto_batch_out -i cluster_indicators.txt -o abundances

kallisto can utilize multiple threads in the mapping phase. You can run kallisto on multiple threads by specifying the number of threads with the '-t' flag. When estimating samples submitted in a kallisto batch, mSWEEP can estimate multiple samples in parallel by specifying the number of threads with the '-t' flag.

Running mSWEEP

mSWEEP accepts the following flags:

	--themisto-1 <themistoPseudoalignment1>
	Pseudoalignment results from Themisto for the 1st strand of paired-end reads.
	--themisto-2 <themistoPseudoalignment2>
	Pseudoalignment results from Themisto for the 2nd strand of paired-end reads.

	-f <pseudomappingFile>
	Pseudoalignment output file location from kallisto. Can't be used when -b is specified.
	-b <pseudomappingBatch>
	The kallisto batch matrix file location. Can't be used when -f is specified.

	-i <clusterIndicators>
	Group identifiers file. Must be supplied.
	-o <outputFile>
	Output file (folder when estimating from a batch) to write results in.
	-t <nrThreads>
	How many threads to use. (default: 1)

	--themisto-mode <PairedEndMergeMode>
	How to merge Themisto pseudoalignments for paired-end reads	(intersection or union, default: intersection).
	--themisto-index <ThemistoIndex>
	Path to the Themisto index the pseudoalignment was performed against (optional).

	--fasta <ReferenceSequences>
	Path to the reference sequences the pseudoalignment index was constructed from (optional)
	--groups-list <groupIndicatorsList>
	Table containing names of the reference sequences (1st column) and their group assignments (2nd column) (optional)
	--groups-delimiter <groupIndicatorsListDelimiter>
	Delimiter character for the --groups option (optional, default: tab)

	--iters <nrIterations>
	Number of times to rerun estimation with bootstrapped alignments (default: 1)
	--bootstrap-count <nrBootstrapCount>
	How many reads to resample when bootstrapping (integer, default: all)
	--seed <BootstrapSeed>
	Seed for the random generator used in bootstrapping (default: random)

	--write-probs
	If specified, write the read equivalence class probabilities in a .csv matrix
	--print-probs
	Print the equivalence class probabilities rather than writing when using --write-probs
	--gzip-probs
	Gzip the .csv matrix output from --write-probs
	--help
	Print this message.

	ELBO optimization and modeling (these seldom need to be changed)
	--tol <tolerance>
	Optimization has converged when the bound changes less than the given tolerance.
	--max-iters
	Maximum number of iterations to run the gradient optimizer.
	-q <meanFraction>
	Fraction of the sequences in a group that the mean is set to. (default: 0.65)
	-e <dispersionTerm>
	Calibration term in the likelihood function. (default: 0.01)

License

The source code from this project is subject to the terms of the MIT license. A copy of the MIT license is supplied with the project, or can be obtained at https://opensource.org/licenses/MIT.

You can’t perform that action at this time.