This pipeline runs SNV calling. It produces a SNV matrix and optionally it uses this output to:
- infer phyologentic trees and ancestor strains and/or
- tracking the abundance of therapeutic microbes in paired samples
Important information The pipeline was used to generate the phylogenetic trees of mother-infant strains displayed in - an analysis that was run in 2019 on the Danish supercomputer, computerome (version 1). It will have to be modified to work on other HPC. This also includes computerome2. The pipeline has therefore not been updated or changed since 2019 and is in the format that was used to run the analysis at this point. This means that the pipeline cannot be directly applied on another dataset without likely having to modify your input or generate required files such as bedfiles. I therefore think this is mainly helpful if you want to design your own strain pipeline outside publicly available tools. In later pipelines I have set up for strain analysis of microbiome samples I saved the output from BCFtools as a text file and then merged SNV files using python. In this script I am using BCFtools merge, however it is not able to merge hundreds of files at a time, which is why I do it two at a time and over multiple rounds. This is not very time or memory efficient, so I would not recommend to use it as a strategy. I have also included script for other type of strain analysis such as ancestor sequence reconstruction, which may come in handy for some.
The input to the pipeline is sorted BAM files. These can be generated by mapping high-quality paired sequencing reads to a collections of sequences in fasta format. The collection of sequences can be a gene catalog (used in this study) or a set of reference genomes, contigs, etc. The BAM file must be sorted after position.
The pipeline is structured in a similar way and submits multiple jobs to Computerome and will need to be mofified in order to run in another environment. Jobs are submitted using qsub and required software is loaded as modules.
Copy the folder with the pipeline to your project directory.
This file should contain space-delimited lines providing samplename and location of FASTQ files for forward and reverse reads.
The file should be called sampletable and should be placed in the parent directory. If you use a nother name or location you can change the path in the snv_config file.
This file should contain a list with the names of the mgs for which the analysis should be run for.
The file should be called mgslist (unless otherwise specified in the snv_config file). Use mgslist_Bifido.txt as a guide.
Make sure the file is in the parent directory:
mv snvcalling/mgslist_Bifido.txt .
Copy the snv_config file template (snv_config) into your working directory and edit it
as necessary.
`cp snvcalling/snv_config .`
The config file contains instructions.
If the pipeline is run on computerome, it should be enough to modify the snv_config file in order to run the pipeline.
SNV calling require that the reference sequence (e.g. gene catalog) is indexed. This must be done before running the pipeline:
-
Index the gene catalogue with samtools
module load samtools/1.6 samtools faidx <GENE_DATABASE> -
Create a directory with BED format files, which specify the genes and region for each metagenomic speices (MGS) or region for which you wish to perform snv calling and/or generate a phylogenetic tree. Even if you are only interested in tracking one species, it is still recommended that you map the reads to the entire reference database.
The BED format is a tab-seperated one-line per feature format, which specify chromosome (gene/contig) and POS_START (0) and POST_STOP (length of gene/contig). Coordinates are 0-based and half-open. The file must end with ".bed" or ".bed.gz". The SNV calling is prossesed in the order of the regions specified in the bed-file.
If you want to generate trees with roots, you will need to create a list per mgs which contains the gene overlap between two mgs. This should be done once per gene catalog.
To create a list with overlap per mgs, follow the three steps below:
The input is the reference database (gene catalog) translated into proteins.
- Create a directory with a gene catalogue per MGS (remember to specify the dir in the
snv_configfile)
#!/usr/bin/env python
from __future__ import print_function
from Bio import SeqIO
import os
# Parse gene catalog (if file compressed you will need to decompress it)
print("Parsing seq file")
record_dict = SeqIO.index("GENECATALOG.FA", "fasta")
# Read through bedfiles
for filename os.listdir("bedfiles/"):
mgs = filename.split(".")[0]
output_igc = open("genecatalog_per_mgs/" + mgs " + ".fasta", 'w')
with open("bedfiles/" + mgs + ".fasta", 'r') as genelist:
for line in genelist:
gene_name = line.split()[0]
try:
output_igc.write(record_dict.get_raw(gene_name).decode())
except:
print(gene_name, mgs)
output_igc.close()
- Second, blast (with simple output format) for all combinations of MGS':
# create dir
mkdir mgs2mgsblast
for mgs in `ls mgs_genecatalogs | grep "fa$" | cut -d"." -f1`; do
if [ ! -f mgs_genecatalogs/$mgs.fa.nhr ]; then
makeblastdb -in mgs_genecatalogs/$mgs.fa -dbtype nucl
fi
mgslist=()
i=0
for mgs2 in `ls mgs_genecatalogs | grep "fa$" | cut -d"." -f1 | grep -v "$mgs"`; do
if [ ! -f mgs2mgs_blast/$mgs.$mgs2.out ]; then
mgslist[i]="$mgs2"
i=$((i+1))
fi
done
if [ $i -gt 0 ]; then
echo "$mgs"
parallel -j 28 "blastn -query mgs_genecatalogs/{}.fa -db mgs_genecatalogs/$mgs.fa -outfmt 6 -out mgs2mgs_blast/$mgs.{}.out" ::: ${mgslist[*]}
fi
done
- Last, summerize results and delete temprorary files. Remember to specify 'mgs2mgsblast' in
snv_config
for mgs in `ls mgs_genecatalogs | grep "fa$" | cut -d"." -f1`; do
for mgs2 in `ls mgs_genecatalogs | grep "fa$" | cut -d"." -f1 | grep -v "$mgs"`; do
echo -e "$mgs2\t"`wc -l < mgs2mgs_blast/$mgs.$mgs2.out` >> mgs2mgs_blast/$mgs.txt
done
done
# rm temporary blast files
rm mgs2mgs_blast/*.out
Run the script from the working directory (where the result files will go):
bash snvcalling/snv_pipeline.sh