Skip to content

pennyneve/SNVcalling

Repository files navigation

SNVcalling

This pipeline runs SNV calling. It produces a SNV matrix and optionally it uses this output to:

  • infer phyologentic trees and ancestor strains and/or
  • tracking the abundance of therapeutic microbes in paired samples

Important information The pipeline was used to generate the phylogenetic trees of mother-infant strains displayed in - an analysis that was run in 2019 on the Danish supercomputer, computerome (version 1). It will have to be modified to work on other HPC. This also includes computerome2. The pipeline has therefore not been updated or changed since 2019 and is in the format that was used to run the analysis at this point. This means that the pipeline cannot be directly applied on another dataset without likely having to modify your input or generate required files such as bedfiles. I therefore think this is mainly helpful if you want to design your own strain pipeline outside publicly available tools. In later pipelines I have set up for strain analysis of microbiome samples I saved the output from BCFtools as a text file and then merged SNV files using python. In this script I am using BCFtools merge, however it is not able to merge hundreds of files at a time, which is why I do it two at a time and over multiple rounds. This is not very time or memory efficient, so I would not recommend to use it as a strategy. I have also included script for other type of strain analysis such as ancestor sequence reconstruction, which may come in handy for some.

The input to the pipeline is sorted BAM files. These can be generated by mapping high-quality paired sequencing reads to a collections of sequences in fasta format. The collection of sequences can be a gene catalog (used in this study) or a set of reference genomes, contigs, etc. The BAM file must be sorted after position.

The pipeline is structured in a similar way and submits multiple jobs to Computerome and will need to be mofified in order to run in another environment. Jobs are submitted using qsub and required software is loaded as modules.

Instructions:

1. Copy the pipeline scripts to Computerome

Copy the folder with the pipeline to your project directory.

2. Create a sample table

This file should contain space-delimited lines providing samplename and location of FASTQ files for forward and reverse reads.

The file should be called sampletable and should be placed in the parent directory. If you use a nother name or location you can change the path in the snv_config file.

3. Create a mgs list

This file should contain a list with the names of the mgs for which the analysis should be run for.

The file should be called mgslist (unless otherwise specified in the snv_config file). Use mgslist_Bifido.txt as a guide.

Make sure the file is in the parent directory:

mv snvcalling/mgslist_Bifido.txt .

4. Edit the configuration file

Copy the snv_config file template (snv_config) into your working directory and edit it as necessary.

`cp snvcalling/snv_config .`

The config file contains instructions.

If the pipeline is run on computerome, it should be enough to modify the snv_config file in order to run the pipeline.

5. Gene/MGS Catalogue Requirements

SNV calling require that the reference sequence (e.g. gene catalog) is indexed. This must be done before running the pipeline:

  1. Index the gene catalogue with samtools

    module load samtools/1.6 samtools faidx <GENE_DATABASE>

  2. Create a directory with BED format files, which specify the genes and region for each metagenomic speices (MGS) or region for which you wish to perform snv calling and/or generate a phylogenetic tree. Even if you are only interested in tracking one species, it is still recommended that you map the reads to the entire reference database.

The BED format is a tab-seperated one-line per feature format, which specify chromosome (gene/contig) and POS_START (0) and POST_STOP (length of gene/contig). Coordinates are 0-based and half-open. The file must end with ".bed" or ".bed.gz". The SNV calling is prossesed in the order of the regions specified in the bed-file.

6. Root Requirements

If you want to generate trees with roots, you will need to create a list per mgs which contains the gene overlap between two mgs. This should be done once per gene catalog.

To create a list with overlap per mgs, follow the three steps below:

The input is the reference database (gene catalog) translated into proteins.

  1. Create a directory with a gene catalogue per MGS (remember to specify the dir in the snv_config file)
#!/usr/bin/env python

from __future__ import print_function
from Bio import SeqIO
import os
    
# Parse gene catalog (if file compressed you will need to decompress it)
print("Parsing seq file")
record_dict = SeqIO.index("GENECATALOG.FA", "fasta")
    
# Read through bedfiles
for filename os.listdir("bedfiles/"):
    mgs = filename.split(".")[0]
    output_igc = open("genecatalog_per_mgs/" + mgs " + ".fasta", 'w')
    with open("bedfiles/" + mgs + ".fasta", 'r') as genelist:
        for line in genelist:
            gene_name = line.split()[0]
            try:
                output_igc.write(record_dict.get_raw(gene_name).decode())
            except:
                print(gene_name, mgs)
    output_igc.close()
    
  1. Second, blast (with simple output format) for all combinations of MGS':
# create dir
mkdir mgs2mgsblast

for mgs in `ls mgs_genecatalogs | grep "fa$" | cut -d"." -f1`; do
    if [ ! -f mgs_genecatalogs/$mgs.fa.nhr ]; then
        makeblastdb -in mgs_genecatalogs/$mgs.fa -dbtype nucl
    fi
    mgslist=()
    i=0
    for mgs2 in `ls mgs_genecatalogs | grep "fa$" | cut -d"." -f1 | grep -v "$mgs"`; do
        if [ ! -f mgs2mgs_blast/$mgs.$mgs2.out ]; then
            mgslist[i]="$mgs2"
            i=$((i+1))
        fi
    done
    if [ $i -gt 0 ]; then
        echo "$mgs"
        parallel -j 28 "blastn -query mgs_genecatalogs/{}.fa -db mgs_genecatalogs/$mgs.fa -outfmt 6 -out mgs2mgs_blast/$mgs.{}.out" ::: ${mgslist[*]}
    fi
done
  1. Last, summerize results and delete temprorary files. Remember to specify 'mgs2mgsblast' in snv_config
for mgs in `ls mgs_genecatalogs | grep "fa$" | cut -d"." -f1`; do

    for mgs2 in `ls mgs_genecatalogs | grep "fa$" | cut -d"." -f1 | grep -v "$mgs"`; do
        echo -e "$mgs2\t"`wc -l < mgs2mgs_blast/$mgs.$mgs2.out` >> mgs2mgs_blast/$mgs.txt
    done
    
done

# rm temporary blast files
rm mgs2mgs_blast/*.out

7. Run the script

Run the script from the working directory (where the result files will go):

bash snvcalling/snv_pipeline.sh

About

Mapping based SNV calling and reconstruction of phylogenetic trees from metagenomic samples

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors