VSG analysis pipeline
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.


VSG-Seq Analysis Pipeline

A pipeline for analyzing VSG-seq data. A protocol for preparing VSG-Seq libraries can be found here, and more information about the approach can be found in our paper. Reference genomes from that paper can be found in Science2015_reference_genomes.

Basic Usage

python VSGSeqPipeline.py -i input_file_data.txt 

You can adjust various parameters for your run. Read about all of the options using python VSGSeqPipeline.py -h. input_file_data.txt is a tab-delimited file describing your input files and the samples they came from. More info in Input Files.

Required Software

You'll need biopython (we use anaconda) and the following software installed and in your PATH:

Input Files

In addition to your sequencing files in FASTQ format (this pipeline uses single-end sequencing reads), you'll need a tab-delimited file containing information about your samples. The first line is a header containing whatever attributes of your sample you'd like to incorporate into downstream analysis. These attributes will be incorporated into the final file describing the results. The first column must be the name of each input FASTQ. Here's an example. The program expects those files to be in the current working directory.

BLAST Databases for Identifying VSGs

You can use any reference you want to identify VSGs. We have a few options available in VSG_blastdbs.

There are three different VSG databases:

  • EATRO1125 VSGs (EATRO1125_vsgs)
  • Lister427 VSGs (tb427_vsgs)
  • Combined database of BOTH Lister427 and EATRO1125 (concatAntattb427)

There is one 'NonVSG' database (NOTvsgs). This database has been cobbled together after multiple iterations of assembling expressed VSGs, inspecting them by hand, and identifying common false positives (e.g., certain ESAGs assemble frequently and this will filter those out). If you run the pipeline using this filter (the default), you'll need this database available for BLAST.

The blast database files you want to use need to be in your working directory or your blastn installation needs to be configured such that it can find them whereever they live on your machine.

The fasta files these were created from are available in VSGdb_fasta.

Output Files

All intermediate files produced in the pipeline are saved in one folder. A summary file shows the expression of each VSG in each sample, both in terms of RPKM (calculated using MULTo) and percentage of the population (RPKM for that VSG/total RPKM). If you assembled VSGs from your reads, it will also contain information on how similar those VSGs are to VSGs in your reference database. See an example here.