Skip to content

padlocke11180/VARV

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

71 Commits
 
 
 
 
 
 
 
 

Repository files navigation

VARV

Authors:

Questions should be sent to all three. Bugs/problems can be posted to the issues section of this repository.

Contents

Synopsis

VARV is a pipeline for automating many of the steps required for performing gene or group based tests with EPACTS. VARV is a command line tool that can be run from a Linux system.

Installation

VARV requires:

  • EPACTS
  • Python 2.7.x (untested on / likely does not work with 3.x)
  • R 3.1 or greater
  • Tabix (available as a part of SAMtools, http://samtools.sourceforge.net/) located somewhere on your $PATH
  • An internet connection during setup
  • An up to date version of Chrome or Firefox (for viewing the output report/visualizations)

To download, grab the tarball from the Releases page.

To install, simply extract the tarball, and then execute the setup script:

tar zxf varv.tar.gz 
cd varv/bin/
python setup.py 

The setup script will attempt to install a number of required python packages into a virtualenv located in varv/env. This requires internet access on port 443 for SSL traffic.

Usage

The main program is located in bin/varv under the extracted directory. To run this without needing to directly specify it, you can create a symbolic link to it from somewhere accessible on your PATH, usually something like:

sudo ln -s -t /usr/local/bin /path/to/varv_dir/bin/varv

Then you can invoke the program on a configuration file like so:

varv your_config.cfg

Basic Configuration File

The config file describes the input parameters and files to be submitted to VARV. The config file is in the METAL format, with each line containing a parameter and value. The config file is whitespace delimited, so the arguments to each option must not contain spaces. The necessary parameters of the config file are:

  • INPUTDIR The input directory containing all the input files.

  • VCFFILE The input vcf file (thie file does not have to be in the INPUTDIR). Only the file name has to be specified if file is in INPUTDIR (see additional options), otherwise the absolute path.

  • PEDFILE File containing phenotype information for samples (tab-delimited). Assumed to be in INPUTDIR.

  • MODEL The model for analysis in the format 'phenotype ~ covariate2 + covariate2' Ex: phenotype ~ sex, or only phenotype (if no covariate is to be used.)

  • TEST Which test is to be used for group-wise test. Any of the options for EPACTS can be specified. If multiple tests are to be run, use multiple TEST lines (Ex: TEST group=skat)

  • OUTPREFIX Output prefix for this run.

  • PROCESS Execute the set of instruction specified above this line (and after the previous instance of PROCESS)

Additional Options

Examples of usage of the options below are in the Examples section below.

  • EPACTSDIR Location of EPACTS if EPACTS is not on system path

  • KINSHIPFILE Kinship file, if one has been generated previously

  • ANNOTFILE Annotation for variants, one variant per line.

  • ANNOTCOLUMNS Which column name from the ANNOTFILE do we want displayed in the output plots? Not specifying this option would not annotate the variants.

  • PVALUETHRESHOLD P-value threshold to determine the significance threshold for group-wise test (default=0.05)

  • GENELIST If the user wants to restrict analysis and visualization only for a group of pre-determined genes (instead of genes significant in the group-wise test), these can be specified using this option. Values are a comma-separated list of gene symbols. (Eg: GENELIST PCSK9,DRD4,CREBBP)

  • GROUPFILE Group file specifying the mapping from group names (genes) to sets of variants. Will b egenerated automatically if not provided

  • SINGLEMARKERTEST Which single marker test is to be used for those variants in genes with significant group-wise P-values. This specifies the name of the test as EPACTS would expect it.

  • SKATO If you specify a SKAT-based test, and you want to use the SKAT-O optimal test, use this option (Eg SKATO ON).

  • FILTERANNOT If you've provided an annotation file, but want to perform a filter on it, use this option. The filter is written with a simple python syntax. For more details, see: http://pandas.pydata.org/pandas-docs/version/0.15.1/indexing.html#indexing-query Eg: FILTERANNOT (vepGENE == 'LIPG') & (CAD > 0.4)

  • FIELD Is there an alternative genotype field that EPACTS should use?

  • SEPCHR If VCF file is split up into chromosomes, specify this option. (Eg SEPCHR ON)

  • GENEMINMAC If you want to restrict genes to those with a minimum MAC. (Eg GENEMINMAC 5)

  • MINVARS Only consider genes with > this number of variants (Eg MINVARS 2)

  • EPACTSCMD Forcefully insert a string of command line arguments into the EPACTS command line when running group/gene tests The following options are to place restrictions on min/max MAF and MAC to be used during both single variant and group based association tests. These are the current defaults:

    • MIN_MAC 1
    • MIN_MAF 0
    • MAX_MAF 1
  • VERBOSE Do you want to see more verbose output? This will spew every EPACTS command line out, along with some other (potentially) useful debug output. (Eg VERBOSE ON)

Examples

Here is an example of a simple config file used to run two separate tests on the same input data. The input data is separated by chromosome.

INPUTDIR  /net/snowwhite/home/xlsim/gene_plots
VCFFILE /net/snowwhite/home/teslo/ExomeChip/Gene-level_analysis/Tanya_dataset/omni_exchip_GoT2D_mean-genotype_10042_FINALchr1.vcf.gz
SEPCHR  ON
MODEL LDL ~ SEX + PC1
TEST  group=skat
TEST  group=VT
OUTPREFIX /net/snowwhite/home/sramdas/output1/
GENEMINMAC  5
PROCESS

Below is an example of a config file, along with comments describing each of the options. You can remove the comments for brevity in your actual config file.

# Location of tabix. Can just be "tabix" if it exists on your path. 
TABIX tabix

# Location of EPACTS. 
EPACTSDIR	/net/fantasia/home/hmkang/bin/epactsRelease/bin

# Directory containing input files (VCFs, or any other file 
# you wish to refer to simply by name and not full path.) 
INPUTDIR	/net/snowwhite/home/xlsim/METSIM/FFA/gene-level/gene_plots

# Either absolute path to a VCF file, or just the name if it 
# is located in the INPUTDIR. 
VCFFILE	/net/snowwhite/home/teslo/ExomeChip/Gene-level_analysis/Tanya_dataset/omni_exchip_GoT2D_mean-genotype_10042_FINAL.vcf.gz

# PED file containing phenotype information. 
PEDFILE	/net/snowwhite/home/welchr/projects/FFA/pheno/excluding_all_t2d/ffa_phenos_studyage-1_BMI-1_smokefix-1_dropstatin-0_adjuststatin-1_dropt2d-1.ped

# Kinship file, if one has already been generated previously. Otherwise it will be created automatically 
# if you select a test that requires a kinship file. 
KINSHIPFILE	/net/snowwhite/home/aujackso/METSIM/OMNI_Express/Geno/metsim_omni.kinf

# Annotations for variants. This can be used for automatically generating a group/mask file, or for showing annotations 
# in resulting output files. 
ANNOTFILE /net/snowwhite/home/welchr/projects/varv/tests/omni_exchip.hg19.vep.annotation.gz

# Which columns from the annotation file do we care about showing in the output? 
# This can be omitted, in which case, no annotations will be shown for variants in the output. 
ANNOTCOLUMNS	vepFUNC

# Specify a model here for analysis. The usual R type of formula should be used. 
# For example: PC ~ AGE + BMI, or just PC if no covariates are needed. 
MODEL	PC

# Gene-based test threshold for significance. 
# Set to 1 to see everything, but this may cause the plotting output to be overwhelming. 
# If not specified, the default is 1e-4. 
PVALUETHRESHOLD	1

# If you wish to restrict analysis to only a few genes, you can list them here. 
# Note that you'll likely want to set PVALUETHRESHOLD to 1 (as above) to be sure
# you get output for all of the genes listed, regardless of their significance. 
GENELIST LIPG,LCAT,FADS1

# Output directory + prefix for output files. 
OUTPREFIX	/net/snowwhite/home/welchr/projects/varv/tests/testcfg/blah

# Group file specifying the mapping from group names (genes) to sets of variants. 
# If this is omitted, it will be generated automatically. 
GROUPFILE PTV+missense.01.grp

# The pipeline will also execute single marker tests, to include them with the results of the
# gene-based tests for interpretation. This specifies the name of the test as EPACTS would expect it. 
# http://genome.sph.umich.edu/wiki/EPACTS#Single_Variant_Tests
SINGLEMARKERTEST	q.emmax

# Specify the group based test that should be run. If you want to execute multiple tests, 
# just use multiple TEST lines. 
# http://genome.sph.umich.edu/wiki/EPACTS#Gene-wise_or_group-wise_tests
TEST group=mmskat
TEST group=VT

# If you specify a SKAT-based test, and you want to use the SKAT-O optimal test, 
# use this option. 
SKATO	ON

# Parallel options for our local MOSIX cluster. This is likely not useful to anyone else!
#NJOBS 100
#MOSIX --mosix-nodes="\`/net/fantasia/home/gjun/bin/pick_mainnode\`"

# Execute this set of instructions. 
PROCESS

# Another run. 
# All of the parameters set are remembered, so if you wish to run another analysis with only one 
# or few parameters changed, just change them and issue PROCESS again (see below for an example.) 
# The exception is that TEST must be re-specified to be explicit about which tests should be run. 
MODEL FFA ~ BMI + AGE + PC1 + PC2
OUTPREFIX /my/new/analysis/prefix
TEST group=mmskat

# If you want to reset an option back to its default: 
PVALUETHRESHOLD RESET

PROCESS

File formats

PEDFILE

PED file for phenotypes and covariates

VCFFILE

VCF format 4.1 or higher

GROUPFILE

Creating a marker group file

ANNOTFILE

Basically just the first few columns of a VCF file, along with additional columns specifying the annotations:

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT ALLELEA ALLELEB TOP vepFUNC vepGENE vepNS vepPTV CADD
1 752566 rs3094315 G A 100 PASS NS=10042;AC=16000;AN=20084;AF=0.796654 GT A G + intron_variant RP11-206L10.10 0 0 0.189
1 752721 rs3131972 A G 100 PASS NS=10042;AC=15994;AN=20084;AF=0.796355 GT A G + intron_variant RP11-206L10.10 0 0 3.952
1 762320 exm2268640 C T 100 PASS NS=10042;AC=6;AN=20084;AF=0.000299 GT T C - non_coding_exon_variant LINC00115 0 0 3.28
1 776546 rs12124819 A G 100 PASS NS=10042;AC=4295;AN=20084;AF=0.213852 GT A G + intron_variant RP11-206L10.11 0 0 1.62
1 798959 rs11240777 G A 100 PASS NS=10042;AC=5484;AN=20084;AF=0.273053 GT A G + downstream_gene_variant RP11-206L10.11 0 0 0.265
1 800007 rs6681049 T C 100 PASS NS=10042;AC=20052;AN=20084;AF=0.998407 GT T C - downstream_gene_variant FAM41C 0 0 0.601
1 838555 rs4970383 C A 100 PASS NS=10042;AC=6335;AN=20084;AF=0.315425 GT A C + upstream_gene_variant RP11-54O7.16 0 0 1.012
1 846808 rs4475691 C T 100 PASS NS=10042;AC=3658;AN=20084;AF=0.182135 GT T C - intron_variant RP11-54O7.16 0 0 5.381
1 854250 rs7537756 A G 100 PASS NS=10042;AC=3657;AN=20084;AF=0.182085 GT A G + non_coding_exon_variant RP11-54O7.3 0 0 1.108

KINSHIPFILE

This is automatically created by the pipeline if a test requiring a kinship matrix is requested. However, if you wish to make your own kinship file, see Creating the kinship matrix.

Output

The pipeline creates a number of output files that are used during the analysis, and may be potentially useful for interpreting the results.

prefix.singlemarker.epacts.gz

The single marker association results (for those variants going into gene based tests)

prefix.groupfile.filtered.txt

The final group file entering into the gene based testing after filtering has (potentially) been applied. For example, variants may be excluded for failing to meet MAX_MAF.

prefix.genotype_trait_table.gene-?.tab

For each significant gene (according to PVALUETHRESHOLD), a file will be created containing the genotypes for the variants going into that gene's test, along with the trait values for each individual, like so:

IND_ID PC 18:47091686_G/A 18:47095862_C/T 18:47101838_G/A
10741 -3.85229944 0/0 0/0 0/0
MJQQ-901 -3.574392931 0/0 0/0 0/0
MCMP-901 -3.438443305 0/0 0/0 0/0
MFID-901 -3.346259618 0/0 0/0 0/0
8071 -3.275935989 0/0 0/0 0/0
MJNO-901 -3.218829067 0/0 0/0 0/0
MJEG-901 -3.170612436 0/0 0/0 0/0
11618 -3.128801711 0/0 0/0 0/0
11577 -3.09183487 0/0 0/0 0/0

prefix.pheno.ped

The phenotype file that entered into both the single variant and gene based tests, after any potential filtering was applied.

prefix.plots/

Directory containing files for creating some of the QC plots and visualizations. In particular:

  • top_genes.html contains a report showing QQ plots, manhattan plots, and a graphic to help interpret the gene-based tests that shows each individual carrier of a rare allele for a variant contributing to a significant gene test (see Plots for more info.)

  • plot_variants.tsv has the genotype data melted into long format for use in the top_genes.html figure and looks like the following:

VARIANT IND GENOTYPE RARE_COUNT TRAIT
16:67976823_C/T SAMPLE1 0/1 1 1.69
16:67976823_C/T SAMPLE2 0/1 1 -0.138
16:67976823_C/T SAMPLE3 0/1 1 -0.409
16:67976823_C/T SAMPLE4 0/1 1 0.16
  • plot_genes.tsv contains the table exactly as seen in top_genes.html, but in tab-delimited format. For example:
GENE mmskat.P VARIANT SV.P BETA MAF MAC vepFUNC
LIPG 6.20E-11 18:47091686_G/A 2.96E-06 0.76 0.00225 38 splice_acceptor_variant
LIPG 6.20E-11 18:47095862_C/T 0.000184 0.437 0.0045 76 missense_variant
LIPG 6.20E-11 18:47101838_G/A 0.792 -0.263 6.00E-05 1 missense_variant
LIPG 6.20E-11 18:47109939_G/A 0.957 -0.0192 0.00047 8 missense_variant
LIPG 6.20E-11 18:47109955_A/G 4.36E-05 0.37 0.00728 123 missense_variant

prefix.?.epacts

An EPACTS file for the gene-based test results, one per test requested.

Plots

The top_genes.html file contains 3 different visualizations for QC and interpretation of your results.

  • Carriers of rare alleles for single variants contributing to significant gene-based tests

carrier-plot

Each row represents a variant contributing to a significant gene-based test (SKAT P=6.2e-11 for LIPG.) The symbols in the leftmost column of the table correspond to each individual carrier of that variant, and are aligned with the x-axis of the histogram above to show where each variant carrier lies in the phenotype distribution. An interactive tooltip displayed over each carrier shows the phenotype value and the sample identifier.

  • QQ and Manhattan plots

These plots are created for two subsets of variants:

  1. Variants assigned to genes that pass our inclusion filters (such as minimum total MAC or minimum number of variants)
  2. Variants assigned to genes failing filters

The manhattan plots show single variant association statistics.

qqmanhat-plots

License

Copyright (C) 2014 Shweta Ramdas, Ryan Welch, The University of Michigan

VARV is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

VARV is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see http://www.gnu.org/licenses/.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python 74.7%
  • HTML 15.8%
  • R 6.0%
  • JavaScript 3.5%