Introduction to Bacterial Genome Wide Association Studies
Set up to run this on Amazon EC2:
Launch EC2 instance: we will run Ubuntu 14.04 LTS (64-bit) on an m3.2xlarge instance. Before you click 'Review and Launch', increase the root file system size to 100 GB.
Download the data and unpack it into
~/data(this step take a while, so we're going to get it running and go through the background).
wget http://dib-training.ucdavis.edu.s3.amazonaws.com/2016-bodega/jane-data.tar.gz wget http://dib-training.ucdavis.edu.s3.amazonaws.com/2016-bodega/jane-data2.tar.gz tar -xvf jane-data.tar.gz tar -xvf jane-data2.tar.gz rm -f *.gz
- To describe some of the problems with carrying out genome wide association studies (GWAS) in bacteria using whole genome sequencing data.
- To gain familiarity with the steps in a bacterial GWAS software pipeline.
- To use the bugwas R package to identify loci and lineages associated with antibiotic resistance.
Genome wide association studies (GWAS)
Genome wide association studies (GWAS) aim to test genetic variants for association with a phenotype of interest. DNA sequences from individuals with the phenotype of interest (cases) are compared with those without (controls) to test whether any variants are significantly associated with one or other group. Traditionally, GWAS studies were carried out using SNP arrays, but advances in genome sequencing technology have made it possible to use whole genomes instead, especially for organisms with small genomes, such as bacteria.
Problems with doing GWAS in bacteria
Doing GWAS in bacteria poses some unique problems, however. Different species of bacteria vary immensely in how often their genomes recombine and most species have only a single, circular chromosome. This means that even variants that are not in close physical proximity on the chromosome are in linkage disequilibrium. In addition, bacterial populations often exhibit strong signals of structure, due to expansion of ecologically successful clones in free-living species and isolation in different hosts in pathogens. Finally, individuals of the same bacterial species often vary quite dramatically in gene content, so just looking at SNPs called relative to a reference genome risks ignoring much interesting variation. Our group has written a pipeline for bacterial GWAS which deals with this problem by looking at variation in SNPs, gene presence/absence and 31nt kmers. Today we'll carry out a GWAS together identifying kmers from Staphylococcus aureus that are associated with resistance to the antibiotic fusidic acid.
Outline of workflow for bacterial GWAS
- Set up - assemble data files, check dependencies.
- Basic kmer GWAS using chi-square test.
- Annotate top kmers with BLAST.
- Controlling for population structure with a linear mixed model (running Gemma).
- Compare top kmers before and after controlling for population structure.
- Use bugwas R package to detect lineage and locus effects.
Let's start with downloading and installing all of the software we need to run our GWAS.
Our pipeline is written in R and we are going to use R on the command line to run some of the scripts from the pipeline.
Let's start by installing R. Unfortunately our code relies on a newer version of R than the one that comes with the basic install on the EC2 Ubuntu machines, but someone on the internet has gone through the pain of figuring this out, and I've shamelessly pinched it (link here if anyone cares: http://askubuntu.com/questions/614530/how-to-install-latest-version-of-r-on-ubuntu-12-04-lts).
codename=$(lsb_release -c -s) echo "deb http://cran.fhcrc.org/bin/linux/ubuntu $codename/" | sudo tee -a /etc/apt/sources.list > /dev/null sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys E084DAB9 sudo add-apt-repository ppa:marutter/rdev sudo apt-get update && sudo apt-get -y upgrade && \ sudo apt-get install -y r-base r-base-dev
Then we'll download the scripts for our GWAS pipeline. Install git,
sudo apt-get install -y git
then clone the repository.
git clone https://github.com/jessiewu/bacterialGWAS.git
For this example we are only doing a kmer GWAS so the relevant scripts are all in the
We also need a few more external dependencies, which are
GEMMA (included in bugwas):
git clone https://github.com/sgearle/bugwas.git sudo apt-get install -y ncbi-blast+
Finally, let's open R and install a couple of packages (select 25 for mirror when prompted and answer yes when it asks if you want to install into a local library directory).
Run the following commands one line at a time:
R install.packages("genoPlotR") install.packages("ape") install.packages("phangorn") #this takes forever on Ubuntu. install.packages("bugwas/build/bugwas_1.0.tar.gz", repos = NULL, type="source") q() #I'd save workspace image just in case something failed to install and you need to re-run
While these R packages are installing, let's set up the shortcut to scp files to your local machine. On your local machine, edit your ~/.ssh/config file (if you don't have one, make one) to include the following:
Host <bugwas> HostName <YOUR PUBLIC IP> IdentityFile <YOUR .pem or ~/.ssh/id_rsa> IdentitiesOnly yes User ubuntu
Then, back on the EC2 instance, create a list of the paths to each of the following dependencies in the following tab-delimited format, calling it
cat > dependencies.txt <<EOF name path GEMMA ~/bugwas/gemma/gemma.0.93b blastn /usr/bin/blastn PrintOutTopXChiSq bacterialGWAS/kmerGWAS/PrintOutTopXChiSq.class sortDsk bacterialGWAS/kmerGWAS/sort_dsk gwasKmerPattern bacterialGWAS/kmerGWAS/gwas_kmer_pattern EOF
- Kmer files - 1 genome per file. These were generated from raw sequencing reads using the kmer counting software
- List of kmer file paths linked to phenotypes.
- Relatedness matrix calculated from genomic SNPs (this is created by
- Maximum-likelihood phylogeny calculated from genomic SNPs (we use
PhyMLfor < 100 genomes and
RAxMLfor > 100 genomes)
- BLAST databases and gene look-up table for annotation
- R scripts: kmerAnalysis.R, kmerAnnotation.R, kmerLMM.R, LMM_kmerAnnotation.R
- bugwas library in R.
Run basic GWAS analysis:
Normally the input to our pipeline starts with raw reads in a bam or fasta file, but kmer-counting is time-consuming so we have generated kmer files for you. Each kmer file contains the reads from a single genome, split up into 31-nucleotide kmers. The files in the
~/data/kmers directory contain the set of unique kmers in each genome.
fus300.kmer.files.txt lists the path to the fasta file for each genome and the phenotype for fusidic acid resistance for each genome.For the phenotypes, 0 and 1 denote sensitivity or resistance to fusidic acid, respectively.
I put the wrong file into the data bundle so we need to fix the filepaths before this will run. Please run the following Perl one-liner to correct this:
perl -pi -e 's/\/dipro\/mmm\/gorm\/v3\/ana\/Saur\/AbxGWAS\/dsk\//data\/kmers\//g' ~/data/fus300.kmerfiles.txt
Now we need to run a piece of c++ code that gets patterns of presence or absence of each kmer in each genome and tests each kmer for a significant association with the phenotype using a Chi-Square test.
We run this using the following R script:
#Bash Shell Rscript bacterialGWAS/kmerGWAS/kmerAnalysis.R -dataFile \ ~/data/fus300.kmerfiles.txt -prefix fus300 \ -removeKmerTxt FALSE -minCov 5 -externalSoftware dependencies.txt
Visualising significant kmers
The script we just ran generates p-values from the chi-square test for each kmer. It also generates a couple of plots - an empirical cumulative distribution function (ECDF) plot and a quantile-quantile (QQ) plot. While we're annotating the kmers, let's take a moment to look at these.
Get them to your local machine with the following (run on your local computer):
scp bugwas:~/*.png ./
The empirical cumulative distribution function plots the p-values for each kmer, ordered by significance. The quantile-quantile plot compares the distribution of -log10 p-values for the kmers in our dataset to a theoretical distribution. Here we can see that we have more significant kmers than would be expected by chance (red line). YAY.
Correcting for population structure
Next we want to re-run our GWAS using a control for population structure. The approach we use for controlling for population structure is a linear mixed model (LMM) implemented in the software
GEMMA (Zhou & Stephens 2012). What this does is assigns all variants a background significance level. Each variant in turn is then tested to see if its individual significance is above this background level. This acts to remove variants that are associated with specific lineages.
We will run the GWAS with linear mixed model and compare the results. This script uses the output of
kmerAnalysis.R and also requires an additional file, relatedness_matrix.txt which contains a relatedness matrix calculated from genomic SNPs in this dataset. This file is located with the other data we downloaded.
#Bash Shell Rscript bacterialGWAS/kmerGWAS/kmerLMM.R -chisqStat fus300.gwaskmer-out.chisqStat.txt -patternKey fus300.gwaskmer-out.patternKey.txt -patternIndex fus300.gwaskmer-out.patternIndex.txt -signif 5000 -relateMatrix ~/data/fus300_gemma_relmatrixout.cXX.txt -phenotype ~/data/fus300.pheno.txt -prefix fus300 -externalSoftware dependencies.txt
The script outputs a plot comparing the p-values of our kmers before and after controlling for population structure - we can see that the most significant kmers are still the same ones, but they drop in significance.
Identifying lineage-associated as well as locus-specific variants with bugwas
In the second part of this tutorial we'll look at applying our method for identifying both lineage and locus-specific effects. This is implemented in the R package bugwas. We will now run bugwas to identify lineages associated with fusidic acid resistance.
In many cases this is important because we observed that for many phenotypes (and for fusidic acid resistance with a larger dataset), the most significant variants drop dramatically in significance after controlling for population structure.
#R R #load the bugwas library, which we will use to test for lineage effects library(bugwas) #First we need to define a few variables.. gem.path="bugwas/gemma/gemma.0.93b" output.dir="./" #Then we call the function “lin_loc” which tests for both lineage and locus effects and generates a bunch of plots. This function needs the genotypes, phenotype, the relatedness matrix we used earlier and the path to the GEMMA software. lin_loc(gen="data/fus300_bugwas_gemma_gen_format.txt",pheno="data/fus300_bugwas.pheno.txt",phylo="data/RAxML_bestTree.fus300",prefix="fus300",gem.path=gem.path,var.matrix="data/fus300_bugwas.var.matrix.txt",relmatrix="data/fus300_gemma_relmatrixout.cXX.txt", output.dir=output.dir)
Let's look at some of the plots produced by bugwas - get them on your local machine.
scp bugwas:bugwas_out/*.png ./
The key plot that we are going to look at today is
fus300_tree_branchescolouredbyPC.png. This shows lineages that are associated with fusidic acid resistance. The lineages are defined by principal components (PCs) (for the maths behind this, please refer to McVean, 2009). Here we can see that PCs 1 and 6 split off groups of resistant and sensitive isolates.
We can also look at which variants are associated with the phenotype. The plot
fus300_genVar1_ManhattanLMMPvalues.png shows that we have lineage-specific variants strongly associated with fusidic acid resistance (in green) and variants that are associated with the phenotype (high -log10 p-value) but which are not significantly associated with any lineage (not shaded). You'll have to take my word for this, but when we looked mapped the kmers corresponding to the locus-specific effects up, they map to fusA, a gene known to be associated with fusidic acid resistance.
Annotating significant kmers
We can annotate our significant kmers by aligning them to a database of reference genomes using BLAST. This R script BLASTs kmers against a database of genomes downloaded from GenBank and then checks their positions against a look-up table of Staph genes. Any kmers that do not have a good hit to a Staph genome are BLASTed for a second time against the whole nucleotide database from NCBI. To save time, we have created the BLAST databases for you.
#Bash Shell Rscript bacterialGWAS/kmerGWAS/kmerAnnotation.R -chisq_results fus300.gwaskmer-out.chisqStat.txt -kmer_results fus300.gwaskmer-out.kmer.txt -present_ctrl fus300.gwaskmer-out.nPresentCtrl.txt -present_case fus300.gwaskmer-out.nPresentCase.txt -blastdb1 ~/data/staphdb3.blast -blastdb2 ~/data/nt_blastdb -ncbi_db ~/data/geneannot.allstaph.txt -signif 5000 -nproc 2 -prefix fus300 -externalSoftware dependencies.txt
Now we annotate the significant kmers as before…
#Bash Shell Rscript LMM_kmerAnnotation.R -chisq_results fus300.gwaskmer-out.chisqStat.txt -kmer_results fus300.gwaskmer-out.kmer.txt -present_ctrl fus300.gwaskmer-out.nPresentCtrl.txt -present_case fus300.gwaskmer-out.nPresentCase.txt ~/data/dbs/staphdb3.blast -blastdb2 ~/data/dbs/nt_blastdb -ncbi_db ~/data/dbs/geneannot.allstaph.txt -signif 5000 -nproc 2 -prefix fus300 -LMM_kmers fus300_lmm_kmers_used.txt -LMM_output fus300_lmm_LMM_allkmers_out.txt -externalSoftware dependencies.txt
We can then look at the top kmers and see if they map to the same gene. Note: you could annotate kmers in other ways than by using BLAST - eg by mapping to a reference.
What we haven't covered today
GWAS is a complex topic and this tutorial is just a snapshot. We haven't covered some other very important topics in GWAS such as correcting for multiple testing and the challenges presented by including low-frequency variants in the analysis. Our group is working on refining these aspects of our pipeline so if you're interested in this, please watch this space.
Gordon NC, Price JR, Cole K, Everitt R, Morgan M, Finney J, Kearns AM, Pichon B, Young B, Wilson DJ, Llewelyn MJ, Paul J, Peto TE, Crook DW, Walker AS, Golubchik T. Prediction of Staphylococcus aureus antimicrobial resistance by whole-genome sequencing. J Clin Microbiol. 2014 Apr;52(4):1182-91. doi: 10.1128/JCM.03117-13.
Earle, Sarah G; Wu, Chieh-Hsi; Charlesworth, Jane et al. Identifying lineage effects when controlling for population structure improves power in bacterial association studies Preprint: arXiv:1510.06863v2
Zhou X, Stephens M. Genome-wide efficient mixed-model analysis for association studies. Nat Genet. 2012 Jun 17;44(7):821-4. doi: 10.1038/ng.2310.
McVean G. A genealogical interpretation of principal components analysis. PLoS Genet. 2009 Oct;5(10):e1000686. doi: 10.1371/journal.pgen.1000686.
GWAS pipeline: Github: jessiewu/bacterial_GWAS
bugwas R package: Github: sgearle/bugwas