# Assignment 2 - Microbial GWAS
*MBB*

*Developed by Jimmy Liu and William Hsiao*

## Learning Objectives
* Gain familiarity with command line tools to run microbial GWAS
* Understand the key analytical steps of GWAS
* Understand how to interpret GWAS output
* Identify the limitations of the presented methods

## Background
This assignment will focus on *Salmonella enterica*, an enteric pathogen that primarily spreads by human consumption of contaminated foods in Canada and the United States. You will examine isolates of *Salmonella* serovar Heidelberg from three epidemiologically distinct foodborne outbreaks that occurred in Quebec, Canada between 2012-2014. For more detailed background on how the outbreaks happened, you are encouraged to read over the original publication by [Bekal et al. (2014)](https://pubmed.ncbi.nlm.nih.gov/26582830/). 

As you have learned in previous lectures, phylogenetic methods can be applied to infer the relationships of these outbreak isolates and identify clonal strains that share the same source of contamination. Outbreak tracing will however, not be the purpose here and instead you are provided with the epidemiology investigation results to identify the genetic features that can distinguish the *Salmonella* isolates of different outbreak origins. 

For example, Outbreak 1 isolates may carry a unique gene (Gene A) that is absent in the isolates from all other outbreaks. The presence and absence of Gene A would thus be a strongly predictive feature of outbreak origin. 

Throughout the assignment, you will be provided with detailed instructions on how to conduct a genome-wide survey of the bacterial genomes to identify all the genes unique to each outbreak. While this exercise focuses on the genetic association to each outbreak, the same approach can be readily extended to identifying genetic association of other phenotypes such as virulence, antimicrobial susceptibility, and transmissibility.

## Getting Started
* *Salmonella* genomes (*N* = 46) are in the shared directory: `/opt/share/gwas/genomes/`
* Metadata of the outbreak dataset is in the shared directory: `/opt/share/gwas/outbreak_metadata.csv`
* Pre-computed analysis results are in the shared directory: `/opt/share/gwas/analysis/`

Let's begin by copying all the data from the shared directory to our current directory using the command `cp`

In [1]:
# -r option indicates recursive copy and . refers to the current directory
cp -r /opt/share/gwas/* .

Use the `ls` command to verify that all the folders and files have been copied to our current directory.

In [2]:
ls

analysis  genomes  mGWAS_assignment.ipynb  outbreak_metadata.csv


To have the bioinformatics tools and their dependencies available in our analysis environment, use the `conda` command to activate the environment. All of the tools required to complete this assignment have been packaged in a conda environment called `gwas`.

In [3]:
conda activate gwas

(gwas) 

: 1

## Pan-Genome Analysis with Prokka and Roary

Prior to testing for genetic association, you first need to obtain information on what genetic features are present in each genome. You will run a gene prediction tool called `Prokka` and a pan-genome pipeline called `Roary` to conduct genome annotation and compute the pan-genome of the dataset. `Roary` generates a multitude of outputs including core genome alignments, genome annotations, etc. For our purpose, the key output from `Roary` is the gene presence/absence matrix in which the rows are the individual genomes and the columns are the predicted genes. Each cell in the matrix will carry a value of 0 or 1, with 0 = absence and 1 = presence.

In consideration of time, the `Prokka` results have been pre-computed and they can be found under `analysis/prokka`. The primary `Prokka` output of interest is GFF files. GFF (General Feature Format) is a standard file format that encodes feature annotations for nucleic acid sequences. The information is formatted as a tab-delimited table with each row corresponding to a unique genetic feature (e.g. Coding or non-coding sequence) and the columns encode contextual information about the features such as its genomic position, +/- strand, gene name, encoded protein product, etc.

In [4]:
# Print the content of the GFF file for Sample SH12-001
# head and tail are used in combination to skip the contig information in the file
head -n 40 analysis/prokka/SH12-001.gff | tail -n 3

(gwas) (gwas) DAASFI010000001.1	Prodigal:002006	CDS	3342	5288	.	-	0	ID=OOCGDIGJ_00004;eC_number=3.6.3.-;Name=macB;db_xref=COG:COG0577;gene=macB;inference=ab initio prediction:Prodigal:002006,similar to AA sequence:UniProtKB:Q0TJH0;locus_tag=OOCGDIGJ_00004;product=Macrolide export ATP-binding/permease protein MacB
DAASFI010000001.1	Prodigal:002006	CDS	5285	6403	.	-	0	ID=OOCGDIGJ_00005;Name=macA;db_xref=COG:COG0845;gene=macA;inference=ab initio prediction:Prodigal:002006,similar to AA sequence:UniProtKB:P75830;locus_tag=OOCGDIGJ_00005;product=Macrolide export protein MacA
DAASFI010000001.1	Prodigal:002006	CDS	9356	10255	.	+	0	ID=OOCGDIGJ_00008;Name=lysO;db_xref=COG:COG2431;gene=lysO;inference=ab initio prediction:Prodigal:002006,similar to AA sequence:UniProtKB:P75826;locus_tag=OOCGDIGJ_00008;product=Lysine exporter LysO
(gwas) 

: 1

With the genome annotations available, you will run `Roary` on the GFF files of the entire dataset. In brief, `Roary` will analyze each GFF file given and aggregate all the features to construct a bacterial pan-genome and report the results as .csv/.tsv file.

A typical `Roary` command looks like: 
```
roary [options] [path to GFF files]
```

`Roary` options explained:
* `-p` option specifies the number of compute cores to use for process parallelization
* `-f` option specifies the path to store the outputs

In [None]:
roary -p 8 -f analysis/roary analysis/prokka/*.gff

## Test Genome-wide association with Scoary

`Scoary` is designed to perform statistical tests (Fischer's exact test) on the features summarized by `Roary` (notice the similarity in the tool names). In order to conduct GWAS with `Scoary`, the pre-requisites are genotype and phenotype matrices. By this stage, you will have prepared the genotype matrix using `Roary`; it will be found under `analysis/prokka/gene_presence_absence.csv`. 

For the phenotype matrix, it should be a .csv file with rows as samples and columns as different phenotypic attributes. You can include as many phenotypes as you like in the matrix and `Scoary` will write the results for each phenotype to a different output file. However, because `Scoary` uses Fischer's exact tests to determine association, each column must be a **binary** variable. Therefore, you can't simply have the three Salmonella outbreak identifiers in a single column and instead the phenotypic information needs to be divided into three columns. Each column will carry a value of `0` or `1` to indicate whether a given sample belongs to outbreak 1, 2 or 3.

The phenotype matrix has already been properly formatted and available at `/opt/share/gwas/outbreak_metadata.csv`. Use the command `head` to print the first few lines of the matrix.

In [5]:
# use -n option to print a specific number of lines (by default head prints the first 10 lines)
head -n 5 /opt/share/gwas/outbreak_metadata.csv

,Outbreak_1,Outbreak_2,Outbreak_3
SH12-001,1,0,0
SH12-002,1,0,0
SH12-003,1,0,0
SH12-004,1,0,0


To calculate the statistical significance of association, you will run 1000 permutations of the phenotypic labels to construct a null distribution of test statistics and calculate the p-value (probability of obtaining a test statistic at least as extreme as the observed test statistic under the null). Lastly, the Benjamini-Hochberg method will be used to correct for false discovery rate.

A typical `Scoary` command looks like:
```
scoary [options] -t [path to phenotype matrix] -g [path to genotype matrix]
```

`Scoary` options explained:
* `--threads` specifies the number of compute cores to use for process parallelization
* `--no-time` to prevent appending time information to the output file name
* `-p` specifies a p-value cutoff to filter the final results
* `-o` specifies the path to store the outputs
* `-c` specifies the method to correct p-values for multiple testing
* `--permute` specifies the number of rounds of permutations to run

In [None]:
scoary --threads 8 --no-time -o analysis/scoary \
       -p 0.05 -c BH --permute 1000 \
       -t outbreak_metadata.csv \
       -g analysis/roary/gene_presence_absence.csv

## Questions

1. Review over the `Scoary` outputs under `analysis/scoary` and do some research on the genes that are found significantly associated with each *Salmonella* outbreak. What do the significant genes have in common?

2. What is one major shortcoming of this analysis? What might you do to address the problem?

## Closing Remarks

Congratulations, you have reached the end of the assignment! An important note to highlight is that in this assignment, you have only been presented one of the many methods to conduct microbial GWAS. Numerous alternative methods exist such as statistical modeling and deep learning, with each of their own advantages and disadvantages. For those interested, we leave you with the publication by [John Lees et al. (2020)](https://journals.asm.org/doi/full/10.1128/mBio.01344-20) that compared and contrasted different strategies to identify genetic association in microbial organisms.

## References

1. Bekal S, Berry C, Reimer AR, Van Domselaar G, Beaudry G, Fournier E, et al. Usefulness of High-Quality Core Genome Single-Nucleotide Variant Analysis for Subtyping the Highly Clonal and the Most Prevalent Salmonella enterica Serovar Heidelberg Clone in the Context of Outbreak Investigations. J Clin Microbiol. 2016 Feb;54(2):289–95.
2. Page AJ, Cummins CA, Hunt M, Wong VK, Reuter S, Holden MTG, et al. Roary: rapid large-scale prokaryote pan genome analysis. Bioinformatics. 2015 Nov 15;31(22):3691–3.
3. Seemann T. Prokka: rapid prokaryotic genome annotation. Bioinformatics. 2014 Jul 15;30(14):2068–9.
4. Brynildsrud O, Bohlin J, Scheffer L, Eldholm V. Rapid scoring of genes in microbial pan-genome-wide association studies with Scoary. Genome Biol. 2016 Nov 25;17(1):238.
5. Lees JA, Mai TT, Galardini M, Wheeler NE, Horsfield ST, Parkhill J, et al. Improved prediction of bacterial genotype-phenotype associations using interpretable pangenome-spanning regressions. MBio [Internet]. 2020 Jul 7;11(4). Available from: https://journals.asm.org/doi/10.1128/mBio.01344-20