- Develop method for applying Harmonic mean p-value (paper) to a multiple sequence alignment (MSA)
- Choose* set of genomes from GWAS to study. Select files for:
- each of the twelve genomes (
.fa
files) - list of k-mers from the analysis of all 992 genomes
- list of associated p-values for these k-mers
- each of the twelve genomes (
- Reorder the contigs on a genome-by-genome basis with Mauve with respect to the MSSA476 reference genome.
- Use Mauve to align the twelve genomes into an MSA.
- K-merize (into 31-mers) the MSA and identify the p-value corresponding to each kmer in the MSA.
- Compute HMPs for overlapping sliding windows at different scales, e.g. 10bp, 100bp, 1kb, 10kb, 100kb, 1Mb. Suggestion: stagger each sliding window by about 50% of its length.
- HMP generated from weighted Harmonic Mean of the p-values of the k-mers present in each window.
* Genomes chosen by script set_cover.py
which selects the smallest subset of genomes containing > 80% of distinct k-mers present in the complete set (992 genomes).
Two studies considered:
- twelve S. aureus genomes from previous analysis (main example of paper) conducted into fusidic acid resistance (FUC).
- 200 E. coli genomes from previous study conducted into cefepime resistance (FEP).
Below are the files (and expected relative file locations) required by the python scripts:
genomes/reference_genome/Record_49484912.fasta
reference genome (downloaded from NCBI Entrez)genomes/draft_genomes/C00000814_contigs.fa
genomes/draft_genomes/C00000830_contigs.fa
etc
draft genomes to create MSA
static_files/fusidic_acid_kmers.txt
list of k-mers in draft genomesstatic_files/fusidic_acid_pvals.txt
list of p-values associated with each k-merstatic_files/fusidic_acid_patternIndex.txt
1-many map of k-mer row numbers to 'pattern' of presence/absence row number in genomesstatic_files/fusidic_acid_patternKey.txt
list of patterns of presence/absence of k-mer pattern row in each genome column
- Reorder contigs for multiple draft files against a single reference file using Mauve on Windows.
- Align reordered sequences with reference file using Mauve.
- Create dictionary of
{k-mer: p-value}
from static files
- Create a DataFrame storing the k-mer and associated p-value info for each position of each sequence in the alignment.
- Calculate the Harmonic Mean p-value for each sliding window across the length of the sequences (for multiple window sizes)
- Plot the k-mer and HMP p-values 'Manhattan-plot' style
Each k-mer in the sequences has an associated p-value from the GWAS performed on the sequences (testing for correlation with phenotype).
This program creates a window of defined length, and travelling laterally across the aligned sequences, takes the Harmonic Mean of the p-values of the k-mers in said window.
Example:
- 2 alignments (first contianing 3 sequences, second containing 2)
- window size = 4
- k = 3
The 3-mers included in each window in the animation are:
- ACG, ACG, ACG, CGT, CGA, CGG
- CGT, CGA, CGG, GTG, GAG, GGC
- GTG, GAG, GGC, TGC, AGC, GCC
- TGC, AGC, GCC, GCA, GCA, CCA
- GCA, GCA, CCA
- TGT, TGT
- TGT, TGT, GTA, GTA
Thus, in e.g. window 1 the HMP of the 6 p-values corresponding to the 6 k-mers it includes will be calculated and assigned position 1.
If there are gap characters in a sequence, e.g.:
ACA-CACGTG
the program will ignore the gap characters, thus the k-mers returned for this window (of size 10) for this sequence would be:
ACACA
CACAC
ACACG
CACGT
ACGTG
Note that no value is associated to position 4.
If there are multiple alignments ordered sequentially, some sliding windows will overlap with the discontinuity between alignments.
[... ...]
[ACACGTGT][CGAT]
In these instances k-mers are taken from each alignment intersection only.
i.e. for k=2
, window_size=6
in the above example:
TG CG
GT GA
K-mers 'across' neighbouring alignments are not considered, as these may not correspond to real k-mers present in the genome (since the ordering of the alignment may not be reflected in reality). E.g. in this instance, the (possibly fictional) inter-alignment k-mer TC
is not recognised.
Some neighbouring alignments may have different numbers of aligned sequences in them. e.g. 3 and 2:
[... ...]
[ACTG][GTGTA]
[CCTG][GTGGA]
[TCTG]
In these instances, if a window overlaps an alignment boundary, k-mers are taken from all aligned sequences. e.g. k=3, window_size=6:
CTG GTG
CTG GTG
CTG
In situations where it is necessary to order contigs in a large number of draft genomes it is often more desirable to automate the process using command-line interfaces and scripts. Mauve Contig Mover supports command-line operation through the Mauve Java JAR file.
Given a reference genome file called
reference.gbk
and a draft genome calleddraft.fasta
, one would invoke the reorder program with the following syntax:java -Xmx500m -cp Mauve.jar org.gel.mauve.contigs.ContigOrderer -output results_dir -ref reference.gbk -draft draft.fasta
The file Mauve.jar is part of the Mauve distribution. On windows systems it can usually be found in
C:\Program Files\Mauve X\Mauve.jar
whereX
is the version of Mauve. On Mac OS X it is located inside the Mauve application. For example, if Mauve has been placed in the OS X applications folder, Mauve.jar can be found at/Applications/Mauve.app/Contents/Resources/Java/Mauve.jar
. On Linux,Mauve.jar
is simply at the top level of thetar.gz
archive. In the above example command, it will be necessary to specify the full path to theMauve.jar
file.
-
Install Mauve Multiple Genome Alignment
-
Edit script
main.py
changing the values:if __name__ == '__main__': mauve_dir = Path('C:/Program Files (x86)/Mauve 20150226') # Path where Mauve is installed base_path = Path('C:/Users/User/project') # Path where your data is saved reference = base_path / 'genomes/reference_genome/reference_genome.fasta' # Change to `None` if you wish to automatically download genome from NCBI drafts_dir = base_path / 'genomes/draft_genomes' # Path where draft genomes are saved kmers = base_path / 'static_files/list_of_kmers.txt' # List of k-mers present in genomes pvals = base_path / 'static_files/kmers_pvals.txt' # List of p-values associated with above k-mers
-
If plotting Manhattan Plot, must ensure plotly lbrary initially configured with account.
-
Run
main.py