kmer-hmp

Overview of project

Goal:

Develop method for applying Harmonic mean p-value (paper) to a multiple sequence alignment (MSA)

Steps:

Choose* set of genomes from GWAS to study. Select files for:
- each of the twelve genomes (.fa files)
- list of k-mers from the analysis of all 992 genomes
- list of associated p-values for these k-mers
Reorder the contigs on a genome-by-genome basis with Mauve with respect to the MSSA476 reference genome.
Use Mauve to align the twelve genomes into an MSA.
K-merize (into 31-mers) the MSA and identify the p-value corresponding to each kmer in the MSA.
Compute HMPs for overlapping sliding windows at different scales, e.g. 10bp, 100bp, 1kb, 10kb, 100kb, 1Mb. Suggestion: stagger each sliding window by about 50% of its length.
- HMP generated from weighted Harmonic Mean of the p-values of the k-mers present in each window.

^{* Genomes chosen by script set_cover.py which selects the smallest subset of genomes containing > 80% of distinct k-mers present in the complete set (992 genomes).}

Data

Two studies considered:

twelve S. aureus genomes from previous analysis (main example of paper) conducted into fusidic acid resistance (FUC).
200 E. coli genomes from previous study conducted into cefepime resistance (FEP).

Below are the files (and expected relative file locations) required by the python scripts:

Genomes

genomes/reference_genome/Record_49484912.fasta
reference genome (downloaded from NCBI Entrez)
genomes/draft_genomes/C00000814_contigs.fa
genomes/draft_genomes/C00000830_contigs.fa etc
draft genomes to create MSA

Static Files (k-mer, p-value data)

static_files/fusidic_acid_kmers.txt
list of k-mers in draft genomes
static_files/fusidic_acid_pvals.txt
list of p-values associated with each k-mer
static_files/fusidic_acid_patternIndex.txt
1-many map of k-mer row numbers to 'pattern' of presence/absence row number in genomes
static_files/fusidic_acid_patternKey.txt
list of patterns of presence/absence of k-mer pattern row in each genome column

Scripts

`build_msa_mauve.py`

Reorder contigs for multiple draft files against a single reference file using Mauve on Windows.
Align reordered sequences with reference file using Mauve.

`build_kmer_pval_dict.py`

Create dictionary of {k-mer: p-value} from static files

`main.py`

Create a DataFrame storing the k-mer and associated p-value info for each position of each sequence in the alignment.
Calculate the Harmonic Mean p-value for each sliding window across the length of the sequences (for multiple window sizes)
Plot the k-mer and HMP p-values 'Manhattan-plot' style

Overview of "sliding window" logic

Each k-mer in the sequences has an associated p-value from the GWAS performed on the sequences (testing for correlation with phenotype).

This program creates a window of defined length, and travelling laterally across the aligned sequences, takes the Harmonic Mean of the p-values of the k-mers in said window.

Example:

2 alignments (first contianing 3 sequences, second containing 2)
window size = 4
k = 3

The 3-mers included in each window in the animation are:

ACG, ACG, ACG, CGT, CGA, CGG
CGT, CGA, CGG, GTG, GAG, GGC
GTG, GAG, GGC, TGC, AGC, GCC
TGC, AGC, GCC, GCA, GCA, CCA
GCA, GCA, CCA
TGT, TGT
TGT, TGT, GTA, GTA

Thus, in e.g. window 1 the HMP of the 6 p-values corresponding to the 6 k-mers it includes will be calculated and assigned position 1.

Handling of gap characters

If there are gap characters in a sequence, e.g.:

ACA-CACGTG

the program will ignore the gap characters, thus the k-mers returned for this window (of size 10) for this sequence would be:

ACACA
 CACAC
  ACACG
  
   CACGT
    ACGTG

Note that no value is associated to position 4.

What happens when a window overlaps multiple alignments?

If there are multiple alignments ordered sequentially, some sliding windows will overlap with the discontinuity between alignments.

     [...  ...]
[ACACGTGT][CGAT]

In these instances k-mers are taken from each alignment intersection only.

i.e. for k=2, window_size=6 in the above example:

      TG   CG
       GT   GA

K-mers 'across' neighbouring alignments are not considered, as these may not correspond to real k-mers present in the genome (since the ordering of the alignment may not be reflected in reality). E.g. in this instance, the (possibly fictional) inter-alignment k-mer TC is not recognised.

What happens when the overlapped alignments have different depths (numbers of sequences in them)?

Some neighbouring alignments may have different numbers of aligned sequences in them. e.g. 3 and 2:

 [...  ...]
[ACTG][GTGTA]
[CCTG][GTGGA]
[TCTG]

In these instances, if a window overlaps an alignment boundary, k-mers are taken from all aligned sequences. e.g. k=3, window_size=6:

 CTG  GTG
 CTG  GTG
 CTG

Mauve instructions

Reordering contigs from the command-line (batch mode)

In situations where it is necessary to order contigs in a large number of draft genomes it is often more desirable to automate the process using command-line interfaces and scripts. Mauve Contig Mover supports command-line operation through the Mauve Java JAR file.

Given a reference genome file called reference.gbk and a draft genome called draft.fasta, one would invoke the reorder program with the following syntax:
 java -Xmx500m -cp Mauve.jar org.gel.mauve.contigs.ContigOrderer -output results_dir -ref reference.gbk -draft draft.fasta
The file Mauve.jar is part of the Mauve distribution. On windows systems it can usually be found in C:\Program Files\Mauve X\Mauve.jar where X is the version of Mauve. On Mac OS X it is located inside the Mauve application. For example, if Mauve has been placed in the OS X applications folder, Mauve.jar can be found at /Applications/Mauve.app/Contents/Resources/Java/Mauve.jar. On Linux, Mauve.jar is simply at the top level of the tar.gz archive. In the above example command, it will be necessary to specify the full path to the Mauve.jar file.

Tutorial

Install Mauve Multiple Genome Alignment

Edit script main.py changing the values:

 if __name__ == '__main__':
     mauve_dir = Path('C:/Program Files (x86)/Mauve 20150226') # Path where Mauve is installed
     base_path = Path('C:/Users/User/project') # Path where your data is saved
     reference = base_path / 'genomes/reference_genome/reference_genome.fasta' # Change to `None` if you wish to automatically download genome from NCBI
     drafts_dir = base_path / 'genomes/draft_genomes' # Path where draft genomes are saved
     kmers = base_path / 'static_files/list_of_kmers.txt' # List of k-mers present in genomes
     pvals = base_path / 'static_files/kmers_pvals.txt' # List of p-values associated with above k-mers

If plotting Manhattan Plot, must ensure plotly lbrary initially configured with account.
Run main.py

Name		Name	Last commit message	Last commit date
Latest commit History 132 Commits
data		data
examples/fep		examples/fep
images		images
set-cover		set-cover
snp		snp
README.md		README.md
build_kmer_pval_dict.py		build_kmer_pval_dict.py
build_msa_mauve.py		build_msa_mauve.py
main.py		main.py
test.py		test.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

kmer-hmp

Overview of project

Goal:

Steps:

Data

Genomes

Static Files (k-mer, p-value data)

Scripts

`build_msa_mauve.py`

`build_kmer_pval_dict.py`

`main.py`

Overview of "sliding window" logic

Handling of gap characters

What happens when a window overlaps multiple alignments?

What happens when the overlapped alignments have different depths (numbers of sequences in them)?

Mauve instructions

Reordering contigs from the command-line (batch mode)

Tutorial

About

Releases

Packages

Languages

iacobo/kmer-hmp

Folders and files

Latest commit

History

Repository files navigation

kmer-hmp

Overview of project

Goal:

Steps:

Data

Genomes

Static Files (k-mer, p-value data)

Scripts

build_msa_mauve.py

build_kmer_pval_dict.py

main.py

Overview of "sliding window" logic

Handling of gap characters

What happens when a window overlaps multiple alignments?

What happens when the overlapped alignments have different depths (numbers of sequences in them)?

Mauve instructions

Reordering contigs from the command-line (batch mode)

Tutorial

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

`build_msa_mauve.py`

`build_kmer_pval_dict.py`

`main.py`

Packages