GitHub - ukhsa-collaboration/kmerid

Branches Tags
Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
bin		bin
config		config
ref		ref
samples		samples
src		src
LICENSE		LICENSE
README		README
kmerid.py		kmerid.py
makefile		makefile
setup_refs.py		setup_refs.py
Repository files navigation

Table of content
----------------
  * Prerequisites
  * Installation
  * Set up reference genome groups
  * Running KmerID
  * KmerID output
  * Examples

Prerequisites
-------------
Python version >= 2.6.6 (not Python 3)
The following Python libraries are required:
sys, argparse, subprocess, os, operator, glob, ConfigParser, tempfile, scipy

Installation
------------

1.) Unpack the tar.gz archive that you downloaded into a folder of your
    choosing. This folder will be called $KMERROOT in the following.
    
2.) Navigate to $KMERROOT. You should see, among others, subfolders
    src, config and bin and two Python scripts.

3.) Compile the auxilliary executables by typing make all.

        make all
    
    This should create the files intersect_kmer_lists_filelist,
    kmer_jaccard_index, kmer_reads_process_stdin, and kmer_refset_process
    in the bin folder.
    
You still need to prepare your reference genome sets before you can
run the software.

Set up reference genome groups
------------------------------

KmerID will compare your fastq reads against one or more sets of reference
genomes. We have been using it in such a way each set of reference genomes
contains only genomes of species from a single bacterial genus.

KmerID will first compare your fastq reads again 3 representatives of each
group. On the basis of this it will determine against which group(s) to
further compare your reads. There are no limits how many groups and how many 
genomes in each group you can have, but it will take longer the more you have.

Each group of reference genomes needs to be set up using the setup_refs.py utility:

    usage: setup_refs.py [-h] -f FILE -n FILE -c FILE

    version 0.1, date 12Feb2014, author ulf.schaefer@phe.gov.uk

    optional arguments:
      -h, --help            show this help message and exit
      -f FILE, --folder FILE
                            REQUIRED: Folder that contains a set of fasta files
                            with one reference genome each.
      -n FILE, --name FILE  REQUIRED: Unique name for this set of references. Case
                            insensitive. [e.g. salmonella]
      -c FILE, --config FILE
                            REQUIRED: Configuration file. Usually
                            config/config.cnf.
    
    e.g. 
        
        setup_refs.py -n salmonella -f ref/genus01 -c config/config.cnf

Please note: The files in the references' folder need be in fasta format and have one
of these file endings .fa, .fna, .fas, .fasta. They can be optionally gzipped (adding.gz)
at the end of the file name.

This step includes a hierarchical clustering step of all genomes in the group. This
includes the creation of an all-by-all similarity matrix (stored under $KMERROOT/config/)
for the genomes in the respective group. Therefore this step will take a significant
amount of time for larger groups.

Note: It is recommended to keep the number of genomes per group under 15. If larger
      groups are required, the all-by-all simmilarity matrix for this group should
      be computed in a parallel manner on a HPC infrastructure.
      
Example groups of genomes for a variety of genera of pathogenic bacteria are part of this
download..

After each group has been setup a config file in the config subfolder is updated. This file is
a required input to the main programme.
      
Running KmerID
--------------

After setting up your reference groups, run Kmerid like this:

    usage: kmerid.py [-h] -f FILE -c FILE [-n]

    version 0.1, date 12Feb2014, author ulf.schaefer@phe.gov.uk

    optional arguments:
      -h, --help            show this help message and exit
      -f FILE, --fastq FILE
                            REQUIRED: Investigate this fastq file.
      -c FILE, --config FILE
                            REQUIRED: Configuration file. Usually
                            config/config.cnf.
      -n, --nomix           Do not investigate sample for mixing. [default:
                            Investigate. (Takes about 2 minutes.)]
      
    e.g.
    
        python kmerid.py -f reads.fastq --config=config/config.cnf        
      
KmerID output
-------------

KmerID writes a table to stdout containing three columns, the similarity of the reads
to the reference genome, the groups the genome is in, and the file name of the genome
(without fasta file ending). The table contains one line per genome is the groups that
are candidates. If one candidate group contains more than 40 genomes, 40 representatives
from the group are chosen and shown. Candidate groups are selected by comparing 3
representatives of each group against the read set and selecting all groups accouring
within the top 5 hits.

Optional:

The KmerID tool will write another table to stdout. This table can give an indication
whether the possibility exists that the reads in the sample stem from two different
organisms. This table contains the following four columns.

1.) absolute difference between two similarities:
    a) similarity between the reads and respective genome
    b) similarity between the top hit genome and the respective genome
2.) the non-absolute difference between the similarities (a-b)
3.) the groups the genome is in
4.) the file name of the genome (without fasta file ending)

Examples
--------

a) Finding the closest genome in a single group:

    * Make sure the reference genomes from the genus Legionella are present in the folder 
      $KMERROOT/refs/Legionella    
    * Prepare these reference genomes for KmerID by calling
      python setup_refs.py -n legionella -f $KMERROOT/ref/Legionella/ -c config/config.cnf
    * Process the sample data by calling:
      python kmerid.py -f samples/Legionella_pneumophila_example_reads.fastq.gz -c config/config.cnf -n
      
      Output:
      -------
      #Kmer based similarities
      #similarity	groups	file
      98.377968	legionella	Legionella_pneumophila_str_Paris.fa
      71.318146	legionella	Legionella_pneumophila_subsp_pneumophila_NC_018140.fa
      66.087273	legionella	Legionella_pneumophila_subsp_pneumophila_str_Hextuple_3a.fa
      66.087234	legionella	Legionella_pneumophila_subsp_pneumophila_str_Hextuple_2q.fa
      64.594894	legionella	Legionella_pneumophila_subsp_pneumophila_NC_018139.fa
      63.234699	legionella	Legionella_pneumophila_str_Corby.fa
      62.958763	legionella	Legionella_pneumophila_2300_99_Alcoy.fa
      62.147953	legionella	Legionella_pneumophila_subsp_pneumophila_str_Philadelphia_1.fa
      61.927429	legionella	Legionella_pneumophila_subsp_pneumophila_ATCC_43290.fa
      59.857174	legionella	Legionella_pneumophila_subsp_pneumophila_str_Thunder_Bay.fa
      56.115902	legionella	Legionella_pneumophila_str_Lens.fa


b) Identify unknown sample against multiple groups:

    * Make sure all reference genomes are present in subfolders of $KMERROOT/ref
    * Prepare all reference groups for KmerID by calling
      for s in `ls $KMERROOT/ref`;do python setup_refs.py -f $KMERROOT/ref/$s -c config/config.cnf -n $s; done
    * Process the sample data by calling:
      python kmerid.py -f $KMERROOT/samples/unknown_pathogen_example_reads.fastq.gz -c $KMERROOT/config/config.cnf -n
      
      Output:
      -------
      #Kmer based similarities
      #similarity	groups	file
      90.137718	pseudomonas	Pseudomonas_aeruginosa_PAO1_uid57945.fa
      5.215832	pseudomonas	Pseudomonas_resinovorans_NBRC_106553_uid208671.fa
      4.093119	pseudomonas	Pseudomonas_fulva_12_X_uid67351.fa
      3.923861	pseudomonas	Pseudomonas_stutzeri_DSM_10701_uid170940.fa
      3.875630	pseudomonas	Pseudomonas_stutzeri_DSM_4166_uid162113.fa
      3.717735	pseudomonas	Pseudomonas_mendocina_NK_01_uid66299.fa
      3.383106	pseudomonas	Pseudomonas_fluorescens_CHA0_uid203393.fa
      2.662693	pseudomonas	Pseudomonas_putida_F1_uid58355.fa
      2.155928	pseudomonas	Pseudomonas_fluorescens_SBW25_uid158693.fa
      1.383858	pseudomonas	Pseudomonas_syringae_phaseolicola_1448A_uid58099.fa
      0.938764	mycobacterium	Mycobacterium_JDM601_uid67369.fa
      0.874562	mycobacterium	Mycobacterium_intracellulare_ATCC_13950_uid167994.fa
      0.844299	mycobacterium	Mycobacterium_MCS_uid58465.fa
      0.749032	mycobacterium	Mycobacterium_smegmatis_MC2_155_uid57701.fa
      0.567610	mycobacterium	Mycobacterium_tuberculosis_H37Rv_uid170532.fa
      0.561620	mycobacterium	Mycobacterium_liflandii_128FXT_uid59005.fa
      0.550582	mycobacterium	Mycobacterium_rhodesiae_NBB3_uid75107.fa
      0.539385	mycobacterium	Mycobacterium_smegmatis_JS623_uid184820.fa
      0.451246	mycobacterium	Mycobacterium_massiliense_GO_06_uid170732.fa
      0.192369	mycobacterium	Mycobacterium_leprae_Br4923_uid59293.fa
      
      
c) Investigate contaminated sample:

    * Make sure references for all groups have been set up under example b)
    * Process the sample by calling
      python kmerid.py -f $KMERROOT/samples/contaminated_sample_example.fastq.gz -c $KMERROOT/config/config.cnf

      Output:
      -------
      #Kmer based similarities
      #similarity	groups	file
      84.077347	staphylococcus	Staphylococcus_aureus_uid193761.fa
      80.150398	enterococcus	Enterococcus_faecalis_OG1RF_uid54927.fa
      76.079369	enterococcus	Enterococcus_faecalis_Symbioflor_1_uid183342.fa
      72.809959	enterococcus	Enterococcus_7L76_uid197170.fa
      71.491592	enterococcus	Enterococcus_faecalis_D32_uid171261.fa
      70.721642	staphylococcus	Staphylococcus_aureus_71193_uid162141.fa
      67.780098	enterococcus	Enterococcus_faecalis_V583_uid57669.fa
      19.011084	staphylococcus	Staphylococcus_aureus_MSHR1132_uid89393.fa
      2.557667	staphylococcus	Staphylococcus_warneri_SG1_uid187059.fa
      2.151581	staphylococcus	Staphylococcus_haemolyticus_JCSC1435_uid62919.fa
      2.100819	staphylococcus	Staphylococcus_epidermidis_RP62A_uid57663.fa
      1.779690	staphylococcus	Staphylococcus_lugdunensis_N920143_uid162143.fa
      1.615321	staphylococcus	Staphylococcus_saprophyticus_ATCC_15305_uid58411.fa
      1.358824	staphylococcus	Staphylococcus_carnosus_TM300_uid59401.fa
      1.189253	staphylococcus	Staphylococcus_pseudintermedius_ED99_uid162109.fa
      0.951751	enterococcus	Enterococcus_hirae_ATCC_9790_uid70619.fa
      0.859111	enterococcus	Enterococcus_faecium_NRRL_B_2354_uid188477.fa
      0.832898	enterococcus	Enterococcus_faecium_DO_uid55353.fa
      0.792562	enterococcus	Enterococcus_faecium_Aus0085_uid214432.fa
      0.612133	enterococcus	Enterococcus_casseliflavus_EC20_uid55693.fa

      #Mixing analysis:
      #Top hit - Group: staphylococcus	File: Staphylococcus_aureus_uid193761.fa	Similarity: 84.077347%

      #Comparison of results:
      #sim diff absolute	sim(reads,thisfile)-sim(tophit,thisfile)	group	file
      79.904867	79.904867	enterococcus	Enterococcus_faecalis_OG1RF_uid54927.fa
      75.834908	75.834908	enterococcus	Enterococcus_faecalis_Symbioflor_1_uid183342.fa
      72.577836	72.577836	enterococcus	Enterococcus_7L76_uid197170.fa
      71.258053	71.258053	enterococcus	Enterococcus_faecalis_D32_uid171261.fa
      67.514926	67.514926	enterococcus	Enterococcus_faecalis_V583_uid57669.fa
      0.738434	0.738434	enterococcus	Enterococcus_hirae_ATCC_9790_uid70619.fa
      0.643232	0.643232	enterococcus	Enterococcus_faecium_DO_uid55353.fa
      0.62082	0.62082	enterococcus	Enterococcus_faecium_NRRL_B_2354_uid188477.fa
      0.582432	-0.582432	staphylococcus	Staphylococcus_epidermidis_RP62A_uid57663.fa
      0.578488	0.578488	enterococcus	Enterococcus_faecium_Aus0085_uid214432.fa
      0.481429	0.481429	enterococcus	Enterococcus_casseliflavus_EC20_uid55693.fa
      0.409216	-0.409216	staphylococcus	Staphylococcus_aureus_MSHR1132_uid89393.fa
      0.302525	-0.302525	staphylococcus	Staphylococcus_haemolyticus_JCSC1435_uid62919.fa
      0.28907	-0.28907	staphylococcus	Staphylococcus_aureus_71193_uid162141.fa
      0.149772	0.149772	staphylococcus	Staphylococcus_lugdunensis_N920143_uid162143.fa
      0.132699	0.132699	staphylococcus	Staphylococcus_carnosus_TM300_uid59401.fa
      0.089541	0.089541	staphylococcus	Staphylococcus_pseudintermedius_ED99_uid162109.fa
      0.016231	-0.016231	staphylococcus	Staphylococcus_warneri_SG1_uid187059.fa
      0.011532	0.011532	staphylococcus	Staphylococcus_saprophyticus_ATCC_15305_uid58411.fa