Merge pull request #94 from phac-nml/development

Major upgrade
phac-nml · May 16, 2019 · 4db666b · 4db666b
2 parents c7ff83d + 395c833
commit 4db666b
Show file tree

Hide file tree

Showing 35 changed files with 136,288 additions and 1,439 deletions.
diff --git a/README.rst b/README.rst
@@ -1,6 +1,6 @@
 |logo|
 
-|conda| |nbsp| |pypi| |nbsp| |license| |nbsp| |nbsp| Master:|citest-master| |nbsp| Development:|citest-dev|
+|conda| |nbsp| |pypi| |nbsp| |license| |nbsp| |nbsp| Master:|citest-master| |nbsp| Development:|citest-dev| |rtd|
 
 
 
@@ -18,6 +18,8 @@
     :target: https://bioconda.github.io/recipes/bio_hansel/README.html
 .. |nbsp| unicode:: 0xA0
     :trim:
+.. |rtd| image:: https://readthedocs.org/projects/pip/badge/?version=latest&style=flat
+    :target: https://bio-hansel.readthedocs.io/en/readthedocs/
 
 Subtype microbial whole-genome sequencing (WGS) data using SNV targeting k-mer subtyping schemes.
 
@@ -34,10 +36,16 @@ If you find the ``biohansel`` tool useful, please cite as:
 .. epigraph::
 
     A robust genotyping scheme for *Salmonella enterica* serovar Heidelberg clones circulating in North America.
-    Geneviève Labbé, James Robertson, Peter Kruczkiewicz, Marisa Rankin, Matthew Gopez, Chad R. Laing, Philip Mabon, Kim Ziebell, Aleisha R. Reimer, Lorelee Tschetter, Gary Van Domselaar, Sadjia Bekal, Kimberley A. MacDonald, Linda Hoang, Linda Chui, Danielle Daignault, Durda Slavic, Frank Pollari, E. Jane Parmley, Elissa Giang, Lok Kan Lee, Jonathan Moffat, Joanne MacKinnon, Roger Johnson, John H.E. Nash.
+    Geneviève Labbé, James Robertson, Peter Kruczkiewicz, Marisa Rankin, Matthew Gopez, Chad R. Laing, Philip Mabon, Kim Ziebell, Aleisha R. Reimer, Lorelee Tschetter, Gary Van Domselaar, Sadjia Bekal, Kimberley A. MacDonald, Linda Hoang, Linda Chui, Danielle Daignault, Durda Slavic, Frank Pollari, E. Jane Parmley, David Son, Darian Hole, Elissa Giang, Lok Kan Lee, Jonathan Moffat, Joanne MacKinnon, Roger Johnson, John H.E. Nash.
     [Manuscript in preparation]
 
 
+Read_The_Docs
+==============
+
+More in-depth information on running and installing biohansel can be found on the `biohansel readthedocs page <https://bio-hansel.readthedocs.io/en/readthedocs/>`_.
+
+
 Requirements and Dependencies
 =============================
 
@@ -92,7 +100,7 @@ Install into Galaxy_ (version >= 17.01)
 
 Install ``biohansel`` from the main Galaxy_ toolshed:
 
-https://toolshed.g2.bx.psu.edu/repository?repository_id=59b90ef18cc5dbbc&changeset_revision=4654c51dae72
+https://toolshed.g2.bx.psu.edu/view/nml/biohansel/ba6a0af656a6
 
 
 Usage
@@ -105,20 +113,21 @@ If you run ``hansel -h``, you should see the following usage statement:
     usage: hansel [-h] [-s SCHEME] [--scheme-name SCHEME_NAME]
                   [-p forward_reads reverse_reads] [-i fasta_path genome_name]
                   [-D INPUT_DIRECTORY] [-o OUTPUT_SUMMARY]
-                  [-O OUTPUT_TILE_RESULTS] [-S OUTPUT_SIMPLE_SUMMARY] [--force]
+                  [-O OUTPUT_KMER_RESULTS] [-S OUTPUT_SIMPLE_SUMMARY] [--force]
                   [--json] [--min-kmer-freq MIN_KMER_FREQ]
                   [--max-kmer-freq MAX_KMER_FREQ]
                   [--low-cov-depth-freq LOW_COV_DEPTH_FREQ]
-                  [--max-missing-tiles MAX_MISSING_TILES]
-                  [--min-ambiguous-tiles MIN_AMBIGUOUS_TILES]
+                  [--max-missing-kmers MAX_MISSING_KMERS]
+                  [--min-ambiguous-kmers MIN_AMBIGUOUS_KMERS]
                   [--low-cov-warning LOW_COV_WARNING]
-                  [--max-intermediate-tiles MAX_INTERMEDIATE_TILES] [-t THREADS]
+                  [--max-intermediate-kmers MAX_INTERMEDIATE_KMERS] 
+                  [--max-degenerate-kmers MAX_DEGENERATE_KMERS] [-t THREADS]
                   [-v] [-V]
                   [F [F ...]]
 
     Subtype microbial genomes using SNV targeting k-mer subtyping schemes.
     Includes schemes for Salmonella enterica spp. enterica serovar Heidelberg and Enteritidis subtyping.
-    Developed by Geneviève Labbé, James Robertson, Peter Kruczkiewicz, Marisa Rankin, Matthew Gopez, Chad R. Laing, Philip Mabon, Kim Ziebell, Aleisha R. Reimer, Lorelee Tschetter, Gary Van Domselaar, Sadjia Bekal, Kimberley A. MacDonald, Linda Hoang, Linda Chui, Danielle Daignault, Durda Slavic, Frank Pollari, E. Jane Parmley, Philip Mabon, Elissa Giang, Lok Kan Lee, Jonathan Moffat, Marisa Rankin, Joanne MacKinnon, Roger Johnson, John H.E. Nash.
+    Developed by Geneviève Labbé, James Robertson, Peter Kruczkiewicz, Marisa Rankin, Matthew Gopez, Chad R. Laing, Philip Mabon, Kim Ziebell, Aleisha R. Reimer, Lorelee Tschetter, Gary Van Domselaar, Sadjia Bekal, Kimberley A. MacDonald, Linda Hoang, Linda Chui, Danielle Daignault, Durda Slavic, Frank Pollari, E. Jane Parmley, David Son, Darian Hole, Philip Mabon, Elissa Giang, Lok Kan Lee, Jonathan Moffat, Marisa Rankin, Joanne MacKinnon, Roger Johnson, John H.E. Nash.
 
     positional arguments:
       F                     Input genome FASTA/FASTQ files (can be Gzipped)
@@ -142,8 +151,8 @@ If you run ``hansel -h``, you should see the following usage statement:
                             paired) (files can be Gzipped)
       -o OUTPUT_SUMMARY, --output-summary OUTPUT_SUMMARY
                             Subtyping summary output path (tab-delimited)
-      -O OUTPUT_TILE_RESULTS, --output-tile-results OUTPUT_TILE_RESULTS
-                            Subtyping tile matching output path (tab-delimited)
+      -O OUTPUT_KMER_RESULTS, --output-kmer-results OUTPUT_KMER_RESULTS
+                            Subtyping kmer matching output path (tab-delimited)
       -S OUTPUT_SIMPLE_SUMMARY, --output-simple-summary OUTPUT_SIMPLE_SUMMARY
                             Subtyping simple summary output path
       --force               Force existing output files to be overwritten
@@ -155,17 +164,17 @@ If you run ``hansel -h``, you should see the following usage statement:
       --low-cov-depth-freq LOW_COV_DEPTH_FREQ
                             Frequencies below this coverage are considered low
                             coverage
-      --max-missing-tiles MAX_MISSING_TILES
-                            Decimal proportion of maximum allowable missing tiles
+      --max-missing-kmers MAX_MISSING_KMERS
+                            Decimal proportion of maximum allowable missing kmers
                             before being considered an error. (0.0 - 1.0)
-      --min-ambiguous-tiles MIN_AMBIGUOUS_TILES
-                            Minimum number of missing tiles to be considered an
+      --min-ambiguous-kmers MIN_AMBIGUOUS_KMERS
+                            Minimum number of missing kmers to be considered an
                             ambiguous result
       --low-cov-warning LOW_COV_WARNING
-                            Overall tile coverage below this value will trigger a
+                            Overall kmer coverage below this value will trigger a
                             low coverage warning
-      --max-intermediate-tiles MAX_INTERMEDIATE_TILES
-                            Decimal proportion of maximum allowable missing tiles
+      --max-intermediate-kmers MAX_INTERMEDIATE_KMERS
+                            Decimal proportion of maximum allowable missing kmers
                             to be considered an intermediate subtype. (0.0 - 1.0)
       -t THREADS, --threads THREADS
                             Number of parallel threads to run analysis (default=1)
@@ -191,15 +200,15 @@ Contents of ``results.tab``:
 
 .. code-block::
 
-    sample  scheme  subtype all_subtypes    tiles_matching_subtype  are_subtypes_consistent inconsistent_subtypes   n_tiles_matching_all    n_tiles_matching_all_total  n_tiles_matching_positive   n_tiles_matching_positive_total n_tiles_matching_subtype    n_tiles_matching_subtype_total  file_path
+    sample  scheme  subtype all_subtypes    kmers_matching_subtype  are_subtypes_consistent inconsistent_subtypes   n_kmers_matching_all    n_kmers_matching_all_total  n_kmers_matching_positive   n_kmers_matching_positive_total n_kmers_matching_subtype    n_kmers_matching_subtype_total  file_path
     SRR1002850  heidelberg  2.2.2.2.1.4 2; 2.2; 2.2.2; 2.2.2.2; 2.2.2.2.1; 2.2.2.2.1.4  1037658-2.2.2.2.1.4; 2154958-2.2.2.2.1.4; 3785187-2.2.2.2.1.4   True        202 202 17  17  3   3   SRR1002850.fasta
 
 
 Contents of ``match_results.tab``:
 
 .. code-block::
 
-    tilename    stitle  pident  length  mismatch    gapopen qstart  qend    sstart  send    evalue  bitscore    qlen    slen    seq coverage    is_trunc    refposition subtype is_pos_tile sample  file_path   scheme
+    kmername    stitle  pident  length  mismatch    gapopen qstart  qend    sstart  send    evalue  bitscore    qlen    slen    seq coverage    is_trunc    refposition subtype is_pos_kmer sample  file_path   scheme
     775920-2.2.2.2  NODE_2_length_512016_cov_46.4737_ID_3   100.0   33  0   0   1   33  474875  474907  2.0000000000000002e-11  62.1    33  512016  GTTCAGGTGCTACCGAGGATCGTTTTTGGTGCG   1.0 False   775920  2.2.2.2 True    SRR1002850  SRR1002850.fasta   heidelberg
     negative3305400-2.1.1.1 NODE_3_length_427905_cov_48.1477_ID_5   100.0   33  0   0   1   33  276235  276267  2.0000000000000002e-11  62.1    33  427905  CATCGTGAAGCAGAACAGACGCGCATTCTTGCT   1.0 False   negative3305400 2.1.1.1 False   SRR1002850  SRR1002850.fasta   heidelberg
     negative3200083-2.1 NODE_3_length_427905_cov_48.1477_ID_5   100.0   33  0   0   1   33  170918  170950  2.0000000000000002e-11  62.1    33  427905  ACCCGGTCTACCGCAAAATGGAAAGCGATATGC   1.0 False   negative3200083 2.1 False   SRR1002850  SRR1002850.fasta   heidelberg
@@ -221,15 +230,15 @@ Contents of ``results.tab``:
 
 .. code-block::
 
-    sample  scheme  subtype all_subtypes    tiles_matching_subtype  are_subtypes_consistent inconsistent_subtypes   n_tiles_matching_all    n_tiles_matching_all_total  n_tiles_matching_positive   n_tiles_matching_positive_total n_tiles_matching_subtype    n_tiles_matching_subtype_total  file_path
+    sample  scheme  subtype all_subtypes    kmers_matching_subtype  are_subtypes_consistent inconsistent_subtypes   n_kmers_matching_all    n_kmers_matching_all_total  n_kmers_matching_positive   n_kmers_matching_positive_total n_kmers_matching_subtype    n_kmers_matching_subtype_total  file_path
     SRR5646583  heidelberg  2.2.1.1.1.1 2; 2.2; 2.2.1; 2.2.1.1; 2.2.1.1.1; 2.2.1.1.1.1  1983064-2.2.1.1.1.1; 4211912-2.2.1.1.1.1    True        202 202 20  20  2   2   SRR5646583_forward.fastqsanger; SRR5646583_reverse.fastqsanger
 
 
 Contents of ``match_results.tab``:
 
 .. code-block::
 
-    seq freq    sample  file_path   tilename    is_pos_tile subtype refposition is_kmer_freq_okay   scheme
+    seq freq    sample  file_path   kmername    is_pos_kmer subtype refposition is_kmer_freq_okay   scheme
     ACGGTAAAAGAGGACTTGACTGGCGCGATTTGC   68  SRR5646583 SRR5646583_forward.fastqsanger; SRR5646583_reverse.fastqsanger    21097-2.2.1.1.1 True    2.2.1.1.1   21097   True    heidelberg
     AACCGGCGGTATTGGCTGCGGTAAAAGTACCGT   77  SRR5646583 SRR5646583_forward.fastqsanger; SRR5646583_reverse.fastqsanger    157792-2.2.1.1.1    True    2.2.1.1.1   157792  True    heidelberg
     CCGCTGCTTTCTGAAATCGCGCGTCGTTTCAAC   67  SRR5646583 SRR5646583_forward.fastqsanger; SRR5646583_reverse.fastqsanger    293728-2.2.1.1  True    2.2.1.1 293728  True    heidelberg

diff --git a/bio_hansel/__init__.py b/bio_hansel/__init__.py
@@ -1,9 +1,9 @@
 # -*- coding: utf-8 -*-
 
-__version__ = '2.1.1'
+__version__ = '2.2.0'
 program_name = 'bio_hansel'
 program_summary = 'Subtype microbial genomes using SNV targeting k-mer subtyping schemes.'
 program_desc = program_summary + '''
-Includes schemes for Salmonella enterica spp. enterica serovar Heidelberg and Enteritidis subtyping.
-Developed by Geneviève Labbé, James Robertson, Peter Kruczkiewicz, Marisa Rankin, Matthew Gopez, Chad R. Laing, Philip Mabon, Kim Ziebell, Aleisha R. Reimer, Lorelee Tschetter, Gary Van Domselaar, Sadjia Bekal, Kimberley A. MacDonald, Linda Hoang, Linda Chui, Danielle Daignault, Durda Slavic, Frank Pollari, E. Jane Parmley, Philip Mabon, Elissa Giang, Lok Kan Lee, Jonathan Moffat, Marisa Rankin, Joanne MacKinnon, Roger Johnson, John H.E. Nash.
+Includes schemes for Salmonella enterica spp. enterica serovar Heidelberg, Enteritidis, and Typhi subtyping. Also includes a Mycobacterium Tuberculosis scheme. 
+Developed by Geneviève Labbé, James Robertson, Peter Kruczkiewicz, Marisa Rankin, Matthew Gopez, Chad R. Laing, Philip Mabon, Kim Ziebell, Aleisha R. Reimer, Lorelee Tschetter, Gary Van Domselaar, Sadjia Bekal, Kimberley A. MacDonald, Linda Hoang, Linda Chui, Danielle Daignault, Durda Slavic, Frank Pollari, E. Jane Parmley, David Son, Darian Hole, Philip Mabon, Elissa Giang, Lok Kan Lee, Jonathan Moffat, Marisa Rankin, Joanne MacKinnon, Roger Johnson, John H.E. Nash.
 '''
diff --git a/bio_hansel/aho_corasick/__init__.py b/bio_hansel/aho_corasick/__init__.py
@@ -1,12 +1,61 @@
 # -*- coding: utf-8 -*-
 
 from collections import defaultdict
+from itertools import product
+import logging
+import sys
 
 from ahocorasick import Automaton
 import pandas as pd
 
 from ..parsers import parse_fasta, parse_fastq
 from ..utils import revcomp
+from ..const import bases_dict
+from ..subtyping_params import SubtypingParams
+
+def expand_degenerate_bases(seq):
+    """List all possible kmers for a scheme given a degenerate base
+   
+    Args:
+         Scheme_kmers from SNV scheme fasta file
+
+    Returns:
+         List of all possible kmers given a degenerate base or not
+    """
+
+    return list(map("".join, product(*map(bases_dict.get, seq))))
+
+def check_total_kmers(scheme_fasta, subtyping_params):
+    '''Checks that the number of kmers about to be created is not at too high a computation or time cost
+
+    Args:
+         scheme_fasta: Kmer sequences from the SNV scheme
+         Subtyping parameter max_degenerate_kmers: The max kmers allowed by the scheme
+    
+    Returns:
+         None if created kmers < max degenerate kmers argument
+         Exits code with warning if created kmers > max degenerate kmers argument
+    '''
+    kmer_number = 0
+    for header, sequence in parse_fasta(scheme_fasta):
+        value = 1
+        for char in sequence:
+            length_key = len(bases_dict[char])
+            value = value * length_key
+        kmer_number = kmer_number + value
+    if kmer_number*2 > subtyping_params.max_degenerate_kmers:
+        return logging.error(
+            '''
+    Your current scheme contains "{}" kmers which is over the current max-degenerate-kmers check of "{}" (Maximum recommended k-mers is 100000).
+    It is not advised to run this scheme due to the time and memory usage required to give an output with this many kmers loaded.
+    If you still want to run this scheme, add the command line check of "--max-degenerate-kmers {}" at the end of your previous command.
+            '''.format(
+                kmer_number*2,
+                subtyping_params.max_degenerate_kmers,
+                kmer_number*2+1
+                )), sys.exit()
+    else:
+        return None
 
 
 def init_automaton(scheme_fasta):
@@ -20,8 +69,10 @@ def init_automaton(scheme_fasta):
     """
     A = Automaton()
     for header, sequence in parse_fasta(scheme_fasta):
-        A.add_word(sequence, (header, sequence, False))
-        A.add_word(revcomp(sequence), (header, sequence, True))
+        kmer_list = expand_degenerate_bases(sequence)
+        for seq in kmer_list:
+            A.add_word(seq, (header, seq, False))
+            A.add_word(revcomp(seq), (header, seq, True))
     A.make_automaton()
     return A
 
@@ -38,9 +89,9 @@ def find_in_fasta(A: Automaton, fasta: str) -> pd.DataFrame:
     """
     res = []
     for contig_header, sequence in parse_fasta(fasta):
-        for idx, (tilename, tile_seq, is_revcomp) in A.iter(sequence):
-            res.append((tilename, tile_seq, is_revcomp, contig_header, idx))
-    df = pd.DataFrame(res, columns=['tilename', 'seq', 'is_revcomp', 'contig_id', 'match_index'])
+        for idx, (kmername, kmer_seq, is_revcomp) in A.iter(sequence):
+            res.append((kmername, kmer_seq, is_revcomp, contig_header, idx))
+    df = pd.DataFrame(res, columns=['kmername', 'seq', 'is_revcomp', 'contig_id', 'match_index'])
     return df
 
 
@@ -54,14 +105,14 @@ def find_in_fastqs(A: Automaton, *fastqs):
     Returns:
         Dataframe with any matches found in input fastq files
     """
-    tile_seq_counts = defaultdict(int)
+    kmer_seq_counts = defaultdict(int)
     for fastq in fastqs:
         for _, sequence in parse_fastq(fastq):
-            for idx, (_, tile_seq, _) in A.iter(sequence):
-                tile_seq_counts[tile_seq] += 1
+            for idx, (_, kmer_seq, _) in A.iter(sequence):
+                kmer_seq_counts[kmer_seq] += 1
     res = []
-    for tile_seq, freq in tile_seq_counts.items():
-        tilename, sequence, _ = A.get(tile_seq)
-        res.append((tilename, tile_seq, freq))
-    df = pd.DataFrame(res, columns=['tilename', 'seq', 'freq'])
+    for kmer_seq, freq in kmer_seq_counts.items():
+        kmername, sequence, _ = A.get(kmer_seq)
+        res.append((kmername, kmer_seq, freq))
+    df = pd.DataFrame(res, columns=['kmername', 'seq', 'freq'])
     return df
diff --git a/bio_hansel/const.py b/bio_hansel/const.py
@@ -6,12 +6,40 @@
 from bio_hansel import program_name
 from bio_hansel.subtyping_params import SubtypingParams
 
-SCHEME_FASTAS = {'heidelberg': {'file': resource_filename(program_name, 'data/heidelberg/tiles.fasta'),
+SCHEME_FASTAS = {'heidelberg': {'file': resource_filename(program_name, 'data/heidelberg/kmers.fasta'),
                                 'version': '0.5.0',
                                 'subtyping_params': SubtypingParams(low_coverage_depth_freq=20)},
-                 'enteritidis': {'file': resource_filename(program_name, 'data/enteritidis/tiles.fasta'),
-                                 'version': '0.8.0',
-                                 'subtyping_params': SubtypingParams(low_coverage_depth_freq=50)}}
+                 'enteritidis': {'file': resource_filename(program_name, 'data/enteritidis/kmers.fasta'),
+                                 'version': '1.0.7',
+                                 'subtyping_params': SubtypingParams(low_coverage_depth_freq=50)},
+                 'typhi': {'file': resource_filename(program_name, 'data/typhi/kmers.fasta'),
+                                 'version': '1.1.0',
+                                 'subtyping_params': SubtypingParams(low_coverage_depth_freq=20)},
+                 'tb_speciation': {'file': resource_filename(program_name, 'data/m.tuberculosis/kmers.fasta'),
+                                 'version': '1.0.1;',
+                                 'subtyping_params': SubtypingParams(low_coverage_depth_freq=20)},
+                 'typhimurium': {'file': resource_filename(program_name, 'data/typhimurium/kmers.fasta'),
+                                 'version': '0.5.5;',
+                                 'subtyping_params': SubtypingParams(low_coverage_depth_freq=20)}}
+
+
+bases_dict = {
+'A': ['A'],
+'C': ['C'],
+'G': ['G'],
+'T': ['T'],
+'R': ['A', 'G'],
+'Y': ['C', 'T'],
+'S': ['G', 'C'],
+'W': ['A', 'T'],
+'K': ['G', 'T'],
+'M': ['A', 'C'],
+'B': ['C', 'G', 'T'],
+'D': ['A', 'G', 'T'],
+'H': ['A', 'C', 'T'],
+'V': ['A', 'C', 'G'],
+'N': ['A', 'C', 'G', 'T'],}
+
 
 COLUMNS_TO_REMOVE = '''
 pident
@@ -38,17 +66,17 @@
 scheme_version
 subtype
 all_subtypes
-tiles_matching_subtype
+kmers_matching_subtype
 are_subtypes_consistent
 inconsistent_subtypes
-n_tiles_matching_all
-n_tiles_matching_all_expected
-n_tiles_matching_positive
-n_tiles_matching_positive_expected
-n_tiles_matching_subtype
-n_tiles_matching_subtype_expected
+n_kmers_matching_all
+n_kmers_matching_all_expected
+n_kmers_matching_positive
+n_kmers_matching_positive_expected
+n_kmers_matching_subtype
+n_kmers_matching_subtype_expected
 file_path
-avg_tile_coverage
+avg_kmer_coverage
 qc_status
 qc_message
 """.strip().split('\n')

diff --git a/bio_hansel/data/enteritidis/kmers.fasta b/bio_hansel/data/enteritidis/kmers.fasta