Skip to content

Commit

Permalink
Merge pull request #94 from phac-nml/development
Browse files Browse the repository at this point in the history
Major upgrade
  • Loading branch information
Takadonet committed May 16, 2019
2 parents c7ff83d + 395c833 commit 4db666b
Show file tree
Hide file tree
Showing 35 changed files with 136,288 additions and 1,439 deletions.
51 changes: 30 additions & 21 deletions README.rst
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
|logo|

|conda| |nbsp| |pypi| |nbsp| |license| |nbsp| |nbsp| Master:|citest-master| |nbsp| Development:|citest-dev|
|conda| |nbsp| |pypi| |nbsp| |license| |nbsp| |nbsp| Master:|citest-master| |nbsp| Development:|citest-dev| |rtd|



Expand All @@ -18,6 +18,8 @@
:target: https://bioconda.github.io/recipes/bio_hansel/README.html
.. |nbsp| unicode:: 0xA0
:trim:
.. |rtd| image:: https://readthedocs.org/projects/pip/badge/?version=latest&style=flat
:target: https://bio-hansel.readthedocs.io/en/readthedocs/

Subtype microbial whole-genome sequencing (WGS) data using SNV targeting k-mer subtyping schemes.

Expand All @@ -34,10 +36,16 @@ If you find the ``biohansel`` tool useful, please cite as:
.. epigraph::

A robust genotyping scheme for *Salmonella enterica* serovar Heidelberg clones circulating in North America.
Geneviève Labbé, James Robertson, Peter Kruczkiewicz, Marisa Rankin, Matthew Gopez, Chad R. Laing, Philip Mabon, Kim Ziebell, Aleisha R. Reimer, Lorelee Tschetter, Gary Van Domselaar, Sadjia Bekal, Kimberley A. MacDonald, Linda Hoang, Linda Chui, Danielle Daignault, Durda Slavic, Frank Pollari, E. Jane Parmley, Elissa Giang, Lok Kan Lee, Jonathan Moffat, Joanne MacKinnon, Roger Johnson, John H.E. Nash.
Geneviève Labbé, James Robertson, Peter Kruczkiewicz, Marisa Rankin, Matthew Gopez, Chad R. Laing, Philip Mabon, Kim Ziebell, Aleisha R. Reimer, Lorelee Tschetter, Gary Van Domselaar, Sadjia Bekal, Kimberley A. MacDonald, Linda Hoang, Linda Chui, Danielle Daignault, Durda Slavic, Frank Pollari, E. Jane Parmley, David Son, Darian Hole, Elissa Giang, Lok Kan Lee, Jonathan Moffat, Joanne MacKinnon, Roger Johnson, John H.E. Nash.
[Manuscript in preparation]


Read_The_Docs
==============

More in-depth information on running and installing biohansel can be found on the `biohansel readthedocs page <https://bio-hansel.readthedocs.io/en/readthedocs/>`_.


Requirements and Dependencies
=============================

Expand Down Expand Up @@ -92,7 +100,7 @@ Install into Galaxy_ (version >= 17.01)

Install ``biohansel`` from the main Galaxy_ toolshed:

https://toolshed.g2.bx.psu.edu/repository?repository_id=59b90ef18cc5dbbc&changeset_revision=4654c51dae72
https://toolshed.g2.bx.psu.edu/view/nml/biohansel/ba6a0af656a6


Usage
Expand All @@ -105,20 +113,21 @@ If you run ``hansel -h``, you should see the following usage statement:
usage: hansel [-h] [-s SCHEME] [--scheme-name SCHEME_NAME]
[-p forward_reads reverse_reads] [-i fasta_path genome_name]
[-D INPUT_DIRECTORY] [-o OUTPUT_SUMMARY]
[-O OUTPUT_TILE_RESULTS] [-S OUTPUT_SIMPLE_SUMMARY] [--force]
[-O OUTPUT_KMER_RESULTS] [-S OUTPUT_SIMPLE_SUMMARY] [--force]
[--json] [--min-kmer-freq MIN_KMER_FREQ]
[--max-kmer-freq MAX_KMER_FREQ]
[--low-cov-depth-freq LOW_COV_DEPTH_FREQ]
[--max-missing-tiles MAX_MISSING_TILES]
[--min-ambiguous-tiles MIN_AMBIGUOUS_TILES]
[--max-missing-kmers MAX_MISSING_KMERS]
[--min-ambiguous-kmers MIN_AMBIGUOUS_KMERS]
[--low-cov-warning LOW_COV_WARNING]
[--max-intermediate-tiles MAX_INTERMEDIATE_TILES] [-t THREADS]
[--max-intermediate-kmers MAX_INTERMEDIATE_KMERS]
[--max-degenerate-kmers MAX_DEGENERATE_KMERS] [-t THREADS]
[-v] [-V]
[F [F ...]]
Subtype microbial genomes using SNV targeting k-mer subtyping schemes.
Includes schemes for Salmonella enterica spp. enterica serovar Heidelberg and Enteritidis subtyping.
Developed by Geneviève Labbé, James Robertson, Peter Kruczkiewicz, Marisa Rankin, Matthew Gopez, Chad R. Laing, Philip Mabon, Kim Ziebell, Aleisha R. Reimer, Lorelee Tschetter, Gary Van Domselaar, Sadjia Bekal, Kimberley A. MacDonald, Linda Hoang, Linda Chui, Danielle Daignault, Durda Slavic, Frank Pollari, E. Jane Parmley, Philip Mabon, Elissa Giang, Lok Kan Lee, Jonathan Moffat, Marisa Rankin, Joanne MacKinnon, Roger Johnson, John H.E. Nash.
Developed by Geneviève Labbé, James Robertson, Peter Kruczkiewicz, Marisa Rankin, Matthew Gopez, Chad R. Laing, Philip Mabon, Kim Ziebell, Aleisha R. Reimer, Lorelee Tschetter, Gary Van Domselaar, Sadjia Bekal, Kimberley A. MacDonald, Linda Hoang, Linda Chui, Danielle Daignault, Durda Slavic, Frank Pollari, E. Jane Parmley, David Son, Darian Hole, Philip Mabon, Elissa Giang, Lok Kan Lee, Jonathan Moffat, Marisa Rankin, Joanne MacKinnon, Roger Johnson, John H.E. Nash.
positional arguments:
F Input genome FASTA/FASTQ files (can be Gzipped)
Expand All @@ -142,8 +151,8 @@ If you run ``hansel -h``, you should see the following usage statement:
paired) (files can be Gzipped)
-o OUTPUT_SUMMARY, --output-summary OUTPUT_SUMMARY
Subtyping summary output path (tab-delimited)
-O OUTPUT_TILE_RESULTS, --output-tile-results OUTPUT_TILE_RESULTS
Subtyping tile matching output path (tab-delimited)
-O OUTPUT_KMER_RESULTS, --output-kmer-results OUTPUT_KMER_RESULTS
Subtyping kmer matching output path (tab-delimited)
-S OUTPUT_SIMPLE_SUMMARY, --output-simple-summary OUTPUT_SIMPLE_SUMMARY
Subtyping simple summary output path
--force Force existing output files to be overwritten
Expand All @@ -155,17 +164,17 @@ If you run ``hansel -h``, you should see the following usage statement:
--low-cov-depth-freq LOW_COV_DEPTH_FREQ
Frequencies below this coverage are considered low
coverage
--max-missing-tiles MAX_MISSING_TILES
Decimal proportion of maximum allowable missing tiles
--max-missing-kmers MAX_MISSING_KMERS
Decimal proportion of maximum allowable missing kmers
before being considered an error. (0.0 - 1.0)
--min-ambiguous-tiles MIN_AMBIGUOUS_TILES
Minimum number of missing tiles to be considered an
--min-ambiguous-kmers MIN_AMBIGUOUS_KMERS
Minimum number of missing kmers to be considered an
ambiguous result
--low-cov-warning LOW_COV_WARNING
Overall tile coverage below this value will trigger a
Overall kmer coverage below this value will trigger a
low coverage warning
--max-intermediate-tiles MAX_INTERMEDIATE_TILES
Decimal proportion of maximum allowable missing tiles
--max-intermediate-kmers MAX_INTERMEDIATE_KMERS
Decimal proportion of maximum allowable missing kmers
to be considered an intermediate subtype. (0.0 - 1.0)
-t THREADS, --threads THREADS
Number of parallel threads to run analysis (default=1)
Expand All @@ -191,15 +200,15 @@ Contents of ``results.tab``:

.. code-block::
sample scheme subtype all_subtypes tiles_matching_subtype are_subtypes_consistent inconsistent_subtypes n_tiles_matching_all n_tiles_matching_all_total n_tiles_matching_positive n_tiles_matching_positive_total n_tiles_matching_subtype n_tiles_matching_subtype_total file_path
sample scheme subtype all_subtypes kmers_matching_subtype are_subtypes_consistent inconsistent_subtypes n_kmers_matching_all n_kmers_matching_all_total n_kmers_matching_positive n_kmers_matching_positive_total n_kmers_matching_subtype n_kmers_matching_subtype_total file_path
SRR1002850 heidelberg 2.2.2.2.1.4 2; 2.2; 2.2.2; 2.2.2.2; 2.2.2.2.1; 2.2.2.2.1.4 1037658-2.2.2.2.1.4; 2154958-2.2.2.2.1.4; 3785187-2.2.2.2.1.4 True 202 202 17 17 3 3 SRR1002850.fasta
Contents of ``match_results.tab``:

.. code-block::
tilename stitle pident length mismatch gapopen qstart qend sstart send evalue bitscore qlen slen seq coverage is_trunc refposition subtype is_pos_tile sample file_path scheme
kmername stitle pident length mismatch gapopen qstart qend sstart send evalue bitscore qlen slen seq coverage is_trunc refposition subtype is_pos_kmer sample file_path scheme
775920-2.2.2.2 NODE_2_length_512016_cov_46.4737_ID_3 100.0 33 0 0 1 33 474875 474907 2.0000000000000002e-11 62.1 33 512016 GTTCAGGTGCTACCGAGGATCGTTTTTGGTGCG 1.0 False 775920 2.2.2.2 True SRR1002850 SRR1002850.fasta heidelberg
negative3305400-2.1.1.1 NODE_3_length_427905_cov_48.1477_ID_5 100.0 33 0 0 1 33 276235 276267 2.0000000000000002e-11 62.1 33 427905 CATCGTGAAGCAGAACAGACGCGCATTCTTGCT 1.0 False negative3305400 2.1.1.1 False SRR1002850 SRR1002850.fasta heidelberg
negative3200083-2.1 NODE_3_length_427905_cov_48.1477_ID_5 100.0 33 0 0 1 33 170918 170950 2.0000000000000002e-11 62.1 33 427905 ACCCGGTCTACCGCAAAATGGAAAGCGATATGC 1.0 False negative3200083 2.1 False SRR1002850 SRR1002850.fasta heidelberg
Expand All @@ -221,15 +230,15 @@ Contents of ``results.tab``:

.. code-block::
sample scheme subtype all_subtypes tiles_matching_subtype are_subtypes_consistent inconsistent_subtypes n_tiles_matching_all n_tiles_matching_all_total n_tiles_matching_positive n_tiles_matching_positive_total n_tiles_matching_subtype n_tiles_matching_subtype_total file_path
sample scheme subtype all_subtypes kmers_matching_subtype are_subtypes_consistent inconsistent_subtypes n_kmers_matching_all n_kmers_matching_all_total n_kmers_matching_positive n_kmers_matching_positive_total n_kmers_matching_subtype n_kmers_matching_subtype_total file_path
SRR5646583 heidelberg 2.2.1.1.1.1 2; 2.2; 2.2.1; 2.2.1.1; 2.2.1.1.1; 2.2.1.1.1.1 1983064-2.2.1.1.1.1; 4211912-2.2.1.1.1.1 True 202 202 20 20 2 2 SRR5646583_forward.fastqsanger; SRR5646583_reverse.fastqsanger
Contents of ``match_results.tab``:

.. code-block::
seq freq sample file_path tilename is_pos_tile subtype refposition is_kmer_freq_okay scheme
seq freq sample file_path kmername is_pos_kmer subtype refposition is_kmer_freq_okay scheme
ACGGTAAAAGAGGACTTGACTGGCGCGATTTGC 68 SRR5646583 SRR5646583_forward.fastqsanger; SRR5646583_reverse.fastqsanger 21097-2.2.1.1.1 True 2.2.1.1.1 21097 True heidelberg
AACCGGCGGTATTGGCTGCGGTAAAAGTACCGT 77 SRR5646583 SRR5646583_forward.fastqsanger; SRR5646583_reverse.fastqsanger 157792-2.2.1.1.1 True 2.2.1.1.1 157792 True heidelberg
CCGCTGCTTTCTGAAATCGCGCGTCGTTTCAAC 67 SRR5646583 SRR5646583_forward.fastqsanger; SRR5646583_reverse.fastqsanger 293728-2.2.1.1 True 2.2.1.1 293728 True heidelberg
Expand Down
6 changes: 3 additions & 3 deletions bio_hansel/__init__.py
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
# -*- coding: utf-8 -*-

__version__ = '2.1.1'
__version__ = '2.2.0'
program_name = 'bio_hansel'
program_summary = 'Subtype microbial genomes using SNV targeting k-mer subtyping schemes.'
program_desc = program_summary + '''
Includes schemes for Salmonella enterica spp. enterica serovar Heidelberg and Enteritidis subtyping.
Developed by Geneviève Labbé, James Robertson, Peter Kruczkiewicz, Marisa Rankin, Matthew Gopez, Chad R. Laing, Philip Mabon, Kim Ziebell, Aleisha R. Reimer, Lorelee Tschetter, Gary Van Domselaar, Sadjia Bekal, Kimberley A. MacDonald, Linda Hoang, Linda Chui, Danielle Daignault, Durda Slavic, Frank Pollari, E. Jane Parmley, Philip Mabon, Elissa Giang, Lok Kan Lee, Jonathan Moffat, Marisa Rankin, Joanne MacKinnon, Roger Johnson, John H.E. Nash.
Includes schemes for Salmonella enterica spp. enterica serovar Heidelberg, Enteritidis, and Typhi subtyping. Also includes a Mycobacterium Tuberculosis scheme.
Developed by Geneviève Labbé, James Robertson, Peter Kruczkiewicz, Marisa Rankin, Matthew Gopez, Chad R. Laing, Philip Mabon, Kim Ziebell, Aleisha R. Reimer, Lorelee Tschetter, Gary Van Domselaar, Sadjia Bekal, Kimberley A. MacDonald, Linda Hoang, Linda Chui, Danielle Daignault, Durda Slavic, Frank Pollari, E. Jane Parmley, David Son, Darian Hole, Philip Mabon, Elissa Giang, Lok Kan Lee, Jonathan Moffat, Marisa Rankin, Joanne MacKinnon, Roger Johnson, John H.E. Nash.
'''
75 changes: 63 additions & 12 deletions bio_hansel/aho_corasick/__init__.py
Original file line number Diff line number Diff line change
@@ -1,12 +1,61 @@
# -*- coding: utf-8 -*-

from collections import defaultdict
from itertools import product
import logging
import sys

from ahocorasick import Automaton
import pandas as pd

from ..parsers import parse_fasta, parse_fastq
from ..utils import revcomp
from ..const import bases_dict
from ..subtyping_params import SubtypingParams

def expand_degenerate_bases(seq):
"""List all possible kmers for a scheme given a degenerate base
Args:
Scheme_kmers from SNV scheme fasta file
Returns:
List of all possible kmers given a degenerate base or not
"""

return list(map("".join, product(*map(bases_dict.get, seq))))

def check_total_kmers(scheme_fasta, subtyping_params):
'''Checks that the number of kmers about to be created is not at too high a computation or time cost
Args:
scheme_fasta: Kmer sequences from the SNV scheme
Subtyping parameter max_degenerate_kmers: The max kmers allowed by the scheme
Returns:
None if created kmers < max degenerate kmers argument
Exits code with warning if created kmers > max degenerate kmers argument
'''
kmer_number = 0
for header, sequence in parse_fasta(scheme_fasta):
value = 1
for char in sequence:
length_key = len(bases_dict[char])
value = value * length_key
kmer_number = kmer_number + value
if kmer_number*2 > subtyping_params.max_degenerate_kmers:
return logging.error(
'''
Your current scheme contains "{}" kmers which is over the current max-degenerate-kmers check of "{}" (Maximum recommended k-mers is 100000).
It is not advised to run this scheme due to the time and memory usage required to give an output with this many kmers loaded.
If you still want to run this scheme, add the command line check of "--max-degenerate-kmers {}" at the end of your previous command.
'''.format(
kmer_number*2,
subtyping_params.max_degenerate_kmers,
kmer_number*2+1
)), sys.exit()
else:
return None


def init_automaton(scheme_fasta):
Expand All @@ -20,8 +69,10 @@ def init_automaton(scheme_fasta):
"""
A = Automaton()
for header, sequence in parse_fasta(scheme_fasta):
A.add_word(sequence, (header, sequence, False))
A.add_word(revcomp(sequence), (header, sequence, True))
kmer_list = expand_degenerate_bases(sequence)
for seq in kmer_list:
A.add_word(seq, (header, seq, False))
A.add_word(revcomp(seq), (header, seq, True))
A.make_automaton()
return A

Expand All @@ -38,9 +89,9 @@ def find_in_fasta(A: Automaton, fasta: str) -> pd.DataFrame:
"""
res = []
for contig_header, sequence in parse_fasta(fasta):
for idx, (tilename, tile_seq, is_revcomp) in A.iter(sequence):
res.append((tilename, tile_seq, is_revcomp, contig_header, idx))
df = pd.DataFrame(res, columns=['tilename', 'seq', 'is_revcomp', 'contig_id', 'match_index'])
for idx, (kmername, kmer_seq, is_revcomp) in A.iter(sequence):
res.append((kmername, kmer_seq, is_revcomp, contig_header, idx))
df = pd.DataFrame(res, columns=['kmername', 'seq', 'is_revcomp', 'contig_id', 'match_index'])
return df


Expand All @@ -54,14 +105,14 @@ def find_in_fastqs(A: Automaton, *fastqs):
Returns:
Dataframe with any matches found in input fastq files
"""
tile_seq_counts = defaultdict(int)
kmer_seq_counts = defaultdict(int)
for fastq in fastqs:
for _, sequence in parse_fastq(fastq):
for idx, (_, tile_seq, _) in A.iter(sequence):
tile_seq_counts[tile_seq] += 1
for idx, (_, kmer_seq, _) in A.iter(sequence):
kmer_seq_counts[kmer_seq] += 1
res = []
for tile_seq, freq in tile_seq_counts.items():
tilename, sequence, _ = A.get(tile_seq)
res.append((tilename, tile_seq, freq))
df = pd.DataFrame(res, columns=['tilename', 'seq', 'freq'])
for kmer_seq, freq in kmer_seq_counts.items():
kmername, sequence, _ = A.get(kmer_seq)
res.append((kmername, kmer_seq, freq))
df = pd.DataFrame(res, columns=['kmername', 'seq', 'freq'])
return df
52 changes: 40 additions & 12 deletions bio_hansel/const.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,12 +6,40 @@
from bio_hansel import program_name
from bio_hansel.subtyping_params import SubtypingParams

SCHEME_FASTAS = {'heidelberg': {'file': resource_filename(program_name, 'data/heidelberg/tiles.fasta'),
SCHEME_FASTAS = {'heidelberg': {'file': resource_filename(program_name, 'data/heidelberg/kmers.fasta'),
'version': '0.5.0',
'subtyping_params': SubtypingParams(low_coverage_depth_freq=20)},
'enteritidis': {'file': resource_filename(program_name, 'data/enteritidis/tiles.fasta'),
'version': '0.8.0',
'subtyping_params': SubtypingParams(low_coverage_depth_freq=50)}}
'enteritidis': {'file': resource_filename(program_name, 'data/enteritidis/kmers.fasta'),
'version': '1.0.7',
'subtyping_params': SubtypingParams(low_coverage_depth_freq=50)},
'typhi': {'file': resource_filename(program_name, 'data/typhi/kmers.fasta'),
'version': '1.1.0',
'subtyping_params': SubtypingParams(low_coverage_depth_freq=20)},
'tb_speciation': {'file': resource_filename(program_name, 'data/m.tuberculosis/kmers.fasta'),
'version': '1.0.1;',
'subtyping_params': SubtypingParams(low_coverage_depth_freq=20)},
'typhimurium': {'file': resource_filename(program_name, 'data/typhimurium/kmers.fasta'),
'version': '0.5.5;',
'subtyping_params': SubtypingParams(low_coverage_depth_freq=20)}}


bases_dict = {
'A': ['A'],
'C': ['C'],
'G': ['G'],
'T': ['T'],
'R': ['A', 'G'],
'Y': ['C', 'T'],
'S': ['G', 'C'],
'W': ['A', 'T'],
'K': ['G', 'T'],
'M': ['A', 'C'],
'B': ['C', 'G', 'T'],
'D': ['A', 'G', 'T'],
'H': ['A', 'C', 'T'],
'V': ['A', 'C', 'G'],
'N': ['A', 'C', 'G', 'T'],}


COLUMNS_TO_REMOVE = '''
pident
Expand All @@ -38,17 +66,17 @@
scheme_version
subtype
all_subtypes
tiles_matching_subtype
kmers_matching_subtype
are_subtypes_consistent
inconsistent_subtypes
n_tiles_matching_all
n_tiles_matching_all_expected
n_tiles_matching_positive
n_tiles_matching_positive_expected
n_tiles_matching_subtype
n_tiles_matching_subtype_expected
n_kmers_matching_all
n_kmers_matching_all_expected
n_kmers_matching_positive
n_kmers_matching_positive_expected
n_kmers_matching_subtype
n_kmers_matching_subtype_expected
file_path
avg_tile_coverage
avg_kmer_coverage
qc_status
qc_message
""".strip().split('\n')
Expand Down
1 change: 1 addition & 0 deletions bio_hansel/data/enteritidis/kmers.fasta

Large diffs are not rendered by default.

Loading

0 comments on commit 4db666b

Please sign in to comment.