#13, #14: Quality Checking Module & Technician's Output #17

mgopez · 2017-10-30T14:23:52Z

Created new simple output file for technicians.
Added quality checking for samples (min tiles, is samples mixed)

…ypes are mixed or not. 2) Added new method to determine if data contains mixed subtypes or not. 3) Added new column for REACHED_MIN_TILES to determine if there is a sufficient number of SNV targets found. 4) Added new method to determine if data contains minimum SNV targets. Signed-off-by: Matt <gopezm@myumanitoba.ca>

Signed-off-by: Matt <gopezm@myumanitoba.ca>

…of the simple output file. misc, moving variables to const file. Signed-off-by: Matt <gopezm@myumanitoba.ca> #14: Adding the quality check module and it's tests.

…o new_output_feature_requests # Conflicts: # bio_hansel/const.py # bio_hansel/subtype.py # bio_hansel/subtyper.py # bio_hansel/utils.py # tests/test_confidence_false.py # tests/test_confidence_true.py

Signed-off-by: Matt <gopezm@myumanitoba.ca>

for MAX_TILE checking. Signed-off-by: Matt <gopezm@myumanitoba.ca>

…eing shown. Signed-off-by: Matt <gopezm@myumanitoba.ca>

Signed-off-by: Matt <gopezm@myumanitoba.ca>

Fixed tests. Signed-off-by: Matt <gopezm@myumanitoba.ca>

Signed-off-by: Matt <gopezm@myumanitoba.ca>

mgopez · 2017-11-17T17:11:51Z

@peterk87, now Genevieve's QC module is implemented. Please let me know if you find anything I can refactor / change.

peterk87 · 2017-11-06T14:48:03Z

record.txt

@@ -0,0 +1,31 @@
+/home/mgopez/anaconda3/lib/python3.6/site-packages/bio_hansel/subtype.py


This file doesn't look like it should be committed. Is it necessary?

The record.txt file that is.

Correct, I'll remove it from the git repo

peterk87 · 2017-11-08T17:33:39Z

bio_hansel/quality_check/quality_check_functions.py

+
+
+''' 
+[does_subtype_result_exist]


Let's keep the documentation style consistent with how it's done in other parts of the project, e.g. https://github.com/phac-nml/bio_hansel/blob/master/bio_hansel/blast_wrapper/__init__.py#L189

See this style guide:

https://google.github.io/styleguide/pyguide.html#Comments

Here's an example of the style guide:

https://pythonhosted.org/an_example_pypi_project/sphinx.html#full-code-example

peterk87 · 2017-11-08T17:41:24Z

bio_hansel/quality_check/quality_check_functions.py

+    if st.are_subtypes_consistent is False and (st.inconsistent_subtypes is not None
+                                                and len(st.inconsistent_subtypes) > 0):
+        mixed_subtypes = '; '.join(st.inconsistent_subtypes)
+        error_messages = MIXED_SUBTYPE_ERROR + ": {" + mixed_subtypes + "} detected in sample " \


string.format is a cleaner solution than string concatenation. See https://docs.python.org/3/library/string.html#format-string-syntax

Example:

'{status}: {msg} val1={val1} val2={val2}'.format( status=some_status, msg=some_message, val1=some_value, val2=other_value )

Lazy method:

'{}: {} val1={} val2={}'.format( some_status, some_message, some_value, other_value )

peterk87 · 2017-11-08T17:42:27Z

bio_hansel/quality_check/quality_check_functions.py

+
+        # Then we verify that the subtype has the correct number of tiles.
+        if st.n_tiles_matching_all <= expected_tiles_matching - (expected_tiles_matching * MIN_TILES_THRESHOLD):
+            error_messages = INSUFFICIENT_NUM_TILES + " : Observed: {"+str(st.n_tiles_matching_all)+"} " \


I recommend using string.format since you won't need to str(all_the_variables)

peterk87 · 2017-11-20T20:14:15Z

bio_hansel/kmer_count/__init__.py

@@ -7,6 +7,7 @@
 import logging
 import pandas as pd

+from bio_hansel.quality_check import perform_quality_check


from ..quality_check import perform_quality_check to be consistent with other import statements and to allow base package name to change (not that it will)

peterk87 · 2017-11-20T20:15:36Z

bio_hansel/kmer_count/__init__.py

@@ -231,6 +232,7 @@ def summary(self):
            return st, None
        dfgood = df[df.is_kmer_freq_okay]
        dfpos = dfgood[dfgood.is_pos_tile]
+        dfneg = dfgood[dfgood.is_pos_tile == False]


Could write as dfneg = dfgood[~dfgood.is_pos_tile] where ~ negates a Pandas Series.

peterk87 · 2017-11-20T20:30:31Z

bio_hansel/main.py

@@ -59,6 +60,8 @@ def init_parser():
                        help='Subtyping summary output path (tab-delimited)')
    parser.add_argument('-O', '--output-tile-results',
                        help='Subtyping tile matching output path (tab-delimited)')
+    parser.add_argument('-OS', '--output-simple-summary',


Let's change this to parser.add_argument('-S', '--output-simple-summary', unless you can think of a better single char short opt (-e?). We want to make the commandline options as intuitive and consistent as possible where short opts are dash followed by a char (e.g. -x) and longer opts are 2 dashes and some words delimited by dashes (e.g. --show-x)

I'll change it to -S. Hopefully won't be confusing as -s corresponds to scheme.

peterk87 · 2017-11-20T20:32:47Z

bio_hansel/main.py

@@ -97,6 +100,7 @@ def main():
    init_console_logger(args.verbose)
    output_summary_path = args.output_summary
    output_tile_results = args.output_tile_results
+    output_simple_summary_path = args.output_simple_summary


Right about here we should add some checking to see if there are already files at each of those paths if the user has supplied those output paths and exit with non-zero status code while alerting user that a file already exists at one of these file paths (we don't want users overwriting their important files). We could add a --force opt to allow users to overwrite any files if they know what they are doing.

Ah, I didn't think about that. I'll implement a warning.

Okay, I've added a check to see if output files already exist, and an optional argument --force to overwrite previous output files.

peterk87 · 2017-11-20T20:43:49Z

bio_hansel/quality_check/__init__.py

+
+    for func in QC_FUNCS:
+        # Calls run_method to check that the qc function takes a Subtype, returns Tuple[Optional[str], Optional[str]]
+        status, message = run_method(func, st, df)


Why not just call func(st, df) here?

I think it was due to an earlier review comment where it suggested I check the type of the function to make sure it was a Callable[[Subtype, DataFrame], Tuple[str, str]]. Is there a better way to do this rather than having an extra method to check?

I think you already have the checking going with specifying the type of the QC_FUNCS list:

QC_FUNCS: List[Callable[[Subtype, DataFrame], Tuple[str, str]]] = \ [ check_missing_tiles, check_mixed_subtype, check_inconsistent_results, check_intermediate_subtype ]

Oh man, I put both in. I'll remove the run_method method.

peterk87 · 2017-11-20T20:51:36Z

bio_hansel/quality_check/const.py

+INCONSISTENT_RESULTS_ERROR_3B = "Inconsistent Results Error 3B"
+INTERMEDIATE_SUBTYPE_WARNING = "Intermediate Subtype Warning"
+# Thresholds for tile checking.
+MAX_TILES_THRESHOLD = 0.01


I think it'd be a good idea to make these thresholds and other currently hardcoded values command-line opts with these values as defaults. We can then have the values passed around in some object/dict.

e.g.

parser.add_argument('--max-tiles-threshold', type=float, default=0.01, help='Max tiles threshold percent (0.01 -> 1%)')

Agreed, I'll change the hardcoded values to command line arguments.

Do you think creating a class would be a good idea in case there are more wanted arguments? Or should I just use a dict to pass the values to the QC module?

I think an attrs class would be a good idea (class name idea: SubtypingParams?). You could bundle a bunch of params for various thresholds into the class like the min/max tiles thresholds as well as the QC specific thresholds. The defaults could be whatever the hardcoded or default cmd line opt values are. The object could be passed to the subtype_* functions and the QC functions rather than adding a new param for each new threshold that we could up with.

Signed-off-by: Matt <gopezm@myumanitoba.ca>

peterk87 · 2017-11-20T22:29:02Z

bio_hansel/quality_check/qc_utils.py

+
+        pos_tiles = dfst[dfst['is_pos_tile']]
+        neg_tiles = dfst[dfst['is_pos_tile'] == False]
+        pos_tile_values = '|'.join(pos_tiles['refposition'].values.tolist())


So it looks like df['refposition'] is a series of type string since there's entries with negativeXXXXX, however, that negative info is already extracted into another column is_pos_tile so it's redundant and it would make a lot more sense for the refposition column to contain integers. That way you don't have to do any string/regex matching hackery and instead just ask if neg_tiles['refposition'].isin(pos_tile_positions) where pos_tile_positions is a list of ints. Hope that makes sense!

Should I change the refposition column to just contain integers then?

I think so. These Pandas functions may be useful

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.replace.html

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.to_numeric.html

e.g. df['refposition'] = pd.to_numeric(df['refposition'], downcast='unsigned', errors='coerce')

peterk87 · 2017-11-20T22:31:15Z

bio_hansel/quality_check/quality_check_functions.py

+
+        if total_tiles_hits < (total_tiles - total_tiles * MIN_TILES_THRESHOLD):
+            tiles_with_hits = df[(df['is_kmer_freq_okay'] == True)]
+            average_freq_coverage_depth = tiles_with_hits['freq'].sum() / len(tiles_with_hits)


Could use Series.mean() instead. https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.mean.html

peterk87 · 2017-11-20T22:35:14Z

bio_hansel/quality_check/quality_check_functions.py

+    error_status = None
+    error_messages = None
+
+    if st.are_subtypes_consistent:


We could alleviate some nesting, e.g.

if not st.are_subtypes_consistent: logging.debug("QC: Checking for missing tiles not run, inconsistent subtype detected.") return FAIL_MESSAGE, "Subtype is inconsistent, quality checking for missing tiles not run." total_tiles = int(st.n_tiles_matching_all_expected) # rest of function

Signed-off-by: Matt <gopezm@myumanitoba.ca>

Removed run_method Changed negating of a pandas df. Signed-off-by: Matt <gopezm@myumanitoba.ca>

Signed-off-by: Matt <gopezm@myumanitoba.ca>

Added checking to see if out files at paths specified exist. Changing method comments to be standard. Making finding conflicting tiles less hacky. Changed imports to be relative. Signed-off-by: Matt <gopezm@myumanitoba.ca>

Signed-off-by: Matt <gopezm@myumanitoba.ca>

mgopez · 2017-11-22T19:52:12Z

@peterk87 I added in the new params for the QC module. ~~They probably aren't the best names~~ Words aren't coming to me right now, so if you have any suggestions let me know so I can change them.

Signed-off-by: Matt <gopezm@myumanitoba.ca>

peterk87 · 2017-11-23T15:02:24Z

bio_hansel/main.py

-from .subtyper import subtype_fasta, SUBTYPE_SUMMARY_COLS, subtype_reads
-from .subtype_stats import subtype_counts
-from .utils import genome_name_from_fasta_path, get_scheme_fasta
+from bio_hansel import program_name, program_desc, __version__


Please change all imports to from . import ... format.

It seems when I leave it at .. instead of bio_hansel. PyCharm won't let me run the program. Perhaps this is an error with my setup?

peterk87 · 2017-11-23T15:07:17Z

bio_hansel/main.py

+                        type=int,
+                        default=20,
+                        help='Frequencies below this coverage are considered low coverage')
+    parser.add_argument('--missing-total-tiles-max',


Let's try to have names of things read like regular english statements and sentences, e.g. --max-missing-tiles

Is this the threshold for the number of missing positive or negative tiles to consider the result to be untrustworthy (i.e. qc_status == FAIL)?

This is the maximum percentage of tiles allowed missing before the result is considered an error.

Looks like I should change the type from int to float if it's a percentage.

Split QC functions. Cleaned up code. Signed-off-by: Matt <gopezm@myumanitoba.ca>

Signed-off-by: Matt <gopezm@myumanitoba.ca>

mgopez · 2017-11-27T19:05:56Z

I added some tests to verify that the QC functions are working as expected. Please let me know if there's anything else you want me to change for this pr @peterk87.

mgopez added 7 commits October 16, 2017 15:23

Merge branch 'master' into new_output_feature_requests

dfd873b

Changed the min tiles threshold to be >5% of expected tiles.

a4e13a1

Signed-off-by: Matt <gopezm@myumanitoba.ca>

Changed the way mixed subtypes and min tiles are checked and stored.

8c25ea6

Signed-off-by: Matt <gopezm@myumanitoba.ca>

#13: Created new simple output file. Added new argument for the path …

8e5ccf5

…of the simple output file. misc, moving variables to const file. Signed-off-by: Matt <gopezm@myumanitoba.ca> #14: Adding the quality check module and it's tests.

Merge remote-tracking branch 'origin/new_output_feature_requests' int…

7c4b2e3

…o new_output_feature_requests # Conflicts: # bio_hansel/const.py # bio_hansel/subtype.py # bio_hansel/subtyper.py # bio_hansel/utils.py # tests/test_confidence_false.py # tests/test_confidence_true.py

Adding checking for null input.

6768fb7

Signed-off-by: Matt <gopezm@myumanitoba.ca>

mgopez requested a review from peterk87 October 30, 2017 14:24

mgopez changed the title ~~Fixes for #13 and #14.~~ WIP // Fixes for #13 and #14. Oct 30, 2017

mgopez added 3 commits November 4, 2017 14:21

Fixing test data.

46c5194

Signed-off-by: Matt <gopezm@myumanitoba.ca>

Refactoring the quality check module.

1e00b97

Signed-off-by: Matt <gopezm@myumanitoba.ca>

Changing warning messages, fixed dead code.

21083f2

Signed-off-by: Matt <gopezm@myumanitoba.ca>

mgopez changed the title ~~WIP // Fixes for #13 and #14.~~ Fixes for #13 and #14. Nov 6, 2017

mgopez added 17 commits November 6, 2017 11:54

Refactoring structure of quality checking.

8dc2535

Signed-off-by: Matt <gopezm@myumanitoba.ca>

Re-re-factoring quality check functions, and added new method

f6ff944

for MAX_TILE checking. Signed-off-by: Matt <gopezm@myumanitoba.ca>

Adding intuitive messages, fixing bug with no inconsistent subtypes b…

d7a601b

…eing shown. Signed-off-by: Matt <gopezm@myumanitoba.ca>

Minor text fixes

acaf53b

Signed-off-by: Matt <gopezm@myumanitoba.ca>

Adding return type.

d2adee9

Signed-off-by: Matt <gopezm@myumanitoba.ca>

Adding simplier method for checking if the subtype is there.

2d79c64

Signed-off-by: Matt <gopezm@myumanitoba.ca>

Skeleton for re-factored QC methods.

1aa20ce

Signed-off-by: Matt <gopezm@myumanitoba.ca>

Skeleton for re-factored QC methods.

76937f4

Signed-off-by: Matt <gopezm@myumanitoba.ca>

Adding genome coverage for check_missing_tiles.

c76bfca

Signed-off-by: Matt <gopezm@myumanitoba.ca>

Created Quality Check module as per Genevieve's requirements.

7482582

Fixed tests. Signed-off-by: Matt <gopezm@myumanitoba.ca>

Adding git ignore.

bdd55c5

Signed-off-by: Matt <gopezm@myumanitoba.ca>

Corrected quality checking methods.

5d58710

Signed-off-by: Matt <gopezm@myumanitoba.ca>

text fixes

9639236

Signed-off-by: Matt <gopezm@myumanitoba.ca>

editing git file

285f868

Signed-off-by: Matt <gopezm@myumanitoba.ca>

Fixed git ignore.

e6af08d

Changing variable names, adding constants for error types.

2029087

Signed-off-by: Matt <gopezm@myumanitoba.ca>

Adding Inconsistent Results Error 3B

94d616c

Signed-off-by: Matt <gopezm@myumanitoba.ca>

peterk87 suggested changes Nov 20, 2017

View reviewed changes

peterk87 reviewed Nov 20, 2017

View reviewed changes

WIP: Changing doc strings

3991c0b

Signed-off-by: Matt <gopezm@myumanitoba.ca>

peterk87 reviewed Nov 20, 2017

View reviewed changes

mgopez added 6 commits November 21, 2017 08:24

Removing record.txt

0b40987

Signed-off-by: Matt <gopezm@myumanitoba.ca>

Changed argument for simple summary to -S

a80bf9b

Removed run_method Changed negating of a pandas df. Signed-off-by: Matt <gopezm@myumanitoba.ca>

Removing nesting within the quality checking methods.

9ca4353

Signed-off-by: Matt <gopezm@myumanitoba.ca>

Change negating.

43e1914

Signed-off-by: Matt <gopezm@myumanitoba.ca>

Added --force, to overwrite existing output files.

839320f

Added checking to see if out files at paths specified exist. Changing method comments to be standard. Making finding conflicting tiles less hacky. Changed imports to be relative. Signed-off-by: Matt <gopezm@myumanitoba.ca>

Added SubtypingParams.

f7a50db

Signed-off-by: Matt <gopezm@myumanitoba.ca>

Fixing formatting, created new tests.

671e3db

Signed-off-by: Matt <gopezm@myumanitoba.ca>

peterk87 reviewed Nov 23, 2017

View reviewed changes

mgopez added 2 commits November 24, 2017 15:19

Added new arguments.

746989d

Split QC functions. Cleaned up code. Signed-off-by: Matt <gopezm@myumanitoba.ca>

Readding test files.

36a7542

Signed-off-by: Matt <gopezm@myumanitoba.ca>

mgopez changed the title ~~Fixes for #13 and #14.~~ #13, #14: Quality Checking Module Nov 30, 2017

mgopez changed the title ~~#13, #14: Quality Checking Module~~ #13, #14: Quality Checking Module & Technician's Output Nov 30, 2017

peterk87 approved these changes Nov 30, 2017

View reviewed changes

peterk87 merged commit 5e3484d into master Nov 30, 2017

peterk87 deleted the new_output_feature_requests branch November 30, 2017 21:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

#13, #14: Quality Checking Module & Technician's Output #17

#13, #14: Quality Checking Module & Technician's Output #17

mgopez commented Oct 30, 2017

mgopez commented Nov 17, 2017

peterk87 Nov 6, 2017

peterk87 Nov 6, 2017

mgopez Nov 20, 2017

peterk87 Nov 8, 2017

peterk87 Nov 8, 2017

peterk87 Nov 8, 2017

peterk87 Nov 20, 2017

peterk87 Nov 20, 2017

peterk87 Nov 20, 2017

mgopez Nov 21, 2017

peterk87 Nov 20, 2017

mgopez Nov 20, 2017

mgopez Nov 21, 2017

peterk87 Nov 20, 2017

mgopez Nov 20, 2017

peterk87 Nov 20, 2017

mgopez Nov 21, 2017

peterk87 Nov 20, 2017

mgopez Nov 20, 2017

mgopez Nov 22, 2017

peterk87 Nov 22, 2017

peterk87 Nov 20, 2017

mgopez Nov 21, 2017

peterk87 Nov 21, 2017

peterk87 Nov 20, 2017

peterk87 Nov 20, 2017

mgopez commented Nov 22, 2017 •

edited

Loading

peterk87 Nov 23, 2017

mgopez Nov 23, 2017

peterk87 Nov 23, 2017

mgopez Nov 23, 2017

mgopez Nov 23, 2017

mgopez commented Nov 27, 2017

		@@ -0,0 +1,31 @@
		/home/mgopez/anaconda3/lib/python3.6/site-packages/bio_hansel/subtype.py



		'''
		[does_subtype_result_exist]

#13, #14: Quality Checking Module & Technician's Output #17

#13, #14: Quality Checking Module & Technician's Output #17

Conversation

mgopez commented Oct 30, 2017

mgopez commented Nov 17, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mgopez commented Nov 22, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mgopez commented Nov 27, 2017

mgopez commented Nov 22, 2017 •

edited

Loading