Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#13, #14: Quality Checking Module & Technician's Output #17

Merged
merged 38 commits into from
Nov 30, 2017

Conversation

mgopez
Copy link
Member

@mgopez mgopez commented Oct 30, 2017

  • Created new simple output file for technicians.
  • Added quality checking for samples (min tiles, is samples mixed)

…ypes are mixed or not.

2) Added new method to determine if data contains mixed subtypes or not.
3) Added new column for REACHED_MIN_TILES to determine if there is a sufficient number of SNV targets found.
4) Added new method to determine if data contains minimum SNV targets.

Signed-off-by: Matt <gopezm@myumanitoba.ca>
Signed-off-by: Matt <gopezm@myumanitoba.ca>
…of the simple output file.

misc, moving variables to const file.

Signed-off-by: Matt <gopezm@myumanitoba.ca>
#14: Adding the quality check module and it's tests.
…o new_output_feature_requests

# Conflicts:
#	bio_hansel/const.py
#	bio_hansel/subtype.py
#	bio_hansel/subtyper.py
#	bio_hansel/utils.py
#	tests/test_confidence_false.py
#	tests/test_confidence_true.py
Signed-off-by: Matt <gopezm@myumanitoba.ca>
@mgopez mgopez requested a review from peterk87 October 30, 2017 14:24
@mgopez mgopez changed the title Fixes for #13 and #14. WIP // Fixes for #13 and #14. Oct 30, 2017
Signed-off-by: Matt <gopezm@myumanitoba.ca>
Signed-off-by: Matt <gopezm@myumanitoba.ca>
Signed-off-by: Matt <gopezm@myumanitoba.ca>
@mgopez mgopez changed the title WIP // Fixes for #13 and #14. Fixes for #13 and #14. Nov 6, 2017
Signed-off-by: Matt <gopezm@myumanitoba.ca>
for MAX_TILE checking.

Signed-off-by: Matt <gopezm@myumanitoba.ca>
…eing shown.

Signed-off-by: Matt <gopezm@myumanitoba.ca>
Signed-off-by: Matt <gopezm@myumanitoba.ca>
Signed-off-by: Matt <gopezm@myumanitoba.ca>
Signed-off-by: Matt <gopezm@myumanitoba.ca>
Signed-off-by: Matt <gopezm@myumanitoba.ca>
Signed-off-by: Matt <gopezm@myumanitoba.ca>
Signed-off-by: Matt <gopezm@myumanitoba.ca>
Fixed tests.

Signed-off-by: Matt <gopezm@myumanitoba.ca>
Signed-off-by: Matt <gopezm@myumanitoba.ca>
Signed-off-by: Matt <gopezm@myumanitoba.ca>
Signed-off-by: Matt <gopezm@myumanitoba.ca>
Signed-off-by: Matt <gopezm@myumanitoba.ca>
Signed-off-by: Matt <gopezm@myumanitoba.ca>
Signed-off-by: Matt <gopezm@myumanitoba.ca>
@mgopez
Copy link
Member Author

mgopez commented Nov 17, 2017

@peterk87, now Genevieve's QC module is implemented. Please let me know if you find anything I can refactor / change.

record.txt Outdated
@@ -0,0 +1,31 @@
/home/mgopez/anaconda3/lib/python3.6/site-packages/bio_hansel/subtype.py
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file doesn't look like it should be committed. Is it necessary?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The record.txt file that is.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, I'll remove it from the git repo



'''
[does_subtype_result_exist]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep the documentation style consistent with how it's done in other parts of the project, e.g. https://github.com/phac-nml/bio_hansel/blob/master/bio_hansel/blast_wrapper/__init__.py#L189

See this style guide:

https://google.github.io/styleguide/pyguide.html#Comments

Here's an example of the style guide:

https://pythonhosted.org/an_example_pypi_project/sphinx.html#full-code-example

if st.are_subtypes_consistent is False and (st.inconsistent_subtypes is not None
and len(st.inconsistent_subtypes) > 0):
mixed_subtypes = '; '.join(st.inconsistent_subtypes)
error_messages = MIXED_SUBTYPE_ERROR + ": {" + mixed_subtypes + "} detected in sample " \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

string.format is a cleaner solution than string concatenation. See https://docs.python.org/3/library/string.html#format-string-syntax

Example:

'{status}: {msg} val1={val1} val2={val2}'.format(
  status=some_status,
  msg=some_message,
  val1=some_value,
  val2=other_value
)

Lazy method:

'{}: {} val1={} val2={}'.format(
  some_status,
  some_message,
  some_value,
  other_value
)


# Then we verify that the subtype has the correct number of tiles.
if st.n_tiles_matching_all <= expected_tiles_matching - (expected_tiles_matching * MIN_TILES_THRESHOLD):
error_messages = INSUFFICIENT_NUM_TILES + " : Observed: {"+str(st.n_tiles_matching_all)+"} " \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend using string.format since you won't need to str(all_the_variables)

@@ -7,6 +7,7 @@
import logging
import pandas as pd

from bio_hansel.quality_check import perform_quality_check
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from ..quality_check import perform_quality_check to be consistent with other import statements and to allow base package name to change (not that it will)

@@ -231,6 +232,7 @@ def summary(self):
return st, None
dfgood = df[df.is_kmer_freq_okay]
dfpos = dfgood[dfgood.is_pos_tile]
dfneg = dfgood[dfgood.is_pos_tile == False]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could write as dfneg = dfgood[~dfgood.is_pos_tile] where ~ negates a Pandas Series.

@@ -59,6 +60,8 @@ def init_parser():
help='Subtyping summary output path (tab-delimited)')
parser.add_argument('-O', '--output-tile-results',
help='Subtyping tile matching output path (tab-delimited)')
parser.add_argument('-OS', '--output-simple-summary',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's change this to parser.add_argument('-S', '--output-simple-summary', unless you can think of a better single char short opt (-e?). We want to make the commandline options as intuitive and consistent as possible where short opts are dash followed by a char (e.g. -x) and longer opts are 2 dashes and some words delimited by dashes (e.g. --show-x)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll change it to -S. Hopefully won't be confusing as -s corresponds to scheme.

@@ -97,6 +100,7 @@ def main():
init_console_logger(args.verbose)
output_summary_path = args.output_summary
output_tile_results = args.output_tile_results
output_simple_summary_path = args.output_simple_summary
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right about here we should add some checking to see if there are already files at each of those paths if the user has supplied those output paths and exit with non-zero status code while alerting user that a file already exists at one of these file paths (we don't want users overwriting their important files). We could add a --force opt to allow users to overwrite any files if they know what they are doing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I didn't think about that. I'll implement a warning.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I've added a check to see if output files already exist, and an optional argument --force to overwrite previous output files.


for func in QC_FUNCS:
# Calls run_method to check that the qc function takes a Subtype, returns Tuple[Optional[str], Optional[str]]
status, message = run_method(func, st, df)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just call func(st, df) here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it was due to an earlier review comment where it suggested I check the type of the function to make sure it was a Callable[[Subtype, DataFrame], Tuple[str, str]]. Is there a better way to do this rather than having an extra method to check?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you already have the checking going with specifying the type of the QC_FUNCS list:

QC_FUNCS: List[Callable[[Subtype, DataFrame], Tuple[str, str]]] = \
[
    check_missing_tiles,
    check_mixed_subtype,
    check_inconsistent_results,
    check_intermediate_subtype
]

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh man, I put both in. I'll remove the run_method method.

INCONSISTENT_RESULTS_ERROR_3B = "Inconsistent Results Error 3B"
INTERMEDIATE_SUBTYPE_WARNING = "Intermediate Subtype Warning"
# Thresholds for tile checking.
MAX_TILES_THRESHOLD = 0.01
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it'd be a good idea to make these thresholds and other currently hardcoded values command-line opts with these values as defaults. We can then have the values passed around in some object/dict.

e.g.

parser.add_argument('--max-tiles-threshold',
                        type=float,
                        default=0.01,
                        help='Max tiles threshold percent (0.01 -> 1%)')

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, I'll change the hardcoded values to command line arguments.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think creating a class would be a good idea in case there are more wanted arguments? Or should I just use a dict to pass the values to the QC module?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think an attrs class would be a good idea (class name idea: SubtypingParams?). You could bundle a bunch of params for various thresholds into the class like the min/max tiles thresholds as well as the QC specific thresholds. The defaults could be whatever the hardcoded or default cmd line opt values are. The object could be passed to the subtype_* functions and the QC functions rather than adding a new param for each new threshold that we could up with.

Signed-off-by: Matt <gopezm@myumanitoba.ca>

pos_tiles = dfst[dfst['is_pos_tile']]
neg_tiles = dfst[dfst['is_pos_tile'] == False]
pos_tile_values = '|'.join(pos_tiles['refposition'].values.tolist())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it looks like df['refposition'] is a series of type string since there's entries with negativeXXXXX, however, that negative info is already extracted into another column is_pos_tile so it's redundant and it would make a lot more sense for the refposition column to contain integers. That way you don't have to do any string/regex matching hackery and instead just ask if neg_tiles['refposition'].isin(pos_tile_positions) where pos_tile_positions is a list of ints. Hope that makes sense!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I change the refposition column to just contain integers then?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so. These Pandas functions may be useful


if total_tiles_hits < (total_tiles - total_tiles * MIN_TILES_THRESHOLD):
tiles_with_hits = df[(df['is_kmer_freq_okay'] == True)]
average_freq_coverage_depth = tiles_with_hits['freq'].sum() / len(tiles_with_hits)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

error_status = None
error_messages = None

if st.are_subtypes_consistent:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could alleviate some nesting, e.g.

if not st.are_subtypes_consistent:
    logging.debug("QC: Checking for missing tiles not run, inconsistent subtype detected.")
     return FAIL_MESSAGE, "Subtype is inconsistent, quality checking for missing tiles not run."

total_tiles = int(st.n_tiles_matching_all_expected)
# rest of function

Signed-off-by: Matt <gopezm@myumanitoba.ca>
Removed run_method
Changed negating of a pandas df.

Signed-off-by: Matt <gopezm@myumanitoba.ca>
Signed-off-by: Matt <gopezm@myumanitoba.ca>
Signed-off-by: Matt <gopezm@myumanitoba.ca>
Added checking to see if out files at paths specified exist.
Changing method comments to be standard.
Making finding conflicting tiles less hacky.
Changed imports to be relative.

Signed-off-by: Matt <gopezm@myumanitoba.ca>
Signed-off-by: Matt <gopezm@myumanitoba.ca>
@mgopez
Copy link
Member Author

mgopez commented Nov 22, 2017

@peterk87 I added in the new params for the QC module. They probably aren't the best names Words aren't coming to me right now, so if you have any suggestions let me know so I can change them.

Signed-off-by: Matt <gopezm@myumanitoba.ca>
from .subtyper import subtype_fasta, SUBTYPE_SUMMARY_COLS, subtype_reads
from .subtype_stats import subtype_counts
from .utils import genome_name_from_fasta_path, get_scheme_fasta
from bio_hansel import program_name, program_desc, __version__
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please change all imports to from . import ... format.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems when I leave it at .. instead of bio_hansel. PyCharm won't let me run the program. Perhaps this is an error with my setup?

type=int,
default=20,
help='Frequencies below this coverage are considered low coverage')
parser.add_argument('--missing-total-tiles-max',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's try to have names of things read like regular english statements and sentences, e.g. --max-missing-tiles

Is this the threshold for the number of missing positive or negative tiles to consider the result to be untrustworthy (i.e. qc_status == FAIL)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the maximum percentage of tiles allowed missing before the result is considered an error.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like I should change the type from int to float if it's a percentage.

Split QC functions.
Cleaned up code.

Signed-off-by: Matt <gopezm@myumanitoba.ca>
Signed-off-by: Matt <gopezm@myumanitoba.ca>
@mgopez
Copy link
Member Author

mgopez commented Nov 27, 2017

I added some tests to verify that the QC functions are working as expected. Please let me know if there's anything else you want me to change for this pr @peterk87.

@mgopez mgopez changed the title Fixes for #13 and #14. #13, #14: Quality Checking Module Nov 30, 2017
@mgopez mgopez changed the title #13, #14: Quality Checking Module #13, #14: Quality Checking Module & Technician's Output Nov 30, 2017
@peterk87 peterk87 merged commit 5e3484d into master Nov 30, 2017
@peterk87 peterk87 deleted the new_output_feature_requests branch November 30, 2017 21:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants