-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
#13, #14: Quality Checking Module & Technician's Output #17
Conversation
mgopez
commented
Oct 30, 2017
- Created new simple output file for technicians.
- Added quality checking for samples (min tiles, is samples mixed)
…ypes are mixed or not. 2) Added new method to determine if data contains mixed subtypes or not. 3) Added new column for REACHED_MIN_TILES to determine if there is a sufficient number of SNV targets found. 4) Added new method to determine if data contains minimum SNV targets. Signed-off-by: Matt <gopezm@myumanitoba.ca>
Signed-off-by: Matt <gopezm@myumanitoba.ca>
Signed-off-by: Matt <gopezm@myumanitoba.ca>
…of the simple output file. misc, moving variables to const file. Signed-off-by: Matt <gopezm@myumanitoba.ca> #14: Adding the quality check module and it's tests.
…o new_output_feature_requests # Conflicts: # bio_hansel/const.py # bio_hansel/subtype.py # bio_hansel/subtyper.py # bio_hansel/utils.py # tests/test_confidence_false.py # tests/test_confidence_true.py
Signed-off-by: Matt <gopezm@myumanitoba.ca>
Signed-off-by: Matt <gopezm@myumanitoba.ca>
Signed-off-by: Matt <gopezm@myumanitoba.ca>
Signed-off-by: Matt <gopezm@myumanitoba.ca>
Signed-off-by: Matt <gopezm@myumanitoba.ca>
for MAX_TILE checking. Signed-off-by: Matt <gopezm@myumanitoba.ca>
…eing shown. Signed-off-by: Matt <gopezm@myumanitoba.ca>
Signed-off-by: Matt <gopezm@myumanitoba.ca>
Signed-off-by: Matt <gopezm@myumanitoba.ca>
Signed-off-by: Matt <gopezm@myumanitoba.ca>
Signed-off-by: Matt <gopezm@myumanitoba.ca>
Signed-off-by: Matt <gopezm@myumanitoba.ca>
Signed-off-by: Matt <gopezm@myumanitoba.ca>
Fixed tests. Signed-off-by: Matt <gopezm@myumanitoba.ca>
Signed-off-by: Matt <gopezm@myumanitoba.ca>
Signed-off-by: Matt <gopezm@myumanitoba.ca>
Signed-off-by: Matt <gopezm@myumanitoba.ca>
Signed-off-by: Matt <gopezm@myumanitoba.ca>
Signed-off-by: Matt <gopezm@myumanitoba.ca>
Signed-off-by: Matt <gopezm@myumanitoba.ca>
@peterk87, now Genevieve's QC module is implemented. Please let me know if you find anything I can refactor / change. |
record.txt
Outdated
@@ -0,0 +1,31 @@ | |||
/home/mgopez/anaconda3/lib/python3.6/site-packages/bio_hansel/subtype.py |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file doesn't look like it should be committed. Is it necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The record.txt
file that is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct, I'll remove it from the git repo
|
||
|
||
''' | ||
[does_subtype_result_exist] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's keep the documentation style consistent with how it's done in other parts of the project, e.g. https://github.com/phac-nml/bio_hansel/blob/master/bio_hansel/blast_wrapper/__init__.py#L189
See this style guide:
https://google.github.io/styleguide/pyguide.html#Comments
Here's an example of the style guide:
https://pythonhosted.org/an_example_pypi_project/sphinx.html#full-code-example
if st.are_subtypes_consistent is False and (st.inconsistent_subtypes is not None | ||
and len(st.inconsistent_subtypes) > 0): | ||
mixed_subtypes = '; '.join(st.inconsistent_subtypes) | ||
error_messages = MIXED_SUBTYPE_ERROR + ": {" + mixed_subtypes + "} detected in sample " \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
string.format
is a cleaner solution than string concatenation. See https://docs.python.org/3/library/string.html#format-string-syntax
Example:
'{status}: {msg} val1={val1} val2={val2}'.format(
status=some_status,
msg=some_message,
val1=some_value,
val2=other_value
)
Lazy method:
'{}: {} val1={} val2={}'.format(
some_status,
some_message,
some_value,
other_value
)
|
||
# Then we verify that the subtype has the correct number of tiles. | ||
if st.n_tiles_matching_all <= expected_tiles_matching - (expected_tiles_matching * MIN_TILES_THRESHOLD): | ||
error_messages = INSUFFICIENT_NUM_TILES + " : Observed: {"+str(st.n_tiles_matching_all)+"} " \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recommend using string.format
since you won't need to str(all_the_variables)
bio_hansel/kmer_count/__init__.py
Outdated
@@ -7,6 +7,7 @@ | |||
import logging | |||
import pandas as pd | |||
|
|||
from bio_hansel.quality_check import perform_quality_check |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from ..quality_check import perform_quality_check
to be consistent with other import statements and to allow base package name to change (not that it will)
bio_hansel/kmer_count/__init__.py
Outdated
@@ -231,6 +232,7 @@ def summary(self): | |||
return st, None | |||
dfgood = df[df.is_kmer_freq_okay] | |||
dfpos = dfgood[dfgood.is_pos_tile] | |||
dfneg = dfgood[dfgood.is_pos_tile == False] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could write as dfneg = dfgood[~dfgood.is_pos_tile]
where ~
negates a Pandas Series.
bio_hansel/main.py
Outdated
@@ -59,6 +60,8 @@ def init_parser(): | |||
help='Subtyping summary output path (tab-delimited)') | |||
parser.add_argument('-O', '--output-tile-results', | |||
help='Subtyping tile matching output path (tab-delimited)') | |||
parser.add_argument('-OS', '--output-simple-summary', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's change this to parser.add_argument('-S', '--output-simple-summary',
unless you can think of a better single char short opt (-e
?). We want to make the commandline options as intuitive and consistent as possible where short opts are dash followed by a char (e.g. -x
) and longer opts are 2 dashes and some words delimited by dashes (e.g. --show-x
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll change it to -S
. Hopefully won't be confusing as -s
corresponds to scheme.
bio_hansel/main.py
Outdated
@@ -97,6 +100,7 @@ def main(): | |||
init_console_logger(args.verbose) | |||
output_summary_path = args.output_summary | |||
output_tile_results = args.output_tile_results | |||
output_simple_summary_path = args.output_simple_summary |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right about here we should add some checking to see if there are already files at each of those paths if the user has supplied those output paths and exit with non-zero status code while alerting user that a file already exists at one of these file paths (we don't want users overwriting their important files). We could add a --force
opt to allow users to overwrite any files if they know what they are doing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I didn't think about that. I'll implement a warning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I've added a check to see if output files already exist, and an optional argument --force
to overwrite previous output files.
bio_hansel/quality_check/__init__.py
Outdated
|
||
for func in QC_FUNCS: | ||
# Calls run_method to check that the qc function takes a Subtype, returns Tuple[Optional[str], Optional[str]] | ||
status, message = run_method(func, st, df) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just call func(st, df)
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it was due to an earlier review comment where it suggested I check the type of the function to make sure it was a Callable[[Subtype, DataFrame], Tuple[str, str]]. Is there a better way to do this rather than having an extra method to check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you already have the checking going with specifying the type of the QC_FUNCS
list:
QC_FUNCS: List[Callable[[Subtype, DataFrame], Tuple[str, str]]] = \
[
check_missing_tiles,
check_mixed_subtype,
check_inconsistent_results,
check_intermediate_subtype
]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh man, I put both in. I'll remove the run_method method.
bio_hansel/quality_check/const.py
Outdated
INCONSISTENT_RESULTS_ERROR_3B = "Inconsistent Results Error 3B" | ||
INTERMEDIATE_SUBTYPE_WARNING = "Intermediate Subtype Warning" | ||
# Thresholds for tile checking. | ||
MAX_TILES_THRESHOLD = 0.01 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it'd be a good idea to make these thresholds and other currently hardcoded values command-line opts with these values as defaults. We can then have the values passed around in some object/dict.
e.g.
parser.add_argument('--max-tiles-threshold',
type=float,
default=0.01,
help='Max tiles threshold percent (0.01 -> 1%)')
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, I'll change the hardcoded values to command line arguments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think creating a class would be a good idea in case there are more wanted arguments? Or should I just use a dict to pass the values to the QC module?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think an attrs
class would be a good idea (class name idea: SubtypingParams
?). You could bundle a bunch of params for various thresholds into the class like the min/max tiles thresholds as well as the QC specific thresholds. The defaults could be whatever the hardcoded or default cmd line opt values are. The object could be passed to the subtype_*
functions and the QC functions rather than adding a new param for each new threshold that we could up with.
Signed-off-by: Matt <gopezm@myumanitoba.ca>
bio_hansel/quality_check/qc_utils.py
Outdated
|
||
pos_tiles = dfst[dfst['is_pos_tile']] | ||
neg_tiles = dfst[dfst['is_pos_tile'] == False] | ||
pos_tile_values = '|'.join(pos_tiles['refposition'].values.tolist()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So it looks like df['refposition']
is a series of type string since there's entries with negativeXXXXX
, however, that negative
info is already extracted into another column is_pos_tile
so it's redundant and it would make a lot more sense for the refposition
column to contain integers. That way you don't have to do any string/regex matching hackery and instead just ask if neg_tiles['refposition'].isin(pos_tile_positions)
where pos_tile_positions
is a list of ints. Hope that makes sense!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should I change the refposition column to just contain integers then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so. These Pandas functions may be useful
- https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.replace.html
- https://pandas.pydata.org/pandas-docs/stable/generated/pandas.to_numeric.html
- e.g.
df['refposition'] = pd.to_numeric(df['refposition'], downcast='unsigned', errors='coerce')
- e.g.
|
||
if total_tiles_hits < (total_tiles - total_tiles * MIN_TILES_THRESHOLD): | ||
tiles_with_hits = df[(df['is_kmer_freq_okay'] == True)] | ||
average_freq_coverage_depth = tiles_with_hits['freq'].sum() / len(tiles_with_hits) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could use Series.mean()
instead. https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.mean.html
error_status = None | ||
error_messages = None | ||
|
||
if st.are_subtypes_consistent: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could alleviate some nesting, e.g.
if not st.are_subtypes_consistent:
logging.debug("QC: Checking for missing tiles not run, inconsistent subtype detected.")
return FAIL_MESSAGE, "Subtype is inconsistent, quality checking for missing tiles not run."
total_tiles = int(st.n_tiles_matching_all_expected)
# rest of function
Signed-off-by: Matt <gopezm@myumanitoba.ca>
Removed run_method Changed negating of a pandas df. Signed-off-by: Matt <gopezm@myumanitoba.ca>
Signed-off-by: Matt <gopezm@myumanitoba.ca>
Signed-off-by: Matt <gopezm@myumanitoba.ca>
Added checking to see if out files at paths specified exist. Changing method comments to be standard. Making finding conflicting tiles less hacky. Changed imports to be relative. Signed-off-by: Matt <gopezm@myumanitoba.ca>
Signed-off-by: Matt <gopezm@myumanitoba.ca>
@peterk87 I added in the new params for the QC module. |
Signed-off-by: Matt <gopezm@myumanitoba.ca>
from .subtyper import subtype_fasta, SUBTYPE_SUMMARY_COLS, subtype_reads | ||
from .subtype_stats import subtype_counts | ||
from .utils import genome_name_from_fasta_path, get_scheme_fasta | ||
from bio_hansel import program_name, program_desc, __version__ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please change all imports to from . import ...
format.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems when I leave it at ..
instead of bio_hansel.
PyCharm won't let me run the program. Perhaps this is an error with my setup?
bio_hansel/main.py
Outdated
type=int, | ||
default=20, | ||
help='Frequencies below this coverage are considered low coverage') | ||
parser.add_argument('--missing-total-tiles-max', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's try to have names of things read like regular english statements and sentences, e.g. --max-missing-tiles
Is this the threshold for the number of missing positive or negative tiles to consider the result to be untrustworthy (i.e. qc_status == FAIL
)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the maximum percentage of tiles allowed missing before the result is considered an error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like I should change the type from int
to float
if it's a percentage.
Split QC functions. Cleaned up code. Signed-off-by: Matt <gopezm@myumanitoba.ca>
Signed-off-by: Matt <gopezm@myumanitoba.ca>
I added some tests to verify that the QC functions are working as expected. Please let me know if there's anything else you want me to change for this pr @peterk87. |