-
Notifications
You must be signed in to change notification settings - Fork 59
Adds the option for processing ALLELE_NUM. #141
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds the option for processing ALLELE_NUM. #141
Conversation
Pull Request Test Coverage Report for Build 483
💛 - Coveralls |
|
BTW, stating the obvious: This PR is on top of PR #131 so the first commit is actually that PR. To see changes for this PR only, please check the second commit. |
arostamianfar
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for finding this feature! Looks good. Just one suggestion about whether we even need the new flag.
| def _add_annotations_by_allele_num( | ||
| self, proc_var, annotation_dict, annotation_field_name): | ||
| # type: (ProcessedVariant, Dict[str, str], str) -> None | ||
| if _ALLELE_NUM_ANNOTATION not in annotation_dict: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we always use this check so that we no longer need the use_allele_num flag? (i.e. the logic would be to first look for allele_num, if it doesn't exist, then use minimal or exact match depending on the --minimal flag). Or are you concerned about cases where allele_num is incorrect?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, when I was adding this feature I also preferred not to add an extra flag, but then I thought it is better to be explicit such that if there are problems with ALLELE_NUM we don't force them on the user. Of course we can try to use ALLELE_NUM and if it fails for whatever reason (e.g., bad indices) then fall back on matching (whether exact or minimal).
One problem with the latter approach I described is that in the ALLELE_NUM mode, annotation lists are processed one by one regardless of their alt_bases but in the non ALLELE_NUM mode, we map all annotations lists that have the same alt_bases into a single ALT (check the new unit-test I have added and note both lists have 'T' as their alt_bases but are mapped to two different ALTs). Of course, we can fix this problem too if we strongly prefer not to have the extra flag.
So with all these, I decided that being explicit and either use ALLELE_NUM always or ignore it always is a more clear approach. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Sounds good. Let's keep the flag then.
| index_str = annotation_dict[_ALLELE_NUM_ANNOTATION] | ||
| try: | ||
| alt_index = int(index_str) - 1 | ||
| alt_list = proc_var._alternate_datas |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
consider using the public getter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I explicitly have chosen not to use the public attributes when the intention is to mutate (i.e., almost everywhere in this module, you can find other examples too). If you remember, we even wanted to make the attributes to be read-only such that they can be read but not altered but then we decided that it is probably over-designing. So I prefer to treat the attributes as read-only getters.
bashir2
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PTAL; this commit has the main changes since last review (others are merge related).
| def _add_annotations_by_allele_num( | ||
| self, proc_var, annotation_dict, annotation_field_name): | ||
| # type: (ProcessedVariant, Dict[str, str], str) -> None | ||
| if _ALLELE_NUM_ANNOTATION not in annotation_dict: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, when I was adding this feature I also preferred not to add an extra flag, but then I thought it is better to be explicit such that if there are problems with ALLELE_NUM we don't force them on the user. Of course we can try to use ALLELE_NUM and if it fails for whatever reason (e.g., bad indices) then fall back on matching (whether exact or minimal).
One problem with the latter approach I described is that in the ALLELE_NUM mode, annotation lists are processed one by one regardless of their alt_bases but in the non ALLELE_NUM mode, we map all annotations lists that have the same alt_bases into a single ALT (check the new unit-test I have added and note both lists have 'T' as their alt_bases but are mapped to two different ALTs). Of course, we can fix this problem too if we strongly prefer not to have the extra flag.
So with all these, I decided that being explicit and either use ALLELE_NUM always or ignore it always is a more clear approach. What do you think?
| index_str = annotation_dict[_ALLELE_NUM_ANNOTATION] | ||
| try: | ||
| alt_index = int(index_str) - 1 | ||
| alt_list = proc_var._alternate_datas |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I explicitly have chosen not to use the public attributes when the intention is to mutate (i.e., almost everywhere in this module, you can find other examples too). If you remember, we even wanted to make the attributes to be read-only such that they can be read but not altered but then we decided that it is probably over-designing. So I prefer to treat the attributes as read-only getters.
|
BTW, thanks for the quick review. |
arostamianfar
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks!
| def _add_annotations_by_allele_num( | ||
| self, proc_var, annotation_dict, annotation_field_name): | ||
| # type: (ProcessedVariant, Dict[str, str], str) -> None | ||
| if _ALLELE_NUM_ANNOTATION not in annotation_dict: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Sounds good. Let's keep the flag then.
Check the documentation of --allele_number option of VEP.
Tested:
Unit tests added; Also ran the pipeline on a sample of gnomAD with --use_allele_num
Issue #81