-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implements Targets.parse_alignment
#20
Conversation
This changes how features are specified to `Targets`, in a way that will mesh with `parse_alignments`.
Slightly altered specs for `feature_parse_specs` as described in docs for `Targets`. This allows the `feature_parse_specs` to also specifying filtering criteria. Then updated code to better check for correct targets / features in `feature_parse_specs` in `Targets.__init__`, and removed redundant code from `Targets.parse_alignment_cs`. Finally, added the feature parse specs YAML file for the RecA example and updated the corresponding Jupyter notebook.
Now can pass and get `feature_parse_specs` as a YAML file or a dict.
Edited `regex` matching to handle custom `cs` '<clip#>' ops.
Instead of writing a script to process the '<clipN>' notation, I decided to stick with just using tuples to designate features that have clipping. This makes counting mutations and clipping easier.
Return columns suffixes `_cs`, `_clip5`, and `_clip3` in `Targets.parse_alignment_cs`.
…arse into parse_alignment
Previously `Targets.parse_alignment_cs` parsed **all** features; now it only parses the ones in `feature_parse_specs`.
Previously the `feature_parse_specs` and the returns from `parse_alignment_cs` included `target_clip5` and `target_clip3`. However, this is redundant with the feature- level clipping information, and so has been removed. The query clipping is retained as that is not redundant with feature-level clipping.
Still need to add more rigorous tests.
…arse into parse_alignment
There are new vesions of `pandas` (0.25.1) and `plotnine` (0.6.0). Use those, and also update notebooks to have output from these; in particular the new `pandas` no longer shows the index in bold in data frames displayed in Jupyter notebooks.
Only major change is that mutations from ambiguous are **not** counted as mutations in mutation strings. Otherwise just streamline code.
Likely has bugs, but general outline is there.
Fixed some formatting, reverted making `Targets.parse_alignment_cs` private; it's now public method again. We can re-visit whether to make it public later.
Simplify the initialization of `Targets` by modularizing operations like filling the defaults of `feature_parse_specs` and getting the names of features to parse in their own methods. Also, eliminated some redundant parameter checks that were confusing to read in the code.
Updates to `feature_parse_specs` input to `Targets`, and docs for `parse_alignments`. Specifically: - `parse_alignments` has different return described and can write CSV. - Previously there was a single `clip_count` in `feature_parse_specs`; now it is `clip5` and `clip3` separately. Example notebooks updated to reflect this.
Previously, the `multi_align` option to `Targets.parse_alignment_cs` was ignored and secondary alignments were not filtered.
The new `Targets.parse_alignment` is fully implemented except for the `Targets._parse_single_Alignment` method it utilizes. No testing yet.
The new `Targets.parse_alignments` is fully implemented and tested against `Targets._parse_alignments_csv` and for consistency in writing CSVs versus returning data frames. It still needs more testing for correctness of output and illustrative example. Also, added a parameter to `Targets` explicitly permittting the return of mutations / sequences of features with clipping; otherwise this is disallowed as it can give confusing results.
In addition to minor doc tweaks and slight code cleaning, implemented testing of `Targets.parse_alignment` in `test_Targets_parse_alignment.ipynb`.
@khdusenbury: I just initiated a pull request. In the pull request message, I describe the major changes I made since we last met. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made a couple of docs proofreading changes and have one question about the docs regarding to_csv
that I commented in the review.
I also looked through the recA example and this all makes sense. I do think that even though we don't explicitly use it, it might be worth keeping the _parse_alignment_cs
function as I could imagine a scenario where one would want to extract and manually look at the clipping and cs
tags for queries that failed rather than just getting the filtered reason.
Anyway, I think this all looks good. I'm curious how the review commenting works, so I'm not going to explicitly merge right now, but I will approve it. I'm assuming you can look at the comment then finish merging?
Fixes this: #20 (comment)
Targets.parse_alignment
is the preferred way to process alignments, and is now fully implemented and tested.The older
Targets.parse_alignment_cs
has now become a private methodTargets._parse_alignment_cs
that should not actually be used in code and is just there for testing as it provides something to which we can compare the results ofTargets.parse_alignment
.Major changes are that
Targets.parse_alignment
does not first go throughTargets._parse_alignment_cs
. The logic is that doing so was inefficient: it required building two sets of large data frames, and thecs
tag was fully parsed even on alignments that failed a filter. Therefore,Targets.parse_alignment
directly parses from the same.In addition, there are options in
Targets.parse_alignment
to write the created data frames to CSV files line-by-line rather than reading them into memory (this is theto_csv
option). The reason is that for large alignment files it may get very expensive to read everything into memory.Finally, the internal code in
Targets
has been somewhat re-factored to put complicated operations in private methods.