Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implements Targets.parse_alignment #20

Merged
merged 30 commits into from
Aug 26, 2019
Merged

Implements Targets.parse_alignment #20

merged 30 commits into from
Aug 26, 2019

Conversation

jbloom
Copy link
Member

@jbloom jbloom commented Aug 26, 2019

Targets.parse_alignment is the preferred way to process alignments, and is now fully implemented and tested.

The older Targets.parse_alignment_cs has now become a private method Targets._parse_alignment_cs that should not actually be used in code and is just there for testing as it provides something to which we can compare the results of Targets.parse_alignment.

Major changes are that Targets.parse_alignment does not first go through Targets._parse_alignment_cs. The logic is that doing so was inefficient: it required building two sets of large data frames, and the cs tag was fully parsed even on alignments that failed a filter. Therefore, Targets.parse_alignment directly parses from the same.

In addition, there are options in Targets.parse_alignment to write the created data frames to CSV files line-by-line rather than reading them into memory (this is the to_csv option). The reason is that for large alignment files it may get very expensive to read everything into memory.

Finally, the internal code in Targets has been somewhat re-factored to put complicated operations in private methods.

jbloom and others added 28 commits August 12, 2019 15:07
This changes how features are specified to `Targets`,
in a way that will mesh with `parse_alignments`.
Slightly altered specs for `feature_parse_specs` as
described in docs for `Targets`. This allows the
`feature_parse_specs` to also specifying filtering criteria.

Then updated code to better check for correct targets / features
in `feature_parse_specs` in `Targets.__init__`, and removed
redundant code from `Targets.parse_alignment_cs`.

Finally, added the feature parse specs YAML file for the RecA
example and updated the corresponding Jupyter notebook.
Now can pass and get `feature_parse_specs` as a YAML
file or a dict.
Edited `regex` matching to handle custom `cs` '<clip#>' ops.
Instead of writing a script to process the '<clipN>' notation, I
decided to stick with just using tuples to designate features
that have clipping. This makes counting mutations and clipping
easier.
Return columns suffixes `_cs`, `_clip5`, and `_clip3`
in `Targets.parse_alignment_cs`.
Previously `Targets.parse_alignment_cs` parsed **all**
features; now it only parses the ones in
`feature_parse_specs`.
Previously the `feature_parse_specs` and the returns
from `parse_alignment_cs` included `target_clip5` and
`target_clip3`. However, this is redundant with the feature-
level clipping information, and so has been removed.

The query clipping is retained as that is not redundant
with feature-level clipping.
Still need to add more rigorous tests.
There are new vesions of `pandas` (0.25.1)
and `plotnine` (0.6.0). Use those, and also
update notebooks to have output from these;
in particular the new `pandas` no longer shows
the index in bold in data frames displayed in
Jupyter notebooks.
Only major change is that mutations from ambiguous
are **not** counted as mutations in mutation strings.

Otherwise just streamline code.
Likely has bugs, but general outline is there.
Fixed some formatting, reverted making
`Targets.parse_alignment_cs` private; it's now public
method again. We can re-visit whether to make it public
later.
Simplify the initialization of `Targets`
by modularizing operations like filling the defaults
of `feature_parse_specs` and getting the names of
features to parse in their own methods.

Also, eliminated some redundant parameter checks that
were confusing to read in the code.
Updates to `feature_parse_specs` input to `Targets`,
and docs for `parse_alignments`.

Specifically:

  - `parse_alignments` has different return described and
    can write CSV.

  - Previously there was a single `clip_count` in
    `feature_parse_specs`; now it is `clip5` and `clip3`
    separately. Example notebooks updated to reflect this.
Previously, the `multi_align` option to
`Targets.parse_alignment_cs` was ignored and
secondary alignments were not filtered.
The new `Targets.parse_alignment` is fully implemented
except for the `Targets._parse_single_Alignment` method
it utilizes.

No testing yet.
The new `Targets.parse_alignments` is fully implemented
and tested against `Targets._parse_alignments_csv` and
for consistency in writing CSVs versus returning data frames.

It still needs more testing for correctness of output and
illustrative example.

Also, added a parameter to `Targets` explicitly permittting
the return of mutations / sequences of features with clipping;
otherwise this is disallowed as it can give confusing results.
In addition to minor doc tweaks and slight code
cleaning, implemented testing of `Targets.parse_alignment`
in `test_Targets_parse_alignment.ipynb`.
@jbloom
Copy link
Member Author

jbloom commented Aug 26, 2019

@khdusenbury: I just initiated a pull request. In the pull request message, I describe the major changes I made since we last met.

Copy link
Collaborator

@khdcrawford khdcrawford left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made a couple of docs proofreading changes and have one question about the docs regarding to_csv that I commented in the review.

I also looked through the recA example and this all makes sense. I do think that even though we don't explicitly use it, it might be worth keeping the _parse_alignment_cs function as I could imagine a scenario where one would want to extract and manually look at the clipping and cs tags for queries that failed rather than just getting the filtered reason.

Anyway, I think this all looks good. I'm curious how the review commenting works, so I'm not going to explicitly merge right now, but I will approve it. I'm assuming you can look at the comment then finish merging?

alignparse/targets.py Outdated Show resolved Hide resolved
@jbloom jbloom merged commit 7828b6f into master Aug 26, 2019
@jbloom jbloom deleted the parse_alignment branch August 26, 2019 19:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants