Implements `Targets.parse_alignment` #20

jbloom · 2019-08-26T03:40:40Z

Targets.parse_alignment is the preferred way to process alignments, and is now fully implemented and tested.

The older Targets.parse_alignment_cs has now become a private method Targets._parse_alignment_cs that should not actually be used in code and is just there for testing as it provides something to which we can compare the results of Targets.parse_alignment.

Major changes are that Targets.parse_alignment does not first go through Targets._parse_alignment_cs. The logic is that doing so was inefficient: it required building two sets of large data frames, and the cs tag was fully parsed even on alignments that failed a filter. Therefore, Targets.parse_alignment directly parses from the same.

In addition, there are options in Targets.parse_alignment to write the created data frames to CSV files line-by-line rather than reading them into memory (this is the to_csv option). The reason is that for large alignment files it may get very expensive to read everything into memory.

Finally, the internal code in Targets has been somewhat re-factored to put complicated operations in private methods.

This changes how features are specified to `Targets`, in a way that will mesh with `parse_alignments`.

Slightly altered specs for `feature_parse_specs` as described in docs for `Targets`. This allows the `feature_parse_specs` to also specifying filtering criteria. Then updated code to better check for correct targets / features in `feature_parse_specs` in `Targets.__init__`, and removed redundant code from `Targets.parse_alignment_cs`. Finally, added the feature parse specs YAML file for the RecA example and updated the corresponding Jupyter notebook.

Now can pass and get `feature_parse_specs` as a YAML file or a dict.

Edited `regex` matching to handle custom `cs` '<clip#>' ops.

Instead of writing a script to process the '<clipN>' notation, I decided to stick with just using tuples to designate features that have clipping. This makes counting mutations and clipping easier.

Return columns suffixes `_cs`, `_clip5`, and `_clip3` in `Targets.parse_alignment_cs`.

…arse into parse_alignment

Previously `Targets.parse_alignment_cs` parsed **all** features; now it only parses the ones in `feature_parse_specs`.

Previously the `feature_parse_specs` and the returns from `parse_alignment_cs` included `target_clip5` and `target_clip3`. However, this is redundant with the feature- level clipping information, and so has been removed. The query clipping is retained as that is not redundant with feature-level clipping.

Still need to add more rigorous tests.

…arse into parse_alignment

There are new vesions of `pandas` (0.25.1) and `plotnine` (0.6.0). Use those, and also update notebooks to have output from these; in particular the new `pandas` no longer shows the index in bold in data frames displayed in Jupyter notebooks.

Only major change is that mutations from ambiguous are **not** counted as mutations in mutation strings. Otherwise just streamline code.

Likely has bugs, but general outline is there.

Fixed some formatting, reverted making `Targets.parse_alignment_cs` private; it's now public method again. We can re-visit whether to make it public later.

Simplify the initialization of `Targets` by modularizing operations like filling the defaults of `feature_parse_specs` and getting the names of features to parse in their own methods. Also, eliminated some redundant parameter checks that were confusing to read in the code.

Updates to `feature_parse_specs` input to `Targets`, and docs for `parse_alignments`. Specifically: - `parse_alignments` has different return described and can write CSV. - Previously there was a single `clip_count` in `feature_parse_specs`; now it is `clip5` and `clip3` separately. Example notebooks updated to reflect this.

Previously, the `multi_align` option to `Targets.parse_alignment_cs` was ignored and secondary alignments were not filtered.

The new `Targets.parse_alignment` is fully implemented except for the `Targets._parse_single_Alignment` method it utilizes. No testing yet.

The new `Targets.parse_alignments` is fully implemented and tested against `Targets._parse_alignments_csv` and for consistency in writing CSVs versus returning data frames. It still needs more testing for correctness of output and illustrative example. Also, added a parameter to `Targets` explicitly permittting the return of mutations / sequences of features with clipping; otherwise this is disallowed as it can give confusing results.

In addition to minor doc tweaks and slight code cleaning, implemented testing of `Targets.parse_alignment` in `test_Targets_parse_alignment.ipynb`.

jbloom · 2019-08-26T03:41:25Z

@khdusenbury: I just initiated a pull request. In the pull request message, I describe the major changes I made since we last met.

khdcrawford

I made a couple of docs proofreading changes and have one question about the docs regarding to_csv that I commented in the review.

I also looked through the recA example and this all makes sense. I do think that even though we don't explicitly use it, it might be worth keeping the _parse_alignment_cs function as I could imagine a scenario where one would want to extract and manually look at the clipping and cs tags for queries that failed rather than just getting the filtered reason.

Anyway, I think this all looks good. I'm curious how the review commenting works, so I'm not going to explicitly merge right now, but I will approve it. I'm assuming you can look at the comment then finish merging?

alignparse/targets.py

Fixes this: #20 (comment)

jbloom and others added 28 commits August 12, 2019 15:07

Set up Targets to have feature_parse_specs

0c368b8

This changes how features are specified to `Targets`, in a way that will mesh with `parse_alignments`.

Targets feature_parse_specs as YAML or dict

f686f57

Now can pass and get `feature_parse_specs` as a YAML file or a dict.

Refactored VEP example notebook to work with feature_parse_specs

3262f1b

Initial docs and cs_to_mutation_count function

74760e2

Edited `regex` matching to handle custom `cs` '<clip#>' ops.

Removed custom '<clipN>' cs notation.

dcdb543

Instead of writing a script to process the '<clipN>' notation, I decided to stick with just using tuples to designate features that have clipping. This makes counting mutations and clipping easier.

Targets.parse_alignment_cs cs, clip separate col

e994bb2

Return columns suffixes `_cs`, `_clip5`, and `_clip3` in `Targets.parse_alignment_cs`.

continued progress on cs_to___ functions

af99af1

Merge branch 'parse_alignment' of https://github.com/jbloomlab/alignp…

7c35ceb

…arse into parse_alignment

parse_alignment_cs only gets features in specs

3c3caf9

Previously `Targets.parse_alignment_cs` parsed **all** features; now it only parses the ones in `feature_parse_specs`.

Finished cs_to____ functions for parsing cs str.

5c54039

Still need to add more rigorous tests.

fixed errors with VEP pilot notebooks

786614f

Merge branch 'parse_alignment' of https://github.com/jbloomlab/alignp…

6c7c2fe

…arse into parse_alignment

update to new pandas and plotnine

44ab13c

There are new vesions of `pandas` (0.25.1) and `plotnine` (0.6.0). Use those, and also update notebooks to have output from these; in particular the new `pandas` no longer shows the index in bold in data frames displayed in Jupyter notebooks.

tweak functions to get mutations from cs tags

4ac5e49

Only major change is that mutations from ambiguous are **not** counted as mutations in mutation strings. Otherwise just streamline code.

update format for feature_parse_specs

434c164

updated VEP_target_feature_parse_specs.yaml

fbf7fd8

partial progress on parse_alignment

aa03b38

Initial filtering on feature clipping.

8a3797d

Likely has bugs, but general outline is there.

tweaks to pass tests

6ed3cf1

Fixed some formatting, reverted making `Targets.parse_alignment_cs` private; it's now public method again. We can re-visit whether to make it public later.

implement multi_align in alignment parsing

fc49dde

Previously, the `multi_align` option to `Targets.parse_alignment_cs` was ignored and secondary alignments were not filtered.

new parse_alignment skeleton

f41b670

The new `Targets.parse_alignment` is fully implemented except for the `Targets._parse_single_Alignment` method it utilizes. No testing yet.

minor code / doc cleanup and update test

682f837

In addition to minor doc tweaks and slight code cleaning, implemented testing of `Targets.parse_alignment` in `test_Targets_parse_alignment.ipynb`.

recA_DMS.ipynb now illustrates parse_alignments

1d2bd92

jbloom requested a review from khdcrawford August 26, 2019 03:40

This was referenced Aug 26, 2019

test Targets.parse_alignment. #21

Closed

flesh out docs and structure for Targets.parse_alignment. #17

Closed

minor docs proofreading

b6a4f87

khdcrawford approved these changes Aug 26, 2019

View reviewed changes

alignparse/targets.py Outdated Show resolved Hide resolved

fix doc caught by @khdusenbury

cee13bf

Fixes this: #20 (comment)

jbloom merged commit 7828b6f into master Aug 26, 2019

jbloom deleted the parse_alignment branch August 26, 2019 19:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implements `Targets.parse_alignment` #20

Implements `Targets.parse_alignment` #20

jbloom commented Aug 26, 2019

jbloom commented Aug 26, 2019

khdcrawford left a comment

Implements Targets.parse_alignment #20

Implements Targets.parse_alignment #20

Conversation

jbloom commented Aug 26, 2019

jbloom commented Aug 26, 2019

khdcrawford left a comment

Choose a reason for hiding this comment

Implements `Targets.parse_alignment` #20

Implements `Targets.parse_alignment` #20