Quantify strand exchange #124

Bernadetadad · 2020-12-01T18:58:29Z

this pull requests uses strand_exchange rule to look at strand exchange using viral tags found in CCS. The rule outputs {expt}_plot_strand_exchange.svg plot showing how many mixed tags are found among CCSs.

There's no accuracy metric currently being outputted with the tag identities, I don't know if maybe I should also ask alignparse to output tag accuracy and filter on tags with >0.999 accuracy?

jbloom · 2020-12-02T01:14:17Z

Good job on getting this started, @Bernadetadad. But a few things I noticed:

What happened to all the other variant tags for NA? Right now the notebook seems hardcoded to assume just two variant tags per gene, but don't we want to be considering all possible variant tags and have a pipeline flexible enough to work with an arbitrary number of variant tags per gene? And definitely the NEP thing should not be hardcoded, that should be figured out by the code. We want the pipeline to be robust so that if the input files are changed in reasonable ways it still works, not hardcoded to make silent assumptions about the nature of the NEP, etc tags.
Don't use itterrows. See this: https://ryxcommar.com/2020/01/15/for-the-love-of-god-stop-using-iterrows/ I think you can do everything with a simple merge of the two data frames.

Bernadetadad · 2020-12-11T02:55:33Z

@jbloom I've updated this to call tag variants by melting and merging as per your suggestion.

I'm not not totally sure of is the classify_func : .assign(tag_status=lambda x: x['unique_tags'].map(classify_func)) was having issues when classify_func has a loop in it so I changed it to run without a loop but it now has hardcoded wt and syn
I've added that tag_status column to the end of alignments dataframe and save it a a new csv.gz file but perhaps I can just overwrite the old one? though snakemake might not like that unless I don't call it in the output.

jbloom

@Bernadetadad, looking good! Just a few minor requested changes:

First, you should keep the output and input CSVs separate as you have now, even if that means some files are duplicated. This is much "safer". Having code that modifies files in place won't work with Snakemake and is dangerous in general, as for instance what happens if same rule is run twice (file gets additional columns).
In pacbio_analysis.smk, although not essential for snakemake functioning, I think it makes it a lot easier if you use a consistent naming scheme for files. Right now the CSV file containing the mutations is called mutation_df (seems reasonable) in the output of align_pacbio. But it is called ccs_alignments in the input to strand_exchange (doesn't seem like a good name, as file isn't alignments, it's a CSV of mutations and other information parsed from alignments). Then the new veresion coming out of strand_exchange has a totally new name of tag_identity_df, where it seems like that isn't the main information in there, maybe more like mutations_with_tag_identity_df.
In strand_exchange.py.ipynb notebook Jupyter notebook, last cell, merging is dangerous if you don't check columns are preserved as merging by default is done "inner" so will drop missing queries. So I'd suggest adding something before the merge like assert set(alignments['query_name']) == set(strand_exchange_df['query_name']). If that isn't true, then there are problems with the merge. Also, within the merge you can replace left_on=... and right_on=... with just on='query_name'. You might also add validate='one_to_one' as this is a good check like the assert.
I don't think it's necessary to hard-code the tags. The viral_tags list is already defined in Snakefile, so that can just be passed to the notebook as a parameter. In fact, it already is, so I'm not sure why you have valid_tags defined in the sixth-to-last cell? Then you can just use this with something like the following in classify_func:

unique_tags = set(unique_tags)
for valid_tag in viral_tags:
    if unique_tags = {valid_tag}:
        return valid_tag
    if 'unknown_tag' in unique_tags:
        <the rest of the code there right now>

One last thing, did you run the linting over this?

Bernadetadad · 2020-12-14T21:12:38Z

yes, linting was done and all passes
I've changed variable names
-I extracted tag names from tag_df. But as I said before, (maybe I don't understand how lambda works? ), but it does not like having a loop in the function. with the following function:

def classify_func(unique_tags):
    """Classify tags"""
    unique_tags = set(unique_tags)
    for valid_tag in valid_tags:
        if unique_tags == {valid_tag}:
            return valid_tag
        if 'unknown_tag' in unique_tags:
            return 'unknown tag'
        elif 'missing' in unique_tags:
            return 'missing one or more tags'
        elif unique_tags == set(valid_tags):
            assert len(unique_tags) > 1
            return 'chimeric tags'
        else:
            raise ValueError(f"Cannot process tags {unique_tags}")

When list is valid_tags = ['syn', 'wt'] the error is Cannot process tags {'wt'}
If the list is valid_tags = ['wt', 'syn'] error is Cannot process tags {'syn'}
It looks to me like if it does not see both tags in one iteration it reaches the end of the loop and throughs an error because one of the tags was not found.

jbloom · 2020-12-15T23:28:42Z

@Bernadetadad, the problem is your indentation is wrong. It should look like below. The loop should just check the two known tags, and then if neither of those match (no return in loop) then it goes onto rest of code. Does this make sense?

def classify_func(unique_tags):
    """Classify tags"""
    unique_tags = set(unique_tags)
    for valid_tag in valid_tags:
        if unique_tags == {valid_tag}:
            return valid_tag
    if 'unknown_tag' in unique_tags:
        return 'unknown tag'
    elif 'missing' in unique_tags:
        return 'missing one or more tags'
    elif unique_tags == set(valid_tags):
        assert len(unique_tags) > 1
        return 'chimeric tags'
    else:
        raise ValueError(f"Cannot process tags {unique_tags}")

Bernadetadad · 2020-12-16T01:02:41Z

Sorry, this was very stupid of me. Now all should run.

jbloom

Looks great. I didn't check that it actually runs, but approving and assuming it runs OK for you then I'd say go ahead and merge it!

Bernadetadad added 2 commits December 1, 2020 10:49

rule for strand exchange

52d1ad3

plot strandexchange notebook

26b75ff

Bernadetadad requested a review from jbloom December 1, 2020 18:58

Bernadetadad added 3 commits December 2, 2020 18:51

add all tags

f2e6fc1

add all fluNA tags

8524d6a

remove redundant

4b0ec5b

Bernadetadad mentioned this pull request Dec 3, 2020

Quantify strand exchange #125

Closed

call ags my melting and merging

d78e98f

jbloom requested changes Dec 14, 2020

View reviewed changes

Bernadetadad added 3 commits December 14, 2020 11:56

safe merging

5095032

tag list inside

39c68b4

change snakemake variable names

7704ff2

Bernadetadad added 3 commits December 15, 2020 16:43

fix classify func

d745da1

remove links

3a545fc

change csv name lint too long

b32fd4d

jbloom approved these changes Dec 16, 2020

View reviewed changes

Bernadetadad merged commit c1eb180 into main Dec 16, 2020

jbloom deleted the quantify_strand_exchange branch January 20, 2022 23:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Quantify strand exchange #124

Quantify strand exchange #124

Bernadetadad commented Dec 1, 2020

jbloom commented Dec 2, 2020

Bernadetadad commented Dec 11, 2020

jbloom left a comment •

edited

Loading

Bernadetadad commented Dec 14, 2020

jbloom commented Dec 15, 2020

Bernadetadad commented Dec 16, 2020

jbloom left a comment

Quantify strand exchange #124

Quantify strand exchange #124

Conversation

Bernadetadad commented Dec 1, 2020

jbloom commented Dec 2, 2020

Bernadetadad commented Dec 11, 2020

jbloom left a comment • edited Loading

Choose a reason for hiding this comment

Bernadetadad commented Dec 14, 2020

jbloom commented Dec 15, 2020

Bernadetadad commented Dec 16, 2020

jbloom left a comment

Choose a reason for hiding this comment

jbloom left a comment •

edited

Loading