Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Quantify strand exchange #124

Merged
merged 12 commits into from
Dec 16, 2020
Merged

Quantify strand exchange #124

merged 12 commits into from
Dec 16, 2020

Conversation

Bernadetadad
Copy link
Collaborator

this pull requests uses strand_exchange rule to look at strand exchange using viral tags found in CCS. The rule outputs {expt}_plot_strand_exchange.svg plot showing how many mixed tags are found among CCSs.

There's no accuracy metric currently being outputted with the tag identities, I don't know if maybe I should also ask alignparse to output tag accuracy and filter on tags with >0.999 accuracy?

@jbloom
Copy link
Member

jbloom commented Dec 2, 2020

Good job on getting this started, @Bernadetadad. But a few things I noticed:

  1. What happened to all the other variant tags for NA? Right now the notebook seems hardcoded to assume just two variant tags per gene, but don't we want to be considering all possible variant tags and have a pipeline flexible enough to work with an arbitrary number of variant tags per gene? And definitely the NEP thing should not be hardcoded, that should be figured out by the code. We want the pipeline to be robust so that if the input files are changed in reasonable ways it still works, not hardcoded to make silent assumptions about the nature of the NEP, etc tags.

  2. Don't use itterrows. See this: https://ryxcommar.com/2020/01/15/for-the-love-of-god-stop-using-iterrows/ I think you can do everything with a simple merge of the two data frames.

@Bernadetadad
Copy link
Collaborator Author

@jbloom I've updated this to call tag variants by melting and merging as per your suggestion.

  • I'm not not totally sure of is the classify_func : .assign(tag_status=lambda x: x['unique_tags'].map(classify_func)) was having issues when classify_func has a loop in it so I changed it to run without a loop but it now has hardcoded wt and syn
  • I've added that tag_status column to the end of alignments dataframe and save it a a new csv.gz file but perhaps I can just overwrite the old one? though snakemake might not like that unless I don't call it in the output.

Copy link
Member

@jbloom jbloom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Bernadetadad, looking good! Just a few minor requested changes:

  • First, you should keep the output and input CSVs separate as you have now, even if that means some files are duplicated. This is much "safer". Having code that modifies files in place won't work with Snakemake and is dangerous in general, as for instance what happens if same rule is run twice (file gets additional columns).

  • In pacbio_analysis.smk, although not essential for snakemake functioning, I think it makes it a lot easier if you use a consistent naming scheme for files. Right now the CSV file containing the mutations is called mutation_df (seems reasonable) in the output of align_pacbio. But it is called ccs_alignments in the input to strand_exchange (doesn't seem like a good name, as file isn't alignments, it's a CSV of mutations and other information parsed from alignments). Then the new veresion coming out of strand_exchange has a totally new name of tag_identity_df, where it seems like that isn't the main information in there, maybe more like mutations_with_tag_identity_df.

  • In strand_exchange.py.ipynb notebook Jupyter notebook, last cell, merging is dangerous if you don't check columns are preserved as merging by default is done "inner" so will drop missing queries. So I'd suggest adding something before the merge like assert set(alignments['query_name']) == set(strand_exchange_df['query_name']). If that isn't true, then there are problems with the merge. Also, within the merge you can replace left_on=... and right_on=... with just on='query_name'. You might also add validate='one_to_one' as this is a good check like the assert.

  • I don't think it's necessary to hard-code the tags. The viral_tags list is already defined in Snakefile, so that can just be passed to the notebook as a parameter. In fact, it already is, so I'm not sure why you have valid_tags defined in the sixth-to-last cell? Then you can just use this with something like the following in classify_func:

unique_tags = set(unique_tags)
for valid_tag in viral_tags:
    if unique_tags = {valid_tag}:
        return valid_tag
    if 'unknown_tag' in unique_tags:
        <the rest of the code there right now>
  • One last thing, did you run the linting over this?

@Bernadetadad
Copy link
Collaborator Author

  • yes, linting was done and all passes
  • I've changed variable names
    -I extracted tag names from tag_df. But as I said before, (maybe I don't understand how lambda works? ), but it does not like having a loop in the function. with the following function:
def classify_func(unique_tags):
    """Classify tags"""
    unique_tags = set(unique_tags)
    for valid_tag in valid_tags:
        if unique_tags == {valid_tag}:
            return valid_tag
        if 'unknown_tag' in unique_tags:
            return 'unknown tag'
        elif 'missing' in unique_tags:
            return 'missing one or more tags'
        elif unique_tags == set(valid_tags):
            assert len(unique_tags) > 1
            return 'chimeric tags'
        else:
            raise ValueError(f"Cannot process tags {unique_tags}")

When list is valid_tags = ['syn', 'wt'] the error is Cannot process tags {'wt'}
If the list is valid_tags = ['wt', 'syn'] error is Cannot process tags {'syn'}
It looks to me like if it does not see both tags in one iteration it reaches the end of the loop and throughs an error because one of the tags was not found.

@jbloom
Copy link
Member

jbloom commented Dec 15, 2020

@Bernadetadad, the problem is your indentation is wrong. It should look like below. The loop should just check the two known tags, and then if neither of those match (no return in loop) then it goes onto rest of code. Does this make sense?

def classify_func(unique_tags):
    """Classify tags"""
    unique_tags = set(unique_tags)
    for valid_tag in valid_tags:
        if unique_tags == {valid_tag}:
            return valid_tag
    if 'unknown_tag' in unique_tags:
        return 'unknown tag'
    elif 'missing' in unique_tags:
        return 'missing one or more tags'
    elif unique_tags == set(valid_tags):
        assert len(unique_tags) > 1
        return 'chimeric tags'
    else:
        raise ValueError(f"Cannot process tags {unique_tags}")

@Bernadetadad
Copy link
Collaborator Author

Sorry, this was very stupid of me. Now all should run.

Copy link
Member

@jbloom jbloom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great. I didn't check that it actually runs, but approving and assuming it runs OK for you then I'd say go ahead and merge it!

@Bernadetadad Bernadetadad merged commit c1eb180 into main Dec 16, 2020
@jbloom jbloom deleted the quantify_strand_exchange branch January 20, 2022 23:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants