Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add process_partis.py option for a specific indel #281

Closed
eharkins opened this issue May 30, 2019 · 6 comments
Closed

Add process_partis.py option for a specific indel #281

eharkins opened this issue May 30, 2019 · 6 comments
Assignees

Comments

@eharkins
Copy link
Contributor

@lauradoepker would like the ability to run (ecgtheow*) on only the subset of sequences in a particular cluster that have a given indel (* I am opening this issue on cft because the way ecgtheow processes partis output is by using cft/bin/process_partis.py).

This option would come with other options to specify the indel of interest, including:

  • indel length
  • insert vs deletion
  • sequence position
  • inserted sequence identity (in nucleotides). Would be None if a deletion.

The name is up for debate; something like : --only-with-particular-indel, --unique-indel, --indel-filter, etc. Going to call it --only-with-particular-indel for now:

  • if you use --only-with-particular-indel, process_partis.py looks to make sure you have specified other options (see above) to define the indel you care about
  • if we have everything we need, we choose our cluster annotation per the normal control flow
  • filter cluster sequences (input_seqs, so as to be able to make sure the indel of interest is there or not) based on containing the indel of interest by using the information from the associated options (see above) and https://github.com/psathyrella/partis/blob/dev/python/utils.py#L634. @psathyrella does this make sense?
  • then find the indel_reversed_seqs sequences corresponding to the remaining IDs after filtering (we may just want to use whichever key would normally be used based on the existing --indel-reversed-seqs option - which happens to be used in ecgtheow context).
  • all sequence ids output from this cluster should have _indel_rev appended
  • we want to output both this subset of the cluster sequences in a file named like cluster_seqs_indel_rev.fa alongside the unfiltered cluster sequences in cluster_seqs.fa (using indel_reversed)

Assuming this makes sense to everyone (cc @matsen), I will open separate issues:

  • cft: raise an exception if not running process_partis.py with --only-with-particular-indel and an indel is encountered in the specified seed sequence. The message would tell the user to use --only-with-particular-indel or specify something to ignore it like --ignore-seed-indel
  • ecg: add the ability to use this option in ecgtheow and to run revbayes on both the indel filtered cluster and the unfiltered cluster as Laura requested
@eharkins eharkins self-assigned this May 30, 2019
@psathyrella
Copy link
Contributor

Yeah, except I think I've changed my mind about how to specify the indel parameters. I think maybe this is what laura was suggesting and I was just being dense, but I think it's probably better to just say "match the indels in this sequence", i.e. specify a uid, rather than having to specify the length/pos/type of the indel.

@lauradoepker
Copy link

@eharkins I'd like the filtered seqs outfile to be named a little more explicitly, something like indel_filtered_cluster_seqs.fa. Since all sequences in EC will be indel_rev, I'm okay with this fact not being reflected in the file name, but if we do add it (to both), it may prevent future forgetfulness on my part about indel reversal.

@metasoarous
Copy link
Member

A few things here:

  • I'd suggest naming the indel pattern matches with +indel or something (maybe custom? --indel-tag?); indel_rev seems to imply that the indel has been reversed in the sequence, which may or may not be the case, but is besides the matter at hand if I understand correctly.
  • Would it be easier to have one flag for filtering in matches of a certain mutation pattern, and one flag for filtering out? This would solve your concern @lauradoepker over what the file is named, as you'd be able to name it whatever you want (I tend to prefer this over pre-determined naming patterns).
  • I'd suggest --filter-indel-pattern-in uid or --filter-indel-pattern-out uid, riffing off @psathyrella's suggestion.

@eharkins
Copy link
Contributor Author

Thanks for the input here. It seems like we are going to spend a little bit more time on thinking about how best to handle the particular indel-ed family Laura is currently dealing with, then we can generalize a solution like this if appropriate. @matsen, @lauradoepker let me know how I can be of help in determining the best way forward with that family.

@lauradoepker
Copy link

@eharkins it's completely up to you to decide how generalized you write the code at this point. I want 157.Vk settled as soon as possible, but not at the cost of you having to rewrite all your code later to make it more generalizable. This issue, then, is for you and @matsen to decide.

@eharkins
Copy link
Contributor Author

e66cf19

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants