-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Indel handling in QA255.157-Vk #277
Comments
Hi there, I’ve done some work with Duncan’s help to isolate the sequences from QA255.157-Vk’s clonal family that have the 3nt deletion that our seed sequence has. Here is the fasta. Unfortunately, it is an alignment, but I don’t think it’s the best alignment because, even though all seqs are the same length, it has dashes in a few places where I don’t think they should be. Please realign using whatever pipeline you normally use to do so. I’ve renamed the naive sequence “naive with deletion” because I manually deleted the 3nts that are missing in the rest of these sequences (and our seed). I believe you need the naive to be the same length as the other sequences in the list in order to run ecgtheow and linearham, right Amrit? Anyway, can you all help me get this smaller clonal family through our pipelines? CFT, ecgtheow, and linear ham? I’m interested to see the updated lineage paths with the deletion in mind. For CFT, you may want to use the original inferred naive sequence that does NOT have the 3nt deletion… but that’s up to you I think. Thank you, |
Amrit @dunleavy005 : Yes, ecgtheow will need aligned seqs but linearham doesn't b/c it runs partis annotate first. Duncan @psathyrella : ah, wait, hold on. Sorry, I meant to ask if you wanted partis run with no indels, rather than just the non-indel-reversed sequences from the current run. That's easy to to do, and probably makes more sense than having the phylo program guess what the proper alignment is. It should be much more accurate to have the v/d/j-aware code in partis do the alignment. Laura @lauradoepker : Sounds good to me, Duncan. Can you update us all when you’re finished? And let people know where they can get the new clonal family fasta? psathyrella [6:56 PM]
the ascii-art annotation logs for the two are here
notes
i then wrote fastas with both the input_seqs and indel_reversed keys for both of those options, they're in each subdir ^ with the appropriate name (edited) Laura @lauradoepker : I think we should use this fasta to rerun ecgtheow lineage analysis for this VK: |
Then ensued lots of chatter about forcing this special case through ecgtheow, including conversations about altering ecgtheow to be able to input a fasta file of clonal family sequences along with their germline info that's usually pulled from partis' Then we realized that the above fasta is not right - it doesn't even contain the seed (QA255.157-Vk), it's just the largest cluster in the special partition Duncan created. Eli @eharkins then edited ecgtheow to be able to search for the largest cluster of unique sequence ids that DOES contain the seed. lauranoges [11:10 AM] psathyrella [11:13 AM] psathyrella [11:19 AM] it sounds like the immediate issue is maybe just a matter of ecgtheow grabbing indel_reversed_seq instead of input_seq or vice versa? But honestly I feel like we've gone through so many iterations of confusion on this that we maybe should start over from scratch and define super carefully what we want, it seems like we're just getting the wrong thing over and over and it's kinda scary (edited) lauranoges [11:20 AM] psathyrella [11:24 AM] eharkins [11:39 AM]
This was my interpretation of this. I'm not sure which one of those keys @lauranoges wants for this indel specific study (input seqs since we do want to allow indels?). I'm happy to meet and discuss this, Laura's suggested time of 10am next wednesday works for me. Laura let me know how I can help in the mean time.
lauranoges [11:43 AM] |
At least from my understanding (correct me if I'm remembering wrong) a tldr would be: In one important family, partis calls an shm indel within the v gene. Since this has big consequences for the ancestral antibody, we want to see what the family and lineage would look like if that isn't actually an shm indel. Rerunning partis with shm indels turned off, indeed things align reasonably sensible to a different v gene. Since the method that determines whether to call shm indels is just match/mismatch and gap open penalties (admittedly fairly optimized), as opposed to a fancy likelihood, it thus seems reasonable to push both with and without through linearham + olmsted to evaluate by eye. But it turns out that in practice "with indels" and "without indels" can mean approximately sixteen different things in the context of this study, so we've spent several weeks confusing each other about what's what and what should be run on (and I had a bug in the original study output). So we called a meeting to reset and make sure we're all on the same page about exactly what the inputs and outputs are for each step. |
Another thing we should discuss as regards this study is how to organize things so it's as reproducible as possible, given that (at least for my stuff) it involves adding a bunch of weird steps and options that can't really be incorporated into the usual push-button data workflow. For instance I have the output and extra scripts here |
Duncan's @psathyrella summary is only mistaken in that indels cannot be turned off to yield a reasonably-aligned V gene. Indels must be allowed to infer the naive and the cluster, then I insist the logical thing to do thereafter is to subcluster the full length and the 3nt-deleted sequences to run two separate "before deletion" and "after deletion" lineage studies. I fear my NGS data won't have enough information to confidently piece together the order of mutations/events in the lineage evolution, but I figure that this is the right approach nonetheless. We will post a meeting summary next after we sit down in person. |
Merging into #279 |
And this #281 |
lauranoges [Feb 26th at 5:28 PM]
@csmall @matsen I just discovered a pretty big issue today that may be my fault for not being attentive enough but: QA255.157-Vk seed is part of a clonal family that includes many members with a 3bp deletion. The seed also has the deletion. When partis’ clonal family was generated, the
***
that represented the deletions was “reversed” in CFT and the clonal family was reported as full length (without any deletions) AND the seed sequence was modified to NOT include that deletion…. without my knowledge. I know we discussed indels at length and decided to “flag” them somewhere in CFT, but I never saw this flag. Where is are the flags supposed to reveal themselves to me? On web-CFT? This is my first clonal family with an indel, so I’ll have to pave the way here for this one, but the entire lineage analysis is certainly wrong given that we reversed all the deletions in the clonal family. (Please reply in a thread to this message).28 replies
csmall [3 months ago]
Do you mean
---
? Maybe*
gets used somewhere for deletions upstream of cft, but in most programs that's a stop codon. (sorry for evading the main point of your comment but just want to sanity check for starters)csmall [3 months ago]
CFT does not use indel reversed sequences as input, so I'm not sure how this would have happened.
csmall [3 months ago]
Last time we talked about this, I thought we agreed that the indels would make themselves obvious in the sequence alignments. We could certainly add a column to the table or a symbol option to the scatterplot to make it easier to find these clusters. But for large clusters, there are frequently sequences with indels.
csmall [3 months ago]
They just might not always be the ones included in the minadcl or seed-lineage-pruned trees.
csmall [3 months ago]
Incidentally, are you still using CFTWeb!? We should switch you over to Olmsted!
psathyrella [3 months ago]
@csmall she's referring to the partis ascii-art output where I use blue stars for indels, e.g.
p.png
psathyrella [3 months ago]
also, @lauranoges maybe you missed chris's replies? they weirdly didn't show up at all in my unreads, only saw cause I was looking for this to say something. Arg, which I've now forgotten. sigh
lauranoges [3 months ago]
@csmall, yes just seeing these replies, sorry. No, I do not use CFT, that’s why this is a big issue because nothing was flagged for me. I only used partis output and sent that partis output to ecgtheow (@dunleavy005). The real issue is that the seed sequence itself got indel reversed sometime in this process without my knowledge, so when we sent the clonal family (which I think was indel reversed) to ecgtheow, all sequences appear without the deleted codon. We need to flag this if/when it happens. I cannot have my seed sequence changing without my knowledge. @dunleavy005 @psathyrella do we know where in the process this clonal family got indel reversed? The
cluster_seqs.fasta
file in my ecgtheow output already has the indel reversed seed sequence for 157.Vk.lauranoges [3 months ago]
A follow up issue that’s just as problematic but doesn’t have an obvious or easy solution is that we need to figure out how to deal with lineage reconstruction when we have a family with indels. @matsen and I originally agreed that this is way too hard for now… but now it’s a year+ later and the issue has resurfaced.
psathyrella [3 months ago]
partis output has the input and indel-reversed sequence, along with indel info: https://github.com/psathyrella/partis/blob/master/docs/output-formats.md#description-of-keys. Not sure if that helps tracking it down, but that's all i know (edited)
lauranoges [3 months ago]
When I view the partis annotations, the seed has the deletion (blue stars that we mentioned above) so why does the output indel reverse these sequences? I thought partis output would not have the clonal sequences + seed indel reversed?
lauranoges [3 months ago]
This confuses me because @csmall uses partis output for CFT and we don’t reverse the indels… so that means that partis output isn’t reversed? @psathyrella
psathyrella [3 months ago]
sorry, that ^ was supposed to indicate a list: input sequence, indel-reversed sequence, and indel info
lauranoges [3 months ago]
Hmm, if it provides both, then maybe this is an ecgtheow problem where we grabbed the indel reversed sequence file instead of the original. @dunleavy005
dunleavy005 [3 months ago]
we're definitely using indel-reversed seqs for ecgtheow https://github.com/matsengrp/ecgtheow/blob/master/python/parse_partis_data.py#L43
python/parse_partis_data.py:43
" --remove-mutated-invariants --indel-reversed-seqs")
matsengrp/ecgtheowAdded by GitHub
lauranoges [3 months ago]
Okay, I’d forgotten that. What’re your thoughts on lineage reconstruction in families with indels? Possible? Difficult? Large scale fix?
lauranoges [3 months ago]
@dunleavy005 @matsen
dunleavy005 [3 months ago]
definitely not possible using standard techniques and don't know of ways off the top of my head to deal with it, usually people just align seqs and fill in indels, maybe @matsen knows more
csmall [3 months ago]
@lauranoges Do you want Bayesian/Beast style lineage reconstruction accounting for indel reversal? Or are you fine with an ML reconstruction (single tree/reconstruction)? We tried to get better handling of indels in the latter, but none of the (several) tools we looked at ended up working (better than dnaml) for ancestral reconstruction in general, and some were downright buggy, so we stopped tinkering on this front. We could potentially look again at these software and see if any of them have improved, but I'm afraid it's a bit unlikely.
csmall [3 months ago]
lauranoges [3 months ago]
I see. Yes, I wanted Bayesian lineage reconstruction, but you bring up a good point: perhaps I CAN go back to CFTweb or Olmsted for this particular clonal family because it has an indel and use dnaml reconstruction with the indels still in consideration. It won’t be great, but we don’t have any other options for this family since ecgtheow won’t be reliable since it’s using indel-reversed sequences.
lauranoges [3 months ago]
Yes @csmall, it’s upstream: partis to ecgtheow.
psathyrella [3 months ago]
i could be misinterpreting, but it kind of seems like this is less of an issue with including indels in the phylogenetic reconstruction, and more of an issue with how we've set up the interface for viewing the results for each seed. On the face of it, it sounds like it'd be better to flag indels in the phylo output to make sure they're obvious. But I can think of other cases (e.g. that "inferred naive" from an intern over the summer that was just the consensus of two seeds sequences with no hits from the NGS) where it's really dangerous for us to not be looking at the partis output for every family to make sure things look sensible, and i'm not sure that putting some subset of the partis output (like indels) into the phylo output really addresses that.
lauranoges [3 months ago]
right. I think one fix would be to explicitly label the filename (or sequence names?) to include “indel-rev” when there WERE indels to reverse within the clonal family. I understand that all of the families are sent through indel-reversal, but this means nothing for 13/14 of my families. For the 14th family, though, it mattered and I need to be alerted that my seed sequence changed. When I’m working with sequences, I now have
QA255.157-Vk
with both the original sequence and the reversed sequence… and they have the same name in the partis/ecgtheow output files. <-- that’s the problem.dunleavy005 [3 months ago]
I'm confused, how is CFT doing DNAML reconstruction using indels? @csmall dont you align the sequences, thus reverting the indels?
csmall [3 months ago]
That was part of my point actually; The dnaml reconstruction itself doesn't really handle indels particularly well, creating weird inferences at indel sites for the internal nodes in the tree. However, the alignment of the tip sequences ends up preserving indels because we align together with the naive.
csmall [3 months ago]
Relevant issues as far as cft is concerned:
• #131
• #149
• #170
matsen [3 months ago]
@lauranoges sorry about the challenges here. It sounds like we'll get a lot of the way there with better labeling.
Phylogenetic inference using indels as informative sequence features is still a very hard problem after decades of work. That's part of what my new grant proposal is about. However, one can do sequence alignment and then the algorithm can treat indel-d areas as having missing data. This is how DNAML works and as we have seen it gives strange results for ASR in the indel-d regions.
The text was updated successfully, but these errors were encountered: