-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Alignment method selection - Annelida branch #10
Comments
Just reposting my sources from the email: |
I'm wondering if I should email Ulrich Bodenhofer? He's the main author of the msa R package. Maybe I can check if he has any suggestions on alignment parameters for us. |
Hi Matt, Thanks for posting this information. The original paper deals with protein sequences. Although the Clustal link indicates that it now accepts DNA sequences, that doesn't seem to be implements in the R package "msa". The msa manual indicates that for ClustalOmega, only amino acid matrices are available: "BLOSUM30", "BLOSUM40", "BLOSUM50", "BLOSUM65", "BLOSUM80", So, I think our options would be:
I am also currently working with the Polychaeta alignment in MEGA. There are still major gaps, and I am checking into the identities of several quite divergent sequences that are causing the large gaps. Possibly, one more quality filter could be needed, but I am still working on this to try to ascertain if the divergent sequences are likely legitimate biological sequences from polychaetes (e.g. as opposed to pseudogenes or major contaminants). Actually, I have another idea for dealing with these highly divergent sequences.... is it possible to run a preliminary alignment and then a distance matrix. Then, kick out any sequences that don't have any pairwise distances less than 0.15 and then re-run the alignment after kicking out such sequences? We already decided to limit the pairs to more closely related sequences. So, it would be reasonable to kick out highly divergent sequences that are creating the alignment problems. Do you have thoughts on any of these suggestions above? Another option that could make sense would be for me to run a few more phyla and see what issues crops up before we make any final decision. Cheers, |
Hi Matt,
I don't think we need to contact the author of the package. I think the documentation is sufficient for us to understand what the package currently can do and not do. I'd appreciate your thoughts on the feasibility of the above suggestions. Thanks very much.
Cheers,
Sally
…--
Sarah (Sally) J. Adamowicz, Ph.D.
Associate Professor
Biodiversity Institute of Ontario
& Department of Integrative Biology
University of Guelph
50 Stone Road East
Guelph, Ontario
N1G 2W1 Canada
Email: sadamowi@uoguelph.ca
Phone: +1 519 824-4120 ext. 53055
Fax: +1 519 824-5703
Office: Centre for Biodiversity Genomics 113
http://www.dnabarcoding.ca/
http://www.barcodinglife.org/
http://www.uoguelph.ca/ib/people/faculty/adamowicz.shtml
________________________________
From: Matthew Orton <notifications@github.com>
Sent: Monday, November 28, 2016 5:06:55 PM
To: m-orton/R-Scripts
Cc: Sarah Adamowicz; Author
Subject: Re: [m-orton/R-Scripts] Alignment method selection - Annelida branch (#10)
I'm wondering if I should email Ulrich Bodenhofer? He's the main author of the msa R package. Maybe I can check if he has any suggestions on alignment parameters for us.
-
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub<#10 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AV89Ov2NtfBF586VMPFYRE3Jn96vlLlvks5rC1B_gaJpZM4K-LWm>.
|
Hi Sally, I think we should try using Muscle instead since it supports nucleotide alignments. That way we can contain the alignment within the script and it should just be a matter of changing a few commands. I can also look into what parameters to use. I can edit the script to do a preliminary alignment and remove sequences with distances greater than 0.15 before doing a final alignment. I think that is a good way of eliminating the more divergent sequences. I will try and get this done today. Also, thank you for pointing out the issues with ClustalOmega, that was an oversight on my part for not reading through the documentation closely enough. I think this is probably the source of a lot of the alignment problems I was having with the script. Best Regards, |
Hi Matt, OK that sounds great to try Muscle. I think, though, that the algorithm was not the only issue with the Polychaeta. I have tried other alignment approaches in MEGA and still got those very gappy alignments. Some gaps might be expected, as entire amino acid indels can occur in COI, but I"ve never seen a legitimate COI alignment so gappy. Those "divergent" sequences are interesting, but I think those are beyond the scope of the present project. That could actually make an interesting project for a future student... to seek and explore the nature of highly divergent barcodes. (Are these simply poorly sampled components of biodiversity up until now? Are these contaminants? pseudogenes? Is there evidence of highly divergent protein sequence and function in unusual environments?) For us, I think we don't want those sequences, as we are focusing on latitude shifts, not diversification deep in the fossil record. We also want to avoid severe saturation issues for our study (i.e. many substitutions at third codon positions would mask rate differences if we use highly divergent sequences), and hence we had maximum divergence criteria for our sister pairs. I would be more than happy to hear any other ideas for dealing with this. If you think the two-alignment approach, with an intermediate step for ejecting highly divergent sequences, can be readily implemented, then I am on board for trying that approach. We don't need those sequences anyhow for our pairs, and kicking them out may improve the alignment. With such an approach, it may be more feasible to allow the analysis at class level for the marine phyla, allowing unidentified sequences to be included in pairs if they have genetic neighbours. Thank you very much for giving this your attention. Cheers, |
PS. I just completed a realignment of the Polychaeta alignment2a in MEGA using Muscle after deleting 3 highly divergent sequences. I got a much better alignment. There were still some gaps, but this was at about the expected level for an taxon such as Polychaeta. I think this alignment would be fine for further analysis, especially using pairwise deletion for downstream distance analysis. So, I think that our approach discussed above is likely to be successful in R as well. I am very optimistic that we are nearly there with a robust pipeline that will work across multiple taxa. :-) |
Thats good to hear that it seems to be working better with Muscle! Another contributing factor too is that in order to grab the centroid sequences I was using ClustalOmega so the centroid sequences being chosen may not have been accurate either. I am currently working on the script to make the changes you suggested. |
OK that's great! Good thinking to fix the centroid finder too, so everything is set with reasonable choices. |
Dec. 5th version of the code. Matt has switched to a different package to conduct the sequence alignment. Thanks for finding that. I will inspect the settings chosen and the outputted alignments, and then this issue should be close to resolved. |
Hi Sally, just to add to the email I sent about the script changes and the new Muscle package. The package has a few parameters that can be adjusted. Documentation of muscle algorithm with all parameters can be found here: I think there are a few that may become important to our analysis: gapopen - parameter will determine what level of stringency to assign to gaps in the alignment. We may want to consider changing this depending on the taxa being run. For instance really gappy alignments (like Polychaeta for example) we may want to consider increasing the gap open penalty so it penalizes gaps more heavily. maxiters - parameter will determine how many iterations the muscle algorithm goes through. Apparently the default is 16 but they suggest in the documentation of Muscle that for large alignments, it is better to restrict the number of iterations to 2. They say for large alignments: "If you have a large number of sequences (a few thousand), or they are very long, then the default settings of may be too slow for practical use. A good compromise between speed and accuracy is to run just the first two iterations of the algorithm. This is done by the option –maxiters 2, as in the following example." |
forgot to add the example they use for maxiters is for a command line not for the muscle package. In the muscle package it would be muscle(dnastringset, maxiters = 2). |
For completeness of the records for this issue, relevant sections of email from Matt (sent Dec. 5 2016):
http://www.bioconductor.org/packages/release/bioc/vignettes/muscle/inst/doc/muscle-vignette.pdf I tried using muscle with the msa package but R kept crashing on me when I tried to run the command so I searched for an alternative package. I have attached the results of the alignment and the final trimmed alignment of Polychaeta looks much better. There are still a few gaps I noticed but overall much improved.
|
Hi Matt, I am really happy with these solutions. Thank you. I will submit a few comments in the response to the topics you raised. This comment will deal with the gap penalty setting. I would not suggest that we change the default gap penalty setting for now. For Polychaeta, the 2-step alignment process (omitting the sequences with no close relatives) largely solved the alignment issues. yay! While a few gaps remain in the final alignment, I think we want these gaps. For example, I notice in one sequence that there are five Ts near the end that probably should have been 4 Ts. Such base-calling errors are more frequent at the ends of the sequences. If this happens in the middle, likely this would cause a stop codon at some point in the sequence, causing it not to get a BIN. However, when this happens near one of the ends, then even if a few amino acids are changed due to the shift in reading frame, there may not be enough sequence left for a stop codon to crop in. So, what we want is a gap in such a position. By using pairwise deletion in downstream analysis, we would get the correct distances. This issue would be more of a problem if we wanted to analyze the translated sequences as well in the future. If we wanted to do so someday, then we would have to reconsider this issue. We also might be able to scrape the amino acid sequences straight from BOLD. I'm not sure about that, but I know that the amino acid sequences are stored in BOLD. So, presumably there would be a solution. That is beyond the scope presently. Cheers, |
Hi Matt, This thread has gotten really long. I thing that we have solved the original issue we had relating to the alignment. hooray. So, I will close this issue and start new issues for the additional specific decisions that need to be made or tasks that need to be performed. I think it will be easier if we confine each "issue" to a narrow issue rather than a complex series of issues. Great work resolving this and implementing the check for highly divergent sequences. Cheers, |
Hi Matt,
I created this issue for us to discuss the alignment algorithm choice, as I remembered I didn't want to bury discussions in email. So, I will try to use the issue tracker more judiciously! I also got a lot of gaps in Polychaeta alignment, as you mentioned. I am reading up on the different alignment options and will try to see if there is an option that would give us a better alignment. ClustalOmega seems to be only for amino acid sequences. Ideally, we will want to consider the amino acid translation when aligning the nucleotides.
If you have any thoughts to share based upon any other alignment options you have tried, please let me know.
Cheers,
Sally
The text was updated successfully, but these errors were encountered: