-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
merging self pacbio assembly with illumina-based one #22
Comments
Hi Pedro, here are the answers for your queries - a) Your interpretations of reference and query sequences in the wrapper are correct. The reference sequence names in the merged assembly are those that resulted from merging. The name of a merged sequence comes from the sequence name of the anchor reference sequence. So the lengths of such sequences would not match the length of the sequences with the same name in the reference assembly. b) It probably is ok to provide scaffolds but we have not tested this. c) I have not tested it using illumina assembly but a colleague of mine has, and he found improvements in his assembly after using quickmerge. Have you tried DBG2OLC and create a hybrid assembly with your PacBio and illumina reads ? |
Hello Mahul, Back from vacations, thanks for the answer, that covers all my doubts. About c) - We considered running DB2OLC, but given the priority for other tasks in our server and the comments by other users regarding the running time, we decided to skip it. Best, |
Hi @mahulchak , Thanks for the amazing tool you developed. Based on your comments
In my case, I found some of (not all) quickmerge resulting sequences are with reference sequence name but are with exactly same length with that reference sequence. Could I interpret that as there are some of my query sequences are completely contained by my reference sequences? My quickmerge commit is 3be7287, which I think the most updated version. FYI. Thank you very much! |
I think your interpretation is correct.
…On Sun, Feb 24, 2019, 00:42 LIU Xingliang ***@***.***> wrote:
Hi @mahulchak <https://github.com/mahulchak> ,
Thanks for the amazing tool you developed.
Based on your comments
The reference sequence names in the merged assembly are those that
resulted from merging. The name of a merged sequence comes from the
sequence name of the anchor reference sequence. So the lengths of such
sequences would not match the length of the sequences with the same name in
the reference assembly.
In my case, I found some of (not all) quickmerge resulting sequences are
with reference sequence name but are with exactly same length with that
reference sequence. Could I interpret that as there are some of my query
sequences are completely contained by my reference sequences?
My quickmerge commit is 3be7287
<3be7287>,
which I think the most updated version. FYI.
Thank you very much!
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#22 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AHMD6Bvc0PwBZzeRXpP8b2ijN1oUwbYUks5vQlBogaJpZM4OspTH>
.
|
Hello,
I'm trying to merge two assemblies of the same individual using different approaches: one refers to a previously generated illumina draft assembly based on very high coverage available, the other is a canu assembly produced from self corrected reads and polished with quiver and pilon. My pacbio coverage is modest, as after error correction I was just able to use 45% of the data (~30X coverage).
I believe my best draft is the illumina one because I think it is capturing a broader portion of the genome. Although a little bit more fragmented than the canu assembly (~23k scaffolds vs ~18k contigs), the N50 of the Illumina one is much higher (~450k vs ~91kb). Therefore, following your recomendations and my sensibility I understand that using the illumina draft as the query (in the quickmerge wrapper the hybrid assembly positional argument) must be the best solution (pacbio assembly will help closing regions that short read assembly didn't capture), despite I tried the other approach (pacbio self assembly as query).
The quast output displays an improvement in both cases (file attached), with best metrics achieved when using illumina draft as query (best N50, less scaffolds, best genome size).
As I understood, quickmerge mostly outputs sequences from the query genome that were joined by the reference genome, as well as the query sequences that remained unaligned. The reference sequences are not included in the ouptut, and if I want them, I should follow recommendations on issue #11. However, I observe in the output fasta headers coming from both assemblies. Furthermore, checking their length in the merged file and in the original assembly I see that the Illumina scaffolds (which I think served as the query) have the exact same length as the original draft, and the pacbio based contigs (the reference) have either longer or the same as before.
My questions are:
a) Given the following commnad, which sequences will serve as queries?
merge_wrapper.py -pre draftAsQuery -l 1000 illumina.fasta pacbio.fasta
In the alignment summary file I see in the 1st columns (REF), sequences coming from pacbio assembly as I expected, but in the merged fasta I see sequences from both, particularly contig extensions in the reference sequence.
b) Is ok for quickmerge to provide scaffolds (with Ns) instead of contigs ?
c) Could you comment, given my case of having a full Illumina assembly, the applicability of the tool ?
Thanks in advance,
Pedro Barbosa
quickmergeResults.txt
The text was updated successfully, but these errors were encountered: