merging self pacbio assembly with illumina-based one #22

PedroBarbosa · 2017-08-03T15:56:20Z

Hello,

I'm trying to merge two assemblies of the same individual using different approaches: one refers to a previously generated illumina draft assembly based on very high coverage available, the other is a canu assembly produced from self corrected reads and polished with quiver and pilon. My pacbio coverage is modest, as after error correction I was just able to use 45% of the data (~30X coverage).

I believe my best draft is the illumina one because I think it is capturing a broader portion of the genome. Although a little bit more fragmented than the canu assembly (~23k scaffolds vs ~18k contigs), the N50 of the Illumina one is much higher (~450k vs ~91kb). Therefore, following your recomendations and my sensibility I understand that using the illumina draft as the query (in the quickmerge wrapper the hybrid assembly positional argument) must be the best solution (pacbio assembly will help closing regions that short read assembly didn't capture), despite I tried the other approach (pacbio self assembly as query).

The quast output displays an improvement in both cases (file attached), with best metrics achieved when using illumina draft as query (best N50, less scaffolds, best genome size).

As I understood, quickmerge mostly outputs sequences from the query genome that were joined by the reference genome, as well as the query sequences that remained unaligned. The reference sequences are not included in the ouptut, and if I want them, I should follow recommendations on issue #11. However, I observe in the output fasta headers coming from both assemblies. Furthermore, checking their length in the merged file and in the original assembly I see that the Illumina scaffolds (which I think served as the query) have the exact same length as the original draft, and the pacbio based contigs (the reference) have either longer or the same as before.

My questions are:
a) Given the following commnad, which sequences will serve as queries?
merge_wrapper.py -pre draftAsQuery -l 1000 illumina.fasta pacbio.fasta
In the alignment summary file I see in the 1st columns (REF), sequences coming from pacbio assembly as I expected, but in the merged fasta I see sequences from both, particularly contig extensions in the reference sequence.

b) Is ok for quickmerge to provide scaffolds (with Ns) instead of contigs ?

c) Could you comment, given my case of having a full Illumina assembly, the applicability of the tool ?

Thanks in advance,
Pedro Barbosa
quickmergeResults.txt

The text was updated successfully, but these errors were encountered:

mahulchak · 2017-08-04T22:46:03Z

Hi Pedro,

here are the answers for your queries -

a) Your interpretations of reference and query sequences in the wrapper are correct. The reference sequence names in the merged assembly are those that resulted from merging. The name of a merged sequence comes from the sequence name of the anchor reference sequence. So the lengths of such sequences would not match the length of the sequences with the same name in the reference assembly.
e.g. if ref1 -q5-ref3-q7 is the chain that will be merged, where ref3 is the anchor sequence, the merged sequence will be called ref3 in the merged assembly.
the query assemblies that do not participate in merging remain unchanged.

b) It probably is ok to provide scaffolds but we have not tested this.

c) I have not tested it using illumina assembly but a colleague of mine has, and he found improvements in his assembly after using quickmerge. Have you tried DBG2OLC and create a hybrid assembly with your PacBio and illumina reads ?
Please let me know if I missed anything or if you have any other question.
Mahul

PedroBarbosa · 2017-08-16T10:41:42Z

Hello Mahul,

Back from vacations, thanks for the answer, that covers all my doubts.

About c) - We considered running DB2OLC, but given the priority for other tasks in our server and the comments by other users regarding the running time, we decided to skip it.

Best,
Pedro

liu-xingliang · 2019-02-24T08:42:16Z

Hi @mahulchak ,

Thanks for the amazing tool you developed.

Based on your comments

The reference sequence names in the merged assembly are those that resulted from merging. The name of a merged sequence comes from the sequence name of the anchor reference sequence. So the lengths of such sequences would not match the length of the sequences with the same name in the reference assembly.

In my case, I found some of (not all) quickmerge resulting sequences are with reference sequence name but are with exactly same length with that reference sequence. Could I interpret that as there are some of my query sequences are completely contained by my reference sequences?

My quickmerge commit is 3be7287, which I think the most updated version. FYI.

Thank you very much!

mahulchak · 2019-02-24T17:23:46Z

I think your interpretation is correct.

…

On Sun, Feb 24, 2019, 00:42 LIU Xingliang ***@***.***> wrote: Hi @mahulchak <https://github.com/mahulchak> , Thanks for the amazing tool you developed. Based on your comments The reference sequence names in the merged assembly are those that resulted from merging. The name of a merged sequence comes from the sequence name of the anchor reference sequence. So the lengths of such sequences would not match the length of the sequences with the same name in the reference assembly. In my case, I found some of (not all) quickmerge resulting sequences are with reference sequence name but are with exactly same length with that reference sequence. Could I interpret that as there are some of my query sequences are completely contained by my reference sequences? My quickmerge commit is 3be7287 <3be7287>, which I think the most updated version. FYI. Thank you very much! — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#22 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AHMD6Bvc0PwBZzeRXpP8b2ijN1oUwbYUks5vQlBogaJpZM4OspTH> .

PedroBarbosa closed this as completed Aug 16, 2017

EarlyEvol mentioned this issue May 9, 2018

Where have all the buscos gone? #28

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

merging self pacbio assembly with illumina-based one #22

merging self pacbio assembly with illumina-based one #22

PedroBarbosa commented Aug 3, 2017

mahulchak commented Aug 4, 2017

PedroBarbosa commented Aug 16, 2017

liu-xingliang commented Feb 24, 2019

mahulchak commented Feb 24, 2019 via email

merging self pacbio assembly with illumina-based one #22

merging self pacbio assembly with illumina-based one #22

Comments

PedroBarbosa commented Aug 3, 2017

mahulchak commented Aug 4, 2017

PedroBarbosa commented Aug 16, 2017

liu-xingliang commented Feb 24, 2019

mahulchak commented Feb 24, 2019 via email