Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

merging self pacbio assembly with illumina-based one #22

Closed
PedroBarbosa opened this issue Aug 3, 2017 · 4 comments
Closed

merging self pacbio assembly with illumina-based one #22

PedroBarbosa opened this issue Aug 3, 2017 · 4 comments

Comments

@PedroBarbosa
Copy link

Hello,

I'm trying to merge two assemblies of the same individual using different approaches: one refers to a previously generated illumina draft assembly based on very high coverage available, the other is a canu assembly produced from self corrected reads and polished with quiver and pilon. My pacbio coverage is modest, as after error correction I was just able to use 45% of the data (~30X coverage).

I believe my best draft is the illumina one because I think it is capturing a broader portion of the genome. Although a little bit more fragmented than the canu assembly (~23k scaffolds vs ~18k contigs), the N50 of the Illumina one is much higher (~450k vs ~91kb). Therefore, following your recomendations and my sensibility I understand that using the illumina draft as the query (in the quickmerge wrapper the hybrid assembly positional argument) must be the best solution (pacbio assembly will help closing regions that short read assembly didn't capture), despite I tried the other approach (pacbio self assembly as query).

The quast output displays an improvement in both cases (file attached), with best metrics achieved when using illumina draft as query (best N50, less scaffolds, best genome size).

As I understood, quickmerge mostly outputs sequences from the query genome that were joined by the reference genome, as well as the query sequences that remained unaligned. The reference sequences are not included in the ouptut, and if I want them, I should follow recommendations on issue #11. However, I observe in the output fasta headers coming from both assemblies. Furthermore, checking their length in the merged file and in the original assembly I see that the Illumina scaffolds (which I think served as the query) have the exact same length as the original draft, and the pacbio based contigs (the reference) have either longer or the same as before.

My questions are:
a) Given the following commnad, which sequences will serve as queries?
merge_wrapper.py -pre draftAsQuery -l 1000 illumina.fasta pacbio.fasta
In the alignment summary file I see in the 1st columns (REF), sequences coming from pacbio assembly as I expected, but in the merged fasta I see sequences from both, particularly contig extensions in the reference sequence.

b) Is ok for quickmerge to provide scaffolds (with Ns) instead of contigs ?

c) Could you comment, given my case of having a full Illumina assembly, the applicability of the tool ?

Thanks in advance,
Pedro Barbosa
quickmergeResults.txt

@mahulchak
Copy link
Owner

Hi Pedro,

here are the answers for your queries -

a) Your interpretations of reference and query sequences in the wrapper are correct. The reference sequence names in the merged assembly are those that resulted from merging. The name of a merged sequence comes from the sequence name of the anchor reference sequence. So the lengths of such sequences would not match the length of the sequences with the same name in the reference assembly.
e.g. if ref1 -q5-ref3-q7 is the chain that will be merged, where ref3 is the anchor sequence, the merged sequence will be called ref3 in the merged assembly.
the query assemblies that do not participate in merging remain unchanged.

b) It probably is ok to provide scaffolds but we have not tested this.

c) I have not tested it using illumina assembly but a colleague of mine has, and he found improvements in his assembly after using quickmerge. Have you tried DBG2OLC and create a hybrid assembly with your PacBio and illumina reads ?
Please let me know if I missed anything or if you have any other question.
Mahul

@PedroBarbosa
Copy link
Author

Hello Mahul,

Back from vacations, thanks for the answer, that covers all my doubts.

About c) - We considered running DB2OLC, but given the priority for other tasks in our server and the comments by other users regarding the running time, we decided to skip it.

Best,
Pedro

@liu-xingliang
Copy link

Hi @mahulchak ,

Thanks for the amazing tool you developed.

Based on your comments

The reference sequence names in the merged assembly are those that resulted from merging. The name of a merged sequence comes from the sequence name of the anchor reference sequence. So the lengths of such sequences would not match the length of the sequences with the same name in the reference assembly.

In my case, I found some of (not all) quickmerge resulting sequences are with reference sequence name but are with exactly same length with that reference sequence. Could I interpret that as there are some of my query sequences are completely contained by my reference sequences?

My quickmerge commit is 3be7287, which I think the most updated version. FYI.

Thank you very much!

@mahulchak
Copy link
Owner

mahulchak commented Feb 24, 2019 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants