Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

assemble genome with high heterozygosity using canu #88

Closed
tangerzhang opened this issue Apr 8, 2016 · 10 comments
Closed

assemble genome with high heterozygosity using canu #88

tangerzhang opened this issue Apr 8, 2016 · 10 comments

Comments

@tangerzhang
Copy link

Hi,
I am assembling a plant genome with high heterozygosity and I got 70X raw Pacbio reads.
I tried canu with error rate 0.025, but got a very poor N50.
I am curious that is there any suggestion to improve this assembly?
Thanks for any comments and suggestions.

@skoren
Copy link
Member

skoren commented Apr 8, 2016

The first would be to check you average read length and coverage input and after correction.

Assuming you have sufficient coverage of corrected reads (25X+) and the length distribution is similar before and after correction, you can try to optimize the assembly. The bubbles caused by heterozygosity can be confused with repeats causing unnecessary splitting in the assembly. Generally this means you will end up with a lot of unassembled contigs (the split part). We're working on algorithm improvements but in the meantime, you can run a parameter sweep to optimize the assembly. In the 4-unitigger directory there is a unitigger.sh script. You can take that file and turn it into a loop. For example, I have one that says:

$bin/bogart \
 -G /data/korens/test/nanopore/test/unitigging/asm.gkpStore \
 -O /data/korens/test/nanopore/test/unitigging/asm.ovlStore \
 -T /data/korens/test/nanopore/test/unitigging/asm.tigStore.WORKING \
 -o /data/korens/test/nanopore/test/unitigging/4-unitigger/asm \
 -B 172 \
 -eg 0.05 \
 -eb 0.0625 \
 -em 0.0375 \
 -er 0.05 \
 -threads 4 \
 -M 16 \
 -RS -NS -CS \
 -repeatdetect 6 40 15 \
 > /data/korens/test/nanopore/test/unitigging/4-unitigger/unitigger.err 2>&1 \

So what you’d want to do is (where the max ovl is up to 1/2 average corrected read length)

for ovl in 500 1000 1500 2000 2500 3000; do
   for ee in in 0.005 0.010 0.015 0.020 0.025 0.030 0.035 0.040 0.045 0.050 0.055 0.060 0.065 0.070 0.075 0.080 0.085 0.090 0.095 0.100 ; do
      mkdir 4-unitigger-RS-NS-CS-${ovl}-${ee} && cd 4-unitigger-RS-NS-CS-${ovl}-${ee} && echo `pwd` && $bin/bogart -G /data/korens/test/nanopore/test/unitigging/asm.gkpStore -O /data/korens/test/nanopore/test/unitigging/asm.ovlStore -T asm.tigStore -o asm -B 172 -el ${ovl}  -eg ${ee} -eb 0.0625 -em 0.0375 -er ${ee} -threads 4 -M 16 -RS -NS -CS -repeatdetect 6 40 15 && cd ..

     $bin/tgStoreDump -sizes -s <your genome size> -G /data/korens/test/nanopore/test/unitigging/asm.gkpStore -T 4-unitigger-RS-NS-CS-${ovl}-${ee}/asm.tigStore 1
   done
done

@tangerzhang
Copy link
Author

tangerzhang commented Apr 9, 2016

Hi Sergey,
That's wonderful!
Thanks for your script. I will try it :-)
Xingtan

@tangerzhang
Copy link
Author

tangerzhang commented Apr 17, 2016

Hi Sergey,
I tried all the parameters you suggested in last post.
However, the N50 was only got a little improvement from 120 k to 174 k.
I though that might be the problem of our Pacbio reads.
We sequenced 72 X of a plant genome with reads N50 8,964 and average reads
length 6,375.
After canu correction and trimming, I only got 22 X of the genome with
reads N50 10k and average reads length 8k.
Any suggestion to improve the assembly with the relatively short Pacbio
reads?
Thanks!

@JohnUrban
Copy link

Did you try corOutCoverage=80 in the Canu command? This may help get more than 22x after trimming.

@JohnUrban
Copy link

Also, what is the exact Canu command you used? If you set errorRate=0.025 in the Canu command, it might work out better to remove that parameter and let Canu use its defaults.

@tangerzhang
Copy link
Author

tangerzhang commented Apr 17, 2016

Hi John,
Thanks for your quick response.
Here is my exact Canu command:
canu -assemble -s spec.txt -p asm -d erate25 genomeSize=500m
errorRate=0.025 -pacbio-corrected asm.rimmedReads.fastq

In the spec.txt:
useGrid=1
gridOptions=-t 20000000
ovsMemory=16g
oeaMemory=62g
oeaThreads=12

I did use errorRate=0.025 for initially assembly.

Another question, should I use corOutCoverage=80 in the correction step or
the trimming step?

Thanks!

@JohnUrban
Copy link

JohnUrban commented Apr 17, 2016

For now, I would not break Canu up into steps manually by specifying [-correct | -trim | -assemble]. If you leave those off, it will run the entire pipeline. You also do not need a spec file for Canu - you can just include it all in the command (unless you like it better in the spec file). For example, try:

canu -p asm -d asm \
 -pacbio-raw reads.fastq \
 genomeSize=500m \
 gridOptions="-t 20000000" \
 minReadLength=500 \
 ovsMemory=16g \
 oeaMemory=62g \
 oeaThreads=12 \
 corOutCoverage=80

@tangerzhang
Copy link
Author

tangerzhang commented Apr 17, 2016

Thanks, John!
I will have another try :-)

@yfu
Copy link

yfu commented Jun 12, 2016

I see that -eg in bogart is deprecated in the latest version:

 -eg 0.020   no more than 0.020 fraction (2.0%) error   ** DEPRECATED **

Does parameter sweeping still apply now? If so, what parameters can be tweaked? I am also assembling a genome with high heterozygosity. Thanks!

@brianwalenz
Copy link
Member

I've hopefully addressed the original issue and parameter sweeping in the FAQ.
http://canu.readthedocs.io/en/latest/faq.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants