assemble genome with high heterozygosity using canu #88

tangerzhang · 2016-04-08T14:21:23Z

Hi,
I am assembling a plant genome with high heterozygosity and I got 70X raw Pacbio reads.
I tried canu with error rate 0.025, but got a very poor N50.
I am curious that is there any suggestion to improve this assembly?
Thanks for any comments and suggestions.

skoren · 2016-04-08T17:42:05Z

The first would be to check you average read length and coverage input and after correction.

Assuming you have sufficient coverage of corrected reads (25X+) and the length distribution is similar before and after correction, you can try to optimize the assembly. The bubbles caused by heterozygosity can be confused with repeats causing unnecessary splitting in the assembly. Generally this means you will end up with a lot of unassembled contigs (the split part). We're working on algorithm improvements but in the meantime, you can run a parameter sweep to optimize the assembly. In the 4-unitigger directory there is a unitigger.sh script. You can take that file and turn it into a loop. For example, I have one that says:

$bin/bogart \
 -G /data/korens/test/nanopore/test/unitigging/asm.gkpStore \
 -O /data/korens/test/nanopore/test/unitigging/asm.ovlStore \
 -T /data/korens/test/nanopore/test/unitigging/asm.tigStore.WORKING \
 -o /data/korens/test/nanopore/test/unitigging/4-unitigger/asm \
 -B 172 \
 -eg 0.05 \
 -eb 0.0625 \
 -em 0.0375 \
 -er 0.05 \
 -threads 4 \
 -M 16 \
 -RS -NS -CS \
 -repeatdetect 6 40 15 \
 > /data/korens/test/nanopore/test/unitigging/4-unitigger/unitigger.err 2>&1 \

So what you’d want to do is (where the max ovl is up to 1/2 average corrected read length)

for ovl in 500 1000 1500 2000 2500 3000; do
   for ee in in 0.005 0.010 0.015 0.020 0.025 0.030 0.035 0.040 0.045 0.050 0.055 0.060 0.065 0.070 0.075 0.080 0.085 0.090 0.095 0.100 ; do
      mkdir 4-unitigger-RS-NS-CS-${ovl}-${ee} && cd 4-unitigger-RS-NS-CS-${ovl}-${ee} && echo `pwd` && $bin/bogart -G /data/korens/test/nanopore/test/unitigging/asm.gkpStore -O /data/korens/test/nanopore/test/unitigging/asm.ovlStore -T asm.tigStore -o asm -B 172 -el ${ovl}  -eg ${ee} -eb 0.0625 -em 0.0375 -er ${ee} -threads 4 -M 16 -RS -NS -CS -repeatdetect 6 40 15 && cd ..

     $bin/tgStoreDump -sizes -s <your genome size> -G /data/korens/test/nanopore/test/unitigging/asm.gkpStore -T 4-unitigger-RS-NS-CS-${ovl}-${ee}/asm.tigStore 1
   done
done

tangerzhang · 2016-04-09T02:29:15Z

Hi Sergey,
That's wonderful!
Thanks for your script. I will try it :-)
Xingtan

tangerzhang · 2016-04-17T03:13:54Z

Hi Sergey,
I tried all the parameters you suggested in last post.
However, the N50 was only got a little improvement from 120 k to 174 k.
I though that might be the problem of our Pacbio reads.
We sequenced 72 X of a plant genome with reads N50 8,964 and average reads
length 6,375.
After canu correction and trimming, I only got 22 X of the genome with
reads N50 10k and average reads length 8k.
Any suggestion to improve the assembly with the relatively short Pacbio
reads?
Thanks!

JohnUrban · 2016-04-17T03:19:34Z

Did you try corOutCoverage=80 in the Canu command? This may help get more than 22x after trimming.

JohnUrban · 2016-04-17T03:27:58Z

Also, what is the exact Canu command you used? If you set errorRate=0.025 in the Canu command, it might work out better to remove that parameter and let Canu use its defaults.

tangerzhang · 2016-04-17T03:40:01Z

Hi John,
Thanks for your quick response.
Here is my exact Canu command:
canu -assemble -s spec.txt -p asm -d erate25 genomeSize=500m
errorRate=0.025 -pacbio-corrected asm.rimmedReads.fastq

In the spec.txt:
useGrid=1
gridOptions=-t 20000000
ovsMemory=16g
oeaMemory=62g
oeaThreads=12

I did use errorRate=0.025 for initially assembly.

Another question, should I use corOutCoverage=80 in the correction step or
the trimming step?

Thanks!

JohnUrban · 2016-04-17T04:20:19Z

For now, I would not break Canu up into steps manually by specifying [-correct | -trim | -assemble]. If you leave those off, it will run the entire pipeline. You also do not need a spec file for Canu - you can just include it all in the command (unless you like it better in the spec file). For example, try:

canu -p asm -d asm \
 -pacbio-raw reads.fastq \
 genomeSize=500m \
 gridOptions="-t 20000000" \
 minReadLength=500 \
 ovsMemory=16g \
 oeaMemory=62g \
 oeaThreads=12 \
 corOutCoverage=80

tangerzhang · 2016-04-17T04:23:11Z

Thanks, John!
I will have another try :-)

yfu · 2016-06-12T21:43:30Z

I see that -eg in bogart is deprecated in the latest version:

 -eg 0.020   no more than 0.020 fraction (2.0%) error   ** DEPRECATED **

Does parameter sweeping still apply now? If so, what parameters can be tweaked? I am also assembling a genome with high heterozygosity. Thanks!

brianwalenz · 2016-10-19T00:39:57Z

I've hopefully addressed the original issue and parameter sweeping in the FAQ.
http://canu.readthedocs.io/en/latest/faq.html

JohnUrban mentioned this issue Apr 13, 2016

The difference between tgStoreDump size stats and asm size stats from contig sequences #92

Closed

skoren mentioned this issue Apr 13, 2016

High number of uniq-repeat-dove reads #93

Closed

yfu mentioned this issue Jun 13, 2016

Genomes with high heterozygosity and parameter sweeping #162

Closed

kevlim83 mentioned this issue Sep 2, 2016

bogart parameter sweeping #249

Closed

brianwalenz closed this as completed Oct 19, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

assemble genome with high heterozygosity using canu #88

assemble genome with high heterozygosity using canu #88

tangerzhang commented Apr 8, 2016

skoren commented Apr 8, 2016

tangerzhang commented Apr 9, 2016 •

edited by brianwalenz

Loading

tangerzhang commented Apr 17, 2016 •

edited by brianwalenz

Loading

JohnUrban commented Apr 17, 2016

JohnUrban commented Apr 17, 2016

tangerzhang commented Apr 17, 2016 •

edited by brianwalenz

Loading

JohnUrban commented Apr 17, 2016 •

edited

Loading

tangerzhang commented Apr 17, 2016 •

edited by brianwalenz

Loading

yfu commented Jun 12, 2016

brianwalenz commented Oct 19, 2016

assemble genome with high heterozygosity using canu #88

assemble genome with high heterozygosity using canu #88

Comments

tangerzhang commented Apr 8, 2016

skoren commented Apr 8, 2016

tangerzhang commented Apr 9, 2016 • edited by brianwalenz Loading

tangerzhang commented Apr 17, 2016 • edited by brianwalenz Loading

JohnUrban commented Apr 17, 2016

JohnUrban commented Apr 17, 2016

tangerzhang commented Apr 17, 2016 • edited by brianwalenz Loading

JohnUrban commented Apr 17, 2016 • edited Loading

tangerzhang commented Apr 17, 2016 • edited by brianwalenz Loading

yfu commented Jun 12, 2016

brianwalenz commented Oct 19, 2016

tangerzhang commented Apr 9, 2016 •

edited by brianwalenz

Loading

tangerzhang commented Apr 17, 2016 •

edited by brianwalenz

Loading

tangerzhang commented Apr 17, 2016 •

edited by brianwalenz

Loading

JohnUrban commented Apr 17, 2016 •

edited

Loading

tangerzhang commented Apr 17, 2016 •

edited by brianwalenz

Loading