Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Thoughts on combining 1d and 1d2 nanopore reads? #659

Closed
lfaller opened this issue Oct 3, 2017 · 13 comments
Closed

Thoughts on combining 1d and 1d2 nanopore reads? #659

lfaller opened this issue Oct 3, 2017 · 13 comments

Comments

@lfaller
Copy link

lfaller commented Oct 3, 2017

Hi all,

I have 1d and 1d2 nanopore reads for a GC-rich microbial genome. Does anyone have experience combining these data? Is it worth it?

Thanks for any thoughts!

Edited to add:

1d read stats:

Read_Count: 317759
Total_bases: 2387382076
Median_read_length: 7404
Max_read_length: 391955

1d2 read stats:

Read_Count: 13427
Total_bases: 121149916
Median_read_length: 8834
Max_read_length: 47993
@brianwalenz
Copy link
Member

This isn't a large assembly, so run all three variations (1d, 1d2, 1d+1d2).

I expect (based on gut feeling instead of actual data) that the 1d reads will dominate the correction process, so any gain from adding 1d2 will be slight - base call accuracy in the reads might be slightly better. Coverage and read length of the 1d reads could make this the best assembly, but bacteria generally assemble well, so the difference could end up just being base call accuracy, in which case 1d2 reads should help.

In other words, fame and glory is yours to be had! Run the experiment and report the results!

Check the FAQ for 'low coverage settings' which might help 1d2 (and add a fourth assembly to the mix).

@gringer
Copy link

gringer commented Oct 3, 2017

Something else you could try is running just the correction step on the 1D reads, then combining those reads with the 1D^2 reads as nanopore-corrected sequences in a standard assembly.

@lfaller
Copy link
Author

lfaller commented Oct 5, 2017

Thanks for helping out a newbie!

I ran the four assemblies that @brianwalenz described (still setting up @gringer 's suggestion).

Unfortunately, this organism does not have a trust-worthy reference. In the absence of a trust-worthy reference, how would you recommend I evaluate the assemblies?

1d only assembly: 4 contigs, 1195 unassembled sequences, 8 unitigs
1d2 only assembly: 13 contigs, 998 unassembled sequences, 20 unitigs
1d + 1d2 assembly: 4 contigs, 1307 unassembled sequences, 9 unitigs
1d2 only assembly with "correctedErrorRate=0.075": 13 contigs, 991 unassembled sequences, 20 unitigs.

Thanks for any suggestions!

@skoren
Copy link
Member

skoren commented Oct 5, 2017

Yes, that's always tough without a good reference. You could try metaQUAST which will try to find a good reference(s) for your assembly. You could also try BUSCO to check gene completeness. It seems, as @brianwalenz predicted the 1d2 are not really helping the assembly continuity (adding them didn't change the 4 contigs you get from the 1d only asm) but it is possible they improved the consensus which the above analysis should show (by increasing BUSCO genes or improving alignment accuracy).

@gringer
Copy link

gringer commented Oct 6, 2017

What do the contig and unitig genome graphs look like? You can load them up in Bandage via /bandage/location/Bandage load <assembly_base>.contigs.gfa

@lfaller
Copy link
Author

lfaller commented Oct 11, 2017

Thanks for the suggestion, @gringer !

I loaded the two graphs into Bandage for one of the assemblies. I am new to analyzing assembly graphs with Bandage -- what should I look for?

Thanks!

Contigs graph

screen shot 2017-10-11 at 10 12 23 am

Unitig graph

screen shot 2017-10-11 at 10 12 04 am

@lfaller
Copy link
Author

lfaller commented Oct 11, 2017

Actually, here is a comparison of the four assemblies. It looks like both the 1d-only and the 1d+1d2 assembly produced the simplest graphs:

2017_10_11_all_assembly_graphs

@gringer
Copy link

gringer commented Oct 12, 2017

@rrwick is probably a better person for interpreting assembly graphs, but I think you're right that 1D-only and 1D+1D^2 are the least complicated (and therefore probably the closest to the truth).

Neither of the graphs are showing a properly circular connection, which seems a little odd to me. My guess from the unitig graph, showing every unitig connected, is that this is likely to be a single chromosome with a bit of population variation. Were the 1D and 1D^2 from the same DNA extraction, or different samples? If different, I'd go with the 1D assembly.

You could also do a genome comparison with LAST to see how different the assemblies are.

@rrwick
Copy link

rrwick commented Oct 12, 2017

I'm not too good at interpreting Canu assembly graphs because I don't have a deep understanding of Canu's inner logic. That being said, I'm inclined to agree with @gringer.

If you're keen to experiment more and want the best possible assembly, here are a few suggestions that jump to mind (biased toward my own tools, because that's what I know best 😄):

  • Use Filtlong to subsample your reads by quality/length (just follow the example commands). Since you have over 2 Gbp of reads for what I presume is a few-Mbp genome, you could take only the best 25% of the reads and still have deep depth of coverage. I've often found that the resulting smaller-but-higher-quality reads assemble better than the original super-high-depth read set. You could try this on either the 1D read set or the pooled set, but since you have so many more 1D reads, I don't know if adding 1D^2 will much difference.
  • Try Unicycler assemblies on your read sets. When run with only long reads, Unicycler does a fairly simple miniasm + Racon pipeline, so it's certainly not as sophisticated as Canu, but I did add a bit of new logic to help create circularised, non-overlapping assemblies. It may do better or worse than Canu, I'm not sure!
  • Try Unicycler on Canu's corrected reads. I have seen cases where that performs best.

Assuming your bug has a circular chromosome, you're hoping to see something like this:
screen shot 2017-10-13 at 9 30 43 am

Or if your bug has plasmids, maybe something like this:
screen shot 2017-10-13 at 9 31 00 am

Linear chromosome/plasmids are possible in bacterial, in which case you may not get loops like that. But I've found that linear contigs are far more often caused by incomplete assemblies than genuine linear replicons.

Best of luck!
Ryan

@skoren
Copy link
Member

skoren commented Oct 13, 2017

Often what prevents circularization is variation in the "clonal" sample or low identity reads on the ends. How big are these contigs? It looks like the 1d2 reads split the largest contig into two. Are any of them chromosome sized? What's the output in the sizes files in unitig unitigging/4-unitig get folder? You could try manually looking for circulization as in issue #663. You could also try Canu tip, especially if you're using a version before 1.6.

@skoren
Copy link
Member

skoren commented Oct 30, 2017

Closing, hopefully enough info in the comments to pick an assembly.

@skoren skoren closed this as completed Oct 30, 2017
@lfaller
Copy link
Author

lfaller commented Nov 2, 2017

Thanks for wrapping this up, @skoren

I've been slow on this, but I did a more or less scientific comparison using different assemblers and filtering strategies, and I figured I'd post the results here for posterity.

For the following results, I used only the 1d reads. Also, I found out that my sequences came from a bacterium that has a linear chromosome (a strep). So getting a circular chromosome is not a high priority anymore.

I used all default command line options, unless specified in parentheses. I'm using Canu 1.6 (with genomeSize=7.5m), Filtlong v0.1.1, and Unicycler v0.4.2.

For Filtlong, I tried a more lenient approach using the following options:

--min_length 1000 --keep_percent 90 --target_bases 500000000

I also tried a more aggressive approach using these options:

--min_length 1000 --keep_percent 25 --target_bases 500000000

As I mentioned, I don't have a good reference for this species, but based on the results below, unicycler together with the lenient filtering strategy seems to have done pretty well (@rrwick !).

I'm planning on adding some more data to this experiment (other bugs, etc) and I will also try to get my hand on some data where I have a better reference.

Thanks for the thoughtful discussions, everyone, and I very much appreciate all the help!

screen shot 2017-11-02 at 1 49 27 pm

@lfaller
Copy link
Author

lfaller commented Nov 2, 2017

Actually, meant to include unicycler, no filtering ... does a little bit better than unicycler with filtering.

screen shot 2017-11-02 at 3 41 53 pm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants