assembling short references #5

ekg · 2015-11-28T10:36:35Z

I'm curious if miniasm works for the assembly of multiple high-quality sequences. For instance, the GRCh38 ALTs that are being used in the graph challenge in the ga4gh.

So, we tried to assemble some short genes in the MHC. I store some in the vg/test directory. For instance,

wget https://raw.githubusercontent.com/ekg/vg/master/test/GRCh38_alts/FASTA/HLA/A-3105.fa
minimap -Sw5 -L100 -m0 -t8 A-3105.fa A-3105.fa | gzip -1 > reads.paf.gz
miniasm -f reads.fq reads.paf.gz > reads.gfa

reads.gfa is empty.

It looks like the mapping works as expected,

➜  x  minimap -Sw5 -L100 -m0 -t8 A-3105.fa A-3105.fa | gzip -1 > reads.paf.gz
[M::mm_idx_gen::0.007*0.59] collected minimizers
[M::mm_idx_gen::0.014*1.47] sorted minimizers
[M::main::0.014*1.47] loaded/built the index for 11 target sequence(s)
[M::main] max occurrences of a minimizer to consider: 16
[M::main] Version: r122
[M::main] CMD: minimap -Sw5 -L100 -m0 -t8 A-3105.fa A-3105.fa
[M::main] Real time: 0.038 sec; CPU: 0.064 sec

but the graph shrinks dramatically during "containment removal":

➜  x  miniasm -f reads.fq reads.paf.gz > reads.gfa
[M::main] ===> Step 1: reading read mappings <===
[M::ma_hit_read::0.000*0.00] read 186 hits; stored 211 hits and 11 sequences (147718 bp)
[M::main] ===> Step 2: 1-pass (crude) read selection <===
[M::ma_hit_sub::0.000*0.00] 11 query sequences remain after sub
[M::ma_hit_cut::0.000*0.00] 111 hits remain after cut
[M::ma_hit_flt::0.000*0.00] 111 hits remain after filtering; crude coverage after filtering: 7.41
[M::main] ===> Step 3: 2-pass (fine) read selection <===
[M::ma_hit_sub::0.000*0.00] 11 query sequences remain after sub
[M::ma_hit_cut::0.000*0.00] 111 hits remain after cut
[M::ma_hit_contained::0.000*0.00] 2 sequences and 2 hits remain after containment removal
[M::main] ===> Step 4: graph cleaning <===
[M::ma_sg_gen] read 2 arcs
[M::main] ===> Step 4.1: transitive reduction <===
[M::asg_arc_del_trans] transitively reduced 0 arcs
[M::main] ===> Step 4.2: initial tip cutting and bubble popping <===
[M::asg_cut_tip] cut 1 tips
[M::asg_arc_del_multi] removed 0 multi-arcs
[M::asg_arc_del_asymm] removed 0 asymmetric arcs
[M::asg_pop_bubble] popped 0 bubbles and trimmed 0 tips
[M::main] ===> Step 4.3: cutting short overlaps (3 rounds in total) <===
[M::asg_arc_del_short] removed 0 short overlaps
[M::asg_arc_del_short] removed 0 short overlaps
[M::asg_arc_del_short] removed 0 short overlaps
[M::main] ===> Step 4.4: removing short internal sequences and bi-loops <===
[M::asg_cut_internal] cut 0 internal sequences
[M::asg_cut_biloop] cut 0 small bi-loops
[M::asg_cut_tip] cut 0 tips
[M::asg_pop_bubble] popped 0 bubbles and trimmed 0 tips
[M::main] ===> Step 4.5: aggressively cutting short overlaps <===
[M::asg_arc_del_short] removed 0 short overlaps
[M::main] ===> Step 5: generating unitigs <===
[M::main] Version: r104
[M::main] CMD: miniasm -f reads.fq reads.paf.gz
[M::main] Real time: 0.001 sec; CPU: 0.000 sec

Out of curiosity, I poked around in the code to try to get a sense of the state of the assembly graph at the point where containment reduction happens, but I don't have a good enough sense of how it is working to know what I'm looking at.

Have you tried this kind of assembly with miniasm? Is it possible? If so how should miniasm be parameterized to get it to happen?

I think it would be very useful to get this going. In the abstract, it seems it should work. Tolerating high error rates between between reads is analogous to the same problem between homologous but divergent haplotypes.

The text was updated successfully, but these errors were encountered:

lh3 · 2015-11-28T15:38:20Z

Containment has almost no information on the connective of the graph. Dropping it is a standard procedure. What do you expect from the assembly?

ekg · 2015-11-29T21:16:19Z

These are a few haplotypes across a gene in HLA. I would have expected them
to form a single connected graph. This is what happens when I pairwise
align them. I'll share a result in GFA to clarify.
On Nov 28, 2015 4:38 PM, "Heng Li" notifications@github.com wrote:

Containment has almost no information on the connective of the graph.
Dropping it is a standard procedure. What do you expect from the assembly?

—
Reply to this email directly or view it on GitHub
#5 (comment).

lh3 · 2015-11-29T23:08:53Z

Overlap-based assembly looks at head-to-tail overlaps between reads. Contained reads are dropped. Internal matches (i.e. non overlapping matches) are ignored. If you have n haplotypes in the same region, the assembly graph will have n singleton contigs with no edges, because there are no head-to-tail overlaps.

ekg · 2015-11-30T09:23:02Z

A resolution would be to sample long overlapping reads from the input sequences, so as to ensure the head to tail overlap criteria. If I understand correctly, something else might need to be done to ensure there is not "chew back" at the head and tail of the assembly.

ekg · 2016-05-03T08:53:13Z

The assembly includes approximate overlaps and containments. We'd like to find the small variants in these, rather than assume equality. So I think we need to work from the PAF files.

gringer · 2017-06-29T23:16:31Z

If you don't want a random read to be picked for the assembly path, then it's probably not a good idea to use miniasm. Miniasm is great for scaffolding, but not good for finding variants because it makes no attempt to correct base-calling errors.

ekg · 2017-12-11T13:20:38Z

Coming back to this. In principle we can smooth the assembly graph using vg call. I'll be testing this.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

assembling short references #5

assembling short references #5

ekg commented Nov 28, 2015

lh3 commented Nov 28, 2015

ekg commented Nov 29, 2015

lh3 commented Nov 29, 2015

ekg commented Nov 30, 2015

ekg commented May 3, 2016

gringer commented Jun 29, 2017

ekg commented Dec 11, 2017

assembling short references #5

assembling short references #5

Comments

ekg commented Nov 28, 2015

lh3 commented Nov 28, 2015

ekg commented Nov 29, 2015

lh3 commented Nov 29, 2015

ekg commented Nov 30, 2015

ekg commented May 3, 2016

gringer commented Jun 29, 2017

ekg commented Dec 11, 2017