Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

assembling short references #5

Open
ekg opened this issue Nov 28, 2015 · 7 comments
Open

assembling short references #5

ekg opened this issue Nov 28, 2015 · 7 comments

Comments

@ekg
Copy link

ekg commented Nov 28, 2015

I'm curious if miniasm works for the assembly of multiple high-quality sequences. For instance, the GRCh38 ALTs that are being used in the graph challenge in the ga4gh.

So, we tried to assemble some short genes in the MHC. I store some in the vg/test directory. For instance,

wget https://raw.githubusercontent.com/ekg/vg/master/test/GRCh38_alts/FASTA/HLA/A-3105.fa
minimap -Sw5 -L100 -m0 -t8 A-3105.fa A-3105.fa | gzip -1 > reads.paf.gz
miniasm -f reads.fq reads.paf.gz > reads.gfa

reads.gfa is empty.

It looks like the mapping works as expected,

➜  x  minimap -Sw5 -L100 -m0 -t8 A-3105.fa A-3105.fa | gzip -1 > reads.paf.gz
[M::mm_idx_gen::0.007*0.59] collected minimizers
[M::mm_idx_gen::0.014*1.47] sorted minimizers
[M::main::0.014*1.47] loaded/built the index for 11 target sequence(s)
[M::main] max occurrences of a minimizer to consider: 16
[M::main] Version: r122
[M::main] CMD: minimap -Sw5 -L100 -m0 -t8 A-3105.fa A-3105.fa
[M::main] Real time: 0.038 sec; CPU: 0.064 sec

but the graph shrinks dramatically during "containment removal":

➜  x  miniasm -f reads.fq reads.paf.gz > reads.gfa
[M::main] ===> Step 1: reading read mappings <===
[M::ma_hit_read::0.000*0.00] read 186 hits; stored 211 hits and 11 sequences (147718 bp)
[M::main] ===> Step 2: 1-pass (crude) read selection <===
[M::ma_hit_sub::0.000*0.00] 11 query sequences remain after sub
[M::ma_hit_cut::0.000*0.00] 111 hits remain after cut
[M::ma_hit_flt::0.000*0.00] 111 hits remain after filtering; crude coverage after filtering: 7.41
[M::main] ===> Step 3: 2-pass (fine) read selection <===
[M::ma_hit_sub::0.000*0.00] 11 query sequences remain after sub
[M::ma_hit_cut::0.000*0.00] 111 hits remain after cut
[M::ma_hit_contained::0.000*0.00] 2 sequences and 2 hits remain after containment removal
[M::main] ===> Step 4: graph cleaning <===
[M::ma_sg_gen] read 2 arcs
[M::main] ===> Step 4.1: transitive reduction <===
[M::asg_arc_del_trans] transitively reduced 0 arcs
[M::main] ===> Step 4.2: initial tip cutting and bubble popping <===
[M::asg_cut_tip] cut 1 tips
[M::asg_arc_del_multi] removed 0 multi-arcs
[M::asg_arc_del_asymm] removed 0 asymmetric arcs
[M::asg_pop_bubble] popped 0 bubbles and trimmed 0 tips
[M::main] ===> Step 4.3: cutting short overlaps (3 rounds in total) <===
[M::asg_arc_del_short] removed 0 short overlaps
[M::asg_arc_del_short] removed 0 short overlaps
[M::asg_arc_del_short] removed 0 short overlaps
[M::main] ===> Step 4.4: removing short internal sequences and bi-loops <===
[M::asg_cut_internal] cut 0 internal sequences
[M::asg_cut_biloop] cut 0 small bi-loops
[M::asg_cut_tip] cut 0 tips
[M::asg_pop_bubble] popped 0 bubbles and trimmed 0 tips
[M::main] ===> Step 4.5: aggressively cutting short overlaps <===
[M::asg_arc_del_short] removed 0 short overlaps
[M::main] ===> Step 5: generating unitigs <===
[M::main] Version: r104
[M::main] CMD: miniasm -f reads.fq reads.paf.gz
[M::main] Real time: 0.001 sec; CPU: 0.000 sec

Out of curiosity, I poked around in the code to try to get a sense of the state of the assembly graph at the point where containment reduction happens, but I don't have a good enough sense of how it is working to know what I'm looking at.

Have you tried this kind of assembly with miniasm? Is it possible? If so how should miniasm be parameterized to get it to happen?

I think it would be very useful to get this going. In the abstract, it seems it should work. Tolerating high error rates between between reads is analogous to the same problem between homologous but divergent haplotypes.

@lh3
Copy link
Owner

lh3 commented Nov 28, 2015

Containment has almost no information on the connective of the graph. Dropping it is a standard procedure. What do you expect from the assembly?

@ekg
Copy link
Author

ekg commented Nov 29, 2015

These are a few haplotypes across a gene in HLA. I would have expected them
to form a single connected graph. This is what happens when I pairwise
align them. I'll share a result in GFA to clarify.
On Nov 28, 2015 4:38 PM, "Heng Li" notifications@github.com wrote:

Containment has almost no information on the connective of the graph.
Dropping it is a standard procedure. What do you expect from the assembly?


Reply to this email directly or view it on GitHub
#5 (comment).

@lh3
Copy link
Owner

lh3 commented Nov 29, 2015

Overlap-based assembly looks at head-to-tail overlaps between reads. Contained reads are dropped. Internal matches (i.e. non overlapping matches) are ignored. If you have n haplotypes in the same region, the assembly graph will have n singleton contigs with no edges, because there are no head-to-tail overlaps.

@ekg
Copy link
Author

ekg commented Nov 30, 2015

A resolution would be to sample long overlapping reads from the input sequences, so as to ensure the head to tail overlap criteria. If I understand correctly, something else might need to be done to ensure there is not "chew back" at the head and tail of the assembly.

@ekg
Copy link
Author

ekg commented May 3, 2016

The assembly includes approximate overlaps and containments. We'd like to find the small variants in these, rather than assume equality. So I think we need to work from the PAF files.

@gringer
Copy link

gringer commented Jun 29, 2017

If you don't want a random read to be picked for the assembly path, then it's probably not a good idea to use miniasm. Miniasm is great for scaffolding, but not good for finding variants because it makes no attempt to correct base-calling errors.

@ekg
Copy link
Author

ekg commented Dec 11, 2017

Coming back to this. In principle we can smooth the assembly graph using vg call. I'll be testing this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants