-
Notifications
You must be signed in to change notification settings - Fork 70
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
assembling short references #5
Comments
Containment has almost no information on the connective of the graph. Dropping it is a standard procedure. What do you expect from the assembly? |
These are a few haplotypes across a gene in HLA. I would have expected them
|
Overlap-based assembly looks at head-to-tail overlaps between reads. Contained reads are dropped. Internal matches (i.e. non overlapping matches) are ignored. If you have n haplotypes in the same region, the assembly graph will have n singleton contigs with no edges, because there are no head-to-tail overlaps. |
A resolution would be to sample long overlapping reads from the input sequences, so as to ensure the head to tail overlap criteria. If I understand correctly, something else might need to be done to ensure there is not "chew back" at the head and tail of the assembly. |
The assembly includes approximate overlaps and containments. We'd like to find the small variants in these, rather than assume equality. So I think we need to work from the PAF files. |
If you don't want a random read to be picked for the assembly path, then it's probably not a good idea to use miniasm. Miniasm is great for scaffolding, but not good for finding variants because it makes no attempt to correct base-calling errors. |
Coming back to this. In principle we can smooth the assembly graph using vg call. I'll be testing this. |
I'm curious if miniasm works for the assembly of multiple high-quality sequences. For instance, the GRCh38 ALTs that are being used in the graph challenge in the ga4gh.
So, we tried to assemble some short genes in the MHC. I store some in the vg/test directory. For instance,
reads.gfa
is empty.It looks like the mapping works as expected,
but the graph shrinks dramatically during "containment removal":
Out of curiosity, I poked around in the code to try to get a sense of the state of the assembly graph at the point where containment reduction happens, but I don't have a good enough sense of how it is working to know what I'm looking at.
Have you tried this kind of assembly with miniasm? Is it possible? If so how should miniasm be parameterized to get it to happen?
I think it would be very useful to get this going. In the abstract, it seems it should work. Tolerating high error rates between between reads is analogous to the same problem between homologous but divergent haplotypes.
The text was updated successfully, but these errors were encountered: