New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Example of the graph representation used in Falcon #8
Comments
Hi, Jason. Can you post the |
forward edges Hi, Shaun, The edges will be look this in the "sg_edge_list".
dual edges
|
go it, I think it will be useful keep the explictly (although redundant) range of the edge sequences comming from, for example, sg_edge_list col 3,4,5. |
Also, while there is a duality relationship, I think using "+" and "-" is probably confusing. I would prefer to make CIGAR string optional too. In reality, we only need the length of the overlap and some summary information, e.g., estimated identity. |
It is optional:
Redundant but useful information can go in optional fields. I would also like an optional field to describe an estimate of the amount of overlap/distance between two reads. I'd suggest the following:
Such an attribute would be useful both for estimating the amount of overlap of two sequences, as well as representing the estimated distance between two reads that is based on paired-end reads linking the two sequences. ABySS uses the |
I've created an issue #9 to discuss this proposal. |
@pb-jchin the |
DALIGNER ouputs both links. ABySS outputs both links in its GraphViz format. It's very useful for searching and index the text graph file. I prefer outputting both links. ABySS supports reading graphs in either format. |
@pmelsted yes. they are necessary to indicate the 5' or 3' end. I just think the operation of reverse compliment is not totally identitical to "-(-1)" = "+1" mathematically, so using the sign symbols can cause confusion sometimes. |
@pmelsted For the cigar string, if we want to use GFA for storing raw sequence alignment between two raw reads with PacBio or ONT, then it won't be just "M". Even 1% difference of 20kb read is about 200 CIGAR operations. For 10% difference, it will be ~2000 operation. In such case, we need to think whether those alignments will be useful for the future operation or they can be cheap enought to regenerate if necessary. For example, Gene Myers does keep the alignment string in the daliger output files, but just some "trace points", otherwise the files will be too big. |
Exonerate uses the |
I agree, Jason. The CIGAR string should definitely be optional for this reason, and is, as it can be |
Regarding CIGAR overlap, my thinking is that it can be implementation specific how detailed they are. For long read technologies, I agree that compressing the info into |
I disagree. I think the CIGAR string should only be used if it's accurate. If it's not accurate, the CIGAR string should be set to |
If we want to use CIGAR-like string, we need to introduce a new CIGAR label tag for "approximate match". However, you will need two numbers for precisely defining the alignment boundaries. I prefer a simple number for easy to parse, otherswise, there will be some ambiguity the begining and the ending of the sequence of the string of the edges. |
I have also thought that it would be useful to specify the start location of the match on sequence A and on sequence B as integer columns. DALIGNER for example sorts and indexes its alignments by the three-tuple (sequence A, sequence B, position A). It would be useful to have position A as in integer column to make that possible. http://dx.doi.org/10.1007/978-3-662-44753-6_5
|
The containment |
see https://github.com/PacificBiosciences/FALCON/wiki/Convert-FALCON-assembly-graph-to-GFA-format, |
That's fantastic, Jason! |
Well, Bandage can technically do graphs of that size... but it's slow, takes a lot of RAM, and generally not the most pleasant experience. Usually the best thing to do is reduce the |
@pb-jchin, what do the GFA link overlaps look like in FALCON graph? In particular, are they simple (exact match overlaps) or more complex CIGARs? I ask because I see the graph you just posted on Twitter has long stretches of simple paths. You could try using If the overlaps are simple, then this should have no real loss of information. But if the overlaps have complex CIGARs, then merging them would keep only one version of each overlapping sequence. And I haven't tested Bandage much with complex CIGAR edges, so tread carefully if you go there! |
@rrwick, thanks. I have not tried the merging nodes. I have to output "faked" CIGAR string. The overlaps are long (as what they should be in a long read assembler.) I will try it out. Also, I agree visualize whole human assembly graph without proper context is not too useful beside generating sometime inspiring pretty picture. There are a lot of work to do put put the right context to make assembly graph visualization useful for end user. |
@rrwick it certainly looks better with the "node merging". Thanks. |
Very nice! Another thing that's often worth changing for large graphs is the node length. You can go to Side note: I changed the node length logic (for the better, I hope!) in the most recent version, so make sure you're using v0.8.0, or else that setting will be different. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. |
Just for a reference. The basic graph I use in the FALCON assembler (https://github.com/PacificBiosciences/FALCON/) is very straigh forward. I decouple the graph information from the sequences information. The sequences used in the graph is just referenced from the text file
sg_edges_list
. Here is a simple example and some brief description:The first two columns indicates the in and out node of the edge. The node notation contain two files operated by :. The first field is the read identifier. The second field is either B or E. B is the 5' end of the read and E is the 3' end of the reads. The next three field indicates the corresponding sequences of the edges. In this example, the edge in the first line contains the sequence from read 000007817 base [10841, 28901). If the second coordinate is smaller than the first one, it means the corresponded sequence is reverse complimented. The next two column are the number of overlapped base and the overlap identity. The final column is the classification. Currently, there are 4 different types G, TR, R, and S. An edge with type "G" is used for the final string graph. A "TR" means the edge is transitive reducible. "R" means the edge is removed during the local repeat resolution and "S" means the edge is likely to be a "spur" which only one ends is connected.
It won't be too hard to convert such information @lh3's original proposal for GFA which label's sequnces ends rather than the seqments themselves.
Here is an example to compare the different graphs using three reads as an example:
https://github.com/pb-jchin/GFA-spec/blob/master/examples/ThreeReads_Summary.svg
The text was updated successfully, but these errors were encountered: