Example of the graph representation used in Falcon #8

pb-jchin · 2015-08-01T20:30:00Z

Just for a reference. The basic graph I use in the FALCON assembler (https://github.com/PacificBiosciences/FALCON/) is very straigh forward. I decouple the graph information from the sequences information. The sequences used in the graph is just referenced from the text file sg_edges_list. Here is a simple example and some brief description:

$ head -5 sg_edges_list
000017363:B 000007817:E 000007817 10841 28901 10841 99.52 TR
000015379:E 000004331:B 000004331 6891 0 18178 99.35 TR
000006813:B 000000681:E 000000681 7609 23795 7616 99.72 TR
000002258:E 000002505:B 000002505 5850 0 17215 99.62 TR
000013449:B 000012565:B 000012565 3317 0 20570 99.72 G

The first two columns indicates the in and out node of the edge. The node notation contain two files operated by :. The first field is the read identifier. The second field is either B or E. B is the 5' end of the read and E is the 3' end of the reads. The next three field indicates the corresponding sequences of the edges. In this example, the edge in the first line contains the sequence from read 000007817 base [10841, 28901). If the second coordinate is smaller than the first one, it means the corresponded sequence is reverse complimented. The next two column are the number of overlapped base and the overlap identity. The final column is the classification. Currently, there are 4 different types G, TR, R, and S. An edge with type "G" is used for the final string graph. A "TR" means the edge is transitive reducible. "R" means the edge is removed during the local repeat resolution and "S" means the edge is likely to be a "spur" which only one ends is connected.

It won't be too hard to convert such information @lh3's original proposal for GFA which label's sequnces ends rather than the seqments themselves.

Here is an example to compare the different graphs using three reads as an example:
https://github.com/pb-jchin/GFA-spec/blob/master/examples/ThreeReads_Summary.svg

The text was updated successfully, but these errors were encountered:

sjackman · 2015-08-01T23:00:43Z

Hi, Jason. Can you post the sg_edges_list for the figure ThreeReads_Summary.svg?

pb-jchin · 2015-08-01T23:14:10Z

forward edges

Hi, Shaun,

The edges will be look this in the "sg_edge_list".

read0:E read1:B read1 2000 0 2000 100 G
read1:B read2:E read2 3000 5500 3000 100 G

dual edges

read2:E read1:B  read1 2500 3500 3000 100 G
read1:B read0:E  read0 2500 0 2000 100 G

sjackman · 2015-08-02T17:35:36Z

Hi, Jason. The GFA overlap graph of these three reads would look like so:

S read0 * LN:i:5500
S read1 * LN:i:4000
S read2 * LN:i:5500
L read0 + read1 - 2000M
L read1 - read2 + 3000M
L read0 + read2 + 1000M

The dual edges:

L read1 + read 0 - 2000M
L read2 - read 1 + 3000M
L read2 - read 0 - 1000M

pb-jchin · 2015-08-02T17:40:36Z

go it, I think it will be useful keep the explictly (although redundant) range of the edge sequences comming from, for example, sg_edge_list col 3,4,5.

pb-jchin · 2015-08-02T17:47:21Z

Also, while there is a duality relationship, I think using "+" and "-" is probably confusing. I would prefer to make CIGAR string optional too. In reality, we only need the length of the overlap and some summary information, e.g., estimated identity.

sjackman · 2015-08-03T01:04:59Z

I would prefer to make CIGAR string optional too.

It is optional:
"When the CIGAR string is *, the nature of the overlap is not specified."
https://github.com/pmelsted/GFA-spec/blob/master/GFA-spec.md#link-line

go it, I think it will be useful keep the explictly (although redundant) range of the edge sequences comming from, for example, sg_edge_list col 3,4,5.
In reality, we only need the length of the overlap and some summary information, e.g., estimated identity.

Redundant but useful information can go in optional fields. I would also like an optional field to describe an estimate of the amount of overlap/distance between two reads. I'd suggest the following:

DI:i is an estimate of the distance between the end of sequence 1 and the start of sequence 2. If DI is negative, it is estimated that the two sequences overlap. If DI is positive, it is estimated that there is a gap between them. DE:f is the estimated error of DI.

Such an attribute would be useful both for estimating the amount of overlap of two sequences, as well as representing the estimated distance between two reads that is based on paired-end reads linking the two sequences. ABySS uses the d and e attributes for this purpose in its GraphViz graph.

sjackman · 2015-08-03T01:11:00Z

I've created an issue #9 to discuss this proposal.

pmelsted · 2015-08-03T17:59:47Z

@pb-jchin the + and - are necessary, we haven't tackled yet whether there should be a canonical link of if it's ok if both versions are specified. For now I think we'll keep the CIGAR format, although most of the time its of the form M. Any other information should go in optional tags, such as percent identity or other values.

sjackman · 2015-08-03T18:02:47Z

we haven't tackled yet whether there should be a canonical link of if it's ok if both versions are specified

DALIGNER ouputs both links. ABySS outputs both links in its GraphViz format. It's very useful for searching and index the text graph file. I prefer outputting both links. ABySS supports reading graphs in either format.

pb-jchin · 2015-08-03T18:04:46Z

@pmelsted yes. they are necessary to indicate the 5' or 3' end. I just think the operation of reverse compliment is not totally identitical to "-(-1)" = "+1" mathematically, so using the sign symbols can cause confusion sometimes.

pb-jchin · 2015-08-03T18:09:28Z

@pmelsted For the cigar string, if we want to use GFA for storing raw sequence alignment between two raw reads with PacBio or ONT, then it won't be just "M". Even 1% difference of 20kb read is about 200 CIGAR operations. For 10% difference, it will be ~2000 operation. In such case, we need to think whether those alignments will be useful for the future operation or they can be cheap enought to regenerate if necessary. For example, Gene Myers does keep the alignment string in the daliger output files, but just some "trace points", otherwise the files will be too big.

sjackman · 2015-08-03T18:34:19Z

Exonerate uses the + and - characters for this purpose. I'm used to it, and I do like it in fact.

sjackman · 2015-08-03T18:36:55Z

I agree, Jason. The CIGAR string should definitely be optional for this reason, and is, as it can be *. I've proposed the attributes DI:i and DE:f in #9 to give an estimate and estimated error of the distance between or overlap of two sequences.

pmelsted · 2015-08-04T00:11:27Z

Regarding CIGAR overlap, my thinking is that it can be implementation specific how detailed they are. For long read technologies, I agree that compressing the info into 1000M and overlap statistics is fine. For short read assembly or scaffolding there is value in keeping a more exact overlap (since there are fewer of them and the errors have been mostly ironed out).

sjackman · 2015-08-04T13:46:12Z

For long read technologies, I agree that compressing the info into 1000M and overlap statistics is fine.

I disagree. I think the CIGAR string should only be used if it's accurate. If it's not accurate, the CIGAR string should be set to * and the proposed distance estimate attributes DI and DE should be used instead. See #9

pb-jchin · 2015-08-04T15:10:57Z

If we want to use CIGAR-like string, we need to introduce a new CIGAR label tag for "approximate match". However, you will need two numbers for precisely defining the alignment boundaries. I prefer a simple number for easy to parse, otherswise, there will be some ambiguity the begining and the ending of the sequence of the string of the edges.

sjackman · 2015-08-04T15:33:09Z

I have also thought that it would be useful to specify the start location of the match on sequence A and on sequence B as integer columns. DALIGNER for example sorts and indexes its alignments by the three-tuple (sequence A, sequence B, position A). It would be useful to have position A as in integer column to make that possible.

http://dx.doi.org/10.1007/978-3-662-44753-6_5

Sort ListM lexicographically on a, b, and i where a is most significant and i
least.

sjackman · 2015-08-04T15:34:37Z

The containment C record has such a column, but I think it would be useful as well for the link L record.

pb-jchin · 2016-07-03T00:50:39Z

see https://github.com/PacificBiosciences/FALCON/wiki/Convert-FALCON-assembly-graph-to-GFA-format,
I can generate a full assembly graph in GFA for a diploid arabidopsis or some human assembly too. Any one likes to visualize a graph that have more than 1M nodes and 1M edges?

sjackman · 2016-07-03T16:36:21Z

That's fantastic, Jason!
@rrwick The gauntlet's been thrown! =)

rrwick · 2016-07-03T23:33:37Z

Well, Bandage can technically do graphs of that size... but it's slow, takes a lot of RAM, and generally not the most pleasant experience.
https://github.com/rrwick/Bandage/wiki/Working-with-very-large-graphs

Usually the best thing to do is reduce the Scope from Entire graph to something more restrictive before clicking the Draw graph button.

rrwick · 2016-07-03T23:56:24Z

@pb-jchin, what do the GFA link overlaps look like in FALCON graph? In particular, are they simple (exact match overlaps) or more complex CIGARs?

I ask because I see the graph you just posted on Twitter has long stretches of simple paths. You could try using Edit > Merge all possible nodes in Bandage before you draw the graph. It might take a little while on a big graph, but it can bring the node/edge count way down.

It turns this sort of thing:

Into this:

If the overlaps are simple, then this should have no real loss of information. But if the overlaps have complex CIGARs, then merging them would keep only one version of each overlapping sequence. And I haven't tested Bandage much with complex CIGAR edges, so tread carefully if you go there!

pb-jchin · 2016-07-04T01:23:29Z

@rrwick, thanks. I have not tried the merging nodes. I have to output "faked" CIGAR string. The overlaps are long (as what they should be in a long read assembler.) I will try it out.

Also, I agree visualize whole human assembly graph without proper context is not too useful beside generating sometime inspiring pretty picture. There are a lot of work to do put put the right context to make assembly graph visualization useful for end user.

pb-jchin · 2016-07-04T01:45:30Z

@rrwick it certainly looks better with the "node merging". Thanks.

rrwick · 2016-07-04T01:54:25Z

Very nice! Another thing that's often worth changing for large graphs is the node length. You can go to Preferences (on Mac) or Settings (on Linux) and adjust the Node length per megabase setting. A smaller value will shorten up those long nodes, which may make the visualisation a bit more compact and manageable.

Side note: I changed the node length logic (for the better, I hope!) in the most recent version, so make sure you're using v0.8.0, or else that setting will be different.

stale · 2018-11-15T22:48:09Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

sjackman mentioned this issue Aug 3, 2015

Add link attributes DI and DE to specify the estimated distance between two sequences #9

Closed

sjackman added the question label Aug 4, 2015

pb-jchin mentioned this issue Sep 10, 2016

Paths in GFA v2 #36

Closed

stale bot added the stale label Nov 15, 2018

stale bot closed this as completed Nov 23, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Example of the graph representation used in Falcon #8

Example of the graph representation used in Falcon #8

pb-jchin commented Aug 1, 2015

sjackman commented Aug 1, 2015

pb-jchin commented Aug 1, 2015

sjackman commented Aug 2, 2015

pb-jchin commented Aug 2, 2015

pb-jchin commented Aug 2, 2015

sjackman commented Aug 3, 2015

sjackman commented Aug 3, 2015

pmelsted commented Aug 3, 2015

sjackman commented Aug 3, 2015

pb-jchin commented Aug 3, 2015

pb-jchin commented Aug 3, 2015

sjackman commented Aug 3, 2015

sjackman commented Aug 3, 2015

pmelsted commented Aug 4, 2015

sjackman commented Aug 4, 2015

pb-jchin commented Aug 4, 2015

sjackman commented Aug 4, 2015

sjackman commented Aug 4, 2015

pb-jchin commented Jul 3, 2016

sjackman commented Jul 3, 2016 •

edited

rrwick commented Jul 3, 2016

rrwick commented Jul 3, 2016

pb-jchin commented Jul 4, 2016

pb-jchin commented Jul 4, 2016

rrwick commented Jul 4, 2016

stale bot commented Nov 15, 2018

Example of the graph representation used in Falcon #8

Example of the graph representation used in Falcon #8

Comments

pb-jchin commented Aug 1, 2015

sjackman commented Aug 1, 2015

pb-jchin commented Aug 1, 2015

sjackman commented Aug 2, 2015

pb-jchin commented Aug 2, 2015

pb-jchin commented Aug 2, 2015

sjackman commented Aug 3, 2015

sjackman commented Aug 3, 2015

pmelsted commented Aug 3, 2015

sjackman commented Aug 3, 2015

pb-jchin commented Aug 3, 2015

pb-jchin commented Aug 3, 2015

sjackman commented Aug 3, 2015

sjackman commented Aug 3, 2015

pmelsted commented Aug 4, 2015

sjackman commented Aug 4, 2015

pb-jchin commented Aug 4, 2015

sjackman commented Aug 4, 2015

sjackman commented Aug 4, 2015

pb-jchin commented Jul 3, 2016

sjackman commented Jul 3, 2016 • edited

rrwick commented Jul 3, 2016

rrwick commented Jul 3, 2016

pb-jchin commented Jul 4, 2016

pb-jchin commented Jul 4, 2016

rrwick commented Jul 4, 2016

stale bot commented Nov 15, 2018

sjackman commented Jul 3, 2016 •

edited