Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Example of the graph representation used in Falcon #8

Closed
pb-jchin opened this issue Aug 1, 2015 · 26 comments
Closed

Example of the graph representation used in Falcon #8

pb-jchin opened this issue Aug 1, 2015 · 26 comments

Comments

@pb-jchin
Copy link

pb-jchin commented Aug 1, 2015

Just for a reference. The basic graph I use in the FALCON assembler (https://github.com/PacificBiosciences/FALCON/) is very straigh forward. I decouple the graph information from the sequences information. The sequences used in the graph is just referenced from the text file sg_edges_list. Here is a simple example and some brief description:

$ head -5 sg_edges_list
000017363:B 000007817:E 000007817 10841 28901 10841 99.52 TR
000015379:E 000004331:B 000004331 6891 0 18178 99.35 TR
000006813:B 000000681:E 000000681 7609 23795 7616 99.72 TR
000002258:E 000002505:B 000002505 5850 0 17215 99.62 TR
000013449:B 000012565:B 000012565 3317 0 20570 99.72 G

The first two columns indicates the in and out node of the edge. The node notation contain two files operated by :. The first field is the read identifier. The second field is either B or E. B is the 5' end of the read and E is the 3' end of the reads. The next three field indicates the corresponding sequences of the edges. In this example, the edge in the first line contains the sequence from read 000007817 base [10841, 28901). If the second coordinate is smaller than the first one, it means the corresponded sequence is reverse complimented. The next two column are the number of overlapped base and the overlap identity. The final column is the classification. Currently, there are 4 different types G, TR, R, and S. An edge with type "G" is used for the final string graph. A "TR" means the edge is transitive reducible. "R" means the edge is removed during the local repeat resolution and "S" means the edge is likely to be a "spur" which only one ends is connected.

It won't be too hard to convert such information @lh3's original proposal for GFA which label's sequnces ends rather than the seqments themselves.

Here is an example to compare the different graphs using three reads as an example:
https://github.com/pb-jchin/GFA-spec/blob/master/examples/ThreeReads_Summary.svg

@sjackman
Copy link
Collaborator

sjackman commented Aug 1, 2015

Hi, Jason. Can you post the sg_edges_list for the figure ThreeReads_Summary.svg?

@pb-jchin
Copy link
Author

pb-jchin commented Aug 1, 2015

forward edges

Hi, Shaun,

The edges will be look this in the "sg_edge_list".

read0:E read1:B read1 2000 0 2000 100 G
read1:B read2:E read2 3000 5500 3000 100 G

dual edges

read2:E read1:B  read1 2500 3500 3000 100 G
read1:B read0:E  read0 2500 0 2000 100 G

@sjackman
Copy link
Collaborator

sjackman commented Aug 2, 2015

Hi, Jason. The GFA overlap graph of these three reads would look like so:

S read0 * LN:i:5500
S read1 * LN:i:4000
S read2 * LN:i:5500
L read0 + read1 - 2000M
L read1 - read2 + 3000M
L read0 + read2 + 1000M

The dual edges:

L read1 + read 0 - 2000M
L read2 - read 1 + 3000M
L read2 - read 0 - 1000M

jchin

@pb-jchin
Copy link
Author

pb-jchin commented Aug 2, 2015

go it, I think it will be useful keep the explictly (although redundant) range of the edge sequences comming from, for example, sg_edge_list col 3,4,5.

@pb-jchin
Copy link
Author

pb-jchin commented Aug 2, 2015

Also, while there is a duality relationship, I think using "+" and "-" is probably confusing. I would prefer to make CIGAR string optional too. In reality, we only need the length of the overlap and some summary information, e.g., estimated identity.

@sjackman
Copy link
Collaborator

sjackman commented Aug 3, 2015

I would prefer to make CIGAR string optional too.

It is optional:
"When the CIGAR string is *, the nature of the overlap is not specified."
https://github.com/pmelsted/GFA-spec/blob/master/GFA-spec.md#link-line

go it, I think it will be useful keep the explictly (although redundant) range of the edge sequences comming from, for example, sg_edge_list col 3,4,5.
In reality, we only need the length of the overlap and some summary information, e.g., estimated identity.

Redundant but useful information can go in optional fields. I would also like an optional field to describe an estimate of the amount of overlap/distance between two reads. I'd suggest the following:

  • DI:i is an estimate of the distance between the end of sequence 1 and the start of sequence 2. If DI is negative, it is estimated that the two sequences overlap. If DI is positive, it is estimated that there is a gap between them. DE:f is the estimated error of DI.

Such an attribute would be useful both for estimating the amount of overlap of two sequences, as well as representing the estimated distance between two reads that is based on paired-end reads linking the two sequences. ABySS uses the d and e attributes for this purpose in its GraphViz graph.

@sjackman
Copy link
Collaborator

sjackman commented Aug 3, 2015

I've created an issue #9 to discuss this proposal.

@pmelsted
Copy link
Collaborator

pmelsted commented Aug 3, 2015

@pb-jchin the + and - are necessary, we haven't tackled yet whether there should be a canonical link of if it's ok if both versions are specified. For now I think we'll keep the CIGAR format, although most of the time its of the form M. Any other information should go in optional tags, such as percent identity or other values.

@sjackman
Copy link
Collaborator

sjackman commented Aug 3, 2015

we haven't tackled yet whether there should be a canonical link of if it's ok if both versions are specified

DALIGNER ouputs both links. ABySS outputs both links in its GraphViz format. It's very useful for searching and index the text graph file. I prefer outputting both links. ABySS supports reading graphs in either format.

@pb-jchin
Copy link
Author

pb-jchin commented Aug 3, 2015

@pmelsted yes. they are necessary to indicate the 5' or 3' end. I just think the operation of reverse compliment is not totally identitical to "-(-1)" = "+1" mathematically, so using the sign symbols can cause confusion sometimes.

@pb-jchin
Copy link
Author

pb-jchin commented Aug 3, 2015

@pmelsted For the cigar string, if we want to use GFA for storing raw sequence alignment between two raw reads with PacBio or ONT, then it won't be just "M". Even 1% difference of 20kb read is about 200 CIGAR operations. For 10% difference, it will be ~2000 operation. In such case, we need to think whether those alignments will be useful for the future operation or they can be cheap enought to regenerate if necessary. For example, Gene Myers does keep the alignment string in the daliger output files, but just some "trace points", otherwise the files will be too big.

@sjackman
Copy link
Collaborator

sjackman commented Aug 3, 2015

Exonerate uses the + and - characters for this purpose. I'm used to it, and I do like it in fact.

@sjackman
Copy link
Collaborator

sjackman commented Aug 3, 2015

I agree, Jason. The CIGAR string should definitely be optional for this reason, and is, as it can be *. I've proposed the attributes DI:i and DE:f in #9 to give an estimate and estimated error of the distance between or overlap of two sequences.

@pmelsted
Copy link
Collaborator

pmelsted commented Aug 4, 2015

Regarding CIGAR overlap, my thinking is that it can be implementation specific how detailed they are. For long read technologies, I agree that compressing the info into 1000M and overlap statistics is fine. For short read assembly or scaffolding there is value in keeping a more exact overlap (since there are fewer of them and the errors have been mostly ironed out).

@sjackman
Copy link
Collaborator

sjackman commented Aug 4, 2015

For long read technologies, I agree that compressing the info into 1000M and overlap statistics is fine.

I disagree. I think the CIGAR string should only be used if it's accurate. If it's not accurate, the CIGAR string should be set to * and the proposed distance estimate attributes DI and DE should be used instead. See #9

@pb-jchin
Copy link
Author

pb-jchin commented Aug 4, 2015

If we want to use CIGAR-like string, we need to introduce a new CIGAR label tag for "approximate match". However, you will need two numbers for precisely defining the alignment boundaries. I prefer a simple number for easy to parse, otherswise, there will be some ambiguity the begining and the ending of the sequence of the string of the edges.

@sjackman
Copy link
Collaborator

sjackman commented Aug 4, 2015

I have also thought that it would be useful to specify the start location of the match on sequence A and on sequence B as integer columns. DALIGNER for example sorts and indexes its alignments by the three-tuple (sequence A, sequence B, position A). It would be useful to have position A as in integer column to make that possible.

http://dx.doi.org/10.1007/978-3-662-44753-6_5

  1. Sort ListM lexicographically on a, b, and i where a is most significant and i
    least.

@sjackman
Copy link
Collaborator

sjackman commented Aug 4, 2015

The containment C record has such a column, but I think it would be useful as well for the link L record.

@pb-jchin
Copy link
Author

pb-jchin commented Jul 3, 2016

see https://github.com/PacificBiosciences/FALCON/wiki/Convert-FALCON-assembly-graph-to-GFA-format,
I can generate a full assembly graph in GFA for a diploid arabidopsis or some human assembly too. Any one likes to visualize a graph that have more than 1M nodes and 1M edges?

@sjackman
Copy link
Collaborator

sjackman commented Jul 3, 2016

That's fantastic, Jason!
@rrwick The gauntlet's been thrown! =)

@rrwick
Copy link

rrwick commented Jul 3, 2016

Well, Bandage can technically do graphs of that size... but it's slow, takes a lot of RAM, and generally not the most pleasant experience.
https://github.com/rrwick/Bandage/wiki/Working-with-very-large-graphs

Usually the best thing to do is reduce the Scope from Entire graph to something more restrictive before clicking the Draw graph button.

@rrwick
Copy link

rrwick commented Jul 3, 2016

@pb-jchin, what do the GFA link overlaps look like in FALCON graph? In particular, are they simple (exact match overlaps) or more complex CIGARs?

I ask because I see the graph you just posted on Twitter has long stretches of simple paths. You could try using Edit > Merge all possible nodes in Bandage before you draw the graph. It might take a little while on a big graph, but it can bring the node/edge count way down.

It turns this sort of thing:
screen shot 2016-07-04 at 9 49 56 am

Into this:
screen shot 2016-07-04 at 9 52 39 am

If the overlaps are simple, then this should have no real loss of information. But if the overlaps have complex CIGARs, then merging them would keep only one version of each overlapping sequence. And I haven't tested Bandage much with complex CIGAR edges, so tread carefully if you go there!

@pb-jchin
Copy link
Author

pb-jchin commented Jul 4, 2016

@rrwick, thanks. I have not tried the merging nodes. I have to output "faked" CIGAR string. The overlaps are long (as what they should be in a long read assembler.) I will try it out.

Also, I agree visualize whole human assembly graph without proper context is not too useful beside generating sometime inspiring pretty picture. There are a lot of work to do put put the right context to make assembly graph visualization useful for end user.

@pb-jchin
Copy link
Author

pb-jchin commented Jul 4, 2016

@rrwick it certainly looks better with the "node merging". Thanks.
scr2016-07-03_06-44-23_pm

@rrwick
Copy link

rrwick commented Jul 4, 2016

Very nice! Another thing that's often worth changing for large graphs is the node length. You can go to Preferences (on Mac) or Settings (on Linux) and adjust the Node length per megabase setting. A smaller value will shorten up those long nodes, which may make the visualisation a bit more compact and manageable.

Side note: I changed the node length logic (for the better, I hope!) in the most recent version, so make sure you're using v0.8.0, or else that setting will be different.

@stale
Copy link

stale bot commented Nov 15, 2018

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

@stale stale bot added the stale label Nov 15, 2018
@stale stale bot closed this as completed Nov 23, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants