Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

--dont_collapse braids a deep nested snarl #15

Open
glennhickey opened this issue Nov 9, 2023 · 3 comments
Open

--dont_collapse braids a deep nested snarl #15

glennhickey opened this issue Nov 9, 2023 · 3 comments

Comments

@glennhickey
Copy link

This is messing with @xchang1's distance index, because it makes a deep nested snarl structure.

To reproduce, gfaffix this graph with and without collapsing

wget -q http://public.gi.ucsc.edu/~hickey/debug/gfaffix-snarl69/chunk_133493101_133529958_raw.gfa
gfaffix chunk_133493101_133529958_raw.gfa -o chunk_133493101_133529958_fix.gfa --dont_collapse 'CHM13*' > /dev/null
gfaffix chunk_133493101_133529958_raw.gfa -o chunk_133493101_133529958_fixc.gfa  > /dev/null

This is what the original graph looks like
chunk_133493101_133529958_raw
After gfaffix it zips up part of the bubble:
chunk_133493101_133529958_fix
But if I zoom into the zipped part, it's really weird "braid" structure
chunk_133493101_133529958_fix_zoom
That same part looks like this in _fixc.gfa where --dont_collapse wasn't used (but the ref path loops back through the zipped part)
chunk_133493101_133529958_fix_collapse_zoom

The net result is a really nested distance index, which can be checked as follows.

vg stats -b chunk_133493101_133529958_raw.dist | sort -rnk 4 | head -1
vg stats -b chunk_133493101_133529958_fix.dist | sort -rnk 4 | head -1
vg stats -b chunk_133493101_133529958_fixc.dist | sort -rnk 4 | head -1

Which show snarl depth 3 for the raw graph, 4 for the collapsed graph and 69 for the --dont_collapse graph.

Do you think there is a way of preventing this type of motif? It's true that the fixed graph has 75 fewer bases and 64 fewer nodes (one more edge, though), but it is much more difficult for vg to work with.

@glennhickey
Copy link
Author

One thing I forgot to mention, is that the original graph is acyclic. gfaffix introduces a cycle when run without --dont_collapse:

for i in *.gfa ; do printf "${i}\t" ; vg paths -x $i -C -Q CHM | awk '{print $3}'; done
chunk_133493101_133529958_fixc.gfa	undirected-cyclic
chunk_133493101_133529958_fix.gfa	undirected-acyclic
chunk_133493101_133529958_raw.gfa	undirected-acyclic

@danydoerr
Copy link
Member

  1. Braiding structure. Yes, this behavior is inherent to the de-collapse algorithm. I was hoping that removing edges that are not covered by any path would resolve this issue, but apparently, it doesn't.
  2. Cycles introduced by de-collapse. I'm not sure this is can be fixed in the current approach.

Both of the issues are non-obvious to fix, they require to refine the de-collapse algorithm. I have to think about it before I can give you a better and definite answer.

@glennhickey
Copy link
Author

Sure, thanks!! Any help is appreciated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants