Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to interpret the color dump file? #28

Closed
jnalanko opened this issue Jun 11, 2024 · 18 comments
Closed

How to interpret the color dump file? #28

jnalanko opened this issue Jun 11, 2024 · 18 comments

Comments

@jnalanko
Copy link

Hello team Fulgor!

I'm trying to dump the color sets out of a Fulgor index. The dump looks like this:

num_references 4896
num_lists_in_color_set 3607699
color_list_0 7 24 405 1442 1561 2243 2767 3402
color_list_1 4 24 405 2243 3402
color_list_2 5 24 272 405 2243 3402
color_list_3 2 737 1312
...

This seems to list all the distinct color sets, but this is missing the information about which color set corresponds to which unitig. Is it possible to somehow easily extract that information from the index? This would let me verify my Themisto indexes against Fulgor, and vice versa.

By the way, this dump is missing a newline at the end of the file :).

@jermp
Copy link
Owner

jermp commented Jun 11, 2024

Hi Jarno,
your interpretation is correct. This dumps the distinct color classes only, not the (expanded) map unitig -> color.
If you need it, I can easily implement it.

By the way, this dump is missing a newline at the end of the file :).

Right! That was intentional, but better to keep style consistent. I'll add it. Thanks!

@jnalanko
Copy link
Author

Would be nice to have, not urgent though! It might benefit others also for interoperability between tools.

@jermp
Copy link
Owner

jermp commented Jun 11, 2024

Sure!
So I'm thinking about the following format:

num_references [X]
num_unitigs [Y]
num_color_lists [Z]
[unitig1_string] [color_list]
[unitig2_string] [color_list]
...
[unitigY_string] [color_list]

So, one line for each piece of information; things are one-single-space separated.

Please, provide feedback.

Q. Is it better to call num_references, num_documents instead?

Note that unitigs will be output sorted by color list in this format, as this is the way they are stored in Fulgor.

@jnalanko
Copy link
Author

That looks good to me!

Q. Is it better to call num_references, num_documents instead?

In my terminology, that would be num_colors. But either of those two are fine.

For reference, in Themisto I have a colored unitig dump command that produces two files:

  1. A fasta file of unitigs, with integer unitig id in the header. Like this:
>14789358964
AAAAAAAAAAAAAAAAAAAAAAAAAAGAGAGAAAAAAAAAAAAAAAAAAAAAAAAAAAA
>95
CTAGAAAAAAAAAAAAAAAAAAAAAAAAAAAA
>22377361820
TCGCCAGGATTTTTCCGTTGCCATTTCGGATTTTTGGTATTTGCTATACGGCGCAACGCGAAAAAAAAAAAAAAAAAAAAAAAAAAAA

Those ids are actually colexicographic ranks of some k-mer in the unitig, if I remember correctly.

  1. A file with pairs (unitig id, color list) list, like this:
14789358964 3015
95 3434 3757 3927
22377361820 4161 4175

The unitigs are listed in the same order as they come in the fasta file, so the unitig id here is actually a bit redundant. I don't write the lengths of the lists but that is a minor detail.

Anyway, my format should be quite easily comparable to yours.

@jermp
Copy link
Owner

jermp commented Jun 11, 2024

I see. Your formats makes sense, although I'd prefer to keep everything in one file.
I guess it's easier for Themisto to split stuff into two files since unitigs are not stored in color order, but in lex-order of some kmers, as you said.

@jermp
Copy link
Owner

jermp commented Jun 11, 2024

(I'm reporting here part of our conversation of X, just to keep track of it.)
On a second thought, I think a dump organized in three files is actually better.

  1. dump.metadata.txt contains useful meta data, like num_references, num_unitigs, and num_color_lists, others? The list of the original filenames maybe?

  2. dump.unitigs.fa lists the unitigs in fasta format, where the header of a unitig is > x y, where x is the unitig-id and y is the color-id. Note: unitig-ids will always be increasing and exactly those returned by SSHash. Color-ids will be non-decreasing instead.

  3. dump.colors.txt lists all the distinct colors in the format [color-id] [color-size] [color-list].

@jermp
Copy link
Owner

jermp commented Jun 11, 2024

Ok, done as of 9d5901e. Can you check it?

For the small example with the 10 salmonella files shipped with the repo, we can build the three files above as follows.

./fulgor build -l ../test_data/salmonella_10_filenames.txt -o ../test_data/salmonella_10 -k 31 -m 19 -d tmp_dir -g 1 -t 1 --verbose --check
./fulgor dump -i ../test_data/salmonella_10.fur

salmonella_10.metadata.txt:

num_references=10
num_unitigs=86630
num_color_classes=171

salmonella_10.unitigs.fa:

> unitig_id=0 color_id=0
GATTGAGCACCAACTGCGAGAATCAGGTGTTGAAGAGCAAGGGCGTGTGTTTATCGAAAAAGCTATTGAGCAGCCGCTTGATCCACAA
> unitig_id=1 color_id=0
GAAATTTAACGGCTGTTTTTCCGGCCAGATGTTATGTCTGGCTGGTTTTATTGTTTTGATTTTAAAGGAATTTACAGTGAATAAATGGCGTAACCCCACTGGGTGGTTATGTGCGGTAGCTATGCCTTTTG
> unitig_id=2 color_id=0
GCGCTGAACATCAGCGCCTTTCTGCGACAGCTCAATCATGCATTCGCCAATCACGGCAATC
...

salmonella_10.colors.txt:

color_id=0 size=4 0 3 7 8
color_id=1 size=1 8
color_id=2 size=10 0 1 2 3 4 5 6 7 8 9
color_id=3 size=6 1 2 4 5 6 9
...

@jermp
Copy link
Owner

jermp commented Jun 15, 2024

@jnalanko, have you had a chance to try it?

@jnalanko
Copy link
Author

Still no! I'll try to verify it against my Themisto index this weekend.

@jnalanko
Copy link
Author

Update: running my verifier now on a big dataset. I could not compare dumps directly because Themisto outputs both strands, whereas Fulgor only canonical, and also cyclic unitigs are tricky. It's not a very optimized verifier so it might take a day to run.

@jermp
Copy link
Owner

jermp commented Jun 16, 2024

But what are you trying to verify? Recall that GGCAT does not necessarily output maximal unitigs, so there might be discrepancies in the unitigs. Kmers and their color sets must instead always be the same.

@jnalanko
Copy link
Author

I'm verifying the color set of each k-mer, which should be the same in both tools, or otherwise there is a bug somewhere.

@jermp
Copy link
Owner

jermp commented Jun 16, 2024

Oh I see. When building the indexes with ./fulgor build, the user can specify --check. In this case, we compare the color set of each kmer against the colors as returned by GGCAT. Similarly, now assuming Fulgor is correct, we compare kmers and color sets of meta-colored Fulgor against Fulgor.

@jnalanko
Copy link
Author

Finished! Matches 100% with Themisto color sets! Love to see it.

@jermp
Copy link
Owner

jermp commented Jun 16, 2024

Good! Weren't you sure? :)

@jnalanko
Copy link
Author

99% sure :). I've been hit with some really rare and subtle bugs before.

@rob-p
Copy link
Collaborator

rob-p commented Jun 16, 2024

We've all been there, but great to know!

@jermp
Copy link
Owner

jermp commented Jun 17, 2024

Alright! Closing this now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants