-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Num_contigs must be less than 2^32 Aborted (core dumped) #16
Comments
Hello @rickbeeloo, So this is not an error per se, it's just a design choice to have Actually, there should be no problem to default to So you can try to comment out this check here https://github.com/jermp/sshash/blob/master/include/builder/util.hpp#L77 and make the changes above.
Since you're trying to build a sshash dictionary for a veery large collection, it would probably also require you to switch to 128-bit hash codes: https://github.com/jermp/sshash/blob/master/include/hash_util.hpp#L51. Let me know. Feel free to also share (perhaps via Box or GoogleDrive) the unitigs file to index so that I can batter assist you. Cheers, |
Hey @jermp, No problem, hope you had a good holiday :)
(or also the input genomes itself?) Thanks in advance, would be awesome to make this work :) |
Hi @rickbeeloo, the file Wow, your input collection is huge ;) I'm very curious of the Fulgor performance in this case. |
@jermp, It's heterogenous, that's also the reason for the large number of unitigs haha. Themisto actually crashes on it without error message. Perhaps also the same reason but without handling the exception, didn't check into detail as I thought with this many colors fulgor would probably be better suited for the task It's probably much easier for you than for me to fix the (Btw will Fulgor detect the index or will it restart GGCAT?) |
Ok, that seems reasonable.
To do so, I would need the files anyway. I do not have much time at the moment but I can surely assist you if you're willing to make the changes to the SSHash code.
Good point, actually no -- I did not implement a check because the construction is meant to be performed once and those files deleted after. For SSHash it is easy: just do not have the file Anyway, what I would do first is to forget about Fulgor for 1 second e try SSHash alone. Make the changes I indicated above to the SSHash codebase (not Fulgor) and try to index with sshash the file Thanks, |
@jermp Lets do it :) |
Cool! No, whatever is inside the I would then run SSHash, first on a small prefix of the whole file and check its correctness with the |
That passes on a 200MB sample, although it might still go wrong when it goes beyond the uint32 tigs but I suppose using Will let that index and come back here. I also saw this line in Fulgor so probably also need that and perhaps more changes to Fulgor? |
Yes, testing the correctness on such data will take a long time.
Ah yeah. Needs careful checking. Thanks for looking into this! |
Just got an error when running sshash on the whole dataset (it works fine on a small sample):
So we have around 280 billion k-mers. Don't really understand what file it was unable to open, all the 1670 temp minimizer files are present. Unless it's something after that, any ideas? |
Mmmh that's interesting yes. Wow, 280 B kmers is huge...and very fragmented. Perhaps the issue is here https://github.com/jermp/sshash/blob/master/include/builder/util.hpp#L287. Are you sure you have enough disk space? |
That file is indeed absent. I'm running it again with these changes to show the file paths. Then I can check if I can manually open these handles or that indeed something weird is going on. Space cannot be the issue as after writing all the temp minimizers files there is still 44TB of space left. Will update you probably tomorrow as this will take ~24 hours. |
Hey @jermp, I added a specific log for all "cannot open file" errors (here). So it can't be any of those raising the error. The only thing that remains is the |
Hi @rickbeeloo, oh I see. Good catch. Never happened to me before because I never run it on such a huge instance. |
Hey @jermp, I can confirm the
But I guess that makes sense with this much data. Maybe we could use more than 7 threads for the hashing? or will this also affect RAM a lot? (It's at 1.5TB RAM at the moment). Mac-dGB looking cool :) hope I can reuse the sshash index haha |
Hi @rickbeeloo, I'm concerned anyway by the percentage of kmers in the skew index: it's above 92%...that's too much. In practical uses of sshash, the skew index component should contain a few percent of the kmers. |
I used m=15, I could bring that up to 21. What do you think would make the most sense?
|
I would use m=21 as a first try. It looks like the dataset is very fragmented. For a "partitioned representation" I mean for the MPHFs used for minimizers and the skew index. This is a technical detail: currently, we are using a single monolithic function which will result in better lookup at the expense of construction time. Sorry, it requires a bit of understanding of the inner data structures, but the changes are not difficult to make. You have my whole support if you're willing to try. (It would be easier for me but I would need your data. Where did you collect the data from btw?) |
hey @jermp, I added more threads (I saw the limit was 48 for pthash) and the use of the partitioned_phf. Is there any "rule" to select the optimal number of partitions? Should we adjust It's public data, I will get it on a Google Drive /our server and email you the link to the unitig fasta (might take a few days to arrange that with the sys admin) |
Hi @rickbeeloo,
Excellent, thank you! |
That probably requires some help @jermp :) Minimizers, num_partitions = std::ceil(size / 3e6); // 3 million keys per partition
num_threads = std::thread::hardware_concurrency() > num_partitions ? num_partitions : std::thread::hardware_concurrency(); 1) Something like that? Skew index, 2) Should this number (or conditionally line 66) of partitions be the same as those used for the MPHF? I assume yes looking at this code together with this. These constants will now scale those down to that range whereas we can probably use more threads like for the minimizers. If so how would we calculate the |
|
Hey @jermp, I did some tests. It works fine for the bigger data (also passing
Since it passed the check on the bigger chunk I also tried it on the actual data which succeeded:
It took around 3 days, the PTHash went MUCH faster now it could use 112 threads on our machine. So now the steps would be:
|
Hi @rickbeeloo,
Some quick questions: the results above are just for SSHash, right (not the entire Fulgor pipeline)?
Yes, I think this is entirely doable. Just a matter of passing this option to the cmd interface and refactoring the code a little bit. But, on the other hand, it is also a bit risky because we have to ensure the client pre-built both GGCAT and SSHash in the manner Fulgor expects. So, at least for the moment being, I would not do this perhaps. Instead, let's be sure that SSHash builds correctly on that whole data and launch the whole Fulgor construction. It will take ~3+1 days, but who cares at this point. I think nobody indexed such a huge collection before, so I'm very curious to see the final result! |
Correct
Well not much smaller haha:
The index itself is |
Ok, that makes sense (it takes 2 days more or less). Well, we are using < 12 bits per kmer, which is rather good for a such large and branching dBG. You see that it is very branching since we spend 3.5 bits/kmer rather than 2. Ah, one thing: Fulgor will build a SSHash index with canonical_parsing ON. So, I would do what I said in the previous comment: re-try the whole pipeline now that we have tuned SSHash to work properly on such collection. Or do you want to try something different first? |
Ok, as you said, indexation went well for 530,777 genomes. I'm reporting here the result (just for our own reference):
|
Hey!
I was trying to build a graph of 400K bacterial genomes and got the following error:
I'm a bit confused by the error message with
num_contigs
as the number of contigs I have is much less (99,645,800
) So I guess this refers to the number of unitigs in the graph instead?If Yes, any ideas to still make it work? Could it use 2^64 since we have sufficient memory? Any other ideas ? :)
Thanks for another cool tool!
The text was updated successfully, but these errors were encountered: