Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RAM crash when use collect method #18

Open
acul3 opened this issue Jul 14, 2022 · 2 comments
Open

RAM crash when use collect method #18

acul3 opened this issue Jul 14, 2022 · 2 comments

Comments

@acul3
Copy link

acul3 commented Jul 14, 2022

first of all thanks for releasing the code

i have dataset(mc4) size about 110 GB

my machine specs is
96 cores cpu and 350 GB RAM

i've successfully created 524GB suffix array from that dataset

i also managed to run deduplicator (self similar method with 100 threshold) with no memory issue , create about ~140 GB cache files ( 20B examples)

but when i run collect method my RAM blowup after few minutes

i stacktrace the code and found my RAM crash when this code/step running

while let Some(Reverse((data_pointer, index, which_array))) = heap.pop() {

is this expected?
do you have workaround to solve the issue?

AFAIK, collect method is just merging all duplicate sequence that found in the dataset and its only return text file with pair of bytes,CMIIW

i'm thinking maybe write and text file as soon each cache files finish processed/read ,instead of waiting all of them to be finish
(this is just assumption, i dont know its possible...not expert on rust)

thank you

@carlini
Copy link
Collaborator

carlini commented Aug 2, 2022

So right now I do think the collection has the possibility of going OOM when there are a lot of duplicates: it does load them all to memory at the same time. I'll need to come up with some kind of algorithmic improvement that fixes this issue, let me try and see what I can think of but I don't have a timeline when I'll get to this.

@longxudou
Copy link

longxudou commented Nov 14, 2023

Thanks for the excellent codebase and instructions!

Eagerly awaiting a fix for the OOM issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants