You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
is this expected?
do you have workaround to solve the issue?
AFAIK, collect method is just merging all duplicate sequence that found in the dataset and its only return text file with pair of bytes,CMIIW
i'm thinking maybe write and text file as soon each cache files finish processed/read ,instead of waiting all of them to be finish
(this is just assumption, i dont know its possible...not expert on rust)
thank you
The text was updated successfully, but these errors were encountered:
So right now I do think the collection has the possibility of going OOM when there are a lot of duplicates: it does load them all to memory at the same time. I'll need to come up with some kind of algorithmic improvement that fixes this issue, let me try and see what I can think of but I don't have a timeline when I'll get to this.
first of all thanks for releasing the code
i have dataset(mc4) size about 110 GB
my machine specs is
96 cores cpu and 350 GB RAM
i've successfully created 524GB suffix array from that dataset
i also managed to run deduplicator (self similar method with 100 threshold) with no memory issue , create about ~140 GB cache files ( 20B examples)
but when i run collect method my RAM blowup after few minutes
i stacktrace the code and found my RAM crash when this code/step running
deduplicate-text-datasets/src/main.rs
Line 1188 in ad86c7f
is this expected?
do you have workaround to solve the issue?
AFAIK, collect method is just merging all duplicate sequence that found in the dataset and its only return text file with pair of bytes,CMIIW
i'm thinking maybe write and text file as soon each cache files finish processed/read ,instead of waiting all of them to be finish
(this is just assumption, i dont know its possible...not expert on rust)
thank you
The text was updated successfully, but these errors were encountered: