-
Notifications
You must be signed in to change notification settings - Fork 111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix to issue #17 limits cmd_merge to be single-threaded #19
Comments
Oops. Yeah that's bad. Let me fix this. |
implements the fix suggested here: google-research#19
I just added the great suggestion by @kleinj as a PR: https://github.com/google-research/deduplicate-text-datasets/pull/22/files. I was running the code on a 10's of gigabyte dataset and this fix makes the code only run in about 2 hours with a lot of CPUs (it would've taken 100's of hours without this fix). Let me know what you think @carlini! |
This should have been fixed. |
Hi,
it looks like the fix for issue #17, which puts some limits on the number of threads in cmd_merge, is a bit too aggressive, resulting in only using a single thread even for big workloads:
deduplicate-text-datasets/src/main.rs
Lines 1020 to 1023 in ad86c7f
texts.len()
is equal tonn
(the number of input parts), I think you want something likeinstead.
The text was updated successfully, but these errors were encountered: