Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix to issue #17 limits cmd_merge to be single-threaded #19

Closed
kleinj opened this issue Aug 2, 2022 · 3 comments
Closed

Fix to issue #17 limits cmd_merge to be single-threaded #19

kleinj opened this issue Aug 2, 2022 · 3 comments

Comments

@kleinj
Copy link

kleinj commented Aug 2, 2022

Hi,

it looks like the fix for issue #17, which puts some limits on the number of threads in cmd_merge, is a bit too aggressive, resulting in only using a single thread even for big workloads:

// Make sure we have enough space to take strided offsets for multiple threads
// This should be an over-approximation, and starts allowing new threads at 1k of data
let num_threads = std::cmp::min(num_threads, std::cmp::max((texts.len() as i64 - 1024)/10, 1));
println!("AA {}", num_threads);

texts.len() is equal to nn (the number of input parts), I think you want something like

    let num_threads = std::cmp::min(num_threads, std::cmp::max((texts_len.iter().sum::<usize>() as i64 - 1024)/10, 1));

instead.

@carlini
Copy link
Collaborator

carlini commented Aug 2, 2022

Oops. Yeah that's bad. Let me fix this.

TristanThrush added a commit to TristanThrush/deduplicate-text-datasets that referenced this issue Sep 30, 2022
@TristanThrush
Copy link

I just added the great suggestion by @kleinj as a PR: https://github.com/google-research/deduplicate-text-datasets/pull/22/files.

I was running the code on a 10's of gigabyte dataset and this fix makes the code only run in about 2 hours with a lot of CPUs (it would've taken 100's of hours without this fix). Let me know what you think @carlini!

@carlini
Copy link
Collaborator

carlini commented Feb 24, 2024

This should have been fixed.

@carlini carlini closed this as completed Feb 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants