Fix to issue #17 limits cmd_merge to be single-threaded #19

kleinj · 2022-08-02T07:00:39Z

Hi,

it looks like the fix for issue #17, which puts some limits on the number of threads in cmd_merge, is a bit too aggressive, resulting in only using a single thread even for big workloads:

deduplicate-text-datasets/src/main.rs

Lines 1020 to 1023 in ad86c7f

    
           // Make sure we have enough space to take strided offsets for multiple threads 
        
           // This should be an over-approximation, and starts allowing new threads at 1k of data 
        
           let num_threads = std::cmp::min(num_threads, std::cmp::max((texts.len() as i64 - 1024)/10, 1)); 
        
           println!("AA {}", num_threads);

texts.len() is equal to nn (the number of input parts), I think you want something like

    let num_threads = std::cmp::min(num_threads, std::cmp::max((texts_len.iter().sum::<usize>() as i64 - 1024)/10, 1));

instead.

The text was updated successfully, but these errors were encountered:

carlini · 2022-08-02T15:23:17Z

Oops. Yeah that's bad. Let me fix this.

implements the fix suggested here: google-research#19

TristanThrush · 2022-09-30T04:15:45Z

I just added the great suggestion by @kleinj as a PR: https://github.com/google-research/deduplicate-text-datasets/pull/22/files.

I was running the code on a 10's of gigabyte dataset and this fix makes the code only run in about 2 hours with a lot of CPUs (it would've taken 100's of hours without this fix). Let me know what you think @carlini!

carlini · 2024-02-24T03:43:01Z

This should have been fixed.

TristanThrush added a commit to TristanThrush/deduplicate-text-datasets that referenced this issue Sep 30, 2022

make cmd_merge use multiple threads again

3081a12

implements the fix suggested here: google-research#19

TristanThrush mentioned this issue Sep 30, 2022

make cmd_merge use multiple threads again #22

Closed

TristanThrush mentioned this issue Oct 5, 2022

make cmd_merge use multiple threads again TristanThrush/deduplicate-text-datasets#1

Merged

ChenghaoMou mentioned this issue Jun 25, 2023

Suffix Array consumed time ChenghaoMou/text-dedup#22

Closed

carlini closed this as completed Feb 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix to issue #17 limits cmd_merge to be single-threaded #19

Fix to issue #17 limits cmd_merge to be single-threaded #19

kleinj commented Aug 2, 2022

carlini commented Aug 2, 2022

TristanThrush commented Sep 30, 2022

carlini commented Feb 24, 2024

Fix to issue #17 limits cmd_merge to be single-threaded #19

Fix to issue #17 limits cmd_merge to be single-threaded #19

Comments

kleinj commented Aug 2, 2022

carlini commented Aug 2, 2022

TristanThrush commented Sep 30, 2022

carlini commented Feb 24, 2024