Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallelizes URL reads using Ray / Multithreading #2040

Merged
merged 56 commits into from
Jun 3, 2022

Conversation

geoffreyangus
Copy link
Collaborator

@geoffreyangus geoffreyangus commented May 18, 2022

This PR implements parallelized URL reads for audio features with both Ray and local backends. Compared to master, we see preprocessing times for a benchmark audio dataset of approximately 7k files decrease from >90 minutes (anecdotally, please confirm @connor-mccorm) to ~18 minutes. Additionally, because of the dedicated URL read functionality, we see a significant reduction in read errors originating from torchaudio.load.

Future work could include using ray.data.read_binary_files or ray Tasks directly. We opt for the method introduced in this PR instead due to errors related to Dask indexing. A follow-up PR will implement parallelized URL reads for image features.

@geoffreyangus geoffreyangus marked this pull request as draft May 18, 2022 16:08
@github-actions
Copy link

github-actions bot commented May 18, 2022

Unit Test Results

       6 files  ±0         6 suites  ±0   2h 30m 42s ⏱️ + 3m 4s
2 801 tests +2  2 769 ✔️ +2    32 💤 ±0  0 ±0 
8 403 runs  +6  8 303 ✔️ +6  100 💤 ±0  0 ±0 

Results for commit 7f56b68. ± Comparison against base commit cc1c73f.

♻️ This comment has been updated with latest results.

@geoffreyangus geoffreyangus changed the title Parallelize URL load using Ray / Multithreading Parallelizes URL load using Ray / Multithreading May 18, 2022
@geoffreyangus geoffreyangus changed the title Parallelizes URL load using Ray / Multithreading Parallelizes URL reads using Ray / Multithreading May 18, 2022
@connor-mccorm
Copy link
Collaborator

Can confirm, without parallelization, reading 7k .wav files takes ~105 minutes

@geoffreyangus geoffreyangus marked this pull request as ready for review May 19, 2022 22:41
Copy link
Collaborator

@justinxzhao justinxzhao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice change!

ludwig/utils/fs_utils.py Outdated Show resolved Hide resolved
@geoffreyangus geoffreyangus changed the base branch from master to suppress-assign-warning-pd May 24, 2022 00:12
Base automatically changed from suppress-assign-warning-pd to master May 24, 2022 03:13
Copy link
Collaborator

@tgaddair tgaddair left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Couple small questions.

@@ -214,7 +216,7 @@ def reduce(series):
merged_stats["cropped"],
audio_file_length_limit_in_s,
)
logger.debug(print_statistics)
logging.debug(print_statistics)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use logger?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I think we are cleaning up the use of logger in favor of logging. Open issue: #2045

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, gotcha!

ludwig/features/audio_feature.py Outdated Show resolved Hide resolved
df[column.name] = df[column.name].map(fn)
return df

ds = ds.map_batches(partial(map_batches_fn, fn=get_bytes_obj_if_path), batch_format="pandas")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This approach makes sense. One concern I had in the back of my mind is regarding partition size. For example, if the input dataset is like 1MB, and so it ends up in a single partition as a Dask DF, then it could explode in size if every image is 10MB being shoved into a single partition.

This is likely something we'll want to wait to see before we prematurely optimize it, but it's something we should be aware of.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, that makes sense. We may want to repartition if that is the case, but we want to be careful in doing that to make sure that our processed columns can still be coerced into a dataframe again by the df_like call at the bottom of data.preprocessing.build_dataset: https://github.com/ludwig-ai/ludwig/blob/master/ludwig/data/preprocessing.py#L1137

Happy to follow-up on this later on if this becomes an issue.

return get_bytes_obj_from_path(path)


@functools.lru_cache(maxsize=32)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this get called repeatedly for the same inputs?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The lru_cache decorator was present in the original audio_utils.read_audio function, so I just preserved that functionality here. Might be useful if there are duplicate paths in the dataset, but happy to remove it if you think it is unnecessary.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can leave it in for now and revisit down the road, I suppose.

@geoffreyangus geoffreyangus merged commit f0a486b into master Jun 3, 2022
@geoffreyangus geoffreyangus deleted the speedup-url-load branch June 3, 2022 01:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants