Parallelizes URL reads using Ray / Multithreading #2040

geoffreyangus · 2022-05-18T16:06:23Z

This PR implements parallelized URL reads for audio features with both Ray and local backends. Compared to master, we see preprocessing times for a benchmark audio dataset of approximately 7k files decrease from >90 minutes (anecdotally, please confirm @connor-mccorm) to ~18 minutes. Additionally, because of the dedicated URL read functionality, we see a significant reduction in read errors originating from torchaudio.load.

Future work could include using ray.data.read_binary_files or ray Tasks directly. We opt for the method introduced in this PR instead due to errors related to Dask indexing. A follow-up PR will implement parallelized URL reads for image features.

for more information, see https://pre-commit.ci

github-actions · 2022-05-18T16:52:14Z

Unit Test Results

      6 files ±0       6 suites ±0 2h 30m 42s ⏱️ + 3m 4s
2 801 tests +2 2 769 ✔️ +2   32 💤 ±0 0 ❌ ±0
8 403 runs +6 8 303 ✔️ +6 100 💤 ±0 0 ❌ ±0

Results for commit 7f56b68. ± Comparison against base commit cc1c73f.

♻️ This comment has been updated with latest results.

…into speedup-url-load

connor-mccorm · 2022-05-18T17:59:20Z

Can confirm, without parallelization, reading 7k .wav files takes ~105 minutes

justinxzhao

Nice change!

ludwig/utils/fs_utils.py

for more information, see https://pre-commit.ci

tgaddair

Looks great! Couple small questions.

tgaddair · 2022-05-25T20:19:41Z

ludwig/features/audio_feature.py

@@ -214,7 +216,7 @@ def reduce(series):
            merged_stats["cropped"],
            audio_file_length_limit_in_s,
        )
-        logger.debug(print_statistics)
+        logging.debug(print_statistics)


Why not use logger?

Oh, I think we are cleaning up the use of logger in favor of logging. Open issue: #2045

Ah, gotcha!

ludwig/features/audio_feature.py

tgaddair · 2022-05-25T20:24:13Z

ludwig/backend/ray.py

+            df[column.name] = df[column.name].map(fn)
+            return df
+
+        ds = ds.map_batches(partial(map_batches_fn, fn=get_bytes_obj_if_path), batch_format="pandas")


This approach makes sense. One concern I had in the back of my mind is regarding partition size. For example, if the input dataset is like 1MB, and so it ends up in a single partition as a Dask DF, then it could explode in size if every image is 10MB being shoved into a single partition.

This is likely something we'll want to wait to see before we prematurely optimize it, but it's something we should be aware of.

I see, that makes sense. We may want to repartition if that is the case, but we want to be careful in doing that to make sure that our processed columns can still be coerced into a dataframe again by the df_like call at the bottom of data.preprocessing.build_dataset: https://github.com/ludwig-ai/ludwig/blob/master/ludwig/data/preprocessing.py#L1137

Happy to follow-up on this later on if this becomes an issue.

tgaddair · 2022-05-25T20:25:57Z

ludwig/utils/fs_utils.py

+    return get_bytes_obj_from_path(path)
+
+
+@functools.lru_cache(maxsize=32)


Does this get called repeatedly for the same inputs?

The lru_cache decorator was present in the original audio_utils.read_audio function, so I just preserved that functionality here. Might be useful if there are duplicate paths in the dataset, but happy to remove it if you think it is unnecessary.

We can leave it in for now and revisit down the road, I suppose.

…into speedup-url-load

geoffreyangus added 9 commits May 16, 2022 13:09

wip

57964de

debugging nans

764582d

Merge branch 'master' into speedup-url-load

add6f60

failing parity test

34ea789

not passing auc parity test... w logs

f81969a

audio feature works

1755fda

cleanup and revert image changes to prepare for image work

dea61b3

further cleanup

eaf8dcc

added batch size

1e44899

geoffreyangus requested review from tgaddair, justinxzhao and connor-mccorm May 18, 2022 16:06

[pre-commit.ci] auto fixes from pre-commit.com hooks

8e78736

for more information, see https://pre-commit.ci

geoffreyangus marked this pull request as draft May 18, 2022 16:08

geoffreyangus added 2 commits May 18, 2022 10:08

remove batch size

95aecac

Merge branch 'speedup-url-load' of https://github.com/ludwig-ai/ludwig …

15ff78b

…into speedup-url-load

geoffreyangus changed the title ~~Parallelize URL load using Ray / Multithreading~~ Parallelizes URL load using Ray / Multithreading May 18, 2022

geoffreyangus changed the title ~~Parallelizes URL load using Ray / Multithreading~~ Parallelizes URL reads using Ray / Multithreading May 18, 2022

Merge branch 'master' into speedup-url-load

6ade301

geoffreyangus marked this pull request as ready for review May 19, 2022 22:41

justinxzhao approved these changes May 19, 2022

View reviewed changes

ludwig/utils/fs_utils.py Outdated Show resolved Hide resolved

geoffreyangus added 2 commits May 19, 2022 16:30

address nit

2b06eb0

cleanup

32febdb

connor-mccorm approved these changes May 20, 2022

View reviewed changes

geoffreyangus and others added 4 commits May 20, 2022 09:52

adds support for nans and unit test

9b66ea5

[pre-commit.ci] auto fixes from pre-commit.com hooks

a150c91

for more information, see https://pre-commit.ci

adds type hint and fixes abstract class definition

7522ca6

merge

933bd38

geoffreyangus added 12 commits May 20, 2022 18:36

Merge branch 'master' into speedup-url-load

8a5e4e4

Merge branch 'master' into speedup-url-load

59e0f77

log in audio feature for debugging

70152f2

further logs for debugging pred inconsistency

66a557b

removes logs

f4121ad

cleanup

402a5d2

make test comparing ray/local evaluation more tolerant

ac47e36

cleanup

676c864

logger to logging

54e17b2

separating utils.random_nan changes to separate PR

1597ce3

refactor function for less warnings

5abaa7e

Merge branch 'suppress-assign-warning-pd' into speedup-url-load

ced5cd7

geoffreyangus changed the base branch from master to suppress-assign-warning-pd May 24, 2022 00:12

Base automatically changed from suppress-assign-warning-pd to master May 24, 2022 03:13

geoffreyangus added 5 commits May 24, 2022 12:56

increased tolerance

2c3a51c

logs plus notebook

f63ba75

removed logs and notebook

086be32

cleanup

f8c6c32

Merge branch 'master' into speedup-url-load

1bd4415

tgaddair approved these changes May 25, 2022

View reviewed changes

geoffreyangus and others added 8 commits May 25, 2022 13:37

PR fix

cbe66b4

Merge branch 'master' into speedup-url-load

4fe3106

merge

839fe32

Merge branch 'speedup-url-load' of https://github.com/ludwig-ai/ludwig …

8ef9f0b

…into speedup-url-load

Merge branch 'master' into speedup-url-load

2403bec

Merge branch 'master' into speedup-url-load

d34b652

merge

64caae3

reduce test load of test_ray_audio

7f56b68

geoffreyangus merged commit f0a486b into master Jun 3, 2022

geoffreyangus deleted the speedup-url-load branch June 3, 2022 01:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallelizes URL reads using Ray / Multithreading #2040

Parallelizes URL reads using Ray / Multithreading #2040

geoffreyangus commented May 18, 2022 •

edited

Loading

github-actions bot commented May 18, 2022 •

edited

Loading

connor-mccorm commented May 18, 2022

justinxzhao left a comment

tgaddair left a comment

tgaddair May 25, 2022

geoffreyangus May 25, 2022

tgaddair May 27, 2022

tgaddair May 25, 2022

geoffreyangus May 25, 2022

tgaddair May 25, 2022

geoffreyangus May 25, 2022

tgaddair May 27, 2022

		return get_bytes_obj_from_path(path)


		@functools.lru_cache(maxsize=32)

Parallelizes URL reads using Ray / Multithreading #2040

Parallelizes URL reads using Ray / Multithreading #2040

Conversation

geoffreyangus commented May 18, 2022 • edited Loading

github-actions bot commented May 18, 2022 • edited Loading

Unit Test Results

connor-mccorm commented May 18, 2022

justinxzhao left a comment

Choose a reason for hiding this comment

tgaddair left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

geoffreyangus commented May 18, 2022 •

edited

Loading

github-actions bot commented May 18, 2022 •

edited

Loading