-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallelizes URL reads using Ray / Multithreading #2040
Conversation
for more information, see https://pre-commit.ci
…into speedup-url-load
Can confirm, without parallelization, reading 7k .wav files takes ~105 minutes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice change!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! Couple small questions.
@@ -214,7 +216,7 @@ def reduce(series): | |||
merged_stats["cropped"], | |||
audio_file_length_limit_in_s, | |||
) | |||
logger.debug(print_statistics) | |||
logging.debug(print_statistics) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not use logger?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I think we are cleaning up the use of logger
in favor of logging
. Open issue: #2045
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, gotcha!
df[column.name] = df[column.name].map(fn) | ||
return df | ||
|
||
ds = ds.map_batches(partial(map_batches_fn, fn=get_bytes_obj_if_path), batch_format="pandas") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This approach makes sense. One concern I had in the back of my mind is regarding partition size. For example, if the input dataset is like 1MB, and so it ends up in a single partition as a Dask DF, then it could explode in size if every image is 10MB being shoved into a single partition.
This is likely something we'll want to wait to see before we prematurely optimize it, but it's something we should be aware of.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, that makes sense. We may want to repartition if that is the case, but we want to be careful in doing that to make sure that our processed columns can still be coerced into a dataframe again by the df_like
call at the bottom of data.preprocessing.build_dataset
: https://github.com/ludwig-ai/ludwig/blob/master/ludwig/data/preprocessing.py#L1137
Happy to follow-up on this later on if this becomes an issue.
return get_bytes_obj_from_path(path) | ||
|
||
|
||
@functools.lru_cache(maxsize=32) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this get called repeatedly for the same inputs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The lru_cache
decorator was present in the original audio_utils.read_audio
function, so I just preserved that functionality here. Might be useful if there are duplicate paths in the dataset, but happy to remove it if you think it is unnecessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can leave it in for now and revisit down the road, I suppose.
…into speedup-url-load
This PR implements parallelized URL reads for audio features with both Ray and local backends. Compared to
master
, we see preprocessing times for a benchmark audio dataset of approximately 7k files decrease from >90 minutes (anecdotally, please confirm @connor-mccorm) to ~18 minutes. Additionally, because of the dedicated URL read functionality, we see a significant reduction in read errors originating fromtorchaudio.load
.Future work could include using
ray.data.read_binary_files
or ray Tasks directly. We opt for the method introduced in this PR instead due to errors related to Dask indexing. A follow-up PR will implement parallelized URL reads for image features.