Parallelizes URL reads for images using Ray/Multithreading #2048

geoffreyangus · 2022-05-21T00:07:00Z

This PR parallelizes URL reads for images, similar to what is done for audio in #2040. Because there are multiple image reads before the whole column is read (namely, the first image and the sample set of images), this PR also refactors the process of inferring image feature properties to unify the byte reading and loading workflow introduced in #2040.

for more information, see https://pre-commit.ci

…into speedup-url-load

for more information, see https://pre-commit.ci

…into speedup-url-load

…udio test

for more information, see https://pre-commit.ci

dantreiman · 2022-06-06T20:20:33Z

ludwig/utils/image_utils.py

+        with BytesIO(bytes_obj) as buffer:
+            buffer_view = buffer.getbuffer()
+            image = decode_image(torch.frombuffer(buffer_view, dtype=torch.uint8), mode=mode)
+            del buffer_view


Why is it necessary to del buffer_view? Seems like it will go out-of-scope on the next line. If I'm missing something, please add a comment here explaining why this is needed.

Hm, not quite sure why it's needed. This was a bit of code that was already in read_image:

ludwig/ludwig/utils/image_utils.py

Line 121 in d3eea13

del buffer_view

AFAICT it would be safe to remove this line.

Hm, it seems like removing this is causing tests to fail with the following error:

Existing exports of data: object cannot be re-sized

Seems related to this open issue: python/cpython#85269

I've put it back for now, but let me know if you think there's an alternative you think is worth exploring.

dantreiman · 2022-06-06T20:24:49Z

ludwig/utils/image_utils.py

+def read_image_if_path(path: Any, num_channels: Optional[int] = None) -> Union[Any, torch.Tensor]:
+    """Gets an image if `path` is a path (e.g. a string).
+
+    If it is not a path, return as-is.


What are the possible data types of path (if not a string)?

This function is intended to be usable by map-like workflows, so path could be any value stored in a column. For image columns, I believe we officially support image features being passed in as either path (str) or torch.Tensor objects (pre-loaded images). So, in this case, path could be a torch.Tensor. In that case, we would want this function to be a no-op.

I'd recommend adding your response:

...path could be a torch.Tensor. In that case, we would want this function to be a no-op.

as a comment, to make the intent clear here.

How about the name the parameter path_or_image (instead of just path)

Removed read_image_if_path in latest change per Jeff's comment

ludwig/features/image_feature.py

geoffreyangus · 2022-06-06T21:32:11Z

Thanks for the review @dantreiman 😄 One thing I wanted to ask:

There are 4 functions in image_utils.py that are effectively replaced by the new functions: get_image_from_http_bytes, get_image_from_path, read_image, read_image_from_str. Are you aware of anywhere these are being used external of Ludwig (i.e. in the Predibase app)? If not, I may go ahead and delete these in this PR.

for more information, see https://pre-commit.ci

dantreiman · 2022-06-06T21:40:58Z

Thanks for the review @dantreiman 😄 One thing I wanted to ask:

There are 4 functions in image_utils.py that are effectively replaced by the new functions: get_image_from_http_bytes, get_image_from_path, read_image, read_image_from_str. Are you aware of anywhere these are being used external of Ludwig (i.e. in the Predibase app)? If not, I may go ahead and delete these in this PR.

I'm not aware of any usage outside the ludwig package. Lets double-check with @hungcs first before deleting.

dantreiman · 2022-06-06T23:30:21Z

I checked with @hungcs: Removing unused functions from image_utils is fine.

for more information, see https://pre-commit.ci

…into image-url-load

jeffreyftang · 2022-06-07T18:13:38Z

ludwig/features/feature_utils.py

 from ludwig.utils.strings_utils import tokenizer_registry, UNKNOWN_SYMBOL

 SEQUENCE_TYPES = {SEQUENCE, TEXT, TIMESERIES}
 FEATURE_NAME_SUFFIX = "__ludwig"
 FEATURE_NAME_SUFFIX_LENGTH = len(FEATURE_NAME_SUFFIX)


+def get_abs_path_if_entry_is_str(entry, src_path):


Would be great to have type annotations if possible :)

I actually removed this function because it falls into the same trap of doing "return as-is behavior"– simplified the API further by removing this and moving the logic to the caller. Thanks!

jeffreyftang · 2022-06-07T18:19:47Z

ludwig/utils/image_utils.py

+    return mode
+
+
+def read_image_if_path(item: Any, num_channels: Optional[int] = None) -> Union[Any, torch.Tensor]:


How important is the "return as-is behavior" here and elsewhere? To me it's more intuitive and self-documenting to return None or an exception if the input isn't the type expected, then handle appropriately from the caller.

Not super important– that's a good point! It was just another layer of abstraction, but perhaps unnecessary.

Removed in latest change.

for more information, see https://pre-commit.ci

…into image-url-load

for more information, see https://pre-commit.ci

geoffreyangus · 2022-06-14T15:45:05Z

Completed benchmarking for this PR. We benchmark the time it takes to load 5000 images from an S3 bucket with 4 c5.9xlarge nodes. Fetching and preprocessing 5000 images from the iSpy2 dataset takes about 6 minutes in total, which translates to roughly 14 images loaded per second.

This is in comparison to attempting to load these images with the master branch, which actually can't load these images at all. Overall, these results seem to be a clear win.

geoffreyangus and others added 30 commits May 16, 2022 13:09

wip

57964de

debugging nans

764582d

Merge branch 'master' into speedup-url-load

add6f60

failing parity test

34ea789

not passing auc parity test... w logs

f81969a

audio feature works

1755fda

cleanup and revert image changes to prepare for image work

dea61b3

further cleanup

eaf8dcc

added batch size

1e44899

[pre-commit.ci] auto fixes from pre-commit.com hooks

8e78736

for more information, see https://pre-commit.ci

remove batch size

95aecac

Merge branch 'speedup-url-load' of https://github.com/ludwig-ai/ludwig …

15ff78b

…into speedup-url-load

Merge branch 'master' into speedup-url-load

6ade301

address nit

2b06eb0

cleanup

32febdb

adds support for nans and unit test

9b66ea5

[pre-commit.ci] auto fixes from pre-commit.com hooks

a150c91

for more information, see https://pre-commit.ci

adds type hint and fixes abstract class definition

7522ca6

merge

933bd38

fix docstring

b1a904f

[pre-commit.ci] auto fixes from pre-commit.com hooks

4ec7108

for more information, see https://pre-commit.ci

add pandas to test

a2c40aa

Merge branch 'speedup-url-load' of https://github.com/ludwig-ai/ludwig …

8133338

…into speedup-url-load

wip

90e703d

wip

3884240

removed legacy audio fns and renamed bytes fns

c701b66

refactor + read_binary_files

c0f10c8

removed pd indexing warning and precommit errors, added nans to ray a…

6aa79e5

…udio test

Merge branch 'speedup-url-load' into image-url-load

054f54d

[pre-commit.ci] auto fixes from pre-commit.com hooks

0f60863

for more information, see https://pre-commit.ci

fix docstring

65ca19b

dantreiman reviewed Jun 6, 2022

View reviewed changes

geoffreyangus and others added 3 commits June 6, 2022 14:33

merge

2843e7b

[pre-commit.ci] auto fixes from pre-commit.com hooks

2fb77f8

for more information, see https://pre-commit.ci

implement PR revisions

8770f4f

geoffreyangus and others added 6 commits June 6, 2022 17:06

add del buffer_view back

cfff33b

addressed PR comments and deleted extraneous functions

9b53a08

[pre-commit.ci] auto fixes from pre-commit.com hooks

d8f3371

for more information, see https://pre-commit.ci

cleanup

602914a

Merge branch 'image-url-load' of https://github.com/ludwig-ai/ludwig …

5b2e857

…into image-url-load

Merge branch 'master' into image-url-load

a356894

jeffreyftang reviewed Jun 7, 2022

View reviewed changes

geoffreyangus and others added 6 commits June 7, 2022 11:51

removed read_*_if_* functions

b92effe

[pre-commit.ci] auto fixes from pre-commit.com hooks

428f19d

for more information, see https://pre-commit.ci

remove map_abs_path_if_entries

55a90cc

Merge branch 'image-url-load' of https://github.com/ludwig-ai/ludwig …

5e3bb99

…into image-url-load

simplified getting abs path

b30c053

merge

bb6fdb8

jeffreyftang approved these changes Jun 10, 2022

View reviewed changes

geoffreyangus and others added 4 commits June 13, 2022 13:08

Merge branch 'master' into image-url-load

d4c2b09

add check for remote protocol

f20454d

Merge branch 'master' into image-url-load

fc9c9a1

[pre-commit.ci] auto fixes from pre-commit.com hooks

42f301e

for more information, see https://pre-commit.ci

tgaddair approved these changes Jun 14, 2022

View reviewed changes

tgaddair merged commit 520af82 into master Jun 14, 2022

tgaddair deleted the image-url-load branch June 14, 2022 16:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallelizes URL reads for images using Ray/Multithreading #2048

Parallelizes URL reads for images using Ray/Multithreading #2048

geoffreyangus commented May 21, 2022

dantreiman Jun 6, 2022

geoffreyangus Jun 6, 2022

dantreiman Jun 6, 2022

geoffreyangus Jun 7, 2022

dantreiman Jun 6, 2022

geoffreyangus Jun 6, 2022

dantreiman Jun 6, 2022

dantreiman Jun 6, 2022

geoffreyangus Jun 7, 2022

geoffreyangus commented Jun 6, 2022

dantreiman commented Jun 6, 2022

dantreiman commented Jun 6, 2022 •

edited

Loading

jeffreyftang Jun 7, 2022

geoffreyangus Jun 7, 2022

jeffreyftang Jun 7, 2022

geoffreyangus Jun 7, 2022

geoffreyangus Jun 7, 2022

geoffreyangus commented Jun 14, 2022

		return mode


		def read_image_if_path(item: Any, num_channels: Optional[int] = None) -> Union[Any, torch.Tensor]:

Parallelizes URL reads for images using Ray/Multithreading #2048

Parallelizes URL reads for images using Ray/Multithreading #2048

Conversation

geoffreyangus commented May 21, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

geoffreyangus commented Jun 6, 2022

dantreiman commented Jun 6, 2022

dantreiman commented Jun 6, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

geoffreyangus commented Jun 14, 2022

dantreiman commented Jun 6, 2022 •

edited

Loading