Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallelizes URL reads for images using Ray/Multithreading #2048

Merged
merged 77 commits into from
Jun 14, 2022

Conversation

geoffreyangus
Copy link
Collaborator

This PR parallelizes URL reads for images, similar to what is done for audio in #2040. Because there are multiple image reads before the whole column is read (namely, the first image and the sample set of images), this PR also refactors the process of inferring image feature properties to unify the byte reading and loading workflow introduced in #2040.

geoffreyangus and others added 30 commits May 16, 2022 13:09
with BytesIO(bytes_obj) as buffer:
buffer_view = buffer.getbuffer()
image = decode_image(torch.frombuffer(buffer_view, dtype=torch.uint8), mode=mode)
del buffer_view
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it necessary to del buffer_view? Seems like it will go out-of-scope on the next line. If I'm missing something, please add a comment here explaining why this is needed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, not quite sure why it's needed. This was a bit of code that was already in read_image:

del buffer_view

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAICT it would be safe to remove this line.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, it seems like removing this is causing tests to fail with the following error:

Existing exports of data: object cannot be re-sized

Seems related to this open issue: python/cpython#85269

I've put it back for now, but let me know if you think there's an alternative you think is worth exploring.

def read_image_if_path(path: Any, num_channels: Optional[int] = None) -> Union[Any, torch.Tensor]:
"""Gets an image if `path` is a path (e.g. a string).

If it is not a path, return as-is.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the possible data types of path (if not a string)?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is intended to be usable by map-like workflows, so path could be any value stored in a column. For image columns, I believe we officially support image features being passed in as either path (str) or torch.Tensor objects (pre-loaded images). So, in this case, path could be a torch.Tensor. In that case, we would want this function to be a no-op.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd recommend adding your response:

...path could be a torch.Tensor. In that case, we would want this function to be a no-op.

as a comment, to make the intent clear here.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about the name the parameter path_or_image (instead of just path)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed read_image_if_path in latest change per Jeff's comment

ludwig/features/image_feature.py Show resolved Hide resolved
@geoffreyangus
Copy link
Collaborator Author

Thanks for the review @dantreiman 😄 One thing I wanted to ask:

  1. There are 4 functions in image_utils.py that are effectively replaced by the new functions: get_image_from_http_bytes, get_image_from_path, read_image, read_image_from_str. Are you aware of anywhere these are being used external of Ludwig (i.e. in the Predibase app)? If not, I may go ahead and delete these in this PR.

@dantreiman
Copy link
Collaborator

Thanks for the review @dantreiman 😄 One thing I wanted to ask:

  1. There are 4 functions in image_utils.py that are effectively replaced by the new functions: get_image_from_http_bytes, get_image_from_path, read_image, read_image_from_str. Are you aware of anywhere these are being used external of Ludwig (i.e. in the Predibase app)? If not, I may go ahead and delete these in this PR.

I'm not aware of any usage outside the ludwig package. Lets double-check with @hungcs first before deleting.

@dantreiman
Copy link
Collaborator

dantreiman commented Jun 6, 2022

I checked with @hungcs: Removing unused functions from image_utils is fine.

from ludwig.utils.strings_utils import tokenizer_registry, UNKNOWN_SYMBOL

SEQUENCE_TYPES = {SEQUENCE, TEXT, TIMESERIES}
FEATURE_NAME_SUFFIX = "__ludwig"
FEATURE_NAME_SUFFIX_LENGTH = len(FEATURE_NAME_SUFFIX)


def get_abs_path_if_entry_is_str(entry, src_path):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be great to have type annotations if possible :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually removed this function because it falls into the same trap of doing "return as-is behavior"– simplified the API further by removing this and moving the logic to the caller. Thanks!

return mode


def read_image_if_path(item: Any, num_channels: Optional[int] = None) -> Union[Any, torch.Tensor]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How important is the "return as-is behavior" here and elsewhere? To me it's more intuitive and self-documenting to return None or an exception if the input isn't the type expected, then handle appropriately from the caller.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not super important– that's a good point! It was just another layer of abstraction, but perhaps unnecessary.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed in latest change.

@geoffreyangus
Copy link
Collaborator Author

Completed benchmarking for this PR. We benchmark the time it takes to load 5000 images from an S3 bucket with 4 c5.9xlarge nodes. Fetching and preprocessing 5000 images from the iSpy2 dataset takes about 6 minutes in total, which translates to roughly 14 images loaded per second.

This is in comparison to attempting to load these images with the master branch, which actually can't load these images at all. Overall, these results seem to be a clear win.

@tgaddair tgaddair merged commit 520af82 into master Jun 14, 2022
@tgaddair tgaddair deleted the image-url-load branch June 14, 2022 16:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants