-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallelizes URL reads for images using Ray/Multithreading #2048
Conversation
for more information, see https://pre-commit.ci
…into speedup-url-load
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
…into speedup-url-load
for more information, see https://pre-commit.ci
with BytesIO(bytes_obj) as buffer: | ||
buffer_view = buffer.getbuffer() | ||
image = decode_image(torch.frombuffer(buffer_view, dtype=torch.uint8), mode=mode) | ||
del buffer_view |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is it necessary to del buffer_view
? Seems like it will go out-of-scope on the next line. If I'm missing something, please add a comment here explaining why this is needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, not quite sure why it's needed. This was a bit of code that was already in read_image
:
ludwig/ludwig/utils/image_utils.py
Line 121 in d3eea13
del buffer_view |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAICT it would be safe to remove this line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, it seems like removing this is causing tests to fail with the following error:
Existing exports of data: object cannot be re-sized
Seems related to this open issue: python/cpython#85269
I've put it back for now, but let me know if you think there's an alternative you think is worth exploring.
ludwig/utils/image_utils.py
Outdated
def read_image_if_path(path: Any, num_channels: Optional[int] = None) -> Union[Any, torch.Tensor]: | ||
"""Gets an image if `path` is a path (e.g. a string). | ||
|
||
If it is not a path, return as-is. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are the possible data types of path (if not a string)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function is intended to be usable by map
-like workflows, so path
could be any value stored in a column. For image columns, I believe we officially support image features being passed in as either path (str) or torch.Tensor objects (pre-loaded images). So, in this case, path
could be a torch.Tensor
. In that case, we would want this function to be a no-op.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd recommend adding your response:
...path could be a torch.Tensor. In that case, we would want this function to be a no-op.
as a comment, to make the intent clear here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about the name the parameter path_or_image
(instead of just path
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed read_image_if_path
in latest change per Jeff's comment
Thanks for the review @dantreiman 😄 One thing I wanted to ask:
|
for more information, see https://pre-commit.ci
I'm not aware of any usage outside the ludwig package. Lets double-check with @hungcs first before deleting. |
I checked with @hungcs: Removing unused functions from image_utils is fine. |
for more information, see https://pre-commit.ci
…into image-url-load
ludwig/features/feature_utils.py
Outdated
from ludwig.utils.strings_utils import tokenizer_registry, UNKNOWN_SYMBOL | ||
|
||
SEQUENCE_TYPES = {SEQUENCE, TEXT, TIMESERIES} | ||
FEATURE_NAME_SUFFIX = "__ludwig" | ||
FEATURE_NAME_SUFFIX_LENGTH = len(FEATURE_NAME_SUFFIX) | ||
|
||
|
||
def get_abs_path_if_entry_is_str(entry, src_path): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be great to have type annotations if possible :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually removed this function because it falls into the same trap of doing "return as-is behavior"– simplified the API further by removing this and moving the logic to the caller. Thanks!
ludwig/utils/image_utils.py
Outdated
return mode | ||
|
||
|
||
def read_image_if_path(item: Any, num_channels: Optional[int] = None) -> Union[Any, torch.Tensor]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How important is the "return as-is behavior" here and elsewhere? To me it's more intuitive and self-documenting to return None
or an exception if the input isn't the type expected, then handle appropriately from the caller.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not super important– that's a good point! It was just another layer of abstraction, but perhaps unnecessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed in latest change.
for more information, see https://pre-commit.ci
…into image-url-load
Completed benchmarking for this PR. We benchmark the time it takes to load 5000 images from an S3 bucket with 4 This is in comparison to attempting to load these images with the |
This PR parallelizes URL reads for images, similar to what is done for audio in #2040. Because there are multiple image reads before the whole column is read (namely, the first image and the sample set of images), this PR also refactors the process of inferring image feature properties to unify the byte reading and loading workflow introduced in #2040.