-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve image/audio read throughput by 50% for image/audio features using Daft #3249
Conversation
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
Co-authored-by: Jay Chia <jaychia94@gmail.com@users.noreply.github.com>
ludwig/backend/ray.py
Outdated
df = df.to_dask_dataframe() | ||
df = self.df_engine.persist(df) | ||
else: | ||
df = df.to_pandas() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this only needed for the Modin df engine? I think Pandas shouldn't use this codepath at all, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually it's needed to convert from Daft back to either Dask or Pandas, so even Pandas would hit this code path
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excited for the speedups!!
There are many advantages to do this from a code maintainability perspective:
In the local single-partition Ray cluster setup: Daft is a 50% speedup over the existing Ludwig code for URL downloading, and an overall 40% speedup for all of preprocessing.
In the distributed multi-partition Ray cluster setup: Daft is a 50% speedup over the existing Ludwig code for URL downloading, and an overall 40% speedup for all of preprocessing.
See benchmarking doc for more details.
Co-authored-by: @jaychia