Improve image/audio read throughput by 50% for image/audio features using Daft #3249

arnavgarg1 · 2023-03-15T01:10:14Z

There are many advantages to do this from a code maintainability perspective:

We can get rid of a lot of custom custom and unnecessary Dask transformations
Still get to use Ray, but Daft is much more optimized for data querying and aggregations.
We only use Daft for image/audio reads

In the local single-partition Ray cluster setup: Daft is a 50% speedup over the existing Ludwig code for URL downloading, and an overall 40% speedup for all of preprocessing.

In the distributed multi-partition Ray cluster setup: Daft is a 50% speedup over the existing Ludwig code for URL downloading, and an overall 40% speedup for all of preprocessing.

See benchmarking doc for more details.

Co-authored-by: @jaychia

github-actions · 2023-03-15T02:49:14Z

Unit Test Results

  6 files ±0   6 suites ±0 1h 14m 59s ⏱️ - 3m 35s
33 tests ±0 29 ✔️ ±0   4 💤 ±0 0 ❌ ±0
99 runs ±0 87 ✔️ ±0 12 💤 ±0 0 ❌ ±0

Results for commit 1883bff. ± Comparison against base commit d2f71c5.

♻️ This comment has been updated with latest results.

… outer join

for more information, see https://pre-commit.ci

Co-authored-by: Jay Chia <jaychia94@gmail.com@users.noreply.github.com>

ludwig/backend/ray.py

for more information, see https://pre-commit.ci

tgaddair · 2023-06-02T18:26:48Z

ludwig/backend/ray.py

+                    df = df.to_dask_dataframe()
+                    df = self.df_engine.persist(df)
+                else:
+                    df = df.to_pandas()


Is this only needed for the Modin df engine? I think Pandas shouldn't use this codepath at all, right?

Actually it's needed to convert from Daft back to either Dask or Pandas, so even Pandas would hit this code path

requirements_distributed.txt

ludwig/backend/ray.py

justinxzhao

Excited for the speedups!!

arnavgarg1 added 2 commits March 15, 2023 01:08

remove unused imports

ee7500f

Move comment

d489cfa

arnavgarg1 added 2 commits March 15, 2023 03:40

Use 100mb per worker as a heuristic

690894a

Merge branch 'master' into daft_reads

fdf65b8

arnavgarg1 changed the title ~~Parallelize image/audio reads using Daft on Ray instead of Ray + Dask~~ Test parallelize image/audio reads using Daft on Ray instead of Ray + Dask Mar 15, 2023

arnavgarg1 and others added 8 commits March 15, 2023 20:33

Added comment

9b2f54e

Skip reinitialization of Daft runner and preserve Dask partitions for…

1f6aa80

… outer join

[pre-commit.ci] auto fixes from pre-commit.com hooks

bc9d031

for more information, see https://pre-commit.ci

Merge branch 'master' into daft_reads

49ef63f

Comments

9fac768

[pre-commit.ci] auto fixes from pre-commit.com hooks

917cfa3

for more information, see https://pre-commit.ci

Update to prevent re-init errors and saturate network bandwidth

e54d8a7

Comments

9ee7317

arnavgarg1 changed the title ~~Test parallelize image/audio reads using Daft on Ray instead of Ray + Dask~~ Parallelize image/audio reads using Daft on Ray instead of Ray + Dask May 3, 2023

arnavgarg1 and others added 4 commits May 3, 2023 11:01

Merge branch 'master' into daft_reads

a6c893d

Daft reads PR for 0.1 (#3394)

907dadb

Co-authored-by: Jay Chia <jaychia94@gmail.com@users.noreply.github.com>

resolve merge conflicts

99d3fd3

Fix binary read test for pandas df_engine

1883719

arnavgarg1 requested review from justinxzhao, tgaddair and geoffreyangus May 13, 2023 00:49

arnavgarg1 marked this pull request as ready for review May 13, 2023 00:49

arnavgarg1 added 3 commits May 15, 2023 05:45

Fix strange ValueError in read binary files pandas test

a8dd6de

Remove conditional check

3874f05

Remove conditional check for idx column

69bbb3d

arnavgarg1 requested a review from jppgks May 16, 2023 16:01

jppgks approved these changes May 16, 2023

View reviewed changes

ludwig/backend/ray.py Outdated Show resolved Hide resolved

arnavgarg1 added 2 commits May 17, 2023 01:03

remove unused imports

33c86e6

Move comment

43afb6a

jppgks added 2 commits May 17, 2023 03:10

instruct user to install distributed extra if missing

5ce8146

fix rebase

a969933

jppgks force-pushed the daft_reads branch from d45ca8e to a969933 Compare May 17, 2023 03:15

pre-commit-ci bot and others added 3 commits May 17, 2023 03:15

[pre-commit.ci] auto fixes from pre-commit.com hooks

f3aee22

for more information, see https://pre-commit.ci

fix rebase

03f03ff

Merge branch 'daft_reads' of github.com:ludwig-ai/ludwig into daft_reads

518406b

jppgks force-pushed the daft_reads branch from 6a66c84 to 518406b Compare May 17, 2023 03:18

pre-commit-ci bot and others added 2 commits May 17, 2023 03:19

[pre-commit.ci] auto fixes from pre-commit.com hooks

92fee1c

for more information, see https://pre-commit.ci

pass fsspec FileSystem to use for downloading data

b706022

arnavgarg1 changed the title ~~Parallelize image/audio reads using Daft on Ray instead of Ray + Dask~~ Parallelize image/audio reads using Daft on Ray instead of Dask on Ray Jun 1, 2023

arnavgarg1 added 2 commits June 1, 2023 19:12

Merge branch 'master' into daft_reads

e120902

resolve merge conflict

77d2c43

tgaddair reviewed Jun 2, 2023

View reviewed changes

requirements_distributed.txt Outdated Show resolved Hide resolved

tgaddair reviewed Jun 2, 2023

View reviewed changes

requirements_distributed.txt Show resolved Hide resolved

tgaddair reviewed Jun 2, 2023

View reviewed changes

ludwig/backend/ray.py Outdated Show resolved Hide resolved

tgaddair reviewed Jun 2, 2023

View reviewed changes

ludwig/backend/ray.py Show resolved Hide resolved

jeffkinnison reviewed Jun 2, 2023

View reviewed changes

ludwig/backend/ray.py Show resolved Hide resolved

Address comments

ebb85bd

arnavgarg1 requested review from jeffkinnison, jppgks and tgaddair June 15, 2023 17:38

requirements re-ordering

492f95d

arnavgarg1 changed the title ~~Parallelize image/audio reads using Daft on Ray instead of Dask on Ray~~ Improve read throughput by 50% for image/audio reads using Daft Jun 20, 2023

justinxzhao approved these changes Jun 20, 2023

View reviewed changes

jppgks approved these changes Jun 21, 2023

View reviewed changes

Remove unnecessary context manager

1883bff

arnavgarg1 changed the title ~~Improve read throughput by 50% for image/audio reads using Daft~~ Improve image/audio read throughput by 50% for image/audio features using Daft Jun 23, 2023

arnavgarg1 merged commit 91c28f8 into master Jun 23, 2023
16 checks passed

arnavgarg1 deleted the daft_reads branch June 23, 2023 14:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve image/audio read throughput by 50% for image/audio features using Daft #3249

Improve image/audio read throughput by 50% for image/audio features using Daft #3249

arnavgarg1 commented Mar 15, 2023 •

edited

github-actions bot commented Mar 15, 2023 •

edited

tgaddair Jun 2, 2023

arnavgarg1 Jun 15, 2023

justinxzhao left a comment

Improve image/audio read throughput by 50% for image/audio features using Daft #3249

Improve image/audio read throughput by 50% for image/audio features using Daft #3249

Conversation

arnavgarg1 commented Mar 15, 2023 • edited

github-actions bot commented Mar 15, 2023 • edited

Unit Test Results

tgaddair Jun 2, 2023

Choose a reason for hiding this comment

arnavgarg1 Jun 15, 2023

Choose a reason for hiding this comment

justinxzhao left a comment

Choose a reason for hiding this comment

arnavgarg1 commented Mar 15, 2023 •

edited

github-actions bot commented Mar 15, 2023 •

edited