Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use ray to parallelize read_feather with pyarrow feather API #292

Merged
merged 11 commits into from Dec 10, 2018

Conversation

Projects
None yet
4 participants
@eavidan
Copy link
Collaborator

commented Dec 10, 2018

What do these changes do?

Using ray to parallelize read_feather natively with pyarrow feather API
similar to what was done in read_parquet

in order to get the parallelization (across columns) I used the pyarrow API instead of pandas, similar to what was done in read_parquet.
one API issue, pyarrow supports use_threads while pandas nthreads.
to resolve this I did not remove nthreads from the API, but not using it, and using the default for use_threads in pyarrow which is currentlly True

Related issue number

#278

  • passes git diff upstream/master -u -- "*.py" | flake8 --diff
  • passes black --check modin/

eranavidan and others added some commits Dec 2, 2018

Refactored io.py into a module #203
read_parquet is now redirected from modin.pandas.io though the factories to data_management.io.pandas_on_ray
pandas on ray implementation of read_feather
parallel columns read similar to parquet
using pyarrow feather API
Refactored io.py into a module #203
read_parquet is now redirected from modin.pandas.io though the factories to data_management.io.pandas_on_ray
@AmplabJenkins

This comment has been minimized.

Copy link

commented Dec 10, 2018

Merged build finished. Test PASSed.

@AmplabJenkins

This comment has been minimized.

Copy link

commented Dec 10, 2018

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Modin-PRB/166/
Test PASSed.

@AmplabJenkins

This comment has been minimized.

Copy link

commented Dec 10, 2018

Merged build finished. Test PASSed.

@AmplabJenkins

This comment has been minimized.

Copy link

commented Dec 10, 2018

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Modin-Performance-PRB/104/
Test PASSed.

@AmplabJenkins

This comment has been minimized.

Copy link

commented Dec 10, 2018

Merged build finished. Test PASSed.

@AmplabJenkins

This comment has been minimized.

Copy link

commented Dec 10, 2018

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Modin-PRB/167/
Test PASSed.

@AmplabJenkins

This comment has been minimized.

Copy link

commented Dec 10, 2018

Merged build finished. Test PASSed.

@AmplabJenkins

This comment has been minimized.

Copy link

commented Dec 10, 2018

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Modin-Performance-PRB/105/
Test PASSed.

@devin-petersohn

This comment has been minimized.

Copy link
Member

commented Dec 10, 2018

Thanks @eavidan! I am going to do some performance benchmarking on this branch to see what our speedup is. I will report back with some numbers.

@devin-petersohn

This comment has been minimized.

Copy link
Member

commented Dec 10, 2018

After doing a bit of testing, this PR is significantly faster than pandas, even on reading the feather-format with nthreads set.

1GB data - np.random.randint(0, 100, size=(2**20, 2**8))
pandas.read_feather(fname, nthreads=144) - 3.57 sec
modin.pandas.read_feather(fname, nthreads=144) - 1.1 sec

4GB data - np.random.randint(0, 100, size=(2**22, 2**8))
pandas.read_feather(fname, nthreads=144) - 13.6 sec
modin.pandas.read_feather(fname, nthreads=144) - 2.05 sec

16GB data - np.random.randint(0, 100, size=(2**24, 2**8))
pandas.read_feather(fname, nthreads=144) - 56.5 sec
modin.pandas.read_feather(fname, nthreads=144) - 10.8 sec

Very nice, thanks!

@devin-petersohn
Copy link
Member

left a comment

Let a couple of comments. Great work on this!

@@ -32,6 +32,7 @@
NaT,
PeriodIndex,
Categorical,
options,

This comment has been minimized.

Copy link
@devin-petersohn

devin-petersohn Dec 10, 2018

Member

Let's do this in a new PR. Adding options to the __init__ won't change our internal options. I think it will require a new class.

@@ -152,6 +153,7 @@
"PeriodIndex",
"Categorical",
"__version__",
"options",

This comment has been minimized.

Copy link
@devin-petersohn

devin-petersohn Dec 10, 2018

Member

Same comment as above.

@AmplabJenkins

This comment has been minimized.

Copy link

commented Dec 10, 2018

Merged build finished. Test PASSed.

@AmplabJenkins

This comment has been minimized.

Copy link

commented Dec 10, 2018

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Modin-PRB/168/
Test PASSed.

@AmplabJenkins

This comment has been minimized.

Copy link

commented Dec 10, 2018

Merged build finished. Test PASSed.

@AmplabJenkins

This comment has been minimized.

Copy link

commented Dec 10, 2018

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Modin-Performance-PRB/106/
Test PASSed.

@devin-petersohn
Copy link
Member

left a comment

Looks great!

@devin-petersohn devin-petersohn merged commit a62a55f into modin-project:master Dec 10, 2018

1 check passed

Travis CI - Pull Request Build Passed
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.