Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

First initiative to distribute the read_sql() method. #436

Merged
merged 20 commits into from Feb 1, 2019

Conversation

@igorborgest
Copy link
Contributor

commented Jan 28, 2019

What do these changes do?

Propose a new approach to distribute read_sql() method.

Related issue number

#418

  • passes git diff upstream/master -u -- "*.py" | flake8 --diff
  • passes black --check modin/
@AmplabJenkins

This comment has been minimized.

Copy link

commented Jan 28, 2019

Merged build finished. Test PASSed.

@AmplabJenkins

This comment has been minimized.

Copy link

commented Jan 28, 2019

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Modin-Performance-PRB/317/
Test PASSed.

@AmplabJenkins

This comment has been minimized.

Copy link

commented Jan 28, 2019

Merged build finished. Test PASSed.

@AmplabJenkins

This comment has been minimized.

Copy link

commented Jan 28, 2019

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Modin-PRB/378/
Test PASSed.

@devin-petersohn

This comment has been minimized.

Copy link
Member

commented Jan 28, 2019

Thanks @igorborgest!

I will play with this a little bit and get you some feedback.

@simon-mo, @osalpekar, @williamma12 please also take a look.

@igorborgest

This comment has been minimized.

Copy link
Contributor Author

commented Jan 28, 2019

I left some explanations in the issue #418

@devin-petersohn

This comment has been minimized.

Copy link
Member

commented Jan 29, 2019

Hi @igorborgest

After some playing around with this PR a bit I have some thoughts. Overall I think it is a good first approach to the problem. If the user is familiar with their data, they can easily add a couple of parameters to improve the performance and memory usage of read_sql. I will try to be brief.

API

Having to specify partition_column, lower_bound, and upper_bound can be a bit challenging if using unfamiliar data. We might want to document how to find their lower_bound or upper_bound in the table itself and which partition_column to choose.

I know this is a first implementation, but before we merge it we should make sure we throw good errors if the user doesn't give all the correct parameters (e.g. partition_column without lower_bound).

We should also specify somewhere that it must be a numeric column because of how the implementation uses it (unless there is a planned change in this area).

Partitioning

Modin uses block-oriented partitioning, so you'll want to call _split_result_for_readers before you return the result from the remote task. This will give the expected shape of partitions.

Final thoughts

Other than minor cleanup and the points mentioned above, I think this approach seems very reasonable for solving the problem. As you said in the issue, it is a difficult problem. Relying on the user's knowledge seems like the best approach at this point. We don't deviate from pandas API, only augment it. I think this is a good direction.

Thanks again for opening this, I will let some of the other committers give it a try and give their feedback.

Add comments. Add instructional errors. Add input checks. Remake colu…
…mn extraction strategy. Using the split_result_for_readers data struct model.
@AmplabJenkins

This comment has been minimized.

Copy link

commented Jan 29, 2019

Merged build finished. Test PASSed.

@AmplabJenkins

This comment has been minimized.

Copy link

commented Jan 29, 2019

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Modin-Performance-PRB/322/
Test PASSed.

@igorborgest

This comment has been minimized.

Copy link
Contributor Author

commented Jan 29, 2019

Hi @devin-petersohn,

Thank you for the feedbacks.

I think that I attacked all the mentioned points in my last commit.
But I'm still learning how modin data structure works from the inside. So let me know if I need to update more things.

Updating that code today I realized several ways to improve this feature. But for now, I think that what I already done is enough to start. I believe that incremental improvements will be healthier than a big shot.

What do you think?

@AmplabJenkins

This comment has been minimized.

Copy link

commented Jan 29, 2019

Merged build finished. Test PASSed.

@AmplabJenkins

This comment has been minimized.

Copy link

commented Jan 29, 2019

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Modin-PRB/383/
Test PASSed.

@osalpekar
Copy link
Collaborator

left a comment

Hi @igorborgest!
Just wanted to add a few general comments.

  • We must add the SQLAlchemy as a test dependency for Travis/Jenkins. (This is causing the CI builds to fail currently). The dependency would need to be added to the following files:
    • .travis/install-dependencies.sh
    • .jenkins/build-tests/Dockerfile
    • .jenkins/performance-tests/Dockerfile
  • Should we support float-type columns, since the current check is for int only?
  • Perhaps we should have additional test cases for distributed from_sql, where there are columns with non-numeric datatypes but the partition_column is still numeric. Just to ensure this case is handled well.
pandas_df = pandas.read_sql("select * from test", conn)
modin_df = pd.read_sql(
"select * from test",
"sqlite:///" + TEST_SQL_FILENAME,

This comment has been minimized.

Copy link
@osalpekar

osalpekar Jan 30, 2019

Collaborator

Should this be conn?

@@ -0,0 +1,165 @@
from sqlalchemy import MetaData, Table, create_engine

This comment has been minimized.

Copy link
@osalpekar

osalpekar Jan 30, 2019

Collaborator

This file seems to require some linting-related changes per black --diff modin/

return query

@classmethod
def read_sql(

This comment has been minimized.

Copy link
@osalpekar

osalpekar Jan 30, 2019

Collaborator

Would be nice to have comments for this function's args that don't conform to the pandas API (partition_column, lower_bound, upper_bound)

@AmplabJenkins

This comment has been minimized.

Copy link

commented Jan 30, 2019

Merged build finished. Test PASSed.

@AmplabJenkins

This comment has been minimized.

Copy link

commented Jan 30, 2019

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Modin-Performance-PRB/324/
Test PASSed.

@AmplabJenkins

This comment has been minimized.

Copy link

commented Jan 30, 2019

Merged build finished. Test PASSed.

@AmplabJenkins

This comment has been minimized.

Copy link

commented Jan 30, 2019

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Modin-PRB/385/
Test PASSed.

@AmplabJenkins

This comment has been minimized.

Copy link

commented Jan 30, 2019

Merged build finished. Test PASSed.

@AmplabJenkins

This comment has been minimized.

Copy link

commented Jan 30, 2019

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Modin-Performance-PRB/326/
Test PASSed.

@AmplabJenkins

This comment has been minimized.

Copy link

commented Jan 30, 2019

Merged build finished. Test PASSed.

@AmplabJenkins

This comment has been minimized.

Copy link

commented Jan 30, 2019

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Modin-PRB/387/
Test PASSed.

@AmplabJenkins

This comment has been minimized.

Copy link

commented Jan 30, 2019

Merged build finished. Test PASSed.

@AmplabJenkins

This comment has been minimized.

Copy link

commented Jan 30, 2019

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Modin-Performance-PRB/327/
Test PASSed.

@AmplabJenkins

This comment has been minimized.

Copy link

commented Jan 31, 2019

Merged build finished. Test PASSed.

@AmplabJenkins

This comment has been minimized.

Copy link

commented Jan 31, 2019

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Modin-PRB/397/
Test PASSed.

@AmplabJenkins

This comment has been minimized.

Copy link

commented Jan 31, 2019

Merged build finished. Test PASSed.

@AmplabJenkins

This comment has been minimized.

Copy link

commented Jan 31, 2019

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Modin-Performance-PRB/338/
Test PASSed.

@AmplabJenkins

This comment has been minimized.

Copy link

commented Jan 31, 2019

Merged build finished. Test PASSed.

@AmplabJenkins

This comment has been minimized.

Copy link

commented Jan 31, 2019

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Modin-PRB/399/
Test PASSed.

@AmplabJenkins

This comment has been minimized.

Copy link

commented Feb 1, 2019

Merged build finished. Test PASSed.

@AmplabJenkins

This comment has been minimized.

Copy link

commented Feb 1, 2019

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Modin-Performance-PRB/339/
Test PASSed.

@AmplabJenkins

This comment has been minimized.

Copy link

commented Feb 1, 2019

Merged build finished. Test PASSed.

@AmplabJenkins

This comment has been minimized.

Copy link

commented Feb 1, 2019

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Modin-PRB/400/
Test PASSed.

@AmplabJenkins

This comment has been minimized.

Copy link

commented Feb 1, 2019

Merged build finished. Test PASSed.

@AmplabJenkins

This comment has been minimized.

Copy link

commented Feb 1, 2019

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Modin-Performance-PRB/340/
Test PASSed.

@AmplabJenkins

This comment has been minimized.

Copy link

commented Feb 1, 2019

Merged build finished. Test PASSed.

@AmplabJenkins

This comment has been minimized.

Copy link

commented Feb 1, 2019

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Modin-PRB/401/
Test PASSed.

@igorborgest

This comment has been minimized.

Copy link
Contributor Author

commented Feb 1, 2019

Great job @devin-petersohn!
Anything else for this PR?

@devin-petersohn

This comment has been minimized.

Copy link
Member

commented Feb 1, 2019

@igorborgest I am going to do another quick pass to make sure everything is okay because it is such a large change. It should be merged later today!

@AmplabJenkins

This comment has been minimized.

Copy link

commented Feb 1, 2019

Merged build finished. Test PASSed.

@AmplabJenkins

This comment has been minimized.

Copy link

commented Feb 1, 2019

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Modin-Performance-PRB/343/
Test PASSed.

@AmplabJenkins

This comment has been minimized.

Copy link

commented Feb 1, 2019

Merged build finished. Test PASSed.

@AmplabJenkins

This comment has been minimized.

Copy link

commented Feb 1, 2019

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Modin-PRB/404/
Test PASSed.

@devin-petersohn

This comment has been minimized.

Copy link
Member

commented Feb 1, 2019

Thanks @igorborgest for the PR!

Thanks to @osalpekar and @eavidan for your reviews/suggestions. I am happy with the import modin.experimental.pandas as pd idea. It will be great to see how it evolves.

@devin-petersohn devin-petersohn merged commit d4cf159 into modin-project:master Feb 1, 2019

1 check passed

Travis CI - Pull Request Build Passed
Details
@devin-petersohn devin-petersohn referenced this pull request Feb 3, 2019
@modin-bot

This comment has been minimized.

Copy link

commented Feb 8, 2019

This pull request has been mentioned on Modin Discuss. There might be relevant details there:

https://discuss.modin.org/t/experimental-api-initial-discussion/28/1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants
You can’t perform that action at this time.