Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: read_parquet can't detect column partitioning in non-local filesystems #4636

Closed
mvashishtha opened this issue Jul 1, 2022 · 6 comments · Fixed by #5192
Closed

BUG: read_parquet can't detect column partitioning in non-local filesystems #4636

mvashishtha opened this issue Jul 1, 2022 · 6 comments · Fixed by #5192
Assignees
Labels
bug 🦗 Something isn't working good first issue 🔰 Good for newcomers hacktoberfest Help Wanted 🌐 Issues good for external contributors. P1 Important tasks that we should complete soon pandas.io

Comments

@mvashishtha
Copy link
Collaborator

mvashishtha commented Jul 1, 2022

System information

  • Modin version (modin.__version__): 86d3610

Parquet datasets are arbitrarily deep directories of data files that can include column partitioning (ref. here). In column partitioning, a directory like col1=3 only has data where column 1 is equal to 3. Modin tries to default to pandas for datasets with column partitioning, but it assumes that column partitioning can only happen when the path to read is a local directory. The directory could, for example, be on s3.

I think that nothing is functionally wrong when we fail to detect column partitioning. Modin can still read the data correctly. It's just not defaulting to pandas even though we expect it to.

@mvashishtha mvashishtha added bug 🦗 Something isn't working good first issue 🔰 Good for newcomers pandas.io labels Jul 1, 2022
@mvashishtha
Copy link
Collaborator Author

Maybe we should also consider not defaulting to pandas for cases like this. We can still parallelize the read across columns and might even do better than pandas in some cases.

@pyrito pyrito added the P2 Minor bugs or low-priority feature requests label Aug 31, 2022
@mvashishtha mvashishtha added Help Wanted 🌐 Issues good for external contributors. hacktoberfest labels Sep 19, 2022
@mvashishtha
Copy link
Collaborator Author

We can still parallelize the read across columns and might even do better than pandas in some cases.

Note that now that we parallelize across both rows and columns, I don't think we can read column-value-partitioned datasets so easily.

@mvashishtha
Copy link
Collaborator Author

I think that nothing is functionally wrong when we fail to detect column partitioning. Modin can still read the data correctly.

Per #5159, we now throw an error.

@mvashishtha mvashishtha added P1 Important tasks that we should complete soon and removed P2 Minor bugs or low-priority feature requests labels Oct 26, 2022
@mvashishtha
Copy link
Collaborator Author

@gshimansky could you please copy the partitioned directory s3://mahesh-vashishtha/modin_bug_5159_parquet/df.parquet to somewhere in s3://modin-datasets/ and make it publicly accessible so we can use that directory in a test case?

@gshimansky
Copy link
Collaborator

I created folder s3://modin-datasets/modin-bugs and copied modin_bug_5159_parquet into it. Please check that you have access, everything should be public.

@mvashishtha
Copy link
Collaborator Author

@gshimansky thanks, I can read the modin_bug_5159_parquet directory with no credentials, and I can list the contents in s3://modin-datasets/modin-bugs/.

billiam-wang added a commit to billiam-wang/modin that referenced this issue Nov 14, 2022
…ning in non-local filesystems

Signed-off-by: Bill Wang <billiam@ponder.io>
mvashishtha pushed a commit that referenced this issue Nov 21, 2022
…-local filesystems (#5192)

Signed-off-by: Bill Wang <billiam@ponder.io>
dchigarev pushed a commit that referenced this issue Nov 25, 2022
…-local filesystems (#5192)

Signed-off-by: Bill Wang <billiam@ponder.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug 🦗 Something isn't working good first issue 🔰 Good for newcomers hacktoberfest Help Wanted 🌐 Issues good for external contributors. P1 Important tasks that we should complete soon pandas.io
Projects
None yet
4 participants