-
Notifications
You must be signed in to change notification settings - Fork 647
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: read_parquet can't detect column partitioning in non-local filesystems #4636
Comments
Maybe we should also consider not defaulting to pandas for cases like this. We can still parallelize the read across columns and might even do better than pandas in some cases. |
Note that now that we parallelize across both rows and columns, I don't think we can read column-value-partitioned datasets so easily. |
Per #5159, we now throw an error. |
@gshimansky could you please copy the partitioned directory |
I created folder |
@gshimansky thanks, I can read the |
…ning in non-local filesystems Signed-off-by: Bill Wang <billiam@ponder.io>
…-local filesystems (#5192) Signed-off-by: Bill Wang <billiam@ponder.io>
…-local filesystems (#5192) Signed-off-by: Bill Wang <billiam@ponder.io>
System information
modin.__version__
): 86d3610Parquet datasets are arbitrarily deep directories of data files that can include column partitioning (ref. here). In column partitioning, a directory like
col1=3
only has data where column 1 is equal to 3. Modin tries to default to pandas for datasets with column partitioning, but it assumes that column partitioning can only happen when the path to read is a local directory. The directory could, for example, be on s3.I think that nothing is functionally wrong when we fail to detect column partitioning. Modin can still read the data correctly. It's just not defaulting to pandas even though we expect it to.
The text was updated successfully, but these errors were encountered: