BUG: read_parquet can't detect column partitioning in non-local filesystems #4636

mvashishtha · 2022-07-01T21:09:18Z

System information

Modin version (modin.__version__): 86d3610

Parquet datasets are arbitrarily deep directories of data files that can include column partitioning (ref. here). In column partitioning, a directory like col1=3 only has data where column 1 is equal to 3. Modin tries to default to pandas for datasets with column partitioning, but it assumes that column partitioning can only happen when the path to read is a local directory. The directory could, for example, be on s3.

I think that nothing is functionally wrong when we fail to detect column partitioning. Modin can still read the data correctly. It's just not defaulting to pandas even though we expect it to.

The text was updated successfully, but these errors were encountered:

mvashishtha · 2022-07-01T21:19:34Z

Maybe we should also consider not defaulting to pandas for cases like this. We can still parallelize the read across columns and might even do better than pandas in some cases.

mvashishtha · 2022-10-26T17:30:25Z

We can still parallelize the read across columns and might even do better than pandas in some cases.

Note that now that we parallelize across both rows and columns, I don't think we can read column-value-partitioned datasets so easily.

mvashishtha · 2022-10-26T17:37:02Z

I think that nothing is functionally wrong when we fail to detect column partitioning. Modin can still read the data correctly.

Per #5159, we now throw an error.

mvashishtha · 2022-10-26T17:43:02Z

@gshimansky could you please copy the partitioned directory s3://mahesh-vashishtha/modin_bug_5159_parquet/df.parquet to somewhere in s3://modin-datasets/ and make it publicly accessible so we can use that directory in a test case?

gshimansky · 2022-10-26T17:54:41Z

I created folder s3://modin-datasets/modin-bugs and copied modin_bug_5159_parquet into it. Please check that you have access, everything should be public.

mvashishtha · 2022-10-26T17:59:04Z

@gshimansky thanks, I can read the modin_bug_5159_parquet directory with no credentials, and I can list the contents in s3://modin-datasets/modin-bugs/.

…ning in non-local filesystems Signed-off-by: Bill Wang <billiam@ponder.io>

…-local filesystems (#5192) Signed-off-by: Bill Wang <billiam@ponder.io>

mvashishtha added bug 🦗 Something isn't working good first issue 🔰 Good for newcomers pandas.io labels Jul 1, 2022

pyrito added the P2 Minor bugs or low-priority feature requests label Aug 31, 2022

mvashishtha added Help Wanted 🌐 Issues good for external contributors. hacktoberfest labels Sep 19, 2022

mvashishtha mentioned this issue Sep 19, 2022

Docs: Broken link on benchmarks/tutorials page #4661

Closed

mvashishtha mentioned this issue Oct 26, 2022

FEAT: Implement distributed read_parquet from column-partitioned directories. #5160

Open

mvashishtha added P1 Important tasks that we should complete soon and removed P2 Minor bugs or low-priority feature requests labels Oct 26, 2022

mvashishtha mentioned this issue Oct 26, 2022

BUG: unable to read partitioned parquet #5159

Closed

3 tasks

mvashishtha assigned mvashishtha and unassigned mvashishtha Oct 26, 2022

gshimansky self-assigned this Oct 26, 2022

mvashishtha unassigned gshimansky Oct 26, 2022

mvashishtha assigned mvashishtha and billiam-wang and unassigned mvashishtha Oct 27, 2022

anmyachev mentioned this issue Nov 5, 2022

FIX-#4636: allows read_parquet to detect column partitioning in non-local filesystems #5192

Merged

7 tasks

billiam-wang added a commit to billiam-wang/modin that referenced this issue Nov 14, 2022

FIX-modin-project#4636: allows read_parquet to detect column partitio…

054e437

…ning in non-local filesystems Signed-off-by: Bill Wang <billiam@ponder.io>

mvashishtha closed this as completed in #5192 Nov 21, 2022

mvashishtha pushed a commit that referenced this issue Nov 21, 2022

FIX-#4636: allows read_parquet to detect column partitioning in non…

073dffc

…-local filesystems (#5192) Signed-off-by: Bill Wang <billiam@ponder.io>

dchigarev pushed a commit that referenced this issue Nov 25, 2022

FIX-#4636: allows read_parquet to detect column partitioning in non…

f80d69d

…-local filesystems (#5192) Signed-off-by: Bill Wang <billiam@ponder.io>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: read_parquet can't detect column partitioning in non-local filesystems #4636

BUG: read_parquet can't detect column partitioning in non-local filesystems #4636

mvashishtha commented Jul 1, 2022 •

edited

mvashishtha commented Jul 1, 2022

mvashishtha commented Oct 26, 2022

mvashishtha commented Oct 26, 2022

mvashishtha commented Oct 26, 2022

gshimansky commented Oct 26, 2022

mvashishtha commented Oct 26, 2022

BUG: read_parquet can't detect column partitioning in non-local filesystems #4636

BUG: read_parquet can't detect column partitioning in non-local filesystems #4636

Comments

mvashishtha commented Jul 1, 2022 • edited

System information

mvashishtha commented Jul 1, 2022

mvashishtha commented Oct 26, 2022

mvashishtha commented Oct 26, 2022

mvashishtha commented Oct 26, 2022

gshimansky commented Oct 26, 2022

mvashishtha commented Oct 26, 2022

mvashishtha commented Jul 1, 2022 •

edited