Skip to content

Restore ability to read Parquet files in S3 directories #392

@gitosaurus

Description

@gitosaurus

Bug report

During the resolution of #365, in 84d3a09, the ability of nested_pandas.read_parquet to read files from S3 directories was regressed. This was because:

  1. During development and testing, it was clear that .read_parquet had never been able to read HTTP directories, and that was considered to be true for all remote network directories;
  2. The use of UPath.is_dir() was observed to be much too slow for testing remote filesystem paths.

This change then caused a regression in LSDB, which was worked around in PR astronomy-commons/hats#576 .

nested_pandas.read_parquet should be changed to restore the use of S3 directories and any other network-based filesystems that it was able to use before, but without incurring any undue cost via UPath.is_dir(). One possible solution would be to trust the presence of a trailing slash on the path as a clue to the user's intent; however, this was not required before. Another would be to accept the cost of UPath.is_dir() (as LSDB does in its workaround), as long as it was much less than the cost of reading the Parquet file itself.

Before submitting
Please check the following:

  • I have described the situation in which the bug arose, including what code was executed, information about my environment, and any applicable data others will need to reproduce the problem.
  • I have included available evidence of the unexpected behavior (including error messages, screenshots, and/or plots) as well as a description of what I expected instead.
  • If I have a solution in mind, I have provided an explanation and/or pseudocode and/or task list.

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions