Skip to content

Use pyarrow native filesystems for reading from URI string #316

@hombit

Description

@hombit

fsspec, which we use via UPath, looks suboptimal for most of our users, see astronomy-commons/lsdb#936 pandas doesn't really use fsspec if given URI string is natively supported by pyarrow, which makes pandas HTTP and S3 reads much faster than ours (especially in the case of column selection).
I propose to:

  1. Change to pyarrow filesystems for supported URI strings
  2. Introduce small block size (e.g. 32kiB) for HTTP filesystem which is not natively supported by pyarrow, as discussed in Investigate how column selection affects input byte volume astronomy-commons/lsdb#936 (comment)
  3. Add S3 and HTTPS ASV benchmarks for file read with column selection performance

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions