Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added support for cloud object storage (S3, GCS, ADLS, etc.) #1164

Merged
merged 40 commits into from
May 5, 2021

Conversation

tgaddair
Copy link
Collaborator

@tgaddair tgaddair commented Apr 23, 2021

Also consolidated HDF5 and Parquet caching into a single abstraction layer as part of the backend called CacheManager. Now, local backends can optionally cache in either HDF5 or Parquet, and the new DatasetManager abstraction will do the work of creating the appropriate dataset based on the cache format.

Depends on uber/petastorm#665.

@tgaddair tgaddair changed the title Added support for remote filesystems (S3, GCS, ADLS, etc.) Added support for cloud object storage (S3, GCS, ADLS, etc.) Apr 26, 2021


@contextlib.contextmanager
def download_h5(url):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why download_h5 and not open_h5?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also does this work with remote urls? I guess if it's remote fsspec.open_local downloads it first?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

download_h5 will first download the remote file to local, then open it for read-only.

upload_h5 opens the file for write then uploads to the remote url.

The open_local just means to download the file locally before opening, yes, as opposed to streaming. This is primarily for the purpose of caching (if you intend to re-read repeatedly).

@tgaddair tgaddair merged commit c93474b into master May 5, 2021
@tgaddair tgaddair deleted the remote-fs branch May 5, 2021 22:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants