Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zarr/GCS potential optimisation to reduce latency #381

Closed
alimanfoo opened this issue Sep 6, 2018 · 12 comments
Closed

Zarr/GCS potential optimisation to reduce latency #381

alimanfoo opened this issue Sep 6, 2018 · 12 comments
Labels

Comments

@alimanfoo
Copy link

In discussion of a separate issue (#196) it came up that gcsfs currently makes 2 HTTP requests each time a cloud object is retrieved, the first to obtain the size and download URL (media link), the second to retrieve the actual bytes. Looking at the source code for the google cloud Python client library, it appears that it should be possible to build the download URL from a template, and the size is not required, hence it should be possible to retrieve the data in a single HTTP request. This could eliminate significant latency, especially when using with zarr arrays where chunks are relatively small.

I thought I would raise here for initial discussion about the best way forward, although this issue affects potentially the zarr and/or gcsfs projects. We could pursue an optimisation within gcsfs. We could also look again at @rabernat's PR with a GCS mapping using the google cloud Python client library (zarr-developers/zarr-python#252). Some performance comparison on a relevant dataset to confirm the issue would also be useful.

@alimanfoo
Copy link
Author

Basically it looks like if you add ?alt=media to the object URL and make a GET request, you get the object data in the response.

@martindurant
Copy link
Contributor

I should point out that the second request is only made if the file information is not already cached. I suppose that this is indeed the case when we are explicitly trying to avoid "contains" calls.

@martindurant
Copy link
Contributor

OK, so it would be pretty easy to implement this in gcsfs too, and I could imagine various ways to do it: an option on open() or the file-system in general, which makes file-likes that don't support seek(); or a modification to cat() (which fetches the whole file) to use the single-call method, to be used in the context of the mapper, as opposed to generally. Would you like me to code it up?

@martindurant
Copy link
Contributor

Try fsspec/gcsfs#111 ?

@alimanfoo
Copy link
Author

alimanfoo commented Sep 6, 2018 via email

@alimanfoo
Copy link
Author

alimanfoo commented Sep 6, 2018 via email

@alimanfoo
Copy link
Author

alimanfoo commented Sep 6, 2018 via email

@stale
Copy link

stale bot commented Nov 5, 2018

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Nov 5, 2018
@martindurant
Copy link
Contributor

I think we can call this fixed.

@rabernat
Copy link
Member

rabernat commented Nov 6, 2018

So should we consolidate all our GCS metadata now?!?

@alimanfoo
Copy link
Author

@rabernat if you are happy with the revised format for the consolidated metadata (zarr-developers/zarr-python#268 (comment)) then I think we can merge that PR, and you could go ahead with consolidating GCS metadata. I think @martindurant also plans to add support for consolidated metadata in intake, so opening a dataset with consolidated metadata would be transparent for pangeo users.

@martindurant
Copy link
Contributor

Yes, it will be an argument in Intake, but someone still has to make the effort to run consolidate on datasets. It's particularly worthwhile with those that have many metadata files at the moment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants