-
Notifications
You must be signed in to change notification settings - Fork 187
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Zarr/GCS potential optimisation to reduce latency #381
Comments
Basically it looks like if you add |
I should point out that the second request is only made if the file information is not already cached. I suppose that this is indeed the case when we are explicitly trying to avoid "contains" calls. |
OK, so it would be pretty easy to implement this in gcsfs too, and I could imagine various ways to do it: an option on open() or the file-system in general, which makes file-likes that don't support seek(); or a modification to cat() (which fetches the whole file) to use the single-call method, to be used in the context of the mapper, as opposed to generally. Would you like me to code it up? |
Try fsspec/gcsfs#111 ? |
Many thanks Martin, I'll give it a go asap.
…On Thu, 6 Sep 2018, 15:53 Martin Durant, ***@***.***> wrote:
Try fsspec/gcsfs#111 <fsspec/gcsfs#111> ?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#381 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAq8QqSPeKV6aBqhr7uXsUAXv9hBzlrbks5uYTbjgaJpZM4WcyAl>
.
|
Just to confirm I tried out the PR branch and it about halves the time to
retrieve small objects via a gcsfs.GCSMap.
…--
If I do not respond to an email within a few days, please feel free to
resend your email and/or contact me by other means.
Alistair Miles
Head of Epidemiological Informatics
Centre for Genomics and Global Health
Big Data Institute
Li Ka Shing Centre for Health Information and Discovery
Old Road Campus
Headington
Oxford
OX3 7LF
United Kingdom
Phone: +44 (0)1865 743596 or +44 (0)7866 541624
Email: alimanfoo@googlemail.com
Web: http://a <http://purl.org/net/aliman>limanfoo.github.io/
Twitter: @alimanfoo <https://twitter.com/alimanfoo>
|
For interest, on my local machine, retrieving a very small object directly
using the google cloud storage client library takes ~300ms.
Using GCSMap from @martindurant's new gcsfs branch also takes ~300ms, down
from >700ms using the release version.
I also tried GCSStore from @rabernat's zarr PR
zarr-developers/zarr-python#252 which currently takes
~450ms and profiling shows some extra network communication, however with a
small amount of hacking this too can be brought down to ~300ms.
…--
If I do not respond to an email within a few days, please feel free to
resend your email and/or contact me by other means.
Alistair Miles
Head of Epidemiological Informatics
Centre for Genomics and Global Health
Big Data Institute
Li Ka Shing Centre for Health Information and Discovery
Old Road Campus
Headington
Oxford
OX3 7LF
United Kingdom
Phone: +44 (0)1865 743596 or +44 (0)7866 541624
Email: alimanfoo@googlemail.com
Web: http://a <http://purl.org/net/aliman>limanfoo.github.io/
Twitter: @alimanfoo <https://twitter.com/alimanfoo>
|
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
I think we can call this fixed. |
So should we consolidate all our GCS metadata now?!? |
@rabernat if you are happy with the revised format for the consolidated metadata (zarr-developers/zarr-python#268 (comment)) then I think we can merge that PR, and you could go ahead with consolidating GCS metadata. I think @martindurant also plans to add support for consolidated metadata in intake, so opening a dataset with consolidated metadata would be transparent for pangeo users. |
Yes, it will be an argument in Intake, but someone still has to make the effort to run consolidate on datasets. It's particularly worthwhile with those that have many metadata files at the moment. |
In discussion of a separate issue (#196) it came up that gcsfs currently makes 2 HTTP requests each time a cloud object is retrieved, the first to obtain the size and download URL (media link), the second to retrieve the actual bytes. Looking at the source code for the google cloud Python client library, it appears that it should be possible to build the download URL from a template, and the size is not required, hence it should be possible to retrieve the data in a single HTTP request. This could eliminate significant latency, especially when using with zarr arrays where chunks are relatively small.
I thought I would raise here for initial discussion about the best way forward, although this issue affects potentially the zarr and/or gcsfs projects. We could pursue an optimisation within gcsfs. We could also look again at @rabernat's PR with a GCS mapping using the google cloud Python client library (zarr-developers/zarr-python#252). Some performance comparison on a relevant dataset to confirm the issue would also be useful.
The text was updated successfully, but these errors were encountered: