Zarr/GCS potential optimisation to reduce latency #381

alimanfoo · 2018-09-06T11:54:44Z

In discussion of a separate issue (#196) it came up that gcsfs currently makes 2 HTTP requests each time a cloud object is retrieved, the first to obtain the size and download URL (media link), the second to retrieve the actual bytes. Looking at the source code for the google cloud Python client library, it appears that it should be possible to build the download URL from a template, and the size is not required, hence it should be possible to retrieve the data in a single HTTP request. This could eliminate significant latency, especially when using with zarr arrays where chunks are relatively small.

I thought I would raise here for initial discussion about the best way forward, although this issue affects potentially the zarr and/or gcsfs projects. We could pursue an optimisation within gcsfs. We could also look again at @rabernat's PR with a GCS mapping using the google cloud Python client library (zarr-developers/zarr-python#252). Some performance comparison on a relevant dataset to confirm the issue would also be useful.

alimanfoo · 2018-09-06T12:05:08Z

Basically it looks like if you add ?alt=media to the object URL and make a GET request, you get the object data in the response.

martindurant · 2018-09-06T13:09:29Z

I should point out that the second request is only made if the file information is not already cached. I suppose that this is indeed the case when we are explicitly trying to avoid "contains" calls.

martindurant · 2018-09-06T13:24:40Z

OK, so it would be pretty easy to implement this in gcsfs too, and I could imagine various ways to do it: an option on open() or the file-system in general, which makes file-likes that don't support seek(); or a modification to cat() (which fetches the whole file) to use the single-call method, to be used in the context of the mapper, as opposed to generally. Would you like me to code it up?

martindurant · 2018-09-06T14:51:41Z

Try fsspec/gcsfs#111 ?

alimanfoo · 2018-09-06T16:41:17Z

Many thanks Martin, I'll give it a go asap.

…

On Thu, 6 Sep 2018, 15:53 Martin Durant, ***@***.***> wrote: Try fsspec/gcsfs#111 <fsspec/gcsfs#111> ? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#381 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAq8QqSPeKV6aBqhr7uXsUAXv9hBzlrbks5uYTbjgaJpZM4WcyAl> .

alimanfoo · 2018-09-06T21:25:30Z

Just to confirm I tried out the PR branch and it about halves the time to retrieve small objects via a gcsfs.GCSMap.

…

-- If I do not respond to an email within a few days, please feel free to resend your email and/or contact me by other means. Alistair Miles Head of Epidemiological Informatics Centre for Genomics and Global Health Big Data Institute Li Ka Shing Centre for Health Information and Discovery Old Road Campus Headington Oxford OX3 7LF United Kingdom Phone: +44 (0)1865 743596 or +44 (0)7866 541624 Email: alimanfoo@googlemail.com Web: http://a <http://purl.org/net/aliman>limanfoo.github.io/ Twitter: @alimanfoo <https://twitter.com/alimanfoo>

alimanfoo · 2018-09-06T21:50:08Z

For interest, on my local machine, retrieving a very small object directly using the google cloud storage client library takes ~300ms. Using GCSMap from @martindurant's new gcsfs branch also takes ~300ms, down from >700ms using the release version. I also tried GCSStore from @rabernat's zarr PR zarr-developers/zarr-python#252 which currently takes ~450ms and profiling shows some extra network communication, however with a small amount of hacking this too can be brought down to ~300ms.

…

-- If I do not respond to an email within a few days, please feel free to resend your email and/or contact me by other means. Alistair Miles Head of Epidemiological Informatics Centre for Genomics and Global Health Big Data Institute Li Ka Shing Centre for Health Information and Discovery Old Road Campus Headington Oxford OX3 7LF United Kingdom Phone: +44 (0)1865 743596 or +44 (0)7866 541624 Email: alimanfoo@googlemail.com Web: http://a <http://purl.org/net/aliman>limanfoo.github.io/ Twitter: @alimanfoo <https://twitter.com/alimanfoo>

stale · 2018-11-05T22:01:03Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

martindurant · 2018-11-05T22:05:39Z

I think we can call this fixed.

rabernat · 2018-11-06T04:03:30Z

So should we consolidate all our GCS metadata now?!?

alimanfoo · 2018-11-06T10:53:52Z

@rabernat if you are happy with the revised format for the consolidated metadata (zarr-developers/zarr-python#268 (comment)) then I think we can merge that PR, and you could go ahead with consolidating GCS metadata. I think @martindurant also plans to add support for consolidated metadata in intake, so opening a dataset with consolidated metadata would be transparent for pangeo users.

martindurant · 2018-11-06T14:08:46Z

Yes, it will be an argument in Intake, but someone still has to make the effort to run consolidate on datasets. It's particularly worthwhile with those that have many metadata files at the moment.

stale bot added the stale label Nov 5, 2018

martindurant closed this as completed Nov 5, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Zarr/GCS potential optimisation to reduce latency #381

Zarr/GCS potential optimisation to reduce latency #381

alimanfoo commented Sep 6, 2018

alimanfoo commented Sep 6, 2018

martindurant commented Sep 6, 2018

martindurant commented Sep 6, 2018

martindurant commented Sep 6, 2018

alimanfoo commented Sep 6, 2018 via email

alimanfoo commented Sep 6, 2018 via email

alimanfoo commented Sep 6, 2018 via email

stale bot commented Nov 5, 2018

martindurant commented Nov 5, 2018

rabernat commented Nov 6, 2018

alimanfoo commented Nov 6, 2018

martindurant commented Nov 6, 2018

Zarr/GCS potential optimisation to reduce latency #381

Zarr/GCS potential optimisation to reduce latency #381

Comments

alimanfoo commented Sep 6, 2018

alimanfoo commented Sep 6, 2018

martindurant commented Sep 6, 2018

martindurant commented Sep 6, 2018

martindurant commented Sep 6, 2018

alimanfoo commented Sep 6, 2018 via email

alimanfoo commented Sep 6, 2018 via email

alimanfoo commented Sep 6, 2018 via email

stale bot commented Nov 5, 2018

martindurant commented Nov 5, 2018

rabernat commented Nov 6, 2018

alimanfoo commented Nov 6, 2018

martindurant commented Nov 6, 2018