add gmet zarr dataset to intake cat #341

jhamman · 2018-07-26T17:15:08Z

xref: pangeo-data/pangeo-example-notebooks#5

martindurant · 2018-07-27T16:34:58Z

I am +1 for this.
However, the full dataset contains rather a lot of files and is slow to load the metadata. The chunksize of 14MB is OK order-of-magnitude but perhaps would be better at ~100MB. Perhaps this data-set also needs the metadata treatment in #309 , and a general agreement on how to cope with the many-files problem for the future (half-million files in this case).

jhamman · 2018-07-27T17:01:12Z

@martindurant - I can update the chunksize in the dataset if that would be useful.

martindurant · 2018-07-27T17:09:02Z

I certainly think that would help, and ~> 100MB is a good place to aim, perhaps bigger as the total dataset increases.

jhamman · 2018-07-27T17:28:01Z

Okay, I'll push the dataset again with larger chunk sizes. I'll update here when its ready.

martindurant · 2018-07-27T17:28:47Z

The catalog doesn't change, right?

jhamman · 2018-07-27T17:29:27Z

Right. I may remove the .zarr suffix though on the bucket. Thoughts?

mrocklin · 2018-07-27T17:29:39Z

Perhaps this data-set also needs the metadata treatment in #309

How does one do this? Do we need a best practices doc? Should this be upstreamed to XArray in some way?

…

On Fri, Jul 27, 2018 at 10:28 AM, Martin Durant ***@***.***> wrote: The catalog doesn't change, right? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#341 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AASszE8qe18EA1PAKQGv5GHbrVHvzhLuks5uK03QgaJpZM4ViOfF> .

martindurant · 2018-07-27T17:35:23Z

@mrocklin , right, it's strictly ad-hoc, we didn't come to a conclusion on how to consolidate the metadata in a way to causes least friction. My simplistic solution was to put a file .zmetadata in the top-level with the contents of all of the rest of the .z* files in the hierarchy; then zarr.Group looks for this key in the mapping. Following the discussion on metadata over on zarr, it could, for example, have been an additional entry in .zattrs in the top-level, and so stay within the spec.

martindurant · 2018-07-27T17:36:19Z

(but yes, it could have been a PR to xarray, that's part of the conversation in #309)

martindurant · 2018-07-27T17:37:02Z

I may remove the .zarr suffix though on the bucket. Thoughts?

I mildly side with keeping it, for people that browse the file hierarchy in the bucket.

mrocklin · 2018-07-27T17:38:32Z

OK. I'd like to make sure that if this is important that it doesn't get lost. It looks like nothing significant has happened on that issue for about a month. @martindurant do you have thoughts on the right way to come to some long-term conclusion on this issue? Should it be elevate to some upstream issue tracker?

…

On Fri, Jul 27, 2018 at 10:36 AM, Martin Durant ***@***.***> wrote: (but yes, it could have been a PR to xarray, that's part of the conversation in #309 <#309>) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#341 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AASszDScv4fYUDsSE7EzjCq_2krKr7d0ks5uK0-UgaJpZM4ViOfF> .

martindurant · 2018-07-27T17:54:58Z

We have zarr-developers/zarr-python#268 , which is also getting old.

alimanfoo · 2018-07-27T18:35:02Z

Happy to pick that discussion up again on the Zarr side if that would help.

…

On Fri, 27 Jul 2018, 18:55 Martin Durant, ***@***.***> wrote: We have zarr-developers/zarr-python#268 <zarr-developers/zarr-python#268> , which is also getting old. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#341 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAq8Qr0cj2pqCMQz-QHVMh6vNM68ZqgLks5uK1PzgaJpZM4ViOfF> .

jhamman · 2018-07-31T01:51:21Z

I rechunked the dataset. We're down to 16k chunks and up to ~80mb/chunk.

add gmet zarr dataset to intake cat

eba40d1

jhamman mentioned this pull request Jul 26, 2018

update gmet dataset pangeo-data/pangeo-example-notebooks#5

Merged

jhamman merged commit 453f1d2 into pangeo-data:master Jul 31, 2018

jhamman deleted the gmet_intake branch July 31, 2018 17:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add gmet zarr dataset to intake cat #341

add gmet zarr dataset to intake cat #341

jhamman commented Jul 26, 2018

martindurant commented Jul 27, 2018

jhamman commented Jul 27, 2018

martindurant commented Jul 27, 2018

jhamman commented Jul 27, 2018

martindurant commented Jul 27, 2018

jhamman commented Jul 27, 2018 •

edited

Loading

mrocklin commented Jul 27, 2018 via email

martindurant commented Jul 27, 2018

martindurant commented Jul 27, 2018

martindurant commented Jul 27, 2018

mrocklin commented Jul 27, 2018 via email

martindurant commented Jul 27, 2018

alimanfoo commented Jul 27, 2018 via email

jhamman commented Jul 31, 2018

add gmet zarr dataset to intake cat #341

add gmet zarr dataset to intake cat #341

Conversation

jhamman commented Jul 26, 2018

martindurant commented Jul 27, 2018

jhamman commented Jul 27, 2018

martindurant commented Jul 27, 2018

jhamman commented Jul 27, 2018

martindurant commented Jul 27, 2018

jhamman commented Jul 27, 2018 • edited Loading

mrocklin commented Jul 27, 2018 via email

martindurant commented Jul 27, 2018

martindurant commented Jul 27, 2018

martindurant commented Jul 27, 2018

mrocklin commented Jul 27, 2018 via email

martindurant commented Jul 27, 2018

alimanfoo commented Jul 27, 2018 via email

jhamman commented Jul 31, 2018

jhamman commented Jul 27, 2018 •

edited

Loading