Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add gmet zarr dataset to intake cat #341

Merged
merged 1 commit into from
Jul 31, 2018
Merged

add gmet zarr dataset to intake cat #341

merged 1 commit into from
Jul 31, 2018

Conversation

jhamman
Copy link
Member

@jhamman jhamman commented Jul 26, 2018

@martindurant
Copy link
Contributor

I am +1 for this.
However, the full dataset contains rather a lot of files and is slow to load the metadata. The chunksize of 14MB is OK order-of-magnitude but perhaps would be better at ~100MB. Perhaps this data-set also needs the metadata treatment in #309 , and a general agreement on how to cope with the many-files problem for the future (half-million files in this case).

@jhamman
Copy link
Member Author

jhamman commented Jul 27, 2018

@martindurant - I can update the chunksize in the dataset if that would be useful.

@martindurant
Copy link
Contributor

I certainly think that would help, and ~> 100MB is a good place to aim, perhaps bigger as the total dataset increases.

@jhamman
Copy link
Member Author

jhamman commented Jul 27, 2018

Okay, I'll push the dataset again with larger chunk sizes. I'll update here when its ready.

@martindurant
Copy link
Contributor

The catalog doesn't change, right?

@jhamman
Copy link
Member Author

jhamman commented Jul 27, 2018

Right. I may remove the .zarr suffix though on the bucket. Thoughts?

@mrocklin
Copy link
Member

mrocklin commented Jul 27, 2018 via email

@martindurant
Copy link
Contributor

@mrocklin , right, it's strictly ad-hoc, we didn't come to a conclusion on how to consolidate the metadata in a way to causes least friction. My simplistic solution was to put a file .zmetadata in the top-level with the contents of all of the rest of the .z* files in the hierarchy; then zarr.Group looks for this key in the mapping. Following the discussion on metadata over on zarr, it could, for example, have been an additional entry in .zattrs in the top-level, and so stay within the spec.

@martindurant
Copy link
Contributor

(but yes, it could have been a PR to xarray, that's part of the conversation in #309)

@martindurant
Copy link
Contributor

I may remove the .zarr suffix though on the bucket. Thoughts?

I mildly side with keeping it, for people that browse the file hierarchy in the bucket.

@mrocklin
Copy link
Member

mrocklin commented Jul 27, 2018 via email

@martindurant
Copy link
Contributor

We have zarr-developers/zarr-python#268 , which is also getting old.

@alimanfoo
Copy link

alimanfoo commented Jul 27, 2018 via email

@jhamman
Copy link
Member Author

jhamman commented Jul 31, 2018

I rechunked the dataset. We're down to 16k chunks and up to ~80mb/chunk.

@jhamman jhamman merged commit 453f1d2 into pangeo-data:master Jul 31, 2018
@jhamman jhamman deleted the gmet_intake branch July 31, 2018 17:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants