Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update the content of chunking_introduction.ipynb #48

Closed
1 of 3 tasks
guillaumeeb opened this issue Aug 13, 2022 · 5 comments · Fixed by #75
Closed
1 of 3 tasks

Update the content of chunking_introduction.ipynb #48

guillaumeeb opened this issue Aug 13, 2022 · 5 comments · Fixed by #75
Assignees

Comments

@guillaumeeb
Copy link
Member

guillaumeeb commented Aug 13, 2022

The goal of this issue is to discuss what we should put in chunking introduction notebook.

This comes from comments and discussions in #45.

I believe we should do the following, but this needs discussion (especially with @tinaok):

  • Either remove the word compression or talk about it somewhere, a short chapter should be fine. Some subjects to talk about
    • Generalities (why compression matters, how it is a gain in processing time despite the overcost of compression -- loading big files in memory is slow, different algorithms currently used in scientific domains, Python blosc, etc.),
    • A section showing the difference between a NetCDF or zarr file with and witout compression and the resulting size
    • Default of Xarray / Various data formats (is NetCDF compression activated by default?).
    • Compressions is often apply on chunks for optimized access.
  • Add more context about Xarray and Chunking before introducing kerchunk. Begin with open_dataset and the chunks attribute, introduce Dask Arrays, talk about lazyness and sequential processing. Talk also of native chunks in files here, if possible without kerchunk (with h5py maybe?).
  • Kerchunk part: introduce it by talking about the difficulties of reading a dataset composed of several files in a optimal way. Then explain how it reads file metadata and detect chunks, and construct a Zarr compatible metadatafiles that allows for optimized access of whole dataset with Xarray. Should we keep the main metadata file creation in the notebook? I'm not sure of that.
@tinaok
Copy link
Collaborator

tinaok commented Aug 16, 2022

Compression part will be moved #53 and I'll update the title of chunking_introduction.ipynb.

@annefou
Copy link
Collaborator

annefou commented Aug 16, 2022

A picture to show chunks will be added by @tinaok

@guillaumeeb
Copy link
Member Author

About compression, as said on my email, I'll add a few things if I found the time, but not a priority.

@guillaumeeb
Copy link
Member Author

About native chunks vs Xarray chunks=auto:

  • Check that Xarray method does not uses native file chunks, but say that it is still some optimisation on data loading
  • See if we can obtain native chunks with something simpler than kerchunk,
  • Use these native chunks in Xarray to speed up access on a single file.
  • Which introduces Kerchunk, especially in multiple file context.

@tinaok
Copy link
Collaborator

tinaok commented Aug 16, 2022

@annefou plz merge #56 ? @ guillaumeeb you can join us the discussion at Gitter?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants