Update the content of chunking_introduction.ipynb #48

guillaumeeb · 2022-08-13T16:34:38Z

The goal of this issue is to discuss what we should put in chunking introduction notebook.

This comes from comments and discussions in #45.

I believe we should do the following, but this needs discussion (especially with @tinaok):

Either remove the word compression or talk about it somewhere, a short chapter should be fine. Some subjects to talk about
- Generalities (why compression matters, how it is a gain in processing time despite the overcost of compression -- loading big files in memory is slow, different algorithms currently used in scientific domains, Python blosc, etc.),
- A section showing the difference between a NetCDF or zarr file with and witout compression and the resulting size
- Default of Xarray / Various data formats (is NetCDF compression activated by default?).
- Compressions is often apply on chunks for optimized access.
Add more context about Xarray and Chunking before introducing kerchunk. Begin with open_dataset and the chunks attribute, introduce Dask Arrays, talk about lazyness and sequential processing. Talk also of native chunks in files here, if possible without kerchunk (with h5py maybe?).
Kerchunk part: introduce it by talking about the difficulties of reading a dataset composed of several files in a optimal way. Then explain how it reads file metadata and detect chunks, and construct a Zarr compatible metadatafiles that allows for optimized access of whole dataset with Xarray. Should we keep the main metadata file creation in the notebook? I'm not sure of that.

The text was updated successfully, but these errors were encountered:

tinaok · 2022-08-16T09:26:22Z

Compression part will be moved #53 and I'll update the title of chunking_introduction.ipynb.

annefou · 2022-08-16T09:53:39Z

A picture to show chunks will be added by @tinaok

guillaumeeb · 2022-08-16T13:06:09Z

About compression, as said on my email, I'll add a few things if I found the time, but not a priority.

guillaumeeb · 2022-08-16T13:13:36Z

About native chunks vs Xarray chunks=auto:

Check that Xarray method does not uses native file chunks, but say that it is still some optimisation on data loading
See if we can obtain native chunks with something simpler than kerchunk,
Use these native chunks in Xarray to speed up access on a single file.
Which introduces Kerchunk, especially in multiple file context.

tinaok · 2022-08-16T14:31:27Z

@annefou plz merge #56 ? @ guillaumeeb you can join us the discussion at Gitter?

annefou assigned tinaok Aug 16, 2022

guillaumeeb mentioned this issue Aug 18, 2022

Chunking_introduction notebook improvements #75

Merged

annefou closed this as completed in #75 Aug 18, 2022

Provide feedback