New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Review Notebooks on Data access, chunking and Dask #45

Merged

annefou merged 11 commits into pangeo-data:main from guillaumeeb:review-dask-intro

Aug 13, 2022

Member

guillaumeeb commented Aug 11, 2022

Creating a draft pull request after the review of the three notebooks on Data and Dask.

Some high level firsts impressions there, and after that I'll try to comment more precisely using github review (or NBReview) later on.

data_discovery is nice. It lacks a bit of computations or exercise, but it's short so probably OK.
chunking_introduction is hard. I try to propose some content modifications, but still the part on kerchunk is complicated. Moreover, I'm not sure if Kerchunk is really needed to explain chunks. It's just a good way to consolidate a dataset and make it easily loadable.
dask_introduction is fine, but it currently lacks real processing (which is perfectly normal as it is probably not finished). I only saw a processing over Lombardia. And as there is currently problem with dask-gateway setup (see Dask gateway configuration problems on pangeo-xxlarge platform pangeo-eosc#3), it still hard to follow.

guillaumeeb added 2 commits

August 11, 2022 12:42


          Review of Data notebooks

86bbd3d


          Finishing try of the notebook on Dask

e3c07e2

review-notebook-app bot commented Aug 11, 2022

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB


          Merge branch 'main' into review-dask-intro

65a2b83

guillaumeeb commented

View reviewed changes

tutorial/pangeo101/chunking_introduction.ipynb

		@@ -49,13 +49,15 @@
		"source": [

Member Author

guillaumeeb Aug 11, 2022

I'm under the impression that currently, we don't realy talk about compression in this notebook, do we?

Reply via ReviewNB

Collaborator

pl-marasco Aug 12, 2022

No I've never touched the topic but definitely is something that would be nice to be added.

Member Author

guillaumeeb Aug 13, 2022

We could add some generalities (why compression matters, how it is a gain in processing time despite the overcost of compression, different algorithms currently used in scientific domains, Python blosc, etc.), and maybe a section showing the difference between a NetCDF or zarr file with and witout compression and the resulting size?

tutorial/pangeo101/chunking_introduction.ipynb

		@@ -49,13 +49,15 @@
		"source": [

Member Author

guillaumeeb Aug 11, 2022

I added Dask because clearly it is used by Xarray under the hood. I think we should remove some other unsued like pooch?

Reply via ReviewNB

tutorial/pangeo101/chunking_introduction.ipynb

		@@ -49,13 +49,15 @@
		"source": [

Member Author

guillaumeeb Aug 11, 2022

I tried to make things clearer here, but I'm not sure why we are using open_mfdataset right now?

In any case, I think we need to talk about Dask in this notebook. Xarray + chunkings almost always means Dask.

Reply via ReviewNB

tutorial/pangeo101/chunking_introduction.ipynb

		@@ -49,13 +49,15 @@
		"source": [

Member Author

guillaumeeb Aug 11, 2022

Okay, I'm not sure about the last part. I don't know how Xarray chooses chunks when you let it do it automatically. These are not the same chunks as with kerchunk. I think Xarray tries to have chunks bigger enough.

Anyway, Analysis and file reading will still benefit from this chunking.

Reply via ReviewNB

tutorial/pangeo101/chunking_introduction.ipynb

		@@ -49,13 +49,15 @@
		"source": [

Member Author

guillaumeeb Aug 11, 2022

Okay, so what's puzzling me more in this notebook is this part when we directly introduce complex Kerchunk manipulation to access only one file.

I don't know how to do it, but I don't think we need kerchunk to get NetCDF chunks layout, and I'm quite certain that Xarray knows how to optimize this chunk access (but not completly I have to admit).

My current belief is that kerchunk starts to be useful when we really want to open several files without having to read all the metadata in every of those files (or do it only once).

We should at least explain all that a little more, what do you think?

And again, we should probably talk more about chunking, lazyness, partial or sequential processing before introducing kerchunk. I'm not even sure we should show how to create a consolidated dataset with Kerchunk, maybe we should just provide it and point to Kerchunk doc or a misc notebook?

Reply via ReviewNB

tutorial/pangeo101/dask_introduction.ipynb

		@@ -5,7 +5,7 @@
		"id": "1bfbae7a-12f1-4787-a520-c3de7529168d",

Member Author

guillaumeeb Aug 11, 2022 •

edited

Loading

Maybe we could do some more interesting computation here to show the graph and some tasks executing on the Dashboard?

Reply via ReviewNB

tutorial/pangeo101/dask_introduction.ipynb

		@@ -5,7 +5,7 @@
		"id": "1bfbae7a-12f1-4787-a520-c3de7529168d",

Member Author

guillaumeeb Aug 11, 2022

Didn't we said we wanted to go global?

Reply via ReviewNB

Collaborator

pl-marasco Aug 12, 2022

One step per time... I would limit the analysis on a small region and, if we have time, we can ask students to scale it up.

It is also really useful for teaching a good prototyping approach with a second stage of scaling the model to a larger area.

Member Author

guillaumeeb Aug 13, 2022

So it's still a work in progress, do you plan to add a section at bigger scale?

You're right it's really important to develop an algorithm and test it at small scale to go bigger later.

I've no idea how much time the processing takes over Lombardia, could we begin at least with the whole Italy?

tutorial/pangeo101/data_discovery.ipynb Outdated

    
            @@ -37,19 +37,19 @@
          
                  "    <ul>\n",

Member Author

guillaumeeb Aug 11, 2022

General question: do we really need the Setup section in every notebook?

About the package used, should we only talk about those we import, or to which level should we go?

Also, we don't install any package in this notebook.

Reply via ReviewNB

tutorial/pangeo101/data_discovery.ipynb Outdated

    
            @@ -37,19 +37,19 @@
          
                  "    <ul>\n",

Member Author

guillaumeeb Aug 11, 2022

I added this to show we could get things quickly, but I feel it still lacks a little example computation on the whole dataset.

And maybe a small conclusion on the fact that Xarray allows concatenating multiple files to form a single dataset, but we have to manually do some operations?

Reply via ReviewNB

Collaborator

pl-marasco Aug 12, 2022

The whole dataset will be used once the VCI index is computed. Unfortunatelly this data are already an aggregation of data and there is not too much to play with left.

Don't forget that there is a general storytelling and some components has been already covered.

In any case I agree with you, better twice than nothing.

Regarding the concatenation from multiple files isn't too clear to me what you mean with 'we have to manually do some operations? '

Member Author

guillaumeeb Aug 13, 2022

I'm not sure why I put this command there, but anyway, considering the concatenation, I was referring to the preprocess function and the opening of all s3 objects.

This is not really complex, but we've got to do some things in order to present this as a full dataset. This is an interesting introduction to the next notebook and ARCO section.

tutorial/pangeo101/data_discovery.ipynb Outdated

    
            @@ -37,19 +37,19 @@
          
                  "    <ul>\n",

Member Author

guillaumeeb Aug 11, 2022

Not sure if this heading should be kept?

Reply via ReviewNB

Member Author

guillaumeeb commented Aug 11, 2022

Okay, I tried to point some interrogations I had about those notebooks.

This is already great work and I know how hard it is to prepare such material, so please don't take my comments badly. I'm only trying to improve this material and makes it as easy as possible for trainees to follow the content.


          Typos and precisions

c57dfd0

guillaumeeb marked this pull request as ready for review

August 11, 2022 18:03

Member

acocac commented Aug 11, 2022

@guillaumeeb all your comments are great. Thanks for your constructive feedback!

acocac added 6 commits

August 11, 2022 22:51


          remove openeo

88b190d


          remove openstack

76d1d19


          omit non-used packages

b053c0f


          remove data subsection, and remove non-used tools incl. openeo and op…

d535830

…enstack


          remove rebundant info

1b2b30c


          remove pystac

1a7f8ee

Member

acocac commented Aug 11, 2022

@guillaumeeb the data and discovery notebook is still, I hope to add a minimal example to access a STAC catalog of Sentinel-2 and maybe another example of available pangeo-forge STAC dataset. I'll merge coming changes in the notebook from the main branch to this PR to get your feedback. Thanks

Collaborator

pl-marasco commented Aug 12, 2022

@guillaumeeb please don't even mention that, without good criticism and different eyes there would not be growth.

All my material has been prepared from the scratch and on the fly so there is a lot that is missing and is really good to have different visions, opinions and hints.

annefou approved these changes

View reviewed changes

Collaborator

annefou commented Aug 12, 2022

There are some conflicts to fix but apart from that it is ok to me. So once the conflicts are resolved we can merge.


          Merge branch 'main' into review-dask-intro

d5041c8

Member Author

guillaumeeb commented Aug 13, 2022

@annefou I resolved the conflict.

I think it's a god idea to merge before things diverge again.

The thing is I wanted to also discuss some points with all of you (and @tinaok did not made any comment yet).

So what I propose is that:

We merge this if everyone agree on the modifications
I open some issues on the most important points to discuss and solve my comments:
- Add more explanation on chunking before introducing Kerchunk and multiple file reading. See if we keep all kerchunk part as is. Talk a bit more about Dask arrays.
- Add some content on compression.
- Do we keep setup section in all notebooks? It's hard to maintain.
- Talk more about Dask in dask_introduction, the Dask clusters and dashboard concepts.
- Remove the part on installing libraries on dask workers (see Dask gateway configuration problems on pangeo-xxlarge platform pangeo-eosc#3)
- Scale to the whole world on dask_intro.
We can continue to discuss little things here if needed.

Thoughts?

Collaborator

annefou commented Aug 13, 2022

It is easier to merge and make new issues and PR.

annefou merged commit 04042a3 into pangeo-data:main

This was referenced Aug 13, 2022

Update the content of chunking_introduction.ipynb #48

Closed

Update content of dask_introduction.ipynb #50

Closed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet