Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Review Notebooks on Data access, chunking and Dask #45

Merged
merged 11 commits into from
Aug 13, 2022

Conversation

guillaumeeb
Copy link
Member

Creating a draft pull request after the review of the three notebooks on Data and Dask.

Some high level firsts impressions there, and after that I'll try to comment more precisely using github review (or NBReview) later on.

  • data_discovery is nice. It lacks a bit of computations or exercise, but it's short so probably OK.
  • chunking_introduction is hard. I try to propose some content modifications, but still the part on kerchunk is complicated. Moreover, I'm not sure if Kerchunk is really needed to explain chunks. It's just a good way to consolidate a dataset and make it easily loadable.
  • dask_introduction is fine, but it currently lacks real processing (which is perfectly normal as it is probably not finished). I only saw a processing over Lombardia. And as there is currently problem with dask-gateway setup (see Dask gateway configuration problems on pangeo-xxlarge platform pangeo-eosc#3), it still hard to follow.

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@@ -49,13 +49,15 @@
"source": [
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm under the impression that currently, we don't realy talk about compression in this notebook, do we?


Reply via ReviewNB

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No I've never touched the topic but definitely is something that would be nice to be added.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could add some generalities (why compression matters, how it is a gain in processing time despite the overcost of compression, different algorithms currently used in scientific domains, Python blosc, etc.), and maybe a section showing the difference between a NetCDF or zarr file with and witout compression and the resulting size?

@@ -49,13 +49,15 @@
"source": [
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added Dask because clearly it is used by Xarray under the hood. I think we should remove some other unsued like pooch?


Reply via ReviewNB

@@ -49,13 +49,15 @@
"source": [
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to make things clearer here, but I'm not sure why we are using open_mfdataset right now?

In any case, I think we need to talk about Dask in this notebook. Xarray + chunkings almost always means Dask.


Reply via ReviewNB

@@ -49,13 +49,15 @@
"source": [
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I'm not sure about the last part. I don't know how Xarray chooses chunks when you let it do it automatically. These are not the same chunks as with kerchunk. I think Xarray tries to have chunks bigger enough.

Anyway, Analysis and file reading will still benefit from this chunking.


Reply via ReviewNB

@@ -49,13 +49,15 @@
"source": [
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, so what's puzzling me more in this notebook is this part when we directly introduce complex Kerchunk manipulation to access only one file.

I don't know how to do it, but I don't think we need kerchunk to get NetCDF chunks layout, and I'm quite certain that Xarray knows how to optimize this chunk access (but not completly I have to admit).

My current belief is that kerchunk starts to be useful when we really want to open several files without having to read all the metadata in every of those files (or do it only once).

We should at least explain all that a little more, what do you think?

And again, we should probably talk more about chunking, lazyness, partial or sequential processing before introducing kerchunk. I'm not even sure we should show how to create a consolidated dataset with Kerchunk, maybe we should just provide it and point to Kerchunk doc or a misc notebook?


Reply via ReviewNB

@@ -5,7 +5,7 @@
"id": "1bfbae7a-12f1-4787-a520-c3de7529168d",
Copy link
Member Author

@guillaumeeb guillaumeeb Aug 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could do some more interesting computation here to show the graph and some tasks executing on the Dashboard?

Reply via ReviewNB

@@ -5,7 +5,7 @@
"id": "1bfbae7a-12f1-4787-a520-c3de7529168d",
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't we said we wanted to go global?


Reply via ReviewNB

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One step per time... I would limit the analysis on a small region and, if we have time, we can ask students to scale it up.

It is also really useful for teaching a good prototyping approach with a second stage of scaling the model to a larger area. 

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it's still a work in progress, do you plan to add a section at bigger scale?

You're right it's really important to develop an algorithm and test it at small scale to go bigger later.

I've no idea how much time the processing takes over Lombardia, could we begin at least with the whole Italy?

@@ -37,19 +37,19 @@
" <ul>\n",
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General question: do we really need the Setup section in every notebook?

About the package used, should we only talk about those we import, or to which level should we go?

Also, we don't install any package in this notebook.


Reply via ReviewNB

@@ -37,19 +37,19 @@
" <ul>\n",
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this to show we could get things quickly, but I feel it still lacks a little example computation on the whole dataset.

And maybe a small conclusion on the fact that Xarray allows concatenating multiple files to form a single dataset, but we have to manually do some operations?


Reply via ReviewNB

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The whole dataset will be used once the VCI index is computed. Unfortunatelly this data are already an aggregation of data and there is not too much to play with left.

Don't forget that there is a general storytelling and some components has been already covered.

In any case I agree with you, better twice than nothing.

Regarding the concatenation from multiple files isn't too clear to me what you mean with 'we have to manually do some operations? '

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure why I put this command there, but anyway, considering the concatenation, I was referring to the preprocess function and the opening of all s3 objects.

This is not really complex, but we've got to do some things in order to present this as a full dataset. This is an interesting introduction to the next notebook and ARCO section.

@@ -37,19 +37,19 @@
" <ul>\n",
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this heading should be kept?


Reply via ReviewNB

@guillaumeeb
Copy link
Member Author

Okay, I tried to point some interrogations I had about those notebooks.

This is already great work and I know how hard it is to prepare such material, so please don't take my comments badly. I'm only trying to improve this material and makes it as easy as possible for trainees to follow the content.

@guillaumeeb guillaumeeb marked this pull request as ready for review August 11, 2022 18:03
@acocac
Copy link
Member

acocac commented Aug 11, 2022

@guillaumeeb all your comments are great. Thanks for your constructive feedback!

@acocac
Copy link
Member

acocac commented Aug 11, 2022

@guillaumeeb the data and discovery notebook is still, I hope to add a minimal example to access a STAC catalog of Sentinel-2 and maybe another example of available pangeo-forge STAC dataset. I'll merge coming changes in the notebook from the main branch to this PR to get your feedback. Thanks

@pl-marasco
Copy link
Collaborator

@guillaumeeb please don't even mention that, without good criticism and different eyes there would not be growth.

All my material has been prepared from the scratch and on the fly so there is a lot that is missing and is really good to have different visions, opinions and hints.

@annefou
Copy link
Collaborator

annefou commented Aug 12, 2022

There are some conflicts to fix but apart from that it is ok to me. So once the conflicts are resolved we can merge.

@guillaumeeb
Copy link
Member Author

@annefou I resolved the conflict.

I think it's a god idea to merge before things diverge again.

The thing is I wanted to also discuss some points with all of you (and @tinaok did not made any comment yet).

So what I propose is that:

  • We merge this if everyone agree on the modifications
  • I open some issues on the most important points to discuss and solve my comments:
    • Add more explanation on chunking before introducing Kerchunk and multiple file reading. See if we keep all kerchunk part as is. Talk a bit more about Dask arrays.
    • Add some content on compression.
    • Do we keep setup section in all notebooks? It's hard to maintain.
    • Talk more about Dask in dask_introduction, the Dask clusters and dashboard concepts.
    • Remove the part on installing libraries on dask workers (see Dask gateway configuration problems on pangeo-xxlarge platform pangeo-eosc#3)
    • Scale to the whole world on dask_intro.
  • We can continue to discuss little things here if needed.

Thoughts?

@annefou
Copy link
Collaborator

annefou commented Aug 13, 2022

It is easier to merge and make new issues and PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants