-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Review Notebooks on Data access, chunking and Dask #45
Conversation
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
@@ -49,13 +49,15 @@ | |||
"source": [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm under the impression that currently, we don't realy talk about compression in this notebook, do we?
Reply via ReviewNB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No I've never touched the topic but definitely is something that would be nice to be added.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could add some generalities (why compression matters, how it is a gain in processing time despite the overcost of compression, different algorithms currently used in scientific domains, Python blosc, etc.), and maybe a section showing the difference between a NetCDF or zarr file with and witout compression and the resulting size?
@@ -49,13 +49,15 @@ | |||
"source": [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added Dask because clearly it is used by Xarray under the hood. I think we should remove some other unsued like pooch?
Reply via ReviewNB
@@ -49,13 +49,15 @@ | |||
"source": [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to make things clearer here, but I'm not sure why we are using open_mfdataset right now?
In any case, I think we need to talk about Dask in this notebook. Xarray + chunkings almost always means Dask.
Reply via ReviewNB
@@ -49,13 +49,15 @@ | |||
"source": [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I'm not sure about the last part. I don't know how Xarray chooses chunks when you let it do it automatically. These are not the same chunks as with kerchunk. I think Xarray tries to have chunks bigger enough.
Anyway, Analysis and file reading will still benefit from this chunking.
Reply via ReviewNB
@@ -49,13 +49,15 @@ | |||
"source": [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, so what's puzzling me more in this notebook is this part when we directly introduce complex Kerchunk manipulation to access only one file.
I don't know how to do it, but I don't think we need kerchunk to get NetCDF chunks layout, and I'm quite certain that Xarray knows how to optimize this chunk access (but not completly I have to admit).
My current belief is that kerchunk starts to be useful when we really want to open several files without having to read all the metadata in every of those files (or do it only once).
We should at least explain all that a little more, what do you think?
And again, we should probably talk more about chunking, lazyness, partial or sequential processing before introducing kerchunk. I'm not even sure we should show how to create a consolidated dataset with Kerchunk, maybe we should just provide it and point to Kerchunk doc or a misc notebook?
Reply via ReviewNB
@@ -5,7 +5,7 @@ | |||
"id": "1bfbae7a-12f1-4787-a520-c3de7529168d", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we could do some more interesting computation here to show the graph and some tasks executing on the Dashboard?
Reply via ReviewNB
@@ -5,7 +5,7 @@ | |||
"id": "1bfbae7a-12f1-4787-a520-c3de7529168d", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One step per time... I would limit the analysis on a small region and, if we have time, we can ask students to scale it up.
It is also really useful for teaching a good prototyping approach with a second stage of scaling the model to a larger area.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So it's still a work in progress, do you plan to add a section at bigger scale?
You're right it's really important to develop an algorithm and test it at small scale to go bigger later.
I've no idea how much time the processing takes over Lombardia, could we begin at least with the whole Italy?
@@ -37,19 +37,19 @@ | |||
" <ul>\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
General question: do we really need the Setup section in every notebook?
About the package used, should we only talk about those we import, or to which level should we go?
Also, we don't install any package in this notebook.
Reply via ReviewNB
@@ -37,19 +37,19 @@ | |||
" <ul>\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added this to show we could get things quickly, but I feel it still lacks a little example computation on the whole dataset.
And maybe a small conclusion on the fact that Xarray allows concatenating multiple files to form a single dataset, but we have to manually do some operations?
Reply via ReviewNB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The whole dataset will be used once the VCI index is computed. Unfortunatelly this data are already an aggregation of data and there is not too much to play with left.
Don't forget that there is a general storytelling and some components has been already covered.
In any case I agree with you, better twice than nothing.
Regarding the concatenation from multiple files isn't too clear to me what you mean with 'we have to manually do some operations? '
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure why I put this command there, but anyway, considering the concatenation, I was referring to the preprocess function and the opening of all s3 objects.
This is not really complex, but we've got to do some things in order to present this as a full dataset. This is an interesting introduction to the next notebook and ARCO section.
@@ -37,19 +37,19 @@ | |||
" <ul>\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I tried to point some interrogations I had about those notebooks. This is already great work and I know how hard it is to prepare such material, so please don't take my comments badly. I'm only trying to improve this material and makes it as easy as possible for trainees to follow the content. |
@guillaumeeb all your comments are great. Thanks for your constructive feedback! |
@guillaumeeb the data and discovery notebook is still, I hope to add a minimal example to access a STAC catalog of Sentinel-2 and maybe another example of available pangeo-forge STAC dataset. I'll merge coming changes in the notebook from the main branch to this PR to get your feedback. Thanks |
@guillaumeeb please don't even mention that, without good criticism and different eyes there would not be growth. All my material has been prepared from the scratch and on the fly so there is a lot that is missing and is really good to have different visions, opinions and hints. |
There are some conflicts to fix but apart from that it is ok to me. So once the conflicts are resolved we can merge. |
@annefou I resolved the conflict. I think it's a god idea to merge before things diverge again. The thing is I wanted to also discuss some points with all of you (and @tinaok did not made any comment yet). So what I propose is that:
Thoughts? |
It is easier to merge and make new issues and PR. |
Creating a draft pull request after the review of the three notebooks on Data and Dask.
Some high level firsts impressions there, and after that I'll try to comment more precisely using github review (or NBReview) later on.