Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ICESat-2 data read-in functionality #222

Merged
merged 79 commits into from Nov 23, 2021
Merged

Add ICESat-2 data read-in functionality #222

merged 79 commits into from Nov 23, 2021

Conversation

JessicaS11
Copy link
Member

@JessicaS11 JessicaS11 commented Sep 10, 2021

A long awaited and exciting addition, this PR introduces basic data read-in functionality to icepyx. The primary workhorse is the read.py module, which implements a Read class object.

Read is initialized with input an input file, directory, or s3 bucket url, your ICESat-2 data product, an intake-style filename pattern to match which files you'd like to read in, and an optional intake catalog if you have one you'd really like to use (and are only trying to read in one group). Similar to NSIDC variable subsetting, the Read object uses icepyx's ICESat-2 variables.py module to determine which variables are available and then create a list of those you'd like to actually read in (this is usually hard-coded into most readers). Then, it will iterate through all of the granules (files) you've provided and create an Xarray dataset with all of the variables you've requested. To do this, under the hood it creates an Intake catalog for each variable group and constructs a list of per-granule Xarray DataSets that contain ALL of the variables you've asked for. Then, it will merge them into a single DataSet. Future efforts will then extend default Xarray functionality to do ICESat-2 specific tasks like "get_strong_beams".

Please see the new ICESat-2_Data_Read-In_Example.ipynb, which explains the functionality in more detail as well as provides a quick start guide for using the new module.

I'm looking for any and all feedback on and contributions to the code, example, explanations, tests, etc. I've no doubt there will be lots of new bugs to fix as we move outside my super-small test case environment, and I'm eager to hear all of your thoughts. Please join the conversation - no detail is too small!

Review on ReviewNB: https://app.reviewnb.com/icesat2py/icepyx/pull/222/

Binder link: Binder

Instructions to test:

cd icepyx/  # change working directory to local clone of icepyx repo
git fetch --all  # fetch all branches from remote icepyx repo
git switch --track upstream/dataobj  # or use git checkout --track origin/dataobj
git switch dataobj  # switch to the dataobj branch
pip install -e.  # install required dependencies of this branch, should pull in h5netcdf5 and so on

pip install jupyterlab  # if you haven't already
jupyter lab --no-browser  # launch jupyter lab to view ICESat-2_Data_Read-In_Example.ipynb

@JessicaS11
Copy link
Member Author

Note: the current Travis build failure should be fixed by #213.

@JessicaS11 JessicaS11 linked an issue Sep 10, 2021 that may be closed by this pull request
@weiji14 weiji14 added enhancement New feature or request longer_contribution Issues that will take more than an afternoon to implement labels Sep 10, 2021
Copy link
Member

@weiji14 weiji14 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @JessicaS11 for all the work! Overall, I think the icepyx.Read class and Read.load() method is a great start. Really wish this was implemented 2 years ago for my PhD research, but anyways... there's a lot of suggested changes below, mostly minor stylistic changes for read.py at the moment. I've also added a binder link to your original comment at the top and some instructions for others on how to test things.

To make it easier to review the examples/ICESat-2_Data_Read-in_Example.ipynb though, could you consider using jupytext to turn the .ipynb file to a .py file? It will be easier to comment and make suggested changes on plaintext instead of a JSON. Steps would be:

pip install jupytext
jupytext --set-formats ipynb,py:percent examples/ICESat-2_Data_Read-in_Example.ipynb
# there should now be an ICESat-2_Data_Read-in_Example.py file
git add examples/ICESat-2_Data_Read-in_Example.py
git commit -m "Add jupytext converted ICESat-2_Data_Read-in_Example.py file"
git push

My other main concern is the data type of some of the coordinates of the output xarray.Dataset which are of object or binary dtypes when they could be str/int or datetime64.

<xarray.Dataset>
Dimensions:              (delta_time: 55654, gran_idx: 3, spot: 2)
Coordinates:
  * delta_time           (delta_time) datetime64[ns] 2019-02-22T01:06:07.5054...
  * gran_idx             (gran_idx) object '084902' '090202' '091002'
  * spot                 (spot) int64 2 5
    source_file          (gran_idx) <U95 '/home/username/Documents/github/icepyx...
    gt                   (gran_idx, spot) object 'gt3r' 'gt1l' ... 'gt3r' 'gt1l'
Data variables:
    sc_orient            (gran_idx) int8 0 0 0
    cycle_number         (gran_idx) int8 2 2 2
    rgt                  (gran_idx) int16 849 902 910
    atlas_sdp_gps_epoch  (gran_idx) datetime64[ns] 2018-01-01T00:00:18 ... 20...
    data_start_utc       (gran_idx) |S27 b'2019-02-22T01:03:44.199777Z' ... b...
    data_end_utc         (gran_idx) |S27 b'2019-02-22T01:07:38.112327Z' ... b...
    h_li                 (spot, gran_idx, delta_time) float32 nan nan ... nan
    latitude             (spot, gran_idx, delta_time) float64 nan nan ... nan
    longitude            (spot, gran_idx, delta_time) float64 nan nan ... nan
Attributes:
    data_product:  ATL06
    Description:   The land_ice_height group contains the primary set of deri...
    data_rate:     Data within this group are sparse.  Data values are provid...

Specifically, the gran_idx coordinate should preferrably have a str/int dtype (instead of object), while the data_start_utc and data_end_utc should preferrably have a datetime64 dtype (instead of S27/binary). I'll need to step through the code a bit more to see where this can be fixed, but maybe you have some ideas.

requirements.txt Show resolved Hide resolved
requirements.txt Outdated Show resolved Hide resolved
icepyx/core/read.py Outdated Show resolved Hide resolved
icepyx/core/read.py Outdated Show resolved Hide resolved
icepyx/core/read.py Show resolved Hide resolved
icepyx/core/read.py Outdated Show resolved Hide resolved
icepyx/core/read.py Outdated Show resolved Hide resolved
icepyx/core/read.py Outdated Show resolved Hide resolved
icepyx/core/read.py Show resolved Hide resolved
icepyx/core/read.py Show resolved Hide resolved
@JessicaS11
Copy link
Member Author

Thanks, as always, for the great feedback and edits @weiji14! They are so appreciated.

Making notebook review easier is on my long dev-to-do list, and was something (along with the dormant gallery work) I hoped to pick up again in the coming weeks. Thanks for adding the binder link in the meantime, and I'll keep moving forward with your suggestions for using jupytext.

Great point about fixing the type of the coordinates. There's a Read._build_dataset_template method that will probably be the best spot for the gran_idx. We'll have to see where in the read-in process works best to specify the type for the dates/times - I hadn't yet gotten to thinking about where/when in the read-in process we should think about handling date computations, but I know that we'll ultimately want to make it easy to calculate based on the epoch (would it be too costly to memory to just have this be one of the xarray extension functions?)...

@weiji14 weiji14 self-assigned this Sep 27, 2021
@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

JessicaS11 and others added 10 commits October 26, 2021 15:55
Co-authored-by: Wei Ji <23487320+weiji14@users.noreply.github.com>
Co-authored-by: Wei Ji <23487320+weiji14@users.noreply.github.com>
Co-authored-by: Wei Ji <23487320+weiji14@users.noreply.github.com>
Co-authored-by: Wei Ji <23487320+weiji14@users.noreply.github.com>
@codecov-commenter
Copy link

codecov-commenter commented Nov 15, 2021

Codecov Report

Merging #222 (b57d247) into development (5b8112d) will decrease coverage by 3.83%.
The diff coverage is 29.93%.

Impacted file tree graph

@@               Coverage Diff               @@
##           development     #222      +/-   ##
===============================================
- Coverage        54.94%   51.11%   -3.84%     
===============================================
  Files               20       24       +4     
  Lines             1547     1839     +292     
  Branches           321      386      +65     
===============================================
+ Hits               850      940      +90     
- Misses             639      840     +201     
- Partials            58       59       +1     
Impacted Files Coverage Δ
icepyx/core/is2cat.py 8.57% <8.57%> (ø)
icepyx/core/is2ref.py 23.52% <9.52%> (-8.53%) ⬇️
icepyx/core/variables.py 10.10% <16.66%> (+1.72%) ⬆️
icepyx/core/read.py 30.35% <30.35%> (ø)
icepyx/tests/test_read.py 95.45% <95.45%> (ø)
icepyx/__init__.py 100.00% <100.00%> (ø)
icepyx/tests/test_variables.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5b8112d...b57d247. Read the comment docs.

@JessicaS11
Copy link
Member Author

@icetianli @weiji14 What do you think? Can we merge?

Copy link
Member

@weiji14 weiji14 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @JessicaS11, I think we're just about ready 😄 The xarray.Dataset's coordinate data types look much better now:

<xarray.Dataset>
Dimensions:              (gran_idx: 1, spot: 2, delta_time: 25428)
Coordinates:
  * gran_idx             (gran_idx) uint64 83903
  * spot                 (spot) uint8 2 5
  * delta_time           (delta_time) datetime64[ns] 2019-05-23T05:39:00.8367...
    source_file          (gran_idx) <U41 './ATL06_20190523053429_08390310_004...
Data variables:
    sc_orient            (gran_idx) int8 0
    cycle_number         (gran_idx) int8 3
    rgt                  (gran_idx) int16 839
    h_li                 (spot, gran_idx, delta_time) float32 -52.24 ... nan
    latitude             (spot, gran_idx, delta_time) float64 -67.12 ... nan
    longitude            (spot, gran_idx, delta_time) float64 165.9 ... nan
    gt                   (gran_idx, spot) <U4 'gt3r' 'gt1l'
    atlas_sdp_gps_epoch  (gran_idx) datetime64[ns] 2018-01-01T00:00:18
    data_start_utc       (gran_idx) |S27 b'2019-05-23T05:39:00.477122Z'
    data_end_utc         (gran_idx) |S27 b'2019-05-23T05:42:10.569813Z'
Attributes:
    data_product:  ATL06
    Description:   The land_ice_height group contains the primary set of deri...
    data_rate:     Data within this group are sparse.  Data values are provid...

The data_start_utc and data_end_utc still seems to be a string type for some reason, didn't you fix that as mentioned in #222 (comment)? I'd see if that can be fixed quickly, otherwise just do it in a follow-up Pull Request since those two variables not that commonly used compared to delta_time.

Oh, and I did spot a few minor typos and an ImportError, but once those are fixed, I think this PR is in a good enough state to bring to the world 🚀

examples/ICESat-2_Data_Read-in_Example.ipynb Outdated Show resolved Hide resolved
examples/ICESat-2_Data_Read-in_Example.ipynb Outdated Show resolved Hide resolved
examples/ICESat-2_Data_Read-in_Example.ipynb Outdated Show resolved Hide resolved
examples/ICESat-2_Data_Read-in_Example.ipynb Outdated Show resolved Hide resolved
examples/ICESat-2_Data_Read-in_Example.ipynb Show resolved Hide resolved
examples/ICESat-2_Data_Read-in_Example.ipynb Outdated Show resolved Hide resolved
icepyx/core/is2cat.py Outdated Show resolved Hide resolved
icepyx/core/read.py Outdated Show resolved Hide resolved
examples/ICESat-2_Data_Read-in_Example.ipynb Outdated Show resolved Hide resolved
JessicaS11 and others added 7 commits November 23, 2021 15:41
Co-authored-by: Wei Ji <23487320+weiji14@users.noreply.github.com>
Co-authored-by: Wei Ji <23487320+weiji14@users.noreply.github.com>
Co-authored-by: Wei Ji <23487320+weiji14@users.noreply.github.com>
Co-authored-by: Wei Ji <23487320+weiji14@users.noreply.github.com>
Co-authored-by: Wei Ji <23487320+weiji14@users.noreply.github.com>
Co-authored-by: Wei Ji <23487320+weiji14@users.noreply.github.com>
@JessicaS11
Copy link
Member Author

The data_start_utc and data_end_utc still seems to be a string type for some reason, didn't you fix that as mentioned in #222 (comment)? I'd see if that can be fixed quickly, otherwise just do it in a follow-up Pull Request since those two variables not that commonly used compared to delta_time.

Good catch. I had, but numpy was issuing a warning, so I'd added a silencer for the warning, which apparently was making it not complete the try that was setting the type.

@JessicaS11 JessicaS11 merged commit 63feb76 into development Nov 23, 2021
@JessicaS11 JessicaS11 deleted the dataobj branch November 23, 2021 22:14
JessicaS11 added a commit that referenced this pull request Dec 8, 2021
* add github action to add binder badge to PRs (#229)
* use the binder badge action directly (instead of a manual implementation of it) (#233)
See the discussion in #230 for more details on this switch.
* preliminary AWS access (#213)
* update links for travis badge (#234)
* Fix failing test_visualization_date_range check for ATL07 (#241)
* By default, no email status updates to users when ordering granules (#240)
* remove extra cell causing errors in example notebook
* Set default page size for orders to 2000 per NSIDC recommendation (#239)
* Add ICESat-2 data read-in functionality (#222)
* update examples from 2020 Hackweek tutorials
* update add and commit GitHub Action version (and UML diagrams) (#244)
* merge traffic (GitHub and PyPI) and bib updates (#245)
* Release 0.5.0 (#246)
* release v0.5.0 CI fixes (#251)
* fix Travis CI label in readme
* update earthdata login fixture for testing
* add required input to pytest fixture (#252)

Co-authored-by: Wei Ji <23487320+weiji14@users.noreply.github.com>
Co-authored-by: trey-stafford <trey.stafford@colorado.edu>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request longer_contribution Issues that will take more than an afternoon to implement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

An intake catalog for ICESat-2 ATLAS data
5 participants