Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose Intake-ESM test data for download #437

Closed
kmpaul opened this issue Jan 27, 2022 · 8 comments · Fixed by #454
Closed

Expose Intake-ESM test data for download #437

kmpaul opened this issue Jan 27, 2022 · 8 comments · Fixed by #454
Labels
enhancement Issues that are found to be a reasonable candidate feature additions help wanted

Comments

@kmpaul
Copy link
Collaborator

kmpaul commented Jan 27, 2022

Is your feature request related to a problem? Please describe.
Understanding and learning intake-esm could be made much easier if the user had a sample dataset and catalog to reference. It doesn't have to be "real" in any sense. It could just be an example dataset, but I think there are pared down datasets in the tests directory of this repository that might fit the bill. But I think it would be helpful if this dataset and corresponding catalog could be downloaded with intake-esm in the same way that xarray.tutorial provides an easy mechanism for downloading example datasets.

Describe the solution you'd like
In the tests directory of this repository, the following catalogs and datasets appear to be already available as toy examples:

  • sample-collections/cesm1-lens-netcdf.csv/cesm1-lens-netcdf.json + sample-data/cesm-le/*.nc
  • sample-collections/cmip5-netcdf.csv/cmip5-netcdf.json + sample-data/cmip/cmip5/*
  • sample-collections/cmip6-netcdf-test.csv/cmip6-netcdf.json + sample-data/cmip/CMIP6/*
  • sample-collections/multi-variable-catalog.csv/multi-variable-catalog.json + sample-data/cesm-multi-variables/*.nc

I think an excellent solution might be to create an intake_esm.tutorial module that contains something like a load_collection(name) function (or load_catalog(name)?) that loads the above sample data and corresponding catalogs from GitHub, in the way that xarray.tutorial.load_dataset(name) does. I would imagine that this function might return the path to the newly downloaded catalog (or the path to the existing catalog, if already downloaded?).

Describe alternatives you've considered
The obvious solution is to just tell people to git clone the repo and possibly reorganize the files so they are more easily "discoverable". Perhaps that should be a first (simple) attempt.

@andersy005
Copy link
Member

I like this idea... A pull request is definitely welcome :)

@andersy005 andersy005 added enhancement Issues that are found to be a reasonable candidate feature additions help wanted labels Jan 28, 2022
@kmpaul
Copy link
Collaborator Author

kmpaul commented Jan 28, 2022

Absolutely. If I find time, I will. We can keep this here as a record of the idea.

@jukent
Copy link
Collaborator

jukent commented Feb 1, 2022

I started this in PR #443. Could someone look and let me know if they think I'm going down the right path? @kmpaul

@andersy005
Copy link
Member

@jukent, here are some of the intake examples packaged as conda packages: https://github.com/intake/intake-examples

@kmpaul
Copy link
Collaborator Author

kmpaul commented Feb 15, 2022

@andersy005: That intake-examples repo is a good model for this.

@jukent: If you can follow the logic in that repo, that might be a good approach to writing an intake-esm-examples module.

@jukent
Copy link
Collaborator

jukent commented Feb 16, 2022

@kmpaul @andersy005 It looks to me like that repo just houses the data -- and then in each tutorial notebook they point to the data catalog directly with the url. Am I missing something?

# put catalog in public place for sharing
cat = intake.open_catalog('https://raw.githubusercontent.com/intake/'
                          'intake-examples/master/tutorial/sea.yaml')

@kmpaul
Copy link
Collaborator Author

kmpaul commented Feb 16, 2022

@jukent: I think @andersy005 recommended looking at this repository because it demonstrates (or appears to) how to make downloadable data and corresponding catalogs work from an installable package. If you look at the README.md here, it suggests that by constructing the pip installable package with the appropriate entry_points, you can hook in third-party example datasets and catalogs directly into Intake's examples.

@andersy005: Is that why you were suggesting looking at this?

@andersy005
Copy link
Member

andersy005 commented Feb 16, 2022

The data-us-states package is a much better example. It's serving data hosted in this repository. So, when user installs the data-us-states package, they end up with the contents of the GitHub repository in their conda environment. I realize that getting this to work in the context of intake-esm may end up being more involved.

In my opinion, the existing netCDF files aren't really very useful for demonstration purposes, and because accessing netCDF over HTTP is complicated, I would recommend using a subset of the realistic CMIP6, CESM1-LENS, CESM2-LENS Zarr datasets hosted on AWS and Google Cloud. With these datasets, we wouldn't have to worry about the location of data (provided we don't need any data caching). We would just create static catalogs that point to a subset of these datasets and these catalogs would be accessed from anywhere (provided the user has internet access).

cat = intake_esm.tutorial.load_catalog('aws_cesm1_lens_sample')
cat = intake_esm.tutorial.load_catalog('aws_cesm2_lens_sample')
cat = intake_esm.tutorial.load_catalog('aws_cmip6_sample')

@jukent jukent mentioned this issue Mar 11, 2022
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Issues that are found to be a reasonable candidate feature additions help wanted
Projects
None yet
3 participants