Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

we need a data catalog #39

Closed
rabernat opened this issue Nov 15, 2017 · 38 comments
Closed

we need a data catalog #39

rabernat opened this issue Nov 15, 2017 · 38 comments

Comments

@rabernat
Copy link
Member

At this point we have multiple datasets on multiple different systems.

It would be great to have some kind of catalog interface to avoid having to manually change the path names every time. For example.

from pangeo import catalog
ds = xr.open_mfdataset(**catalog.dataset('CM2.6'))
@mrocklin
Copy link
Member

Perhaps just some yaml file on the network file system that we refer against as well as some tiny library that reads that file?

Also cc'ing @seibert , who thinks about these things within Anaconda Inc. @seibert these tend to be glob-string paths pointing to NetCDF files. All machines have uniform access to them, so there is no need for remote data access. Modeling data types would be difficult due to the complexity of the underlying files.

@mmccarty
Copy link

I had a quick email exchange with @jhamman, who pointed me to this issue. Anaconda has a new open source project to provide some functionality for building data catalogs Intake.

This week, I have been investigating Intake for another project and started a plugin for NetCDF. So far, intake-netcdf is just a thin wrapper around xarray since most of the file I/O and dask functionality in Intake is already provided with xarray. This leaves the advantage of the approach to be limited to the cataloging aspects. I'm wondering if this community has any alternatives or input on this approach?

@rabernat
Copy link
Member Author

Sounds very promising @mmccarty! The cataloging features of Intake could definitely solve some of the problems we are contemplating.

One complication for Pangeo is that most of our cloud-based data is actually stored in zarr format, not netCDF. So my recommendation would be to broaden the scope beyond netCDF to include any format that can be read by xarray. In the docs you pointed to, it says

Intake supports loading data into standard Python containers, depending on the data source:

  • Pandas Dataframes - tabular data
  • NumPy Arrays - tensor data
  • Python lists of dictionaries - semi-structured data
  • (possibly more in the future)

There is a strong analogy between pandas and xarray. Pandas.Dataframe is the fundamental python data structure. The pandas data itself can come from csv, hdf, xls, many different file formats. Likewise, xarray data can come from netcdf, hdf, geotiff, zarr, etc. What would be great is to have an xarray data catalog that is agnostic to the underlying file format.

We have also discussed Quilt in #162, which I guess aims to solve a similar problem

@countdorkula
Copy link

countdorkula commented Apr 5, 2018

This probably isn't exactly what you want, but you might want to take a look at datalad / git-annex as well to see how a similar problem of sprawling datasets is being tackled in neuroimaging.

dat is another project that shows some promise in this domain (data versioning / cataloguing) as well.

@mmccarty
Copy link

mmccarty commented Apr 5, 2018

@rabernat My initial approach was to add an xarray container to Intake, however it felt a bit awkward (like this was a lower level of abstraction that we needed since xarray already provides much of the dask and file IO functionality on it's own). Also, the xarray container needed a slightly different interface to support chunks in N-dimensions. So, we (@seibert @martindurant and I) then decided to attempt to implement a plugin "the harder way", which is the approach in the netCDF plugin.

Options we are considering:

  • xarray container with a separate interface
  • generalize the netCDF plugin to support other file types (perhaps a class of xarray Intake plugins for code reuse)
  • consider cataloging alternatives

@jacobtomlinson
Copy link
Member

It would be nice to separate netCDF from xarray in this work. There are multiple libraries for loading netCDF files into numpy/dask arrays. From our point of view we would like to be able to add support to iris.

@mmccarty
Copy link

mmccarty commented Apr 5, 2018

Sounds like a case for option 1 (which I favor) plus an iris Intake container in the future (this PR shows the container interface thus far). Then plugins can be built for netCDF using either containers.

@martindurant
Copy link
Contributor

There would already multiple methods we would want to load xarrays - I don't know iris, but netCDF/hdf5 limited local files (or NFS, FUSE) and zarr for data anywhere.

Also, there are other data formats out there that are like this such as minc (a particular netcdf flavour), dicom, fits; which all have heterogeneous data types of ndarrays (and maybe tables) and copious metadata with coordinate information.

@martindurant
Copy link
Contributor

pydata/xarray#1750 ?

@rabernat
Copy link
Member Author

rabernat commented Apr 5, 2018

@mmccarty: let me reiterate that I am very enthusiastic this. A simple data catalog would really accelerate Pangeo. Right now we have a solid computational platform, but the platform is useless without data. So far, we have mostly been working with datasets that we already know about, loading them manually. It sounds like Intake could really help us grow a general purpose, flexible data catalog.

I repeat my comment above that it is very important for us to be able to load xarray from multiple different backends. On pangeo.pydata.org, most of the data is stored in zarr format in google cloud storage. Here is what I imagine the catalog.yml entry might look like for this data

sources:
  sea_surface_altimetry:
    description: GLOBAL OCEAN GRIDDED L4 SEA SURFACE HEIGHTS AND DERIVED VARIABLES REPROCESSED (1993-ONGOING)
    driver: xarray-zarr
    args:
      store: gs://pangeo-data/dataset-duacs-rep-global-merged-allsat-phy-l4-v3-alt

Does this help you? Please let me know what other input you need to guide the design of xarray / intake integration. Perhaps we should open an issue specifically for that.

p.s. Could you clarify what you mean by "option 1"?

@martindurant
Copy link
Contributor

@rabernat , that is exactly what we envisage. The decision was whether each plugin would be written "from scratch" and implement all methods or whether we can have a "xarray" container type that is common to many. In fact, since xarray already has several loader methods, we could also have a combined plugin for all, where the entry would look something like

sources:
  sea_surface_altimetry:
    description: GLOBAL OCEAN GRIDDED L4 SEA SURFACE HEIGHTS AND DERIVED VARIABLES REPROCESSED (1993-ONGOING)
    driver: xarray
    method: zarr
    url: gs://pangeo-data/dataset-duacs-rep-global-merged-allsat-phy-l4-v3-alt

@rabernat
Copy link
Member Author

rabernat commented Apr 5, 2018

Sounds like are on the same page! Please let us know how we can help.

@mrocklin
Copy link
Member

mrocklin commented Apr 5, 2018

I think that a useful deliverable here would be a pull request that both changed the Docker image in this repository (see gce/notebook/Dockerfile) and the attached notebooks (see gce/notebook/examples/*.ipynb) that replaced current mechanisms to create XArray datasets from Zarr on GCS that look like this:

import gcsfs
import xarray as xr
gcsmap = gcsfs.mapping.GCSMap('pangeo-data/dataset-duacs-rep-global-merged-allsat-phy-l4-v3-alt')
ds = xr.open_zarr(gcsmap)

with something that looks like this

import intake
ds = intake.open('dataset-....')

Presumably this would require intake to grow appropriately. It would also require the addition of an intake config file in the docker image that specified all of the dataset locations pointed to by the various notebooks.

@pbranson
Copy link
Member

pbranson commented Apr 7, 2018

Hi, my name is Paul Branson, I am a coastal scientist from Perth, Australia and have been lurking around for a while to learn about this awesome project!

From reading your recent discussions about CMIP6 (#179) and this thread I thought up a possible use case as a solution to a current problem my research group has for a distributed information system.

The idea would be to setup a Kubernetes cloud federation (https://medium.com/google-cloud/experimenting-with-cross-cloud-kubernetes-cluster-federation-dfa99f913d54) that joins up multiple clouds each that is physically located with the data. Then it might be possible to write an xarray driver for INTAKE that natively starts a dask KubeCluster() with workers+backend on the appropriate pods(?) and a dataset specific yaml. I think INGRESS could be used to effectively manage the dataset namespacing somewhat transparently (https://medium.com/google-cloud/kubernetes-nodeport-vs-loadbalancer-vs-ingress-when-should-i-use-what-922f010849e0).

image

Do you think that could be a reasonable approach (for co-operating organisations) to provide a simple platform for data analysis across distributed datasets? Possibly a little off topic sorry!

@mmccarty
Copy link

The latest on plan intake-xarray.

@martindurant
Copy link
Contributor

The latest on plan intake-xarray.

Specifically, we will provide thin wrappers for all the xarray.open-* functions, so from pangeo's point of view, you would make a catalogue by encoding the arguments currently in the notebooks into each YAML entry for the catalogue file.

(I don't think there is any plan for intake to manage the execution environment)

@rabernat
Copy link
Member Author

This sounds great! Let us know when you have a minimal working prototype we can try on pangeo.pydata.org!

@martindurant
Copy link
Contributor

Work coming here: intake/intake-xarray#7 @rabernat

@martindurant
Copy link
Contributor

I made a quick pre-release, so you can try it out with

conda install -c intake intake-xarray

(but expect some rough edges!)

@mrocklin
Copy link
Member

mrocklin commented Apr 28, 2018 via email

@mrocklin
Copy link
Member

mrocklin commented Apr 29, 2018 via email

@martindurant
Copy link
Contributor

Agreed, @mrocklin . First, intake-xarray should have a minimum of documentation, but that is exactly the plan.

@rabernat
Copy link
Member Author

rabernat commented May 1, 2018

Just want to reiterate my enthusiasm for these developments! Can't wait to see a working example of intake-xarray in pangeo.

@rabernat
Copy link
Member Author

I just came across this jupyterlab extension: https://github.com/eWaterCycle/jupyterlab_thredds

JupyterLab dataset browser for THREDDS catalog

Can inject iris/xarray/leaflet code cells into a Python notebook of a selected dataset to further process/visualize the dataset.

screenshot

This seems really cool! Something similar could probably be developed for intake.

@martindurant
Copy link
Contributor

We could certainly do something like that. I'm not certain how the initial "catalogue location" would be populated (file-browser, list of recent locations, set of builtin cats?), but listing entries and their brief details, and then injecting load commands sounds very worthwhile. I'll raise the issue with Intake.

@martindurant
Copy link
Contributor

Unless you missed it, @rabernat , the latest release of dask, now on the pangeo cluster, does not include my fix for HTTP downloading of the catalogue (github's headers give incorrect information about the file size).

@rabernat
Copy link
Member Author

@martindurant - I'm not sure I understand your comment. In the dockerfile of #231, you specified dask=0.17.5. This is unchanged in the most recent dockerfile of my latest update (#261).

I get the impression from your comment that intake is still not working properly on the pangeo cluster. What went wrong? Your fix for the HTTP downloading was not yet merged when 0.17.5 was release? What do we need to do to fix it at this point?

@martindurant
Copy link
Contributor

Specifically, my fix for HTTP was merged into master in time, but the 0.17.5 release was not based on master, but to do with quick fixes for the pandas release that also just happened - so my fix is not in the release, and the (commented) lines for intake in the notebooks don't currently work :|

@mrocklin suggested that he would want to return to installing some packages from master, but that of course comes with its own problems of things changing between image rebuilds.

@rabernat
Copy link
Member Author

It looks like Matt is working on pointing us to a consistent set of commits for dask-related packages over in pangeo-data/helm-chart#29.

Can you inspect that PR and verify whether it includes the fixes necessary for intake to work?

@martindurant
Copy link
Contributor

martindurant commented May 20, 2018

@stale
Copy link

stale bot commented Jul 19, 2018

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Jul 19, 2018
@stale
Copy link

stale bot commented Jul 26, 2018

This issue has been automatically closed because it had not seen recent activity. The issue can always be reopened at a later date.

@stale stale bot closed this as completed Jul 26, 2018
@rsignell-usgs
Copy link
Member

rsignell-usgs commented Jul 27, 2018

A more general and scalable way to create a catalog for Pangeo would be to create ISO metadata records for each Pangeo dataset and harvest them into pycsw, which would then support catalog search using CSW, the Open Geospatial Consortium Catalog Service for the Web.

Users could then search for Pangeo datasets across multiple pangeo deployments, and also would provide a mechanism for pangeo datasets to be discovered on sites like http://data.gov and http://data.gov.uk.

We use this approach in the Integrated Ocean Observing System (IOOS), supporting both user exploration and automated workflows. See this example notebook that finds coastal wave data from multiple providers based on a specified bounding box and time range.

In IOOS, data providers serve data primarily via THREDDS and ERDDAP, both of which automatically generate ISO metadata records from NetCDF datasets with CF Conventions.

In Pangeo, we don't yet have a way to automatically generate ISO metadata records from the CF Compliant NetCDF or Zarr datasets, but we could solve this by developing python code to generate ISO metadata records directly from Xarray or Iris. The logic of converting NetCDF attributes to ISO is contained in this XSL.

We won't solve this tomorrow, but it might be a nice little project for the roadmap!

@martindurant
Copy link
Contributor

@rsignell-usgs , Do you see a role for Intake in this supra-cataloging that you describe? Should it be able to interact with CSW, or should it have a role in extracting appropriate metadata from netCDF/zarr files?

@rsignell-usgs
Copy link
Member

I don't know much about Intake, but I guess Intake catalogs could be constructed from the responses to CSW queries if that is useful.

@jhamman jhamman reopened this Jul 27, 2018
@stale stale bot removed the stale label Jul 27, 2018
@martindurant
Copy link
Contributor

Finally announcing intake officially: https://www.anaconda.com/blog/developer-blog/intake-taking-the-pain-out-of-data-access/

@stale
Copy link

stale bot commented Oct 1, 2018

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Oct 1, 2018
@stale
Copy link

stale bot commented Oct 8, 2018

This issue has been automatically closed because it had not seen recent activity. The issue can always be reopened at a later date.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants