Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

example #2

Merged
merged 3 commits into from Feb 11, 2019
Merged

example #2

merged 3 commits into from Feb 11, 2019

Conversation

martindurant
Copy link
Member

Works:

In [1]: import intake

In [2]: cat = intake.open_thredds_cat('http://dap.nci.org.au/thredds/catalog.xml')

In [3]: catty = cat['eMAST TERN']()['eMAST TERN - files']()['ASCAT']()['ASCAT_v1-0_soil-moisture_daily_0-05deg_2007-2011']()['00000000']()

In [4]: catty['ASCAT_v1-0_soil-moisture_daily_0-05deg_2007-2011_00000000_201104.nc'].to_dask()

<xarray.Dataset>
Dimensions:                                 (lat: 681, lon: 841, time: 30)
Coordinates:
  * time                                    (time) datetime64[ns] 2011-04-01 ... 2011-04-30
  * lat                                     (lat) float64 -10.0 -10.05 ... -44.0
  * lon                                     (lon) float64 112.0 112.0 ... 154.0
Data variables:
    crs                                     uint8 ...
    lwe_thickness_of_soil_moisture_content  (time, lat, lon) float32 ...
Attributes:
    history:                         Reformatted to NetCDF 2014-08-04. Update...
    license:                         Copyright 2014 CSIRO. Rights owned by th...
    licence_data_access:             These data can be freely downloaded and ...
    spatial_coverage:                Australia
    acknowledgment:                  The conversion of this data into NetCDFs...
    citation:                        Wagner, Wolfgang; Lemoine, Guido; Rott, ...
    TERN_eMAST_contact:              eMAST.data@mq.edu.au
    title:                           ASCAT derived daily soil moisture: 0.05 ...
    licence_copyright:               Copyright 2014 CSIRO. Rights owned by th...
    short_desc:                      ASCAT derived soil moisture, Australian ...
    summary:                         Daily top-layer (~ 0-2 cm) soil moisture...
    long_name:                       ASCAT derived soil moisture
    contact:                         Luigi Renzullo, Senior Research Scientis...
    keywords:                        EARTH SCIENCE > CLIMATE INDICATORS > LAN...
    Conventions:                     CF-1.6
    institution:                     CSIRO
    geospatial_lat_min:              -44.025
    geospatial_lat_max:              -9.974999999999994
    geospatial_lat_units:            degrees_north
    geospatial_lat_resolution:       -0.05
    geospatial_lon_min:              111.975
    geospatial_lon_max:              154.025
    geospatial_lon_units:            degrees_east
    geospatial_lon_resolution:       0.05
    keywords_vocabulary:             Global Change Master Directory (http://g...
    metadata_link:                   http://datamgt.nci.org.au:8080/geonetwork
    standard_name_vocabulary:        Climate and Forecast(CF) convention stan...
    DOI:                             To be added
    cdm_data_type:                   grid
    contributor_name:                Wolfgang Wagner, Guido Lemoine, Helmut Rott
    contributor_role:                principalInvestigator, author, author
    creator_name:                    eMAST data manager
    creator_url:                     http://www.emast.org.au/
    Metadata_Conventions:            Unidata Dataset Discovery v1.0
    publisher_name:                  Ecosystem Modelling and Scaling Infrastr...
    publisher_email:                 eMAST.data@mq.edu.au
    publisher_url:                   http://www.emast.org.au/
    id:                              ASCAT_v1-0_soil-moisture_daily_0-05deg_2...
    source:                          ASCAT_v1-0
    date_created:                    2014/08/04
    creator_email:                   emast.info@gmail.com
    metadata_uuid:                   17236f6f-829e-48e1-98fd-32c7904a5793
    DODS_EXTRA.Unlimited_Dimension:  time

Notes:

  • the ugly [...]() syntax is an artefact of having reference names that aren't valid python identifiers. It is certainly possible to get rid of the parentheses and maybe to include the chain as [.., ..]
  • the output here is a Dasky xarray (via opendap) but with a single chunk; I don't know what one should say to xarray in order to get "automatic" chunking.
  • each nested level is loaded on demand only. In theory it would be possible to walk the whole tree, but that would take many HTTP calls and be slow. Various metadata is available as one descends the levels and none of this is captured
  • this took very few lines of code
  • not sure if this should be called siphon or thredds. Probably few have heard of the former?

Martin Durant added 2 commits January 11, 2019 15:40
@martindurant
Copy link
Member Author

cc #1

@rabernat
Copy link

Wow that was fast! 😲 Amazing work Martin!

  • the ugly [...]() syntax is an artefact of having reference names that aren't valid python identifiers. It is certainly possible to get rid of the parentheses and maybe to include the chain as [.., ..]

I don't see much of a way around this.

  • the output here is a Dasky xarray (via opendap) but with a single chunk; I don't know what one should say to xarray in order to get "automatic" chunking.

Agreed: the to_dask() syntax is starting to feel a bit of a stretch here, since there are actually no dask arrays in this xarray dataset. (Out of curiosity, why not .to_xarray(dask=True) or .to_xarray(chunks=True).

There is no automatic chunking of single files in xarray.open_dataset. You need to manually call .chunk(). Automatic chunking only happens with open_mfdataset.

In my original siphon to xarray example, I loaded many individual OpenDAP data files into a single xarray dataset via open_mfdataset. I think a broader goal for intake-siphon would be to allow skipping the final ['00000000'] level of the hierarchy and just getting the whole set of files (different time snapshots of the same field) as a single xarray dataset with dask arrays.

  • each nested level is loaded on demand only. In theory it would be possible to walk the whole tree, but that would take many HTTP calls and be slow. Various metadata is available as one descends the levels and none of this is captured

I think that's fine.

  • this took very few lines of code

👏

  • not sure if this should be called siphon or thredds. Probably few have heard of the former?

Agreed, thredds is probably more generic and recognizable.

@kmpaul
Copy link
Collaborator

kmpaul commented Jan 12, 2019

This is great!

I agree with @rabernat that calling this a intake-thredds might be clearer.

Also, I will second @rabernat's question about the to_dask() method. I believe that some of @martindurant's thoughts on this can be found here: intake/intake-xarray#26.

@pbranson
Copy link

pbranson commented Jan 14, 2019 via email

@rabernat
Copy link

However I'm not sure about use in practice, my experience when hitting opendap servers with open_mfdataset and subsetting with multiple dask workers is that the thredds server dies or starts to timeout.

I have had the opposite experience. Well configured TDS servers seem to be able to handle many simultaneous request. I have some experiments in this binder:
https://github.com/rabernat/pangeo_esgf_demo

Wondering about any more recent results from https://github.com/pangeo-data/storage-benchmarks

This project was abandoned by the intern who was working on it and is not going anywhere. However, I have done some of my own benchmarking.

Here is throughput from an ESGF THREDDS server running in google cloud, accessed in parallel via xarray / dask:
image
(cs stands for "chunk size".)

Here is the same access pattern using zarr with directly cloud storage.
image

You can see that the direct approach gets orders of magnitude higher throughput.

More here: https://speakerdeck.com/rabernat/cloud-native-climate-data-with-zarr-and-xarray

@martindurant
Copy link
Member Author

I think a broader goal for intake-siphon would be to allow skipping the final ['00000000'] level of the hierarchy and just getting the whole set of files

Is there a way that I could have known (attached metadata or something) that this was the final level and that the entries below it formed a coordinate grid?

@martindurant
Copy link
Member Author

With intake/intake#229, the following syntax does work:

cat['eMAST TERN', 'eMAST TERN - files', 'ASCAT', 'ASCAT_v1-0_soil-moisture_daily_0-05deg_2007-2011', '00000000',
    'ASCAT_v1-0_soil-moisture_daily_0-05deg_2007-2011_00000000_201111.nc'].to_dask()

@dopplershift
Copy link
Collaborator

This looks really cool! I'd echo the sentiment about intake-thredds--that would make more sense to me.

In siphon, I think we're looking at adding some syntax for walking through the catalog more simply (Unidata/siphon#263). Not sure if any similar ideas are applicable to the intake world.

@martindurant
Copy link
Member Author

@dopplershift , those ideas for walking the catalog will work as-is with the code here, except that "cat" is an Intake catalogue (but every instance has the siphon cat as an attribute too). Since the names are not valid python identifiers, you cannot use tab-completion, but if they were you could.

@dopplershift
Copy link
Collaborator

@martindurant IPython has hooks that also enable completion of dictionary keys within []--are you saying those need to be valid identifiers as well?

@martindurant
Copy link
Member Author

@dopplershift , no I was not saying that. Do you know how ipython fetches the set of potential completions?

@dopplershift
Copy link
Collaborator

Looks like you can define _ipython_key_completions_() to provide suggested completions for obj[".

@martindurant
Copy link
Member Author

^ Done in master

@martindurant
Copy link
Member Author

@andersy005 , can we change the name of this repo to intake-thredds? I'll change the name of the python package in this PR, and I think we should merge this near to what is here, so that people can try it, and we can iterate over ideas like @rabernat 's about merging several like sources using xarray.

@andersy005
Copy link
Member

Can we change the name of this repo to intake-thredds?

@martindurant, this is done.

@martindurant
Copy link
Member Author

Thanks @andersy005 . I have globally renamed things within the repo, including places where it probably doesn't matter. What do you think remains to be done in this PR? I'd be keen to merge earlier, so that we can get something up and usable for experimentation.

@andersy005
Copy link
Member

This looks good, @martindurant! Are you planning on adding tests to this PR? If not, that's also fine. We can merge this and iterate on tests in future PRs.

@martindurant
Copy link
Member Author

I do not know where to go for tests, it doesn't seem like a good idea to test against live thredds servers which may change or go down without notice.
Perhaps https://github.com/Unidata/thredds-docker would be a way to do it? I would put that in a separate PR, since it may take some work to get right.

@andersy005
Copy link
Member

Sounds good. I am going to merge this. By the way, I added you as an Admin to the repo.

@andersy005 andersy005 merged commit 792fdc0 into intake:master Feb 11, 2019
@martindurant martindurant deleted the first_attempt branch February 11, 2019 19:51
@dopplershift
Copy link
Collaborator

For siphon, we’ve used vcrpy as a way to record web requests to e.g. THREDDS servers and play them back for testing purposes.

@martindurant
Copy link
Member Author

I have used vcrpy for gcsfs, and find it an immense pain to work with!

@rabernat
Copy link

An alternative may be to use pydap to start up a lightweight opendap server. Pydap also serves THREDDS metadata. Since it is pure python, it can be launched in a range of ways that are compatible with testing environments.

@martindurant
Copy link
Member Author

Thanks @rabernat , that sounds like the best option then

@dopplershift
Copy link
Collaborator

@martindurant It’s interesting that you feel that way. In the scheme of things that cause me pain in testing and maintaining a CI system, vcrpy doesn’t crack my top 20. Would love to know more (but don’t want to belabor the point).

As far as pydap is concerned, you’re now introducing a 3rd party package here to stand in as a mock for TDS—one whose goal isn’t to be a TDS, just serve THREDDS-compatible catalogs. I feel like this has the potential to be shaking out issues in pydap rather than test intake-thredds.
(This isn’t hypothetical, I’ve seen some impactful differences in the past.)

Just $0.02 from someone whose not signing up for more work. 😉

@martindurant
Copy link
Member Author

from someone whose not signing up for more work.

!!

It may well be that my vcr setup for gcsfs is not as intended, but I first came across it in the context of adlfs (azure-datalake-storage), and the collaborators there claimed to know vcr well, and I mostly copied their prescription. Perhaps imperfectly.

@dopplershift
Copy link
Collaborator

It’s entirely possible our use of vcrpy is too simplistic to encounter the pain points.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants