Change urlpaths to reflect moved datasets #67

charlesbluca · 2019-12-17T22:03:55Z

Changes the urlpath argument to reflect the new locations of the data, which is outlined here.

TomAugspurger · 2019-12-18T16:16:57Z

We'll need to rethink CI a bit. Since these are requester pays now we'll need to ensure that the requests are properly authenticated.

The easiest way to do this is to probably to set up the right environment / environment variables and use google's defaults. I can help out with this if needed.

TomAugspurger · 2019-12-18T16:47:56Z

For the catalog, hopefully something like

diff --git a/intake-catalogs/hydro.yaml b/intake-catalogs/hydro.yaml
index e11cc9d..5df3fb0 100644
--- a/intake-catalogs/hydro.yaml
+++ b/intake-catalogs/hydro.yaml
@@ -28,6 +28,8 @@ sources:
     args:
       urlpath: gs://pangeo-usgs-hydrosheds/dir
       chunks: {'y': 6000, 'x': 6000}
+      storage_options:
+        requester_pays: True
 
   hydrosheds_acc:
     description: Flow accumulation at 3-second resolution

fixes things. It's not clear to my why the other ones are passing though. Locally, this failed on your branch

In [2]: cat = intake.Catalog("intake-catalogs/master.yaml")

In [3]: dset = cat['ocean.cesm_mom6_example'].to_dask()

but adding the requester_pays: True to the storage_options arg got things passing again.

rabernat · 2019-12-18T18:58:30Z

@TomAugspurger - what happens if we add requester_pays: True to a public bucket (not configured for requester_pays). Does it still work? If so, the easy solution is to just do this for all datasets. Can we make it a default for the catalog?

More generally, I'm worried about the proliferation of different syntax for the same thing. In gcsfs, the keyword is user_project. In intake, it's requester_pays. Where does the translation happen?

This divergence increases the cognitive load for people trying to learn how all this stuff works.

TomAugspurger · 2019-12-18T19:10:30Z

In gcsfs 0.6, `user_project=...` had to be renamed to `requester_pays=True`. So intake is just passing the `storage_options: requester_pays=True` through to gcsfs. The `requester_pays` keyword matches the name in s3fs as well.

…

On Wed, Dec 18, 2019 at 12:58 PM Ryan Abernathey ***@***.***> wrote: @TomAugspurger <https://github.com/TomAugspurger> - what happens if we add requester_pays: True to a public bucket (not configured for requester_pays). Does it still work? If so, the easy solution is to just do this for all datasets. Can we make it a default for the catalog? More generally, I'm worried about the proliferation of different syntax for the same thing. In gcsfs, the keyword is user_project. In intake, it's requester_pays. Where does the translation happen? This divergence increases the cognitive load for people trying to learn how all this stuff works. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#67?email_source=notifications&email_token=AAKAOIU7KGS6KXUIFNHC2WDQZJXFNA5CNFSM4J4CFJX2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEHHD6ZA#issuecomment-567164772>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAKAOISFQHD3ABIPPVTZXY3QZJXFNANCNFSM4J4CFJXQ> .

TomAugspurger · 2019-12-19T15:07:07Z

@charlesbluca a few more specifics on what (I think) you'll need to do to test this.

We need some GCP credentials on Travis. We'll create a service account following https://cloud.google.com/docs/authentication/production, and create a role dedicated to pangeo-datastore testing, with very limited permissions. Ideally just the ability to make GET requests from gcs.

The credentials are a JSON structure like

{
    "client_id": "xxx",
    "client_secrent": "xxx",
    "refresh_token": "xxx",
    "type": "..."
}

I don't know which parts are sensitive, so we'll assume it all is. Do you have permissions on the GCP pangeo project? If not, I can do this part.

Adding the credentials to Travis: Following https://docs.travis-ci.com/user/encrypting-files/ we can add an encrypted file. I've only encrypted variables, but this looks easiest. I don't have permissions on this repo, so we may need help from @rabernat here, or he can delegate permissions.
Pointing gcsfs to that file: We need to set the environment variable GOOGLE_APPLICATION_CREDENTIALS to the decrypted file on travis. This can be done in the regular .travis.yml file I think.

LMK if any of that doesn't make sense.

rabernat · 2019-12-19T19:57:32Z

Thanks for the detailed suggestions @TomAugspurger.

We are trying to create a service account with the necessary permissions. We just read
https://cloud.google.com/storage/docs/requester-pays#requirements
and it looks like they are recommending either roles/editor, roles/owner, or roles/billing.projectManager. These are both pretty heavy duty roles with the potential to do big things on the project.

Before going forward, I just wanted to make sure we are on the right track. Can we think of some way to create a service account with more limited permissions?

TomAugspurger · 2019-12-19T20:10:27Z

IIUC, that page is describing the roles for creating requester pays buckets. This service account should just be for getting data: it's simulating a 3rd party user downloading the data (though it's not perfect, since the project will be the same, but it should be good enough).

I think hope the storage.objectViewer role from https://cloud.google.com/storage/docs/access-control/iam-roles is sufficient.

rabernat · 2019-12-20T14:13:22Z

@charlesbluca - I created a service account with "Storage Object Viewer" permissions and shared the key with you via keybase.

TomAugspurger · 2020-01-08T20:50:58Z

Short term, I think we need to include

storage_options:
  requester_pays: True

on each item in the catalog. Once that's done, the tests will fail but things should be working and I think we can merge. Then we'll have two followups

Fix CI (will need to include encrypted credentials in travis so that the requests actually go through).
Investigate whether we can make this easier for people getting the from one of our gcloud pangeoes / binder. Google won't actually charge for requests made from the same region, but users would still need to include credentials in the request. And we probably don't want to encourage people to upload their personal project credentials to our pangeos. I'm hopeful that we can have some very limited credentials included in the containers we spawn on google that can only make get / list requests to our buckets and only from within gcloud. Not sure if that's possible yet.

I'll be able to help more with this tomorrow or Friday.

charlesbluca · 2020-01-08T22:46:54Z

Added the storage_options back for each moved item - we can merge and start looking into good options for user credentials ASAP.

rabernat · 2020-01-08T22:48:26Z

As an ad-hoc test, I tried this code from ocean.pangeo.io

import intake
url = 'https://raw.githubusercontent.com/charlesbluca/pangeo-datastore/move-datasets/intake-catalogs/master.yaml'
cat = intake.Catalog(url)
ds = cat.ocean.LLC4320.LLC4320_grid.to_dask()

I got this error:

---------------------------------------------------------------------------
HTTPError                                 Traceback (most recent call last)
/srv/conda/envs/notebook/lib/python3.7/site-packages/fsspec/mapping.py in __getitem__(self, key, default)
     75         try:
---> 76             result = self.fs.cat(key)
     77         except:  # noqa: E722

</srv/conda/envs/notebook/lib/python3.7/site-packages/decorator.py:decorator-gen-138> in cat(self, path)

/srv/conda/envs/notebook/lib/python3.7/site-packages/gcsfs/core.py in _tracemethod(f, self, *args, **kwargs)
     49 
---> 50     return f(self, *args, **kwargs)
     51 

/srv/conda/envs/notebook/lib/python3.7/site-packages/gcsfs/core.py in cat(self, path)
    844         r = self.session.get(u2)
--> 845         r.raise_for_status()
    846         if "X-Goog-Hash" in r.headers:

/srv/conda/envs/notebook/lib/python3.7/site-packages/requests/models.py in raise_for_status(self)
    939         if http_error_msg:
--> 940             raise HTTPError(http_error_msg, response=self)
    941 

HTTPError: 400 Client Error: Bad Request for url: https://www.googleapis.com/download/storage/v1/b/pangeo-ecco-llc4320/o/grid%2F.zmetadata?alt=media

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
<ipython-input-10-592bca809f02> in <module>
      2 url = 'https://raw.githubusercontent.com/charlesbluca/pangeo-datastore/move-datasets/intake-catalogs/master.yaml'
      3 cat = intake.Catalog(url)
----> 4 ds = cat.ocean.LLC4320.LLC4320_grid.to_dask()
      5 ds

/srv/conda/envs/notebook/lib/python3.7/site-packages/intake_xarray/base.py in to_dask(self)
     67     def to_dask(self):
     68         """Return xarray object where variables are dask arrays"""
---> 69         return self.read_chunked()
     70 
     71     def close(self):

/srv/conda/envs/notebook/lib/python3.7/site-packages/intake_xarray/base.py in read_chunked(self)
     42     def read_chunked(self):
     43         """Return xarray object (which will have chunks)"""
---> 44         self._load_metadata()
     45         return self._ds
     46 

/srv/conda/envs/notebook/lib/python3.7/site-packages/intake/source/base.py in _load_metadata(self)
    115         """load metadata only if needed"""
    116         if self._schema is None:
--> 117             self._schema = self._get_schema()
    118             self.datashape = self._schema.datashape
    119             self.dtype = self._schema.dtype

/srv/conda/envs/notebook/lib/python3.7/site-packages/intake_xarray/base.py in _get_schema(self)
     16 
     17         if self._ds is None:
---> 18             self._open_dataset()
     19 
     20             metadata = {

/srv/conda/envs/notebook/lib/python3.7/site-packages/intake_xarray/xzarr.py in _open_dataset(self)
     29 
     30         self._mapper = get_mapper(self.urlpath, **self.storage_options)
---> 31         self._ds = xr.open_zarr(self._mapper, **self.kwargs)
     32 
     33     def close(self):

/srv/conda/envs/notebook/lib/python3.7/site-packages/xarray/backends/zarr.py in open_zarr(store, group, synchronizer, chunks, decode_cf, mask_and_scale, decode_times, concat_characters, decode_coords, drop_variables, consolidated, overwrite_encoded_chunks, **kwargs)
    594         synchronizer=synchronizer,
    595         group=group,
--> 596         consolidated=consolidated,
    597     )
    598     ds = maybe_decode_store(zarr_store)

/srv/conda/envs/notebook/lib/python3.7/site-packages/xarray/backends/zarr.py in open_group(cls, store, mode, synchronizer, group, consolidated, consolidate_on_close)
    256         if consolidated:
    257             # TODO: an option to pass the metadata_key keyword
--> 258             zarr_group = zarr.open_consolidated(store, **open_kwargs)
    259         else:
    260             zarr_group = zarr.open_group(store, **open_kwargs)

/srv/conda/envs/notebook/lib/python3.7/site-packages/zarr/convenience.py in open_consolidated(store, metadata_key, mode, **kwargs)
   1180 
   1181     # setup metadata sotre
-> 1182     meta_store = ConsolidatedMetadataStore(store, metadata_key=metadata_key)
   1183 
   1184     # pass through

/srv/conda/envs/notebook/lib/python3.7/site-packages/zarr/storage.py in __init__(self, store, metadata_key)
   2455             d = store[metadata_key].decode()  # pragma: no cover
   2456         else:  # pragma: no cover
-> 2457             d = store[metadata_key]
   2458         meta = json_loads(d)
   2459 

/srv/conda/envs/notebook/lib/python3.7/site-packages/fsspec/mapping.py in __getitem__(self, key, default)
     78             if default is not None:
     79                 return default
---> 80             raise KeyError(key)
     81         return result
     82 

KeyError: 'pangeo-ecco-llc4320/grid/.zmetadata'

It seems like the requester pays buckets are not working by default.

I tried turning on urllib3 logging to see what was happening

import logging
logging.basicConfig(level=logging.DEBUG)

and I saw this when calling .to_dask()

DEBUG:google.auth.transport._http_client:Making request: GET http://169.254.169.254
DEBUG:google.auth.transport._http_client:Making request: GET http://metadata.google.internal/computeMetadata/v1/project/project-id
DEBUG:urllib3.util.retry:Converted retries value: 3 -> Retry(total=3, connect=None, read=None, redirect=None, status=None)
DEBUG:gcsfs.core:cat(args=('pangeo-ecco-llc4320/grid/.zmetadata',), kwargs={})
DEBUG:google.auth.transport.requests:Making request: GET http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/default/?recursive=true
DEBUG:urllib3.connectionpool:Starting new HTTP connection (1): metadata.google.internal:80
DEBUG:urllib3.connectionpool:http://metadata.google.internal:80 "GET /computeMetadata/v1/instance/service-accounts/default/?recursive=true HTTP/1.1" 200 401
DEBUG:google.auth.transport.requests:Making request: GET http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/464800473488-compute@developer.gserviceaccount.com/token
DEBUG:urllib3.connectionpool:http://metadata.google.internal:80 "GET /computeMetadata/v1/instance/service-accounts/464800473488-compute@developer.gserviceaccount.com/token HTTP/1.1" 200 210
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): www.googleapis.com:443
DEBUG:urllib3.connectionpool:https://www.googleapis.com:443 "GET /download/storage/v1/b/pangeo-ecco-llc4320/o/grid%2F.zmetadata?alt=media HTTP/1.1" 400 61

It looks like the userProject query string is not getting set. That is likely the source of the problem.

(Also, perhaps not a problem, but I noticed grid/.zmetadata has been url-encoded to grid%2F.zmetadata.)

TomAugspurger · 2020-01-08T22:51:30Z

@rabernat that one did work for me, likely because I have my gcloud config configured to point to a token and a default project set.

import intake
url = 'https://raw.githubusercontent.com/charlesbluca/pangeo-datastore/move-datasets/intake-catalogs/master.yaml'
cat = intake.Catalog(url)
ds = cat.ocean.LLC4320.LLC4320_grid.to_dask()

## -- End pasted text --
/Users/taugspurger/miniconda3/envs/filesystems/lib/python3.7/site-packages/google/auth/_default.py:69: UserWarning: Your application has authenticated using end user credentials from Google Cloud SDK. We recommend that most server applications use service accounts instead. If your application continues to use end user credentials from Cloud SDK, you might receive a "quota exceeded" or "API not enabled" error. For more information about service accounts, see https://cloud.google.com/docs/authentication/
  warnings.warn(_CLOUD_SDK_CREDENTIALS_WARNING)

In [2]: ds
Out[2]:
<xarray.Dataset>
Dimensions:  (face: 13, i: 4320, i_g: 4320, j: 4320, j_g: 4320, k_p1: 2, time: 9030)
Coordinates:
    CS       (face, j, i) float32 dask.array<chunksize=(1, 4320, 4320), meta=np.ndarray>
    Depth    (face, j, i) float32 dask.array<chunksize=(1, 4320, 4320), meta=np.ndarray>
    PHrefC   float32 ...
    PHrefF   (k_p1) float32 dask.array<chunksize=(2,), meta=np.ndarray>
    SN       (face, j, i) float32 dask.array<chunksize=(1, 4320, 4320), meta=np.ndarray>
    XC       (face, j, i) float32 dask.array<chunksize=(1, 4320, 4320), meta=np.ndarray>
    XG       (face, j_g, i_g) float32 dask.array<chunksize=(1, 4320, 4320), meta=np.ndarray>
    YC       (face, j, i) float32 dask.array<chunksize=(1, 4320, 4320), meta=np.ndarray>
    YG       (face, j_g, i_g) float32 dask.array<chunksize=(1, 4320, 4320), meta=np.ndarray>
    Z        float32 ...
    Zl       float32 ...
    Zp1      (k_p1) float32 dask.array<chunksize=(2,), meta=np.ndarray>
    Zu       float32 ...
    drC      (k_p1) float32 dask.array<chunksize=(2,), meta=np.ndarray>
    drF      float32 ...
    dxC      (face, j, i_g) float32 dask.array<chunksize=(1, 4320, 4320), meta=np.ndarray>
    dxG      (face, j_g, i) float32 dask.array<chunksize=(1, 4320, 4320), meta=np.ndarray>
    dyC      (face, j_g, i) float32 dask.array<chunksize=(1, 4320, 4320), meta=np.ndarray>
    dyG      (face, j, i_g) float32 dask.array<chunksize=(1, 4320, 4320), meta=np.ndarray>
  * face     (face) int64 0 1 2 3 4 5 6 7 8 9 10 11 12
    hFacC    (face, j, i) float32 dask.array<chunksize=(1, 4320, 4320), meta=np.ndarray>
    hFacS    (face, j_g, i) float32 dask.array<chunksize=(1, 4320, 4320), meta=np.ndarray>
    hFacW    (face, j, i_g) float32 dask.array<chunksize=(1, 4320, 4320), meta=np.ndarray>
  * i        (i) int64 0 1 2 3 4 5 6 7 ... 4313 4314 4315 4316 4317 4318 4319
  * i_g      (i_g) int64 0 1 2 3 4 5 6 7 ... 4313 4314 4315 4316 4317 4318 4319
    iter     (time) int64 dask.array<chunksize=(9030,), meta=np.ndarray>
  * j        (j) int64 0 1 2 3 4 5 6 7 ... 4313 4314 4315 4316 4317 4318 4319
  * j_g      (j_g) int64 0 1 2 3 4 5 6 7 ... 4313 4314 4315 4316 4317 4318 4319
    k        int64 ...
    k_l      int64 ...
  * k_p1     (k_p1) int64 0 1
    k_u      int64 ...
    rA       (face, j, i) float32 dask.array<chunksize=(1, 4320, 4320), meta=np.ndarray>
    rAs      (face, j_g, i) float32 dask.array<chunksize=(1, 4320, 4320), meta=np.ndarray>
    rAw      (face, j, i_g) float32 dask.array<chunksize=(1, 4320, 4320), meta=np.ndarray>
    rAz      (face, j_g, i_g) float32 dask.array<chunksize=(1, 4320, 4320), meta=np.ndarray>
  * time     (time) datetime64[ns] 2011-09-13 ... 2012-09-23T05:00:00
Data variables:
    *empty*
Attributes:
    Conventions:  CF-1.6
    history:      Created by calling `open_mdsdataset(llc_method='smallchunks...
    source:       MITgcm
    title:        netCDF wrapper of MITgcm MDS binary data

rabernat · 2020-01-08T22:52:28Z

Ah ok, we just have to upgrade gcsfs, then it all works.

TomAugspurger · 2020-01-08T22:52:44Z

I can look up the commands later, but I think it was some combination of gcloud config set project, gcloud auth login, and perhaps setting an env var.

rabernat · 2020-01-08T22:57:20Z

I can look up the commands later, but I think it was some combination of gcloud config set project, gcloud auth login, and perhaps setting an env var.

No, we just need gcsfs 0.6.0. So we have to update the ocean image. Our best best of doing this is via pangeo-data/pangeo-cloud-federation#491. It should pip install gcsfs from github master.

rabernat · 2020-01-09T16:36:51Z

I'm going to merge this now. The only next step is to update ocean.pangeo.io, which I am working on in pangeo-data/pangeo-cloud-federation#492.

Change urlpath to reflect moved datasets

8f1cf4d

This was referenced Jan 7, 2020

cannot import data from intake-catalogs #73

Closed

Error getting data from intake catalog pangeo-data/pangeo#754

Closed

shanicetbailey mentioned this pull request Jan 8, 2020

Cannot import data from fsspec or gcsfs #74

Closed

rabernat mentioned this pull request Jan 8, 2020

move datasets out of pangeo-data #64

Closed

charlesbluca added 7 commits January 8, 2020 17:32

Add requester pays to storage options

ec8cbe0

Update GFDL_CM2.6.yaml

87ce028

Update channel.yaml

18292f8

Update llc4320.yaml

a2d1f5e

Update atmosphere.yaml

b3b1f72

Updated storage options & pointed urlpath to .vrt file

c798107

Update ocean.yaml

bd9cd87

rabernat merged commit 6ba2154 into pangeo-data:master Jan 9, 2020

charlesbluca deleted the move-datasets branch January 9, 2020 19:13

charlesbluca restored the move-datasets branch January 9, 2020 19:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change urlpaths to reflect moved datasets #67

Change urlpaths to reflect moved datasets #67

charlesbluca commented Dec 17, 2019

TomAugspurger commented Dec 18, 2019

TomAugspurger commented Dec 18, 2019

rabernat commented Dec 18, 2019

TomAugspurger commented Dec 18, 2019 via email

TomAugspurger commented Dec 19, 2019

rabernat commented Dec 19, 2019

TomAugspurger commented Dec 19, 2019

rabernat commented Dec 20, 2019

TomAugspurger commented Jan 8, 2020

charlesbluca commented Jan 8, 2020

rabernat commented Jan 8, 2020 •

edited

Loading

TomAugspurger commented Jan 8, 2020

rabernat commented Jan 8, 2020

TomAugspurger commented Jan 8, 2020

rabernat commented Jan 8, 2020

rabernat commented Jan 9, 2020

Change urlpaths to reflect moved datasets #67

Change urlpaths to reflect moved datasets #67

Conversation

charlesbluca commented Dec 17, 2019

TomAugspurger commented Dec 18, 2019

TomAugspurger commented Dec 18, 2019

rabernat commented Dec 18, 2019

TomAugspurger commented Dec 18, 2019 via email

TomAugspurger commented Dec 19, 2019

rabernat commented Dec 19, 2019

TomAugspurger commented Dec 19, 2019

rabernat commented Dec 20, 2019

TomAugspurger commented Jan 8, 2020

charlesbluca commented Jan 8, 2020

rabernat commented Jan 8, 2020 • edited Loading

TomAugspurger commented Jan 8, 2020

rabernat commented Jan 8, 2020

TomAugspurger commented Jan 8, 2020

rabernat commented Jan 8, 2020

rabernat commented Jan 9, 2020

rabernat commented Jan 8, 2020 •

edited

Loading