Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change urlpaths to reflect moved datasets #67

Merged
merged 8 commits into from
Jan 9, 2020
Merged

Change urlpaths to reflect moved datasets #67

merged 8 commits into from
Jan 9, 2020

Conversation

charlesbluca
Copy link
Member

Changes the urlpath argument to reflect the new locations of the data, which is outlined here.

@TomAugspurger
Copy link
Member

We'll need to rethink CI a bit. Since these are requester pays now we'll need to ensure that the requests are properly authenticated.

The easiest way to do this is to probably to set up the right environment / environment variables and use google's defaults. I can help out with this if needed.

@TomAugspurger
Copy link
Member

For the catalog, hopefully something like

diff --git a/intake-catalogs/hydro.yaml b/intake-catalogs/hydro.yaml
index e11cc9d..5df3fb0 100644
--- a/intake-catalogs/hydro.yaml
+++ b/intake-catalogs/hydro.yaml
@@ -28,6 +28,8 @@ sources:
     args:
       urlpath: gs://pangeo-usgs-hydrosheds/dir
       chunks: {'y': 6000, 'x': 6000}
+      storage_options:
+        requester_pays: True
 
   hydrosheds_acc:
     description: Flow accumulation at 3-second resolution

fixes things. It's not clear to my why the other ones are passing though. Locally, this failed on your branch

In [2]: cat = intake.Catalog("intake-catalogs/master.yaml")

In [3]: dset = cat['ocean.cesm_mom6_example'].to_dask()

but adding the requester_pays: True to the storage_options arg got things passing again.

@rabernat
Copy link
Member

@TomAugspurger - what happens if we add requester_pays: True to a public bucket (not configured for requester_pays). Does it still work? If so, the easy solution is to just do this for all datasets. Can we make it a default for the catalog?

More generally, I'm worried about the proliferation of different syntax for the same thing. In gcsfs, the keyword is user_project. In intake, it's requester_pays. Where does the translation happen?

This divergence increases the cognitive load for people trying to learn how all this stuff works.

@TomAugspurger
Copy link
Member

TomAugspurger commented Dec 18, 2019 via email

@TomAugspurger
Copy link
Member

@charlesbluca a few more specifics on what (I think) you'll need to do to test this.

  1. We need some GCP credentials on Travis. We'll create a service account following https://cloud.google.com/docs/authentication/production, and create a role dedicated to pangeo-datastore testing, with very limited permissions. Ideally just the ability to make GET requests from gcs.

The credentials are a JSON structure like

{
    "client_id": "xxx",
    "client_secrent": "xxx",
    "refresh_token": "xxx",
    "type": "..."
}

I don't know which parts are sensitive, so we'll assume it all is. Do you have permissions on the GCP pangeo project? If not, I can do this part.

  1. Adding the credentials to Travis: Following https://docs.travis-ci.com/user/encrypting-files/ we can add an encrypted file. I've only encrypted variables, but this looks easiest. I don't have permissions on this repo, so we may need help from @rabernat here, or he can delegate permissions.

  2. Pointing gcsfs to that file: We need to set the environment variable GOOGLE_APPLICATION_CREDENTIALS to the decrypted file on travis. This can be done in the regular .travis.yml file I think.

LMK if any of that doesn't make sense.

@rabernat
Copy link
Member

Thanks for the detailed suggestions @TomAugspurger.

We are trying to create a service account with the necessary permissions. We just read
https://cloud.google.com/storage/docs/requester-pays#requirements
and it looks like they are recommending either roles/editor, roles/owner, or roles/billing.projectManager. These are both pretty heavy duty roles with the potential to do big things on the project.

Before going forward, I just wanted to make sure we are on the right track. Can we think of some way to create a service account with more limited permissions?

@TomAugspurger
Copy link
Member

IIUC, that page is describing the roles for creating requester pays buckets. This service account should just be for getting data: it's simulating a 3rd party user downloading the data (though it's not perfect, since the project will be the same, but it should be good enough).

I think hope the storage.objectViewer role from https://cloud.google.com/storage/docs/access-control/iam-roles is sufficient.

@rabernat
Copy link
Member

@charlesbluca - I created a service account with "Storage Object Viewer" permissions and shared the key with you via keybase.

@TomAugspurger
Copy link
Member

Short term, I think we need to include

storage_options:
  requester_pays: True

on each item in the catalog. Once that's done, the tests will fail but things should be working and I think we can merge. Then we'll have two followups

  1. Fix CI (will need to include encrypted credentials in travis so that the requests actually go through).
  2. Investigate whether we can make this easier for people getting the from one of our gcloud pangeoes / binder. Google won't actually charge for requests made from the same region, but users would still need to include credentials in the request. And we probably don't want to encourage people to upload their personal project credentials to our pangeos. I'm hopeful that we can have some very limited credentials included in the containers we spawn on google that can only make get / list requests to our buckets and only from within gcloud. Not sure if that's possible yet.

I'll be able to help more with this tomorrow or Friday.

@charlesbluca
Copy link
Member Author

Added the storage_options back for each moved item - we can merge and start looking into good options for user credentials ASAP.

@rabernat
Copy link
Member

rabernat commented Jan 8, 2020

As an ad-hoc test, I tried this code from ocean.pangeo.io

import intake
url = 'https://raw.githubusercontent.com/charlesbluca/pangeo-datastore/move-datasets/intake-catalogs/master.yaml'
cat = intake.Catalog(url)
ds = cat.ocean.LLC4320.LLC4320_grid.to_dask()

I got this error:

---------------------------------------------------------------------------
HTTPError                                 Traceback (most recent call last)
/srv/conda/envs/notebook/lib/python3.7/site-packages/fsspec/mapping.py in __getitem__(self, key, default)
     75         try:
---> 76             result = self.fs.cat(key)
     77         except:  # noqa: E722

</srv/conda/envs/notebook/lib/python3.7/site-packages/decorator.py:decorator-gen-138> in cat(self, path)

/srv/conda/envs/notebook/lib/python3.7/site-packages/gcsfs/core.py in _tracemethod(f, self, *args, **kwargs)
     49 
---> 50     return f(self, *args, **kwargs)
     51 

/srv/conda/envs/notebook/lib/python3.7/site-packages/gcsfs/core.py in cat(self, path)
    844         r = self.session.get(u2)
--> 845         r.raise_for_status()
    846         if "X-Goog-Hash" in r.headers:

/srv/conda/envs/notebook/lib/python3.7/site-packages/requests/models.py in raise_for_status(self)
    939         if http_error_msg:
--> 940             raise HTTPError(http_error_msg, response=self)
    941 

HTTPError: 400 Client Error: Bad Request for url: https://www.googleapis.com/download/storage/v1/b/pangeo-ecco-llc4320/o/grid%2F.zmetadata?alt=media

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
<ipython-input-10-592bca809f02> in <module>
      2 url = 'https://raw.githubusercontent.com/charlesbluca/pangeo-datastore/move-datasets/intake-catalogs/master.yaml'
      3 cat = intake.Catalog(url)
----> 4 ds = cat.ocean.LLC4320.LLC4320_grid.to_dask()
      5 ds

/srv/conda/envs/notebook/lib/python3.7/site-packages/intake_xarray/base.py in to_dask(self)
     67     def to_dask(self):
     68         """Return xarray object where variables are dask arrays"""
---> 69         return self.read_chunked()
     70 
     71     def close(self):

/srv/conda/envs/notebook/lib/python3.7/site-packages/intake_xarray/base.py in read_chunked(self)
     42     def read_chunked(self):
     43         """Return xarray object (which will have chunks)"""
---> 44         self._load_metadata()
     45         return self._ds
     46 

/srv/conda/envs/notebook/lib/python3.7/site-packages/intake/source/base.py in _load_metadata(self)
    115         """load metadata only if needed"""
    116         if self._schema is None:
--> 117             self._schema = self._get_schema()
    118             self.datashape = self._schema.datashape
    119             self.dtype = self._schema.dtype

/srv/conda/envs/notebook/lib/python3.7/site-packages/intake_xarray/base.py in _get_schema(self)
     16 
     17         if self._ds is None:
---> 18             self._open_dataset()
     19 
     20             metadata = {

/srv/conda/envs/notebook/lib/python3.7/site-packages/intake_xarray/xzarr.py in _open_dataset(self)
     29 
     30         self._mapper = get_mapper(self.urlpath, **self.storage_options)
---> 31         self._ds = xr.open_zarr(self._mapper, **self.kwargs)
     32 
     33     def close(self):

/srv/conda/envs/notebook/lib/python3.7/site-packages/xarray/backends/zarr.py in open_zarr(store, group, synchronizer, chunks, decode_cf, mask_and_scale, decode_times, concat_characters, decode_coords, drop_variables, consolidated, overwrite_encoded_chunks, **kwargs)
    594         synchronizer=synchronizer,
    595         group=group,
--> 596         consolidated=consolidated,
    597     )
    598     ds = maybe_decode_store(zarr_store)

/srv/conda/envs/notebook/lib/python3.7/site-packages/xarray/backends/zarr.py in open_group(cls, store, mode, synchronizer, group, consolidated, consolidate_on_close)
    256         if consolidated:
    257             # TODO: an option to pass the metadata_key keyword
--> 258             zarr_group = zarr.open_consolidated(store, **open_kwargs)
    259         else:
    260             zarr_group = zarr.open_group(store, **open_kwargs)

/srv/conda/envs/notebook/lib/python3.7/site-packages/zarr/convenience.py in open_consolidated(store, metadata_key, mode, **kwargs)
   1180 
   1181     # setup metadata sotre
-> 1182     meta_store = ConsolidatedMetadataStore(store, metadata_key=metadata_key)
   1183 
   1184     # pass through

/srv/conda/envs/notebook/lib/python3.7/site-packages/zarr/storage.py in __init__(self, store, metadata_key)
   2455             d = store[metadata_key].decode()  # pragma: no cover
   2456         else:  # pragma: no cover
-> 2457             d = store[metadata_key]
   2458         meta = json_loads(d)
   2459 

/srv/conda/envs/notebook/lib/python3.7/site-packages/fsspec/mapping.py in __getitem__(self, key, default)
     78             if default is not None:
     79                 return default
---> 80             raise KeyError(key)
     81         return result
     82 

KeyError: 'pangeo-ecco-llc4320/grid/.zmetadata'

It seems like the requester pays buckets are not working by default.

I tried turning on urllib3 logging to see what was happening

import logging
logging.basicConfig(level=logging.DEBUG)

and I saw this when calling .to_dask()

DEBUG:google.auth.transport._http_client:Making request: GET http://169.254.169.254
DEBUG:google.auth.transport._http_client:Making request: GET http://metadata.google.internal/computeMetadata/v1/project/project-id
DEBUG:urllib3.util.retry:Converted retries value: 3 -> Retry(total=3, connect=None, read=None, redirect=None, status=None)
DEBUG:gcsfs.core:cat(args=('pangeo-ecco-llc4320/grid/.zmetadata',), kwargs={})
DEBUG:google.auth.transport.requests:Making request: GET http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/default/?recursive=true
DEBUG:urllib3.connectionpool:Starting new HTTP connection (1): metadata.google.internal:80
DEBUG:urllib3.connectionpool:http://metadata.google.internal:80 "GET /computeMetadata/v1/instance/service-accounts/default/?recursive=true HTTP/1.1" 200 401
DEBUG:google.auth.transport.requests:Making request: GET http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/464800473488-compute@developer.gserviceaccount.com/token
DEBUG:urllib3.connectionpool:http://metadata.google.internal:80 "GET /computeMetadata/v1/instance/service-accounts/464800473488-compute@developer.gserviceaccount.com/token HTTP/1.1" 200 210
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): www.googleapis.com:443
DEBUG:urllib3.connectionpool:https://www.googleapis.com:443 "GET /download/storage/v1/b/pangeo-ecco-llc4320/o/grid%2F.zmetadata?alt=media HTTP/1.1" 400 61

It looks like the userProject query string is not getting set. That is likely the source of the problem.

(Also, perhaps not a problem, but I noticed grid/.zmetadata has been url-encoded to grid%2F.zmetadata.)

@TomAugspurger
Copy link
Member

@rabernat that one did work for me, likely because I have my gcloud config configured to point to a token and a default project set.

import intake
url = 'https://raw.githubusercontent.com/charlesbluca/pangeo-datastore/move-datasets/intake-catalogs/master.yaml'
cat = intake.Catalog(url)
ds = cat.ocean.LLC4320.LLC4320_grid.to_dask()

## -- End pasted text --
/Users/taugspurger/miniconda3/envs/filesystems/lib/python3.7/site-packages/google/auth/_default.py:69: UserWarning: Your application has authenticated using end user credentials from Google Cloud SDK. We recommend that most server applications use service accounts instead. If your application continues to use end user credentials from Cloud SDK, you might receive a "quota exceeded" or "API not enabled" error. For more information about service accounts, see https://cloud.google.com/docs/authentication/
  warnings.warn(_CLOUD_SDK_CREDENTIALS_WARNING)

In [2]: ds
Out[2]:
<xarray.Dataset>
Dimensions:  (face: 13, i: 4320, i_g: 4320, j: 4320, j_g: 4320, k_p1: 2, time: 9030)
Coordinates:
    CS       (face, j, i) float32 dask.array<chunksize=(1, 4320, 4320), meta=np.ndarray>
    Depth    (face, j, i) float32 dask.array<chunksize=(1, 4320, 4320), meta=np.ndarray>
    PHrefC   float32 ...
    PHrefF   (k_p1) float32 dask.array<chunksize=(2,), meta=np.ndarray>
    SN       (face, j, i) float32 dask.array<chunksize=(1, 4320, 4320), meta=np.ndarray>
    XC       (face, j, i) float32 dask.array<chunksize=(1, 4320, 4320), meta=np.ndarray>
    XG       (face, j_g, i_g) float32 dask.array<chunksize=(1, 4320, 4320), meta=np.ndarray>
    YC       (face, j, i) float32 dask.array<chunksize=(1, 4320, 4320), meta=np.ndarray>
    YG       (face, j_g, i_g) float32 dask.array<chunksize=(1, 4320, 4320), meta=np.ndarray>
    Z        float32 ...
    Zl       float32 ...
    Zp1      (k_p1) float32 dask.array<chunksize=(2,), meta=np.ndarray>
    Zu       float32 ...
    drC      (k_p1) float32 dask.array<chunksize=(2,), meta=np.ndarray>
    drF      float32 ...
    dxC      (face, j, i_g) float32 dask.array<chunksize=(1, 4320, 4320), meta=np.ndarray>
    dxG      (face, j_g, i) float32 dask.array<chunksize=(1, 4320, 4320), meta=np.ndarray>
    dyC      (face, j_g, i) float32 dask.array<chunksize=(1, 4320, 4320), meta=np.ndarray>
    dyG      (face, j, i_g) float32 dask.array<chunksize=(1, 4320, 4320), meta=np.ndarray>
  * face     (face) int64 0 1 2 3 4 5 6 7 8 9 10 11 12
    hFacC    (face, j, i) float32 dask.array<chunksize=(1, 4320, 4320), meta=np.ndarray>
    hFacS    (face, j_g, i) float32 dask.array<chunksize=(1, 4320, 4320), meta=np.ndarray>
    hFacW    (face, j, i_g) float32 dask.array<chunksize=(1, 4320, 4320), meta=np.ndarray>
  * i        (i) int64 0 1 2 3 4 5 6 7 ... 4313 4314 4315 4316 4317 4318 4319
  * i_g      (i_g) int64 0 1 2 3 4 5 6 7 ... 4313 4314 4315 4316 4317 4318 4319
    iter     (time) int64 dask.array<chunksize=(9030,), meta=np.ndarray>
  * j        (j) int64 0 1 2 3 4 5 6 7 ... 4313 4314 4315 4316 4317 4318 4319
  * j_g      (j_g) int64 0 1 2 3 4 5 6 7 ... 4313 4314 4315 4316 4317 4318 4319
    k        int64 ...
    k_l      int64 ...
  * k_p1     (k_p1) int64 0 1
    k_u      int64 ...
    rA       (face, j, i) float32 dask.array<chunksize=(1, 4320, 4320), meta=np.ndarray>
    rAs      (face, j_g, i) float32 dask.array<chunksize=(1, 4320, 4320), meta=np.ndarray>
    rAw      (face, j, i_g) float32 dask.array<chunksize=(1, 4320, 4320), meta=np.ndarray>
    rAz      (face, j_g, i_g) float32 dask.array<chunksize=(1, 4320, 4320), meta=np.ndarray>
  * time     (time) datetime64[ns] 2011-09-13 ... 2012-09-23T05:00:00
Data variables:
    *empty*
Attributes:
    Conventions:  CF-1.6
    history:      Created by calling `open_mdsdataset(llc_method='smallchunks...
    source:       MITgcm
    title:        netCDF wrapper of MITgcm MDS binary data

@rabernat
Copy link
Member

rabernat commented Jan 8, 2020

Ah ok, we just have to upgrade gcsfs, then it all works.

@TomAugspurger
Copy link
Member

I can look up the commands later, but I think it was some combination of gcloud config set project, gcloud auth login, and perhaps setting an env var.

@rabernat
Copy link
Member

rabernat commented Jan 8, 2020

I can look up the commands later, but I think it was some combination of gcloud config set project, gcloud auth login, and perhaps setting an env var.

No, we just need gcsfs 0.6.0. So we have to update the ocean image. Our best best of doing this is via pangeo-data/pangeo-cloud-federation#491. It should pip install gcsfs from github master.

@rabernat
Copy link
Member

rabernat commented Jan 9, 2020

I'm going to merge this now. The only next step is to update ocean.pangeo.io, which I am working on in pangeo-data/pangeo-cloud-federation#492.

@rabernat rabernat merged commit 6ba2154 into pangeo-data:master Jan 9, 2020
@charlesbluca charlesbluca deleted the move-datasets branch January 9, 2020 19:13
@charlesbluca charlesbluca restored the move-datasets branch January 9, 2020 19:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants