-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change urlpaths to reflect moved datasets #67
Conversation
We'll need to rethink CI a bit. Since these are requester pays now we'll need to ensure that the requests are properly authenticated. The easiest way to do this is to probably to set up the right environment / environment variables and use google's defaults. I can help out with this if needed. |
For the catalog, hopefully something like diff --git a/intake-catalogs/hydro.yaml b/intake-catalogs/hydro.yaml
index e11cc9d..5df3fb0 100644
--- a/intake-catalogs/hydro.yaml
+++ b/intake-catalogs/hydro.yaml
@@ -28,6 +28,8 @@ sources:
args:
urlpath: gs://pangeo-usgs-hydrosheds/dir
chunks: {'y': 6000, 'x': 6000}
+ storage_options:
+ requester_pays: True
hydrosheds_acc:
description: Flow accumulation at 3-second resolution
fixes things. It's not clear to my why the other ones are passing though. Locally, this failed on your branch In [2]: cat = intake.Catalog("intake-catalogs/master.yaml")
In [3]: dset = cat['ocean.cesm_mom6_example'].to_dask() but adding the |
@TomAugspurger - what happens if we add More generally, I'm worried about the proliferation of different syntax for the same thing. In gcsfs, the keyword is This divergence increases the cognitive load for people trying to learn how all this stuff works. |
In gcsfs 0.6, `user_project=...` had to be renamed to
`requester_pays=True`. So intake is just passing the `storage_options:
requester_pays=True` through to gcsfs.
The `requester_pays` keyword matches the name in s3fs as well.
…On Wed, Dec 18, 2019 at 12:58 PM Ryan Abernathey ***@***.***> wrote:
@TomAugspurger <https://github.com/TomAugspurger> - what happens if we
add requester_pays: True to a public bucket (not configured for
requester_pays). Does it still work? If so, the easy solution is to just do
this for all datasets. Can we make it a default for the catalog?
More generally, I'm worried about the proliferation of different syntax
for the same thing. In gcsfs, the keyword is user_project. In intake,
it's requester_pays. Where does the translation happen?
This divergence increases the cognitive load for people trying to learn
how all this stuff works.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#67?email_source=notifications&email_token=AAKAOIU7KGS6KXUIFNHC2WDQZJXFNA5CNFSM4J4CFJX2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEHHD6ZA#issuecomment-567164772>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAKAOISFQHD3ABIPPVTZXY3QZJXFNANCNFSM4J4CFJXQ>
.
|
@charlesbluca a few more specifics on what (I think) you'll need to do to test this.
The credentials are a JSON structure like
I don't know which parts are sensitive, so we'll assume it all is. Do you have permissions on the GCP pangeo project? If not, I can do this part.
LMK if any of that doesn't make sense. |
Thanks for the detailed suggestions @TomAugspurger. We are trying to create a service account with the necessary permissions. We just read Before going forward, I just wanted to make sure we are on the right track. Can we think of some way to create a service account with more limited permissions? |
IIUC, that page is describing the roles for creating requester pays buckets. This service account should just be for getting data: it's simulating a 3rd party user downloading the data (though it's not perfect, since the project will be the same, but it should be good enough). I think hope the |
@charlesbluca - I created a service account with "Storage Object Viewer" permissions and shared the key with you via keybase. |
Short term, I think we need to include storage_options:
requester_pays: True on each item in the catalog. Once that's done, the tests will fail but things should be working and I think we can merge. Then we'll have two followups
I'll be able to help more with this tomorrow or Friday. |
Added the |
As an ad-hoc test, I tried this code from ocean.pangeo.io import intake
url = 'https://raw.githubusercontent.com/charlesbluca/pangeo-datastore/move-datasets/intake-catalogs/master.yaml'
cat = intake.Catalog(url)
ds = cat.ocean.LLC4320.LLC4320_grid.to_dask() I got this error:
It seems like the requester pays buckets are not working by default. I tried turning on urllib3 logging to see what was happening import logging
logging.basicConfig(level=logging.DEBUG) and I saw this when calling
It looks like the (Also, perhaps not a problem, but I noticed |
@rabernat that one did work for me, likely because I have my gcloud config configured to point to a token and a default project set. import intake
url = 'https://raw.githubusercontent.com/charlesbluca/pangeo-datastore/move-datasets/intake-catalogs/master.yaml'
cat = intake.Catalog(url)
ds = cat.ocean.LLC4320.LLC4320_grid.to_dask()
## -- End pasted text --
/Users/taugspurger/miniconda3/envs/filesystems/lib/python3.7/site-packages/google/auth/_default.py:69: UserWarning: Your application has authenticated using end user credentials from Google Cloud SDK. We recommend that most server applications use service accounts instead. If your application continues to use end user credentials from Cloud SDK, you might receive a "quota exceeded" or "API not enabled" error. For more information about service accounts, see https://cloud.google.com/docs/authentication/
warnings.warn(_CLOUD_SDK_CREDENTIALS_WARNING)
In [2]: ds
Out[2]:
<xarray.Dataset>
Dimensions: (face: 13, i: 4320, i_g: 4320, j: 4320, j_g: 4320, k_p1: 2, time: 9030)
Coordinates:
CS (face, j, i) float32 dask.array<chunksize=(1, 4320, 4320), meta=np.ndarray>
Depth (face, j, i) float32 dask.array<chunksize=(1, 4320, 4320), meta=np.ndarray>
PHrefC float32 ...
PHrefF (k_p1) float32 dask.array<chunksize=(2,), meta=np.ndarray>
SN (face, j, i) float32 dask.array<chunksize=(1, 4320, 4320), meta=np.ndarray>
XC (face, j, i) float32 dask.array<chunksize=(1, 4320, 4320), meta=np.ndarray>
XG (face, j_g, i_g) float32 dask.array<chunksize=(1, 4320, 4320), meta=np.ndarray>
YC (face, j, i) float32 dask.array<chunksize=(1, 4320, 4320), meta=np.ndarray>
YG (face, j_g, i_g) float32 dask.array<chunksize=(1, 4320, 4320), meta=np.ndarray>
Z float32 ...
Zl float32 ...
Zp1 (k_p1) float32 dask.array<chunksize=(2,), meta=np.ndarray>
Zu float32 ...
drC (k_p1) float32 dask.array<chunksize=(2,), meta=np.ndarray>
drF float32 ...
dxC (face, j, i_g) float32 dask.array<chunksize=(1, 4320, 4320), meta=np.ndarray>
dxG (face, j_g, i) float32 dask.array<chunksize=(1, 4320, 4320), meta=np.ndarray>
dyC (face, j_g, i) float32 dask.array<chunksize=(1, 4320, 4320), meta=np.ndarray>
dyG (face, j, i_g) float32 dask.array<chunksize=(1, 4320, 4320), meta=np.ndarray>
* face (face) int64 0 1 2 3 4 5 6 7 8 9 10 11 12
hFacC (face, j, i) float32 dask.array<chunksize=(1, 4320, 4320), meta=np.ndarray>
hFacS (face, j_g, i) float32 dask.array<chunksize=(1, 4320, 4320), meta=np.ndarray>
hFacW (face, j, i_g) float32 dask.array<chunksize=(1, 4320, 4320), meta=np.ndarray>
* i (i) int64 0 1 2 3 4 5 6 7 ... 4313 4314 4315 4316 4317 4318 4319
* i_g (i_g) int64 0 1 2 3 4 5 6 7 ... 4313 4314 4315 4316 4317 4318 4319
iter (time) int64 dask.array<chunksize=(9030,), meta=np.ndarray>
* j (j) int64 0 1 2 3 4 5 6 7 ... 4313 4314 4315 4316 4317 4318 4319
* j_g (j_g) int64 0 1 2 3 4 5 6 7 ... 4313 4314 4315 4316 4317 4318 4319
k int64 ...
k_l int64 ...
* k_p1 (k_p1) int64 0 1
k_u int64 ...
rA (face, j, i) float32 dask.array<chunksize=(1, 4320, 4320), meta=np.ndarray>
rAs (face, j_g, i) float32 dask.array<chunksize=(1, 4320, 4320), meta=np.ndarray>
rAw (face, j, i_g) float32 dask.array<chunksize=(1, 4320, 4320), meta=np.ndarray>
rAz (face, j_g, i_g) float32 dask.array<chunksize=(1, 4320, 4320), meta=np.ndarray>
* time (time) datetime64[ns] 2011-09-13 ... 2012-09-23T05:00:00
Data variables:
*empty*
Attributes:
Conventions: CF-1.6
history: Created by calling `open_mdsdataset(llc_method='smallchunks...
source: MITgcm
title: netCDF wrapper of MITgcm MDS binary data |
Ah ok, we just have to upgrade gcsfs, then it all works. |
I can look up the commands later, but I think it was some combination of |
No, we just need gcsfs 0.6.0. So we have to update the ocean image. Our best best of doing this is via pangeo-data/pangeo-cloud-federation#491. It should pip install gcsfs from github master. |
I'm going to merge this now. The only next step is to update ocean.pangeo.io, which I am working on in pangeo-data/pangeo-cloud-federation#492. |
Changes the
urlpath
argument to reflect the new locations of the data, which is outlined here.