# Access to data in the cloud (GCS)

*If you run this notebook at colab.research.google.com, you need to install packages with the following command*:

In [None]:
!pip install --upgrade gcsfs intake intake-xarray zarr

In [1]:
import sys
import gcsfs
import xarray as xr
import intake

## Read data from Google Cloud Storage (gcsfs)

### Access and listing

In [13]:
# Define cloud file system access point:
fs = gcsfs.GCSFileSystem(project='alert-ground-261008', token='anon', access='read_only')

# And list content of a bucket:
fs.ls('opendata_bdo2020')

['opendata_bdo2020/EN.4.2.1.f.analysis.g10.zarr',
 'opendata_bdo2020/GLOBAL_ARGO_SDL2000',
 'opendata_bdo2020/GLOB_HOMOGENEOUS_variables.zarr',
 'opendata_bdo2020/Global_Argo_VerticalMean_Temperature.zarr']

But data access with ``gcsfs`` is critically dependant on the GCS set-up. For instance the following project does not allow to list the bucket content:

In [12]:
fs2 = gcsfs.GCSFileSystem(project='alert-ground-261008', token='anon', access='read_only')
try:
    fs2.ls('data_bdo2020')
except:
    print(sys.exc_info()[0])

_request non-retriable exception: Anonymous caller does not have storage.objects.list access to the Google Cloud Storage bucket., 401
Traceback (most recent call last):
  File "/Users/gmaze/anaconda/envs/ds2/lib/python3.8/site-packages/gcsfs/retry.py", line 115, in retry_request
    return await func(*args, **kwargs)
  File "/Users/gmaze/anaconda/envs/ds2/lib/python3.8/site-packages/gcsfs/core.py", line 339, in _request
    validate_response(status, contents, path, args)
  File "/Users/gmaze/anaconda/envs/ds2/lib/python3.8/site-packages/gcsfs/retry.py", line 102, in validate_response
    raise HttpError(error)
gcsfs.retry.HttpError: Anonymous caller does not have storage.objects.list access to the Google Cloud Storage bucket., 401


<class 'gcsfs.retry.HttpError'>


On the other hand, some dataset may not be free and use a requester pay model. 
In this case, you would have to properly manage authentication:

In [9]:
fs3 = gcsfs.GCSFileSystem(project='poised-honor-358', token='anon')
try:
    fs3.ls('sonific01')
except ValueError as e:
    print(str(e))

Bucket is requester pays. Set `requester_pays=True` when creating the GCSFileSystem.


### Load data

In [8]:
gcsmap = fs.get_mapper("opendata_bdo2020/EN.4.2.1.f.analysis.g10.zarr")
ds = xr.open_zarr(gcsmap)

# ds = xr.open_dataset("gcs://opendata_bdo2020/EN.4.2.1.f.analysis.g10.zarr",
#                      backend_kwargs={"storage_options": {"project": "alert-ground-261008", "token": 'anon', 'access':'read_only'}},
#                     engine="zarr")

print("Size of the dataset:", ds.nbytes/1e9,"Gb")
print(ds)

Size of the dataset: 52.2317975 Gb
<xarray.Dataset>
Dimensions:                          (depth: 42, time: 832, bnds: 2, lat: 173, lon: 360)
Coordinates:
  * depth                            (depth) float32 5.022 15.08 ... 5.35e+03
  * lat                              (lat) float32 -83.0 -82.0 ... 88.0 89.0
  * lon                              (lon) float32 1.0 2.0 3.0 ... 359.0 360.0
  * time                             (time) datetime64[ns] 1950-01-16T12:00:0...
Dimensions without coordinates: bnds
Data variables:
    depth_bnds                       (time, depth, bnds) float32 dask.array<chunksize=(1, 42, 2), meta=np.ndarray>
    salinity                         (time, depth, lat, lon) float32 dask.array<chunksize=(1, 42, 173, 360), meta=np.ndarray>
    salinity_observation_weights     (time, depth, lat, lon) float32 dask.array<chunksize=(1, 42, 173, 360), meta=np.ndarray>
    salinity_uncertainty             (time, depth, lat, lon) float32 dask.array<chunksize=(1, 42, 173, 360), me

In [25]:
# Load another dataset:
gcsmap = fs.get_mapper('opendata_bdo2020/GLOBAL_ARGO_SDL2000')
ds = xr.open_zarr(gcsmap, consolidated=False)
print("Size of the dataset:", ds.nbytes/1e9,"Gb")
print(ds)

Size of the dataset: 5.974301444 Gb
<xarray.Dataset>
Dimensions:    (depth: 381, samples: 976831)
Coordinates:
  * depth      (depth) float64 0.0 -5.0 -10.0 ... -1.89e+03 -1.895e+03 -1.9e+03
  * samples    (samples) int64 0 1 2 3 4 ... 976826 976827 976828 976829 976830
Data variables:
    julianday  (samples) float32 dask.array<chunksize=(6000,), meta=np.ndarray>
    latitude   (samples) float32 dask.array<chunksize=(6000,), meta=np.ndarray>
    longitude  (samples) float32 dask.array<chunksize=(6000,), meta=np.ndarray>
    so         (samples, depth) float64 dask.array<chunksize=(6000, 381), meta=np.ndarray>
    thetao     (samples, depth) float64 dask.array<chunksize=(6000, 381), meta=np.ndarray>
Attributes:
    Conventions:  CF-1.6
    institution:  Argo-France
    source:       Argo float
    title:        Argo float profiles interpolated onto Standard Depth Levels


## Use intake catalog of data

The catalog also uses the gcsfs entry point, but with intake it's transparent to the user:

### Access and listing of the catalog

In [14]:
from intake import open_catalog

In [16]:
catalog_url = 'https://raw.githubusercontent.com/obidam/ds2-2022/main/ds2_data_catalog.yml'
cat = open_catalog(catalog_url)
list(cat)

['argo_global_homogeneous_sdl',
 'en4',
 'argo_global_vertical_mean',
 'argo_global_sdl',
 'isas15_temp_natl',
 'sea_surface_height']

### Load data

In [17]:
ds = cat.en4.read_chunked()
print("Size of the dataset:", ds.nbytes/1e9,"Gb")
ds

Size of the dataset: 52.2317975 Gb


Unnamed: 0,Array,Chunk
Bytes,273.00 kiB,336 B
Shape,"(832, 42, 2)","(1, 42, 2)"
Count,833 Tasks,832 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 273.00 kiB 336 B Shape (832, 42, 2) (1, 42, 2) Count 833 Tasks 832 Chunks Type float32 numpy.ndarray",2  42  832,

Unnamed: 0,Array,Chunk
Bytes,273.00 kiB,336 B
Shape,"(832, 42, 2)","(1, 42, 2)"
Count,833 Tasks,832 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,8.11 GiB,9.98 MiB
Shape,"(832, 42, 173, 360)","(1, 42, 173, 360)"
Count,833 Tasks,832 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 8.11 GiB 9.98 MiB Shape (832, 42, 173, 360) (1, 42, 173, 360) Count 833 Tasks 832 Chunks Type float32 numpy.ndarray",832  1  360  173  42,

Unnamed: 0,Array,Chunk
Bytes,8.11 GiB,9.98 MiB
Shape,"(832, 42, 173, 360)","(1, 42, 173, 360)"
Count,833 Tasks,832 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,8.11 GiB,9.98 MiB
Shape,"(832, 42, 173, 360)","(1, 42, 173, 360)"
Count,833 Tasks,832 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 8.11 GiB 9.98 MiB Shape (832, 42, 173, 360) (1, 42, 173, 360) Count 833 Tasks 832 Chunks Type float32 numpy.ndarray",832  1  360  173  42,

Unnamed: 0,Array,Chunk
Bytes,8.11 GiB,9.98 MiB
Shape,"(832, 42, 173, 360)","(1, 42, 173, 360)"
Count,833 Tasks,832 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,8.11 GiB,9.98 MiB
Shape,"(832, 42, 173, 360)","(1, 42, 173, 360)"
Count,833 Tasks,832 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 8.11 GiB 9.98 MiB Shape (832, 42, 173, 360) (1, 42, 173, 360) Count 833 Tasks 832 Chunks Type float32 numpy.ndarray",832  1  360  173  42,

Unnamed: 0,Array,Chunk
Bytes,8.11 GiB,9.98 MiB
Shape,"(832, 42, 173, 360)","(1, 42, 173, 360)"
Count,833 Tasks,832 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,8.11 GiB,9.98 MiB
Shape,"(832, 42, 173, 360)","(1, 42, 173, 360)"
Count,833 Tasks,832 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 8.11 GiB 9.98 MiB Shape (832, 42, 173, 360) (1, 42, 173, 360) Count 833 Tasks 832 Chunks Type float32 numpy.ndarray",832  1  360  173  42,

Unnamed: 0,Array,Chunk
Bytes,8.11 GiB,9.98 MiB
Shape,"(832, 42, 173, 360)","(1, 42, 173, 360)"
Count,833 Tasks,832 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,8.11 GiB,9.98 MiB
Shape,"(832, 42, 173, 360)","(1, 42, 173, 360)"
Count,833 Tasks,832 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 8.11 GiB 9.98 MiB Shape (832, 42, 173, 360) (1, 42, 173, 360) Count 833 Tasks 832 Chunks Type float32 numpy.ndarray",832  1  360  173  42,

Unnamed: 0,Array,Chunk
Bytes,8.11 GiB,9.98 MiB
Shape,"(832, 42, 173, 360)","(1, 42, 173, 360)"
Count,833 Tasks,832 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,8.11 GiB,9.98 MiB
Shape,"(832, 42, 173, 360)","(1, 42, 173, 360)"
Count,833 Tasks,832 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 8.11 GiB 9.98 MiB Shape (832, 42, 173, 360) (1, 42, 173, 360) Count 833 Tasks 832 Chunks Type float32 numpy.ndarray",832  1  360  173  42,

Unnamed: 0,Array,Chunk
Bytes,8.11 GiB,9.98 MiB
Shape,"(832, 42, 173, 360)","(1, 42, 173, 360)"
Count,833 Tasks,832 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,13.00 kiB,13.00 kiB
Shape,"(832, 2)","(832, 2)"
Count,2 Tasks,1 Chunks
Type,datetime64[ns],numpy.ndarray
"Array Chunk Bytes 13.00 kiB 13.00 kiB Shape (832, 2) (832, 2) Count 2 Tasks 1 Chunks Type datetime64[ns] numpy.ndarray",2  832,

Unnamed: 0,Array,Chunk
Bytes,13.00 kiB,13.00 kiB
Shape,"(832, 2)","(832, 2)"
Count,2 Tasks,1 Chunks
Type,datetime64[ns],numpy.ndarray


In [18]:
ds  = cat["sea_surface_height"].to_dask()
print("Size of the dataset:", ds.nbytes/1e9,"Gb")
ds

OSError: Forbidden: https://storage.googleapis.com/download/storage/v1/b/ds2data/o/dt_global_allsat_phy_l4_mm.zarr%2F.zmetadata?alt=media
The project to be billed is associated with an absent billing account.

# Pangeo data

https://github.com/pangeo-data/pangeo-datastore

https://catalog.pangeo.io/

## Explore catalog

In [20]:
from intake import open_catalog

pangeo_cat = open_catalog("https://raw.githubusercontent.com/pangeo-data/pangeo-datastore/master/intake-catalogs/master.yaml")
list(pangeo_cat)

['ocean', 'atmosphere', 'climate', 'hydro']

In [21]:
list(pangeo_cat.ocean)
# print(list(pangeo_cat.atmosphere))
# print(list(pangeo_cat.hydro))
# pangeo_cat.walk(depth=5)

['sea_surface_height',
 'cesm_mom6_example',
 'ECCOv4r3',
 'SOSE',
 'GODAS',
 'ECCO_layers',
 'altimetry',
 'LLC4320',
 'GFDL_CM2_6',
 'CESM_POP',
 'channel',
 'MEOM_NEMO']