# Access to data in the cloud (GCS)

## Import modules and libraries

*First, let's make sure the Python env is correct to run this notebook*:

In [1]:
import os, sys, urllib, tempfile
with tempfile.TemporaryDirectory() as tmpdirname:
    sys.path.append(tmpdirname)
    repo = "https://raw.githubusercontent.com/obidam/ds2-2025/main/"
    urllib.request.urlretrieve(os.path.join(repo, "utils.py"), 
                               os.path.join(tmpdirname, "utils.py"))
    from utils import check_up_env
    check_up_env()

Running on your own environment
Make sure to have all necessary packages installed
See:   https://github.com/obidam/ds2-2025/blob/main/practice/environment/coiled/environment-coiled-pinned-binder.yml


*Then, import the usual suspects*:

In [2]:
import xarray as xr
from intake import open_catalog

import sys
import gcsfs
import xarray as xr
import intake
import pandas as pd

## Read data from Google Cloud Storage (gcsfs)

### Access and listing

In [3]:
# Define cloud file system access point:
fs = gcsfs.GCSFileSystem(project='alert-ground-261008', token='anon', access='read_only')

# And list content of a bucket:
fs.ls('opendata_bdo2020')

['opendata_bdo2020/EN.4.2.1.f.analysis.g10.zarr',
 'opendata_bdo2020/GLOBAL_ARGO_SDL2000',
 'opendata_bdo2020/GLOB_HOMOGENEOUS_variables.zarr',
 'opendata_bdo2020/Global_Argo_VerticalMean_Temperature.zarr',
 'opendata_bdo2020/dt_global_allsat_phy_l4_mm']

But data access with ``gcsfs`` is critically dependant on the GCS set-up. For instance the following project does not allow to list the bucket content:

In [4]:
fs2 = gcsfs.GCSFileSystem(project='alert-ground-261008', token='anon', access='read_only')
try:
    fs2.ls('data_bdo2020')
except:
    print(sys.exc_info()[0])

_request non-retriable exception: Anonymous caller does not have storage.objects.list access to the Google Cloud Storage bucket. Permission 'storage.objects.list' denied on resource (or it may not exist)., 401
Traceback (most recent call last):
  File "/Users/gmaze/miniconda3/envs/ds2-coiled-2025/lib/python3.11/site-packages/gcsfs/retry.py", line 130, in retry_request
    return await func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/gmaze/miniconda3/envs/ds2-coiled-2025/lib/python3.11/site-packages/gcsfs/core.py", line 440, in _request
    validate_response(status, contents, path, args)
  File "/Users/gmaze/miniconda3/envs/ds2-coiled-2025/lib/python3.11/site-packages/gcsfs/retry.py", line 117, in validate_response
    raise HttpError(error)
gcsfs.retry.HttpError: Anonymous caller does not have storage.objects.list access to the Google Cloud Storage bucket. Permission 'storage.objects.list' denied on resource (or it may not exist)., 401


<class 'gcsfs.retry.HttpError'>


On the other hand, some dataset may not be free and use a requester pay model. 
In this case, you would have to properly manage authentication:

In [5]:
fs3 = gcsfs.GCSFileSystem(project='poised-honor-358', token='anon')
try:
    fs3.ls('sonific01')
except ValueError as e:
    print(str(e))

Bucket is requester pays. Set `requester_pays=True` when creating the GCSFileSystem.


### Load data

In [28]:
ds = xr.open_dataset("gcs://opendata_bdo2020/EN.4.2.1.f.analysis.g10.zarr",
                     backend_kwargs={"storage_options": {
                         "project": "alert-ground-261008", 
                         "token": 'anon', 
                         'access':'read_only'}},
                     engine="zarr")
print(ds)

<xarray.Dataset> Size: 87GB
Dimensions:                          (depth: 42, time: 832, bnds: 2, lat: 173,
                                      lon: 360)
Coordinates:
  * depth                            (depth) float32 168B 5.022 ... 5.35e+03
  * lat                              (lat) float32 692B -83.0 -82.0 ... 89.0
  * lon                              (lon) float32 1kB 1.0 2.0 ... 359.0 360.0
  * time                             (time) datetime64[ns] 7kB 1950-01-16T12:...
Dimensions without coordinates: bnds
Data variables:
    depth_bnds                       (time, depth, bnds) float32 280kB ...
    salinity                         (time, depth, lat, lon) float64 17GB ...
    salinity_observation_weights     (time, depth, lat, lon) float32 9GB ...
    salinity_uncertainty             (time, depth, lat, lon) float64 17GB ...
    temperature                      (time, depth, lat, lon) float64 17GB ...
    temperature_observation_weights  (time, depth, lat, lon) float32 9GB ...
  

In [29]:
# Load another dataset:
ds = xr.open_dataset("gcs://opendata_bdo2020/GLOBAL_ARGO_SDL2000",
                     backend_kwargs={"storage_options": {
                         "project": "alert-ground-261008", 
                         "token": 'anon', 
                         'access':'read_only'}},
                     consolidated=False,
                     engine="zarr")

# print("Size of the dataset:", ds.nbytes/1e9,"Gb")
print(ds)

<xarray.Dataset> Size: 6GB
Dimensions:    (depth: 381, samples: 976831)
Coordinates:
  * depth      (depth) float64 3kB 0.0 -5.0 -10.0 ... -1.895e+03 -1.9e+03
  * samples    (samples) int64 8MB 0 1 2 3 4 ... 976827 976828 976829 976830
Data variables:
    julianday  (samples) float32 4MB ...
    latitude   (samples) float32 4MB ...
    longitude  (samples) float32 4MB ...
    so         (samples, depth) float64 3GB ...
    thetao     (samples, depth) float64 3GB ...
Attributes:
    Conventions:  CF-1.6
    institution:  Argo-France
    source:       Argo float
    title:        Argo float profiles interpolated onto Standard Depth Levels


## Use intake catalog of data

The catalog also uses the gcsfs entry point, but with intake it's transparent to the user:

### Access and listing of the catalog

In [30]:
from intake import open_catalog

In [31]:
catalog_url = 'https://raw.githubusercontent.com/obidam/ds2-2025/main/ds2_data_catalog.yml'
cat = open_catalog(catalog_url)
list(cat)

['argo_global_sdl',
 'argo_global_sdl_homogeneous',
 'argo_global_vertical_mean',
 'en4',
 'sea_surface_height']

### Load data

In [34]:
ds = cat['en4'].read_chunked()
print(ds)

<xarray.Dataset> Size: 87GB
Dimensions:                          (depth: 42, time: 832, bnds: 2, lat: 173,
                                      lon: 360)
Coordinates:
  * depth                            (depth) float32 168B 5.022 ... 5.35e+03
  * lat                              (lat) float32 692B -83.0 -82.0 ... 89.0
  * lon                              (lon) float32 1kB 1.0 2.0 ... 359.0 360.0
  * time                             (time) datetime64[ns] 7kB 1950-01-16T12:...
Dimensions without coordinates: bnds
Data variables:
    depth_bnds                       (time, depth, bnds) float32 280kB dask.array<chunksize=(1, 42, 2), meta=np.ndarray>
    salinity                         (time, depth, lat, lon) float64 17GB dask.array<chunksize=(1, 42, 173, 360), meta=np.ndarray>
    salinity_observation_weights     (time, depth, lat, lon) float32 9GB dask.array<chunksize=(1, 42, 173, 360), meta=np.ndarray>
    salinity_uncertainty             (time, depth, lat, lon) float64 17GB dask.arra

In [35]:
ds  = cat["sea_surface_height"].to_dask()
print(ds)

<xarray.Dataset> Size: 18GB
Dimensions:    (time: 312, latitude: 720, longitude: 1440, nv: 2)
Coordinates:
  * latitude   (latitude) float32 3kB -89.88 -89.62 -89.38 ... 89.38 89.62 89.88
  * longitude  (longitude) float32 6kB 0.125 0.375 0.625 ... 359.4 359.6 359.9
  * nv         (nv) int32 8B 0 1
  * time       (time) datetime64[ns] 2kB 1993-01-01 1993-02-01 ... 2018-12-01
Data variables:
    adt        (time, latitude, longitude) float64 3GB dask.array<chunksize=(1, 720, 1440), meta=np.ndarray>
    crs        (time) float64 2kB dask.array<chunksize=(1,), meta=np.ndarray>
    err        (time, latitude, longitude) float64 3GB dask.array<chunksize=(1, 720, 1440), meta=np.ndarray>
    lat_bnds   (time, latitude, nv) float32 2MB dask.array<chunksize=(1, 720, 2), meta=np.ndarray>
    lon_bnds   (time, longitude, nv) float32 4MB dask.array<chunksize=(1, 1440, 2), meta=np.ndarray>
    sla        (time, latitude, longitude) float64 3GB dask.array<chunksize=(1, 720, 1440), meta=np.ndarray>
 

# Pangeo data

https://github.com/pangeo-data/pangeo-datastore

https://catalog.pangeo.io/

## Explore catalog

In [36]:
from intake import open_catalog

pangeo_cat = open_catalog("https://raw.githubusercontent.com/pangeo-data/pangeo-datastore/master/intake-catalogs/master.yaml")
list(pangeo_cat)

['ocean', 'atmosphere', 'climate', 'hydro']

In [37]:
list(pangeo_cat.ocean)
# print(list(pangeo_cat.atmosphere))
# print(list(pangeo_cat.hydro))
# pangeo_cat.walk(depth=5)

['sea_surface_height',
 'cesm_mom6_example',
 'ECCOv4r3',
 'SOSE',
 'GODAS',
 'ECCO_layers',
 'altimetry',
 'LLC4320',
 'GFDL_CM2_6',
 'CESM_POP',
 'channel',
 'MEOM_NEMO']

# CMIP6 data

In [39]:
# Let's open the CMIP catalogue:
df_full = pd.read_csv('https://storage.googleapis.com/cmip6/cmip6-zarr-consolidated-stores.csv')
df_full.sample(10)

Unnamed: 0,activity_id,institution_id,source_id,experiment_id,member_id,table_id,variable_id,grid_label,zstore,dcpp_init_year,version
112980,ScenarioMIP,CCCma,CanESM5,ssp370,r13i1p2f1,fx,areacella,gn,gs://cmip6/CMIP6/ScenarioMIP/CCCma/CanESM5/ssp...,,20190429
241338,ScenarioMIP,DWD,MPI-ESM1-2-HR,ssp126,r2i1p1f1,Omon,fric,gn,gs://cmip6/CMIP6/ScenarioMIP/DWD/MPI-ESM1-2-HR...,,20190710
88709,CMIP,CCCma,CanESM5-CanOE,historical,r2i1p2f1,Oyr,chlmisc,gn,gs://cmip6/CMIP6/CMIP/CCCma/CanESM5-CanOE/hist...,,20190429
57921,DAMIP,CCCma,CanESM5,hist-stratO3,r7i1p1f1,Amon,tas,gn,gs://cmip6/CMIP6/DAMIP/CCCma/CanESM5/hist-stra...,,20190306
506305,ScenarioMIP,MIROC,MIROC-ES2L,ssp585,r4i1p1f2,Omon,sos,gr1,gs://cmip6/CMIP6/ScenarioMIP/MIROC/MIROC-ES2L/...,,20201222
265742,DCPP,MIROC,MIROC6,dcppA-hindcast,r1i1p1f1,Amon,uas,gn,gs://cmip6/DCPP/MIROC/MIROC6/dcppA-hindcast/s2...,2002.0,20190821
400345,AerChemMIP,MRI,MRI-ESM2-0,piClim-BC,r1i1p1f1,Amon,ts,gn,gs://cmip6/CMIP6/AerChemMIP/MRI/MRI-ESM2-0/piC...,,20200114
271496,DCPP,MIROC,MIROC6,dcppA-hindcast,r8i1p1f1,Amon,huss,gn,gs://cmip6/DCPP/MIROC/MIROC6/dcppA-hindcast/s1...,1964.0,20190821
166714,DCPP,CCCma,CanESM5,dcppA-hindcast,r5i1p2f1,Amon,tasmax,gn,gs://cmip6/DCPP/CCCma/CanESM5/dcppA-hindcast/s...,1963.0,20190429
498070,DAMIP,EC-Earth-Consortium,EC-Earth3,ssp245-covid,r29i1p1f2,Ofx,deptho,gn,gs://cmip6/CMIP6/DAMIP/EC-Earth-Consortium/EC-...,,20201104


In [50]:
# And make a simulation selection:

# df = df_full.query("activity_id=='CMIP' & table_id == 'Omon' & variable_id == 'thetao' & experiment_id == 'historical' & member_id == 'r1i1p1f1'")
df = df_full.query("activity_id=='CMIP' & table_id == 'Omon' & institution_id == 'CNRM-CERFACS' & experiment_id == 'historical'")
# df = df_full.query('institution_id == "CNRM-CERFACS" & member_id=="r1i1p1f2" & source_id=="CNRM-CM6-1"')

# df = df_full.query("activity_id=='CMIP' & table_id == 'Omon' & variable_id == 'thetao' & experiment_id == 'abrupt-4xCO2'")

# df = df.query("source_id=='CNRM-CM6-1-HR' & variable_id=='thetao'") # Horizontal resolution up to 1/4 deg
# df = df.query("source_id=='CNRM-ESM2-1' & variable_id=='thetao'") # Horizontal resolution up to 1deg
df = df.query("source_id=='CNRM-ESM2-1' & (variable_id=='thetao' | variable_id=='so')") # Horizontal resolution up to 1deg

# df = df.sort_values('version')
df = df.sort_values('member_id')
df

Unnamed: 0,activity_id,institution_id,source_id,experiment_id,member_id,table_id,variable_id,grid_label,zstore,dcpp_init_year,version
406634,CMIP,CNRM-CERFACS,CNRM-ESM2-1,historical,r10i1p1f2,Omon,so,gn,gs://cmip6/CMIP6/CMIP/CNRM-CERFACS/CNRM-ESM2-1...,,20200117
406642,CMIP,CNRM-CERFACS,CNRM-ESM2-1,historical,r10i1p1f2,Omon,thetao,gn,gs://cmip6/CMIP6/CMIP/CNRM-CERFACS/CNRM-ESM2-1...,,20200117
430447,CMIP,CNRM-CERFACS,CNRM-ESM2-1,historical,r11i1p1f2,Omon,so,gn,gs://cmip6/CMIP6/CMIP/CNRM-CERFACS/CNRM-ESM2-1...,,20200408
44083,CMIP,CNRM-CERFACS,CNRM-ESM2-1,historical,r1i1p1f2,Omon,so,gn,gs://cmip6/CMIP6/CMIP/CNRM-CERFACS/CNRM-ESM2-1...,,20181206
44013,CMIP,CNRM-CERFACS,CNRM-ESM2-1,historical,r1i1p1f2,Omon,thetao,gn,gs://cmip6/CMIP6/CMIP/CNRM-CERFACS/CNRM-ESM2-1...,,20181206
51505,CMIP,CNRM-CERFACS,CNRM-ESM2-1,historical,r2i1p1f2,Omon,thetao,gn,gs://cmip6/CMIP6/CMIP/CNRM-CERFACS/CNRM-ESM2-1...,,20190125
51514,CMIP,CNRM-CERFACS,CNRM-ESM2-1,historical,r2i1p1f2,Omon,so,gn,gs://cmip6/CMIP6/CMIP/CNRM-CERFACS/CNRM-ESM2-1...,,20190125
51428,CMIP,CNRM-CERFACS,CNRM-ESM2-1,historical,r3i1p1f2,Omon,thetao,gn,gs://cmip6/CMIP6/CMIP/CNRM-CERFACS/CNRM-ESM2-1...,,20190125
50556,CMIP,CNRM-CERFACS,CNRM-ESM2-1,historical,r3i1p1f2,Omon,so,gn,gs://cmip6/CMIP6/CMIP/CNRM-CERFACS/CNRM-ESM2-1...,,20190125
51214,CMIP,CNRM-CERFACS,CNRM-ESM2-1,historical,r4i1p1f2,Omon,so,gn,gs://cmip6/CMIP6/CMIP/CNRM-CERFACS/CNRM-ESM2-1...,,20190125


In [46]:
# get the path to a specific zarr store (the first one from the dataframe above)
zstore = df.zstore.values[-1]
print(zstore)

# open it using xarray and zarr
ds = xr.open_dataset(zstore, consolidated=True, engine='zarr', 
                     backend_kwargs={"storage_options": { "token": 'anon',  'access':'read_only'}})
print(ds)

gs://cmip6/CMIP6/CMIP/CNRM-CERFACS/CNRM-ESM2-1/historical/r9i1p1f2/Omon/thetao/gn/v20200117/
<xarray.Dataset> Size: 63GB
Dimensions:      (y: 294, x: 362, nvertex: 4, lev: 75, axis_nbounds: 2,
                  time: 1980)
Coordinates:
    bounds_lat   (y, x, nvertex) float64 3MB ...
    bounds_lon   (y, x, nvertex) float64 3MB ...
    lat          (y, x) float64 851kB ...
  * lev          (lev) float64 600B 0.5058 1.556 2.668 ... 5.698e+03 5.902e+03
    lev_bounds   (lev, axis_nbounds) float64 1kB ...
    lon          (y, x) float64 851kB ...
  * time         (time) datetime64[ns] 16kB 1850-01-16T12:00:00 ... 2014-12-1...
    time_bounds  (time, axis_nbounds) datetime64[ns] 32kB ...
Dimensions without coordinates: y, x, nvertex, axis_nbounds
Data variables:
    thetao       (time, lev, y, x) float32 63GB ...
Attributes: (12/55)
    CMIP6_CV_version:       cv=6.2.3.0-7-g2019642
    Conventions:            CF-1.7 CMIP-6.2
    EXPID:                  CNRM-ESM2-1_historical_r9i1p1f2
    a

In [47]:
sst = ds['thetao'].sel(lev=0, method='nearest')
sst

In [48]:
def open_cmip6(df_row):
    # get the path to zarr store
    zstore = df.zstore.values[-1]
#     print(zstore)

    # open it using xarray and zarr
    return xr.open_dataset(zstore, consolidated=True, engine='zarr', 
                     backend_kwargs={"storage_options": { "token": 'anon',  'access':'read_only'}})

ds = open_cmip6(df.iloc[0])
print("Size of the dataset:", ds.nbytes/1e9,"Gb")
ds

Size of the dataset: 63.22679556 Gb


In [49]:
# Compute size of the df selection:
total_size = 0 # Gb
for index, row in df.iterrows():
    ds = open_cmip6(row)
    total_size += ds.nbytes/1e9
print("Size of the selection of datasets:", total_size, "Gb")    

Size of the selection of datasets: 1327.7627067600004 Gb
