### Versions
- Unfortunately, `version` is not stored in the metadata of each original netcdf file nor in the name of the netcdf file. It is not in the `Dataset` name. When downloading the netcdf files via Globus, for example, the original path information is lost.  
- Data handles, aka persistent identifiers, called `tracking_id`s in the CMIP6 CV, are stored in the metadata of each netcdf file
- The Cloud zarr stores, which are aggregations of netcdf files corresponding to a particular dataset, store the `tracking_id`s of the constituent files in metadata
- The `tracking_id`s have been registered with the hdl.handle.net web service through links to the [DKRZ web service](https://handle-esgf.dkrz.de/) and a Proxy Server [REST API](https://www.handle.net/proxy_servlet.html) which will return `version` when given a `tracking_id`
- This notebook both demonstrates the use of the REST API and documents various problems/issues with this process

Our Cloud CMIP6 Catalogs records a `version` for each dataset.  This `version` has been obtained by a query to the Handle REST API using the constituent `tracking_id`s

In [2]:
import pandas as pd
import zarr
import fsspec

In [3]:
from myidentify import gsurl2tracks, tracks2version, tracks2source, jdict2source, id2jdict, tracks2cloudversion
from myutilities import search_df
from mysearch import esgf_search
import myconfig

In [4]:
dfcat = pd.read_csv('https://cmip6.storage.googleapis.com/cmip6-zarr-consolidated-stores-noQC.csv', dtype='unicode')

In [5]:
def gsurl2search(gsurl):
    values = gsurl[11:-1].split('/')
    keys = myconfig.target_keys
    return dict(zip(keys,values))

In [6]:
# standard example
#gsurl = 'gs://cmip6/CMIP/NCAR/CESM2/historical/r11i1p1f1/Oyr/expc/gr/'
#gsurl = 'gs://cmip6/CMIP/E3SM-Project/E3SM-1-0/historical/r1i1p1f1/Amon/tas/gr/'
#gsurl = 'gs://cmip6/CMIP/AWI/AWI-CM-1-1-MR/historical/r1i1p1f1/Amon/tas/gn/'
gsurl = 'gs://cmip6/CMIP6/CMIP/AS-RCEC/TaiESM1/1pctCO2/r1i1p1f1/Amon/hfls/gn/v20200225'
version_cat = dfcat[dfcat.zstore == gsurl].version.values[0]
print('current version from GC catalog = ',version_cat)

tracks = gsurl2tracks(gsurl)
(version,jdict) = tracks2version(tracks)
print('latest version from handler = ', version)

asearch = gsurl2search(gsurl)
dfs = esgf_search(asearch, toFilter = False)
version_ESGF = list(set(dfs.version_id))
print('version(s) available from ESGF = ', version_ESGF)

#source_urls =tracks2source(tracks) 
#source_urls

current version from GC catalog =  20200225




current version from GC tracks =  ['20200225']
latest version from handler =  20200225
version(s) available from ESGF =  ['v20200225']


In [13]:
#gsurl = 'gs://cmip6/CMIP/CSIRO/ACCESS-ESM1-5/historical/r1i1p1f1/3hr/uas/gn/'  # just created, but scripts say there is a newer version!!!!

gsurl = 'gs://cmip6/ScenarioMIP/HAMMOZ-Consortium/MPI-ESM-1-2-HAM/ssp370/r1i1p1f1/3hr/uas/gn/' # duplicate dataset_ids and versions 
                                                                # since two files are in both datasets and one file is only in the most recent
tracks = gsurl2tracks(gsurl)
tracking_ids = tracks.split('\n')
tracking_ids

['hdl:21.14100/d5aae306-16fe-4a2f-9706-8c41b4d20584',
 'hdl:21.14100/43133a86-783d-4d67-a73f-c512d2e27582',
 'hdl:21.14100/4fd3a7ec-d1ab-421b-963e-bcf82fcd8cdb']

In [5]:
# Now, what if we try this for another? This example has two features. The tracking_ids of the netcdf files are not unique and this is a replacement version.
gsurl = 'gs://cmip6/ScenarioMIP/NCAR/CESM2/ssp370/r4i1p1f1/Amon/ts/gn/'
tracks = gsurl2tracks(gsurl)
(version,jdict) = tracks2version(tracks)
print('version = ', version)

# But since the tracking_ids were not unique, we won't get all of the urls!!
source_urls = tracks2source(tracks) 
source_urls


netcdf file tracking_ids are NOT UNIQUE!
['hdl:21.14100/33cbdc29-fbc9-44ab-9e09-5dc7824441cf', 'hdl:21.14100/33cbdc29-fbc9-44ab-9e09-5dc7824441cf']



cloud version from tracks =  20200528
version =  20200528


['http://esgf-data.ucar.edu/thredds/fileServer/esg_dataroot/CMIP6/ScenarioMIP/NCAR/CESM2/ssp370/r4i1p1f1/Amon/ts/gn/v20200528/ts_Amon_CESM2_ssp370_r4i1p1f1_gn_206501-210012.nc']

In [14]:
# This example is a Dataset where there are three versions (and an ES-DOC ERRATA link from second version)
gsurl = 'gs://cmip6/CMIP/EC-Earth-Consortium/EC-Earth3/historical/r1i1p1f1/Amon/tasmin/gr/'

version_cat = dfcat[dfcat.zstore == gsurl].version.values[0]
print('current version from GC catalog = ',version_cat)

tracks = gsurl2tracks(gsurl)
(version,jdict) = tracks2version(tracks)
print('latest version from handler = ', version)

asearch = gsurl2search(gsurl)
dfs = esgf_search(asearch, toFilter = False)
version_ESGF = list(set(dfs.version_id))
print('version(s) available from ESGF = ', version_ESGF)

#source_urls =tracks2source(tracks) 
#source_urls

current version from GC catalog =  20190926
current version from GC tracks =  20190926



*** Newer version exists, see: http://hdl.handle.net/hdl:21.14100/480d0915-c4de-3b4a-89da-dbce9ace46ce


*** Newer version exists, see: http://hdl.handle.net/hdl:21.14100/b7fc3bc4-2489-3627-b8ce-bf665b908fb6



latest version from handler =  20200310
version(s) available from ESGF =  ['v20200310', 'v20190926']


In [None]:
gsurl = 'gs://cmip6/CMIP/THU/CIESM/historical/r1i1p1f1/Amon/tasmin/gr/'

In [6]:
dfcat = pd.read_csv('https://cmip6.storage.googleapis.com/cmip6-zarr-consolidated-stores-noQC.csv')

df = search_df(dfcat,table_id='Amon',experiment_id='historical',variable_id='tas')
df['member'] = [int(s.split('r')[-1].split('i')[0]) for s in df['member_id']]
df = df.sort_values(by=['member'])
df = df.reset_index(drop=True)

len(df)

  interactivity=interactivity, compiler=compiler, result=result)


1552

In [8]:
for index, row in df.iterrows():
    if index > 14:
        continue
    gsurl = row['zstore']
    version = row['version']
    print(gsurl)
    
    tracks = gsurl2tracks(gsurl)
    (version_new,jdict) = tracks2cloudversion(tracks)
    print('\t',version, version_new)

gs://cmip6/CMIP/AS-RCEC/TaiESM1/historical/r1i1p1f1/Amon/tas/gn/
	 20200623 ['20200623']
gs://cmip6/CMIP/UA/MCM-UA-1-0/historical/r1i1p1f1/Amon/tas/gn/
	 20190731 ['20190731']
gs://cmip6/CMIP/INM/INM-CM5-0/historical/r1i1p1f1/Amon/tasmin/gr1/
	 20190610 ['20190610']
gs://cmip6/CMIP/INM/INM-CM5-0/historical/r1i1p1f1/Amon/tasmax/gr1/
	 20190610 ['20190610']
gs://cmip6/CMIP/INM/INM-CM5-0/historical/r1i1p1f1/Amon/tas/gr1/
	 20190610 ['20190610']
gs://cmip6/CMIP/INM/INM-CM4-8/historical/r1i1p1f1/Amon/tasmin/gr1/
	 20190530 ['20190530']
gs://cmip6/CMIP/INM/INM-CM4-8/historical/r1i1p1f1/Amon/tasmax/gr1/
	 20190530 ['20190530']
gs://cmip6/CMIP/IPSL/IPSL-CM6A-LR/historical/r1i1p1f1/Amon/tas/gr/
	 20180803 ['20180803']
gs://cmip6/CMIP/INM/INM-CM4-8/historical/r1i1p1f1/Amon/tas/gr1/
	 20190530 ['20190530']
gs://cmip6/CMIP/HAMMOZ-Consortium/MPI-ESM-1-2-HAM/historical/r1i1p1f1/Amon/tasmax/gn/
	 20190627 ['20190627']
gs://cmip6/CMIP/HAMMOZ-Consortium/MPI-ESM-1-2-HAM/historical/r1i1p1f1/Amon/tas/gn/


In [5]:
# Here is a really bad one:

gsurl = 'gs://cmip6/CMIP/EC-Earth-Consortium/EC-Earth3/historical/r12i1p1f1/day/hurs/gr/'

# https://handle-esgf.dkrz.de/lp/21.14100/388fda69-61ed-44f0-af3d-73d20dab3502   'handle is not accessible'
#version_cat = dfcat[dfcat.zstore == gsurl].version.values[0]
#print('current version from GC catalog = ',version_cat)

tracks = gsurl2tracks(gsurl)
#print(tracks)
(version,jdict) = tracks2version(tracks)
print('latest version from handler = ', version)

asearch = gsurl2search(gsurl)
dfs = esgf_search(asearch, toFilter = False)
version_ESGF = list(set(dfs.version_id))
print('version(s) available from ESGF = ', version_ESGF)

#source_urls =tracks2source(tracks) 
#source_urls'

hdl:21.14100/e3c5b869-c4d2-3d40-9585-5dd1ba09a9f8;hdl:21.14100/0aecf69a-2b74-32f4-96ff-879669b90960


current version from GC tracks =  ['20200315']
latest version from handler =  20200315
version(s) available from ESGF =  ['v20200315']


In [13]:
 id2jdict('hdl:21.14100/b8eec705-a216-4868-911e-f4b7d188e11f')

{'URL': 'https://handle-esgf.dkrz.de/lp/21.14100/b8eec705-a216-4868-911e-f4b7d188e11f',
 'AGGREGATION_LEVEL': 'FILE',
 'FIXED_CONTENT': 'TRUE',
 'FILE_NAME': 'hurs_day_EC-Earth3_historical_r12i1p1f1_gr_20140101-20141231.nc',
 'FILE_SIZE': '161893254',
 'IS_PART_OF': 'hdl:21.14100/e3c5b869-c4d2-3d40-9585-5dd1ba09a9f8;hdl:21.14100/0aecf69a-2b74-32f4-96ff-879669b90960',
 'FILE_VERSION': '1',
 'CHECKSUM': '391dc8728d86929978d86e809af71fafeb704ebb87ad4c15a5634b0e10178a34',
 'CHECKSUM_METHOD': 'SHA256',
 'URL_ORIGINAL_DATA': '<locations><location href="http://esgf.bsc.es/thredds/fileServer/esg_dataroot/a1tk-CMIP-r12/CMIP6/CMIP/EC-Earth-Consortium/EC-Earth3/historical/r12i1p1f1/day/hurs/gr/v20201230/hurs_day_EC-Earth3_historical_r12i1p1f1_gr_20140101-20141231.nc" publishedOn="2020-12-30T18:11:22.384+00:00" host="esgf.bsc.es" dataset="hdl:21.14100/0aecf69a-2b74-32f4-96ff-879669b90960" /></locations>',
 'URL_REPLICA': '<locations><location href="http://esgf-data1.llnl.gov/thredds/fileServer/htt

In [9]:
tracks.split('\n')[-1:]

['hdl:21.14100/b8eec705-a216-4868-911e-f4b7d188e11f']

In [8]:
import xarray as xr
xr.open_zarr(fsspec.get_mapper('gs://cmip6/CMIP6/CMIP/AS-RCEC/TaiESM1/1pctCO2/r1i1p1f1/Amon/hfls/gn/v20200225'))

Unnamed: 0,Array,Chunk
Bytes,3.07 kB,3.07 kB
Shape,"(192, 2)","(192, 2)"
Count,2 Tasks,1 Chunks
Type,float64,numpy.ndarray
"Array Chunk Bytes 3.07 kB 3.07 kB Shape (192, 2) (192, 2) Count 2 Tasks 1 Chunks Type float64 numpy.ndarray",2  192,

Unnamed: 0,Array,Chunk
Bytes,3.07 kB,3.07 kB
Shape,"(192, 2)","(192, 2)"
Count,2 Tasks,1 Chunks
Type,float64,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,4.61 kB,4.61 kB
Shape,"(288, 2)","(288, 2)"
Count,2 Tasks,1 Chunks
Type,float64,numpy.ndarray
"Array Chunk Bytes 4.61 kB 4.61 kB Shape (288, 2) (288, 2) Count 2 Tasks 1 Chunks Type float64 numpy.ndarray",2  288,

Unnamed: 0,Array,Chunk
Bytes,4.61 kB,4.61 kB
Shape,"(288, 2)","(288, 2)"
Count,2 Tasks,1 Chunks
Type,float64,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,28.80 kB,28.80 kB
Shape,"(1800, 2)","(1800, 2)"
Count,2 Tasks,1 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 28.80 kB 28.80 kB Shape (1800, 2) (1800, 2) Count 2 Tasks 1 Chunks Type object numpy.ndarray",2  1800,

Unnamed: 0,Array,Chunk
Bytes,28.80 kB,28.80 kB
Shape,"(1800, 2)","(1800, 2)"
Count,2 Tasks,1 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,398.13 MB,64.59 MB
Shape,"(1800, 192, 288)","(292, 192, 288)"
Count,8 Tasks,7 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 398.13 MB 64.59 MB Shape (1800, 192, 288) (292, 192, 288) Count 8 Tasks 7 Chunks Type float32 numpy.ndarray",288  192  1800,

Unnamed: 0,Array,Chunk
Bytes,398.13 MB,64.59 MB
Shape,"(1800, 192, 288)","(292, 192, 288)"
Count,8 Tasks,7 Chunks
Type,float32,numpy.ndarray


In [10]:
import gcsfs
import xarray as xr

# create a MutableMapping from a store URL
fs = gcsfs.GCSFileSystem(token='anon', access='read_only')
mapper = fs.get_mapper("gs://cmip6/CMIP6/CMIP/AS-RCEC/TaiESM1/1pctCO2/r1i1p1f1/Amon/hfls/gn/v20200225/")
# make sure to specify that metadata is consolidated
ds = xr.open_zarr(mapper, consolidated=True)


In [16]:
# get the path to a specific zarr store
zstore = df_subset.zstore.values[-1]
mapper = gcsfs.get_mapper(zstore)
# open using xarray
ds = xr.open_zarr(mapper, consolidated=True)

NameError: name 'df_subset' is not defined

In [17]:
df = pd.read_csv("https://cmip6.storage.googleapis.com/pangeo-cmip6.csv")
df_subset = df.query("activity_id=='CMIP' & table_id=='Amon' & variable_id=='tas'")
zstore = df_subset.zstore.values[-1]

  interactivity=interactivity, compiler=compiler, result=result)


In [18]:
zstore = df_subset.zstore.values[-1]
mapper = fs.get_mapper(zstore)
# open using xarray
ds = xr.open_zarr(mapper, consolidated=True)


In [19]:
ds

Unnamed: 0,Array,Chunk
Bytes,1.28 kB,1.28 kB
Shape,"(80, 2)","(80, 2)"
Count,2 Tasks,1 Chunks
Type,float64,numpy.ndarray
"Array Chunk Bytes 1.28 kB 1.28 kB Shape (80, 2) (80, 2) Count 2 Tasks 1 Chunks Type float64 numpy.ndarray",2  80,

Unnamed: 0,Array,Chunk
Bytes,1.28 kB,1.28 kB
Shape,"(80, 2)","(80, 2)"
Count,2 Tasks,1 Chunks
Type,float64,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,1.54 kB,1.54 kB
Shape,"(96, 2)","(96, 2)"
Count,2 Tasks,1 Chunks
Type,float64,numpy.ndarray
"Array Chunk Bytes 1.54 kB 1.54 kB Shape (96, 2) (96, 2) Count 2 Tasks 1 Chunks Type float64 numpy.ndarray",2  96,

Unnamed: 0,Array,Chunk
Bytes,1.54 kB,1.54 kB
Shape,"(96, 2)","(96, 2)"
Count,2 Tasks,1 Chunks
Type,float64,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,96.00 kB,96.00 kB
Shape,"(6000, 2)","(6000, 2)"
Count,2 Tasks,1 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 96.00 kB 96.00 kB Shape (6000, 2) (6000, 2) Count 2 Tasks 1 Chunks Type object numpy.ndarray",2  6000,

Unnamed: 0,Array,Chunk
Bytes,96.00 kB,96.00 kB
Shape,"(6000, 2)","(6000, 2)"
Count,2 Tasks,1 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,61.44 kB,61.44 kB
Shape,"(80, 96)","(80, 96)"
Count,2 Tasks,1 Chunks
Type,float64,numpy.ndarray
"Array Chunk Bytes 61.44 kB 61.44 kB Shape (80, 96) (80, 96) Count 2 Tasks 1 Chunks Type float64 numpy.ndarray",96  80,

Unnamed: 0,Array,Chunk
Bytes,61.44 kB,61.44 kB
Shape,"(80, 96)","(80, 96)"
Count,2 Tasks,1 Chunks
Type,float64,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,184.32 MB,18.43 MB
Shape,"(6000, 80, 96)","(600, 80, 96)"
Count,11 Tasks,10 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 184.32 MB 18.43 MB Shape (6000, 80, 96) (600, 80, 96) Count 11 Tasks 10 Chunks Type float32 numpy.ndarray",96  80  6000,

Unnamed: 0,Array,Chunk
Bytes,184.32 MB,18.43 MB
Shape,"(6000, 80, 96)","(600, 80, 96)"
Count,11 Tasks,10 Chunks
Type,float32,numpy.ndarray
