### Versions
- Unfortunately, `version` is not stored in the metadata of each original netcdf file nor in the name of the netcdf file. It is not in the `Dataset` name. When downloading the netcdf files via Globus, for example, the original path information is lost.  
- Data handles, aka persistent identifiers, called `tracking_id`s in the CMIP6 CV, are stored in the metadata of each netcdf file
- The Cloud zarr stores, which are aggregations of netcdf files corresponding to a particular dataset, store the `tracking_id`s of the constituent files in metadata
- The `tracking_id`s have been registered with the hdl.handle.net web service through links to the [DKRZ web service](https://handle-esgf.dkrz.de/) and a Proxy Server [REST API](https://www.handle.net/proxy_servlet.html) which will return `version` when given a `tracking_id`
- This notebook both demonstrates the use of the REST API and documents various problems/issues with this process

Our Cloud CMIP6 Catalogs records a `version` for each dataset.  This `version` has been obtained by a query to the Handle REST API using the constituent `tracking_id`s

In [1]:
import pandas as pd
import zarr
import fsspec

In [2]:
from myidentify import gsurl2tracks, tracks2version, tracks2source, jdict2source, id2jdict, tracks2cloudversion
from myutilities import search_df
from mysearch import esgf_search
import myconfig

In [3]:
dfcat = pd.read_csv('https://cmip6.storage.googleapis.com/cmip6-zarr-consolidated-stores-noQC.csv', dtype='unicode')

In [4]:
def gsurl2search(gsurl):
    values = gsurl[11:-1].split('/')
    keys = myconfig.target_keys
    return dict(zip(keys,values))

In [6]:
# standard example
gsurl = 'gs://cmip6/CMIP/NCAR/CESM2/historical/r11i1p1f1/Oyr/expc/gr/'
gsurl = 'gs://cmip6/CMIP/E3SM-Project/E3SM-1-0/historical/r1i1p1f1/Amon/tas/gr/'
gsurl = 'gs://cmip6/CMIP/AWI/AWI-CM-1-1-MR/historical/r1i1p1f1/Amon/tas/gn/'
version_cat = dfcat[dfcat.zstore == gsurl].version.values[0]
print('current version from GC catalog = ',version_cat)

tracks = gsurl2tracks(gsurl)
(version,jdict) = tracks2version(tracks)
print('latest version from handler = ', version)

asearch = gsurl2search(gsurl)
dfs = esgf_search(asearch, toFilter = False)
version_ESGF = list(set(dfs.version_id))
print('version(s) available from ESGF = ', version_ESGF)

#source_urls =tracks2source(tracks) 
#source_urls

current version from GC catalog =  20200720


hdl:21.14100/666dd286-bb54-39c8-a57e-b697339b7bfa;hdl:21.14100/e401cf38-1abf-37dd-af92-07e7da0b43c4


current version from GC tracks =  ['20191015', '20200720']
latest version from handler =  20200720
version(s) available from ESGF =  ['v20200511', 'v20200720']


In [13]:
#gsurl = 'gs://cmip6/CMIP/CSIRO/ACCESS-ESM1-5/historical/r1i1p1f1/3hr/uas/gn/'  # just created, but scripts say there is a newer version!!!!

gsurl = 'gs://cmip6/ScenarioMIP/HAMMOZ-Consortium/MPI-ESM-1-2-HAM/ssp370/r1i1p1f1/3hr/uas/gn/' # duplicate dataset_ids and versions 
                                                                # since two files are in both datasets and one file is only in the most recent
tracks = gsurl2tracks(gsurl)
tracking_ids = tracks.split('\n')
tracking_ids

['hdl:21.14100/d5aae306-16fe-4a2f-9706-8c41b4d20584',
 'hdl:21.14100/43133a86-783d-4d67-a73f-c512d2e27582',
 'hdl:21.14100/4fd3a7ec-d1ab-421b-963e-bcf82fcd8cdb']

In [5]:
# Now, what if we try this for another? This example has two features. The tracking_ids of the netcdf files are not unique and this is a replacement version.
gsurl = 'gs://cmip6/ScenarioMIP/NCAR/CESM2/ssp370/r4i1p1f1/Amon/ts/gn/'
tracks = gsurl2tracks(gsurl)
(version,jdict) = tracks2version(tracks)
print('version = ', version)

# But since the tracking_ids were not unique, we won't get all of the urls!!
source_urls = tracks2source(tracks) 
source_urls


netcdf file tracking_ids are NOT UNIQUE!
['hdl:21.14100/33cbdc29-fbc9-44ab-9e09-5dc7824441cf', 'hdl:21.14100/33cbdc29-fbc9-44ab-9e09-5dc7824441cf']



cloud version from tracks =  20200528
version =  20200528


['http://esgf-data.ucar.edu/thredds/fileServer/esg_dataroot/CMIP6/ScenarioMIP/NCAR/CESM2/ssp370/r4i1p1f1/Amon/ts/gn/v20200528/ts_Amon_CESM2_ssp370_r4i1p1f1_gn_206501-210012.nc']

In [14]:
# This example is a Dataset where there are three versions (and an ES-DOC ERRATA link from second version)
gsurl = 'gs://cmip6/CMIP/EC-Earth-Consortium/EC-Earth3/historical/r1i1p1f1/Amon/tasmin/gr/'

version_cat = dfcat[dfcat.zstore == gsurl].version.values[0]
print('current version from GC catalog = ',version_cat)

tracks = gsurl2tracks(gsurl)
(version,jdict) = tracks2version(tracks)
print('latest version from handler = ', version)

asearch = gsurl2search(gsurl)
dfs = esgf_search(asearch, toFilter = False)
version_ESGF = list(set(dfs.version_id))
print('version(s) available from ESGF = ', version_ESGF)

#source_urls =tracks2source(tracks) 
#source_urls

current version from GC catalog =  20190926
current version from GC tracks =  20190926



*** Newer version exists, see: http://hdl.handle.net/hdl:21.14100/480d0915-c4de-3b4a-89da-dbce9ace46ce


*** Newer version exists, see: http://hdl.handle.net/hdl:21.14100/b7fc3bc4-2489-3627-b8ce-bf665b908fb6



latest version from handler =  20200310
version(s) available from ESGF =  ['v20200310', 'v20190926']


In [None]:
gsurl = 'gs://cmip6/CMIP/THU/CIESM/historical/r1i1p1f1/Amon/tasmin/gr/'

In [8]:
dfcat = pd.read_csv('https://cmip6.storage.googleapis.com/cmip6-zarr-consolidated-stores-noQC.csv')

df = search_df(dfcat,table_id='Amon',experiment_id='historical',variable_id='tas')
df['member'] = [int(s.split('r')[-1].split('i')[0]) for s in df['member_id']]
df = df.sort_values(by=['member'])
df = df.reset_index(drop=True)

len(df)

  interactivity=interactivity, compiler=compiler, result=result)


1539

In [11]:
for index, row in df.iterrows():
    #if index < 14:
    #    continue
    gsurl = row['zstore']
    version = row['version']
    print(index, gsurl)
    
    tracks = gsurl2tracks(gsurl)
    version_new = tracks2cloudversion(tracks)
    print('\t',version, version_new)

In [8]:
# Here is a really bad one:

gsurl = 'gs://cmip6/CMIP/EC-Earth-Consortium/EC-Earth3/historical/r12i1p1f1/day/hurs/gr/'

# https://handle-esgf.dkrz.de/lp/21.14100/388fda69-61ed-44f0-af3d-73d20dab3502   'handle is not accessible'
#version_cat = dfcat[dfcat.zstore == gsurl].version.values[0]
#print('current version from GC catalog = ',version_cat)

tracks = gsurl2tracks(gsurl)
(version,jdict) = tracks2version(tracks)
print('latest version from handler = ', version)

asearch = gsurl2search(gsurl)
dfs = esgf_search(asearch, toFilter = False)
version_ESGF = list(set(dfs.version_id))
print('version(s) available from ESGF = ', version_ESGF)

#source_urls =tracks2source(tracks) 
#source_urls'

hdl:21.14100/e3c5b869-c4d2-3d40-9585-5dd1ba09a9f8;hdl:21.14100/0aecf69a-2b74-32f4-96ff-879669b90960


HTTPError: 404 Client Error:  for url: http://hdl.handle.net/api/handles/21.14100/0aecf69a-2b74-32f4-96ff-879669b90960

In [9]:
tracks

'hdl:21.14100/388fda69-61ed-44f0-af3d-73d20dab3502\nhdl:21.14100/101deaaa-6032-4474-b996-0d8e01a8d97f\nhdl:21.14100/7a7dcbf5-b1d5-4f02-a6ee-2cb63e06538e\nhdl:21.14100/f12c83af-e0b4-488a-90d7-a5d3ee04ed6e\nhdl:21.14100/5e023a03-2409-4145-bbfc-93a0386c297a\nhdl:21.14100/b6eb5b89-583c-47ae-89b3-e0bf2d22fa90\nhdl:21.14100/c8e927aa-b6a6-4654-843a-3515cf83f53d\nhdl:21.14100/b040c06b-9bfa-4fb0-bacb-fbdfda2bb339\nhdl:21.14100/a9ca1644-463f-4e6c-81e1-8f1eda3e137c\nhdl:21.14100/87d472a5-348f-4f07-9eaa-2e597435a61d\nhdl:21.14100/1b636d34-c549-47ba-a5f5-9fb3883ccae2\nhdl:21.14100/a223e1c1-ea08-4473-a673-e087e753d501\nhdl:21.14100/eacd9304-6e25-4d3d-9fa9-20c21fe70f1e\nhdl:21.14100/5fa98c38-30d0-4091-9497-01847458effc\nhdl:21.14100/78ee67d5-a986-4063-b6eb-0cdefb398c77\nhdl:21.14100/7f443051-5a3f-43ed-a56d-334515a12c3b\nhdl:21.14100/98753a67-f158-4050-9c6c-aa9b5bb56644\nhdl:21.14100/db799f8c-48e4-499d-9166-4552808fa4a4\nhdl:21.14100/b0eb89aa-7ff2-4a7d-9a24-06de3f2d7e5e\nhdl:21.14100/b091dc7b-4f7c-4e4