# Data Handles (aka persistent identifiers)
- Each netcdf file should correspond to a UNIQUE handle, which is called the tracking_id in CMIP6-world.
- We can get information about these tracking_ids from the German Climate Computing Centre at http://handle-esgf.dkrz.de

### For example,
- given tracking_id = hdl:21.14100/2f67746f-580b-4750-a25f-ce2559d1c57b

- We can append the tracking_id to 'http://hdl.handle.net/' which re-directs to 'https://handle-esgf.dkrz.de/lp/21.14100/2f67746f-580b-4750-a25f-ce2559d1c57b' (try it out).
This gives us the following information:

```
       netcdf_file    = 'expc_Oyr_CESM2_historical_r11i1p1f1_gr_1850-1899.nc'
       file_size      = 440729165  
       source_url     = 'http://esgf-data.ucar.edu/thredds/fileServer/esg_dataroot/CMIP6/CMIP/NCAR/CESM2/historical/r11i1p1f1/
                      Oyr/expc/gr/v20190514/expc_Oyr_CESM2_historical_r11i1p1f1_gr_1850-1899.nc'
       aggregation_id = 'hdl:21.14100/f03d7841-e21a-31fd-88c6-9632c0e2c5b1'  
       dataset_id     = 'CMIP6.CMIP.NCAR.CESM2.historical.r11i1p1f1.Oyr.expc.gr'
       version        = 20190514
       
```  

- If you now follow the link to the aggregation, you see the following list of netcdf files and their corresponding tracking_ids, including the netcdf file with our given tracking_id (expc_Oyr_CESM2_historical_r11i1p1f1_gr_1850-1899.nc):

```
       expc_Oyr_CESM2_historical_r11i1p1f1_gr_1950-1999.nc	hdl:21.14100/9c92b764-4f3e-4aa2-9f4b-5a7612cf660f  
       expc_Oyr_CESM2_historical_r11i1p1f1_gr_2000-2014.nc	hdl:21.14100/405705f5-6566-4cad-83c1-07ce60e82fcb  
       expc_Oyr_CESM2_historical_r11i1p1f1_gr_1900-1949.nc	hdl:21.14100/eab727d0-c996-419f-8f38-1bd79641aca4  
       expc_Oyr_CESM2_historical_r11i1p1f1_gr_1850-1899.nc	hdl:21.14100/2f67746f-580b-4750-a25f-ce2559d1c57b  
```

- Note that this is a list of all netcdf files which were concatenated in time to make the complete dataset: 'CMIP6.CMIP.NCAR.CESM2.historical.r11i1p1f1.Oyr.expc.gr'

# Zarr store persistent identifier
- Each zarr store in our Cloud collection also has a tracking_id. This tracking id is the concatenation of tracking_ids for all netcdf file in the aggregation.

### For example,
- For the zarr store = 'gs://cmip6/CMIP/NCAR/CESM2/historical/r11i1p1f1/Oyr/expc/gr':

``` 
ds = xr.open_zarr(fsspec.get_mapper('gs://cmip6/ScenarioMIP/DKRZ/MPI-ESM1-2-HR/ssp370/r1i1p1f1/Amon/prc/gn'),consolidated=True)
ds.attrs['tracking_id']) 

'hdl:21.14100/2f67746f-580b-4750-a25f-ce2559d1c57b\nhdl:21.14100/eab727d0-c996-419f-8f38-1bd79641aca4\nhdl:21.14100/9c92b764-4f3e-4aa2-9f4b-5a7612cf660f\nhdl:21.14100/405705f5-6566-4cad-83c1-07ce60e82fcb'
```

# GOAL: Finding version_id for each zarr store
Once we know the version_id for each zarr store, we can iterate through the catalog and replace old versions with new versions when they become available. 

- Unfortunately, the version_id is NOT contained in the netcdf file metadata, nor as part of the netcdf file's name.  Since we download the netcdf files (using GLOBUS or wget), the directory structure of the source machine is lost and therefore the version_id is lost.   
- Fortunately, using the Handle REST API, we can find the version_id corresponding to the constituent tracking_ids of each zarr store.
- Ideally, each netcdf file would have a UNIQUE tracking_id.  This is not always true (e.g., CESM assigns same tracking_id to multiple netcdf files).
- Ideally, each aggregation consists of netcdf files with known tracking_ids.  This is not always true (e.g., the netcdf file was not properly registered)
- Ideally, each aggregation corresponds to a single version_id.  This is not always true, either.
- Ideally, each zarr store has tracking_ids corresponding to the same aggregation. This happens, for example, if the netcdf files from the ESGF search API were not all from the same version. (For example, v20191201 and v20191202). 
- If all of the above are true, than each zarr store would correspond to a unique version_id.  


### So here is code to get a version_id from the zarr store's tracking_id

In [1]:
import requests

In [2]:
client = requests.session()
baseurl =  'http://hdl.handle.net/api/handles/'
query1 = '?type=IS_PART_OF'
query2 = '?type=VERSION_NUMBER'

In [7]:
# given the zarr store tracking_ids, find the dataset tracking_id and then the version of the dataset

# CESM2 - doesn't have unique tracking_id for each file?!
#tracking_ids = "hdl:21.14100/33cbdc29-fbc9-44ab-9e09-5dc7824441cf\nhdl:21.14100/33cbdc29-fbc9-44ab-9e09-5dc7824441cf"

tracking_ids = 'hdl:21.14100/2f67746f-580b-4750-a25f-ce2559d1c57b\nhdl:21.14100/eab727d0-c996-419f-8f38-1bd79641aca4\nhdl:21.14100/9c92b764-4f3e-4aa2-9f4b-5a7612cf660f\nhdl:21.14100/405705f5-6566-4cad-83c1-07ce60e82fcb'

aggregations = []
versions = []
datasets = []
for file_tracking_id in tracking_ids.split('\n'):
    url1 = baseurl + file_tracking_id[4:]+query1
    r = client.get(url1)
    r.raise_for_status()
    dataset_tracking_id = r.json()['values'][0]['data']['value']
    aggregations += [dataset_tracking_id]
    url2 = baseurl + dataset_tracking_id[4:] + query2
    r = client.get(url2)
    r.raise_for_status()
    versions += [r.json()['values'][0]['data']['value']]
    url = baseurl + dataset_tracking_id[4:]
    r = client.get(url)
    r.raise_for_status()
    datasets += [r.json()['values'][3]['data']['value']]

aggregation_id = list(set(aggregations)) 
version_id = list(set(versions))
dataset_id = list(set(datasets))
    
print('dataset_id = ',dataset_id,'\n')
print('version_id = ',version_id,'\n')
print('aggregation_id = ', aggregation_id,'\n')

dataset_id =  ['CMIP6.CMIP.NCAR.CESM2.historical.r11i1p1f1.Oyr.expc.gr'] 

version_id =  ['20190514'] 

aggregation_id =  ['hdl:21.14100/f03d7841-e21a-31fd-88c6-9632c0e2c5b1'] 

