Lets find two different unique dataset ids (one netcdf3 one netcdf4) and the corresponding file links.
Then we can compare across storage locations and reference files built from these!


- [ ] Test virtualizarr 
- [ ] Merge the main pangeo-forge-esgf once the async client is merged
- [ ] Fix the iid generation (with proper generation of member_id) upstream in xmip

In [1]:
# !pip install git+https://github.com/jbusecke/pangeo-forge-esgf.git@new-request-scheme

In [2]:
# !pip install virtualizarr

In [3]:
# !pip install pangeo-forge-esgf

In [4]:
import xarray as xr
from pangeo_forge_esgf.utils import CMIP6_naming_schema

In [5]:
import intake
from tqdm.auto import tqdm

In [14]:
from xmip.utils import cmip6_dataset_id

In [6]:
from virtualizarr.backend import automatically_determine_filetype
r_options = {'storage_options':{'anon':True}} #complicated fsspec options for anon s3 access

In [7]:
path = "s3://esgf-world/CMIP6/CMIP/CCCma/CanESM5/historical/r10i1p1f1/Omon/uo/gn/v20190429/uo_Omon_CanESM5_historical_r10i1p1f1_gn_185001-186012.nc"

In [8]:
automatically_determine_filetype(filepath=path, reader_options=r_options)

<FileType.hdf5: 'hdf5'>

In [9]:
def make_iid(ds):
    iid_schema = CMIP6_naming_schema.replace('.member_id.', '.variant_label.sub_experiment_id.').replace('.version', '')
    return '.'.join([ds.attrs[facet] for facet in iid_schema.split('.')])

In [10]:
import fsspec
with fsspec.open(path, anon=True) as f:
    ds = xr.open_dataset(f)
# ds
make_iid(ds)

'CMIP6.CMIP.CCCma.CanESM5.historical.r10i1p1f1.none.Omon.uo.gn'

## Find a dataset with multiple nc3 files in the GFDL holdings.

In [11]:
col = intake.open_esm_datastore("https://cmip6-nc.s3.us-east-2.amazonaws.com/esgf-world.json")

In [12]:
col_fast_iter = col.search(experiment_id='historical', member_id='r1i1p1f1', grid_label='gn')
rows = list(col_fast_iter.df.iterrows())
# trying to skip a few each time in hopes this will expose a model with netcdf3 output earlier?
for i, row in tqdm(rows[0::500]):
    path = row['path']
    try:
        file_type = automatically_determine_filetype(filepath=path, reader_options={'storage_options':{'anon':True}})
        if file_type.value == 'netcdf3':
            # check the catalog for this file and get all the other relevant files
            cat = col.search(**{facet:row[facet] for facet in ['source_id', 'table_id', 'variable_id', 'member_id', 'version', 'experiment_id']})
            # only break if this has multiple files (better for the demo)
            flist = cat.df['path'].tolist()
            if len(flist) > 1:
                print(row)
                print(flist)
                break
            else:
                print(f"Found only a single file {flist} for {row['source_id']}{row['experiment_id']}{row['variable_id']}")
    except Exception as e:
        print(f"{row['source_id']}{row['experiment_id']}{row['variable_id']} failed with {e}")

  0%|          | 0/104 [00:00<?, ?it/s]

AWI-ESM-1-1-LRhistoricalua failed with Forbidden
project                                                        CMIP6
institution_id                                                   BCC
source_id                                                BCC-CSM2-MR
experiment_id                                             historical
frequency                                                        NaN
modeling_realm                                                   NaN
table_id                                                      6hrLev
member_id                                                   r1i1p1f1
grid_label                                                        gn
variable_id                                                       ps
temporal_subset                            195001010000-195412311800
version                                                    v20181127
path               s3://esgf-world/CMIP6/CMIP/BCC/BCC-CSM2-MR/his...
Name: 9000, dtype: object
['s3://esgf-world/CMIP6/CMIP

In [15]:
automatically_determine_filetype(filepath=flist[0], reader_options=r_options)

<FileType.netcdf3: 'netcdf3'>

In [16]:
with fsspec.open(flist[0], **r_options['storage_options']) as f:
    ds = xr.open_dataset(f, chunks={})
# cmip6_dataset_id(ds) # this would be a nice target for the warning todo below!!!
make_iid(ds)

'CMIP6.CMIP.BCC.BCC-CSM2-MR.historical.r1i1p1f1.none.6hrLev.ps.gn'

## Found two!

>[!WARNING]
>The iid generation is messed up. I need a reliable way to go ds-> iid that takes member_id generation into account! Where does this live? Test this against the pgf-esgf async client! [This is discussed here!!!](https://github.com/jbusecke/xMIP/issues/291)
> - [ ] Implement this upstream, release, and construct the iids above correctly! 

### 'CMIP6.CMIP.BCC.BCC-CSM2-MR.historical.r1i1p1f1.none.6hrLev.ps.gn' - NETCDF3

S3 Filelist:
```
's3://esgf-world/CMIP6/CMIP/BCC/BCC-CSM2-MR/historical/r1i1p1f1/6hrLev/ps/gn/v20181127/ps_6hrLev_BCC-CSM2-MR_historical_r1i1p1f1_gn_195001010000-195412311800.nc',
's3://esgf-world/CMIP6/CMIP/BCC/BCC-CSM2-MR/historical/r1i1p1f1/6hrLev/ps/gn/v20181127/ps_6hrLev_BCC-CSM2-MR_historical_r1i1p1f1_gn_195501010000-195912311800.nc',
's3://esgf-world/CMIP6/CMIP/BCC/BCC-CSM2-MR/historical/r1i1p1f1/6hrLev/ps/gn/v20181127/ps_6hrLev_BCC-CSM2-MR_historical_r1i1p1f1_gn_196001010000-196412311800.nc',
's3://esgf-world/CMIP6/CMIP/BCC/BCC-CSM2-MR/historical/r1i1p1f1/6hrLev/ps/gn/v20181127/ps_6hrLev_BCC-CSM2-MR_historical_r1i1p1f1_gn_196501010000-196912311800.nc',
's3://esgf-world/CMIP6/CMIP/BCC/BCC-CSM2-MR/historical/r1i1p1f1/6hrLev/ps/gn/v20181127/ps_6hrLev_BCC-CSM2-MR_historical_r1i1p1f1_gn_197001010000-197412311800.nc',
's3://esgf-world/CMIP6/CMIP/BCC/BCC-CSM2-MR/historical/r1i1p1f1/6hrLev/ps/gn/v20181127/ps_6hrLev_BCC-CSM2-MR_historical_r1i1p1f1_gn_197501010000-197912311800.nc',
's3://esgf-world/CMIP6/CMIP/BCC/BCC-CSM2-MR/historical/r1i1p1f1/6hrLev/ps/gn/v20181127/ps_6hrLev_BCC-CSM2-MR_historical_r1i1p1f1_gn_198001010000-198412311800.nc',
's3://esgf-world/CMIP6/CMIP/BCC/BCC-CSM2-MR/historical/r1i1p1f1/6hrLev/ps/gn/v20181127/ps_6hrLev_BCC-CSM2-MR_historical_r1i1p1f1_gn_198501010000-198912311800.nc',
's3://esgf-world/CMIP6/CMIP/BCC/BCC-CSM2-MR/historical/r1i1p1f1/6hrLev/ps/gn/v20181127/ps_6hrLev_BCC-CSM2-MR_historical_r1i1p1f1_gn_199001010000-199412311800.nc',
's3://esgf-world/CMIP6/CMIP/BCC/BCC-CSM2-MR/historical/r1i1p1f1/6hrLev/ps/gn/v20181127/ps_6hrLev_BCC-CSM2-MR_historical_r1i1p1f1_gn_199501010000-199912311800.nc',
's3://esgf-world/CMIP6/CMIP/BCC/BCC-CSM2-MR/historical/r1i1p1f1/6hrLev/ps/gn/v20181127/ps_6hrLev_BCC-CSM2-MR_historical_r1i1p1f1_gn_200001010000-200412311800.nc',
's3://esgf-world/CMIP6/CMIP/BCC/BCC-CSM2-MR/historical/r1i1p1f1/6hrLev/ps/gn/v20181127/ps_6hrLev_BCC-CSM2-MR_historical_r1i1p1f1_gn_200501010000-200912311800.nc',
's3://esgf-world/CMIP6/CMIP/BCC/BCC-CSM2-MR/historical/r1i1p1f1/6hrLev/ps/gn/v20181127/ps_6hrLev_BCC-CSM2-MR_historical_r1i1p1f1_gn_201001010000-201412311800.nc'
```

### 'CMIP6.CMIP.CCCma.CanESM5.historical.r10i1p1f1.none.Omon.uo.gn' - NETCDF4(HDF5)
From the `proof-of-concept` notebook

```
's3://esgf-world/CMIP6/CMIP/CCCma/CanESM5/historical/r10i1p1f1/Omon/uo/gn/v20190429/uo_Omon_CanESM5_historical_r10i1p1f1_gn_185001-186012.nc',
's3://esgf-world/CMIP6/CMIP/CCCma/CanESM5/historical/r10i1p1f1/Omon/uo/gn/v20190429/uo_Omon_CanESM5_historical_r10i1p1f1_gn_187101-188012.nc',
's3://esgf-world/CMIP6/CMIP/CCCma/CanESM5/historical/r10i1p1f1/Omon/uo/gn/v20190429/uo_Omon_CanESM5_historical_r10i1p1f1_gn_188101-189012.nc',
's3://esgf-world/CMIP6/CMIP/CCCma/CanESM5/historical/r10i1p1f1/Omon/uo/gn/v20190429/uo_Omon_CanESM5_historical_r10i1p1f1_gn_186101-187012.nc',
's3://esgf-world/CMIP6/CMIP/CCCma/CanESM5/historical/r10i1p1f1/Omon/uo/gn/v20190429/uo_Omon_CanESM5_historical_r10i1p1f1_gn_189101-190012.nc',
's3://esgf-world/CMIP6/CMIP/CCCma/CanESM5/historical/r10i1p1f1/Omon/uo/gn/v20190429/uo_Omon_CanESM5_historical_r10i1p1f1_gn_190101-191012.nc',
's3://esgf-world/CMIP6/CMIP/CCCma/CanESM5/historical/r10i1p1f1/Omon/uo/gn/v20190429/uo_Omon_CanESM5_historical_r10i1p1f1_gn_191101-192012.nc',
's3://esgf-world/CMIP6/CMIP/CCCma/CanESM5/historical/r10i1p1f1/Omon/uo/gn/v20190429/uo_Omon_CanESM5_historical_r10i1p1f1_gn_192101-193012.nc',
's3://esgf-world/CMIP6/CMIP/CCCma/CanESM5/historical/r10i1p1f1/Omon/uo/gn/v20190429/uo_Omon_CanESM5_historical_r10i1p1f1_gn_193101-194012.nc',
's3://esgf-world/CMIP6/CMIP/CCCma/CanESM5/historical/r10i1p1f1/Omon/uo/gn/v20190429/uo_Omon_CanESM5_historical_r10i1p1f1_gn_194101-195012.nc',
's3://esgf-world/CMIP6/CMIP/CCCma/CanESM5/historical/r10i1p1f1/Omon/uo/gn/v20190429/uo_Omon_CanESM5_historical_r10i1p1f1_gn_195101-196012.nc',
's3://esgf-world/CMIP6/CMIP/CCCma/CanESM5/historical/r10i1p1f1/Omon/uo/gn/v20190429/uo_Omon_CanESM5_historical_r10i1p1f1_gn_196101-197012.nc',
's3://esgf-world/CMIP6/CMIP/CCCma/CanESM5/historical/r10i1p1f1/Omon/uo/gn/v20190429/uo_Omon_CanESM5_historical_r10i1p1f1_gn_197101-198012.nc',
's3://esgf-world/CMIP6/CMIP/CCCma/CanESM5/historical/r10i1p1f1/Omon/uo/gn/v20190429/uo_Omon_CanESM5_historical_r10i1p1f1_gn_198101-199012.nc',
's3://esgf-world/CMIP6/CMIP/CCCma/CanESM5/historical/r10i1p1f1/Omon/uo/gn/v20190429/uo_Omon_CanESM5_historical_r10i1p1f1_gn_199101-200012.nc',
's3://esgf-world/CMIP6/CMIP/CCCma/CanESM5/historical/r10i1p1f1/Omon/uo/gn/v20190429/uo_Omon_CanESM5_historical_r10i1p1f1_gn_200101-201012.nc',
's3://esgf-world/CMIP6/CMIP/CCCma/CanESM5/historical/r10i1p1f1/Omon/uo/gn/v20190429/uo_Omon_CanESM5_historical_r10i1p1f1_gn_201101-201412.nc'
```

## Get the equivalent ESGF urls

In [17]:
from pangeo_forge_esgf.async_client import ESGFAsyncClient, get_sorted_http_urls_from_iid_dict

In [18]:
iids = ["CMIP6.CMIP.CCCma.CanESM5.historical.r10i1p1f1.Omon.uo.gn.v20190429"]
async with ESGFAsyncClient() as client:
    res = await client.recipe_data(iids)

for iid, data in res.items():
    urls = get_sorted_http_urls_from_iid_dict(data)

urls

100%|██████████| 7/7 [00:01<00:00,  6.07it/s]
100%|██████████| 7/7 [00:01<00:00,  5.51it/s]


['http://esgf-node.ornl.gov/thredds/fileServer/css03_data/CMIP6/CMIP/CCCma/CanESM5/historical/r10i1p1f1/Omon/uo/gn/v20190429/uo_Omon_CanESM5_historical_r10i1p1f1_gn_185001-186012.nc',
 'http://esgf-node.ornl.gov/thredds/fileServer/css03_data/CMIP6/CMIP/CCCma/CanESM5/historical/r10i1p1f1/Omon/uo/gn/v20190429/uo_Omon_CanESM5_historical_r10i1p1f1_gn_186101-187012.nc',
 'http://esgf-node.ornl.gov/thredds/fileServer/css03_data/CMIP6/CMIP/CCCma/CanESM5/historical/r10i1p1f1/Omon/uo/gn/v20190429/uo_Omon_CanESM5_historical_r10i1p1f1_gn_187101-188012.nc',
 'http://esgf-node.ornl.gov/thredds/fileServer/css03_data/CMIP6/CMIP/CCCma/CanESM5/historical/r10i1p1f1/Omon/uo/gn/v20190429/uo_Omon_CanESM5_historical_r10i1p1f1_gn_188101-189012.nc',
 'http://esgf-node.ornl.gov/thredds/fileServer/css03_data/CMIP6/CMIP/CCCma/CanESM5/historical/r10i1p1f1/Omon/uo/gn/v20190429/uo_Omon_CanESM5_historical_r10i1p1f1_gn_189101-190012.nc',
 'http://esgf-node.ornl.gov/thredds/fileServer/css03_data/CMIP6/CMIP/CCCma/CanES

In [19]:
iids = ["CMIP6.CMIP.BCC.BCC-CSM2-MR.historical.r1i1p1f1.6hrLev.ps.gn.v20181127"]
async with ESGFAsyncClient() as client:
    res = await client.recipe_data(iids)

for iid, data in res.items():
    urls = get_sorted_http_urls_from_iid_dict(data)

urls

100%|██████████| 7/7 [00:00<00:00,  7.08it/s]
100%|██████████| 7/7 [00:00<00:00, 12.80it/s]


['http://esgf3.dkrz.de/thredds/fileServer/cmip6/CMIP/BCC/BCC-CSM2-MR/historical/r1i1p1f1/6hrLev/ps/gn/v20181127/ps_6hrLev_BCC-CSM2-MR_historical_r1i1p1f1_gn_195001010000-195412311800.nc',
 'http://esgf3.dkrz.de/thredds/fileServer/cmip6/CMIP/BCC/BCC-CSM2-MR/historical/r1i1p1f1/6hrLev/ps/gn/v20181127/ps_6hrLev_BCC-CSM2-MR_historical_r1i1p1f1_gn_195501010000-195912311800.nc',
 'http://esgf3.dkrz.de/thredds/fileServer/cmip6/CMIP/BCC/BCC-CSM2-MR/historical/r1i1p1f1/6hrLev/ps/gn/v20181127/ps_6hrLev_BCC-CSM2-MR_historical_r1i1p1f1_gn_196001010000-196412311800.nc',
 'http://esgf3.dkrz.de/thredds/fileServer/cmip6/CMIP/BCC/BCC-CSM2-MR/historical/r1i1p1f1/6hrLev/ps/gn/v20181127/ps_6hrLev_BCC-CSM2-MR_historical_r1i1p1f1_gn_196501010000-196912311800.nc',
 'http://esgf3.dkrz.de/thredds/fileServer/cmip6/CMIP/BCC/BCC-CSM2-MR/historical/r1i1p1f1/6hrLev/ps/gn/v20181127/ps_6hrLev_BCC-CSM2-MR_historical_r1i1p1f1_gn_197001010000-197412311800.nc',
 'http://esgf3.dkrz.de/thredds/fileServer/cmip6/CMIP/BCC/BCC