# Hello World!

Here's an example notebook with some documentation on how to access CMIP data.

In [1]:
%matplotlib inline

import xarray as xr
import intake

# util.py is in the local directory
# it contains code that is common across project notebooks
# or routines that are too extensive and might otherwise clutter
# the notebook design
import util 

In [2]:
print('hello world!')

hello world!


## Demonstrate how to use `intake-esm`
[Intake-esm](https://intake-esm.readthedocs.io) is a data cataloging utility that facilitates access to CMIP data. It's pretty awesome.

An `intake-esm` collection object establishes a link to a database that contains file locations and associated metadata (i.e. which experiement, model, etc. thet come from). 

### Opening a collection
First step is to open a collection by pointing to the collection definition file, which is a JSON file that conforms to the [ESM Collection Specification](https://github.com/NCAR/esm-collection-spec). 

The collection JSON files are stored locally in this repository for purposes of reproducibility---and because Cheyenne compute nodes don't have Internet access. 

The primary source for these files is the [intake-esm-datastore](https://github.com/NCAR/intake-esm-datastore) repository. Any changes made to these files should be pulled from that repo. For instance, the Pangeo cloud collection is available [here](https://raw.githubusercontent.com/NCAR/intake-esm-datastore/master/catalogs/pangeo-cmip6.json).

In [37]:
if util.is_ncar_host():
    col = intake.open_esm_datastore("../catalogs/glade-cmip6.json")
else:
    col = intake.open_esm_datastore("../catalogs/pangeo-cmip6.json")
col

glade-cmip6-ESM Collection with 687919 entries:
	> 12 activity_id(s)

	> 24 institution_id(s)

	> 47 source_id(s)

	> 66 experiment_id(s)

	> 162 member_id(s)

	> 35 table_id(s)

	> 1027 variable_id(s)

	> 12 grid_label(s)

	> 59 dcpp_init_year(s)

	> 246 version(s)

	> 6667 time_range(s)

	> 687919 path(s)

`intake-esm` is build on top of [pandas](https://pandas.pydata.org/pandas-docs/stable). It is possible to view the `pandas.DataFrame` as follows.

In [38]:
col.df.head()

Unnamed: 0,activity_id,institution_id,source_id,experiment_id,member_id,table_id,variable_id,grid_label,dcpp_init_year,version,time_range,path
0,AerChemMIP,BCC,BCC-ESM1,ssp370,r2i1p1f1,day,pr,gn,,v20190702,20150101-20551231,/glade/collections/cmip/CMIP6/AerChemMIP/BCC/B...
1,AerChemMIP,BCC,BCC-ESM1,ssp370,r2i1p1f1,Amon,hfls,gn,,v20190624,201501-205512,/glade/collections/cmip/CMIP6/AerChemMIP/BCC/B...
2,AerChemMIP,BCC,BCC-ESM1,ssp370,r2i1p1f1,Amon,prsn,gn,,v20190624,201501-205512,/glade/collections/cmip/CMIP6/AerChemMIP/BCC/B...
3,AerChemMIP,BCC,BCC-ESM1,ssp370,r2i1p1f1,Amon,va,gn,,v20190624,201501-205512,/glade/collections/cmip/CMIP6/AerChemMIP/BCC/B...
4,AerChemMIP,BCC,BCC-ESM1,ssp370,r2i1p1f1,Amon,tas,gn,,v20190624,201501-205512,/glade/collections/cmip/CMIP6/AerChemMIP/BCC/B...


It is possible to interact with the `DataFrame`; for instance, we can see what the "attributes" of the datasets are by printing the columns.

In [39]:
col.df.columns

Index(['activity_id', 'institution_id', 'source_id', 'experiment_id',
       'member_id', 'table_id', 'variable_id', 'grid_label', 'dcpp_init_year',
       'version', 'time_range', 'path'],
      dtype='object')

### Search and discovery

#### Finding unique entries
Let's query the data to see what models ("source_id"), experiments ("experiment_id") and temporal frequencies ("table_id") are available.

In [52]:
import pprint 
uni_dict = col.unique(['source_id', 'experiment_id', 'table_id'])

pprint.pprint(uni_dict, compact=True)

{'experiment_id': {'count': 66,
                   'values': ['ssp370', 'histSST-piNTCF', 'histSST',
                              'histSST-1950HC', 'hist-1950HC', 'hist-piNTCF',
                              'piClim-NTCF', 'ssp370SST-lowNTCF',
                              'ssp370-lowNTCF', 'ssp370SST', 'hist-bgc',
                              'esm-ssp585', 'amip-future4K', 'amip-m4K',
                              'a4SST', 'aqua-p4K', 'piSST', 'amip-4xCO2',
                              'a4SSTice', 'amip-p4K', 'aqua-control',
                              'aqua-4xCO2', 'abrupt-4xCO2', 'historical',
                              'piControl', 'amip', '1pctCO2', 'esm-piControl',
                              'esm-hist', 'ssp245', 'ssp585', 'ssp126',
                              'hist-GHG', 'hist-aer', 'dcppA-hindcast',
                              'dcppC-hindcast-noPinatubo',
                              'dcppC-hindcast-noElChichon', 'dcppA-assim',
                              'dcp

#### Searching for specific datasets

Let's find all the dissolved oxygen data at annual frequency from the ocean for the `historical` and `ssp585` experiments.

In [41]:
cat = col.search(experiment_id=['dcppC-hindcast-noPinatubo']) #table_id='Amon', variable_id='ts', grid_label='gn')
cat.df

Unnamed: 0,activity_id,institution_id,source_id,experiment_id,member_id,table_id,variable_id,grid_label,dcpp_init_year,version,time_range,path
563285,DCPP,CCCma,CanESM5,dcppC-hindcast-noPinatubo,r8i1p2f1,Amon,tas,gn,1990.0,v20190429,199101-200012,/glade/collections/cmip/CMIP6/DCPP/CCCma/CanES...
563286,DCPP,CCCma,CanESM5,dcppC-hindcast-noPinatubo,r5i1p2f1,Amon,tas,gn,1990.0,v20190429,199101-200012,/glade/collections/cmip/CMIP6/DCPP/CCCma/CanES...
563287,DCPP,CCCma,CanESM5,dcppC-hindcast-noPinatubo,r1i1p2f1,Amon,tas,gn,1990.0,v20190429,199101-200012,/glade/collections/cmip/CMIP6/DCPP/CCCma/CanES...
563288,DCPP,CCCma,CanESM5,dcppC-hindcast-noPinatubo,r7i1p2f1,Amon,tas,gn,1990.0,v20190429,199101-200012,/glade/collections/cmip/CMIP6/DCPP/CCCma/CanES...
563289,DCPP,CCCma,CanESM5,dcppC-hindcast-noPinatubo,r9i1p2f1,Amon,tas,gn,1990.0,v20190429,199101-200012,/glade/collections/cmip/CMIP6/DCPP/CCCma/CanES...
563290,DCPP,CCCma,CanESM5,dcppC-hindcast-noPinatubo,r10i1p2f1,Amon,tas,gn,1990.0,v20190429,199101-200012,/glade/collections/cmip/CMIP6/DCPP/CCCma/CanES...
563291,DCPP,CCCma,CanESM5,dcppC-hindcast-noPinatubo,r2i1p2f1,Amon,tas,gn,1990.0,v20190429,199101-200012,/glade/collections/cmip/CMIP6/DCPP/CCCma/CanES...
563292,DCPP,CCCma,CanESM5,dcppC-hindcast-noPinatubo,r4i1p2f1,Amon,tas,gn,1990.0,v20190429,199101-200012,/glade/collections/cmip/CMIP6/DCPP/CCCma/CanES...
563293,DCPP,CCCma,CanESM5,dcppC-hindcast-noPinatubo,r3i1p2f1,Amon,tas,gn,1990.0,v20190429,199101-200012,/glade/collections/cmip/CMIP6/DCPP/CCCma/CanES...


In [53]:
cat = col.search(source_id=['CESM2'],experiment_id=['historical'], table_id='Amon', variable_id='ts')
#grid_label='gn'
cat.df

Unnamed: 0,activity_id,institution_id,source_id,experiment_id,member_id,table_id,variable_id,grid_label,dcpp_init_year,version,time_range,path
88140,CMIP,NCAR,CESM2,historical,r2i1p1f1,Amon,ts,gn,,v20190308,185001-201412,/glade/collections/cmip/CMIP6/CMIP/NCAR/CESM2/...
90571,CMIP,NCAR,CESM2,historical,r5i1p1f1,Amon,ts,gn,,v20190308,185001-201412,/glade/collections/cmip/CMIP6/CMIP/NCAR/CESM2/...
92806,CMIP,NCAR,CESM2,historical,r1i1p1f1,Amon,ts,gn,,v20190308,185001-201412,/glade/collections/cmip/CMIP6/CMIP/NCAR/CESM2/...
95041,CMIP,NCAR,CESM2,historical,r4i1p1f1,Amon,ts,gn,,v20190308,185001-201412,/glade/collections/cmip/CMIP6/CMIP/NCAR/CESM2/...
97276,CMIP,NCAR,CESM2,historical,r3i1p1f1,Amon,ts,gn,,v20190308,185001-201412,/glade/collections/cmip/CMIP6/CMIP/NCAR/CESM2/...
101883,CMIP,NCAR,CESM2,historical,r9i1p1f1,Amon,ts,gn,,v20190311,200001-201412,/glade/collections/cmip/CMIP6/CMIP/NCAR/CESM2/...
101884,CMIP,NCAR,CESM2,historical,r9i1p1f1,Amon,ts,gn,,v20190311,190001-194912,/glade/collections/cmip/CMIP6/CMIP/NCAR/CESM2/...
101885,CMIP,NCAR,CESM2,historical,r9i1p1f1,Amon,ts,gn,,v20190311,195001-199912,/glade/collections/cmip/CMIP6/CMIP/NCAR/CESM2/...
101886,CMIP,NCAR,CESM2,historical,r9i1p1f1,Amon,ts,gn,,v20190311,185001-189912,/glade/collections/cmip/CMIP6/CMIP/NCAR/CESM2/...
106849,CMIP,NCAR,CESM2,historical,r8i1p1f1,Amon,ts,gn,,v20190311,190001-194912,/glade/collections/cmip/CMIP6/CMIP/NCAR/CESM2/...


It might be desirable to get more specific. For instance, we may want to select only the models that have *both* `historical` and `ssp585` data. We coud do this as follows.

In [54]:
models = set(uni_dict['source_id']['values']) # all the models

for experiment_id in ['historical']:
    query = dict(experiment_id=experiment_id, table_id='Amon', 
                 variable_id='ts')  
    #grid_label='gn'
    cat = col.search(**query)
    models = models.intersection({model for model in cat.df.source_id.unique().tolist()})

## ensure the CESM2 models are not included (oxygen was erroneously submitted to the archive)
#models = models - {'CESM2-WACCM', 'CESM2'}

models = list(models)
models

['SAM0-UNICON',
 'GFDL-ESM4',
 'CAMS-CSM1-0',
 'GFDL-CM4',
 'BCC-CSM2-MR',
 'NESM3',
 'BCC-ESM1',
 'MCM-UA-1-0',
 'CanESM5',
 'UKESM1-0-LL',
 'CESM2',
 'IPSL-CM6A-LR',
 'EC-Earth3-Veg',
 'GISS-E2-1-G',
 'CNRM-CM6-1',
 'E3SM-1-0',
 'HadGEM3-GC31-LL',
 'GISS-E2-1-H',
 'EC-Earth3',
 'NorESM2-LM',
 'MIROC-ES2L',
 'MIROC6',
 'FGOALS-g3',
 'CNRM-ESM2-1',
 'CESM2-WACCM',
 'MRI-ESM2-0']

In [57]:
cat = col.search(source_id=['CESM2'], experiment_id=['historical'], table_id='Amon', 
                 variable_id='ts')#, source_id=models
cat.df

Unnamed: 0,activity_id,institution_id,source_id,experiment_id,member_id,table_id,variable_id,grid_label,dcpp_init_year,version,time_range,path
88140,CMIP,NCAR,CESM2,historical,r2i1p1f1,Amon,ts,gn,,v20190308,185001-201412,/glade/collections/cmip/CMIP6/CMIP/NCAR/CESM2/...
90571,CMIP,NCAR,CESM2,historical,r5i1p1f1,Amon,ts,gn,,v20190308,185001-201412,/glade/collections/cmip/CMIP6/CMIP/NCAR/CESM2/...
92806,CMIP,NCAR,CESM2,historical,r1i1p1f1,Amon,ts,gn,,v20190308,185001-201412,/glade/collections/cmip/CMIP6/CMIP/NCAR/CESM2/...
95041,CMIP,NCAR,CESM2,historical,r4i1p1f1,Amon,ts,gn,,v20190308,185001-201412,/glade/collections/cmip/CMIP6/CMIP/NCAR/CESM2/...
97276,CMIP,NCAR,CESM2,historical,r3i1p1f1,Amon,ts,gn,,v20190308,185001-201412,/glade/collections/cmip/CMIP6/CMIP/NCAR/CESM2/...
101883,CMIP,NCAR,CESM2,historical,r9i1p1f1,Amon,ts,gn,,v20190311,200001-201412,/glade/collections/cmip/CMIP6/CMIP/NCAR/CESM2/...
101884,CMIP,NCAR,CESM2,historical,r9i1p1f1,Amon,ts,gn,,v20190311,190001-194912,/glade/collections/cmip/CMIP6/CMIP/NCAR/CESM2/...
101885,CMIP,NCAR,CESM2,historical,r9i1p1f1,Amon,ts,gn,,v20190311,195001-199912,/glade/collections/cmip/CMIP6/CMIP/NCAR/CESM2/...
101886,CMIP,NCAR,CESM2,historical,r9i1p1f1,Amon,ts,gn,,v20190311,185001-189912,/glade/collections/cmip/CMIP6/CMIP/NCAR/CESM2/...
106849,CMIP,NCAR,CESM2,historical,r8i1p1f1,Amon,ts,gn,,v20190311,190001-194912,/glade/collections/cmip/CMIP6/CMIP/NCAR/CESM2/...


### Loading data

`intake-esm` enables loading data directly into an [xarray.Dataset](http://xarray.pydata.org/en/stable/api.html#dataset).

Note that data on the cloud are in 
[zarr](https://zarr.readthedocs.io/en/stable/) format and data on 
[glade](https://www2.cisl.ucar.edu/resources/storage-and-file-systems/glade-file-spaces) are stored as 
[netCDF](https://www.unidata.ucar.edu/software/netcdf/) files. This is opaque to the user.

`intake-esm` has rules for aggegating datasets; these rules are defined in the collection-specification file.

In [58]:
dset_dict = cat.to_dataset_dict(zarr_kwargs={'consolidated': True, 'decode_times': False}, 
                                cdf_kwargs={'chunks': {}, 'decode_times': False})


xarray will load netCDF datasets with dask using a single chunk for all arrays.
For effective chunking, please provide chunks in cdf_kwargs.
For example: cdf_kwargs={'chunks': {'time': 36}}

--> The keys in the returned dictionary of datasets are constructed as follows:
	'activity_id.institution_id.source_id.experiment_id.table_id.grid_label'

--> There will be 1 group(s)


  use_cftime=use_cftime,
  use_cftime=use_cftime,
  use_cftime=use_cftime,
  use_cftime=use_cftime,
  use_cftime=use_cftime,
  use_cftime=use_cftime,
  use_cftime=use_cftime,
  use_cftime=use_cftime,
  use_cftime=use_cftime,
  use_cftime=use_cftime,
  use_cftime=use_cftime,


`dset_dict` is a dictionary of `xarray.Dataset`'s; its keys are constructed to refer to compatible groups.

In [59]:
dset_dict.keys()

dict_keys(['CMIP.NCAR.CESM2.historical.Amon.gn'])

We can access a particular dataset as follows.

In [60]:
dset_dict['CMIP.NCAR.CESM2.historical.Amon.gn']

<xarray.Dataset>
Dimensions:    (lat: 192, lon: 288, member_id: 11, nbnd: 2, time: 1980)
Coordinates:
  * time       (time) float64 6.749e+05 6.749e+05 ... 7.351e+05 7.351e+05
  * lon        (lon) float64 0.0 1.25 2.5 3.75 5.0 ... 355.0 356.2 357.5 358.8
  * lat        (lat) float64 -90.0 -89.06 -88.12 -87.17 ... 88.12 89.06 90.0
  * member_id  (member_id) <U9 'r10i1p1f1' 'r11i1p1f1' ... 'r8i1p1f1' 'r9i1p1f1'
Dimensions without coordinates: nbnd
Data variables:
    lon_bnds   (lon, nbnd) float64 -0.625 0.625 0.625 ... 358.1 358.1 359.4
    lat_bnds   (lat, nbnd) float64 -90.0 -89.53 -89.53 ... 89.53 89.53 90.0
    time_bnds  (time, nbnd) float64 dask.array<chunksize=(600, 2), meta=np.ndarray>
    ts         (member_id, time, lat, lon) float32 dask.array<chunksize=(1, 600, 192, 288), meta=np.ndarray>
Attributes:
    parent_variant_label:   r1i1p1f1
    forcing_index:          1
    tracking_id:            hdl:21.14100/20016537-e6d1-403a-be0f-7facdee33089...
    parent_activity_id:     C

In [67]:
%matplotlib inline
import intake
import xarray as xr
import numpy as np

In [68]:
def _compute_slope(y):
    """
    Private function to compute slopes at each grid cell using
    polyfit. 
    """
    x = np.arange(len(y))
    return np.polyfit(x, y, 1)[0] # return only the slope

def compute_slope(da):
    """
    Computes linear slope (m) at each grid cell.
    
    Args:
      da: xarray DataArray to compute slopes for
      
    Returns:
      xarray DataArray with slopes computed at each grid cell.
    """
    # apply_ufunc can apply a raw numpy function to a grid.
    # 
    # vectorize is only needed for functions that aren't already
    # vectorized. You don't need it for polyfit in theory, but it's
    # good to use when using things like np.cov.
    #
    # dask='parallelized' parallelizes this across dask chunks. It requires
    # an output_dtypes of the numpy array datatype coming out.
    #
    # input_core_dims should pass the dimension that is being *reduced* by this operation,
    # if one is being reduced.
    slopes = xr.apply_ufunc(_compute_slope,
                            da,
                            vectorize=True,
                            dask='parallelized', 
                            input_core_dims=[['time']],
                            output_dtypes=[float],
                            )
    return slopes

In [None]:
col = intake.open_esm_datastore("/glade/collections/cmip/catalog/intake-esm-datastore/catalogs/glade-cmip6.json")

In [None]:
# Just select CESM2 here for a single ensemble member.
cat = col.search(member_id='r1i1p1f1',
                 experiment_id='historical',
                 activity_id='CMIP',
                 table_id='Omon',
                 variable_id='spco2',
                 grid_label='gn',
                 source_id='CESM2')

In [None]:
# Chunk over the full time dimension since we're computing slope over the time dimension.
dsets = cat.to_dataset_dict(cdf_kwargs={"chunks": {"time": -1}})

In [None]:
ds = dsets['CMIP.NCAR.CESM2.historical.Omon.gn'].squeeze()

In [None]:
single_member = ds['spco2'].load() # Load the single member INTO MEMORY.

In [None]:
%%time
slopes = compute_slope(single_member)

In [None]:
slopes.plot()

In [None]:
#https://nbviewer.jupyter.org/gist/bradyrx/41e4fa86a92908deecd422503b62a29b

### More advanced queries

As motivation for diving into more advanced manipulations with `intake-esm`, let's consider the task of getting access to grid information in the `Ofx` table_id.

In [64]:
cat_fx = col.search(source_id=['CESM2'], member_id='r1i1p1f1', experiment_id=['historical'], table_id='fx')
cat_fx.df

Unnamed: 0,activity_id,institution_id,source_id,experiment_id,member_id,table_id,variable_id,grid_label,dcpp_init_year,version,time_range,path
92645,CMIP,NCAR,CESM2,historical,r1i1p1f1,fx,sftgif,gn,,v20190308,,/glade/collections/cmip/CMIP6/CMIP/NCAR/CESM2/...
92646,CMIP,NCAR,CESM2,historical,r1i1p1f1,fx,sftlf,gn,,v20190308,,/glade/collections/cmip/CMIP6/CMIP/NCAR/CESM2/...
92647,CMIP,NCAR,CESM2,historical,r1i1p1f1,fx,orog,gn,,v20190308,,/glade/collections/cmip/CMIP6/CMIP/NCAR/CESM2/...
92648,CMIP,NCAR,CESM2,historical,r1i1p1f1,fx,areacella,gn,,v20190308,,/glade/collections/cmip/CMIP6/CMIP/NCAR/CESM2/...


This, however, comes with lots of redundant information.

Additionally, it may be necessary to do more targeted manipulations of the search. For instance, we've found a handful of corrupted files on `glade` and might need to work around loading these. 

As an illustration of this, in the code below, we specify a list of to queries (in this case one) to eliminate.

In [21]:
import numpy as np

# specify a list of queries to eliminate
corrupt_data = [dict(variable_id='areacello', source_id='IPSL-CM6A-LR',
                     experiment_id='historical', member_id='r2i1p1f1')
               ]


# copy the dataframe 
df = cat_fx.df.copy()

# eliminate data
for elim in corrupt_data:
    condition = np.ones(len(df), dtype=bool)
    for key, val in elim.items():
        condition = condition & (df[key] == val)
    df = df.loc[~condition]
df    

Unnamed: 0,activity_id,institution_id,source_id,experiment_id,member_id,table_id,variable_id,grid_label,dcpp_init_year,version,time_range,path
294145,CMIP,CCCma,CanESM5,historical,r2i1p1f1,Ofx,areacello,gn,,v20190429,,/glade/collections/cmip/CMIP6/CMIP/CCCma/CanES...
294146,CMIP,CCCma,CanESM5,historical,r2i1p1f1,Ofx,thkcello,gn,,v20190429,,/glade/collections/cmip/CMIP6/CMIP/CCCma/CanES...
294147,CMIP,CCCma,CanESM5,historical,r2i1p1f1,Ofx,deptho,gn,,v20190429,,/glade/collections/cmip/CMIP6/CMIP/CCCma/CanES...
294684,CMIP,CCCma,CanESM5,historical,r5i1p1f1,Ofx,areacello,gn,,v20190429,,/glade/collections/cmip/CMIP6/CMIP/CCCma/CanES...
294685,CMIP,CCCma,CanESM5,historical,r5i1p1f1,Ofx,thkcello,gn,,v20190429,,/glade/collections/cmip/CMIP6/CMIP/CCCma/CanES...
...,...,...,...,...,...,...,...,...,...,...,...,...
636507,ScenarioMIP,CCCma,CanESM5,ssp585,r6i1p1f1,Ofx,deptho,gn,,v20190429,,/glade/collections/cmip/CMIP6/ScenarioMIP/CCCm...
686900,ScenarioMIP,MIROC,MIROC-ES2L,ssp585,r1i1p1f2,Ofx,sftof,gn,,v20190823,,/glade/collections/cmip/CMIP6/ScenarioMIP/MIRO...
687632,ScenarioMIP,IPSL,IPSL-CM6A-LR,ssp585,r1i1p1f1,Ofx,areacello,gn,,v20190119,,/glade/collections/cmip/CMIP6/ScenarioMIP/IPSL...
687633,ScenarioMIP,IPSL,IPSL-CM6A-LR,ssp585,r1i1p1f1,Ofx,hfgeou,gn,,v20190119,,/glade/collections/cmip/CMIP6/ScenarioMIP/IPSL...


We then drop duplicates.

In [22]:
df.drop_duplicates(subset=['source_id', 'variable_id'], inplace=True)

Now, since we've only retained one ensemble member, we need to eliminate that column. If we omit this step, `intake-esm` will throw an error, complaining that different variables are present for each ensemble member. Setting the `member_id` column to NaN precludes attempts to join along the ensemble dimension.

After this final manipulation, we copy the `DataFrame` back to the collection object and proceed with loading the data.

In [23]:
df['member_id'] = np.nan
cat_fx.df = df

In [24]:
fx_dsets = cat_fx.to_dataset_dict(zarr_kwargs={'consolidated': True}, cdf_kwargs={'chunks': {}})


xarray will load netCDF datasets with dask using a single chunk for all arrays.
For effective chunking, please provide chunks in cdf_kwargs.
For example: cdf_kwargs={'chunks': {'time': 36}}

--> The keys in the returned dictionary of datasets are constructed as follows:
	'activity_id.institution_id.source_id.experiment_id.table_id.grid_label'

--> There will be 3 group(s)


In [25]:
fx_dsets.keys()

dict_keys(['CMIP.CCCma.CanESM5.historical.Ofx.gn', 'CMIP.IPSL.IPSL-CM6A-LR.historical.Ofx.gn', 'CMIP.MIROC.MIROC-ES2L.historical.Ofx.gn'])

In [26]:
for key, ds in fx_dsets.items():
    print(ds.data_vars)

Data variables:
    latitude            (j, i) float64 dask.array<chunksize=(291, 360), meta=np.ndarray>
    longitude           (j, i) float64 dask.array<chunksize=(291, 360), meta=np.ndarray>
    vertices_latitude   (j, i, vertices) float64 dask.array<chunksize=(291, 360, 4), meta=np.ndarray>
    vertices_longitude  (j, i, vertices) float64 dask.array<chunksize=(291, 360, 4), meta=np.ndarray>
    areacello           (j, i) float32 dask.array<chunksize=(291, 360), meta=np.ndarray>
    lev_bnds            (lev, bnds) float64 dask.array<chunksize=(45, 2), meta=np.ndarray>
    thkcello            (lev, j, i) float32 dask.array<chunksize=(45, 291, 360), meta=np.ndarray>
    deptho              (j, i) float32 dask.array<chunksize=(291, 360), meta=np.ndarray>
    type                |S3 ...
    sftof               (j, i) float32 dask.array<chunksize=(291, 360), meta=np.ndarray>
Data variables:
    nav_lat         (y, x) float32 dask.array<chunksize=(332, 362), meta=np.ndarray>
    nav_lon  

## Demonstrate how spin-up a dask cluster

If you expect to require Big Data capabilities, here's how you spin up a [dask](https://dask.org) cluster using [dask-jobqueue](https://dask-jobqueue.readthedocs.io/en/latest/).

The syntax is different if on an NCAR machine versus the cloud.

In [27]:
if util.is_ncar_host():
    from ncar_jobqueue import NCARCluster
    cluster = NCARCluster(project='UCGD0006')
    cluster.adapt(minimum_jobs=1, maximum_jobs=10)
else:
    from dask_kubernetes import KubeCluster
    cluster = KubeCluster()
    cluster.adapt(minimum=1, maximum=10)
cluster

PermissionError: [Errno 13] Permission denied: '/ncar/usr/jupyterhub/envs/cmip6-201910/lib/python3.7/site-packages/ncar_jobqueue/jobqueue.yaml'

In [28]:
from dask.distributed import Client
client = Client(cluster) # Connect this local process to remote workers
client

NameError: name 'cluster' is not defined