# Generate an [Intake-ESM](https://intake-esm.readthedocs.io/en/latest/) Catalog Using [ECGTools](https://ecgtools.readthedocs.io/en/latest/)
In this notebook, we use the data directory specified in `_config_calc.yml` to build a data calog to be used throughout the analysis

In [22]:
import yaml

from ecgtools import Builder
from ecgtools.parsers.cesm import parse_cesm_timeseries

In [14]:
with open('_config_calc.yml') as fid:
    config_dict = yaml.load(fid, Loader=yaml.Loader)

## Setup the Builder
We set up the builder object here - specifying the data directory within the `_config_calc.yml` file

In [11]:
b = Builder(config_dict['esm_data_dir'])

## Build the Catalog
When we build the catalog, we specify to use the `parse_cesm_timeseries` parser

In [23]:
b.build(parse_cesm_timeseries)

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:    0.0s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  24 tasks      | elapsed:    1.3s
[Parallel(n_jobs=-1)]: Done  43 out of  43 | elapsed:    1.7s finished


Builder(root_path=PosixPath('/glade/scratch/mclong/cesm2-marbl-data_nc'), extension='.nc', depth=0, exclude_patterns=None, njobs=-1)

In [54]:
def add_experiment_to_dataframe(df):
    case_split = df.case.str.split('.', expand=True)
    experiment = case_split.iloc[:, 1] + '.' + case_split.iloc[:, 2]
    df['experiment'] = experiment.fillna('historical')
    return df

In [56]:
b.df = add_experiment_to_dataframe(b.df)

## Save the Catalog
Now that we have built the catalog, let's save it to disk, using the file name specified in `_config_calc.yml`

In [57]:
b.save(
    config_dict['esm_collection'],
    # Column name including filepath
    path_column_name='path',
    # Column name including variables
    variable_column_name='variable',
    # Data file format - could be netcdf or zarr (in this case, netcdf)
    data_format="netcdf",
    # Which attributes to groupby when reading in variables using intake-esm
    groupby_attrs=["component", "experiment", "stream"],
    # Aggregations which are fed into xarray when reading in data using intake
    aggregations=[
        {
            "type": "join_existing",
            "attribute_name": "time_range",
            "options": {"dim": "time", "coords": "minimal", "compat": "override"},
        }
    ],
)

Saved catalog location: data/cesm2-cmip6-timeseries.json and data/cesm2-cmip6-timeseries.json
