# Building an Intake-esm catalog from CESM2 History Files

This example covers how to build an intake-esm catalog from Community Earth System Model v2 (CESM2) model output. In this case, we use model output using the default component-set (compset) detailed in the [CESM Quickstart Guide](https://escomp.github.io/CESM/versions/cesm2.1/html/).

## What's a "history" file?
A history file is the default output from CESM, where each file is a single time "slice" with every variable from the component of interest. These types of files can be difficult to work with, since often times one is interested in a time series of a single variable. Building a catalog can be helpful in accessing your data, querying for certain variables, and potentially creating timeseries files later down the road.

Let's get started!

## Imports
The only parts of ecgtools we need are the `Builder` object and the `parse_cesm_history` parser from the CESM parsers! We import `glob` to take a look at the files we are parsing.

In [None]:
from ecgtools import Builder
from ecgtools.parsers.cesm import parse_cesm_history
import glob

### Understanding the Directory Structure

The first step to setting up the `Builder` object is determining where your files are stored. As mentioned previously, we have a sample dataset of CESM2 model output, which is stored in `/glade/work/mgrover/cesm_test_data/`

Taking a look at that directory, we see that there is a single case `b.e20.B1850.f19_g17.test`

In [None]:
glob.glob('/glade/work/mgrover/cesm_test_data/*')

Once we go into that directory, we see all the different components, including the atmosphere (atm), ocean (ocn), and land (lnd)!

In [None]:
glob.glob('/glade/work/mgrover/cesm_test_data/b.e20.B1850.f19_g17.test/*')

If we go one step further, we notice that within each component, is a `hist` directory which contains the model output

In [None]:
glob.glob('/glade/work/mgrover/cesm_test_data/b.e20.B1850.f19_g17.test/atm/*/*.nc')[0:3]

If we take a look at the `ocn` component though, we notice that there are a few timeseries files in there...

In [None]:
glob.glob('/glade/work/mgrover/cesm_test_data/b.e20.B1850.f19_g17.test/ocn/*/*.nc')[0:3]

When we setup our catalog builder, we will need to specify not including the timeseries (tseries) and restart (rest) directories!

Now that we understand the directory structure, let's make the catalog.

## Build the catalog!

Let's start by inspecting the builder object

In [None]:
Builder?

In [None]:
b = Builder(
    # Directory with the output
    "/glade/work/mgrover/cesm_test_data/b.e20.B1850.f19_g17.test/",
    
    # Depth of 1 since we are sending it to the case output directory
    depth=1,
    
    # Use the parse_cesm_history parsing function
    parsing_func=parse_cesm_history,
    
    # Exclude the timeseries and restart directories
    exclude_patterns=["*/tseries/*", "*/rest/*"],
    
    # Number of jobs to execute - should be equal to # threads you are using
    njobs=5,
)

Double check the object is set up...

In [None]:
b

We are good to go! Let's build the catalog by calling `.build()` on the object! By default, it will use the `LokyBackend` which is described in the [Joblib documentation](https://joblib.readthedocs.io/en/latest/parallel.html). Essentially, 

In [None]:
b = b.build()

## Inspect the Catalog

Now that the catalog is built, we can inspect the dataframe which is used to create the catalog by calling `.df` on the builder object

In [None]:
b.df

The resultant dataframe includes the:
* Component
* Stream
* Case
* Date
* Frequency
* Variables
* Path

We can also check to see which files ***were not*** parsed by calling `.invalid_assets`

In [None]:
b.invalid_assets

In [None]:
b.invalid_assets.INVALID_ASSET.values[0]

It appears that one of the invalid assets is a `pop.hv` stream, which is a time-invariant dataset we would not neccessarily be interested in looking at. If there is a file you think ***should*** be included in the resultant catalog but isn't, be sure to add it to the `_STREAMS_DICT` used in the [parsing tool](https://github.com/NCAR/ecgtools/blob/main/ecgtools/parsers/cesm.py)! 

## Save the Catalog

In [None]:
b.save(
    # File path - could save as .csv (uncompressed csv) or .csv.gz (compressed csv)
    "/glade/work/mgrover/cesm-hist-test.csv",
    
    # Column name including filepath
    path_column='path',
    
    # Column name including variables
    variable_column='variables',
    
    # Data file format - could be netcdf or zarr (in this case, netcdf)
    data_format="netcdf",
    
    # Which attributes to groupby when reading in variables using intake-esm
    groupby_attrs=["component", "stream", "case"],
    
    # Aggregations which are fed into xarray when reading in data using intake
    aggregations=[
        {
            "type": "join_existing",
            "attribute_name": "date",
            "options": {"dim": "time", "coords": "minimal", "compat": "override"},
        }
    ],
)


## Using the Catalog

You'll notice the resultant filepaths are output when calling `.save` - which you could use within your intake-esm `open_esm_datastore` function.

### Additional Imports

In [None]:
# Import intake-esm
import intake

# Import ast which helps with parsing the list of variables
import ast

### Use the catalog to read in data

In [None]:
col = intake.open_esm_datastore(
    "/glade/work/mgrover/cesm-hist-test.json", 
    csv_kwargs={"converters": {"variables": ast.literal_eval}}, sep="/"
)
col

In [None]:
cat = col.search(variables='TEMP', 
                 case='b.e20.b1850.f19_g17.test')
cat

In [None]:
dsets = cat.to_dataset_dict(cdf_kwargs={'use_cftime': True, 'chunks': {'time':10}})