# Tutorial


In [None]:
%load_ext autoreload
%autoreload 2

## Builder


To begin, import `ecgtools`, and instantiate a builder object.


In [None]:
import ecgtools

In [None]:
# The `Builder` class expects:
ecgtools.Builder?

In [None]:
builder = ecgtools.Builder(
    root_path="../../sample_data/cmip/CMIP6/",
    extension="*.nc",
    depth=4,
    exclude_patterns=["*/files/*", "*/latest/*"],
)
builder

## Parser

Let's create a custom function for parsing the global attributes and variable
attributes of an xarray dataset. For our use case, an xarray dataset corresponds
to the content of a single netCDF file:


In [None]:
# Define some global attributes to extract from the xarray.dataset
global_attrs = [
    "activity_id",
    "institution_id",
    "source_id",
    "experiment_id",
    "table_id",
    "frequency",
    "grid_label",
    "realm",
    "variable_id",
    "variant_label",
    "parent_experiment_id",
    "parent_variant_label",
    "sub_experiment",
]

# Define variable attributes to extract from xarray.dataset
variable_attrs = ["standard_name"]

# We want to rename the following attributes once we've extracted their values
attrs_mapping = {
    "variant_label": "member_id",
    "parent_variant_label": "parent_member_id",
}

In [None]:
import xarray as xr


def cmip6_ds_parser(
    filepath: str,
    global_attrs: list,
    variable_attrs: list = None,
    attrs_mapping: dict = None,
    add_dim: bool = True,
):
    """
    Function that harvests global attributes and variable attributes
    for CMIP6 netCDF output.

    Parameters
    ----------
    filepath : str
        filepath
    global_attrs : list
        global attributes to extract from the netCDF file.
    variable_attrs : list, optional
        variable attributes to extract from the netCDF file, by default None
    attrs_mapping : dict, optional
        A mapping to use to rename some keys/attributes harvested from
        the netCDF file, by default None
    add_dim : bool, optional
        Whether to add variable's dimensionality information to harvested
        attributes, by default True

    Returns
    -------
    dict
        A dictionary of attributes harvested from the input CMIP6 netCDF file.
    """
    try:
        results = {"path": filepath}
        ds = xr.open_dataset(
            filepath, decode_times=True, use_cftime=True, chunks={}
        )
        g_attrs = ds.attrs
        variable_id = g_attrs["variable_id"]
        v_attrs = ds[variable_id].attrs
        for attr in global_attrs:
            results[attr] = g_attrs.get(attr, None)

        if variable_attrs:
            for attr in variable_attrs:
                results[attr] = v_attrs.get(attr, None)

        # Is this a reliable way to get dim?
        results["dim"] = f"{ds[variable_id].data.ndim}D"

        if "time" in ds.coords:
            times = ds["time"]
            start = times[0].dt.strftime("%Y-%m-%d").data.item()
            end = times[-1].dt.strftime("%Y-%m-%d").data.item()
            results["end"] = end
            results["start"] = start
        if attrs_mapping and isinstance(attrs_mapping, dict):
            for old_key, new_key in attrs_mapping.items():
                results[new_key] = results.pop(old_key)

        return results

    except Exception as e:
        # TODO: Record faulty files
        data = {"exception": str(e), "file": filepath}
        print(data)
        return {}

When parsing file attributes, `ecgtools` requires that the parser is a function
with the following signature:

```python
def myparser(filepath, global_attrs):
    ...
```

To meet this requirement, we need to modify our `cmip6_ds_parser` function by
creating a partial function. A partial functions allow one to derive a function
with x parameters to a function with fewer parameters and fixed values set for
the more limited function (which is what `ecgtools` expects).


In [None]:
# Create a partial for our parser
import functools

cmip6_parser = functools.partial(
    cmip6_ds_parser, variable_attrs=variable_attrs, attrs_mapping=attrs_mapping,
)

## Crawl directories, compile list of files, and extract attributes

Now we are ready to compile the list of valid file by crawling directories. Once
we have this list of files, we should be able to extract attributes from each
one as follows:


In [None]:
df = (
    builder.parse_files_attributes(
        global_attrs, parser=cmip6_parser, lazy=False
    )  # Extract attributes from each file
    .to_df()  # Create Pandas DataFrame containing all attributes extracted from the files
    .df  # Retrieve constructed Pandas DataFrame
)

In [None]:
df.head()