# Want to add a new datafile type to Intake-ESM?

This tutorial documents the work flow for adding [NA-Cordex data](https://na-cordex.org/) to intake-esm. 

The following steps will be replicable for your data catalogue of choice:

## Step 1: Select Attribute for Colection Columns
Make sure you understand the formatting and attributes of your dataset.
Does your dataset follow a naming convention/template for datafiles? 


Each data file name in the NA-CORDEX collection contains its attributes in a set order "[variable].[experiment].[global_climate_model].[regional_climate_model].[frequency].[grid].[bias_corrected_or_raw].nc".


## Step 2: Generate .yaml File Listing Collection Columns

This .yaml file should be housed inside the collections folder. For our cordex data the full file path is: `intake_esm/collection_defs/cordex.yaml`.


Our column names, listed under collection_columns, contain all of the attributes mentioned in Step 1, plus resource, resource_type, direct_access, file_fullpath, file_basename, and file_dirname. These are columns we recommend for any dataset.

Your .yaml file should follow the same template as bellow, but with the items listed under collection_columns appropriate for your data.

```yaml
collections:
  cordex:
    collection_columns:
    - resource
    - resource_type
    - direct_access
    - variable
    - experiment 
    - global_climate_model
    - regional_climate_model    
    - frequency
    - grid
    - bias_corrected_or_raw
    - file_fullpath
    - file_basename
    - file_dirname
    order_by_columns:
      - file_fullpath
    required_columns:
      - file_fullpath
```   

## Step 3: Create a Another .yaml File

This file is external input, not a part of the package, just external input. We housed this file in a new test directory, `intake_esm/tests/cordex` and named it `cordex-collection-input.yaml`.

This .yaml file requires the following keys: **name**, **collection_type**, and **data_sources**.

Here is a description of all of the assigned keys in our cordex-collection-input.yaml file:

- **name** - What is a meaninful name that ponts to this dataset? In this example, this is 'NA-CORDEX.'

- **collection_type** - What is the shorthand name you will use while coding? This should match the first key from the .yaml file in Step 2. Here it is 'cordex'

- **data_sources**. - Where should intake-esm look for your data? This can point to more than one directory, but in this example we have only one key, here named 'CORDEX-Data'. Within the data_source 'CORDEX-DATA' we point to locations. 'locations' is list of dictionaries, each with the following keys:

    - **name** - For us, all of our data is on the GLADE file storage system so we name this location 'GLADE.' **It is very important that the combination of your data_source key and location key are unique**, so the combination of 'CORDEX-Data' and 'GLADE' should not be repeated inside this .yaml file.

    - **loc_type** - How is this data stored? Typically data is on a posix file system, but if the data is on Tape or Cloud Storage, you might have a different answer here.   

    - **direct_access** - Here this is set to 'True'    

    - **urlpath** - The root directory that contains your data. If data does not share a root directory, you will want to have several items in the locations list.
    
Below is what our .yaml file looks like for the cordex data:

```yaml
name: NA-CORDEX
collection_type: cordex
data_sources:
  CORDEX-Data:
    locations:
      - name: GLADE
        loc_type: posix
        direct_access: True
        urlpath: /glade/collections/cdg/data/cordex/data/
        ```

Here we display how intake-esm will interpret the structure of that yaml file. This maybe easier to understand for some.
```
{'name': 'NA-CORDEX',
 'collection_type': 'cordex',
 'data_sources': {'CORDEX-Data': {'locations': [{'name': 'GLADE',
     'loc_type': 'posix',
     'direct_access': True,
     'urlpath': '/glade/collections/cdg/data/cordex/data/'}]}}}
     ```

## Step 4: Create a .py Script Containing Rules for Gathering Attributes

Inside the `intake-esm/intake_esm` directory create a new .py file, here called `cordex.py`. This file contains the rules for filling the columns (determined by the .yaml file in Step 2) via the information found in each datafile name.

Inside the file there are two classes: **Collection** and **Source**. Here `CORDEXCollection` and `CORDEXSource`.

### Step 4a: Writing the Collection Class

The Collection Class, in our script called `CORDEXCollection` inherits from base 'Collection' class, but we still have to implement the method that determines how to get attributes from filename.

This is the starting point, any data formatting will have at least this skeleton of a structure: 

```python
class CORDEXCollection(Collection):

    __doc__ = docstrings.with_indents(
        """ Builds a NA-CORDEX collection for data
        stored on NCAR's GLADE
    %(Collection.parameters)s
    """
    )

    def _get_file_attrs(self, filepath):
        file_basename = os.path.basename(filepath)
        fs = file_basename.split('.')

        keys = list(set(self.columns) - set(['resource', 'resource_type', 'direct_access']))

        fileparts = {key: None for key in keys}
        fileparts['file_basename'] = file_basename
        fileparts['file_dirname'] = os.path.dirname(filepath) + '/'
        fileparts['file_fullpath'] = filepath

       
        return fileparts
```

In our method, within the `_get_file_attrs` function, we define a 'filename_template' to retrieve the attribute components of the filename. Then the `_reverse_filename_format` function maps these attributes to the collection columns.

```python
class CORDEXCollection(Collection):

    __doc__ = docstrings.with_indents(
        """ Builds a NA-CORDEX collection for data
        stored on NCAR's GLADE
    %(Collection.parameters)s
    """
    )

    def _get_file_attrs(self, filepath):
        file_basename = os.path.basename(filepath)
        fs = file_basename.split('.')

        keys = list(set(self.columns) - set(['resource', 'resource_type', 'direct_access']))

        fileparts = {key: None for key in keys}
        fileparts['file_basename'] = file_basename
        fileparts['file_dirname'] = os.path.dirname(filepath) + '/'
        fileparts['file_fullpath'] = filepath
        
        filename_template = '{variable}.{experiment}.{global_climate_model}.{regional_climate_model}.{frequency}.{grid}.{bias_corrected_or_raw}.nc'
        
        f = CORDEXCollection._reverse_filename_format(file_basename, filename_template)
        fileparts.update(f)
            
        return fileparts
    ```

An example output of fileparts from the file 'uas_hist_CanESM2_CRCM5-UQAM_day_NAM-44i_raw.nc' looks like:

```
{'variable': 'uas',
 'experiment': 'hist',
 'global_climate_model': 'CanESM2',
 'regional_climate_model': 'CRCM5-UQAM',
 'frequency': 'day',
 'grid': 'NAM-44i',
 'is_bias_corrected': 'raw'}
 ```

#### **Not Done Yet! Now go into intake_esm/core.py and add the following lines:**

**1)** Near the top of the script you will find a list of importing Collection classes from each data type's .py files.  So now we add:
```python
from .cordex import CORDEXCollection
```

**2)** Then you will find a collecton_types dictionary:

```python
    collection_types = {
        'cesm': CESMCollection,
        'cesm-aws': CESMAWSCollection,
        'cmip5': CMIP5Collection,
        'cmip6': CMIP6Collection,
        'mpige': MPIGECollection,
        'gmet': GMETCollection,
        'era5': ERA5Collection,
    }
 ```
    
    Add an element for your new collection:

```python
    collection_types = {
        'cesm': CESMCollection,
        'cesm-aws': CESMAWSCollection,
        'cmip5': CMIP5Collection,
        'cmip6': CMIP6Collection,
        'mpige': MPIGECollection,
        'gmet': GMETCollection,
        'era5': ERA5Collection,
        'cordex': CORDEXCollection,
    }
 ```

### Step 4b: Writing the Source Class

The Source Class, in our cordex.py script called `CORDEXSource`. 

First you need to make some decisions about what **dataset_fields** consitute unique datasets that are unlikely to be concatinated together (you probably wouldn't put data gridded differently in the same dataset, for example). In our case we chose 5 columns: global_climate_model, regional_climate_model, frequency', grid', and bias_corrected_or_raw.

You also need to decide how your data may be grouped. If you have multiple ensemble members that share all other attributes, you might want to group them together - this was not the case for the cordex data. Look inside `source.py` to understand more of the build-in functionality.

Below we have the CORDEXSource class:

```python
class CORDEXSource(BaseSource):
    name = 'cordex'
    partition_access = True

    def _open_dataset(self):
        # fields which define a single dataset
        dataset_fields = ['global_climate_model', 'regional_climate_model', 'frequency', 'grid', 'bias_corrected_or_raw']

        kwargs = self._validate_kwargs(self.kwargs)

        all_dsets = {}
        query_results = get_subset(self.collection_name, self.query)
        
        file_fullpath_column_name = 'file_fullpath'
        file_basename_column_name = 'file_basename'
        variable_column_name = 'variable'
        
        query_results = _ensure_file_access(
            query_results, file_fullpath_column_name, file_basename_column_name
        )
        grouped = query_results.groupby(dataset_fields)
        for dset_keys, dset_files in tqdm(grouped, desc='dataset'):
            dset_id = '.'.join(dset_keys)
            var_dsets = []
            for v_id, v_files in dset_files.groupby(variable_column_name):
                urlpath_ei_vi = v_files[file_fullpath_column_name].tolist()
                dsets = [
                    aggregate.open_dataset_delayed(
                        url,
                        data_vars=[v_id],
                        chunks=kwargs['chunks'],
                        decode_times=kwargs['decode_times'],
                    )
                    for url in urlpath_ei_vi
                ]

                var_dset_i = aggregate.concat_time_levels(
                    dsets,
                    time_coord_name_default=kwargs['time_coord_name'],
                    override_coords=kwargs['override_coords'],
                )
                var_dsets.append(var_dset_i)

            _dset_i = aggregate.merge(dsets=var_dsets)
            all_dsets[dset_id] = _dset_i

        self._ds = all_dsets
        ```

#### **Now go into `collection_defs/source.yaml` and add the new Source class to the list of sources:**

**1)** It should look like this:

```python
sources:
  cesm: intake_esm.cesm.CESMSource
  cesm-aws: intake_esm.cesm_aws.CESMAWSSource
  cmip5: intake_esm.cmip.CMIP5Source
  cmip6: intake_esm.cmip.CMIP6Source
  mpige: intake_esm.mpige.MPIGESource
  gmet: intake_esm.gmet.GMETSource
  era5: intake_esm.era5.ERA5Source
  cordex: intake_esm.cordex.CORDEXSource
```


## Step 5: Testing

#### **Congratulations! You have added a new datatype to the intake-esm ecosystem!**
Let's test the collection:

In [2]:
import intake
import yaml
import pandas as pd
from intake_esm import config
from distributed.utils import format_bytes



**Note: Import intake_esm and its components after intake, not the other way around.**

This is because intake_esm is a plug-in for intake.

**1)** Let's generate our collection `col`:

In [3]:
config.get('collections.cordex')

test = yaml.safe_load('''name: NA-CORDEX
collection_type: cordex
data_sources:
  CORDEX-Data:
    locations:
      - name: GLADE
        loc_type: posix
        direct_access: True
        urlpath: /glade/collections/cdg/data/cordex/data/''')        

col = intake.open_esm_metadatastore(collection_input_definition=test, overwrite_existing=True)

Getting file listing: CORDEX-Data:GLADE:posix:/glade/collections/cdg/data/cordex/data/


HBox(children=(IntProgress(value=0, description='file listing', max=16897, style=ProgressStyle(description_widâ€¦


<class 'pandas.core.frame.DataFrame'>
Int64Index: 16897 entries, 14130 to 11851
Data columns (total 13 columns):
resource                  16897 non-null object
resource_type             16897 non-null object
direct_access             16897 non-null bool
variable                  16897 non-null object
experiment                16897 non-null object
global_climate_model      16897 non-null object
regional_climate_model    16897 non-null object
frequency                 16897 non-null object
grid                      16897 non-null object
bias_corrected_or_raw     16897 non-null object
file_fullpath             16897 non-null object
file_basename             16897 non-null object
file_dirname              16897 non-null object
dtypes: bool(1), object(12)
memory usage: 1.7+ MB
None
Persisting NA-CORDEX at : /glade/u/home/jkent/.intake_esm/collections/cordex/NA-CORDEX.cordex.csv


**2)** And take a look, make sure the column headings are as specified. 

In [4]:
col.df.head()

Unnamed: 0,resource,resource_type,direct_access,variable,experiment,global_climate_model,regional_climate_model,frequency,grid,bias_corrected_or_raw,file_fullpath,file_basename,file_dirname
14130,CORDEX-Data:GLADE:posix:/glade/collections/cdg...,posix,True,huss,hist,CanESM2,CRCM5-UQAM,day,NAM-22i,kddm-METDATA,/glade/collections/cdg/data/cordex/data/kddm-M...,huss.hist.CanESM2.CRCM5-UQAM.day.NAM-22i.kddm-...,/glade/collections/cdg/data/cordex/data/kddm-M...
14125,CORDEX-Data:GLADE:posix:/glade/collections/cdg...,posix,True,prec,hist,CanESM2,CRCM5-UQAM,day,NAM-22i,kddm-METDATA,/glade/collections/cdg/data/cordex/data/kddm-M...,prec.hist.CanESM2.CRCM5-UQAM.day.NAM-22i.kddm-...,/glade/collections/cdg/data/cordex/data/kddm-M...
14128,CORDEX-Data:GLADE:posix:/glade/collections/cdg...,posix,True,rsds,hist,CanESM2,CRCM5-UQAM,day,NAM-22i,kddm-METDATA,/glade/collections/cdg/data/cordex/data/kddm-M...,rsds.hist.CanESM2.CRCM5-UQAM.day.NAM-22i.kddm-...,/glade/collections/cdg/data/cordex/data/kddm-M...
14129,CORDEX-Data:GLADE:posix:/glade/collections/cdg...,posix,True,tmax,hist,CanESM2,CRCM5-UQAM,day,NAM-22i,kddm-METDATA,/glade/collections/cdg/data/cordex/data/kddm-M...,tmax.hist.CanESM2.CRCM5-UQAM.day.NAM-22i.kddm-...,/glade/collections/cdg/data/cordex/data/kddm-M...
14131,CORDEX-Data:GLADE:posix:/glade/collections/cdg...,posix,True,tmin,hist,CanESM2,CRCM5-UQAM,day,NAM-22i,kddm-METDATA,/glade/collections/cdg/data/cordex/data/kddm-M...,tmin.hist.CanESM2.CRCM5-UQAM.day.NAM-22i.kddm-...,/glade/collections/cdg/data/cordex/data/kddm-M...


**3)** And search along attributes.

In [9]:
query = col.search(variable='uas', global_climate_model = 'CanESM2', regional_climate_model = 'CRCM5-UQAM', experiment = 'hist', frequency = 'day', grid='NAM-44i').query_results.head()

**4)** Then turn the search results into an xarray dataset.

In [10]:
dset = query.to_xarray(chunks = {'lon':50})

TypeError: to_xarray() got an unexpected keyword argument 'chunks'

**5)** Listing the remaining keys can tell us which datasets are left without having to sift through their attributes

In [16]:
dset.keys()

dict_keys(['CanESM2.CRCM5-UQAM.day.NAM-44i.kddm-METDATA', 'CanESM2.CRCM5-UQAM.day.NAM-44i.mbcn-METDATA', 'CanESM2.CRCM5-UQAM.day.NAM-44i.raw'])

In [17]:
dset['CanESM2.CRCM5-UQAM.day.NAM-44i.raw']

<xarray.Dataset>
Dimensions:    (bnds: 2, lat: 129, lon: 300, time: 20440)
Coordinates:
  * time       (time) object 1950-01-01 12:00:00 ... 2005-12-31 12:00:00
  * lat        (lat) float64 12.25 12.75 13.25 13.75 ... 74.75 75.25 75.75 76.25
  * lon        (lon) float64 -171.8 -171.2 -170.8 ... -23.25 -22.75 -22.25
    time_bnds  (time, bnds) object dask.array<shape=(20440, 2), chunksize=(20440, 2)>
Dimensions without coordinates: bnds
Data variables:
    uas        (time, lat, lon) float32 dask.array<shape=(20440, 129, 300), chunksize=(20440, 129, 50)>
Attributes:
    Conventions:                    CF-1.4
    institution:                    Universite du Quebec a Montreal
    contact:                        Winger.Katja@uqam.ca
    comment:                        CORDEX North America CRCM5 v333 0.44 deg ...
    model:                          CRCM5 (dynamics GEM v_3.3.3, physics RPN ...
    model_grid:                     rotated lat-lon 236x241 incl. 10p pilot a...
    geophysical_f

How big is each chunk of our new dataset?

In [19]:
format_bytes(20440*129*50)

'131.84 MB'